Date: 2004-10-21 00:20:00
Tags: unix
adventures in unix

This post describes a little fun I had today with repairing a broken Linux system. One of my computers at home stopped working properly: Almost anything I tried to run gave a "fork: Resource temporarily unavailable" error. Read on for a tale of what you might call a shell game...

My weather radio system runs on a recently installed Debian Linux system. This was the system that suffered a hard drive crash last april. The weather radio system has not been operating since the crash, because I needed to build a new kernel with sound card support. I finally had some time to do that last weekend, and set up the weather radio system so it works again.

Today at work I switched to an already-open 'screen' session on that machine. I noticed that the weather radio system had shut down with a fork: Resource temporarily unavailable message. Ok, so I figured there was still something not working quite right with the new system, and I went to start it back up again and I'd work on it later. Same fork error. Try 'ls', same error. Try any non-shell command, fork error. Something was seriously hosed.

This was annoying but not serious, in the worst case I would just reboot the machine when I got home. But I wanted to see whether I could find out what was wrong from remote.

I remembered the trick of using echo * to get a listing of everything in a directory when external commands don't work. I ran echo /proc/* and found that there were about 800 processes active, which seemed like too many. The weather radio system had shut down, but it's made up of many different processes connected with pipes, so perhaps something stayed running. How could I find out what these processes were?

It turns out that the /proc filesystem is a wealth of information. However, most of it is in the form of symlinks (eg. /proc/12345/exe is a symlink to the actual process executable for pid 12345) or pseudo-files that you can read to obtain status information. There is no shell builtin command to find out what a symlink points to, and there is no shell builtin to display a file (ls and cat are both external commands).

The first thing we tried was to try cd /proc/12345/cwd which should change into the current directory of process 12345. Maybe that would give me a clue. I tried this for one of the many similarly-numbered processes, and got a permission denied error. I was not logged in as root, and evidently I didn't have permission to change into whatever the current directory was.

Having determined that these were probably not "my" processes, it seemed that I needed a way to become root. su and sudo are both external commands, and I certainly couldn't make a new ssh connection from another machine. My coworker johnkw came up with exec su - which would replace a currently running shell with su, rather than forking first. If this worked, I would have a root shell; if not, I would likely lose my shell completely. Since I was running in screen and had three shells available, I could afford to lose one if this didn't work. I tried it, and although I got yet another fork error in the process, it dropped me in a root shell anyway!

So, now I could change into the current directory of one of the offending processes. It just turned out to be the root directory, so no great clue there.

Another thing in the /proc tree is the cmdline pseudo-file, which contains the command line used to invoke the process. Without cat, there didn't seem to be any way to view this file. A review of the shell builtin commands didn't turn up with anything at first. But [info]bovineone found that the read command could do it: "read foo ". The read command reads one line of its input and puts it in the shell variable foo. This showed that the command was /bin/cat, hardly helpful as I couldn't think of anything that might spawn a whole bunch of cats.

We figured that it might be useful to find out what the parent process was. Again the /proc filesystem contains a file called stat which dontains a couple dozen numbers and stuff. We figured that one of those is probably the parent pid. I logged on to another (working) Linux box and found that the fourth field is the parent pid. Back to the problem box, the parent was inetd! Why on earth was inetd spawning cats? We were not ruling out the possibility that my box had been hacked, even though it is still behind my firewall.

I wanted to see what was in inetd.conf, but the read command would insist on only reading the first line. Eventually I found the -d option, which lets you choose another line delimiter character other than newline! So, with "read -dZ foo " (this set the delimiter to Z, which didn't appear in the file), I got to view the whole inetd.conf file. The line in question was:

1234 stream tcp nowait root /bin/cat cat /dev/ttyS0

What the heck was that? Then I remembered, it is the gateway for my temperature sensor so I can read the serial port from a remote machine (a different machine does the data collection and aggregation using RRDtool). This had been working fine for months until the reboot on the weekend (which I didn't notice until this point).

So I decided to kill inetd (it wasn't doing anything else useful), which finally made the fork problem go away, and I could use external commands normally. I ran killall cat to kill all the stray cats (pardon the pun), which surprisingly took several seconds but cleaned everything up nicely.

The reason all the cat processes had hung was because after rebooting, the serial port was set to the wrong baud rate. Apparently when this happens nothing is read from the serial port, and they never terminate (it's not very robust). So, after about three days, hundreds of cat processes had accumulated and the kernel no longer let anything else fork.

I have changed inetd.conf to run the cat process as the user nobody instead of root. I'm not sure whether this will help, because inetd itself still runs as root and is the process doing the forking. At least I now know what caused this and how to fix it if it happens again.

The lesson learned from all this is that the shell built-in commands are just barely flexible enough to get you out of a pinch if you really need to. It certainly helped having a couple of other creative minds helping me with the problem too. Thanks guys!

[info]hoyhoy
2004-10-21T05:32:16Z
I thought hard process ulimits were per user. Is it possible to enforce a hard maximum process ulimit for uid 0?
[info]_fool
2004-10-21T05:45:45Z
dunno about that, but it's easy to enforce one per system on many axes including open files, file descriptors, et al, which could easily get tripped up with multiple executions of even something as simple as cat.
[info]taral
2004-10-30T19:23:17Z
I suggest making that "cat" command run as non-root. I also suggest installing "sash", a shell for exactly these circumstances. :)
[info]ghewgill
2004-11-01T01:11:05Z
I don't think a non-root 'cat' would have helped, because fork() was failing no matter what I tried to run. However since the shell 'exec' command worked and I had three shells to play with, it occurs to me now that I could have afforded to sacrifice one of them to run a one-time cat.

sash looks like it would have done the trick too, thanks for the pointer.
[info]taral
2004-11-01T04:23:29Z
A non-root 'cat' wouldn't eat all the processes in the system perhaps? I dunno. It's more secure though. :)
Greg Hewgill <greg@hewgill.com>