On a couple of occasions recently, we’ve noticed swap use getting out of hand on a server or two. There’s been no common cause so far, but the troubleshooting approach has been the same in each case.
To try and tell the difference between a VM which is generally “just a bit tight on resources” and a situation where process has run away – it can sometimes be handy to work out what processes are hitting swap.
The approach I’ve been using isn’t particularly elegant, but it has proved useful so I’m documenting it here:
grep VmSwap /proc/*/status 2>&1 | perl -ne '/\/(\d+)\/[^\d]*(\d+) (.B)$/g;if($2>0){$name=`ps -p $1 -o comm=`;chomp($name);print "$name ($1) $2$3\n"}'
Lets pick it apart a component at a time.
grep VmSwap /proc/*/status 2>&1
The first step is to pull out the VmSwap line from the PID status files held in /proc. There’s one of these files for each process on the system and it tracks all sorts of stuff. VmSwap is how much swap is currently being used by this process. The grep gives output like this:
... /proc/869/status:VmSwap: 232 kB /proc/897/status:VmSwap: 136 kB /proc/9039/status:VmSwap: 5368 kB /proc/9654/status:VmSwap: 312 kB ...
That’s got a lot of useful info in it (eg the PID is there, as is the amount of swap in use), but it’s not particularly friendly. The PID is part of the filename, and it would be more useful if we could have the name of the process as well as the PID.
Time for some perl…
perl -ne '/\/(\d+)\/[^\d]*(\d+) (.B)$/g;if($2>0){$name=`ps -p $1 -o comm=`;chomp($name);print "$name ($1) $2$3\n"}'
Dealing with shell side of things first (before we dive into the perl code) “-ne” says to perl “I want you to run the following code against every line of input I pipe your way”.
The first thing we do in perl itself is run a regular expression across the line of input looking for three things; the PID, the amount of swap used and the units reported. When the regex matches, this info gets stored in $1, $2 and $3 respectively.
I’m pretty sure the units are always kB but matching the units as well seemed safer than assuming!
The if statement allows us to ignore processes which are using 0kB of swap because we don’t care about them, and they can cause problems for the next stage:
$name=`ps -p $1 -o comm=`;chomp($name)
To get the process name, we run a “ps” command in backticks, which allows us to capture the output. “-p $1” tells ps that we want information about a specific PID (which we matched earlier and stored in $1), and “-o comm=” specifies a custom output format which is just the process name.
chomp is there to strip the ‘\n’ off the end of the ps output.
print "$name ($1) $2$3\n"
Lastly we print out the $name of the process, it’s PID and the amount of swap it’s using.
So now, you get output like this:
... automount (869) 232kB cron (897) 136kB munin-node (9039) 5364kB exim4 (9654) 312kB ...
The output is a little untidy, and there is almost certainly a more elegant way to get the same information. If you have an improvement, let me know in the comments!
You might also be interested in “iotop” which shows the per-process disk I/O in a style similar to “top”. I don’t think this will show per-disk I/O unfortunately, but it’s a good start as to which processes are thrashing any disks (which will include swap thrashing). The amount of swap shown from VmStat in /proc per-process is also the number shown in the “SWAP” column in regular “top”. (Not quite the same but the combo of those two gives a lot of the same info and allows you to correlate it per-process 🙂
There was some reason I couldn’t use the SWAP column in regular top (although enough time has passed now that I can’t remember the reasoning… it was something to do with the way top calculates those figures)
iotop and atop are both interesting tools, but weren’t available on the servers I was looking at (and as they weren’t “my” servers, installing extra packages might have been deemed a bit rude 🙂
I should probably fess up that the full approach as listed in the post is a more automated version of what I did by hand – and this post was mostly written so that I could come back to it next time and just copy/paste a one liner 🙂
Also, “atop” is excellent and gives a lot of info like pages in/out and virtual memory growth of processes during the update interval, which can show if particular processes are using more memory at a rapid rate.