Date: 2007-11-30 00:19:00
premature optimization

For (literally) years, I've been pulling monthly web server logs from my server with a simple script that I run by hand. I have been using the scp -C option, which is supposed to compress the data before sending it across the wire. I'm sure it does so, but it sure doesn't seem very effective. Have a look at these timings:

$ time scp -C access_log
     4882.68 real        29.09 user        20.99 sys
$ time ssh 'bzip2 -c logs/' >/dev/null
     859.47 real         0.50 user         0.22 sys
$ time ssh 'gzip -c logs/' >/dev/null
     108.90 real         0.67 user         0.33 sys

These numbers are all counterintuitive. I would have expected that the scp -C option would have performed better (the man page claims that it uses the gzip algorithm, but why does it perform so badly in this case?). I would have expected that bzip2 would have been faster overall than gzip because there would be less data to transfer (in reality the server couldn't compress the data as fast as it was being sent).

This month's access_log file is just over 250 MB. The gzip compressed version is 6.2 MB. The bzip2 compressed version is 3.7 MB. It looks like it doesn't pay to wait the nearly an order of magnitude more time for a result that's less than twice as compressed.

This exercise has just reinforced the point that it pays to actually measure performance in different situations, instead of going with what "feels" right. Premature optimization is the root of all evil [Hoare].

How does scp do without -C?
It's worse than with, at least. I'm not sure, I'll try to run that tonight to see.
I was wondering ig -C might not actucally do any compression.
With scp -C you're not compressing much (probably compressing small packets one at a time without reference to each other, which doesn't help much), you're decompressing at the client end, and you're writing the result to disk. So you pay for compression time (which shows up in "real" but not in the other two), lots of transfer time (which again counts toward "real" but not toward the other two), decompression time (which would count toward "user" and "real"), and writing-to-disk time (which would count toward "sys" and "real").

With ssh bzip2 > /dev/null you're paying for compression time and it's a slow compressor; that counts toward "real". Then you pay much less for network transfer, and you DON'T decompress at the client end and you just drop the data instead of writing it, so "user" and "sys" are both very small. With ssh gzip > /dev/null it's the same story but you pay less for compression because it's a faster compression, and that outweighs the increase in transfer time (which was probably not a big component of the overall real-time).

I'd suggest that to get a more accurate comparison you should do something like time ssh 'gzip -c logs/' | gzunzip -c > access_log , so as to count decompression and disk time on the client end, though to time it properly you might have to throw a subshell in there to count the time spent by both ssh and by gzip.

I also wonder how fast the server machine you're connecting to is - if it's slow, or if it's loaded and running the ssh daemon niced or something, then it may be running CPU-intensive stuff slowly.
I think the most surprising thing about this is the really poor performance of the -C option. I wonder how big the chunks are that it compresses, it's clearly not enough to make a significant impact on the data set I'm using.

While this wasn't exactly a controlled experiment, I think I've got a pretty good handle on the various factors involved. At least between bzip2 and gzip - bzip2 was cpu limited on the sender (about 70% utilisation) and gzip was bandwidth limited. The receiver decompression is in both cases faster than the sender - the machine is roughly the same speed cpu-wise but more lightly loaded and has faster I/O, and decompression is less work anyway.

One factor I can't control is the network bandwidth. My ISP, in exchange for unlimited monthly traffic volume, retains the right to shape my traffic depending on whatever they feel like (read: depending on how much the pirates are torrenting). I typically get 50 KB/s or so downloads. (Yes, this is the price I pay for living on the ass end of the internet. :)

In this situation the most important factor is overall elapsed time. I think I'll go with the gzip solution as the best case there. It's interesting that Nerf actually saw numbers similar to mine in the next reply.
(anonymous) : Another set of tests
The script:

time scp billabong:$FILE file
md5 file
time scp -C billabong:$FILE file
md5 file
time ssh billabong "gzip -c $FILE" | gunzip > file
md5 file
time ssh billabong "bzip2 -c $FILE" | bunzip2 > file
md5 file

The output:
httpd-access.log 100% |*****************************| 60575 KB 04:41
283.00 real 0.05 user 0.50 sys
MD5 (file) = da3790490964243e6c4ecf754e530ef9
httpd-access.log 100% |*****************************| 60575 KB 04:36
278.65 real 0.03 user 0.52 sys
MD5 (file) = da3790490964243e6c4ecf754e530ef9
9.44 real 0.16 user 0.02 sys
MD5 (file) = da3790490964243e6c4ecf754e530ef9
153.70 real 0.12 user 0.00 sys
MD5 (file) = da3790490964243e6c4ecf754e530ef9

So, my results are very similar to what greg was seeing

[info]ghewgill : Re: Another set of tests
Interesting, thanks for that.
ssh -C is, I believe, compressing at a very different layer. Profiling might help...
Greg Hewgill <>