premature optimization

Date: 2007-11-30 00:19:00

For (literally) years, I've been pulling monthly web server logs from my server with a simple script that I run by hand. I have been using the scp -C option, which is supposed to compress the data before sending it across the wire. I'm sure it does so, but it sure doesn't seem very effective. Have a look at these timings:

$ time scp -C occam.hewgill.net:logs/hewgill.com/access_log access_log
     4882.68 real        29.09 user        20.99 sys
$ time ssh occam.hewgill.net 'bzip2 -c logs/hewgill.com/access_log' >/dev/null
     859.47 real         0.50 user         0.22 sys
$ time ssh occam.hewgill.net 'gzip -c logs/hewgill.com/access_log' >/dev/null
     108.90 real         0.67 user         0.33 sys

These numbers are all counterintuitive. I would have expected that the scp -C option would have performed better (the man page claims that it uses the gzip algorithm, but why does it perform so badly in this case?). I would have expected that bzip2 would have been faster overall than gzip because there would be less data to transfer (in reality the server couldn't compress the data as fast as it was being sent).

This month's access_log file is just over 250 MB. The gzip compressed version is 6.2 MB. The bzip2 compressed version is 3.7 MB. It looks like it doesn't pay to wait the nearly an order of magnitude more time for a result that's less than twice as compressed.

This exercise has just reinforced the point that it pays to actually measure performance in different situations, instead of going with what "feels" right. Premature optimization is the root of all evil [Hoare].

Comment

banana
2007-11-29T13:10:41Z

How does scp do without -C?

ghewgill
2007-11-30T10:50:18Z

It's worse than with, at least. I'm not sure, I'll try to run that tonight to see.

banana
2007-12-01T21:03:12Z

I was wondering ig -C might not actucally do any compression.

mskala
2007-11-29T15:27:59Z

With scp -C you're not compressing much (probably compressing small packets one at a time without reference to each other, which doesn't help much), you're decompressing at the client end, and you're writing the result to disk. So you pay for compression time (which shows up in "real" but not in the other two), lots of transfer time (which again counts toward "real" but not toward the other two), decompression time (which would count toward "user" and "real"), and writing-to-disk time (which would count toward "sys" and "real").

With ssh bzip2 > /dev/null you're paying for compression time and it's a slow compressor; that counts toward "real". Then you pay much less for network transfer, and you DON'T decompress at the client end and you just drop the data instead of writing it, so "user" and "sys" are both very small. With ssh gzip > /dev/null it's the same story but you pay less for compression because it's a faster compression, and that outweighs the increase in transfer time (which was probably not a big component of the overall real-time).

I'd suggest that to get a more accurate comparison you should do something like time ssh occam.hewgill.net 'gzip -c logs/hewgill.com/access_log' | gzunzip -c > access_log , so as to count decompression and disk time on the client end, though to time it properly you might have to throw a subshell in there to count the time spent by both ssh and by gzip.

I also wonder how fast the server machine you're connecting to is - if it's slow, or if it's loaded and running the ssh daemon niced or something, then it may be running CPU-intensive stuff slowly.

ghewgill
2007-11-30T10:59:55Z

I think the most surprising thing about this is the really poor performance of the -C option. I wonder how big the chunks are that it compresses, it's clearly not enough to make a significant impact on the data set I'm using.

While this wasn't exactly a controlled experiment, I think I've got a pretty good handle on the various factors involved. At least between bzip2 and gzip - bzip2 was cpu limited on the sender (about 70% utilisation) and gzip was bandwidth limited. The receiver decompression is in both cases faster than the sender - the machine is roughly the same speed cpu-wise but more lightly loaded and has faster I/O, and decompression is less work anyway.

One factor I can't control is the network bandwidth. My ISP, in exchange for unlimited monthly traffic volume, retains the right to shape my traffic depending on whatever they feel like (read: depending on how much the pirates are torrenting). I typically get 50 KB/s or so downloads. (Yes, this is the price I pay for living on the ass end of the internet. :)

In this situation the most important factor is overall elapsed time. I think I'll go with the gzip solution as the best case there. It's interesting that Nerf actually saw numbers similar to mine in the next reply.

(anonymous) : Another set of tests
2007-11-29T20:52:30Z

The script:

            

            FILE=/var/log/httpd-access.log
            

            
            

            time scp billabong:$FILE file
            

            md5 file
            

            time scp -C billabong:$FILE file
            

            md5 file
            

            time ssh billabong "gzip -c $FILE" | gunzip > file
            

            md5 file
            

            time ssh billabong "bzip2 -c $FILE" | bunzip2 > file
            

            md5 file

The output:

            

            httpd-access.log     100% |*****************************| 60575 KB    04:41
            

                  283.00 real         0.05 user         0.50 sys
            

            MD5 (file) = da3790490964243e6c4ecf754e530ef9
            

            httpd-access.log     100% |*****************************| 60575 KB    04:36
            

                  278.65 real         0.03 user         0.52 sys
            

            MD5 (file) = da3790490964243e6c4ecf754e530ef9
            

                    9.44 real         0.16 user         0.02 sys
            

            MD5 (file) = da3790490964243e6c4ecf754e530ef9
            

                  153.70 real         0.12 user         0.00 sys
            

            MD5 (file) = da3790490964243e6c4ecf754e530ef9

So, my results are very similar to what greg was seeing

-Nerf

ghewgill : Re: Another set of tests
2007-11-30T11:00:27Z

Interesting, thanks for that.

taral
2007-12-01T07:45:46Z

ssh -C is, I believe, compressing at a very different layer. Profiling might help...

Greg Hewgill <greg@hewgill.com>