In my quest to create useful reference ebooks for the Kindle, I've created ebooks of the top rated programming questions on
Stack Overflow for the
top 20 tags.
Stack Overflow ebooks
These files are created from the monthly
Stack Overflow Creative Commons data dump. I've got a combination of Python, Java, and XSLT scripts that process the raw XML database dumps into something usable. Then the Amazon
kindlegen program creates the ebook file (in Mobipocket format).
Last year I got a new computer with a fast CPU and lots of memory and disk space. While working on the XML processor, I realised that I was doing a lot of work seeking around a big XML file (it's over 4 GB) and collecting questions and answers together. This was taking quite some time because of the sheer size of the files. Since I am running a 64-bit OS (FreeBSD 8 amd64), I memory mapped the entire 4 GB XML file into memory and then didn't have to think about seeking anymore. Letting the OS manage the caching is a much better approach, and the improved performance really shows.
The preprocessing step (that needs to run once per data dump) creates all the HTML files for each question and its set of answers. I was originally storing all the files in one directory, but a million files in a single directory wasn't working very well. I ended up splitting the question number into groups of three digits, so
1234567.html is actually stored in
123/456/7.html. This step takes about two hours to run.
Creating each ebook file is then a single XSLT transformation (taking about a minute), plus the
kindlegen step which can take several minutes depending on the number of questions. The performance of
kindlegen isn't very impressive and appears to be O(n
2) in the number of pages.
The source for all this is available
on Github.
2011-01-25T13:02:09Z