As I mentioned
previously, one of the projects that has been rattling around in the back of my head is an attribution system for Wikipedia sort of like the "blame" function of your favourite source control system. I've been thinking about how to approach this problem on and off for months, and last weekend I sat down and hacked out a bunch of code.
The goal of this project is to provide a view of the current state of a Wikipedia article, with annotations indicating who wrote what when. So if you're reading the
Aardvark article and you want to know who said aardvarks live to be "over 24 years old", you could click on those words and it would bring up the
revision with that modification.
For the impatient who don't want to read about how this works, skip down to the last paragraph.
To approach this problem, first I needed the entire Wikipedia history of all pages. Fortunately, the Wikimedia foundation provides a
database dump of the entire history. The most recent successful dump was in January 2008, and the dump file is 18522193111 bytes (18 GB) compressed and 2807444044080 bytes (2.8 TB) uncompressed. The dump file is in XML format and I had to write a custom parser in C++ that could efficiently handle that volume of data.
One of the difficulties with the Wikipedia dataset is that for a given page, there are usually many revisions that fix vandalism and other antisocial behaviour. A naive attribution algorithm would claim that the person (or bot) who reverted a blanked page to its previous state actually "wrote" the whole page. Obviously this won't do, so the first step is to simplify the history and remove vandalism and the revisions that fix it. I do this by computing a hash of the contents of each individual revision. Then I scan through the history, skipping over any changes that are eliminated by a later revert.
After I simplify the history, I use the Python
difflib module to compute the actual changes between each successive revision of the page. I keep track of which user wrote which words, and after running through all the revisions I end up with a list of words and who wrote them.
Finally, I take the attribution word list and annotate the current state of the article page with HTML mouseover highlights. My original implementation was hand-coded CSS and Javascript, which worked on Firefox 3 but
nothing else. I then switched to the
jQuery library which was super easy to get working and suddenly made my code work across Firefox, IE6, IE7, Chrome, Safari, and Opera (at least that's all the browsers I tried).
The results for the first 1450 or so articles in the English Wikipedia database is
here (this contains some articles that don't start with A because Mediawiki allows article renames). Moving your mouse over text highlights in yellow all the text that was written in the same revision as whatever your mouse is pointing at. Clicking on text brings up the corresponding revision diff on Wikipedia.
2008-10-31T15:02:16Z