Date: 2008-10-31 23:47:00
wikipedia blame
As I mentioned previously, one of the projects that has been rattling around in the back of my head is an attribution system for Wikipedia sort of like the "blame" function of your favourite source control system. I've been thinking about how to approach this problem on and off for months, and last weekend I sat down and hacked out a bunch of code.

The goal of this project is to provide a view of the current state of a Wikipedia article, with annotations indicating who wrote what when. So if you're reading the Aardvark article and you want to know who said aardvarks live to be "over 24 years old", you could click on those words and it would bring up the revision with that modification.

For the impatient who don't want to read about how this works, skip down to the last paragraph.

To approach this problem, first I needed the entire Wikipedia history of all pages. Fortunately, the Wikimedia foundation provides a database dump of the entire history. The most recent successful dump was in January 2008, and the dump file is 18522193111 bytes (18 GB) compressed and 2807444044080 bytes (2.8 TB) uncompressed. The dump file is in XML format and I had to write a custom parser in C++ that could efficiently handle that volume of data.

One of the difficulties with the Wikipedia dataset is that for a given page, there are usually many revisions that fix vandalism and other antisocial behaviour. A naive attribution algorithm would claim that the person (or bot) who reverted a blanked page to its previous state actually "wrote" the whole page. Obviously this won't do, so the first step is to simplify the history and remove vandalism and the revisions that fix it. I do this by computing a hash of the contents of each individual revision. Then I scan through the history, skipping over any changes that are eliminated by a later revert.

After I simplify the history, I use the Python difflib module to compute the actual changes between each successive revision of the page. I keep track of which user wrote which words, and after running through all the revisions I end up with a list of words and who wrote them.

Finally, I take the attribution word list and annotate the current state of the article page with HTML mouseover highlights. My original implementation was hand-coded CSS and Javascript, which worked on Firefox 3 but nothing else. I then switched to the jQuery library which was super easy to get working and suddenly made my code work across Firefox, IE6, IE7, Chrome, Safari, and Opera (at least that's all the browsers I tried).

The results for the first 1450 or so articles in the English Wikipedia database is here (this contains some articles that don't start with A because Mediawiki allows article renames). Moving your mouse over text highlights in yellow all the text that was written in the same revision as whatever your mouse is pointing at. Clicking on text brings up the corresponding revision diff on Wikipedia.
Very nice. I'd be surprised if Wikipedia didn't incorporate your tool soon. :)
If you haven't already seen it, you might be interested in looking at IBM's HistoryFlow application for Wikipedia. It shows a graphical representation of how an article has evolved in bits and pieces over time and gives attribution by color coding. (However, I think your mouse-over function works better.) - the downloadable application
Interesting, thanks for the pointer. It looks like it's a different presentation of a similar analysis result. I note that their granularity is sentences, while I chose individual words.
Nice one! You should post on wikitech-l about it.
(anonymous) : Any plans to release the code?
Hey, this sounds like a great project. Any plans to release the code? I'd love to fork it and play around with it on, say, github.
[info]ghewgill : Re: Any plans to release the code?
I'd like to do this as I'm sure other people will find it useful. The code base is currently a disaster as it's intertwined with some other work I was doing on page revert statistics, which I never quite finished. But this is on my todo list anyway.
Really useful tool (I was looking for such a one when I found this).
Why don't you enable doing it on demand, using Special:Export?
Interesting, I didn't know that Special:Export existed. I'll give that a go, but it can take anywhere from a few minutes to well over an hour to process the diffs for an article, depending on the article length and revision history. I see that Special:Export is limited to 1000 revisions, which should help with that.
Actually, I've looked at Special:Export and it's slow. But still, is there any chance of your making it publicly available?
Yes, I plan to. The code base is currently a disaster as it's intertwined with some other work I was doing on page revert statistics, which I never quite finished. But most of that code was designed to quickly extract relevant info from the full XML dump, which I wouldn't need when using Special:Export.
[info]ext_219544 : Status
Any progress on getting the code cleaned up for release? I'd really like to play around with the code too.
Greg Hewgill <>