Date: 2007-08-11 16:46:00
the abomination of mediawiki templates
The dict command line program is something I use pretty much daily. I always have a command prompt window open, and frequently (perhaps compulsively) look up words with which I'm unfamiliar. dict uses the standard Dictionary Server Protocol to connect to a dictionary server (by default, dict.org) to retrieve definitions. The dict.org server has quite a number of different dictionaries available, of varying utility.

Enter Wiktionary. Wiktionary is a sister project to the more well-known Wikipedia. Like Wikipedia, anybody can contribute and edit definitions to any word. The incredible thing about Wiktionary is that it is fully multilingual—it aims to provide definitions for every word in every language in every other language. So, you can look up definitions in English for thank you, danke, спасибо, or even ありがとう. No matter what your preferred language, ideally you will be able to use Wiktionary to look up anything found in print anywhere and find a definition in your chosen language. This is an enormously aggressive goal and I look forward to seeing it grow.

I want to use dict to look up words in Wiktionary. As nice as a web browser is, often I just want to use a simple command line program to do a quick lookup without all the extra fluff. So, the first thing I did was implement a DICT protocol server in Python. Next, I downloaded the entire English wiktionary from Wikimedia Downloads, which gives me the raw wiki markup for each entry in one big XML file. Then I wrote a quick program to extract the entries from the XML and do some simple formatting of the wiktionary pages. This is where things started to get complicated.

The MediaWiki software offers simple templates, which allow page authors to include common text and markup into articles. Wikipedia makes limited use of templates, but as I've discovered, Wiktionary uses them extensively and in decidedly nontrivial ways. For example, have a look at the template documentation for indicating English noun plural forms. It seems reasonably easy to use, but behind the scenes the source for the en-noun template is a nearly impenetrable forest of curly braces, wiki markup, HTML, and XML-like tags and comments. My program attempts to parse this.

The MediaWiki template language is an example of a domain-specific language. However, as languages go, it is not terribly well specified or documented. The Wikimedia Help:Template page seems to be the best documentation I can find, and it's chock full of contrived examples and pathological cases without even bothering to present a clear grammar. The lack of a grammar, and the Byzantine expansion rules, makes this template language somewhat challenging to parse. This is a prime example of Greenspun's Tenth Rule that states "Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp."

I'm not sure where to go from here. I've implemented what one would consider a normal recursive descent parser for the MediaWiki template language, but my implementation doesn't quite match up with the given examples in the weird corner cases. It seems to me that the only way to make a parser that works in the same way is to follow the vague instructions in the Template mechanism documentation. This means implementing an ad hoc parser, and it might even be necessary to borrow the implementation from MediaWiki itself. I really hadn't anticipated taking this that far.

An appealing option is to try to avoid evaluating templates at all, and offer only the defintions of words without all the extra etymology and inflection information. I had already planned to offer several different views of the database: raw wiki markup; full formatted page; normal view without translations; and a brief view with just definitions. Perhaps I want to start from the brief view and work my way up.

If you've read this far and want to try what I've got in its current state, try: dict -h hewgill.com word. The available databases (-d option) are: en-raw, en-full, en, en-brief. Not all databases may have complete (or any) info at any given time.
[info]infinitevoid
2007-08-11T04:56:23Z
I would've sworn you were going to say "ny sufficiently complicated C or Fortran program is indistinguishable from magic." :)
(anonymous) : any sufficiently complicated
2007-09-20T22:01:00Z
Yeah, I thought that was what was coming as well... lol
(anonymous) : Re: any sufficiently complicated
2007-10-12T20:41:32Z
Yeah, that en-noun template source code really looks like magic :)
(anonymous)
2008-10-27T17:27:07Z
Well, you could say both. Hmm... Combine them and you get "An ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp is indistinguishable from magic". Which is also true :D.
[info]thewordnerd
2007-08-11T05:08:34Z
A naive question, perhaps, but how difficult would it be to write a dict server that just sends off an HTTP request and scrapes the screen for the results? I haven't studied the protocol so don't know what it returns, but this seems to dodge the issue of parsing the XML directly. It's more fragile in that your functionality is tied to the layout (unless the mediawiki engine lets you specify themes/templates per request and then presumably you could request a simple/printer-friendly theme), but you'd also have access to the latest data without needing to update the database periodically.
[info]ghewgill
2007-08-11T05:53:53Z
I thought about that but I don't want to do it for at least two reasons: (1) performance, and (2) load on the wikimedia foundation servers. By having a copy of the database stored locally, I can immediately return a response without having to relay the request to a remote server and wait for it to render and return the response. Scraping the HTML is at least as messy as processing mediawiki templates, so I think I'm on an acceptable track here.

The database dump is updated monthly or so, and dictionaries don't change too terribly often. :)
[info]bovineone
2007-08-12T05:51:24Z
Why not just install a local Mediawiki install, and import all of the pages taken from their public dumps. Then you can create a very minimal Mediawiki skin that doesn't have the noisy HTML in it. Then a very simple webscrape against your local server (or maybe even just a command-line PHP invocation) would get you what you want.
[info]ghewgill
2007-08-12T08:11:05Z
That's an interesting idea, and although it might work I'm afraid it's far too heavyweight for my tastes. Besides, I don't have PHP installed on my web server, and don't plan to do so. :)
[info]decibel45
2007-08-11T05:58:51Z
Look on the bright side; at least mediawiki runs on PostgreSQL...
[info]davidmccabe : Better approach
2008-05-20T19:38:04Z
Parsing MediaWiki markup yourself is impossible. Instead, grab the raw pre-parsed HTML:

http://en.wiktionary.org/wiki/foo?action=render

Then run that through lynx or a similar html->text converter.

[info]ghewgill : Re: Better approach
2008-05-20T19:58:34Z
Impossible is such a strong word. In fact, clearly it is not impossible, since MediaWiki does it. The problem is that the algorithm is underspecified and so the only way of reimplementing it is to inspect the MediaWiki code itself.

The point of the Wiktiondict project though, is to offer a server that is independent of the Wiktionary servers. This gives me the opportunity to make the response time as fast as possible, and it's not just a proxy to another service. This solution also doesn't contribute to extra load on the Wiktionary servers.
[info]davidmccabe : Re: Better approach
2008-05-20T20:25:33Z
Fine, it's exceedingly impractical almost all of the time.

I was about to suggest running your own MediaWiki, but it looks like somebody already suggested that. Takes about five minutes to install. PHP is gross but it won't bite.

[info]ijon
2008-11-28T00:52:02Z
Thanks for referring me to this entry (from my stackoverflow question). I'm facing a similar situation, re Wikipedia markup, for an offline reader I'm working on (BzReader). I'll be adding Mediawiki features to the parser by stages; still haven't decided what to do about the templates.

Have you made any progress on this since writing the entry?
[info]ghewgill
2008-12-01T08:39:24Z
I have, though I mostly punted on the issue of actually trying to parse and execute templates. A description of the current state of the project is at http://hewgill.com/dict. You can also find the current state of the code here.
Greg Hewgill <greg@hewgill.com>