I was chatting with Phil the other day and he was working on some statistical analysis tools for the Twitter API. I was reminded of a little project that had popped into my head a while ago: How hard would it be to identify the language in which a twitter status update is written? With only 140 characters, some of which are going to be a URL or something, there won't be much info there. Is it possible?
The Twitter API provides a method to get the most recent 20 updates from the general public on twitter. You can see this public timeline in your browser as well as getting it via XML or JSON or whatever. I took one look at that page and was immediately struck by:
I was disheartened by what I saw, ready to give up on the project. Here's a typical gem:
ne1 der.....2 b frndz wid me..
But you know what? Regardless of whether any of this could be considered correct, literate, crap, spam, or what have you, it's what people are actually writing. It's (mostly) human communication, even if it has a low information content. Doesn't it deserve analysis anyway? Real world problems are almost never the easy ones.
2009-09-19T13:33:58Z