Date: 2009-09-20 01:25:00
the general public on twitter

I was chatting with Phil the other day and he was working on some statistical analysis tools for the Twitter API. I was reminded of a little project that had popped into my head a while ago: How hard would it be to identify the language in which a twitter status update is written? With only 140 characters, some of which are going to be a URL or something, there won't be much info there. Is it possible?

The Twitter API provides a method to get the most recent 20 updates from the general public on twitter. You can see this public timeline in your browser as well as getting it via XML or JSON or whatever. I took one look at that page and was immediately struck by:

I was disheartened by what I saw, ready to give up on the project. Here's a typical gem:

ne1 der.....2 b frndz wid me..

But you know what? Regardless of whether any of this could be considered correct, literate, crap, spam, or what have you, it's what people are actually writing. It's (mostly) human communication, even if it has a low information content. Doesn't it deserve analysis anyway? Real world problems are almost never the easy ones.

How are you defining "information content"? It sounds like you mean something like useful or important information content... but I suspect Twitter items may actually have very high information content in the sense of entropy, as used in information theory, and measuring that might be interesting and easier.
Right, I wasn't referring to the technical measure of information content, but something more like useful or important. Maybe "semantic content" is what I mean.

There are certainly many ways of looking at this data, and there certainly is a lot of it. I'll see what I can distill out of it.
Hah! The first 20 public posts that I pulled up included a "Chantal" who posted in Dutch!!! (status)
Greg Hewgill <>