the general public on twitter

Date: 2009-09-20 01:25:00

I was chatting with Phil the other day and he was working on some statistical analysis tools for the Twitter API. I was reminded of a little project that had popped into my head a while ago: How hard would it be to identify the language in which a twitter status update is written? With only 140 characters, some of which are going to be a URL or something, there won't be much info there. Is it possible?

The Twitter API provides a method to get the most recent 20 updates from the general public on twitter. You can see this public timeline in your browser as well as getting it via XML or JSON or whatever. I took one look at that page and was immediately struck by:

the wide range of languages represented (I imagine this varies by time of day)
the high frequency of spelling errors, both active (because of the 140 character limit) and passive (because people can't or won't spell properly)
the wide range of "words" used that simply aren't in any dictionary (mmmmm, XD, arrr, dat) (well, "arrr" is permissible because it's International Talk Like A Pirate Day today, of course)
the low information content of most of the crap people post on twitter
the (attempted) spam (who would actually read or click on your gratuitous message about affiliate marketing?)

I was disheartened by what I saw, ready to give up on the project. Here's a typical gem:

ne1 der.....2 b frndz wid me..

But you know what? Regardless of whether any of this could be considered correct, literate, crap, spam, or what have you, it's what people are actually writing. It's (mostly) human communication, even if it has a low information content. Doesn't it deserve analysis anyway? Real world problems are almost never the easy ones.

Comment

mskala
2009-09-19T13:33:58Z

How are you defining "information content"? It sounds like you mean something like useful or important information content... but I suspect Twitter items may actually have very high information content in the sense of entropy, as used in information theory, and measuring that might be interesting and easier.

ghewgill
2009-09-19T13:49:37Z

Right, I wasn't referring to the technical measure of information content, but something more like useful or important. Maybe "semantic content" is what I mean.

There are certainly many ways of looking at this data, and there certainly is a lot of it. I'll see what I can distill out of it.

ivo
2009-09-19T15:17:08Z

Hah! The first 20 public posts that I pulled up included a "Chantal" who posted in Dutch!!! (status)

Greg Hewgill <greg@hewgill.com>