Best practice Two billion tweets used as research materials

What do we twitter about? Which words are used in which location, in which context? For the past year, this information has been publicly available via the website. The SURFsara Hadoop cluster provides computing power and the requisite storage capacity.

09 JUN 2014

Tweet text analysis

The website was initiated by Erik Tjong Kim Sang, a researcher at the Meertens Institute. He supports researchers with an interest in tweets: “For example, some researchers want to know how often a specific word is used. In that case, it can also be relevant to find out how many tweets are posted in the Netherlands. The website offers access to the collection of tweets used by linguistics researchers to study the frequency, user location and context of specific words.”

1.5 million tweets per day

The data collection was initiated in 2010 and currently totals some two billion tweets. A total of between one and a half and two million Dutch-language tweets are added to the database every day. “This allows us to study the development of language use”, Tjong Kim Sang explains. “We can also trace the rise and decline of specific hypes or debates.”

Predicting events

So who uses the website and for what purposes? “We have a group of permanent users in Nijmegen, at the department of Computational Linguistics”, Tjong Kim Sang explains. “One of the researchers is trying to predict events on the basis of tweets. If effective, this technique could be applied to predict violent clashes between rival football supporters. Another good case in point would be a study on autism, where the researcher tries to predict whether someone suffers from autism on the basis of a tweet analysis.”

Heat maps

The website allows researchers to create heat maps. These geographical representations help pinpoint where specific words are used frequently in the Netherlands. Tjong Kim Sang: “For example, the maps show you where specific dialect terms are common. Dialect from Brabant is frequently used in tweets from Amsterdam, which might be due to the large amount of students from that region living in the city.” The website also allows users to track regional debates, such as the many tweets on the recent earthquakes in Groningen.

Computing power

Some calculations require a great deal of computing power, especially in cases where a researcher wants to study the use of specific words over a longer period of time (such as a month). SURFsara is running the website on the HPC Cloud, which is also being used to store tweets. The collected tweets are then copied to the Hadoop cluster, which does the actual computing. “These are extremely large calculations”, Tjong Kim Sang explains. “Some heat maps require many hours of computing power.”

