Terrier/Tweets11

Twitter Support in Terrier

Building upon Terrier 3.5, this page describes how to index and retrieve tweets - e.g. from the TREC Microblog [WWW] Tweets11 corpus, and samples of the Gardenhose stream. Currently, support for Twitter corpora has not been integrated into the core of Terrier (3.5). Instead, we provide a tarball containing the additional java classes and jar files needed to index a Twitter collection with Terrier 3.5.

This tarball can by downloaded from the JIRA issue regarding [WWW] Twitter corpus support with Terrier (TR-171)

Indexing Tweets11

When indexing the Tweet11 corpus, we assume that you have the collection stored in JSON format, one tweet per-line. This is the default output if you have used the JSON crawler. If you used the HTML crawler, then you need to run the HTML scraper provided with the HTML crawler to write out the collection in JSON format (this scrapes the page for useful content like the tweet text, username, etc.). For example, a JSON line should look like this:

{"text":"RT @NemesisRepublic: RT @Wallstroker: I know this has prob been put up already. But just wanted to share this amazing find! http://bbc.in...","id":32368820383383552,"id_str":"32368820383383552","truncated":true,"user":{"screen_name":"ibisroofing","protected":false},"retweeted_status":{"id":32367958483275776,"id_str":"32367958483275776","created_at":null,"text":"RT @Wallstroker: I know this has prob been put up already. But just wanted to share this amazing find! http://bbc.in/hFI6BY #history","truncated":false,"retweet_count":0,"in_reply_to_screen_name":null,"in_reply_to_user_id_str":null,"in_reply_to_user_id":null,"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"contributors":null,"user":{"screen_name":"NemesisRepublic","protected":false,"lang":"en","name":"Nemesis Republic","profile_image_url":"http://a3.twimg.com/profile_images/1372019809/e18126ed-be2a-4cfb-80c2-55a8ad329427_bigger.png"},"entities":{"hashtags":["history"],"urls":["http://bbc.in/hFI6BY"],"user_mentions":["Wallstroker"]}

Once you have the collection is this format, Follow these instructions to index the collection:

#use the new collection class
trec.collection.class=TwitterJSONCollection

#record extra fields in the index
FieldTags.process=TWEET,RAW,NAME,SNAME,LOC

#record extra information in the meta index
indexer.meta.forward.keys=docno,id,created_at,text,retweet_count,in_reply_to_screen_name,in_reply_to_user_id,in_reply_to_status_id,user.name,user.screen_name,user.lang,user.profile_image_url,place.name,place.id,geo.lat,geo.lng,retweet.text,retweet.id,retweet.created_at,retweet.retweet_count,retweet.in_reply_to_screen_name,retweet.in_reply_to_user_id,retweet.in_reply_to_status_id,retweet.user.name,retweet.user.screen_name,retweet.user.lang,retweet.user.profile_image_url,retweet.place.name,retweet.place.id,retweet.geo.lat,retweet.geo.lng
indexer.meta.forward.keylens=32,30,30,200,10,30,30,30,60,30,10,250,160,30,30,30,200,30,30,10,30,30,30,160,60,10,250,160,30,30,30

#additional configuration for single pass indexing
docs.check=50
memory.heap.usage=0.70
indexing.max.docs.per.builder=100000000

last edited 2012-03-02 15:49:28 by CraigMacdonald