Twitter Support in Terrier
Building upon Terrier 3.5, this page describes how to index and retrieve tweets - e.g. from the TREC Microblog
Tweets11 corpus, and samples of the Gardenhose stream. Currently, support for Twitter corpora has not been integrated into the core of Terrier (3.5). Instead, we provide a tarball containing the additional java classes and jar files needed to index a Twitter collection with Terrier 3.5.
This tarball can by downloaded from the JIRA issue regarding
Twitter corpus support with Terrier (TR-171)
Indexing Tweets11
When indexing the Tweet11 corpus, we assume that you have the collection stored in JSON format, one tweet per-line. This is the default output if you have used the JSON crawler. If you used the HTML crawler, then you need to run the HTML scraper provided with the HTML crawler to write out the collection in JSON format (this scrapes the page for useful content like the tweet text, username, etc.). For example, a JSON line should look like this:
{"text":"RT @NemesisRepublic: RT @Wallstroker: I know this has prob been put up already. But just wanted to share this amazing find! http://bbc.in...","id":32368820383383552,"id_str":"32368820383383552","truncated":true,"user":{"screen_name":"ibisroofing","protected":false},"retweeted_status":{"id":32367958483275776,"id_str":"32367958483275776","created_at":null,"text":"RT @Wallstroker: I know this has prob been put up already. But just wanted to share this amazing find! http://bbc.in/hFI6BY #history","truncated":false,"retweet_count":0,"in_reply_to_screen_name":null,"in_reply_to_user_id_str":null,"in_reply_to_user_id":null,"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"contributors":null,"user":{"screen_name":"NemesisRepublic","protected":false,"lang":"en","name":"Nemesis Republic","profile_image_url":"http://a3.twimg.com/profile_images/1372019809/e18126ed-be2a-4cfb-80c2-55a8ad329427_bigger.png"},"entities":{"hashtags":["history"],"urls":["http://bbc.in/hFI6BY"],"user_mentions":["Wallstroker"]}
Once you have the collection is this format, Follow these instructions to index the collection:
Download and Terrier 3.5 from
http://terrier.org and extract it. Download the TREC Microblog Terrier Plugin from the JIRA issue regarding
Twitter corpus support with Terrier (TR-171) and extract it from within your terrier-3.5 directory. You will need the JAVA_HOME environment variable to be set. Windows users in particular should check that this has been set.
If you are using a unix-based OS, run bin/compile_package.sh twitter to build the new Twitter-related classes. If you are running Windows, instead run bin\compile_package.bat twitter which is provided with the tarball.
Add the following properties to your etc/terrier.properties:
#use the new collection class trec.collection.class=TwitterJSONCollection #record extra fields in the index FieldTags.process=TWEET,RAW,NAME,SNAME,LOC #record extra information in the meta index indexer.meta.forward.keys=docno,id,created_at,text,retweet_count,in_reply_to_screen_name,in_reply_to_user_id,in_reply_to_status_id,user.name,user.screen_name,user.lang,user.profile_image_url,place.name,place.id,geo.lat,geo.lng,retweet.text,retweet.id,retweet.created_at,retweet.retweet_count,retweet.in_reply_to_screen_name,retweet.in_reply_to_user_id,retweet.in_reply_to_status_id,retweet.user.name,retweet.user.screen_name,retweet.user.lang,retweet.user.profile_image_url,retweet.place.name,retweet.place.id,retweet.geo.lat,retweet.geo.lng indexer.meta.forward.keylens=32,30,30,200,10,30,30,30,60,30,10,250,160,30,30,30,200,30,30,10,30,30,30,160,60,10,250,160,30,30,30 #additional configuration for single pass indexing docs.check=50 memory.heap.usage=0.70 indexing.max.docs.per.builder=100000000
Create a collection.spec file in the etc/ folder containing the full paths & filenames of all files in your tweets corpus.
Run terrier indexing as normal, e.g. using bin/trec_terrier.sh -i or bin/trec_terrier.sh -i -j (trec_terrier.bat on Windows)