Diff for "Terrier/Tweets11"

Differences between revisions 30 and 31

Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
Building upon Terrier 4.0, this page describes how to index and retrieve tweets - e.g. from the TREC Microblog [http://trec.nist.gov/data/tweets/ Tweets11] corpus, and samples of the Gardenhose stream. Since version 4.0, support for Twitter corpora has not been integrated into the core of Terrier. Building upon Terrier 4.0, this page describes how to index and retrieve tweets - e.g. from the TREC Microblog [http://trec.nist.gov/data/tweets/ Tweets11] corpus, and samples of the Gardenhose stream. Since version 4.0, support for Twitter corpora has been integrated into the core of Terrier.

Twitter Support in Terrier

Building upon Terrier 4.0, this page describes how to index and retrieve tweets - e.g. from the TREC Microblog [WWW] Tweets11 corpus, and samples of the Gardenhose stream. Since version 4.0, support for Twitter corpora has been integrated into the core of Terrier.

Indexing Tweets11

When indexing the Tweet11 corpus, we assume that you have the collection stored in JSON format, one tweet per-line. This is the default output if you have used the JSON crawler. If you used the HTML crawler, then you need to run the HTML scraper provided with the HTML crawler to write out the collection in JSON format (this scrapes the page for useful content like the tweet text, username, etc.). For example, a JSON line should look like this:

{"text":"RT @NemesisRepublic: RT @Wallstroker: I know this has prob been put up already. But just wanted to share this amazing find! http://bbc.in...","id":32368820383383552,"id_str":"32368820383383552","truncated":true,"user":{"screen_name":"ibisroofing","protected":false},"retweeted_status":{"id":32367958483275776,"id_str":"32367958483275776","created_at":null,"text":"RT @Wallstroker: I know this has prob been put up already. But just wanted to share this amazing find! http://bbc.in/hFI6BY #history","truncated":false,"retweet_count":0,"in_reply_to_screen_name":null,"in_reply_to_user_id_str":null,"in_reply_to_user_id":null,"in_reply_to_status_id_str":null,"in_reply_to_status_id":null,"contributors":null,"user":{"screen_name":"NemesisRepublic","protected":false,"lang":"en","name":"Nemesis Republic","profile_image_url":"http://a3.twimg.com/profile_images/1372019809/e18126ed-be2a-4cfb-80c2-55a8ad329427_bigger.png"},"entities":{"hashtags":["history"],"urls":["http://bbc.in/hFI6BY"],"user_mentions":["Wallstroker"]}

Once you have the collection is this format, Follow these instructions to index the collection:

  • Download and Terrier 4.0 from [WWW] http://terrier.org and extract it.

  • You will need the JAVA_HOME environment variable to be set. Windows users in particular should check that this has been set.

  • Add the following properties to your etc/terrier.properties:

#use the new collection class
trec.collection.class=TwitterJSONCollection

#record extra fields in the index
FieldTags.process=TWEET,RAW,NAME,SNAME,LOC

#record extra information in the meta index
indexer.meta.forward.keys=docno,id,created_at,text,retweet_count,in_reply_to_screen_name,in_reply_to_user_id,in_reply_to_status_id,user.name,user.screen_name,user.lang,user.profile_image_url,place.name,place.id,geo.lat,geo.lng,retweet.text,retweet.id,retweet.created_at,retweet.retweet_count,retweet.in_reply_to_screen_name,retweet.in_reply_to_user_id,retweet.in_reply_to_status_id,retweet.user.name,retweet.user.screen_name,retweet.user.lang,retweet.user.profile_image_url,retweet.place.name,retweet.place.id,retweet.geo.lat,retweet.geo.lng
indexer.meta.forward.keylens=32,30,30,200,10,30,30,30,60,30,10,250,160,30,30,30,200,30,30,10,30,30,30,160,60,10,250,160,30,30,30

#additional configuration for single pass indexing
docs.check=50
memory.heap.usage=0.70
indexing.max.docs.per.builder=100000000
  • Create a collection.spec file in the etc/ folder containing the full paths & filenames of all files in your tweets corpus. They have to be gzipped (as of version 4.0)

  • Run terrier indexing as normal, e.g. using bin/trec_terrier.sh -i  or bin/trec_terrier.sh -i -j  (trec_terrier.bat on Windows)

last edited 2015-04-02 21:03:22 by IadhOunis