Terrier/FIRE

FIRE Data

Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages (Documents in Bengali, Hindi, Marathi and Queries in Bengali, Hindi, Marathi, Tamil, Telugu, Malayalam, Gujarati etc.)

Code Changes

Some changes in the code are necessary for indexing/retrieving FIRE data. These changes are listed below:

For each case where the code uses Character.isLetterOrDigit((char)ch), change it to Character.isUnicodeIdentifierPart((char)ch)

For each case where the code uses either sw.write((char)ch) or sw.write(ch), change these to if (ch != 0) sw.write((char)ch); Similar changes have to be made for sw.append

To make things easier, FIRE have worked with the Univ. of Glasgow to produce patches. These are attached to this wiki page - one for Terrier v2.2.1, and one for Terrier v3.0

Configuration Properties

Following are the changes in terrier.properties file:

TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
trec.collection.class=TRECUTFCollection
string.use_utf=true
trec.encoding=utf-8

#for terrier 3
indexer.meta.forward.keylens=50
#for Terrier 2
docno.byte.length=50