FIRE Data
Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages (Documents in Bengali, Hindi, Marathi and Queries in Bengali, Hindi, Marathi, Tamil, Telugu, Malayalam, Gujarati etc.)
Terrier is a great choice of retrieval system for FIRE. However, it needs careful configuration, and a few code changes for older versions of Terrier. Below, we detail the code changes, and a recommended configuration.
Code Changes
Terrier 3.5 has all the necessary code changes to support FIRE. However, for earlier versions (2.2.1 and 3.0), some changes in the code are necessary for indexing/retrieving FIRE data. These changes are listed below:
For each case where the code uses Character.isLetterOrDigit((char)ch), change it to Character.isUnicodeIdentifierPart((char)ch)
For each case where the code uses either sw.write((char)ch) or sw.write(ch), change these to if (ch != 0) sw.write((char)ch); Similar changes have to be made for sw.append
To make things easier, FIRE have worked with the Univ. of Glasgow to produce patches. These are attached to this wiki page - one for Terrier v2.2.1 (FIRE-tr2.2.1-v1.patch), and one for Terrier v3.0 (FIRE-tr3.0-v1.patch). To apply a patch, download it your Terrier folder, use the patch command to apply the patch patch -p0 < FIRE-tr-etc.patch, and finally follow the instructions in the Terrier documentation about recompiling. NB: No patches are required for Terrier 3.5.
Example, Terrier 2:
patch -p0 < FIRE-tr2.2.1-v1.patch make clean compile
Note that once this patch has been applied, we recommend that you rebuild your indices trec_terrier.sh -i.
Configuration Properties
Following are the changes in terrier.properties file:
TrecDocTags.doctag=DOC TrecDocTags.idtag=DOCNO trec.encoding=utf-8 #for Terrier 2 and Terrier 3.0 trec.collection.class=TRECUTFCollection string.use_utf=true #for Terrier 3.5 trec.collection.class=TRECCollection tokeniser=UTFTokeniser #for Terrier 3 and 3.5 indexer.meta.forward.keylens=50 #for Terrier 2 docno.byte.length=50