Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages (Documents in Bengali, Hindi, Marathi and Queries in Bengali, Hindi, Marathi, Tamil, Telugu, Malayalam, Gujarati etc.)
Terrier is a great choice of retieval system for FIRE. However, it needs careful configuration, and a few code changes. Below, we detail the code changes, and a recommended configuration.
Some changes in the code are necessary for indexing/retrieving FIRE data. These changes are listed below:
For each case where the code uses Character.isLetterOrDigit((char)ch), change it to Character.isUnicodeIdentifierPart((char)ch)
For each case where the code uses either sw.write((char)ch) or sw.write(ch), change these to if (ch != 0) sw.write((char)ch); Similar changes have to be made for sw.append
To make things easier, FIRE have worked with the Univ. of Glasgow to produce patches. These are attached to this wiki page - one for Terrier v2.2.1 (FIRE-tr2.2.1-v1.patch), and one for Terrier v3.0 (FIRE-tr3.0-v1.patch). To apply a patch, download it your Terrier folder, use the patch command to apply the patch patch -p0 < FIRE-tr-etc.patch, and finally follow the instructions in the Terrier documentation about recompiling.
Example, Terrier 2:
patch -p0 < FIRE-tr2.2.1-v1.patch make clean compile
Note that once this patch has been applied, we recommend that you rebuild your indices trec_terrier.sh -i.
Following are the changes in terrier.properties file:
TrecDocTags.doctag=DOC TrecDocTags.idtag=DOCNO trec.collection.class=TRECUTFCollection string.use_utf=true trec.encoding=utf-8 #for terrier 3 indexer.meta.forward.keylens=50 #for Terrier 2 docno.byte.length=50