Terrier/FIRE

Terrier/FIRE

Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages (Documents in Bengali, Hindi, Marathi and Queries in Bengali, Hindi, Marathi, Tamil, Telugu, Malayalam, Gujarati etc.)

Indexing

Some changes in the code are necessary for indexing/retrieving FIRE data. These changes are listed below:

For each case where the code uses either !Character.isLetterOrDigit((char)ch) or (ch < 'A' || ch > 'Z') && (ch < 'a' || ch > 'z') && (ch < '0' || ch > '9') change these to  ! Character.isUnicodeIdentifierPart((char)ch)

For each case where the code uses either Character.isLetterOrDigit((char)ch) or ((ch >= 'A') && (ch <= 'Z')) || ((ch >= 'a') && (ch <= 'z')) || ((ch >= '0') && (ch <= '9')) change these to Character.isUnicodeIdentifierPart((char)ch)

For each case where the code uses eithersw.write((char)ch) or sw.write(ch) change these to if (ch != 0) sw.write((char)ch);

Similar changes have to be made for sw.append

Following are the changes in terrier.properties file:

TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
indexer.meta.forward.keylens=50
trec.collection.class=TRECUTFCollection
string.use_utf=true
trec.encoding=utf-8