Terrier/FIRE

Terrier/FIRE

Corpus of FIRE-2008 and FIRE-2010 adhoc task contains Indian languages (Documents in Bengali, Hindi, Marathi and Queries in Bengali, Hindi, Marathi, Tamil, Telugu, Malayalam, Gujarati etc.)

Indexing

Some changes in the code is necessary for indexing/retrieving FIRE data. These changes are listed below: For each case where the code uses either [!Character.isLetterOrDigit((char)ch)] or [(ch < 'A' || ch > 'Z') && (ch < 'a' || ch > 'z') && (ch < '0' || ch > '9')] change these to [ ! Character.isUnicodeIdentifierPart((char)ch)]

For each case where the code uses either [Character.isLetterOrDigit((char)ch)] or [((ch >= 'A') && (ch <= 'Z')) || ((ch >= 'a') && (ch <= 'z')) || ((ch >= '0') && (ch <= '9'))] change these to [Character.isUnicodeIdentifierPart((char)ch)]

For each case where the code uses either[sw.write((char)ch)] or [sw.write(ch)] (write/append) change these to [if (ch != 0) sw.write((char)ch);] (write/append)

TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
indexer.meta.forward.keylens=50
trec.collection.class=TRECUTFCollection
string.use_utf=true
trec.encoding=utf-8