Terrier/CollectionOfFiles

Indexing a collection of documents

If you have a collection of various files such as text (.txt), Microsoft Word (.doc), Microsoft Excel (.xls), Microsoft Powerpoint (.ppt) or PDF (.pdf), then Terrier can index these, mainly courtesy of the SimpleFileCollection class. There are two ways to generate the index:

1. A simple method is to use the DesktopTerrier to create the index.

2. Configure Terrier to index using [WWW] SimpleFileCollection.


trec.collection.class=SimpleFileCollection

#indexing.simplefilecollection.extensionsparsers - use this to define parsers for know file extensions
indexing.simplefilecollection.extensionsparsers=txt:FileDocument,text:FileDocument,tex:FileDocument,bib:FileDocument,pdf:PDFDocument,html:HTMLDocument,htm:HTMLDocument,xhtml:HTMLDocument,html:TaggedDocument,doc:MSWordDocument,ppt:MSPowerpointDocument,xls:MSExcelDocument

#if this is defined, then terrier will attempt to open any file it doesn't have an explicit parser for with the parser given 
#indexing.simplefilecollection.defaultparser=FileDocument 

#configure the meta index to record the filename
indexer.meta.forward.keys=filename
indexer.meta.forward.keylens=256

#If directories should be opened and any files indexed
indexing.simplefilecollection.recurse=true


In particular, the property indexing.simplefilecollection.extensionsparsers defines pairs of file extension (e.g. .txt) and the corresponding class to use to read and parse the file.

Finally, you should add the list of files to be indexed into the etc/collection.spec file.

Query Biased Summarisation

Terrier can support query-biased summarisation (e.g. for the Web results interface) for files indexed by SimpleFileCollection.


#for documents other than "tagged"
FileDocument.abstract.length=1024

TaggedDocument.abstracts=title,abstract
TaggedDocument.abstracts.tags=title,ELSE
TaggedDocument.abstracts.lengths=100,2048

#configure the meta index to record the filename
indexer.meta.forward.keys=filename,title,abstract
indexer.meta.forward.keylens=256,100,1024

last edited 2011-06-09 18:39:07 by CraigMacdonald