Indexing a collection of documents
If you have a collection of various files such as text (.txt), Microsoft Word (.doc), Microsoft Excel (.xls), Microsoft Powerpoint (.ppt) or PDF (.pdf), then Terrier can index these, mainly courtesy of the
SimpleFileCollection. class. There are two ways to generate the index:
1. A simple method is to use the DesktopTerrier to create the index.
2. Configure Terrier to index using
SimpleFileCollection.
trec.collection.class=SimpleFileCollection #indexing.simplefilecollection.extensionsparsers - use this to define parsers for know file extensions indexing.simplefilecollection.extensionsparsers=txt:FileDocument,text:FileDocument,tex:FileDocument,bib:FileDocument,pdf:PDFDocument,html:HTMLDocument,htm:HTMLDocument,xhtml:HTMLDocument,html:TaggedDocument,doc:MSWordDocument,ppt:MSPowerpointDocument,xls:MSExcelDocument #if this is defined, then terrier will attempt to open any file it doesn't have an explicit parser for with the parser given #indexing.simplefilecollection.defaultparser=FileDocument #configure the meta index to record the filename indexer.meta.forward.keys=filename indexer.meta.forward.keylens=256 #If directories should be opened and any files indexed indexing.simplefilecollection.recurse=true
In particular, the property indexing.simplefilecollection.extensionsparsers defines pairs of file extension (e.g. .txt) and the corresponding class to use to read and parse the file.
Finally, you should add the list of files to be indexed into the etc/collection.spec file.
Query Biased Summarisation
Terrier can support query-biased summarisation (e.g. for the Web results interface) for files indexed by SimpleFileCollection.
#for documents other than "tagged" FileDocument.abstract.length=1024 TaggedDocument.abstracts=title,abstract TaggedDocument.abstracts.tags=title,ELSE TaggedDocument.abstracts.lengths=100,2048 #configure the meta index to record the filename indexer.meta.forward.keys=filename,title,abstract indexer.meta.forward.keylens=256,100,1024