Terrier/HowToXMLCollections

So you want to index an XML collection such as INEX?

Firstly, you'll need to create an object which implements the Collection interface. This will find all the files it has to process, then open each one in turn and create a Document object for each article. I'd suggest you call this INEXCollection

Secondly, you'll want to create an object which implements the Document interface. In this you'll want to parse the document, probably using your favourite XML parser, eg SAX, DOM, whatever. The interface you need implement contains two important methods: getTerm() which returns the String term each term; and getFields() - if you want to do something special with the fields, you should only return fields which the indexer should take note of.

Because you have taken note of the fields, Terrier will allow you to query within a field. eg title:term1 body:(term2 term 3).


CategoryTerrier