Terrier/HowToXMLCollections

So you want to index an XML collection such as INEX?

Firstly, you'll need to create an object which implements the Collection interface. This will find all the files it has to process, then open each one in turn and create a Document object for each article. I'd suggest you call this INEXCollection if you're indexing INEX. You can change the Collection object used by TRECIndexing by altering the trec.collection.class property (See Terrier/FAQ).

Secondly, you'll want to create an object which implements the Document interface. In this you'll want to parse the document, probably using your favourite XML parser, eg SAX, DOM, whatever. The interface you need implement contains two important methods: getTerm() which returns the String term each term; and getFields() - if you want to do something special with the fields, you inform the indexer of which fields it should take note of, using FieldTags.process=title,body

Because you have taken note of the fields, Terrier will allow you to query within a field. eg title:term1 body:(term2 term 3).

If you're XML collection doesn't have clear document boundaries, ie how big a retrieval unit should be (document, chapter, section, paragraph), then you'll need to think carefully about how you index and retrieve from such a collection.


CategoryTerrier CategoryTerrierHowTos

last edited 2005-10-17 15:05:18 by CraigMacdonald