INDEXING DESIGN
Terrier Indexing architecture rough diagram:
(TODO UPDATE)
Interfaces
These are the current implementation as they stand as of Terrier 1.0.0.
Collection: represents a collection.
examples: TRECCollection, SimpleFileCollection, (as well as SQLCollection, POP3Collection, IMAPCollection)
methods: boolean endOfCollection(); boolean nextDocument(); Document getDocument; String docid(); void reset();
See Also:
Document: represents a document of a collection
examples: TRECDocument, HTMLDocument, PDFDocument, FileDocument, MSPowerpointDocument, MSWordDocument, MSExcelDocument
Methods String getNextTerm(); String getField();
See Also:
Indexer: uses a Collection to get each Document, extracts Terms, Stops and Stems them
examples: BasicIndexer, BlockIndexer
methods: void buildDirect(); void buildInverted();
Uses a list of named Fields from properties file, to note which fields a term belongs to. This named list can then be used at query time to check to see if the required term exists in that field. Example: Query(intitle:"index of" mp3)
Passes terms to a TermPipeline which would peform stopping and stemming (and other options, eg translation)
Then passes terms to the LexiconBuilder and DirectFileBuilder
Finally, invoked the InvertedIndexBuilder
See Also:
-
Examples: EnglishStemming, EnglishStemmingLite, Stopwords
String processTerm(String t)
Further Examples: Acronym expander, Stemmers in different languages
See Also:
I think I need to implement my own Collection/Document/Indexer
You probably want to read Terrier/XMLCollections
CategoryTerrier CategoryTerrier