Terrier/TODO

RELEASE OUTSTANDING

Applications
Release
Querying
Indexing

CORE

1.0

1.1 +

QUERYING API

Changes:

SEARCH ENGINE INTERFACE

INDEXING DESIGN

Terrier Indexing architecture rough diagram:

terrier-indexing

The current indexing architecture is very tied towards indexing TREC and other test collections (ie in the above diagram, "Corpus decodor" is handled by the same code described as "Tokeniser & parser" - in particular, the parsing of 1 collection file is too closely tied to the parsing of a document. The planned changes to the indexing process should allow other collections of documents to be indexed. Examples would be indexing from a database; indexing file from the filesystem; indexing from a 3rd party API, eg a POP3 or IMAP server etc.

General outline of thoughts:

Interfaces

These are the current designs as they stand atm. Where Option1 and Option2 are mentioned, we are undecided about how to construct the API - this mainly relates about the boundary between Document and Indexer, and how stemming & stopping should be involved.

PreIndexing phase support

Allow phases prior to and after indexing to occur in Java, rather than Perl/Bash scripts. Example - Adding Anchor texts to collection: preIndexing phase

Other post-indexing phases may also exist, so perhaps we should be generalising this and providing a Runnable like interface. eg run(ApplicationSetup, Collection) and run(DirectIndex, Lexicon, InvertedIndex)

Querying architecture

Terrier querying architecture rough diagram:

terrier-retrieving