New Querying APIs
A new querying API has been implemented to allow Terrier to be suited for more applications, including interactive applications. To this end, we have encapsulated every query in a SearchRequest object, which is passed through different stages of a query retrieval by the Manager:
A query has to be parsed into a syntax tree - this allows Terrier to identify terms, phrases, requirements, fields, proximity requirements, weights etc from the grammar of the query entered. For this we use a parser generated by the Antlr parser generator.
The Query tree is then traversed. This allows three operations: each term to be passed through the TermPipeline (stemming, stopping etc); controls to be identified and removed; terms to be aggregated for the Matching process
The aggregated terms (known as MatchingTerms) are the query for the main retrieval (Matching) stage, where relevant documents are determined, and scores assigned using assigned weighting model. There are two additional (new) substages at this time:
Term Score Modifiers - alter the scores given to a term in a given document - eg the term occurs in the desired field (eg TITLE, H1 etc)
Document Score Modifiers - alters the score of a given document - eg if all the terms occur in the document, but not in a phrase as desired
Post Processing is for application specific code to alter the result set in an unspecified way. Terrier provides automatic QueryExpansion where relevant terms from the top N documents are added to the query, and the matching stage rerun.
Post Filtering is like Post Processing, but only one document of the result set may be operated on at any one time - this allows results to be filtered out (eg not in a specific DNS domain for Search Engine results)
The querying stage of Terrier is controlled by controls which are string->string mappings. These can either be set in two places:
As defaults in the terrier.properties files.
In the query, using the syntax name:value. They are removed from the query before at Pre Processing time. To prevent users from being able to alter the functionality of Terrier in undesired ways, only controls that have specifically been enabled in the terrier.properties file can be used in the query.
I have documented the controls present in the Terrier 1.0 core separately : Terrier/QueryingControls
Controls are often used to turn on post processes or post filters. However, controls need to be mapped into class names, using the querying.postprocesses.controls and querying.postfilters.controls properties in the terrier.properties file. In addition, as order is often important, you should specify the order using the querying.postprocesses.order and querying.postfilters.order properties.
querying.postprocesses.order=QueryExpansion querying.postprocesses.controls=qe:QueryExpansion querying.postfilters.order=Scope querying.postfilters.controls=scope cope
New Indexing API
Terrier 1.0 also has a new indexing API which allows more diverse collections of documents to be indexed. To this end, Terrier breaks the indexing up into several responsibilties:
Collection interface - decodes the collection, and provides a stream of Document objects. Terrier comes with TRECCollection and SimpleFileCollection - other examples might be EmailCollection, SQLCollection, INEXCollection
Document interface - decodes each individual document, and provides a stream of Terms. If the collection has a notion of Fields (eg Title, document text; or TITLE, H1, B, A etc for HTML) which should be indexed, then the Document object should also let the indexer know which fields a given term occurs in.
Indexer - receives each term from each document of the collection, passes through a TermPipeline, and then adds onto the following: the current temporary Lexicon, the DocumentIndex, the DirectIndex. When all documents from all collections have been indexed, the inverted index is then built.