Terrier IR Platform
2.2.1

uk.ac.gla.terrier.indexing
Class TRECFullUTFTokenizer

java.lang.Object
  extended by uk.ac.gla.terrier.indexing.TRECFullTokenizer
      extended by uk.ac.gla.terrier.indexing.TRECFullUTFTokenizer
All Implemented Interfaces:
Tokenizer

public class TRECFullUTFTokenizer
extends TRECFullTokenizer

This is a subclass of TRECFullTokenizer, which is less restrictive than it's parent. In this class any character passing Character.isLetterOrDigit() is accepted as a valid query term.

Since:
2.1
Version:
$Revision: 1.3 $
Author:
Craig Macdonald

Field Summary
 
Fields inherited from class uk.ac.gla.terrier.indexing.TRECFullTokenizer
br, counter, EOD, EOF, error, inDocnoTag, inTagToProcess, inTagToSkip, lastChar, number_of_terms
 
Constructor Summary
TRECFullUTFTokenizer()
           
TRECFullUTFTokenizer(java.io.BufferedReader br)
           
TRECFullUTFTokenizer(TagSet _tagSet, TagSet _exactSet)
           
TRECFullUTFTokenizer(TagSet _ts, TagSet _exactSet, java.io.BufferedReader br)
           
 
Method Summary
 java.lang.String nextToken()
          nextTermWithNumbers gives the first next string which is not a tag.
 
Methods inherited from class uk.ac.gla.terrier.indexing.TRECFullTokenizer
close, closeBufferedReader, currentTag, getByteOffset, inDocnoTag, inTagToProcess, inTagToSkip, isEndOfDocument, isEndOfFile, nextDocument, setIgnoreMissingClosingTags, setInput
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TRECFullUTFTokenizer

public TRECFullUTFTokenizer()

TRECFullUTFTokenizer

public TRECFullUTFTokenizer(java.io.BufferedReader br)

TRECFullUTFTokenizer

public TRECFullUTFTokenizer(TagSet _tagSet,
                            TagSet _exactSet)

TRECFullUTFTokenizer

public TRECFullUTFTokenizer(TagSet _ts,
                            TagSet _exactSet,
                            java.io.BufferedReader br)
Method Detail

nextToken

public java.lang.String nextToken()
nextTermWithNumbers gives the first next string which is not a tag. All encounterd tags are pushed or popped according they are initial or final

Specified by:
nextToken in interface Tokenizer
Overrides:
nextToken in class TRECFullTokenizer
Returns:
the next token, or null if the end of file is encountered.

Terrier IR Platform
2.2.1

Terrier Information Retrieval Platform 2.2.1. Copyright 2004-2008 University of Glasgow