Terrier/MemoryIssues

Memory Issues

Terrier can use high amounts of memory during both indexing and retrieval. This can manifest as a problem as the maximum amount of memory that can be used by a Java Virtual Machine (VM) has to be specified when it is created. In Terrier 1.0.x, we have tried to pick good defaults for memory usage, but when you're indexing a new collection, these may not be sufficient. This Wiki page documents the symptoms and causes of high memory usage, and some possible resolutions.

Java limits the maximum memory in two ways:

When you run into memory problems using Terrier, it is straight-forward to alter the settings of Java or Terrier to work-around the problem. If you get stuck then you can ask on the [WWW] Terrier Forum . If you still can't get a solution, then try coding round the problem - then send us a patch, and we'll include it in a future version of Terrier!

Heap memory

A heap memory problem is signified by a Java error: OutOfMemoryError. You can allow Java to use more heap memory by altering the -Xmx line in the script you are using to start Terrier.

Defaults (as of Terrier 1.0.2):

Stack memory

A stack memory issue is signified by a Java error: StackOverflowError. You can allow Java to use more stack memory by adding or alterting the -Xss line in the script you are using to start Terrier. If there is no -Xss line, the Java is using its default maximum stack size.

Defaults (as of Terrier 1.0.2):

Notes:

Indexing

Indexing a new large collection can be a daunting process. Be prepared to give indexing a couple of attempts, referring to this page when things go wrong. Also consider doing direct indexing and inverted indexing in two steps, using trec_terrier.sh -i -d then trec_terrier.sh -i -v . Doing this ensures that Terrier has the maximum amount of memory available for building the inverted index.

OutOfMemoryError while building direct index

During building the direct index, Terrier attempts to write as much data to disk as possible to conserve memory. However, to keep good indexing speed, the class TermCodes keeps in memory a mapping from the String of a term to a term-id to be saved in the direct index. If you are indexing a large collection, the TermCodes may grow so large that it becomes too big to keep in memory.

Resolutions:

StackOverFlowError while building block direct index

read collection specification
Processing file1
BlockIndexer creating direct index 
Exception in thread "main" java.lang.StackOverflowError
at uk.ac.gla.terrier.structures.trees.BlockTree.traversePreOrder(BlockTree.java:127)
at uk.ac.gla.terrier.structures.trees.BlockTree.traversePreOrder(BlockTree.java:128)
at uk.ac.gla.terrier.structures.trees.BlockTree.traversePreOrder(BlockTree.java:128) 

Currently blocks numbers are stored in a binary tree, however the in-order traversal of a large binary tree can use a considerable amount of stack usage. A BlockTree can become large is a term is repeated very many times in a document.

Resolutions:

OutOfMemory while building inverted index

During building the inverted index, the InvertedIndexBuilder reads the term ids of a number of terms from the Lexicon, then scans the direct index looking for occurrences of those terms, building up a postings list to write to the InvertedIndex. The default number of terms to process in one iteration is 75,000. The generated postings list is kept in memory until the disk is flushed.

Note that the postings list is considerably larger when indexing using blocks, in which case the 75,000 default may be too large for some collections.

Resolutions:

Retrieval

DocumentIndex

When retrieving from an index, the first step when loading the Index is the loading of the DocumentIndex. There are three classes that can be used when accessing the DocumentIndex:

The default DocumentIndex used by Terrier is DocumentIndexEncoded. If you have memory problems during retrieval with a large collection, change to using the standard DocumentIndex implementation, by changing DocumentIndexEncoded to DocumentIndex in the .log file of you index. This does however have a performance impact as document lengths needed for weighting documents during retrieval now have to be read from disk.