Terrier/MemoryIssues

Memory Issues

Terrier can use high amounts of memory during both indexing and retrieval. This can manifest as a problem as the maximum amount of memory that can be used by a Java Virtual Machine (VM) has to be specified when it is created. In Terrier, we have tried to pick good defaults for memory usage, but when you're indexing a new collection, these may not be sufficient. This Wiki page documents the symptoms and causes of high memory usage, and some possible resolutions.

Java limits the maximum memory in two ways:

When you run into memory problems using Terrier, it is straight-forward to alter the settings of Java or Terrier to work-around the problem. If you get stuck then you can ask on the [WWW] Terrier Forum . If you still can't get a solution, then try coding round the problem - then send us a patch, and we'll include it in a future version of Terrier!

There are two types of memory available:

1. Heap Memory: A heap memory problem is signified by a Java error: OutOfMemoryError. You can allow Java to use more heap memory by altering the TERRIER_HEAP_MEM enviornment variable before starting Terrier. Defaults (as of Terrier 3.6): anyclass.sh: 1024MB (line 106). If you see OutOfMemoryError: GC Overhead limit exceeded, you may contemplate adding -XX:-UseGCOverheadLimit to the JAVA_OPTIONS environment variable, particularly for single-pass or Hadoop indexing (see below).

2. Stack memory: A stack memory issue is signified by a Java error: StackOverflowError. You can allow Java to use more stack memory by adding or altering the -Xss line in the script you are using to start Terrier. If there is no -Xss line, the Java is using its default maximum stack size. NB: These are much rarer in more recent versions of Terrier.

Notes:

Indexing

Indexing a new large collection can be a daunting process. Be prepared to give indexing a couple of attempts, referring to this page when things go wrong. Recall that Terrier has several indexing methods: 1. Classical multi-pass indexing 2. Single-pass indexing 3. Hadoop indexing

Classical multi-pass indexing

1. OutOfMemoryError while building direct index: During building the direct index, Terrier attempts to write as much data to disk as possible to conserve memory. However, to keep good indexing speed, the class TermCodes keeps in memory a mapping from the String of a term to a term-id to be saved in the direct index. If you are indexing a large collection, the TermCodes may grow so large that it becomes too big to keep in memory.

Resolutions:

Also consider doing direct indexing and inverted indexing in two steps, using trec_terrier.sh -i -d then trec_terrier.sh -i -v . Doing this ensures that Terrier has the maximum amount of memory available for building the inverted index.

2. OutOfMemoryError while building inverted index: During building the inverted index, the InvertedIndexBuilder reads the term ids of a number of terms from the Lexicon, then scans the direct index looking for occurrences of those terms, building up a postings list to write to the InvertedIndex. The default number of terms to process in one iteration amounts to 20,000,000 pointers (postings) for non-blocks indices. The generated postings list is kept in memory until flushed to disk at the end of each iteration. The number of pointers kept in memory can be reduced using the invertedfile.processpointers property. For blocks indices, the default it 2,000,000.

Resolutions:

Single-pass indexing

Single-pass indexing is much faster and scalable than classical indexing. It works by keeping the partial posting lists for the terms from as many documents as possible in memory (called a run), until memory is exhausted. The disadvantage is that no direct index is built (but this can be obtained using the Inverted2DirectIndexBuilder later).

It has four properties for controlling memory consumption:

Additionally, we recommend disabling Java's GC overhead limit exceeded check - see Heap Memory above.

Hadoop Indexing

Hadoop indexing is based upon single-pass indexing, so the same properties apply. Additionally, you need to control carefully the MapReduce child JVM's allocated memory, c.f. mapred.child.java.opts=-Xmx2000m\ -XX:-UseGCOverheadLimit

Retrieval

DocumentIndex

All document lengths must be loaded into memory before retrieval can commence. If you are using a field-based weighting model, you should also ensure that the field lengths are loaded in memory also, by changing the DocumentIndex used in the index's data.properties file from FSADocumentIndex to FSAFieldDocumentIndex.

MetaIndex

Matching

An OutOfMemoryError may occur during the construction of the taat.Full matching class, if the number of documents in an index is extremely large. In this case, follow the instructions above to increase the amount of heap memory available to Terrier.


CategoryTerrier

last edited 2014-05-21 13:42:03 by CraigMacdonald