Terrier can use high amounts of memory during both indexing and retrieval. This can manifest as a problem as the maximum amount of memory that can be used by a Java Virtual Machine (VM) has to be specified when it is created. In Terrier, we have tried to pick good defaults for memory usage, but when you're indexing a new collection, these may not be sufficient. This Wiki page documents the symptoms and causes of high memory usage, and some possible resolutions.
Java limits the maximum memory in two ways:
Heap usage (number and size of objects etc)
Stack usage (how many recursive function calls)
When you run into memory problems using Terrier, it is straight-forward to alter the settings of Java or Terrier to work-around the problem. If you get stuck then you can ask on the Terrier Forum . If you still can't get a solution, then try coding round the problem - then send us a patch, and we'll include it in a future version of Terrier!
There are two types of memory available:
1. Heap Memory: A heap memory problem is signified by a Java error: OutOfMemoryError. You can allow Java to use more heap memory by altering the TERRIER_HEAP_MEM enviornment variable before starting Terrier. Defaults (as of Terrier 3.6): anyclass.sh: 1024MB (line 106). If you see OutOfMemoryError: GC Overhead limit exceeded, you may contemplate adding -XX:-UseGCOverheadLimit to the JAVA_OPTIONS environment variable, particularly for single-pass or Hadoop indexing (see below).
2. Stack memory: A stack memory issue is signified by a Java error: StackOverflowError. You can allow Java to use more stack memory by adding or altering the -Xss line in the script you are using to start Terrier. If there is no -Xss line, the Java is using its default maximum stack size. NB: These are much rarer in more recent versions of Terrier.
All this information assumes a standard Sun Java VM (we're currently using 1.4.2_04). We do not have any experience of IBM JVMs etc.
Despite Java producing platform-independent code, some stack settings on one platform may not be directly suitable on other platforms.
Indexing a new large collection can be a daunting process. Be prepared to give indexing a couple of attempts, referring to this page when things go wrong. Recall that Terrier has several indexing methods: 1. Classical multi-pass indexing 2. Single-pass indexing 3. Hadoop indexing
Classical multi-pass indexing
1. OutOfMemoryError while building direct index: During building the direct index, Terrier attempts to write as much data to disk as possible to conserve memory. However, to keep good indexing speed, the class TermCodes keeps in memory a mapping from the String of a term to a term-id to be saved in the direct index. If you are indexing a large collection, the TermCodes may grow so large that it becomes too big to keep in memory.
Increase the amount of heap memory available to Java.
Decrease the property indexing.max.docs.per.builder (default value is 18000000 - 18 million documents).
Also consider doing direct indexing and inverted indexing in two steps, using trec_terrier.sh -i -d then trec_terrier.sh -i -v . Doing this ensures that Terrier has the maximum amount of memory available for building the inverted index.
2. OutOfMemoryError while building inverted index: During building the inverted index, the InvertedIndexBuilder reads the term ids of a number of terms from the Lexicon, then scans the direct index looking for occurrences of those terms, building up a postings list to write to the InvertedIndex. The default number of terms to process in one iteration amounts to 20,000,000 pointers (postings) for non-blocks indices. The generated postings list is kept in memory until flushed to disk at the end of each iteration. The number of pointers kept in memory can be reduced using the invertedfile.processpointers property. For blocks indices, the default it 2,000,000.
Increase the amount of heap memory available to Java (see above).
Adjust/set invertedfile.processpointers to a lower value (the lower this value us, the longer indexing will take as more iterations of the direct file have to be performed). The default value is 20,000,000.
Single-pass indexing is much faster and scalable than classical indexing. It works by keeping the partial posting lists for the terms from as many documents as possible in memory (called a run), until memory is exhausted. The disadvantage is that no direct index is built (but this can be obtained using the Inverted2DirectIndexBuilder later).
It has four properties for controlling memory consumption:
memory.reserved - amount of free memory threshold before a run is committed. Default is 50 000 000 (50MB) and 100 000 000 (100MB) for 32bit and 64bit JVMs respectively.
memory.heap.usage - proportion of max heap allocated to JVM before a run is committed. Default 0.70.
indexing.singlepass.max.postings.memory - maximum amount of memory that the postings can consume before a run is committed. Default is 0, which is no limit.
indexing.singlepass.max.documents.flush - maximum number of documents before a run is committed. Default is 0, which is no limit.
docs.check - interval of how many documents indexed should the amount of free memory be checked. Default is 20 - check memory consumption every 20 documents.
Additionally, we recommend disabling Java's GC overhead limit exceeded check - see Heap Memory above.
Hadoop indexing is based upon single-pass indexing, so the same properties apply. Additionally, you need to control carefully the MapReduce child JVM's allocated memory, c.f. mapred.child.java.opts=-Xmx2000m\ -XX:-UseGCOverheadLimit
All document lengths must be loaded into memory before retrieval can commence. If you are using a field-based weighting model, you should also ensure that the field lengths are loaded in memory also, by changing the DocumentIndex used in the index's data.properties file from FSADocumentIndex to FSAFieldDocumentIndex.
An OutOfMemoryError may occur during the construction of the taat.Full matching class, if the number of documents in an index is extremely large. In this case, follow the instructions above to increase the amount of heap memory available to Terrier.