Terrier can use high amounts of memory during both indexing and retrieval. This can manifest as a problem as the maximum amount of memory that can be used by a Java Virtual Machine (VM) has to be specified when it is created. In Terrier 1.0.x, we have tried to pick good defaults for memory usage, but when you're indexing a new collection, these may not be sufficient. This Wiki page documents the symptoms and causes of high memory usage, and some possible resolutions.
Java limits the maximum memory in two ways:
Heap usage (number and size of objects etc)
Stack usage (how many recursive function calls)
When you run into memory problems using Terrier, it is straight-forward to alter the settings of Java or Terrier to work-around the problem. If you get stuck then you can ask on the Terrier Forum . If you still can't get a solution, then try coding round the problem - then send us a patch, and we'll include it in a future version of Terrier!
A heap memory problem is signified by a Java error: OutOfMemoryError. You can allow Java to use more heap memory by altering the -Xmx line in the script you are using to start Terrier.
Defaults (as of Terrier 1.0.2):
trec_terrier.sh: 512MB (line 85)
trec_terrier.bat: 512MB (line 68)
anyclass.sh: 512MB (line 82)
desktop_terrier.sh: 120MB (line 81)
desktop_terrier.bat: 120MB (line 77)
interactive_terrier.sh: 64MB (line 82)
interactive_terrier.bat: 64MB (line 64)
A stack memory issue is signified by a Java error: StackOverflowError. You can allow Java to use more stack memory by adding or alterting the -Xss line in the script you are using to start Terrier. If there is no -Xss line, the Java is using its default maximum stack size.
Defaults (as of Terrier 1.0.2):
trec_terrier.sh: Unspecified (line 85)
trec_terrier.bat: Unspecified (line 68)
anyclass.sh: Unspecified (line 82)
desktop_terrier.sh: 64MB (line 82)
desktop_terrier.bat: 64MB (line 77)
interactive_terrier.sh: Unspecified (line 82)
All this information assumes a standard Sun Java VM (we're currently using 1.4.2_04). We do not have any experience of IBM JVMs etc.
Despite Java producing platform-independant code, some stack settings on one platform may not be directly suitable on other platforms.
Indexing a new large collection can be a daunting process. Be prepared to give indexing a couple of attempts, referring to this page when things go wrong. Also consider doing direct indexing and inverted indexing in two steps, using trec_terrier.sh -i -d then trec_terrier.sh -i -v . Doing this ensures that Terrier has the maximum amount of memory available for building the inverted index.
OutOfMemoryError while building direct index
During building the direct index, Terrier attempts to write as much data to disk as possible to conserve memory. However, to keep good indexing speed, the class TermCodes keeps in memory a mapping from the String of a term to a term-id to be saved in the direct index. If you are indexing a large collection, the TermCodes may grow so large that it becomes too big to keep in memory.
Increase the amount of heap memory available to Java.
Contemplate breaking the collection into smaller chunks and then merging them somehow.
StackOverFlowError while building block direct index
read collection specification Processing file1 BlockIndexer creating direct index Exception in thread "main" java.lang.StackOverflowError at uk.ac.gla.terrier.structures.trees.BlockTree.traversePreOrder(BlockTree.java:127) at uk.ac.gla.terrier.structures.trees.BlockTree.traversePreOrder(BlockTree.java:128) at uk.ac.gla.terrier.structures.trees.BlockTree.traversePreOrder(BlockTree.java:128)
Currently blocks numbers are stored in a binary tree, however the in-order traversal of a large binary tree can use a considerable amount of stack usage. A BlockTree can become large is a term is repeated very many times in a document.
Increase the stack size available to Java
Adjust/set the max.blocks property so that Terrier does not record all the blocks of a large document.
Adjust/set the indexing.max.tokens property so that Terrier does not index too deeply into a document.
Do not index that particular document - if you're using the TRECCollection you can make a list of bad documents in a file and set the trec.blacklist to that file.
(A future version of Terrier may replace the BlockTree with another non-recursive data structure)
OutOfMemory while building inverted index
During building the inverted index, the InvertedIndexBuilder reads the term ids of a number of terms from the Lexicon, then scans the direct index looking for occurrences of those terms, building up a postings list to write to the InvertedIndex. The default number of terms to process in one iteration is 75,000. The generated postings list is kept in memory until the disk is flushed.
Note that the postings list is considerably larger when indexing using blocks, in which case the 75,000 default may be too large for some collections.
Increase the amount of memory available to Java - for instance by changing the -Xmx value on line 85 of trec_terrier.sh
Adjust/set invertedfile.processterms to a lower value (the lower this value us, the longer indexing will take as more iterations of the direct file have to be performed). The default value is 75,000. Try decreasing this to 25,000 to start with, then lower still if problems remain. I have personally gone as low as 5,000 terms before.
When retrieving from an index, the first step when loading the Index is the loading of the DocumentIndex. There are three classes that can be used when accessing the DocumentIndex:
DocumentIndex : the index is accessed on disk. (least memory usage)
DocumentIndexEncoded : the index is loaded compressed into memory, and decoded during accessing (considerable memory usage)
DocumentIndexInMemory : the index is decompressed into memory (high memory usage)
The default DocumentIndex used by Terrier is DocumentIndexEncoded. If you have memory problems during retrieval with a large collection, change to using the standard DocumentIndex implementation, by changing DocumentIndexEncoded to DocumentIndex in the .log file of your index. This does however have a performance impact as document lengths needed for weighting documents during retrieval now have to be read from disk.