ClueWeb09 is a corpus of general Web documents. It is distributed by [WWW] Carnegie Mellon Univ. The B subset is all documents in the ClueWeb09_English_1 folder, approximately 50M documents.


Indexing of ClueWeb09 (category B set) can be completed in just over 1 day on a modern machine, using single-pass indexing (i.e. bin/trec_terrier.sh -i -j). However, depending on the amount of available memory, you may need to increase the number of maximum allowed open files on your machine. We do not recommend indexing using the classical indexer (i.e. bin/trec_terrier.sh -i)

Alternatively, using a Hadoop cluster with MapReduce indexing (bin/trec_terrier.sh -i -H) can dramatically speed things up - e.g. 5 nodes, 3 processors each gives an indexing time of 5.5 hours.

In both cases, ensure that -XX:-UseGCOverheadLimit is configured on the java command line (even on the Hadoop tasktrackers), as it can cause spurious OutOfMemory errors.


#using Hadoop MapReduce, with/without HOD

#using Hadoop MapReduce, with HOD
plugin.hadoop.hod.params= -Mmapred.map.tasks=15 -Mmapred.tasktracker.map.tasks.maximum=3 -Mmapred.tasktracker.reduce.tasks.maximum=3

It is also desirable to have the following property in hadoop-site.xml: mapred.child.java.opts=-Xmx2600m\ -XX:-UseGCOverheadLimit (your node should have enough memory to allow ${{mapred.tasktracker.map.tasks.maximum}} * 2600m.

If you want to ignore scripts and CSS stylesheets during indexing, try to set the following property



ClueWeb09 and ClueWeb12 have a big meta index. Terrier play's safe and keep this on disk. Hence you may see WARNings like:

WARN structures.CompressingMetaIndex: Structure meta reading lookup file directly from disk (SLOW)
INFO structures.CompressingMetaIndex: Structure meta reading reverse map for key docno directly from disk
WARN structures.CompressingMetaIndex: Structure meta reading data file directly from disk (SLOW)

If you want retrieval to be faster, then you can adjust the index to use more memory, by adusting the index's data.properties file as follows:


For a big index such as ClueWeb09, its preferable to use DAAT retrieval - this is default since Terrier 4.0


