>>> Back to Terrier wikipage
Terrier/ClueWeb09-B
ClueWeb09 is a corpus of general Web documents. It is distributed by
Carnegie Mellon Univ. The B subset is all documents in the ClueWeb09_English_1 folder, approximately 50M documents.
Indexing
Indexing of ClueWeb09 (category B set) can be completed in just over 1 day on a modern machine, using single-pass indexing. However, depending on the amount of available memory, you may need to increase the number of maximum allowed open files on your machine.
Alternatively, using a Hadoop cluster with MapReduce indexing can dramatically speed things up - e.g. 5 nodes, 3 processors each gives an indexing time of 5.5 hours.
In both cases, ensure that -XX:-UseGCOverheadLimit is configured on the java command line (even on the Hadoop tasktrackers), as it can cause spurious OutOfMemory errors.
trec.collection.class=WARC018Collection indexer.meta.forward.keys=docno,url indexer.meta.forward.keylens=26,256 indexer.meta.reverse.keys=docno #using Hadoop MapReduce, with/without HOD terrier.index.path=hdfs://namenode:9000/path/to/index #using Hadoop MapReduce, with HOD plugin.hadoop.hod=/path/to/hadoop/contrib/hod/bin/hod plugin.hadoop.hod.nodes=5 plugin.hadoop.hod.params= -Mmapred.map.tasks=15 -Mmapred.tasktracker.map.tasks.maximum=3 -Mmapred.tasktracker.reduce.tasks.maximum=3
It is also desirable to have the following property in hadoop-size.xml mapred.child.java.opts=-Xmx2600m\ -XX:-UseGCOverheadLimit (your node should have enough memory to allow ${{mapred.tasktracker.map.tasks.maximum}} * 2600m.
If you want to ignore scripts and CSS stylesheets during indexing, try to set the following property
TrecDocTags.skip=SCRIPT,STYLE