Terrier/ClueWeb09-B

>>> Back to Terrier wikipage

Terrier/ClueWeb09-B

ClueWeb09 is a corpus of general Web documents. It is distributed by [WWW] Carnegie Mellon Univ. The B subset is all documents in the ClueWeb09_English_1 folder, approximately 50M documents.

Indexing

Indexing of ClueWeb09 (category B set) can be completed in just over 1 day on a modern machine, using single-pass indexing. However, depending on the amount of available memory, you may need to increase the number of maximum allowed open files on your machine.

Alternatively, using a Hadoop cluster with MapReduce indexing can dramatically speed things up - e.g. 5 nodes, 3 processors each gives an indexing time of 5.5 hours.

In both cases, ensure that -XX:-UseGCOverheadLimit is configured on the java command line (even on the Hadoop tasktrackers), as it can cause spurious OutOfMemory errors.

trec.collection.class=WARC018Collection
indexer.meta.forward.keys=docno,url
indexer.meta.forward.keylens=26,256
indexer.meta.reverse.keys=docno

#using Hadoop MapReduce, with/without HOD
terrier.index.path=hdfs://namenode:9000/path/to/index

#using Hadoop MapReduce, with HOD
plugin.hadoop.hod=/path/to/hadoop/contrib/hod/bin/hod
plugin.hadoop.hod.nodes=5
plugin.hadoop.hod.params= -Mmapred.map.tasks=15 -Mmapred.tasktracker.map.tasks.maximum=3 -Mmapred.tasktracker.reduce.tasks.maximum=3

It is also desirable to have the following property in hadoop-size.xml mapred.child.java.opts=-Xmx2600m\ -XX:-UseGCOverheadLimit (your node should have enough memory to allow ${{mapred.tasktracker.map.tasks.maximum}} * 2600m.

If you want to ignore scripts and CSS stylesheets during indexing, try to set the following property

TrecDocTags.skip=SCRIPT,STYLE

last edited 2011-06-16 16:15:27 by CraigMacdonald