Diff for "Terrier/ClueWeb09-B"

Differences between revisions 17 and 18

Deletions are marked like this. Additions are marked like this.
Line 20: Line 20:
metaindex.compressed.crop.long=true

>>> Back to Terrier wikipage

Terrier/ClueWeb09-B

ClueWeb09 is a corpus of general Web documents. It is distributed by [WWW] Carnegie Mellon Univ. The B subset is all documents in the ClueWeb09_English_1 folder, approximately 50M documents.

Indexing

Indexing of ClueWeb09 (category B set) can be completed in just over 1 day on a modern machine, using single-pass indexing (i.e. bin/trec_terrier.sh -i -j). However, depending on the amount of available memory, you may need to increase the number of maximum allowed open files on your machine. We do not recommend indexing using the classical indexer (i.e. bin/trec_terrier.sh -i)

Alternatively, using a Hadoop cluster with MapReduce indexing (bin/trec_terrier.sh -i -H) can dramatically speed things up - e.g. 5 nodes, 3 processors each gives an indexing time of 5.5 hours.

In both cases, ensure that -XX:-UseGCOverheadLimit is configured on the java command line (even on the Hadoop tasktrackers), as it can cause spurious OutOfMemory errors.

trec.collection.class=WARC018Collection
indexer.meta.forward.keys=docno,url
indexer.meta.forward.keylens=26,256
indexer.meta.reverse.keys=docno
metaindex.compressed.crop.long=true

#using Hadoop MapReduce, with/without HOD
terrier.index.path=hdfs://namenode:9000/path/to/index

#using Hadoop MapReduce, with HOD
plugin.hadoop.hod=/path/to/hadoop/contrib/hod/bin/hod
plugin.hadoop.hod.nodes=5
plugin.hadoop.hod.params= -Mmapred.map.tasks=15 -Mmapred.tasktracker.map.tasks.maximum=3 -Mmapred.tasktracker.reduce.tasks.maximum=3

It is also desirable to have the following property in hadoop-site.xml: mapred.child.java.opts=-Xmx2600m\ -XX:-UseGCOverheadLimit (your node should have enough memory to allow ${{mapred.tasktracker.map.tasks.maximum}} * 2600m.

If you want to ignore scripts and CSS stylesheets during indexing, try to set the following property

TrecDocTags.skip=SCRIPT,STYLE

Retrieval

ClueWeb09 and ClueWeb12 have a big meta index. Terrier play's safe and keep this on disk. Hence you may see WARNings like:

WARN structures.CompressingMetaIndex: Structure meta reading lookup file directly from disk (SLOW)
INFO structures.CompressingMetaIndex: Structure meta reading reverse map for key docno directly from disk
WARN structures.CompressingMetaIndex: Structure meta reading data file directly from disk (SLOW)

If you want retrieval to be faster, then you can adjust the index to use more memory, by adusting the index's data.properties file as follows:

#index.meta.index-source=file
#index.meta.data-source=file
index.meta.index-source=fileinmem
index.meta.data-source=fileinmem

For a big index such as ClueWeb09, its preferable to use DAAT retrieval - this is default since Terrier 4.0

trec.matching=org.terrier.matching.daat.Full 

last edited 2016-12-03 11:47:19 by CraigMacdonald