Diff for "Terrier/ClueWeb12"

Differences between revisions 6 and 7

Deletions are marked like this. Additions are marked like this.
Line 13: Line 13:
Terrier 4.0 makes several improvements to indexing for Clue``Web12 compared to Terrier 3.6 (see [http://terrier.org/issues/browse/TR-295 TR-295]) - no further patches are required to index Clue``Web12. Terrier 4.0 makes several improvements to indexing for Clue``Web12 compared to Terrier 3.6 (see [http://terrier.org/issues/browse/TR-295 TR-295]) - no further patches are required to index Clue``Web12 from Terrier 4.0 onwards.

Terrier/ClueWeb12

Patches

For Terrier 3.5, you will need to apply a few patches:

  • [WWW] TR-209: Allow long metaindex values to be cropped automatically by the MetaIndex

  • [WWW] TR-225: Support for ClueWeb12 collection

  • [WWW] TR-295: WARC10Collection incorrectly misses some documents

Terrier 3.6 does not require patching to index ClueWeb12.

  • [WWW] TR-295: WARC10Collection incorrectly misses some documents

Terrier 4.0 makes several improvements to indexing for ClueWeb12 compared to Terrier 3.6 (see [WWW] TR-295) - no further patches are required to index ClueWeb12 from Terrier 4.0 onwards.

Configuration

Generally, you need to follow Terrier/ClueWeb09-B, with some specifics:

trec.collection.class=WARC10Collection
indexer.meta.forward.keys=docno,url
indexer.meta.forward.keylens=26,512
indexer.meta.reverse.keys=docno

TrecDocTags.skip=SCRIPT,STYLE

Retrieval

last edited 2019-06-05 09:06:06 by CraigMacdonald