TREC Web Corpus : WT10g

Note (2002-10-11): The following description of WT10g was last updated in March 2000. To obtain WT10g and/or the more recent .GOV test collection, see our access to data page.

Goals in the preparation of WT10g

There were a number of goals in the preparation of WT10g. These included:

A more substantial quantity of Web data than was available in WT2g.
A higher "quality" of Web data than is present in either WT2g or VLC2. This meant trying to eliminate non-English and binary data documents. (Foreign language documents are not uninteresting, but retrieval over mixed language collections is currently served by the cross-language track in TREC and the new cross-language workshop.) It also meant trying to eliminate "uninteresting" servers and/or documents.
Elimination of large quantities of redundant or duplicate data.
A larger number of inter-server links than was present in WT2g.
Better support for distributed information retrieval experiments.
Preservation of certain statistical properties from the VLC2, such as server size distribution.

Properties of WT10g

1 692 096 documents
11 680 servers
an average of 144 documents per server
a minimum of 5 documents per server
171 740 inter-server links (within the collection)
9977 servers with inter-server in-links (within the collection)
8999 servers with inter-server out-links (within the collection)
1 295 841 documents with out-links (within the collection)
1 532 012 documents with in-links (within the collection)

Contents of WT10g

WT10g consists of data distributed on 5 CDs, numbered CD1 to CD5. The data is split into individual directories, WTX001, WTX002 and so on. Within each directory, documents are bundled together into files of roughly 2MB in size, numbered B01, B02 .. B50. The bundle files are all compressed using gzip, so exist as B01.gz etc. There is no particular ordering from docids to documents (other than the VLC2 ordering).

CD1 contains data for the following:

  WTX001 .. WTX024, each directory contains 50 bundle files B01.gz .. B50.gz

CD2 contains data for the following:

  WTX024 .. WTX048, each directory contains 50 bundle files B01.gz .. B50.gz

CD3 contains data for the following:

  WTX049 .. WTX072, each directory contains 50 bundle files B01.gz .. B50.gz

CD4 contains data for the following:

  WTX073 .. WTX096, each directory contains 50 bundle files B01.gz .. B50.gz

CD5 contains data for the following:

  WTX097 .. WTX104, each directory contains 50 bundle files B01.gz .. B50.gz

                    except WTX104, containing 7 bundle files B01.gz .. B07.gz

CD5 also contains:

  info

which has additional information generated for WT10g data, described below.

Note well: The contents of this directory ( WT10g::CD5::info ) do not constitute part of WT10g's data.

None of the files in this info directory should be indexed.

It contains the following files:

README -    this file

docid_to_url -  mappings: WT10g docid -> URL (*)

homepages - mappings: server name -> WT10g docid

in_links -  mappings: WT10g docid -> set of WT10g docids, whose pages

                          contain (incoming) links to this page (*)

out_links - mappings: WT10g docid -> set of WT10g docids, whose pages

                          are named by (outgoing) links from this page (*)

servers -       server names

url_to_docid -  mappings: URL -> WT10g docid

wt10g_to_vlc2 - mappings: WT10g docid -> VLC2 docid (*)

URLs are of the form:       http://server_name/path

Server names are of the form:   www.foo.com:port_number

Port numbers are of the form:   1234 (but are usually just 80)

WT10g docids are of the form:   WTX123-B45-6789, where the final doc

                                number in the bundle is numbered from 1

VLC2 docids are of the form:    IA012-003456-B078-901, where the final

                                doc number in the bundle is numbered from 1

(*) Note well: All info files are sorted using the Linux sort routine, using the first entry of each line as the sort key. Since the last component of a WT10g docid is numbered sequentially from 1 upwards, and the sort order is alphabetical, these files have a slightly confusing ordering, which is not identical to the numeric ordering of the documents within each bundle. For example, the first entries of docid_to_url are:

WTX001-B01-1 http://www.ram.org:80/ramblings/movies/jimmy_hollywood.html

WTX001-B01-10 http://sd48.mountain-inter.net:80/hss/teachers/Prothero.html

WTX001-B01-100 http://www.ccs.org:80/hc/9607/win95.html

WTX001-B01-101 http://www.cdc.net:80/~dejavu/scuba-spec.html

WTX001-B01-102 http://www.cdm.com:80/humanres/jobs/enevga.html

after which there are a number of other entires followed by:

WTX001-B01-198 http://www.cdc.net:80/~dupre/pharmacy/CD581.html

WTX001-B01-199 http://www.cdnemb-washdc.org:80/baltimor.html

WTX001-B01-2 http://www.radio.cbc.ca:80/radio/programs/current/quirks/archives/feb1796.htm

WTX001-B01-20 http://moe.med.yale.edu:80/mirror/vat/la.html

WTX001-B01-200 http://www.cdc.net:80/~dupre/pharmacy/pbsound.html

WTX001-B01-201 http://www.cdnemb-washdc.org:80/sanfran.html

and so on.

Document format

This is an example document contained within the collection. All documents are delimited by <DOC></DOC> tags. The unique WT10g document identifier is enclosed within <DOCNO></DOCNO> tags, and the old VLC2 document identifier is contained on the next line between <DOCOLDNO></DOCOLDNO> tags. Next comes a <DOCHDR></DOCHDR> section which provides various bits of information about the document reported by the http server which served the document to the original Internet Archive crawler. Lastly the actual HTML source is given.

Disclaimer

While all reasonable attempts have been made to accurately identify URLs and link references occurring in documents in this collection, we make no guarantee as to the correctness or completeness of the information contained in the files in this directory. In particular, URL canonicalisation is a fiendishly problematic task, especially with relative URLs and HTML tags such as base hrefs. Similarly, servers are identified sometimes by IP addresses and sometimes by hostname. It may be the case that some hostnames are aliases for others, and/or for IP addresses represented within the collection. In all cases, do not rely on the info files to be completely accurate.

If you encounter any major discrepancies within the info files, we would be very grateful to hear about them.

TREC | Web Track

Updated: 2003-05-13