The .GOV test collection

Modified: 2 April 2002

.GOV is a TREC test collection.

Stats:

Documents:

1247753 (1.25 million)

text/html

1053372 (mime types reported by server)

application/pdf

131333

text/plain

43754

application/msword

13842

application/postscript

5673

other stuff which turned out to be text

44

Bundles:

4613

Total size:

19455030550 = 18.1G (without 100k limit was 35.3G)

Average bunsize:

4217435 = 4.0M

Average docsize:

15592 = 15.2k (higher due to pdf+word+ps I think)

Doc truncation length:

100kb (turned out to be "roughly 100k")

Docs without words:

55 (e.g. due to pdftotext producing much text followed by truncation)

Distribution information: http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html