Modified: 2 April 2002
.GOV is a TREC
test collection.
Stats:
Documents: |
1247753 (1.25 million) |
text/html |
1053372 (mime types
reported by server) |
application/pdf |
131333 |
text/plain |
43754 |
application/msword |
13842 |
application/postscript |
5673 |
other stuff which turned
out to be text |
44 |
Bundles: |
4613 |
Total size: |
19455030550 = 18.1G
(without 100k limit was 35.3G) |
Average bunsize: |
4217435 = 4.0M |
Average docsize: |
15592 = 15.2k (higher due
to pdf+word+ps I think) |
Doc truncation length: |
100kb (turned out to be
"roughly 100k") |
Docs without words: |
55 (e.g. due to pdftotext producing much text followed by truncation) |
Distribution information: http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html