On this page are a list of publically available IR test collections.
Some are held locally and some are pointers to remote sites.
Collections held at Glasgow
DOCS QRYS SIZE*
* Size in Mega bytes.
LISA 5,872 35 3.4
NPL 11,429 93 3.1
CACM 3,204 64 2.2
CISI 1,460 112 2.2
Cranfield 1,400 225 1.6
Time 423 83 1.5
Medline 1,033 30 1.1
ADI 82 35 0.04
To quote from the readme file
This test collection was created to assist information retrieval research.
It is a clinically-oriented MEDLINE subset, consisting of 348,566 references
(out of a total of over 7 million), covering all references from 270 medical
journals over a five-year period (1987-1991).
- Reuters 21,578 collection
The Reuters-21578 text categorization test collection is available through this link.
The TREC collections aren't in the public domain. However, we are now the official distributors for
the following TREC collections: WT2G, WT10G, DOTGOV, DOTGOV2.
- Europarl Parallel Corpus
Europarl is a parallel corpus for statistical machine translation testing.