These datasets are
distributed by the University of Glasgow to support research on information
retrieval and related technologies. All collections were or are being used by
several tracks of the TREC conference.
Medium
of Distribution
We use 2.5" or 3.5"
SATA hard disk drives for distributing collections too large for DVD or CD-ROM:
GOV2, Blogs06 and Blogs08 will be available ONLY on this medium. You will
need to specify either Linux or Windows file systems. If you receive a
collection on a hard drive, you will need to install it in your system.
If you don't have
spare slots in your machine, consider using an external hard disk enclosure
with USB2.0 or FireWire interface. They're available quite cheaply and we
use one for writing the disks. The hard drive is yours to keep.
Available Collections
The
Web/Blog research collections are distributed by the University of Glasgow for
research purposes only. In order to receive copies of one or more of these
collections, you must sign an agreement with the University of Glasgow and pay
a contribution to the University's various costs in preparing, maintaining and
distributing the data.
Web Collections:
Collection |
Size |
Fee |
Sample
Documents |
.GOV2 |
426
GB |
£650 |
- |
.GOV |
18 GB |
£500 |
|
WT10g |
10 GB |
£500 |
|
WT2g |
2 GB |
£350 |
Blog Collections:
Collection |
Size |
Fee |
Sample
Documents |
Blogs06 |
148
GB |
£500 |
- |
Blogs08 |
2.25
TB |
£600 |
- |
Notes:
Obtaining Test Collections
To obtain a test collection, please follow the steps described below.
Information on
Agreements (IMPORTANT- PLEASE READ):
Please note that the organiation agreements are
normally signed for one research group or a small unit within a legal entity,
and not for the whole entity. The licensed group is usually a small and
homogeneous group of researchers working together on the same topic and within
the same location.
For example, the license could be for the Information
Retrieval research group of the Department of Computer Science of the
University X. In this case, the “Organisation” on the license is
the Information Retrieval group of the Department of Computer Science, while
the “Corporation/Legal Entity”
is the University X. The Machine Learning research group of the same Department
will need to buy another license.
Steps to obtain the collections:
The organisational agreement
must be signed by a person with authority to do so on behalf of your organisation. This person should appose his/her initials on each page
of the agreement (See the “Initials” field at the bottom right corner of each page).
Notes:
1. We cannot ship the collections until we have received both your signed organisation agreement in good order (e.g. see Information on Agreements above) and we have cleared your payment.
2. Payment is made by electronic transfer to University of Glasgow's bank account, or payment by cheque is also possible.
If you are in a hurry:
Please ensure that you complete all
of the above steps as early as possible! The most common causes of
delayed shipment are:
We can usually process and ship
standard requests within a day or two of clearing your payment. However, while
we make every effort to ship data quickly, note that i)
distributing data is not our only job, and ii) other groups may be ahead of you
in the data distribution queue.
Requests that do not comply with the
above guidelines will not be processed.
What Happens Next?
Please
note that the current fees are at their lowest possible values. We regret to
inform you that we are unable to offer any reduction on the fees, even for
those organisations based in developing countries.
TREC | TREC Web track | TREC Blog Track | GLA | NIST