TREC-BLOG

OVERVIEW

The Blog track explores the information seeking behavior in the blogosphere. The track was first introduced in TREC 2006. The Blog track will run again in TREC 2009.

This Wiki Web page provides the guidelines for participation in the 2009 edition of the TREC Blog track. Updates and new information will ultimately appear in this Web page.

CONTENTS

  1. OVERVIEW
  2. MAILING-LIST
  3. DATASETS
    1. Blogs06 Collection
    2. Blogs08 Collection
  4. History of Blog Track
    1. TREC 2006
    2. TREC 2007
    3. TREC 2008
  5. TREC 2009
    1. Faceted Blog Distillation Task
    2. Top Stories Identification Task
  6. Timeline
  7. History of Document
  8. Track Coordinators

MAILING-LIST

There is a mailing list for TREC-blog that is run by NIST. To subscribe to the trec-blog list, send an email message to [MAILTO] listproc@nist.gov such that the body of the message consists of the line

If you later wish to unsubscribe from the TREC-blog mailing list, send an email to [MAILTO] listproc@nist.gov such that the body of the message consists of the line

If you wish to contact the Blog track organisers, please email the following email-address: (trecblog-organisers (at) dcs.gla.ac.uk)

DATASETS

Blogs06 Collection

The TREC Blogs06 collection is a big sample of the blogsphere, and contains spam as well as possibly non-blogs, e.g. RSS feeds from news broadcasters. It was crawled over an eleven week period from 6th December 2005 until the 21st February 2006. The collection is 148GB in size, consisting of:

The number of permalinks documents, is over 3.2 million of documents. Further information on the Blogs06 collection and how it was created can be found in the DCS Technical Report TR-2006-224, Department of Computing Science, University of Glasgow at [WWW] http://www.dcs.gla.ac.uk/~craigm/publications/macdonald06creating.pdf

The collection was used in TREC 2006, 2007 and 2008.

Blogs08 Collection

In 2009, the Blog track will use a new collection called Blogs08. The collection is larger than the previous Blogs06 collection, with a much longer timespan. Indeed, Blogs08 is one order of magnitude bigger than Blogs06, and samples the blogosphere from January 2008 to February 2009. The uncompressed permalink size is approx 1.3TB, while including feeds, this amounts to over 2TB of data.

Blogs08 has 28,488,767 blog posts from 1,303,520 blog feeds. Further details and statistics about the collection are provided at: [WWW] http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html

The collection is available since 9th April 2009. License details and information on how to get access to the TREC Blogs08 collection are provided in [WWW] http://ir.dcs.gla.ac.uk/test_collections

History of Blog Track

TREC 2006

In TREC 2006, we had two tasks, a main task (opinion retrieval) and an open task. The opinion retrieval task focuses on a specific aspect of blogs: the opinionated nature of many blogs. The second task was introduced to allow participants the opportunity to influence the determination of a suitable second task for 2007 on other aspects of blogs, such as the temporal/event-related nature of many blogs, or the severity of spam in the blogsphere.

Further and detailed information about the TREC 2006 Blog track could be found in [WWW] http://www.science.uva.nl/~mdr/Wikis/ The TREC 2006 Wiki is password protected. You will need to ask Maarten de Rijke for a login (mdr (at) science.uva.nl)

The TREC 2006 Blog track 'Overview paper' has appeared in the Proceedings of TREC 2006, and is available from the TREC Web site at [WWW] http://trec.nist.gov/pubs/trec15/papers/BLOG06.OVERVIEW.pdf NB: You should cite this paper when you refer to the TREC 2006 Blog track, or describe the opinion finding task in publications.

TREC 2007

TREC 2007 saw the addition of a new main task and a new subtask, namely a blog distillation (feed search) task and a polarity subtask respectively, along with a second year of the opinion retrieval task. The polarity subtask was added as a natural extension of the opinion task, and was intended to represent a text classification-related task, requiring participants to determine the polarity (or orientation) of the opinions in the retrieved documents, namely whether the opinions are positive, negative or mixed. The newly introduced blog distillation task was an articulation of an ad hoc search task, where users wish to identify blogs (i.e. feeds) about a given topic, which they can subscribe to and read on a regular basis.

Further and detailed information about the TREC 2007 Blog track could be found in TREC-BLOG/TREC2007

The TREC 2007 Blog track 'Overview paper' has appeared in the Proceedings of TREC 2007, and is available from the TREC Web site at [WWW] http://trec.nist.gov/pubs/trec16/t16_proceedings.html NB: Please cite this paper when you refer to the blog track or when you describe the TREC 2007 opinion finding task and/or polarity subtask and/or blog distillation task in publications.

You can also look at the ICWSM 2008 paper below, which summarises two years of TREC Blog track: 'On the TREC Blog Track', In Proceedings of International Conference on Weblogs and Social Media (ICWSM 2008), Seattle, 2008 [WWW] http://www.dcs.gla.ac.uk/~craigm/publications/ounis08trecblog.pdf

TREC 2008

Following our conclusions from both the TREC 2006 and the 2007 Blog tracks, we structured the Blog track 2008 around four tasks:

The Blog06 test collection was used for all experiments. (see [WWW] http://ir.dcs.gla.ac.uk/test_collections/blog06info.html )

Further and detailed information about the TREC 2008 Blog track could be found in TREC-BLOG/TREC2008

The TREC 2008 Blog track 'Overview paper' will appear in the Proceedings of TREC 2008, after it completes the WERB process. On a point of information, an updated draft of the paper is available at [WWW] http://www.dcs.gla.ac.uk/~ounis/blogOverview2008.pdf

TREC 2009

As discussed in the Blog track workshop in TREC 2008, the Blog track 2009 will make use of the new Blogs08 test collection, a larger and more up-to-date sample of the blogosphere, which has a much longer time-span period than that of the Blogs06 collection.

The Blog track 2009 aims to investigate more refined and complex search scenarios in the blogosphere. In particular, following discussions at the Blog track workshop at TREC 2008, we propose to run the following tasks:

Faceted Blog Distillation Task

Task Background

Blog search users often wish to identify blogs about a given topic, which they can subscribe to and read on a regular basis. This user task is most often manifested in two scenarios:

In the TREC Blog track, we have been investigating the latter scenario – blog distillation. The blog distillation task can be summarised as Find me a blog with a principle, recurring interest in X. For a given area X, systems should suggest feeds that are principally devoted to X over the timespan of the feed, and would be recommended to subscribe to as an interesting feed about the X (ie a user may be interested in adding it to their RSS reader).

In its TREC 2007 and TREC 2008 form, the blog distillation task only focuses on topical relevance. It does not address the quality aspect of the retrieved blogs. Following a [WWW] position paper by Marti Hearst et al. in SSM 2008, we propose a refinement of the blog distillation task that takes into account a number of attributes or facets such as the authority of the blog, its opinionated nature, the trustworthiness of its authors, or the genre of the blog and its style of writing. The new faceted blog distillation task can be summarised as Find me a good blog with a principal, recurring interest in X. The task has the following characteristics:

The facets will be allocated on a per-topic basis. Evaluation will be done as for the blog distillation task in 2008, with the caveat that blogs should be assessed on the facets active for a given topic.

Training can be done on the Blogs06 collection using the previous years relevance assessments, albeit without facets.

Task Definition

We propose several facets for the TREC 2009 blog distillation task, which may be of varying difficulty to identify for participant systems. Topics will have facets of interest attached to them, but there will be a reasonable spread between all facets in use for this year. The facets that will be considered for TREC 2009 are:

1. Opinionated: Some bloggers may make opinionated comment on the topics of interest, while others report factual information. A user may be interested in blogs which show prevalence to opinionatedness. For this facet, the values of interest are 'opinionated' vs 'factual' blogs.

2. Personal: Companies are increasingly using blogging as an activity for PR purposes. However, a user may not wish to read such mostly marketing or commercial blogs, and prefer instead to keep to blogs that appear to be written in personal time without commercial influences. For this facet, the values of interest are 'personal' vs 'official' blogs.

3. In-depth: Users might be interested to follow bloggers whose posts express in-depth thoughts and analysis on the reported issues, preferring these over bloggers who simply provide quick bites on these topics, without taking the time to analyse the implications of the provided information. For this facet, the values of interest are 'indepth' vs. 'shallow' blogs (in terms of their treatment of the subject).

NB:

For each topic, systems should supply the top 100 blogs which they think are both relevant to the topic, and which are likely to satisfy the first value (e.g. opinionated) of interest attached to the topic, followed by the second value (e.g. factual) of interest attached to the facet. In addition, for each topic, systems should provide a ranking of blogs where 'no facet value is applied' (denoted by 'none').

Example:

<top>
<num>1051</num>
<query>Example query</query>
<facet>personal</facet>
<description> longer statement of the information need </description>
<narrative> description </narrative> 
</top>

Runs have the format detailed below. In particular, for each topic, you should produce three rankings of 100 blogs each: one for the first value of the facet enabled, one with the second value of the facet enabled, and one for a baseline ranking with no facet whatsoever enabled. For example, for the personal facet, the first ranking would have 100 blogs that your system thinks are Personal, the second ranking would have 100 blogs which your system thinks are Official, while the third ranking would have 100 blogs which your system thinks are relevant to the topic, without any consideration for the facet.

topic-facet_value1 Q0 docno rank sim runtag
....
topic-facet_value2 Q0 docno rank sim runtag
....
topic-facet_none Q0 docno rank sim runtag

For example:

1051-personal Q0 blog08-feed-00002 1 10 testRun
1051-personal Q0 blog08-feed-00001 2 9 testRun
...
1051-official Q0 blog08-feed-00501 1 10.1 testRun
1051-official Q0 blog08-feed-00112 2 9.2 testRun
...
1051-none Q0 blog08-feed-00001 1 20.1 testRun
1051-none Q0 blog08-feed-00041 2 17.1 testRun
...
1052...

Participating groups may submit up to 'four' runs for the faceted blog distillation task. We wholeheartedly encourage the submission of manual runs, which are invaluable in improving the quality of the collection. (An automatic run is one that involves no human interaction. In contrast, a manual run is one where (for example) you formulate queries, search manually, give relevance feedback, and/or rerank documents by hand.)

Assessment

Topics development and relevance assessments for this task will be performed by NIST. We have actively pursued the option of obtaining query logs from a commercial search engine to assist the creation of realistic topics.

The following scale will be used for the assessment:

Evaluation

The number of test targets is 50. Metrics will be precision/recall based, where the actual "most important metric" will be MAP.

Top Stories Identification Task

Task Background

The query logs from the commercial search engines show that there is a fair number of news-related queries, suggesting that Blog search users have an interest in the blogosphere response to news stories as they develop.

We propose to run a new pilot search task addressing the news dimension in the blogosphere: For a given unit of time (e.g. date), systems will be asked to identify the top news stories (similar to what is displayed on the main page of Google Blog Search or Google News), and provide a list of relevant blog posts discussing each news story. The ranked list of blog posts should have a diverse nature, covering different/diverse aspects or opinions of the news story.

Participating System: Inputs & Output

Participating groups will be provided with a large sample of news headlines and their corresponding dates from throughout the timespan of the Blogs08 corpus. Participants will also have access to the Blogs08 corpus, from which they can extract relevant date information.

In response to a date "query", systems should provide a ranking of 100 headlines that they think were important on the specified day. Moreover for each headline, they should provide a ranking of 10 blog posts which are relevant to and discuss the news story headline.

The dates of the provided headlines will be the ones used by the news broadcaster. For example, a story that happens in Europe very early in the morning of day d, can be issued with a date d-1 by an American news broadcaster. Because of this possible time disparity between the date when the headline was issued by the news broadcaster and the one where the story actually happened, the participating systems should rank all headlines corresponding to the query date d +-1 days (i.e. headlines on day d, day d-1, and day d+1).

On the other hand, note that relevant blog posts may naturally be posted on or after the date of the news headline, but even shortly before the provided headline date (recall the possible time disparity). They just have to be on topic, i.e. related to the news headline. The blog posts selected for a given headline should be diverse in that they discuss different aspects, perspectives or opinions of the news story.

Importantly, the aim of the task is to ascertain the usefulness of the blogosphere in real-time news identification. Moreover, as the headline information is available on the Web, groups should use only the data provided, and not resort to external news resources or systems to enrich their system's knowledge. When external resources - beyond the Blogs08 collection and the provided sample of headlines and their corresponding dates - are used, these should be clearly mentioned. Runs using external resources will be reported separately.

Sample news headline corpus:

BLOG08-NEWS-0000001 News headline 1 here
BLOG08-NEWS-0000002 News headline 2 here
...

Sample query:

<top>
<num>1110</num>
<date>20080424</date>
</top>
...

The system responses are similar to the TREC Enterprise track Expert Search task formats. It includes a list of supporting relevant discussive documents (at most 10) in the response covering various aspects of the news story.

Sample system response:

1110 Q0 BLOG08-NEWS-0000002 1 10.0 runtag
SUPPORT BLOG08-20080426-000258281 1 1.5 runtag
SUPPORT BLOG08-20080426-000333190 2 1.3 runtag
1110 Q0 BLOG08-NEWS-0010056 2 9.8 runtag
...

Participating groups may submit up to four runs for the top stories identification task. Each run consists of a ranking of 100 headlines, and their corresponding supporting relevant posts.

Assessment

Assessors will use multiple sources of evidence to answer three questions: (i) What are the top news stories for a given day? (ii) Which blog posts are relevant to a given news story? (iii) What aspects of the news story that the blog posts discuss.

1. News Story Headline Assessment: Only headlines published on the query date d+-1 days can be judged relevant. Assessors will decide using various sources of evidence what the top stories were for a given day.

2. Blog Post Assessment: For each top new story, assessors will decide on the relevant blog posts discussing the news story.

3. Relevant Blog Post Diversity Assessment: For the relevant blog posts for a news story, assessors will group these posts into topics covering various aspects of the news story.

Evaluation

Number of test targets will be 50. Evaluation will use precision/recall measures based on correct story headlines, while the 'most important' metric will be MAP.

The 2nd level evaluation will examine how good each system is at identifying relevant related blog posts. In this 2nd evaluation, we will also score by MAP. However, (similar to the TREC 2009 Web track), we will also examine diversity - systems will be penalised for retrieving blog posts which do not add any information/perspectives to those already retrieved.

Timeline

History of Document

Track Coordinators

last edited 2009-08-07 11:22:12 by IadhOunis