TREC-BLOG/TREC2009

TREC Blog Track 2009

As discussed in the Blog track workshop in TREC 2008, the Blog track 2009 will make use of the new Blogs08 test collection, a larger and more up-to-date sample of the blogosphere, which has a much longer time-span period than that of the Blogs06 collection.

The Blog track 2009 aims to investigate more refined and complex search scenarios in the blogosphere. In particular, following discussions at the Blog track workshop at TREC 2008, we propose to run the following tasks:

A draft version of the final TREC 2009 Blog track overview is available online [WWW] http://www.dcs.gla.ac.uk/~craigm/publications/blogOverview2009.pdf.

Faceted Blog Distillation Task

Task Background

Blog search users often wish to identify blogs about a given topic, which they can subscribe to and read on a regular basis. This user task is most often manifested in two scenarios:

In the TREC Blog track, we have been investigating the latter scenario – blog distillation. The blog distillation task can be summarised as Find me a blog with a principle, recurring interest in X. For a given area X, systems should suggest feeds that are principally devoted to X over the timespan of the feed, and would be recommended to subscribe to as an interesting feed about the X (ie a user may be interested in adding it to their RSS reader).

In its TREC 2007 and TREC 2008 form, the blog distillation task only focuses on topical relevance. It does not address the quality aspect of the retrieved blogs. Following a [WWW] position paper by Marti Hearst et al. in SSM 2008, we propose a refinement of the blog distillation task that takes into account a number of attributes or facets such as the authority of the blog, its opinionated nature, the trustworthiness of its authors, or the genre of the blog and its style of writing. The new faceted blog distillation task can be summarised as Find me a good blog with a principal, recurring interest in X. The task has the following characteristics:

The facets will be allocated on a per-topic basis. Evaluation will be done as for the blog distillation task in 2008, with the caveat that blogs should be assessed on the facets active for a given topic.

Training can be done on the Blogs06 collection using the previous years relevance assessments, albeit without facets.

Task Definition

We propose several facets for the TREC 2009 blog distillation task, which may be of varying difficulty to identify for participant systems. Topics will have facets of interest attached to them, but there will be a reasonable spread between all facets in use for this year. The facets that will be considered for TREC 2009 are:

1. Opinionated: Some bloggers may make opinionated comment on the topics of interest, while others report factual information. A user may be interested in blogs which show prevalence to opinionatedness. For this facet, the values of interest are 'opinionated' vs 'factual' blogs.

2. Personal: Companies are increasingly using blogging as an activity for PR purposes. However, a user may not wish to read such mostly marketing or commercial blogs, and prefer instead to keep to blogs that appear to be written in personal time without commercial influences. For this facet, the values of interest are 'personal' vs 'official' blogs.

3. In-depth: Users might be interested to follow bloggers whose posts express in-depth thoughts and analysis on the reported issues, preferring these over bloggers who simply provide quick bites on these topics, without taking the time to analyse the implications of the provided information. For this facet, the values of interest are 'indepth' vs. 'shallow' blogs (in terms of their treatment of the subject).

NB:

For each topic, systems should supply the top 100 blogs which they think are both relevant to the topic, and which are likely to satisfy the first value (e.g. opinionated) of interest attached to the topic, followed by the second value (e.g. factual) of interest attached to the facet. In addition, for each topic, systems should provide a ranking of blogs where 'no facet value is applied' (denoted by 'none').

Example:

<top>
<num>1051</num>
<query>Example query</query>
<facet>personal</facet>
<description> longer statement of the information need </description>
<narrative> description </narrative> 
</top>

Runs have the format detailed below. In particular, for each topic, you should produce three rankings of 100 blogs each: one for the first value of the facet enabled, one with the second value of the facet enabled, and one for a baseline ranking with no facet whatsoever enabled. For example, for the personal facet, the first ranking would have 100 blogs that your system thinks are Personal, the second ranking would have 100 blogs which your system thinks are Official, while the third ranking would have 100 blogs which your system thinks are relevant to the topic, without any consideration for the facet.

topic-facet_value1 Q0 docno rank sim runtag
....
topic-facet_value2 Q0 docno rank sim runtag
....
topic-facet_none Q0 docno rank sim runtag

For example:

1051-personal Q0 blog08-feed-00002 1 10 testRun
1051-personal Q0 blog08-feed-00001 2 9 testRun
...
1051-official Q0 blog08-feed-00501 1 10.1 testRun
1051-official Q0 blog08-feed-00112 2 9.2 testRun
...
1051-none Q0 blog08-feed-00001 1 20.1 testRun
1051-none Q0 blog08-feed-00041 2 17.1 testRun
...
1052...

Participating groups may submit up to 'four' runs for the faceted blog distillation task. We wholeheartedly encourage the submission of manual runs, which are invaluable in improving the quality of the collection. (An automatic run is one that involves no human interaction. In contrast, a manual run is one where (for example) you formulate queries, search manually, give relevance feedback, and/or rerank documents by hand.)

Assessment

Topics development and relevance assessments for this task will be performed by NIST. We have actively pursued the option of obtaining query logs from a commercial search engine to assist the creation of realistic topics.

The following scale will be used for the assessment:

Evaluation

The number of test targets is 50. Metrics will be precision/recall based, where the actual "most important metric" will be MAP.

Top Stories Identification Task

Task Background

The query logs from the commercial search engines show that there is a fair number of news-related queries, suggesting that Blog search users have an interest in the blogosphere response to news stories as they develop.

We propose to run a new pilot search task addressing the news dimension in the blogosphere: For a given unit of time (e.g. date), systems will be asked to identify the top news stories (similar to what is displayed on the main page of Google Blog Search or Google News), and provide a list of relevant blog posts discussing each news story. The ranked list of blog posts should have a diverse nature, covering different/diverse aspects or opinions of the news story.

Participating System: Inputs & Output

Participating groups will be provided with a large sample of news headlines and their corresponding dates from throughout the timespan of the Blogs08 corpus. Participants will also have access to the Blogs08 corpus, from which they can extract relevant date information.

In response to a date "query", systems should provide a ranking of 100 headlines that they think were important on the specified day. Moreover for each headline, they should provide a ranking of 10 blog posts which are relevant to and discuss the news story headline.

The dates of the provided headlines will be the ones used by the news broadcaster. For example, a story that happens in Europe very early in the morning of day d, can be issued with a date d-1 by an American news broadcaster. Because of this possible time disparity between the date when the headline was issued by the news broadcaster and the one where the story actually happened, the participating systems should rank all headlines corresponding to the query date d +-1 days (i.e. headlines on day d, day d-1, and day d+1).

On the other hand, note that relevant blog posts may naturally be posted on or after the date of the news headline, but even shortly before the provided headline date (recall the possible time disparity). They just have to be on topic, i.e. related to the news headline. The blog posts selected for a given headline should be diverse in that they discuss different aspects, perspectives or opinions of the news story.

Importantly, the aim of the task is to ascertain the usefulness of the blogosphere in real-time news identification. Moreover, as the headline information is available on the Web, groups should use only the data provided, and not resort to external news resources or systems to enrich their system's knowledge. When external resources - beyond the Blogs08 collection and the provided sample of headlines and their corresponding dates - are used, these should be clearly mentioned. Runs using external resources will be reported separately.

Sample news headline corpus:

BLOG08-NEWS-0000001 News headline 1 here
BLOG08-NEWS-0000002 News headline 2 here
...

Sample query:

<top>
<num>1110</num>
<date>20080424</date>
</top>
...

The system responses are similar to the TREC Enterprise track Expert Search task formats. It includes a list of supporting relevant discussive documents (at most 10) in the response covering various aspects of the news story.

Sample system response:

1110 Q0 BLOG08-NEWS-0000002 1 10.0 runtag
SUPPORT BLOG08-20080426-000258281 1 1.5 runtag
SUPPORT BLOG08-20080426-000333190 2 1.3 runtag
1110 Q0 BLOG08-NEWS-0010056 2 9.8 runtag
...

Participating groups may submit up to four runs for the top stories identification task. Each run consists of a ranking of 100 headlines, and their corresponding supporting relevant posts.

Assessment

Assessors will use multiple sources of evidence to answer three questions: (i) What are the top news stories for a given day? (ii) Which blog posts are relevant to a given news story? (iii) What aspects of the news story that the blog posts discuss.

1. News Story Headline Assessment: Only headlines published on the query date d+-1 days can be judged relevant. Assessors will decide using various sources of evidence what the top stories were for a given day.

2. Blog Post Assessment: For each top new story, assessors will decide on the relevant blog posts discussing the news story.

3. Relevant Blog Post Diversity Assessment: For the relevant blog posts for a news story, assessors will group these posts into topics covering various aspects of the news story.

Evaluation

Number of test targets will be 50. Evaluation will use precision/recall measures based on correct story headlines, while the 'most important' metric will be MAP.

The 2nd level evaluation will examine how good each system is at identifying relevant related blog posts. In this 2nd evaluation, we will also score by MAP. However, (similar to the TREC 2009 Web track), we will also examine diversity - systems will be penalised for retrieving blog posts which do not add any information/perspectives to those already retrieved.

Timeline

History of Document

last edited 2010-05-14 13:11:31 by IadhOunis