TREC-BLOG

OVERVIEW

The Blog track explores the information seeking behaviour in the blogosphere. The track was first introduced in TREC 2006. The Blog track will run again in TREC 2010.

This Wiki Web page provides the guidelines for participation in the 2010 edition of the TREC Blog track. Updates and new information will ultimately appear in this Web page.

CONTENTS

  1. OVERVIEW
  2. MAILING-LIST
  3. DATASETS
    1. Blogs06 Collection
    2. Blogs08 Collection
  4. History of Blog Track
    1. TREC 2006
    2. TREC 2007
    3. TREC 2008
    4. TREC 2009
  5. TREC 2010
    1. Faceted Blog Distillation Task
    2. Top Stories Identification Task
  6. Timeline
  7. History of Document
  8. Track Coordinators

MAILING-LIST

There is a mailing list for TREC-blog that is run by NIST. To subscribe to the trec-blog list, send an email message to [MAILTO] listproc@nist.gov such that the body of the message consists of the line

If you later wish to unsubscribe from the TREC-blog mailing list, send an email to [MAILTO] listproc@nist.gov such that the body of the message consists of the line

If you wish to contact the Blog track organisers directly, please email the following email-address: (trecblog-organisers (at) dcs.gla.ac.uk). Note that the Blog track organisers cannot directly add you to the TREC Blog mailing list.

DATASETS

Blogs06 Collection

The TREC Blogs06 collection is a big sample of the blogsphere, and contains spam as well as possibly non-blogs, e.g. RSS feeds from news broadcasters. It was crawled over an eleven week period from 6th December 2005 until the 21st February 2006. The collection is 148GB in size, consisting of:

The collection was used in TREC 2006, 2007 and 2008.

The number of permalinks documents, is over 3.2 million of documents. Further information on the Blogs06 collection and how it was created can be found in the DCS Technical Report TR-2006-224, Department of Computing Science, University of Glasgow at [WWW] http://www.dcs.gla.ac.uk/~craigm/publications/macdonald06creating.pdf

As described in the paper above, some spam has been interjected into the Blogs06 collection. This spam had a certain number of "assumed spam blogs (splogs)". While participants of TREC did not have access to the list of assumed splogs, the list is now attached here blog06_spam_feeds.txt.gz - NB: You should not make use of this list except to study spam and its effects within the Blog06 collection. For example, removing all splogs in this list and then comparing results to the TREC 2006 opinion finding task is not fair, as the participants did not have access to this list.

Further information about the how spam affected the participating groups are found in the Blog track overview papers 2006-2008, as well as in the following paper:

C. Macdonald, I. Ounis and I. Soboroff.Is Spam an Issue for Opinionated Blog Post Search? In Proceedings of SIGIR 2009. Boston, USA [WWW] pdf

Blogs08 Collection

In 2009, the Blog track will use a new collection called Blogs08. The collection is larger than the previous Blogs06 collection, with a much longer timespan. Indeed, Blogs08 is one order of magnitude bigger than Blogs06, and samples the blogosphere from January 2008 to February 2009. The uncompressed permalink size is approx 1.3TB, while including feeds, this amounts to over 2TB of data.

Blogs08 has 28,488,767 blog posts from 1,303,520 blog feeds. Further details and statistics about the collection are provided at: [WWW] http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html

The collection is available since 9th April 2009. License details and information on how to get access to the TREC Blogs08 collection are provided in [WWW] http://ir.dcs.gla.ac.uk/test_collections

History of Blog Track

TREC 2006

In TREC 2006, we had two tasks, a main task (opinion retrieval) and an open task. The opinion retrieval task focuses on a specific aspect of blogs: the opinionated nature of many blogs. The second task was introduced to allow participants the opportunity to influence the determination of a suitable second task for 2007 on other aspects of blogs, such as the temporal/event-related nature of many blogs, or the severity of spam in the blogsphere.

Further and detailed information about the TREC 2006 Blog track could be found in [WWW] http://www.science.uva.nl/~mdr/Wikis/ The TREC 2006 Wiki is password protected. You will need to ask Maarten de Rijke for a login (mdr (at) science.uva.nl)

The TREC 2006 Blog track 'Overview paper' has appeared in the Proceedings of TREC 2006, and is available from the TREC Web site at [WWW] http://trec.nist.gov/pubs/trec15/papers/BLOG06.OVERVIEW.pdf NB: You should cite this paper when you refer to the TREC 2006 Blog track, or describe the opinion finding task in publications.

TREC 2007

TREC 2007 saw the addition of a new main task and a new subtask, namely a blog distillation (feed search) task and a polarity subtask respectively, along with a second year of the opinion retrieval task. The polarity subtask was added as a natural extension of the opinion task, and was intended to represent a text classification-related task, requiring participants to determine the polarity (or orientation) of the opinions in the retrieved documents, namely whether the opinions are positive, negative or mixed. The newly introduced blog distillation task was an articulation of an ad hoc search task, where users wish to identify blogs (i.e. feeds) about a given topic, which they can subscribe to and read on a regular basis.

Further and detailed information about the TREC 2007 Blog track could be found in TREC-BLOG/TREC2007

The TREC 2007 Blog track 'Overview paper' has appeared in the Proceedings of TREC 2007, and is available from the TREC Web site at [WWW] http://trec.nist.gov/pubs/trec16/t16_proceedings.html NB: Please cite this paper when you refer to the blog track or when you describe the TREC 2007 opinion finding task and/or polarity subtask and/or blog distillation task in publications.

You can also look at the ICWSM 2008 paper below, which summarises two years of TREC Blog track: 'On the TREC Blog Track', In Proceedings of International Conference on Weblogs and Social Media (ICWSM 2008), Seattle, 2008 [WWW] http://www.dcs.gla.ac.uk/~craigm/publications/ounis08trecblog.pdf

TREC 2008

Following our conclusions from both the TREC 2006 and the 2007 Blog tracks, we structured the Blog track 2008 around four tasks:

The Blog06 test collection was used for all experiments. (see [WWW] http://ir.dcs.gla.ac.uk/test_collections/blog06info.html )

Further and detailed information about the TREC 2008 Blog track could be found in TREC-BLOG/TREC2008

The TREC 2008 Blog track 'Overview paper' is available at [WWW] http://trec.nist.gov/pubs/trec17/papers/BLOG.OVERVIEW08.pdf

TREC 2009

In 2009, Blog track introduced a new corpus called Blogs08, and tackled two new tasks:

Further and detailed information about the TREC 2009 Blog track could be found in TREC-BLOG/TREC2009

The final TREC 2009 Blog track overview is available online [WWW] http://trec.nist.gov/pubs/trec18/papers/BLOG09.OVERVIEW.pdf.

TREC 2010

The Blog track 2010 refines the tasks of 2009, using more queries and two stage submission procedures. In particular, the two following tasks witll run again:

To participate in the TREC 2010 Blog track, please ensure that you have responded to the [WWW] TREC 2010 Call for Participation.

To submit runs, use the TREC Blog track submit form at [WWW] http://ir.nist.gov/trecsubmit/blog.html

Faceted Blog Distillation Task

Task Background

Blog search users often wish to identify blogs about a given topic, which they can subscribe to and read on a regular basis. This user task is most often manifested in two scenarios:

In the TREC Blog track, we have been investigating the latter scenario – blog distillation. The blog distillation task can be summarised as Find me a blog with a principle, recurring interest in X. For a given area X, systems should suggest feeds that are principally devoted to X over the timespan of the feed, and would be recommended to subscribe to as an interesting feed about the X (ie a user may be interested in adding it to their RSS reader).

In its TREC 2007 and TREC 2008 form, the blog distillation task only focuses on topical relevance. It does not address the quality aspect of the retrieved blogs. Following a [WWW] position paper by Marti Hearst et al. in SSM 2008, for TREC 2009, we proposed a refinement of the blog distillation task that takes into account a number of attributes or facets such as the authority of the blog, its opinionated nature, the trustworthiness of its authors, or the genre of the blog and its style of writing. The new faceted blog distillation task can be summarised as Find me a good blog with a principal, recurring interest in X. The task has the following characteristics:

The facets will be allocated on a per-topic basis. Evaluation will be done as for the blog distillation task in 2009, with the caveat that blogs should be assessed on the facets active for a given topic. Training can be done on the Blogs08 collection using relevance assessments from TREC 2009.

The same three facets proposed for the TREC 2009 blog distillation task will be used. Each topic has a facet of interest attached to it, but there will be a reasonable spread between all facets in use for this year. The facets are:

1. Opinionated: Some bloggers may make opinionated comment on the topics of interest, while others report factual information. A user may be interested in blogs which show prevalence to opinionatedness. For this facet, the values of interest are 'opinionated' vs 'factual' blogs.

2. Personal: Companies are increasingly using blogging as an activity for PR purposes. However, a user may not wish to read such mostly marketing or commercial blogs, and prefer instead to keep to blogs that appear to be written in personal time without commercial influences. For this facet, the values of interest are 'personal' vs 'official' blogs.

3. In-depth: Users might be interested to follow bloggers whose posts express in-depth thoughts and analysis on the reported issues, preferring these over bloggers who simply provide quick bites on these topics, without taking the time to analyse the implications of the provided information. For this facet, the values of interest are 'indepth' vs. 'shallow' blogs (in terms of their treatment of the subject).

NB: For a given topic, the appropriate facet will be chosen by the TREC assessors during topic development.

Example:

<top>
<num>1051</num>
<query>Example query</query>
<facet>personal</facet>
<description> longer statement of the information need </description>
<narrative> description </narrative> 
</top>

This task will be run as two separate sub-tasks, namely the Baseline Blog Distillation and the Faceted Blog Distillation.

Task Definitions
Baseline Blog Distillation: Definition

The baseline blog distillation sub-task will consist in ranking 100 blogs which your system thinks are relevant to the topic, without any consideration of the facets attached to each topic. This task exactly corresponds to the TREC 2007 & 2008 Blog Distillation tasks, or the "None" facet rankings from TREC 2009.

Faceted Blog Distillation: Definition

In the faceted blog distillation sub-task, for each topic, systems should supply the top 100 blogs which they think are both relevant to the topic, and which are likely to satisfy the first facet value (e.g. opinionated) of interest attached to the topic, followed by the second value (e.g. factual) of interest attached to the facet.

Evaluation

The number of new test target topics will be at least 50. We require all runs to use both old (all the 50 old 2009 blog distillation topics) and new topics.

Metrics will be precision/recall based, where the actual "most important metric" will be MAP.

Task Submissions

There will be a two stage submission procedure. This is detailed separately for each stage below.

Submitting Baseline Blog Distillation Runs

Run format will be as follows:

topic Q0 docno rank sim runtag

For example, for a run named "testRun":

1051 Q0 blog08-feed-00002 1 10 testRun
...
1052 Q0 blog08-feed-38120 1 19.9 testRun

Participants may submit upto two runs to the Baseline Blog Distillation sub-task, including a compulsory automatic "query-only" run.

To submit Baseline Blog Distillation sub-task runs, use the TREC Blog track submit form at [WWW] http://ir.nist.gov/trecsubmit/blog.html

Submitting Faceted Blog Distillation Runs

Runs have the format detailed below. In particular, for each topic, you should produce two rankings of 100 blogs each: one for the first value of the facet enabled, and one with the second value of the facet enabled. For example, for the personal facet, the first ranking would have 100 blogs that your system thinks are Personal, the second ranking would have 100 blogs which your system thinks are Official.

topic-facet_value1 Q0 docno rank sim runtag
....
topic-facet_value2 Q0 docno rank sim runtag

For example:

1051-personal Q0 blog08-feed-00002 1 10 testRun
1051-personal Q0 blog08-feed-00001 2 9 testRun
...
1051-official Q0 blog08-feed-00501 1 10.1 testRun
1051-official Q0 blog08-feed-00112 2 9.2 testRun
...
1052...

You are permitted to submit up to 4 runs which are based on each of your two previously submitted baseline task runs (i.e. 4 runs per own baseline, 8 maximum). These will use all 100 topics. One of these submitted runs must be an automatic, query-only run. If your system cannot be clearly broken down into baseline and facet-ranking features, then you can select "N/A" as the baseline run (though this will make it difficult for you to see the advantage of your facet-ranking features). We suggest for the clarity of your own analysis, that it makes sense not to vary the length of the query (with/without description or narrative) from the baseline to the faceted run.

We wholeheartedly encourage the submission of manual runs, which are invaluable in improving the quality of the collection. (An automatic run is one that involves no human interaction. In contrast, a manual run is one where (for example) you formulate queries, search manually, give relevance feedback, and/or rerank documents by hand.)

One of the aims of the Faceted blog distillation is to compare how effective systems are at effective ranking with respect to a facet. For this reason, runs will be compared to a common, where possible. To aid this cross-comparison, from the baseline blog distillation task, NIST will select a few "standard baselines" to be redistributed to all participants.

You MAY submit up to 4 runs for EACH of the provided standard baselines or your own baseline runs (in total you can submit a maximum of 4*(3 standard baselines + your 2 own baseline runs) = 20 runs). While the submission form will ask you to state the baseline run, for easier analysis please make it clear in your run names and descriptions how to compare these runs to the runs based on your own baselines. In particular, please use the suffix _s1 for runs using standard baseline 1, _s2 for runs using standard baseline 2 and _s3 for runs using standard baseline 3. Moreover, in the run description, (e.g., FooRun3_s2 is "FooRun3's facet reranker, using standard baseline 2." - FooRun3 being a run that uses your own baseline).

We are particularly interested and encourage you to apply any given facet ranking approach on ALL of the three baselines. This allows you to draw interesting and useful insights on the performance and robustness of your faceted search approach.

Note that you can participate in the faceted blog distillation task without having previously submitted a run to the baseline task. In particular, you can apply your faceted blog distillation approach to the standard baselines, AND/OR use the "N/A" option for the corresponding baseline for a faceted blog distillation run.

Since you may submit up to 20 runs, please take care how you assign priorities for pooling. Please only give high priority to the runs you really want pooled (3 runs at most), and lower priority to others.

To submit Faceted Blog Distillation sub-task runs, use the TREC Blog track submit form at [WWW] http://ir.nist.gov/trecsubmit/blog.html

Assessment

Topics development and relevance assessments for this task will not be performed by participants. One of the aims of the task is to increase the number of topics for each facet. For this reason, some topics may be judged through the use of crowdsourcing.

The following scale will be used for the assessment:

Top Stories Identification Task

Task Background

The top stories identification task was first run as a pilot task in TREC 2009 to address the news dimension of the blogosphere. In particular, we address whether the blogosphere can be used to identify the most important news stories for a given day (for a motivation of this task, see the TREC 2009 Blog Overview).

This task involves two aspects:

1. Identifying top news stories for a given unit of time and category - the 'Story Ranking Task'

2. Identifying relevant blog posts for a given news story, that cover different/diverse aspects or opinions - the 'News Blog Post Ranking Task'

Differently from last year, we will use standardised news categories (e.g. Sport, Technology, World, Business) within the task. Moreover, the task will be mimicking a real-time environment as described below.

Story Ranking Task Definition

Participating groups will be provided with a large sample of the full text and headlines of news stories and their corresponding dates from throughout the timespan of the Blogs08 corpus. Participants also have access to the Blogs08 corpus, which contains blog posts covering the same timespan. For TREC 2009, news headlines were used from NYTimes. This year, Thomson-Reuters have kindly provided the TRC2 newswire corpus covering the same dates. The TRC2 corpus is much larger, contains both the headlines and content of each news story, and is distributed by NIST free of charge. See the Active Participants page for the forms to access the corpus. The completed organisational agreement must be emailed/faxed back to the indicated address at NIST. You will then be able to download the corpus. In particular, please use the TRC2-headlines-docs-TRECBLOG.v2.gz file, as this contains story ids and timestamps (see below). NB: Please note the requirement to acknowledge Thomson-Reuters in your TREC notebook papers as well as any research disseminations you may have using this data.

Differently from TREC 2009, this task will be treated as an online event detection, i.e. mimicking a real-time environment. To facilitate this, we will provide timestamp information for three data listed below. In particular, the timestamp is an integer representing the number of days elapsed since 14th January 2008, and is present in:

In response to a date "query", systems should provide a ranking of 100 news stories that they think were important on the specified day (as defined by matching <BLOGS08DAY> and <blogs08day> timestamps between topic and news story), for EACH category of news (see definition of categories below). When ranking stories, only evidence from blog posts which were published AT OR BEFORE the timestamp should be used.

A blacklist of news stories from TRC2 is provided on the Active Participants website as topNews-blacklist.docnos.txt.gz. Systems must not use or return these stories.

Moreover, blog post evidence from after the "date query" timestamp MUST NOT be used to identify top news. The aim of the task is to ascertain the usefulness of the blogosphere in real-time news identification. Hence, groups should use only the data provided, and not resort to external news resources or systems to enrich their system's knowledge, particularly where such evidence is after the event. When such "timely" external resources - beyond the Blogs08 collection and the TRC2 newswire corpus and their corresponding timestamps - are used, these should be clearly mentioned. Runs using timely external resources will be reported separately.

Format of TRC2 news corpus:

<DOC>
<DOCNO>TRC2-date-number</DOCNO>
<BLOGS08DAY>5</BLOGS08DAY>
<DATE>date</DATE>
<HEADLINE>headline of article</HEADLINE>
<CONTENT>content of article</CONTENT>
</DOC>
...

where the DOCNO tag contains the unique identifier of the headline that you should return in the story ranking task; the BLOGS08DAY tag contains the integer timestamp described above; The HEADLINE and CONTENT contain the headline and content of the story, as provided by Thomson-Reuters.

Sample query:

<top>
<num>TS10-01</num>
<date>2008-04-24</date>
<day>Wednesday</day>
<blogs08day>100</blogs08day>
</top>
...

where the num tag contains is the the topic number; blogs08day contains the integer timestamp described above. Only TRC2 news stories with the same value in the BLOGS08DAY tag as the topic has in the blogs08day tag should be ranked in response to a day-topic. For example, for a topic with <blogs08day>5</blogs08day>, you should only rank TRC2 news stories with <BLOGS08DAY>5</BLOGS08DAY>, using blog post evidence from Blogs08 which have timestamp <= 5. The date and day tags are self-explanatory.

A further change from TREC 2009 is the use of categories. Unlike TREC 2009, there is no overall category. Instead, we propose the following categories, and intend them to be from a US perspective:

The system responses this year are similar to the faceted blog distillation task. In particular, for each numbered "query date", we wish to have a ranking of 100 news stories for EACH of the 5 news categories listed above.

Sample system response:

TS10-01-world Q0 TRC2-2008-04-24-0004 1 10.0 runtag
TS10-01-world Q0 TRC2-2008-04-24-0010 2 9.0 runtag
...
TS10-01-us Q0 TRC2-2008-04-24-1001 1 10.8 runtag
TS10-01-us Q0 TRC2-2008-04-24-1101 2 8.3 runtag
...

where the format is topicnumber-category Q0 headline rank score name_of_run.

Participating groups may submit up to three runs for the story ranking task.

Number of test targets (date topics) will be 50. Evaluation will use precision/recall measures based on correct stories, while the 'most important' metric will be MAP. We note that some news stories are duplicated, as their contents may evolve across multiple releases. We will examine how much duplication is an issue in participants systems using additional appropriate evaluation measures.

News Blog Post Ranking Task Definition

After the submission of the headline ranking task, the organisers will select a number of news stories for which relevant blog posts should be identified. The number of news stories in this stage will be 68.

For each news story, the participating systems should provide three rankings of 50 blog posts, which should be relevant to the news story, and discuss the news story. Each ranking will be "centered" at a different period of time:

1. Before the timestamp of the "query date". i.e. blog posts must have timestamp <= query timestamp

2. One day after the "query date". i.e. blog posts must have timestamp <= query timestamp + 1 days

3. One week after the timestamp. i.e. blog posts must have timestamp <= query timestamp + 7 days

Each ranking of blog posts should be diverse, i.e. cover multiple aspects of the news stories (e.g. different opinions, type of blog posts, etc).

Format of topics:

<top>
<num>TS10b-NUM</num>
<blogs08day> NUM </blogs08day>
<story> TRC-DATE-NUM </story>
</top>

where TS10b-NUM is the topic number and TRC-DATE-NUM is the the docno of a TRC2 headline.

Format of submitted runs:

TS10b-NUM-before Q0 BLOG08-20080426-000333190 1 9.8 runtag
TS10b-NUM-before Q0 BLOG08-20080426-000532120 2 9.0 runtag
...
TS10b-NUM-day Q0 BLOG08-20080427-010333190 1 9.9 runtag
...
TS10b-NUM-week Q0 BLOG08-20080430-11111111 1 4.2 runtag
...

Evaluation will be performed by both relevance measures (e.g. MAP) or diversity measures, primarily alpha-NDCG@10.

For the blog post ranking task, you may submit upto three runs.

To submit news blog post ranking runs, use the TREC Blog track submit form at [WWW] http://ir.nist.gov/trecsubmit/blog.html

Timeline

History of Document

Track Coordinators

last edited 2012-06-29 13:17:28 by IadhOunis