OVERVIEW
The Blog track explores the information seeking behavior in the blogosphere. The track was first introduced in TREC 2006. The Blog track will run again in TREC 2009.
This Wiki Web page provides the guidelines for participation in the 2009 edition of the TREC Blog track. Updates and new information will ultimately appear in this Web page.
CONTENTS
- OVERVIEW
- MAILING-LIST
- DATASETS
- History of Blog Track
- TREC 2009
- Timeline
- History of Document
- Track Coordinators
MAILING-LIST
There is a mailing list for TREC-blog that is run by NIST. To subscribe to the trec-blog list, send an email message to
listproc@nist.gov such that the body of the message consists of the line
subscribe trec-blog <FirstName> <LastName>
If you later wish to unsubscribe from the TREC-blog mailing list, send an email to
listproc@nist.gov such that the body of the message consists of the line
unsubscribe trec-blog
If you wish to contact the Blog track organisers, please email the following email-address: (trecblog-organisers (at) dcs.gla.ac.uk)
DATASETS
Blogs06 Collection
The TREC Blogs06 collection is a big sample of the blogsphere, and contains spam as well as possibly non-blogs, e.g. RSS feeds from news broadcasters. It was crawled over an eleven week period from 6th December 2005 until the 21st February 2006. The collection is 148GB in size, consisting of:
38.6GB of feeds
88.8GB of permalink documents
28.8GB of homepages
The number of permalinks documents, is over 3.2 million of documents. Further information on the Blogs06 collection and how it was created can be found in the DCS Technical Report TR-2006-224, Department of Computing Science, University of Glasgow at
http://www.dcs.gla.ac.uk/~craigm/publications/macdonald06creating.pdf
The collection was used in TREC 2006, 2007 and 2008.
Blogs08 Collection
In 2009, the Blog track will use a new collection called Blogs08. The collection is larger than the previous Blogs06 collection, with a much longer timespan. Indeed, Blogs08 is one order of magnitude bigger than Blogs06, and samples the blogosphere from January 2008 to February 2009. The uncompressed permalink size is approx 1.3TB, while including feeds, this amounts to over 2TB of data.
Blogs08 has 28,488,767 blog posts from 1,303,520 blog feeds. Further details and statistics about the collection are provided at:
http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html
The collection is available since 9th April 2009. License details and information on how to get access to the TREC Blogs08 collection are provided in
http://ir.dcs.gla.ac.uk/test_collections
History of Blog Track
TREC 2006
In TREC 2006, we had two tasks, a main task (opinion retrieval) and an open task. The opinion retrieval task focuses on a specific aspect of blogs: the opinionated nature of many blogs. The second task was introduced to allow participants the opportunity to influence the determination of a suitable second task for 2007 on other aspects of blogs, such as the temporal/event-related nature of many blogs, or the severity of spam in the blogsphere.
Further and detailed information about the TREC 2006 Blog track could be found in
http://www.science.uva.nl/~mdr/Wikis/ The TREC 2006 Wiki is password protected. You will need to ask Maarten de Rijke for a login (mdr (at) science.uva.nl)
The TREC 2006 Blog track 'Overview paper' has appeared in the Proceedings of TREC 2006, and is available from the TREC Web site at
http://trec.nist.gov/pubs/trec15/papers/BLOG06.OVERVIEW.pdf NB: You should cite this paper when you refer to the TREC 2006 Blog track, or describe the opinion finding task in publications.
TREC 2007
TREC 2007 saw the addition of a new main task and a new subtask, namely a blog distillation (feed search) task and a polarity subtask respectively, along with a second year of the opinion retrieval task. The polarity subtask was added as a natural extension of the opinion task, and was intended to represent a text classification-related task, requiring participants to determine the polarity (or orientation) of the opinions in the retrieved documents, namely whether the opinions are positive, negative or mixed. The newly introduced blog distillation task was an articulation of an ad hoc search task, where users wish to identify blogs (i.e. feeds) about a given topic, which they can subscribe to and read on a regular basis.
Further and detailed information about the TREC 2007 Blog track could be found in TREC-BLOG/TREC2007
The TREC 2007 Blog track 'Overview paper' has appeared in the Proceedings of TREC 2007, and is available from the TREC Web site at
http://trec.nist.gov/pubs/trec16/t16_proceedings.html NB: Please cite this paper when you refer to the blog track or when you describe the TREC 2007 opinion finding task and/or polarity subtask and/or blog distillation task in publications.
You can also look at the ICWSM 2008 paper below, which summarises two years of TREC Blog track: 'On the TREC Blog Track', In Proceedings of International Conference on Weblogs and Social Media (ICWSM 2008), Seattle, 2008
http://www.dcs.gla.ac.uk/~craigm/publications/ounis08trecblog.pdf
TREC 2008
Following our conclusions from both the TREC 2006 and the 2007 Blog tracks, we structured the Blog track 2008 around four tasks:
Baseline adhoc (blog post) retrieval task
Opinion finding (blog post) retrieval task
Polarised opinion finding (blog post) retrieval task
Blog finding distillation task
The Blog06 test collection was used for all experiments. (see
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html )
Further and detailed information about the TREC 2008 Blog track could be found in TREC-BLOG/TREC2008
The TREC 2008 Blog track 'Overview paper' will appear in the Proceedings of TREC 2008, after it completes the WERB process. On a point of information, an updated draft of the paper is available at
http://www.dcs.gla.ac.uk/~ounis/blogOverview2008.pdf
TREC 2009
As discussed in the Blog track workshop in TREC 2008, the Blog track 2009 will make use of the new Blogs08 test collection, a larger and more up-to-date sample of the blogosphere, which has a much longer time-span period than that of the Blogs06 collection.
The Blog track 2009 aims to investigate more refined and complex search scenarios in the blogosphere. In particular, following discussions at the Blog track workshop at TREC 2008, we propose to run the following tasks:
Faceted blog distillation: a more refined version of the blog distillation task that addresses the quality aspect of the retrieved blogs.
Top stories identification: A task that addresses news-related issues on the blogosphere.
Faceted Blog Distillation Task
Task Background
Blog search users often wish to identify blogs about a given topic, which they can subscribe to and read on a regular basis. This user task is most often manifested in two scenarios:
Filtering: The user subscribes to a repeating search in their RSS reader.
Distillation: The user searches for blogs with a recurring central interest, and then adds these to their RSS reader.
In the TREC Blog track, we have been investigating the latter scenario – blog distillation. The blog distillation task can be summarised as Find me a blog with a principle, recurring interest in X. For a given area X, systems should suggest feeds that are principally devoted to X over the timespan of the feed, and would be recommended to subscribe to as an interesting feed about the X (ie a user may be interested in adding it to their RSS reader).
In its TREC 2007 and TREC 2008 form, the blog distillation task only focuses on topical relevance. It does not address the quality aspect of the retrieved blogs. Following a
position paper by Marti Hearst et al. in SSM 2008, we propose a refinement of the blog distillation task that takes into account a number of attributes or facets such as the authority of the blog, its opinionated nature, the trustworthiness of its authors, or the genre of the blog and its style of writing. The new faceted blog distillation task can be summarised as Find me a good blog with a principal, recurring interest in X. The task has the following characteristics:
It goes beyond topical-relevance
It integrates a quality aspect in the evaluation of the retrieved blogs
It mimics an exploratory search task
The facets will be allocated on a per-topic basis. Evaluation will be done as for the blog distillation task in 2008, with the caveat that blogs should be assessed on the facets active for a given topic.
Training can be done on the Blogs06 collection using the previous years relevance assessments, albeit without facets.
Task Definition
We propose several facets for the TREC 2009 blog distillation task, which may be of varying difficulty to identify for participant systems. Topics will have facets of interest attached to them, but there will be a reasonable spread between all facets in use for this year. The facets that will be considered for TREC 2009 are:
1. Opinionated: Some bloggers may make opinionated comment on the topics of interest, while others report factual information. A user may be interested in blogs which show prevalence to opinionatedness. For this facet, the values of interest are 'opinionated' vs 'factual' blogs.
2. Personal: Companies are increasingly using blogging as an activity for PR purposes. However, a user may not wish to read such mostly marketing or commercial blogs, and prefer instead to keep to blogs that appear to be written in personal time without commercial influences. For this facet, the values of interest are 'personal' vs 'official' blogs.
3. In-depth: Users might be interested to follow bloggers whose posts express in-depth thoughts and analysis on the reported issues, preferring these over bloggers who simply provide quick bites on these topics, without taking the time to analyse the implications of the provided information. For this facet, the values of interest are 'indepth' vs. 'shallow' blogs (in terms of their treatment of the subject).
NB:
For a given topic, the appropriate facet will be chosen by the TREC assessors during topic development.
In future incarnations of this task, systems may be asked to select automatically the facets they think are interesting for a given query.
For each topic, systems should supply the top 100 blogs which they think are both relevant to the topic, and which are likely to satisfy the first value (e.g. opinionated) of interest attached to the topic, followed by the second value (e.g. factual) of interest attached to the facet. In addition, for each topic, systems should provide a ranking of blogs where 'no facet value is applied' (denoted by 'none').
Example:
<top> <num>1051</num> <query>Example query</query> <facet>personal</facet> <description> longer statement of the information need </description> <narrative> description </narrative> </top>
Runs have the format detailed below. In particular, for each topic, you should produce three rankings of 100 blogs each: one for the first value of the facet enabled, one with the second value of the facet enabled, and one for a baseline ranking with no facet whatsoever enabled. For example, for the personal facet, the first ranking would have 100 blogs that your system thinks are Personal, the second ranking would have 100 blogs which your system thinks are Official, while the third ranking would have 100 blogs which your system thinks are relevant to the topic, without any consideration for the facet.
topic-facet_value1 Q0 docno rank sim runtag .... topic-facet_value2 Q0 docno rank sim runtag .... topic-facet_none Q0 docno rank sim runtag
For example:
1051-personal Q0 blog08-feed-00002 1 10 testRun 1051-personal Q0 blog08-feed-00001 2 9 testRun ... 1051-official Q0 blog08-feed-00501 1 10.1 testRun 1051-official Q0 blog08-feed-00112 2 9.2 testRun ... 1051-none Q0 blog08-feed-00001 1 20.1 testRun 1051-none Q0 blog08-feed-00041 2 17.1 testRun ... 1052...
Participating groups may submit up to 'four' runs for the faceted blog distillation task. We wholeheartedly encourage the submission of manual runs, which are invaluable in improving the quality of the collection. (An automatic run is one that involves no human interaction. In contrast, a manual run is one where (for example) you formulate queries, search manually, give relevance feedback, and/or rerank documents by hand.)
Assessment
Topics development and relevance assessments for this task will be performed by NIST. We have actively pursued the option of obtaining query logs from a commercial search engine to assist the creation of realistic topics.
The following scale will be used for the assessment:
[-1] i.e. Not judged. The content of the blog was not
examined due to offensive URLs or headers (such documents do exist in the collection due to spam). Although the content itself was not assessed, it is very likely, given the offensive headers, that the blog is irrelevant.
[0] i.e. Not relevant. The blog and its posts were
examined, and does not contain any interest in the target topic area, or refers to it only in passing.
[1] i.e. Relevant but facet value unknown.
[2] i.e. Relevant and clearly inclined towards first facet value.
[3] i.e. Relevant and clearly inclined towards second facet value.
Evaluation
The number of test targets is 50. Metrics will be precision/recall based, where the actual "most important metric" will be MAP.
Top Stories Identification Task
Task Background
The query logs from the commercial search engines show that there is a fair number of news-related queries, suggesting that Blog search users have an interest in the blogosphere response to news stories as they develop.
We propose to run a new pilot search task addressing the news dimension in the blogosphere: For a given unit of time (e.g. date), systems will be asked to identify the top news stories (similar to what is displayed on the main page of Google Blog Search or Google News), and provide a list of relevant blog posts discussing each news story. The ranked list of blog posts should have a diverse nature, covering different/diverse aspects or opinions of the news story.
Participating System: Inputs & Output
Participating groups will be provided with a large sample of news headlines and their corresponding dates from throughout the timespan of the Blogs08 corpus. Participants will also have access to the Blogs08 corpus, from which they can extract relevant date information.
In response to a date "query", systems should provide a ranking of 100 headlines that they think were important on the specified day. Moreover for each headline, they should provide a ranking of 10 blog posts which are relevant to and discuss the news story headline.
The dates of the provided headlines will be the ones used by the news broadcaster. For example, a story that happens in Europe very early in the morning of day d, can be issued with a date d-1 by an American news broadcaster. Because of this possible time disparity between the date when the headline was issued by the news broadcaster and the one where the story actually happened, the participating systems should rank all headlines corresponding to the query date d +-1 days (i.e. headlines on day d, day d-1, and day d+1).
On the other hand, note that relevant blog posts may naturally be posted on or after the date of the news headline, but even shortly before the provided headline date (recall the possible time disparity). They just have to be on topic, i.e. related to the news headline. The blog posts selected for a given headline should be diverse in that they discuss different aspects, perspectives or opinions of the news story.
Importantly, the aim of the task is to ascertain the usefulness of the blogosphere in real-time news identification. Moreover, as the headline information is available on the Web, groups should use only the data provided, and not resort to external news resources or systems to enrich their system's knowledge. When external resources - beyond the Blogs08 collection and the provided sample of headlines and their corresponding dates - are used, these should be clearly mentioned. Runs using external resources will be reported separately.
Sample news headline corpus:
BLOG08-NEWS-0000001 News headline 1 here BLOG08-NEWS-0000002 News headline 2 here ...
Sample query:
<top> <num>1110</num> <date>20080424</date> </top> ...
The system responses are similar to the TREC Enterprise track Expert Search task formats. It includes a list of supporting relevant discussive documents (at most 10) in the response covering various aspects of the news story.
Sample system response:
1110 Q0 BLOG08-NEWS-0000002 1 10.0 runtag SUPPORT BLOG08-20080426-000258281 1 1.5 runtag SUPPORT BLOG08-20080426-000333190 2 1.3 runtag 1110 Q0 BLOG08-NEWS-0010056 2 9.8 runtag ...
Participating groups may submit up to four runs for the top stories identification task. Each run consists of a ranking of 100 headlines, and their corresponding supporting relevant posts.
Assessment
Assessors will use multiple sources of evidence to answer three questions: (i) What are the top news stories for a given day? (ii) Which blog posts are relevant to a given news story? (iii) What aspects of the news story that the blog posts discuss.
1. News Story Headline Assessment: Only headlines published on the query date d+-1 days can be judged relevant. Assessors will decide using various sources of evidence what the top stories were for a given day.
2. Blog Post Assessment: For each top new story, assessors will decide on the relevant blog posts discussing the news story.
3. Relevant Blog Post Diversity Assessment: For the relevant blog posts for a news story, assessors will group these posts into topics covering various aspects of the news story.
Evaluation
Number of test targets will be 50. Evaluation will use precision/recall measures based on correct story headlines, while the 'most important' metric will be MAP.
The 2nd level evaluation will examine how good each system is at identifying relevant related blog posts. In this 2nd evaluation, we will also score by MAP. However, (similar to the TREC 2009 Web track), we will also examine diversity - systems will be penalised for retrieving blog posts which do not add any information/perspectives to those already retrieved.
Timeline
9th April: Blogs08 collection ready for distribution
Mid-May: Search tasks defined
6th July: Blog distillation topics available
7th July: Top Stories ID task queries available
28th August: Top Stories ID task runs due
31st August: Blog distillation task runs due
5th September: News stories ID participant assessment phase starts
30th September: News stories ID participant assessment phase ends
History of Document
March 04, 2009: first draft
April 22, 2009: draft of faceted blog distillation task guidelines added
April 28, 2009: draft of top news story identification task guidelines added
June 22, 2009: timeline updated
August 7, 2009: updated run formats.