CONTENTS
This wiki page provides all details about the TREC Blog track campaign ran in 2008. It can be seen as the archives of the TREC 2008 Blog track. Details about the new TREC Blog track 2009 campaign can be found in TREC-BLOG
Further details about the 2008 edition can be found in the TREC 2008 Blog track 'Overview paper', which will appear in the Proceedings of TREC 2008, after it completes the WERB process. On a point of information, a updated draft of the paper is available at
http://www.dcs.gla.ac.uk/~ounis/blogOverview2008.pdf
Search Tasks
In TREC 2008, two main tasks ran: the opinion finding task and the blog distillation task. Following from our conclusions from both TREC 2006 and 2007 Blog track campaigns, we have structured the TREC Blog track in 4 search tasks:
Baseline adhoc (blog post) retrieval task
Opinion finding (blog post) retrieval task
Polarised opinion finding (blog post) retrieval task
Blog finding distillation task
The Blog06 test collection was used in TREC 2008. (see
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html )
Baseline Adhoc Retrieval Task
The baseline adhoc retrieval task involves locating blog posts that contain relevant information about a given topic target. This task will use the same topics as the other two blog post retrieval tasks, however all opinion-finding retrieval techniques should be turned off.
This task corresponds to the topic-relevance baseline runs from the TREC 2007 overview paper.
Opinion Retrieval Task
The opinion retrieval task involves locating blog posts that express an opinion about a given target. It can be summarised as What do people think about <target>. It is a subjective task. The target can be a "traditional" named entity -- a name of a person, location, or organization -- but also a concept (such as a type of technology), a product name, or an event. Note that the topic of the post does not necessarily have to be the target, but an opinion about the target must be present in the post or one of the comments to the post.
For example, for the target "skype":
Excerpt from relevant, opinionated post (permalink
http://gigaom.com/2005/12/01/skype-20-eats-its-young/):
Skype 2.0 eats its young
The elaborate press release and WSJ review while impressive don’t help mask the fact that, Skype is short on new ground breaking ideas. Personalization via avatars and ring-tones... big new idea? Not really. Phil Wolff over on Skype Journal puts it nicely when he writes, "If you’ve been using Skype, the Beta version of Skype 2.0 for Windows won’t give you a new Wow! experience." ...
Excerpt from unopinionated post (permalink
http://www.slashphone.com/115/3152.html):
Skype Launches Skype 2.0 Features Skype Video
Skype released the beta version of Skype 2.0, the newest version of its software that allows anyone with an Internet connection to make free Internet calls. The software is designed for greater ease of use, integrated video calling, and ...
Evaluation will be by standard IR evaluation measures, such as MAP. For each submitted run, you should designate which topic-relevance baseline run it corresponds to. For training, participants could use last years 100 queries, with their associated relevance judgements - available from
http://trec.nist.gov/data/blog06.html.
Polarity Task
The polarity task will run again this year, but in a different form. In a closer analogue to a user task where for a given query, a system would retrieve both the positive and negative opinionated document/sentences, categorised in the user display (Opinmind.com used to do this).
For this year's polarity task, for each topic, your system should retrieve and rank all the positive opinionated documents. Then for each topic, your system should retrieve and rank all the negative opinionated documents. For submission, these two runs should be concatenated together in one run file, separated by a blank line. Note that mixed opinionated documents, i.e. documents labelled [3] in the relevance assessment procedure, should not be listed in the positive (resp. negative) rankings of retrieved documents.
Evaluation will be by standard IR evaluation measures, such as MAP. For each submitted run, you should designate which topic-relevance baseline run it corresponds to. For training, participants could use last years 100 queries, with their associated relevance judgements - available from
http://trec.nist.gov/data/blog06.html.
Blog Post Tasks Assessment
We will use the same assessment procedure defined in 2006, and used in 2007. The retrieval unit is documents from the permalink component of the Blog06 test collection. The content of a blog post is defined as the content of the post itself and the contents of all comments to the post: if the relevant content is in a comment, then the permalink is declared to be relevant. Note that blogs and non-blogs will be treated equally in this task.
The following scale will be used for the assessment:
*[-1] i.e. Not judged. The content of the post was not
examined due to offensive URL or header (such documents do exist
in the collection due to spam). Although the content itself was not assessed,
it is very likely, given the offensive header, that the post is
irrelevant.
*[0] i.e. Not relevant. The post and its comments were
examined, and does not contain any information about the target,
or refers to it only in passing.
*[1] i.e. Relevant. The post or its comments contain
information about the target, but do not express an opinion
towards it. To be assessed as ``Relevant", the information given
about the target should be substantial enough to be included in a
report compiled about this entity.
If the post or its comments are not only on target, but also contain an explicit expression of opinion or sentiment about the target, showing some personal attitude of the writer(s), then judge the document using the labels below.
*[2] i.e. Relevant, negative opinions. The post contains an explicit expression of opinion or sentiment about the target, showing some personal attitude of the writer(s), and the opinion expressed is explicitly negative about, or against, the target. *[3] i.e. Relevant, mixed positive and negative opinions. Same as [2], but contains both positive and negative opinions. *[4] i.e. Relevant, positive opinion. Same as [2], but the opinion expressed is explicitly positive about, or supporting, the target.
Evaluation
Number of new test targets will be 50 (topics numbered 1001-1050), however the topic set will include the 100 queries from TREC 2006 and TREC 2007 opinion finding tasks (i.e. 150 topics total). For all blog retrieval post tasks, evaluation measures will be precision/recall based, where the actual "most important measure" will be MAP.
Blog Post Retrieval Task Submissions
Baseline Adhoc Retrieval Task
You can submit up to 2 topic-relevance baselines using all 150 topics. One baseline must be an automatic, title-only run.
For the opinion task, the usual trec_eval format will be used. The submission file should contain lines of the format
topic Q0 docno rank sim runtag
where
topic is the topic number
Q0 is a literal "Q0" (a historical field...)
docno is the permalink document number (BLOG06-200.....-...)
rank is the rank at which the system returned the document (1 .. n)
sim is the system's similarity score
runtag is the run's identifier string
Opinion Finding Task
For the opinion finding task, you are permitted to submit up to 4 runs which must be based on your two previously submitted baseline task runs. These will use all 150 topics. One of these 4 submitted runs must be an automatic, title-only run. If your system cannot be clearly broken down into baseline and opinion-finding features, then you can select "N/A" as the baseline run (though this will make it difficult for you to see the advantage of your opinion finding features).
In addition, TREC will provide 5 standard topic-relevance baseline runs previously submitted. You should attempt to apply your opinion-finding techniques to as many of these 5 baseline runs as possible. This will enable you to have different baselines, and to assess the impact of your opinion-finding technique independently of your initial baseline, hence determining the validity of your conclusions. Each of the provided 5 standard baseline runs could be seen as a black box engine, which has returned a list of relevant documents when all opinion finding features have been switched off.
You MAY submit up to 4 runs for EACH of the provided five standard baselines. Please make it clear in your run descriptions how to compare these runs to the runs based on your own baselines (e.g., "FooRun3's opinion reranker, using baseline2." - FooRun3 being a run that uses your own baseline).
Since you may submit up to 24 runs in total, please take care how you assign priorities for pooling. Please only give high priority to the runs you really want pooled (3 runs at most), and lower priority to others.
Run formats are as for the baseline adhoc retrieval task.
Polarity Task
For the opinion finding task, you are permitted to submit up to 2 runs to the polarity task, which must be based on your two previously submitted baseline runs. These will use all 150 topics. One of these 2 submitted runs must be an automatic, title-only run. If your system cannot be clearly broken down into baseline and polarity-detection features, then you can select "N/A" as the baseline run (though this will make it difficult for you to see the advantage of your polarity-detection features).
In addition, TREC will provide the same 5 standard baseline runs as for the opinion finding task. You should attempt to apply your polarity detection techniques to as many of these 5 baseline runs as possible. Again, each of the provided 5 standard baseline runs could be seen as a black box engine, which has returned a list of relevant documents when all opinion finding features have been switched off.
You MAY submit up to 2 runs for EACH of the provided five standard baselines. Please make it clear in your run descriptions how to compare these runs to the runs based on your own baselines (e.g., "FooRunPol3's Polarity reranker, using baseline2." - FooRunPol3 being a run that uses your own baseline).
Since you may submit up to 12 runs in total, please take care how you assign priorities for pooling. Please only give high priority to the runs you really want pooled (3 runs at most), and lower priority to others.
As described above, the run formats for the polarity task are as for the baseline adhoc and opinion finding tasks, except that each run file contain the output for both positive opinionated posts and negative opinionated posts, separated by a blank line.
Blog Distillation (Feed Search) Task
Blog search users often wish to identify blogs about a given topic, which they can subscribe to and read on a regular basis. This user task is most often manifested in two scenarios:
Filtering: The user subscribes to a repeating search in their RSS reader.
Distillation: The user searches for blogs with a recurring central interest, and then adds these to their RSS reader.
In the TREC Blog track, we have been investigating the latter scenario – blog distillation. The blog distillation task can be summarised as Find me a blog with a principle, recurring interest in X. For a given area X, systems should suggest feeds that are principally devoted to X over the timespan of the feed, and would be recommended to subscribe to as an interesting feed about the X (ie a user may be interested in adding it to their RSS reader).
This task is particularly interesting for the following reasons:
A similar (yet-different) task has been investigated in the Enterprise track (Expert Search) in a smaller setting of around 1000 candidate experts. For blog distillation, the Blogs06 corpus contains around 100k blogs, and a Web-like setting (with anchor text, linkage, spam, etc).
A Topic Distillation task was run in the Web track. In Topic Distillation, site relevance was required as (i) Is principally devoted to the topic, (ii) provides credible information on the topic, and (iii) is not part of a larger site also principally devoted to the topic.
While the definition of blog Distillation as explained above is different, the idea is to provide the users with the key blogs about a given topic. Note that point (iii) is not applicable in a blog setting.
Operationality
Like in TREC 2007, the blog distillation task will follow the community judging model, where the topics will be proposed and judged by the participating groups.
(23rd June): Each participating group will initially provide 6 or 7 topics along with some relevant feeds.
(After submission): Submitted blogs will be pooled, and the groups which proposed topics will evaluate them.
Proposed assessment guidelines at TREC-BLOG/BlogDistillationAssessmentGuidelines.
Topic Development Phase
We need each participating group to create 6 topics for this task. Your aim is to identify some topics 'X', and a few (e.g. 2 or 3) relevant feeds (identified by their feedno). Your topic area should be specific enough that there are not likely to be hundreds or thousands of relevant feeds (so 'cars' is probably too vague a topic). Once you have decided on your topics, send them in an email with two or three relevant feeds to ian.soboroff (AT SYMBOL) nist.gov PLEASE DO NOT POST THEM TO THE MAILING LIST, OR TO THE ORGANISERS LIST.
Format:
<top> <title> a short query-ish title </title> <desc> Description: The desc is a sentence-length description of what you are looking for, and should include the title words. </desc> <narr> Narrative: The narr is a paragraph-length description of what you are looking for. Use it to give details on what feeds or blogs are relevant and what feeds or blogs are not. If there are "gray areas", state them here. </narr> <feeds> feedno feedno feedno </feeds> <comments> Anything else you want to say. </comments> </top>
When choosing topics, please avoid topics that are too general, and will lead to many many relevant feeds being identified. For example, the topic 'Linux' is too general for this collection. A better topic might be 'linux filesystems'. Also avoid topics with temporal aspects - for example, Christmas is an external event and the blogosphere will likely react to this. However, very few blogs will have a recurring interest in that event.
Example:
<top> <title> solaris </title> <desc> Description: Blogs describing experiences administrating the Solaris operating system, or its new features or developments. </desc> <narr> Narrative: Relevant blogs will post regularly about administrating or using the Solaris operating system from Sun, its latest features or developments. Blogs with posts about Solaris the movie are not relevant, not are blogs which only have a few posts Solaris.</narr> <feeds> *BLOG06-feed-053948 BLOG06-feed-078402 BLOG06-feed-018020* </feeds> <comments> None. </comments> </top>
All topics should be submitted to Ian by end of Monday 23rd June.
Topic Development System
To help the participating groups in creating their blog distillation topics, we have provided a standard search system for *documents* on the Blog06 collection, but it also displays the feeds for each documents, and moreover, you can view all the documents for a given feed. You can access it from: [WWW]
http://ir.dcs.gla.ac.uk/terrier/search_blogs06/ the engine will be taken down once the topic development phase ends
If you have your own search system for the Blogs06 collection (say, from last year's track), feel free to use that.
You don't need to state all the relevant feeds for a topic, as there will be a separate assessment phase in September, after all runs have been submitted.
Evaluation
Participants can submit up to 4 runs to the blog distillation task. As usual, at least one automatic, title-only run is required. Each submitted run should have feeds ranked by their likelihood of having an principle (recurring) interest in the topic. For each submitted run, we ask that up to 100 feeds are returned per topic.
The submitted runs to the blog distillation task will follow the usual trec_eval format, i.e.
topic Q0 feedno rank sim runtag
where
topic is the topic number Q0 is a literal "Q0" (a historical field...) feedno is the feed number of the blog (BLOG06-feed-......) rank is the rank at which the system returned the document (1 .. n) sim is the system's similarity score runtag is the run's identifier string
Evaluation is likely to be by Mean Average Precision. In order to identify the key blogs, there may be multiple levels of relevance during assessments (i.e. highly relevant, key, relevant, non-relevant), allowing measures such as NDCG to be used.
Track Timeline
Early May: All Blog post retrieval tasks - topics released
10th June: Baseline adhoc retrieval task - runs due
12th June: Blog Distillation - topic development phase starts
23rd June: Blog Distillation - topics due
2nd July: Opinion Finding retrieval task - runs due
2nd July: Polarity task retrieval task - runs due
3rd July: Blog Distillation - topics released
11th August: Blog Distillation task - runs due
5th September: Blog Distillation Task - Participants Judging Phase starts
30th September: Blog Distillation Task - Participants Judging Phase ends