Skip to content

Blogs06

Craig Macdonald edited this page Dec 17, 2019 · 2 revisions

Blogs06 is a collection of blog posts and feeds used by the TREC Blog track 2006-2008. It can be obtained from the [http://ir.dcs.gla.ac.uk/test_collections/ University of Glasgow]. Two tasks have been defined on the Blogs06 collection, namely opinion finding and blog distillation. The topics and qrels can be obtained from:

Terrier can index the permalinks (blog posts only) of the Blogs06 collection with very little changes:

TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
#remove some more non-content bearing tags
TrecDocTags.skip=DOCHDR,FEEDNO,FEEDURL,BLOGHPNO,BLOGHPURL,PERMALINK,DATE_XML

indexing.singlepass.max.postings.memory=500000000
indexer.meta.forward.keys=docno
indexer.meta.forward.keylens=31
indexer.meta.reverse.keys=docno

If you wish URLs in your index, then set the following properties:

trec.collection.class=TRECWebCollection
indexer.meta.forward.keys=docno,url
indexer.meta.forward.keylens=31,256

See the Terrier documentation on Web-based Terrier to see how to build a Web search engine for this collection.

Clone this wiki locally