This page provides spam scores for the ClueWeb12 (cw12) dataset using spam models developed by Cormack, Smucker, and Clarke for ClueWeb09.
The file waterloo-spam-cw12-encoded.tar contains a gzip file for each of the cw12 directories. Each file was encoded using compress-spam12.c before being gzip'd. After gunzipping, each file must be decompressed using decompress-spam12.c To fetch and uncompress all of the files do (assuming a linux-like setup and bash shell):
wget http://www.mansci.uwaterloo.ca/~msmucker/cw12spam/waterloo-spam-cw12-encoded.tar wget http://www.mansci.uwaterloo.ca/~msmucker/cw12spam/decompress-spam12.c gcc -o decompress-spam12 decompress-spam12.c mkdir waterloo-spam-cw12-decoded tar -xvf waterloo-spam-cw12-encoded.tar cd waterloo-spam-cw12-encoded for f in *.spamPct.gz ; do cat $f | gunzip -c | ../decompress-spam12 | gzip -c > ../waterloo-spam-cw12-decoded/$f ; doneThe tar is 654 MB. Decoded, but still gzip'd, the files are 2.6 GB.
The format of each decoded file is:
percentile-score clueweb-docidwhere the percentile score indicates the percentage of the documents in the corpus that are "spammier" as per the "fusion" spam score. The spammiest documents have a score of 0, and the least spammy have a score of 99. We have not extensively tested the spam scores on cw12, but they appear reasonable.
The docids are not listed in any particular order in each file.
The fusion spam score is the average score produced by the three models described in "Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets" with the modification that the "Britney" model has been trained on a very similar, but slightly different data set, from the the model used for ClueWeb09.
Questions are best directed to Mark Smucker.