This site is an online appendix to the paper "A nearly four-year longitudinal study of search-engine poisoning [1]."

It contains links to and a brief description of the data files we used for the longitudinal analysis of search engine manipulation through variants of the search-redirection attack. While in the paper we talk about 4 datasets, their representation in the data files is not as strictly partitioned. Therefore, we describe the data files and we make the association between them and the datasets along the way.


In the paper we discuss the use of 2 query corpora that our measurements are based on, and here we provide them as text files containing one query per line:

Set 1

This dataset contains measurements from April 20th, 2010 until November 20th 2011. It is available in the form of a MySQL dump, and covers datasets 1 and 2 in the paper. The dump defines two separate databases representing T1 (db name: googlenew) and T2 (db name: pharma) that are almost identical in structure. The only differences are in the definition of some primary keys. Overall it contains the following tables, and columns per table (note that some columns are not used, and therefore not discussed in this guide):
Set 2

This set represents period T3 in the paper (November 2011 to September 2013), and its format is notably different than Set 1. Specifically, throughout this period we provide 2 data files; one contains each day's search results, and one representing traditional redirection chains. From April 10th 2012 and till the end of our measurements we provide an additional data file per day, containing cookie-based redirection chains.

Set 3

This set represents period T4 in the paper and contains measurements related to the non-pharmaceutical queries in Q'. The format and structure is identical to set 1 (i.e. it is an MySQL dump), with each of the 6 query categories and observations stored in a separate database:

Citing the work

The data made available under a Creative Commons Attribution-NonCommercial 4.0 International License. If you use any of this dataset, please cite the associated paper:

[1]Nektarios Leontiadis, Tyler Moore, and Nicolas Christin. A Nearly Four-Year Longitudinal Study of Search-Engine Poisoning. To appear in Proceedings of the 21st ACM Conference on Computer and Communication Security (CCS'14). Scottsdale, AZ. November 2014.

Creative Commons License