Introduction
This site is an online appendix to the paper "A nearly four-year longitudinal study of search-engine poisoning [1]."
It contains links to and a brief description of the data files we used for the longitudinal analysis of search engine manipulation through variants of the search-redirection attack. While in the paper we talk about 4 datasets, their representation in the data files is not as strictly partitioned. Therefore, we describe the data files and we make the association between them and the datasets along the way.
Queries
In the paper we discuss the use of 2 query corpora that our measurements are based on, and here we provide them as text files containing one query per line:
- Q → 218 terms on prescription drugs. For each query we also provide its classification as benign, bad or gray
- Q' → 100 terms for each of the following 6 categories: antivirus, software, counterfeit software, e-books, gambling, and luxury watches
This dataset contains measurements from April 20th, 2010 until November 20th 2011. It is available in the form of a MySQL dump, and covers datasets 1 and 2 in the paper. The dump defines two separate databases representing T1 (db name: googlenew) and T2 (db name: pharma) that are almost identical in structure. The only differences are in the definition of some primary keys. Overall it contains the following tables, and columns per table (note that some columns are not used, and therefore not discussed in this guide):
Show/Hide table description
- timestamp:
- id → unique identifier
- time → timestamp of the observation (shared across all observations in the same day)
- query:
- id → unique identifier
- query → the google query url (contains the query as a url parameter)
- query_time:
- query_id → id linking to the query table
- time_id → id linking to the timestamp table
- result:
- id → unique identifier
- url → the url of the a query's result
- result_time:
- result_id → id linking to the result table
- time_id → id linking to the timestamp table
- ranking → position of the result in the given timestamp (day)
- query_result:
- query_id → id linking to the query table
- result_id → id linking to the result table
- time_id → id linking to the timestamp table
- redirect_data:
- id → unique identifier
- url → the url of a redirection. This url is the location where either a result or an intermediate point in the redirection chain (e.g. traffic broker) is redirecting to
- result_redirect:
- result_id → id linking to the result table. The ids if all search-redirecting results are found in this table
- redirect_id → id linking to the redirect_data table
- redirect_time:
- redirect_data_id → id linking to the redirect_data table
- time_id → id linking to the timestamp table
- redirect_redirect:
- from_id → id linking to the redirect_data table designating the origin of an intermediate redirection
- to_id → id linking to the redirect_data table designating the destination of an intermediate redirection
- redirect_host:
- id → unique identifier
- host → the FQDN of a URL appearing in the redirect_data table
- redirect_host_data:
- host_id → id linking to the redirect_host table (containing FQDNs)
- data_id → id linking to the redirect_data table (the associated URL)
- ip_address:
- id → unique identifier
- ip → the IP address of a redirecting host identified at observation time
- ip_time:
- ip_id → id linking to the ip_address table
- time_id → id linking to the timestamp table
- host_ip:
- host_id → id linking to the redirect_host table (containing FQDNs)
- ip_id → id linking to the ip_address table (the associated IP address)
Show/Hide table description
Set 2
This set represents period T3 in the paper (November 2011 to September 2013), and its format is notably different than Set 1. Specifically, throughout this period we provide 2 data files; one contains each day's search results, and one representing traditional redirection chains. From April 10th 2012 and till the end of our measurements we provide an additional data file per day, containing cookie-based redirection chains.
- Search results
- Name in the format: urls-yyyy-MM-dd.tmp
- Each line: <query,result>
- The sequence of results for a given query provides implicitly information on the results' rankings
- Traditional redirection chains
- Name in the format: chains-yyyy-MM-dd.tmp
- Each line: <timestamp, query, result_url, [redirection_url, ip address]+> or <timestamp, query, result_url, HTTP_error_code>
- These files both traditional and cookie-based redirection chains. For the latter, the chain is not fully expanded and the endpoint has the same FQDN as the original, redirecting search result.
- Cookie-based redirection chains
- Name in the format: chains-cookie-yyyy-MM-dd.tmp
- Each line: <timestamp, result_url, [redirection_url, ip address]+>
- These files contain only stateful redirection chains, and appear after April 9th, 2012
Set 3
This set represents period T4 in the paper and contains measurements related to the non-pharmaceutical queries in Q'. The format and structure is identical to set 1 (i.e. it is an MySQL dump), with each of the 6 query categories and observations stored in a separate database:
- Antivirus software → antivirus
- Software (general) → applications_good
- Counterfeit software → applications_bad
- E-books → books
- Gambling → gambling
- Luxury watches → watches
Citing the work
The data made available under a Creative Commons Attribution-NonCommercial 4.0 International License. If you use any of this dataset, please cite the associated paper:
[1]Nektarios Leontiadis, Tyler Moore, and Nicolas Christin. A Nearly Four-Year Longitudinal Study of Search-Engine Poisoning. To appear in Proceedings of the 21st ACM Conference on Computer and Communication Security (CCS'14). Scottsdale, AZ. November 2014.