Search engine manipulation datasets

This site is an online appendix to the paper "A nearly four-year longitudinal study of search-engine poisoning [1]."

Set 1

This dataset

Show/Hide table description

timestamp:
- id → unique identifier
- time → timestamp of the observation (shared across all observations in the same day)
query:
- id → unique identifier
- query → the google query url (contains the query as a url parameter)
query_time:
- query_id → id linking to the query table
- time_id → id linking to the timestamp table
result:
- id → unique identifier
- url → the url of the a query's result
result_time:
- result_id → id linking to the result table
- time_id → id linking to the timestamp table
- ranking → position of the result in the given timestamp (day)
query_result:
- query_id → id linking to the query table
- result_id → id linking to the result table
- time_id → id linking to the timestamp table
redirect_data:
- id → unique identifier
- url → the url of a redirection. This url is the location where either a result or an intermediate point in the redirection chain (e.g. traffic broker) is redirecting to
result_redirect:
- result_id → id linking to the result table. The ids if all search-redirecting results are found in this table
- redirect_id → id linking to the redirect_data table
redirect_time:
- redirect_data_id → id linking to the redirect_data table
- time_id → id linking to the timestamp table
redirect_redirect:
- from_id → id linking to the redirect_data table designating the origin of an intermediate redirection
- to_id → id linking to the redirect_data table designating the destination of an intermediate redirection
redirect_host:
- id → unique identifier
- host → the FQDN of a URL appearing in the redirect_data table
redirect_host_data:
- host_id → id linking to the redirect_host table (containing FQDNs)
- data_id → id linking to the redirect_data table (the associated URL)
ip_address:
- id → unique identifier
- ip → the IP address of a redirecting host identified at observation time
ip_time:
- ip_id → id linking to the ip_address table
- time_id → id linking to the timestamp table
host_ip:
- host_id → id linking to the redirect_host table (containing FQDNs)
- ip_id → id linking to the ip_address table (the associated IP address)

Show/Hide table description

Set 2

This set

Set 1

Search results

Name in the format: urls-yyyy-MM-dd.tmp
Each line: <query,result>
The sequence of results for a given query provides implicitly information on the results' rankings

Traditional redirection chains

Name in the format: chains-yyyy-MM-dd.tmp
Each line: <timestamp, query, result_url, [redirection_url, ip address]+> or <timestamp, query, result_url, HTTP_error_code>
These files both traditional and cookie-based redirection chains. For the latter, the chain is not fully expanded and the endpoint has the same FQDN as the original, redirecting search result.

Cookie-based redirection chains

Name in the format: chains-cookie-yyyy-MM-dd.tmp
Each line: <timestamp, result_url, [redirection_url, ip address]+>
These files contain only stateful redirection chains, and appear after April 9^th, 2012

Set 3

This set

set 1

Antivirus software → antivirus

Software (general) → applications_good

Counterfeit software → applications_bad

E-books → books

Gambling → gambling

Luxury watches → watches

Citing the work

Attribution-NonCommercial 4.0 International License

Introduction

Queries

Set 1

Set 2

Set 3

Citing the work