Traveling the Silk Road: Datasets

Nicolas Christin
Carnegie Mellon University

Introduction

This site is an online appendix to the paper Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace [1].

In an effort to make my research results reproducible, I make available here: 1) the data files used to generate the plots in the paper [1], 2) a subset of the databases I used in my analysis, and 3) some of my analysis code. This code is sparsely documented, unmaintained, has not been thoroughly tested with the "sanitized" databases provided here, and may require some more work to be useful.

All data was, at some point or another, publicly available on the Silk Road website. I decided however to take a conservative approach, and chose not to make available any textual information (item name, description, or feedback text). Indeed, I could not manually inspect each entry to ensure that no potentially private information (e.g., URLs, email addresses) would be inadvertently released. I also anonymized all handles (user id, item id); these handles are already anonymous on Silk Road, but I did not want current items/users to be directly linkable to this dataset.

Despite these caveats, except for the numbers on early finalization or stealth listings, all results from the paper should be reproducible with this publicly-released dataset.

If you have questions or comments, please contact me. Due to the overall volume of email I have to handle (of which this research represents a very, very small subset), I regrettably cannot guarantee timely responses, and I certainly will not be able to answer requests regarding SQL, UNIX, Perl or R syntax. There are a number of very good tutorials for all these resources available online.

Figure data

Figure data is basically in text format. I recommend gnuplot to plot these files.

SQL data format

All data is in SQL format, and can be used in mysql. To use a certain database, e.g., master.sql.gz, use the following commands:

gzip -d -c master.sql.gz > master.sql
mysql -p -u USERNAME master < master.sql

where USERNAME is the username you wish to associate with this database. You can then operate on the master database from the mysql command prompt, or through scripts.

Each database contains three tables:

An item table which contains all items processed. This table has the following fields:
- item_id: A varchar(10) field containing a hash of the corresponding Silk Road item handle. Note that this field is not the actual Silk Road item handle, but a hashed version of it. You cannot directly map these handles with item handles currently used on the site.
- seller: A varchar(10) field containing a hash of the corresponding Silk Road seller handle advertising the item. Note that this field is not the actual Silk Road seller handle, but a hashed version of it. You cannot directly map these handles with seller handles currently in use on the site.
- ships_to: A varchar(50) field containing the advertised acceptable shipping destinations for the item.
- ships_from: A varchar(50) field containing the advertised origin of the item.
- category: An unsigned integer denoting the item category. A mapping between category and actual category names can be found in this CSV file.
- first_seen: An unsigned integer representing the first time the crawler found the item on the site. This is expressed in UNIX epoch time. Note that first_seen is bound by the beginning of the measurement interval.
- last_seen An unsigned integer representing the last time the crawler found the item on the site. This is expressed in UNIX epoch time. Note that last_seen is bound by the end of the measurement interval.
A feedback table which contains feedback on all items processed. This table has the following fields:
- item_id: A varchar(10) field containing a hash of the corresponding Silk Road item handle. This corresponds to the handle in the item table.
- feedback_time: An unsigned integer representing (an approximate value of) the time, in UNIX epoch time, at which the feedback was deposited.
- feedback_rating: A smallint representing the feedback rating that was given.
- feedback_hash: A varchar(32) representing a hash of the text entered as feedback.
A price table containing the price of the items, with the following fields:
- item_id: A varchar(10) field containing a hash of the corresponding Silk Road item handle. This corresponds to the handle in the item (and feedback) table(s).
- price: A decimal(10,2) unsigned representing the price in Bitcoin of the item as advertised at the time of parsing.
- time: An unsigned integer representing the time (in UNIX epoch time) at which the price for that item was recorded.

Code

Most of the code used in the paper consists of very simple SQL queries, that can be readily executed. Three groups of plots require slightly more complex programs: the plot of the sellers evolution (Figure 5), the survivability analyses (which were done using R), and the sales volume and commissions plot (which was written in Perl). Our code is sparsely documented, not currently maintained, and was not well tested with the publicly released databases I provide here. All of this may make it relatively cumbersome to reuse. If this does not deter you, the scripts can be found here.

Downloads

categories.csv: Mapping between category identifiers and category names.

figures.tar.gz: Data files for all plots in the paper.

queries.sql: A set of SQL queries (and comments) corresponding to many of the plots/tables in the paper.

code.tar.gz: SQL queries, Perl, shell and R scripts used to produce some of the plots in the paper. Untested with these databases, not maintained, use at your own risk. Refer to the queries.sql file for an overview of what the scripts do.

master.sql.gz: The database containing all items, feedback, prices over the entire collection interval; this corresponds to the database D of the paper.

all_snapshots.sql.tar.gz: A number of snapshots corresponding to the databases D_t of the paper (can be decompressed using tar zxvf all_snapshots.sql.tar.gz). The number present in the file name (e.g., 1330820000) denotes the approximate time at which a given snapshot was taken. This time corresponds to the approximate time at which crawling for a given snapshot started. Crawling usually takes up to 24 hours, and you will find items with later timestamps in the data. Important: due to a different method of parsing items than in the master database, there are usually multiple entries for each item in these snapshot databases. Please ensure that you are always grouping by item_id to avoid counting duplicates.

You can also download individual snapshots:
1328220000.sql.gz, 1328310000.sql.gz, 1328390000.sql.gz, 1328480000.sql.gz, 1328570000.sql.gz, 1328650000.sql.gz, 1328740000.sql.gz, 1328920000.sql.gz, 1329000000.sql.gz, 1329080000.sql.gz, 1329180000.sql.gz, 1329260000.sql.gz, 1329350000.sql.gz, 1329430000.sql.gz, 1329520000.sql.gz, 1329610000.sql.gz, 1329690000.sql.gz, 1329780000.sql.gz, 1329870000.sql.gz, 1329950000.sql.gz, 1330030000.sql.gz, 1330120000.sql.gz, 1330210000.sql.gz, 1330300000.sql.gz, 1330380000.sql.gz, 1330470000.sql.gz, 1330560000.sql.gz, 1330640000.sql.gz, 1330820000.sql.gz, 1330900000.sql.gz, 1330990000.sql.gz, 1331760000.sql.gz, 1331850000.sql.gz, 1331940000.sql.gz, 1332020000.sql.gz, 1332100000.sql.gz, 1332190000.sql.gz, 1332300000.sql.gz, 1332370000.sql.gz, 1332460000.sql.gz, 1332540000.sql.gz, 1332630000.sql.gz, 1332710000.sql.gz, 1332800000.sql.gz, 1332880000.sql.gz, 1332970000.sql.gz, 1333050000.sql.gz, 1333150000.sql.gz, 1333320000.sql.gz, 1333410000.sql.gz, 1333490000.sql.gz, 1333570000.sql.gz, 1333660000.sql.gz, 1333750000.sql.gz, 1333920000.sql.gz, 1334620000.sql.gz, 1334780000.sql.gz, 1334870000.sql.gz, 1334950000.sql.gz, 1335050000.sql.gz, 1335130000.sql.gz, 1335300000.sql.gz, 1335650000.sql.gz, 1335740000.sql.gz, 1335820000.sql.gz, 1335990000.sql.gz, 1336250000.sql.gz, 1336340000.sql.gz, 1336430000.sql.gz, 1336510000.sql.gz, 1336600000.sql.gz, 1336690000.sql.gz, 1336770000.sql.gz, 1336860000.sql.gz, 1336940000.sql.gz, 1337030000.sql.gz, 1337120000.sql.gz, 1337210000.sql.gz, 1337290000.sql.gz, 1337390000.sql.gz, 1337460000.sql.gz, 1337550000.sql.gz, 1337630000.sql.gz, 1337730000.sql.gz, 1337810000.sql.gz, 1337890000.sql.gz, 1338500000.sql.gz, 1338590000.sql.gz, 1338670000.sql.gz, 1338760000.sql.gz, 1339080000.sql.gz, 1339190000.sql.gz, 1339280000.sql.gz, 1339360000.sql.gz, 1339450000.sql.gz, 1339530000.sql.gz, 1339630000.sql.gz, 1339710000.sql.gz, 1339790000.sql.gz, 1339880000.sql.gz, 1339970000.sql.gz, 1340140000.sql.gz, 1340230000.sql.gz, 1340310000.sql.gz, 1340400000.sql.gz, 1340490000.sql.gz, 1340570000.sql.gz, 1340660000.sql.gz, 1340850000.sql.gz, 1340920000.sql.gz, 1341000000.sql.gz, 1341090000.sql.gz, 1341180000.sql.gz, 1341270000.sql.gz, 1341350000.sql.gz, 1341440000.sql.gz, 1341520000.sql.gz, 1341610000.sql.gz, 1341700000.sql.gz, 1341780000.sql.gz, 1341870000.sql.gz, 1341950000.sql.gz, 1342730000.sql.gz, 1342820000.sql.gz, 1342910000.sql.gz, 1343080000.sql.gz.

Frequently asked questions

Some snapshots appear to be missing!
As explained in the paper, there are collection gaps, due to the site going down, for instance.

Do you have data more recent than late July 2012?
No.

None of the item or seller ids in the database match those I can find on the site. What is wrong?
This is by design. The item_id and seller_id fields in the database are (salted) hashed versions of the "real" item and seller id's so that it is not easy to link sellers and items currently listed on the site with this older data.

I really need to have textual descriptions of items, feedback, ... for my work, can you send this to me?
It depends. If you are an academic researcher (as evidenced by a .edu (or a known non-US academic domain) email address, and a .edu (or a known non-US academic domain) webpage describing your research and publications), we can at least talk. If you are an undergraduate student, please get your advisor to contact me; if you are a graduate student, please cc' your faculty advisor. Please note that, in any case, I cannot guarantee I have the data you request, or that I will be able to make the data available to you.

Can I get your crawling/parsing code?
Unfortunately, no. Not only the crawling code would reveal some information about the account(s) I have been using, but the website structure has changed since data collection has stopped, and these scripts are not useful anymore.

How do I connect to Silk Road, shop on it, etc...?
The paper [1] discusses all of this. Remember, though, that most items on Silk Road are considered contraband or illicit in most jurisdictions, and purchasing them may be punishable by law, sometimes with harsh sentences.

I have a question about Tor, Bitcoin...
Many online resources can answer these questions far better than I could. Please refer to them.

Citation

The data is under an "Attribution-NonCommercial" Creative Commons License. License terms are available here.

If you use any of this dataset, please cite the associated paper:

[1] Nicolas Christin. Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace. To appear in Proceedings of the 22nd International World Wide Web Conference (WWW'13). Rio de Janeiro, Brazil. May 2013.
Preliminary version available as CMU CyLab Technical Report CMU-CyLab-12-018. (Also: arXiv 1207.7139 [cs.CY].) July 2012 (revised November 2012).

Acknowledgments

This research was partially supported by CyLab at Carnegie Mellon under grant DAAD19-02-1-0389 from the Army Research Office, and by the National Science Foundation under ITR award CCF-0424422 (TRUST).