| |  \ \ / /                
 __      _____| |__ \ V / _ __ __ _ _   _ 
 \ \ /\ / / _ \ '_ \ > < | '__/ _` | | | |
  \ V  V /  __/ |_) / . \| | | (_| | |_| |
   \_/\_/ \___|_.__/_/ \_\_|  \__,_|\__, |
                                     __/ |
                                    |___/ 
                                   [v.1.0]
			

About

webXray is a tool for detecting third-party HTTP requests on large numbers of web pages and matching them to the companies which receive user data.

Why

Third-party HTTP requests are the lowest-level mechanism by which user data may be surreptitiously disclosed to unknown parties on the web. This may be for perfectly benign reasons, such as an embedded a picture from another site, or it may be a form of surveillance utilizing tracking pixels, cookies, or even sophisticated fingerprinting techniques.

How

As a departure from existing tools, webXray facilitates the identification of the real-world entities to which requests are made by correlating domain request with the owners of domains. In other words, webXray allows you to see which companies are monitoring which pages.

The core of webXray is a python program which ingests addresses of webpages, passes them to the headless web browser PhantomJS, and parses requests in order to determine those which go to domains which are exogenous to the primary (or first-party) domain of the site. This data is then stored in MySQL for later analysis.

Who

webXray was originally developed for academic research, but may be used by anybody with an interest in the hidden structures of the web, privacy, and surveillance.

Cost

Free!*

(*Subject to terms of the GNU Public License.)

Get the Software

webXray can downloaded from GitHub.

Dependencies

In order to use webXray the following must be installed on your system:

Python 3.4+ https://www.python.org
PhantomJS 1.9+ http://phantomjs.org
MySQL https://www.mysql.com
MySQL Python Connector https://dev.mysql.com/downloads/connector/python/

In regards to the MySQL Python Connector, I have had the best luck with going with the "Platform Independent" version. Make sure you use Python3 to install it.

OS Specific Directions

Installing on Ubuntu 14.04

Ubuntu 14.04 already has Python3 installed, to make sure you have 3.4 type:

python3 --version

...if you see 3.4 or above you are good to go!

While Ubuntu has python 3 include by default it does not include the python3 package manager, so you will need to include that using this command:

apt-get install python3-pip

Next is to install PhantomJS, to do so execute this command:

apt-get install phantomjs

...this will install PhantomJS. webXray requires 1.9 or above, to see if you have this type:

phantomjs --version

Assuming you are going well so far it's time to install the MySQL database server. Do do so enter this command:

apt-get install mysql-server

This will walk you through setting up MySQL. It will encourage you to enter a password, strictly speaking this is a good idea, but if you are only using the database for webXray it is likely ok to run it without a password, in which case webXray will work with no changes. However, if you do choose to use a password you will need to modify the file 'webxray/MySQLDriver.py' so webXray can connect.

Now you start the MySQL server:

service mysql start

Each time you reboot your computer you will need to start the MySQL server.

The last Ubuntu-specific step is to install git:

apt-get install git

Great, you can now skip to installing the Python MySQL Connector.

Other Linux Flavors

Should be similar to above, if you are using another flavor you're likely able to figure it out based on the above. webXray was tested and developed primarily on CentOS and RHEL; the major pain there is installing Python 3.

Windows Specific Instructions

Get a linux cloud server (which cost fractions of a cent per hour these days). Ubuntu is the easiest flavor of Linux to get started with and the directions above will serve you well. Seriously, this is your best option. You can do it. I'm both confident in your abilities and proud of you for taking this important step in life.

Mac OSX Specific Directions

Unlike Windows, Mac OSX is UNIX-based, this means you can run webXray pretty easily.

First you need to install Homebrew which will manage the rest of your installation steps. You can get Homebrew here, it may take a little bit to get set up. We'll be patient while you do that.

OK, the following assumes Homebrew is installed on your system. Using the terminal (if you have installed Homebrew this is where you are already), you will now install Python3:

brew install python3

To make sure you have the right version of Python installed type:

python3 --version

...if you see 3.4 or above you are good to go!

Next is to install PhantomJS, to do so execute this command:

brew install phantomjs

...this will install PhantomJS. webXray requires 1.9 or above, to see if you have this type:

phantomjs --version

Assuming you are going well so far it's time to install the MySQL database server. Do do so enter this command:

brew install mysql

This will walk you through setting up MySQL. It will encourage you to enter a password, strictly speaking this is a good idea, but if you are only using the database for webXray it is likely ok to run it without a password, in which case webXray will work with no changes. However, if you do choose to use a password you will need to modify the file 'webxray/MySQLDriver.py' so webXray can connect.

Now you start the MySQL server:

mysql.server start

Each time you reboot your computer you will need to start the MySQL server.

The last Mac specific step is to install wget, a program that allows you to download resources from the web. To do so type:

brew install wget

Great, you can now continue to installing the Python MySQL Connector.

Installing Python MySQL Connector

The Python MySQL connector allows python to talk to your mysql database, which is how webXray stores data. This part can either go very easily, or be a big pain in the ass. First, the easy way; type this in your terminal:

pip3 install mysql-connector-python --allow-external mysql-connector-python

That should work in most cases, if it does not work you will have to manually install the Python MySQL Connector. To do so first download it using this command:

wget https://dev.mysql.com/get/Downloads/Connector-Python/mysql-connector-python-2.0.4.tar.gz

Next, you need to extract the archive:

tar -xvf mysql-connector-python-2.0.4.tar.gz

Next, go to the directory:

cd mysql-connector-python-2.0.4/

...and run the installer:

python3 setup.py install

...the go back to the last directory:

cd ..

Continue below to...

Install webXray!

The last step is to get webXray from GitHub!

git clone https://github.com/timlib/webXray.git

Now webXray is ready to go! To use it type:

cd webXray

python3 run_webxray.py -i

This is the interactive mode and will guide you to scanning the top 1,000 Alexa websites.

Using webXray to Analyze Your Own List of Pages

The entire point of webXray is to allow you to analyze pages of your choosing. In order to do so, first place all of the page addresses you wish to scan into a text file and place this file in the "page_lists" directory. Make sure your addresses start with "http", if not, webXray will not recognize them as valid addresses. Once you have placed your page list in the proper directory you may run webXray in interactive mode and it will allow you to select your page list. Easy-peasy.

Viewing Reports

Use the interactive mode to guide you to generating an analysis. When it is completed it will be output to the '/reports' directory. This will contain a number of csv files; they are:

Important Note on Speed

webXray can analyze many pages in parallell and has achieved speeds up to 30,000 pages per hour. However, out-of-the-box webXray is configured to only scan four pages in parallell. If you think your system can handle more (and chances are it can!), open the 'run_webxray.py' file and search for the string 'pool_size' - when you find that there are instructions on how to increase the numbers of pages you can do concurrently. The bigger you can make pool_size, the faster you will go.

Publications and Data

Privacy Implication of Health Information Seeking on the Web PDF

Communications of the ACM, March, 2015.

Data: 1G Zip File

What web browsing reveals about your health (PDF)

The BMJ (British Medical Journal), November, 2015.

Data: 1G Zip File

Exposing the Hidden Web: Third-Party HTTP Requests On One Million Websites (PDF)

International Journal of Communication, October, 2015.

Data: 35,569,481 Request Records, 600MB Zip File

Add Your Own!

If you have used webXray for your own research, please let us know and we will add you to this section.

Transparency

Tracking

In order to give the webmaster an idea of how much traffic this site is getting, the open-source, locally-hosted analytics software piwik is used. If you have the Do Not Track setting on your data will not be stored. Collected data is minimized to the extent piwik allows, and is purged regularly.

Help!

If you are having problems installing the software please email me and if I have time I will try to help you as I am able: tlibert<[@]>asc.upenn.edu.