| |                        
__      _____| |____  ___ __ __ _ _   _ 
\ \ /\ / / _ \ '_ \ \/ / '__/ _` | | | |
 \ V  V /  __/ |_) >  <| | | (_| | |_| |
  \_/\_/ \___|_.__/_/\_\_|  \__,_|\__, |
                                   __/ |
                                  |___/ 
                           	   [v 2.0]
			

About

webxray is a tool for analyzing third-party content on webpages and identifying the companies which collect user data. A command line user interface makes webxray easy to use for non-programmers, and those with advanced needs may analyze millions of pages with proper configuration. webxray is a professional tool designed for academic research, and may be used by privacy compliance officers, regulators, and those who are generally curious about hidden data flows on the web.

webxray uses a custom library of domain ownership to chart the flow of data from a given third-party domain to a corporate owner, and if applicable, to parent companies. Tracking attribution reports produced by webxray provide robust granularity. Reports of the average numbers of third-parties and cookies per-site, most commonly occurring third-party domains and elements, volumes of data transferred, use of SSL encryption, and more are provided out-of-the-box. A flexible data schema allows for the generation of custom reports as well as authoring extensions to add additional data sources.

By default, webxray uses Chrome to load pages, stores data in a SQLite database, and can be used on a normal desktop computer. Users with advanced needs may install webxray on a server and leverage MySQL or PostgreSQL for heavy-duty data storage.

Below you will find detailed instructions on how to install the software needed to run webxray in the Dependencies section. The Installation and First Run section provides guidance on getting started. Additional section provide instructions on Using webxray to Analyze Your Own List of Pages, Viewing and Understanding Reports, as well as Advanced Options, and Getting Help.

Dependencies

webxray depends on several pieces of software being installed on your computer in advance. If you are familiar with installing dependencies on your own, you may install what is listed below and skip to Installation and First Run . If you are not familiar with dependencies, follow the detailed instructions for Ubuntu and macOS below. Note that webxray can be run on Windows, but detailed instructions are not currently available.

The dependences for a standard webxray install are as follows:

Python Version 3.4+ https://www.python.org
Google Chrome Version 64+ https://www.google.com/chrome/
Chromedriver https://sites.google.com/a/chromium.org/chromedriver/
Selenium https://pypi.python.org/pypi/selenium

OS Specific Directions

Installing on Ubuntu

Step One: Install Google Chrome (if you already have Chrome go to Step Two)

If you are using Ubuntu desktop, download Chrome here: https://www.google.com/chrome/

If you are on Ubuntu server, run the following commands:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

sudo dpkg -i google-chrome-stable_current_amd64.deb

It is likely you will get errors, if so, run the following:

sudo apt -f install

Run the following command to make sure chrome is installed, if you get an error try the above steps again or search the web for advice.

google-chrome --version

Step Two: Install chromedriver

Chromedriver allows other programs to control Chrome. You must download chromedriver from Google:

If you are on a server you can find the correct address using the above steps on another computer and use wget to get the file to your server.

Once you have downloaded chromedriver, install it with the following commands:

unzip chromedriver_linux64.zip

sudo mv chromedriver /usr/bin/

Run the following command to make sure chromedriver is installed, if you get an error try the above steps again or search the web for advice.

chromedriver --version

Step Three: Install pip3

While Ubuntu has Python3 included by default it does not include the Python3 package manager pip3, so you will need to install it using this command:

sudo apt install python3-pip

Run the following command to make sure pip3 is installed, if you get an error try the above steps again or search the web for advice.

pip3 --version

Step Four: Install Selenium for Python3

Selenium is the glue between Python3 and web browsers, install it with the following command:

sudo pip3 install selenium

You are now ready to install webxray.

macOS Specific Directions

macOS is UNIX-based system and setting up webxray is relatively straight-forward.

Step One: Install Homebrew:

Homebrew is a command-line tool which helps you install and manage various other command-line tools. To install Homebrew go to the following site and follow the instructions, note it may take some time to download and install: https://brew.sh.

By default, Homebrew sends information to Google Analytics, you can disable that with the following command using the terminal (which you should have open after installing Homebrew):

brew analytics off

Step Two: Install Python3

Python3 is needed to run webxray, enter the following command to install it:

brew install python3

To make sure you have the right version of Python installed run the following command:

python3 --version

...if you see 3.4 or above you are good to go!

Step Three: Install chromedriver

Chromedriver allows other programs to control Chrome. Homebrew will install chomedriver for you using the following command:

brew install chromedriver

Run the following command to make sure chromedriver is installed, if you get an error try the above steps again or search the web for advice.

chromedriver --version

Step Four: Install Selenium for Python3

Selenium is the glue between Python3 and web browsers, install it with the following command:

sudo pip3 install selenium

You are now ready to install webxray.

Installation and First Run

The basic installation uses Chrome as the browser in 'headless' mode, meaning you will not see the browser open the pages you are analyzing. Data is stored in a SQLite database and you do not need to install a database server; databases are created in the directory './webxray/resources/db/sqlite/'.

Once your dependencies are installed you can download webxray from GitHub or you can clone the GitHub repository using the following command:

git clone https://github.com/timlib/webxray.git

Now webxray is ready to go! To use it enter the following commands:

cd webxray

python3 run_webxray.py

This is the interactive mode and will guide you to scanning a list of sample websites.

Note that if you are running webxray as the 'root' user it may not run properly due to limitations in Chrome. If webxray stalls or crashes after the 'Building List of Pages' message, run webxray as a non-root user.

Using webxray to Analyze Your Own List of Pages

The raison d'ĂȘtre of webxray is to allow you to analyze pages of your choosing. In order to do so, first place all of the page addresses you wish to scan into a text file and place this file in the "page_lists" directory. Make sure your addresses start with "http://" or "https://", if not, webxray will not recognize them as valid addresses. Once you have placed your page list in the proper directory you may run webxray and it will allow you to select your page list.

Viewing and Understanding Reports

Use the interactive mode to guide you to generating an analysis once you have completed your data collection. When it is completed it will be output to the '/reports' directory. This will contain a number of csv files; they are:

Advanced Options

The default installation instructions above are perfect for analyzing moderate volumes of pages (e.g. fewer than 1,000). However, the true value of webxray is the ability to scan vast volumes of pages. The following are details on how to leverage the power of many advanced functions, and unlike the above, these directions assume you are familiar installing dependencies on your own and doing light editing of Python3 code.

Analyze a Single Page

Sometimes you just want to run a single quick scan, to do so, use the command below. Be sure to replace "http://example.com" with the address of the site you want to scan.

python3 run_webxray.py -s http://example.com

Run Many Browsers in Paralell to Increase Speed

By default, webxray will only run a single browser at a time. Given webxray waits 30 seconds for a page to load, this means it will take over 8 hours to scan 1,000 pages. However, most systems can handle running many more browsers at a time, resulting in significant speed gains.

To use one browser per available processor core open run_webxray.py and change "pool_size = 1" to "pool_size = None" - this is the most straight-forward way to increase speed. Since webxray spends most of its time waiting for pages to load you may also experiment with setting pool_size above the number of available cores.

It is possible to do some performance tuning to determine how many browsers you can run before you get crashes. If you desire to scan more than 100,000 pages, performance tuning is highly advised.

Use a More Powerful Database

If you are scanning more than a few thousand pages, you will need a database engine far more powerful than SQLite. Both MySQL and PostgreSQL are supported by webxray. The instructions below will tell you how to set up each database. Note that most webxray development has been done on MySQL and PostgreSQL has therefore undergone less testing.

To use MySQL:

First you must install MySQL: https://www.mysql.com and the mysql-connector module for Python3: https://dev.mysql.com/downloads/connector/python/.

Second, because MySQL's version of unicode is not actual unicode, you must configure your database to use real unicode which MySQL calls 'utf8mb4'. Follow this guide to get MySQL to work correctly, note you only need to follow Step 5, ignore the rest: https://mathiasbynens.be/notes/mysql-utf8mb4#character-sets.

Third, you must configure webxray to use MySQL, to do so open the file 'run_webxray.py', find the line "db_engine = 'sqlite'" and change it to "db_engine = 'mysql'".

Fourth, open the file './webxray/MySQLDriver.py' and update your username and password in the '__init__' function.

You should be all set to use MySQL now, run webxray as you normally would.

To use PostgreSQL:

First, you must install PostgreSQL: https://www.postgresql.org and the psycopg2 module for Python3: http://initd.org/psycopg/.

Second, you must configure webxray to use PostgreSQL, to do so open the file 'run_webxray.py' find the line "db_engine = 'sqlite'" and change it to "db_engine = 'postgres'".

Third, open the file './webxray/PostgreSQLDriver.py' and update your username and password in the '__init__' function.

You should be all set to use PostgreSQL now, run webxray as you normally would.

Change How Long the Browser Waits After Loading a Page

In order to get all the third-party elements possible, webxray waits for 30 seconds after loading a page. You can make this longer or shorter by changing the line "browser_wait = 30" in run_webxray.py. Note that 30 seconds works very well, and due to specifics of Chrome, setting it to fewer than 30 seonds may result in lost cookies.

Run Chrome in Windowed Mode

By default, Chrome is run without a window opening on your computer which uses less resources, is less annoying, and is required if your computer doesn't have a monitor. If you do want to see pages loading you can enable 'windowed' mode by opening the file './webxray/ChromeDriver.py', finding the line "self.headless = True", and changing it to " self.headless = False".

Use PhantomJS

The state of the PhantomJS is in dissarray as people are moving to Chrome. This is sad as the browser is extremely resource-friendly and very useful for scanning large volumes of web pages. If you still want to use PhantomJS you must first install it on your system, then open the run_webxray.py file and change the line "browser_type = 'chrome'" to "browser_type = 'phantomjs'".

Hash Elements

It is possible to have webxray calculate the MD5 hash of all elements being downloaded. This is useful if you want to check for malware and other tasks. In the 'store' function of of './webxray/OutputStore.py' you will find the 'get_file_hashes' and 'hash_3p_only' parameters, modify them as desired. Note that elements must be re-downloaded for hashing so it slows down your network connection considerably and makes heavy CPU demands as you are runnig a ton of hashing operations. This is why it is off by default.

Hidden Options

There are many options hidden throughout the code base, most often you can find hints in function parameters and by running this command:

python3 run_webxray.py -h

Getting Help

If you are having problems installing the software or find bugs, please open an issue on GitHub. If you if have advanced needs and require assistance, or if you are interested in comissioning custom written reports rather than running the software yourself, please email Timothy Libert: contact@webxray.eu.