| |                        
__      _____| |____  ___ __ __ _ _   _ 
\ \ /\ / / _ \ '_ \ \/ / '__/ _` | | | |
 \ V  V /  __/ |_) >  <| | | (_| | |_| |
  \_/\_/ \___|_.__/_/\_\_|  \__,_|\__, |
                                   __/ |
                                  |___/ 
                           	   [v 2.2]
			

About

webxray is a tool for analyzing third-party content on webpages and identifying the companies which collect user data. A command line user interface makes webxray easy to use for non-programmers, and those with advanced needs may analyze millions of pages with proper configuration. webxray is a professional tool designed for academic research, and may be used by privacy compliance officers, regulators, and those who are generally curious about hidden data flows on the web.

webxray uses a custom library of domain ownership to chart the flow of data from a given third-party domain to a corporate owner, and if applicable, to parent companies. Tracking attribution reports produced by webxray provide robust granularity. Reports of the average numbers of third-parties and cookies per-site, most commonly occurring third-party domains and elements, volumes of data transferred, use of SSL encryption, and more are provided out-of-the-box. A flexible data schema allows for the generation of custom reports as well as authoring extensions to add additional data sources.

The public version of webxray uses Chrome to load pages, stores data in a SQLite database, and can be used on a normal desktop computer.

Below you will find detailed instructions on how to install the software needed to run webxray in the Dependencies section. The Installation and First Run section provides guidance on getting started. Additional section provide instructions on Using webxray to Analyze Your Own List of Pages, Viewing and Understanding Reports, as well as Advanced Options, and Getting Help.

Dependencies

webxray depends on several pieces of software being installed on your computer in advance. If you are familiar with installing dependencies on your own, you may install what is listed below and skip to Installation and First Run . If you are not familiar with dependencies, follow the detailed instructions for Ubuntu and macOS below. Note that webxray can be run on Windows, but detailed instructions are not currently available.

The dependences for a standard webxray install are as follows:

Python Version 3.4+ https://www.python.org
Google Chrome Version 64+ https://www.google.com/chrome/
Chromedriver https://sites.google.com/a/chromium.org/chromedriver/
Selenium https://pypi.python.org/pypi/selenium

OS Specific Directions

Installing on Ubuntu

Step One: Install Google Chrome (if you already have Chrome go to Step Two)

If you are using Ubuntu desktop, download Chrome here: https://www.google.com/chrome/

If you are on Ubuntu server, run the following commands:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

sudo dpkg -i google-chrome-stable_current_amd64.deb

It is likely you will get errors, if so, run the following:

sudo apt -f install

Run the following command to make sure chrome is installed, if you get an error try the above steps again or search the web for advice.

google-chrome --version

Step Two: Install chromedriver

Chromedriver allows other programs to control Chrome. You must download chromedriver from Google:

If you are on a server you can find the correct address using the above steps on another computer and use wget to get the file to your server.

Once you have downloaded chromedriver, install it with the following commands:

unzip chromedriver_linux64.zip

sudo mv chromedriver /usr/bin/

Run the following command to make sure chromedriver is installed, if you get an error try the above steps again or search the web for advice.

chromedriver --version

Step Three: Install pip3

While Ubuntu has Python3 included by default it does not include the Python3 package manager pip3, so you will need to install it using this command:

sudo apt install python3-pip

Run the following command to make sure pip3 is installed, if you get an error try the above steps again or search the web for advice.

pip3 --version

Step Four: Install Selenium for Python3

Selenium is the glue between Python3 and web browsers, install it with the following command:

sudo pip3 install selenium

You are now ready to install webxray.

macOS Specific Directions

macOS is UNIX-based system and setting up webxray is relatively straight-forward.

Step One: Install Homebrew:

Homebrew is a command-line tool which helps you install and manage various other command-line tools. To install Homebrew go to the following site and follow the instructions, note it may take some time to download and install: https://brew.sh.

By default, Homebrew sends information to Google Analytics, you can disable that with the following command using the terminal (which you should have open after installing Homebrew):

brew analytics off

Step Two: Install Python3

Python3 is needed to run webxray, enter the following command to install it:

brew install python3

To make sure you have the right version of Python installed run the following command:

python3 --version

...if you see 3.4 or above you are good to go!

Step Three: Install chromedriver

Chromedriver allows other programs to control Chrome. Homebrew will install chomedriver for you using the following command:

brew install chromedriver

Run the following command to make sure chromedriver is installed, if you get an error try the above steps again or search the web for advice.

chromedriver --version

Step Four: Install Selenium for Python3

Selenium is the glue between Python3 and web browsers, install it with the following command:

sudo pip3 install selenium

You are now ready to install webxray.

Installation and First Run

The basic installation uses Chrome as the browser in 'headless' mode, meaning you will not see the browser open the pages you are analyzing. Data is stored in a SQLite database and you do not need to install a database server; databases are created in the directory './webxray/resources/db/sqlite/'.

Once your dependencies are installed you can download webxray from GitHub or you can clone the GitHub repository using the following command:

git clone https://github.com/timlib/webxray.git

Now webxray is ready to go! To use it enter the following commands:

cd webxray

python3 run_webxray.py

This is the interactive mode and will guide you to scanning a list of sample websites.

Important Note: If you are running webxray as the 'root' user in Linux it may not run properly due to limitations in Chrome. If webxray stalls or crashes after the 'Building List of Pages' message, run webxray as a non-root user.

Using webxray to Analyze Your Own List of Pages

The raison d'ĂȘtre of webxray is to allow you to analyze pages of your choosing. In order to do so, first place all of the page addresses you wish to scan into a text file and place this file in the "page_lists" directory. Make sure your addresses start with "http://" or "https://", if not, webxray will not recognize them as valid addresses. Once you have placed your page list in the proper directory you may run webxray and it will allow you to select your page list.

Viewing and Understanding Reports

Use the interactive mode to guide you to generating an analysis once you have completed your data collection. When it is completed it will be output to the '/reports' directory. This will contain a number of csv files; they are:

Advanced Options

The following are details on how to leverage the power of many advanced functions, and unlike the above, these directions assume you are capable of doing light editing of Python3 code.

Analyze a Single Page

Sometimes you just want to run a single quick scan, to do so, use the command below. Be sure to replace "http://example.com" with the address of the site you want to scan.

python3 run_webxray.py -s http://example.com

Run Many Browsers in Parallell to Increase Speed

By default, webxray will only run a single browser at a time. Given webxray waits 45 seconds for a page to load, this means it will take over 8 hours to scan 1,000 pages. However, most systems can handle running many more browsers at a time, resulting in significant speed gains.

To use one browser per available processor core open run_webxray.py and change "pool_size = 1" to "pool_size = None" - this is the most straight-forward way to increase speed. Since webxray spends most of its time waiting for pages to load you may also experiment with setting pool_size above the number of available cores.

It is possible to do some performance tuning to determine how many browsers you can run before you get crashes. If you desire to scan more than 100,000 pages, performance tuning is highly advised.

Change How Long the Browser Waits After Loading a Page

In order to get all the third-party elements possible, webxray waits for 45 seconds after loading a page. You can make this longer or shorter by changing the line "browser_wait = 45" in run_webxray.py. Note that 45 seconds works very well, and due to specifics of Chrome, setting it to fewer than 45 seonds may result in lost cookies.

Run Chrome in Windowed Mode

By default, Chrome is run without a window opening on your computer which uses less resources, is less annoying, and is required if your computer doesn't have a monitor. If you do want to see pages loading you can enable 'windowed' mode by opening the file './webxray/ChromeDriver.py', finding the line "self.headless = True", and changing it to " self.headless = False".

Getting Help

If you are having problems installing the software or find bugs, please open an issue on GitHub. If you if have advanced needs and require assistance, or if you are interested in comissioning custom written reports rather than running the software yourself, please email Timothy Libert: contact@webxray.eu.