| | __ _____| |____ ___ __ __ _ _ _ \ \ /\ / / _ \ '_ \ \/ / '__/ _` | | | | \ V V / __/ |_) > <| | | (_| | |_| | \_/\_/ \___|_.__/_/\_\_| \__,_|\__, | __/ | |___/ [v 2.2]
About
A new version of webXray will be released in 2021, version 2.2 is no longer maintained and this page is for reference purposes as the code is not currently available.
webXray is a tool for analyzing third-party content on webpages and identifying the companies which collect user data. A command line user interface makes webXray easy to use for non-programmers, and those with advanced needs may analyze millions of pages with proper configuration. webXray is a professional tool designed for academic research, and may be used by privacy compliance officers, regulators, and those who are generally curious about hidden data flows on the web.
webXray uses a custom library of domain ownership to chart the flow of data from a given third-party domain to a corporate owner, and if applicable, to parent companies. Tracking attribution reports produced by webXray provide robust granularity. Reports of the average numbers of third-parties and cookies per-site, most commonly occurring third-party domains and elements, volumes of data transferred, use of SSL encryption, and more are provided out-of-the-box. A flexible data schema allows for the generation of custom reports as well as authoring extensions to add additional data sources.
The public version of webXray uses Chrome to load pages, stores data in a SQLite database, and can be used on a normal desktop computer. There is also a propriety forensic version of webXray designed to meet the demands of academic research and litigation. If you have academic needs please contact Tim Libert, if you have litigation needs please contact us at the webXray company website.
Below you will find detailed instructions on how to install the software needed to run webXray in the Dependencies section. The Installation and First Run section provides guidance on getting started. Additional section provide instructions on Using webXray to Analyze Your Own List of Pages, Viewing and Understanding Reports, as well as Advanced Options, and Getting Help.
Dependencies
webXray depends on several pieces of software being installed on your computer in advance. If you are familiar with installing dependencies on your own, you may install what is listed below and skip to Installation and First Run . If you are not familiar with dependencies, follow the detailed instructions for Ubuntu and macOS below. Note that webXray can be run on Windows, but detailed instructions are not currently available.
The dependences for a standard webXray install are as follows:
Python Version 3.4+ | https://www.python.org |
Google Chrome Version 64+ | https://www.google.com/chrome/ |
Chromedriver | https://sites.google.com/a/chromium.org/chromedriver/ |
Selenium | https://pypi.python.org/pypi/selenium |
OS Specific Directions
Installing on Ubuntu
Step One: Install Google Chrome (if you already have Chrome go to Step Two)
If you are using Ubuntu desktop, download Chrome here: https://www.google.com/chrome/
If you are on Ubuntu server, run the following commands:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
It is likely you will get errors, if so, run the following:
sudo apt -f install
Run the following command to make sure chrome is installed, if you get an error try the above steps again or search the web for advice.
google-chrome --version
Step Two: Install chromedriver
Chromedriver allows other programs to control Chrome. You must download chromedriver from Google:
- In a browser go to: https://sites.google.com/a/chromium.org/chromedriver/downloads
- Find the version of chromedriver which corresponds to the version of Chrome displayed in the previous step.
- Download 'chromedriver_linux64.zip'
If you are on a server you can find the correct address using the above steps on another computer and use wget to get the file to your server.
Once you have downloaded chromedriver, install it with the following commands:
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/
Run the following command to make sure chromedriver is installed, if you get an error try the above steps again or search the web for advice.
chromedriver --version
Step Three: Install pip3
While Ubuntu has Python3 included by default it does not include the Python3 package manager pip3, so you will need to install it using this command:
sudo apt install python3-pip
Run the following command to make sure pip3 is installed, if you get an error try the above steps again or search the web for advice.
pip3 --version
Step Four: Install Selenium for Python3
Selenium is the glue between Python3 and web browsers, install it with the following command:
sudo pip3 install selenium
You are now ready to install webXray.
macOS Specific Directions
macOS is UNIX-based system and setting up webXray is relatively straight-forward.
Step One: Install Homebrew:
Homebrew is a command-line tool which helps you install and manage various other command-line tools. To install Homebrew go to the following site and follow the instructions, note it may take some time to download and install: https://brew.sh.
By default, Homebrew sends information to Google Analytics, you can disable that with the following command using the terminal (which you should have open after installing Homebrew):
brew analytics off
Step Two: Install Python3
Python3 is needed to run webXray, enter the following command to install it:
brew install python3
To make sure you have the right version of Python installed run the following command:
python3 --version
...if you see 3.4 or above you are good to go!
Step Three: Install chromedriver
Chromedriver allows other programs to control Chrome. Homebrew will install chomedriver for you using the following command:
brew install chromedriver
Run the following command to make sure chromedriver is installed, if you get an error try the above steps again or search the web for advice.
chromedriver --version
Step Four: Install Selenium for Python3
Selenium is the glue between Python3 and web browsers, install it with the following command:
sudo pip3 install selenium
You are now ready to install webXray.
Installation and First Run
The basic installation uses Chrome as the browser in 'headless' mode, meaning you will not see the browser open the pages you are analyzing. Data is stored in a SQLite database and you do not need to install a database server; databases are created in the directory './webXray/resources/db/sqlite/'.
Once your dependencies are installed you can download webXray from GitHub or you can clone the GitHub repository using the following command:
git clone https://github.com/timlib/webXray.git
Now webXray is ready to go! To use it enter the following commands:
cd webXray
python3 run_webXray.py
This is the interactive mode and will guide you to scanning a list of sample websites.
Important Note: If you are running webXray as the 'root' user in Linux it may not run properly due to limitations in Chrome. If webXray stalls or crashes after the 'Building List of Pages' message, run webXray as a non-root user.
Using webXray to Analyze Your Own List of Pages
The raison d'ĂȘtre of webXray is to allow you to analyze pages of your choosing. In order to do so, first place all of the page addresses you wish to scan into a text file and place this file in the "page_lists" directory. Make sure your addresses start with "http://" or "https://", if not, webXray will not recognize them as valid addresses. Once you have placed your page list in the proper directory you may run webXray and it will allow you to select your page list.
Viewing and Understanding Reports
Use the interactive mode to guide you to generating an analysis once you have completed your data collection. When it is completed it will be output to the '/reports' directory. This will contain a number of csv files; they are:
- db_summary.csv: a basic report of what is in the database and how many pages loaded
- stats.csv: provides top-level stats on how many domains are contacted, cookies, javascript, etc.
- aggregated_tracking_attribution.csv: details on percentages of sites tracked by different companies and their subsidiaries
- 3p_domain.csv: most frequently occurring third-party domains
- 3p_element.csv: most frequently occurring third-party elements of all types
- 3p_image.csv: most frequently occurring third-party images
- 3p_javascript.csv: most frequently occurring third-party javascript
- 3p_ssl_use.csv: rates at which detected third-parties encrypt requests
- data_xfer_summary.csv: volume and percentage of data received from first- and third-party domains
- data_xfer_aggregated.csv: volume and percentage of data received from various companies
- data_xfer_by_domain.csv: volume and percentage of data received from specific third-party domains
- network: pairings between page domains and third-party domains, you can import this info to network visualization software
- per_page_data_flow.csv: one giant file that lists the requests made for each page, off by default
Advanced Options
The following are details on how to leverage the power of many advanced functions, and unlike the above, these directions assume you are capable of doing light editing of Python3 code.
Analyze a Single Page
Sometimes you just want to run a single quick scan, to do so, use the command below. Be sure to replace "http://example.com" with the address of the site you want to scan.
python3 run_webXray.py -s http://example.com
Run Many Browsers in Parallell to Increase Speed
By default, webXray will only run a single browser at a time. Given webXray waits 45 seconds for a page to load, this means it will take over 8 hours to scan 1,000 pages. However, most systems can handle running many more browsers at a time, resulting in significant speed gains.
To use one browser per available processor core open run_webXray.py and change "pool_size = 1" to "pool_size = None" - this is the most straight-forward way to increase speed. Since webXray spends most of its time waiting for pages to load you may also experiment with setting pool_size above the number of available cores.
It is possible to do some performance tuning to determine how many browsers you can run before you get crashes. If you desire to scan more than 100,000 pages, performance tuning is highly advised.
Change How Long the Browser Waits After Loading a Page
In order to get all the third-party elements possible, webXray waits for 45 seconds after loading a page. You can make this longer or shorter by changing the line "browser_wait = 45" in run_webXray.py. Note that 45 seconds works very well, and due to specifics of Chrome, setting it to fewer than 45 seonds may result in lost cookies.
Run Chrome in Windowed Mode
By default, Chrome is run without a window opening on your computer which uses less resources, is less annoying, and is required if your computer doesn't have a monitor. If you do want to see pages loading you can enable 'windowed' mode by opening the file './webXray/ChromeDriver.py', finding the line "self.headless = True", and changing it to " self.headless = False".
Getting Help
If you are having problems installing the software or find bugs, please open an issue on GitHub. If you if have advanced needs and require assistance, or if you are interested in comissioning custom written reports rather than running the software yourself, please email Timothy Libert: timlibert@cmu.edu.