tracking users on the world wide web

66
Tracking Users on the World Wide Web HENRIK WRAMNER Master of Science Thesis Stockholm, Sweden 2011

Upload: others

Post on 12-Sep-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tracking Users on the World Wide Web

Tracking Users on the World Wide Web

H E N R I K W R A M N E R

Master of Science Thesis Stockholm, Sweden 2011

Page 2: Tracking Users on the World Wide Web

Tracking Users on the World Wide Web

H E N R I K W R A M N E R

Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2011 Supervisor at CSC was Olof Hagsand Examiner was Stefan Arnborg TRITA-CSC-E 2011:041 ISRN-KTH/CSC/E--11/041--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc

Page 3: Tracking Users on the World Wide Web

Abstract

The current dominant business model of the web are financing website content with advertisements. Advertisers usually only pay if anad view or ad click results in a subsequent purchase by the web sitevisitor. In order to detect which web sites led the visitor to a finalpurchase, it is necessary to track web user actions. The current usedmethods of tracking web users are identified as a basic form of browserfingerprinting and also by using HTTP cookies. These two trackingmethods are shown to face a downward trend of becoming less able totrack web user actions.

In this degree project several other potential methods of web usertracking are developed. Some of the tracking methods are also imple-mented as prototypes and subsequently evaluated. The evaluation showsthat all implemented methods are able to successfully track some subsetof all web users actions. The conclusion is that tracking web users byusing the web browser cache is the best tracking method found in thisdegree project. The reasons are that this method can track a large shareof user actions, it causes minor customer effects and also minor serverside effects for the company performing the tracking.

Page 4: Tracking Users on the World Wide Web

ReferatSpårning av användare på webben

Den dominerande affärsmodellen på webben innebär att webbsidorfinansieras med hjälp av reklam. Vanligtvis får webbsideägarna betaltav annonsören endast om en annonsvisning eller ett annonklick senareleder till ett slutfört köp. För att kunna avgöra vilken webbsida som äransvarig för ett slutfört köp krävs spårning av webbanvändares hand-lingar. De nuvarande standardmetoderna för spårning på webben ärfingeravtryck av webbläsare samt användandet av HTTP-kakor. Dessatvå spårningsmetoder visas vara allt mindre tillförlitliga.

I detta examensarbete utvecklas flera alternativa spårningsmetoder.Några av dessa implementeras även som prototyper och utvärderas. Ut-värderingen visar att samtliga implementerade prototyper kan använ-das för att spåra någon delmängd av alla webanvändares handlingar.Slutsatsen är att användandet av webbläsarcachen är den bästa spår-ningsmetoden som tagits fram under examensarbetet. Anledningen äratt en stor andel av webbanvändarnas handlingar kan spåras, det or-saker endast små effekter hos spårningsbolagets kunder och det är gerbara upphov till små effekter på spårningsbolagets server-datorer.

Page 5: Tracking Users on the World Wide Web

Preface

This degree project was requested and supported by TradeDoubler AB in Stock-holm, Sweden. Most of the work was done on the premises of TradeDoubler AB atSveavägen, Stockholm.

During this degree project I received the help and support from several people.I would like to express my gratitude to all people involved:

• Mikael Löthman, my supervisor at TradeDoubler, for excellent support, en-couragement and feedback during the entire degree project.

• Olof Hagsand, my supervisor at KTH, for giving valuable advice during thedegree project and also giving superior feedback on the report.

• All of TradeDoubler, especially the Hercules group, for providing great knowl-edge and feedback and also for providing a pleasant working environmentduring this degree project.

• My family and friends for supporting and encouraging me.

Page 6: Tracking Users on the World Wide Web

Contents

1 Introduction 11.1 The problem background . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 The ad view . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 The ad click . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.3 The closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 The problem statement . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Goals of this degree project . . . . . . . . . . . . . . . . . . . . . . . 31.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 52.1 Client side code on the web . . . . . . . . . . . . . . . . . . . . . . . 52.2 The hypertext transfer protocol . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 HTTP User-Agent . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 HTTP cookie . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Web user tracking methods . . . . . . . . . . . . . . . . . . . . . . . 72.4 The tracking methods of TradeDoubler . . . . . . . . . . . . . . . . . 8

2.4.1 Cookies on the TradeDoubler domain . . . . . . . . . . . . . 82.4.2 Cookies on the merchant domain . . . . . . . . . . . . . . . . 112.4.3 Browser fingerprinting . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Why the current tracking methods of TradeDoubler are insufficient . 132.5.1 Problems with cookies . . . . . . . . . . . . . . . . . . . . . . 132.5.2 Problems with browser fingerprinting . . . . . . . . . . . . . . 13

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 New potential approaches of tracking web user actions 173.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 CSS history leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 How to use the browser history for tracking . . . . . . . . . . 203.3 The web browser cache . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.1 How to use the web browser cache for tracking . . . . . . . . 223.4 Client side storage capabilities . . . . . . . . . . . . . . . . . . . . . 24

3.4.1 Rich Internet Applications . . . . . . . . . . . . . . . . . . . . 24

Page 7: Tracking Users on the World Wide Web

3.4.2 HTML 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.3 Javascript and the window.name property . . . . . . . . . . . 253.4.4 How to use client side storage capabilities for tracking . . . . 25

3.5 Extended browser fingerprinting . . . . . . . . . . . . . . . . . . . . 263.5.1 How to use extended browser fingerprinting for tracking . . . 27

3.6 HTTP basic authentication . . . . . . . . . . . . . . . . . . . . . . . 273.6.1 How to use HTTP basic authentication for tracking . . . . . 28

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Implementation 294.1 Selection of methods to implement . . . . . . . . . . . . . . . . . . . 294.2 CSS history leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 The web browser cache . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 324.4 Client side storage capabilities . . . . . . . . . . . . . . . . . . . . . 33

4.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Methods not implemented . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5.1 Extended browser fingerprinting . . . . . . . . . . . . . . . . 354.5.2 HTTP basic authentication . . . . . . . . . . . . . . . . . . . 36

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Evaluation Criteria 395.1 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1.1 How large share of user actions can be tracked? . . . . . . . . 395.1.2 What are the customer effects? . . . . . . . . . . . . . . . . . 405.1.3 What are the server side effects? . . . . . . . . . . . . . . . . 40

5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Result 436.1 CSS history leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1.1 How large share of user actions can be tracked? . . . . . . . . 436.1.2 What are the customer effects? . . . . . . . . . . . . . . . . . 436.1.3 What are the server side effects? . . . . . . . . . . . . . . . . 446.1.4 Summary of CSS history leakage evaluation result . . . . . . 44

6.2 The web browser cache . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2.1 How large share of user actions can be tracked? . . . . . . . . 456.2.2 What are the customer effects? . . . . . . . . . . . . . . . . . 456.2.3 What are the server side effects? . . . . . . . . . . . . . . . . 466.2.4 Summary of the web browser cache evaluation result . . . . . 46

6.3 Client side storage capabilities . . . . . . . . . . . . . . . . . . . . . 466.3.1 How large share of user actions can be tracked? . . . . . . . . 476.3.2 What are the customer effects? . . . . . . . . . . . . . . . . . 476.3.3 What are the server side effects? . . . . . . . . . . . . . . . . 47

Page 8: Tracking Users on the World Wide Web

6.3.4 Summary of client side storage evaluation result . . . . . . . 47

7 Conclusions 497.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Bibliography 53

Page 9: Tracking Users on the World Wide Web

Chapter 1

Introduction

This chapter gives the reader an introduction to online advertising, what web track-ing is, why web tracking is needed and what the purpose of this degree project is.The problem statement is given and the scientific methodology used in this degreeproject is described. In the last part of the chapter the goals of this degree projectare defined.

1.1 The problem background

Many web sites on the Internet are provided free of charge to the visitors. Evenlarge and content rich sites with obvious high cost of business are provided for free.How does the business model behind this work? The answer is that many of thesesites are ad-financed. An advertiser pays the site owner money in exchange for avisible ad on the site. This ad might for example be a graphical banner link or atext link shown with the page content. The site owners displaying the ads are calledaffiliates and the advertisers are called merchants.

The ad is usually shown to persuade the web site audience to take some sort ofaction. The most common type of intended action is a purchase. Some ads mighthave a different cause, such as to persuade the visitor to register on another website or fill in a car loan application. These resulting kinds of actions are calledclosures. There are also some ad campaigns where the merchant wants to build itsbrand awareness; thus no further action apart from the ad view by the visitor isnecessarily required.

There are different models of how these payments are generated. These modelsare dependent on the intended user action resulting from the ad view. Some mer-chants pay an amount of money per view of the ad, others pay an amount of moneyper click on the ad. However, the most common model is that the merchant paysonly for an ad click or view resulting in a subsequent closure.

In order for the business model to work there needs to be some tracking of theactions of visitors, all the way from the ad view through the ad click and on to thefinal closure. The time between these different actions could potentially be several

1

Page 10: Tracking Users on the World Wide Web

CHAPTER 1. INTRODUCTION

weeks. If a closure cannot be connected to a previous action the affiliates are notpaid.

Most affiliates do not develop their own tracking system and do not deal directlywith merchants. Instead a middle man company, called an Internet marketingcompany, provides the tracking technology and also the customer base to partnermerchants with affiliates. TradeDoubler AB is such a marketing company.

There is no way to track 100 % of all the interesting visitor actions over time.There is a huge amount of different web browsers and versions, each with differentcharacteristics and capabilities. Also there is no easy way to know exactly how largeshare of visitors that are not tracked by the current tracking technologies. Whatis certain is that the more users tracked the more profit and perhaps also a fairerdistribution of the same.

The current tracking technology of TradeDoubler AB is primarily based on cook-ies and a basic form of browser fingerprinting, both of which seems to be industrystandard methods of web tracking [1, 2, 3, 4]. These tracking methods are explainedin detail in Chapter 2.

Some of the competitors of TradeDoubler AB are using other, perhaps moresophisticated tracking methods with names such as Flash cookie tracking [5] andHTTP ETag tracking [6], both of these methods are explored in Chapter 3. Alsodifferent user settings on new web browsers seems to make the current standardtracking methods less effective or even useless. In order to increase their competitiveadvantage and profit TradeDoubler wants to explore and evaluate different ways oftracking web users.

In order to state the problem of this degree project, the different visitor actionsnecessary for TradeDoubler to track must be defined.

1.1.1 The ad view

An ad is usually in the form of a graphical banner. However, text links are alsoavailable for some merchant campaigns. If the merchant pay the affiliates per viewit is not necessary to track the clients at all. The marketing company just needsto keep count of how many times and where this ad has been displayed. Note thatthere is no way to confidently know if the visitor actually sees the ad displayed,instead the count is of how many times the ad has been sent to any web browser.

There are some merchants willing to pay for a closure if a previous view isdetected. This is rare and also the payment is low in comparison to a confirmedclick leading to a closure. TradeDoubler does support this feature and thus ad viewsmust be traceable.

1.1.2 The ad click

In addition to tracking ad views, TradeDoubler must also be able to track ad clicks.Ad clicks is when the visitor of the affiliate site clicks on any form of ad and getstransferred to the merchant landing page.

2

Page 11: Tracking Users on the World Wide Web

1.2. THE PROBLEM STATEMENT

1.1.3 The closureWhenever a visitor to a merchant makes a closure, TradeDoubler tracks the closureand detects if there is a match between the previous views or click data of the visitorconnected to that merchant.

1.2 The problem statementThe problem statement of this degree project is: What are possible methods totrack web user actions and how do they compare?

1.3 Goals of this degree project• Theory: To find and develop different methods for web tracking.

• Implementation: To implement prototypes of the most interesting methods.

• Evaluation: To evaluate how the methods compare to each other.

1.4 MethodologyThe methodology used in this Master’s degree project was that of experimental com-puter science. The work was divided into three steps, each step closely correlatedwith the specified goals of this degree project.

The first step consisted of information gathering to find as many different track-ing methods as possible. The author got a thorough presentation of the currentTradeDoubler tracking system and then proceeded with reading literature and aca-demic reports on the subject. The author then analyzed how other companiesperformed their tracking. This was done with the help of information on their sitesand also by performing a code analysis on the client side and an inspection of theweb traffic when performing a click and a closure action.

The next step was the implementation of prototypes. To be able to decide whichmethods to implement an evaluation scheme was decided. The possible trackingmethods were then pre-evaluated to decide which were suitable candidates for finalprototyping. These chosen tracking methods were then implemented with focus onreceiving a good result on the subsequent evaluation.

The last step was the evaluation of the implemented prototypes according to thedecided evaluation scheme. From the evaluation the conclusions and recommenda-tions of this degree project were made.

1.5 Ethical considerationsTracking users on the world wide web are sometimes, depending on the context,considered unethical. This degree project report will not discuss the ethical issues

3

Page 12: Tracking Users on the World Wide Web

CHAPTER 1. INTRODUCTION

surrounding web tracking, but will instead only focus on the technological aspectsof the same.

1.6 SummaryThis chapter explained the importance of tracking web user actions in order forthe ad-financed business model of the web to work. The problem statement wasintroduced by explaining the three interesting user actions to track (the ad view, thead click and the closure), and also by explaining the three different actors wherethe tracking must be implemented (the affiliate, the merchant and the Internetmarketing company). Finally the methodology was given as a three step process:Finding methods of tracking, implementing the interesting tracking methods andlastly evaluating the implemented tracking methods according to the evaluationcriteria.

4

Page 13: Tracking Users on the World Wide Web

Chapter 2

Background

In this chapter the world wide web (www, web) technologies, terms and other con-cept used in this thesis are introduced. The current standard ways of tracking webusers are also explained.

2.1 Client side code on the webA web page usually contains both content to be displayed and also some interactingfunctionality. In order for the client web browser to display the content and enablethe functionality some kind of client side interpreted code is necessary. The basisfor all web site coding is the HyperText Markup Language (HTML). HTML is initself not a programming language but is instead used for presentation. HTML givesthe ability to embed text, graphics, forms, links and other objects on a single webpage. HTML can also embed scripts in different languages. These scripts are smallapplications that are run in the web browser on the client side.

All the embedded objects and scripts can be hosted on a different web serverthan the actual HTML document. HTML even allows for one HTML page to embedanother HTML page in a so called iframe. When the embedded objects are locatedon a different web server they are known as third party content. Third party contentis often used when performing web user tracking.

For a complete technical specification of the current HTML standard, pleaserefer to the W3C HTML 4.01 specification [9].

2.2 The hypertext transfer protocolIn order to understand the most common forms of web tracking the basics of howweb browsers and web servers communicate over the web and how web browsersinterpret the downloaded data must be understood.

The hypertext transfer protocol (HTTP) is a communication protocol originallyintended for transferring HTML web pages. Currently most data communication

5

Page 14: Tracking Users on the World Wide Web

CHAPTER 2. BACKGROUND

over the web is made via HTTP; thus HTTP could be said to be the foundation of theweb. HTTP is specified as a request-response protocol in a client-server computingmodel. This means clients, who are usually web browsers, submits HTTP requestmessages to a server. For each incoming HTTP request the server then issuesan HTTP response, hopefully containing the requested resource. An example ofan HTTP request-response is given in Figure 2.1 where the browser user visitshttp://www.google.se:

HTTP GET /

HTTP 200 OK

HTTP GET /intl/en_com/images/srpr/logo1w.png

HTTP GET /extern_chrome/bbb40706d0adda3.jsHTTP GET /ig/cp/get?hl=sv&ql=seHTTP GET /generate_204

HTTP GET ...

Client Server

Figure 2.1. The HTTP communication resulting from a web client visitinghttp://www.google.se.

• Initially the web browser sends an HTTP request to www.google.se request-ing the root resource (usually called index.html).

• The Google web server sends an HTTP response containing a 200 OK, indi-cating that the incoming request has succeeded, and also an HTML documentis part of the HTTP response message body.

• The HTML document is rendered by the web browser, which in turn findsnumerous embedded content and thus issues further HTTP requests for eachof these.

• These HTTP requests should subsequently lead to further 200 OK responsescontaining the requested resources.

There are numerous features in the current HTTP, for a complete specificationplease refer to RFC 2616 [10]. Some of the HTTP features, namely the HTTPUser-Agent and the HTTP cookie, are of great importance to the current standardtracking methods, and are thus explained in detail.

6

Page 15: Tracking Users on the World Wide Web

2.3. WEB USER TRACKING METHODS

2.2.1 HTTP User-Agent

One of the standard header fields in an HTTP request is the User-Agent. Thepurpose of the User-Agent field is to include information about the originating webbrowser, such as browser vendor, browser version etc. The User-Agent informationcan be used on the server side for many reasons. One example is for statisticalpurposes, another example is to dynamically generate different client side codedepending on the known capabilities of the browser issuing the HTTP request.

2.2.2 HTTP cookie

Fundamentally HTTP is a stateless protocol, meaning that the protocol itself hasno way of remembering previous interactions between the server and client. Whena request and the resulting response are finished the connection between the serverand client is closed until the next request-response sequence. However, there areways to make HTTP act stateful. One way to achieve statefulness is by using HTTPcookies.

A cookie is a piece of text set in the header of a server response. The textcontains one or more name-value pairs specified by the server. A web client withsupport for cookies stores this data in a file for a specified amount of time givenby the server. On all subsequent HTTP requests to the same domain, all cookiesstored on the computer and given by the requested domain are submitted in therequest header [10].

HTTP cookies are usually set by the same domain as the visited web page. How-ever, third party content, as descibed in Section 2.1, can also set cookies. Cookiesset by third party content are known as third party cookies.

The textbook example of cookie usage is a virtual shopping cart on an e-commerce web site. With a stateless protocol there is no way to know what theuser has put in the shopping cart. With a cookie the web server has a way to storethis state information on the client side. Another common implementation is thatthe client gets a cookie with a unique identifier (ID) and that the server save thestate for any given ID in a database. Thus the server save all client state and theclients instead saves an ID as a representation of their own state.

2.3 Web user tracking methods

In Chapter 1 the three different visitor actions of interest to track was defined. Thesewere the ad view, the ad click and finally the closure. It is usually the marketingcompany who performs the tracking of all these actions. Whenever one of theseactions take place the marketing company needs to store what action happened andin some way identify who did it. The reason is that in being able to back track afinal closure to a previous click or a view they must be able to identify the actionsof the same web browser in different points in time.

7

Page 16: Tracking Users on the World Wide Web

CHAPTER 2. BACKGROUND

A big part of the problem is how to uniquely identify a web browser. There aretwo ways of achieving this:

Fingerprinting the web browser The marketing companies are sometimes ableto uniquely identify a web user by using a combination of retrievable charac-teristics from the client web browser, also called a browser fingerprint. Themarketing company must then retrieve a fingerprint at the time of an ad clickor view and subsequently retrieve a new fingerprint at the time of a closure.If the two fingerprints matches, the different actions are connected and theresponsible affiliate gets paid.

Tagging the web browser The marketing companies could also store some uniquetag in every web browser performing an ad click or an ad view and sub-sequently retrieve this tag at the time of a closure. The standard methodof doing this is by using HTTP cookies to store the unique tag in the webbrowser.

2.4 The tracking methods of TradeDoublerTradeDoubler uses both the method of tagging and the method of fingerprintingweb browsers in order to perform tracking of the different types of visitor actions.

2.4.1 Cookies on the TradeDoubler domainThe first method of TradeDoubler is the usage of HTTP cookies. Whenever anHTTP response is issued the server can set cookies which are then stored on theclients. With any subsequent HTTP requests from the client to a specific domain,all cookies previously set by that domain are submitted.

TradeDoubler does not have full access to all affiliate and merchant web servers.Instead third party content is embedded into all involved web pages. The thirdparty content is located on the TradeDoubler web server.

Whenever an ad is requested from the TradeDoubler server, a cookie containingdata about the ad view is set. This cookie is called an impression cookie and containsdata about the ad, the affiliate and the time. Since the ad content is requested froma TradeDoubler server, the cookie is retrievable on all subsequent HTTP requestto the TradeDoubler domain. Because of the huge amount of ad views every dayTradeDoubler does not keep the impression data in a database. Instead the webbrowser keeps a list of its previous ad views in the impression cookie.

This HTTP communication is illustrated in Figure 2.2 with the following expla-nation:

• The web browser makes an HTTP request for the root document.

• The server responds with an HTML document containing an embedded bannerad on the TradeDoubler domain.

8

Page 17: Tracking Users on the World Wide Web

2.4. THE TRACKING METHODS OF TRADEDOUBLER

HTTP GET /

HTTP 200 OK

<img src=”http://www.tradedoubler.com/

banner.gif”>

HTTP GET tradedoubler.com/banner.gif

Client Server

HTTP 301 redirect

http://www.somemerchant.com/banner.gif

Set-Cookie: imp=”details about impression”

HTTP GET http://www.somemerchant.com/banner.gif

Figure 2.2. Illustration of how impressions are stored

• The web browser issues another HTTP request for the embedded banner.

• The TradeDoubler decides what banner to display and responds with anHTTP redirect to the image location on another domain. With the redirectHTTP response an impression cookie is set.

• The web browser follows the HTTP redirect and issues another HTTP requestfor the banner image at the specified location.

Thus ad views can be stored persistently on the client in the impression cookie.How to connect the ad views with closures is illustrated in Figure 2.4.

Identifying and storing clicks needs a different approach. All ads displayed on anaffiliate page links to a page on the web server of TradeDoubler. The TradeDoublerpage performs some actions and then instantly redirects the web browser to themerchant page, as illustrated in Figure 2.3 with the following explanation:

• The web user clicks on a banner ad on the affiliate page. This banner links toa TradeDoubler page.

• An HTTP request is issued to the linked TradeDoubler page. This incomingHTTP request informs TradeDoubler that a click has been made.

• TradeDoubler generates a unique identifier and store it together with the affil-iate ID, the merchant ID and the time of the click action in a click database.

9

Page 18: Tracking Users on the World Wide Web

CHAPTER 2. BACKGROUND

Affiliate page Merchant landing page

Banner ad

TradeDoubler redirect page

Database

Figure 2.3. Illustration of how clicks are stored

• TradeDoubler then issues an HTTP response instructing the web browser toinstantly redirect to the merchant web site. In this HTTP response a cookieis set containing the unique identifier previously generated.

Thus TradeDoubler have stored the click data in a database and tagged theunique web browser with an ID as a representation of this data. Therefore Trade-Doubler is also able to store clicks.

As noted in Section 1.1, both the click and the view usually needs to be connectedto a subsequent closure at the merchant site. The connection is accomplished ona page loaded at the merchant site after a successful closure. This page is knownas the confirmation page. In the confirmation page a third party invisible image isembedded. The embedded image is located on the TradeDoubler server as illustratedin Figure 2.4 with the following explanation:

• The web user clicks on the ad and the click events illustrated in Figure 2.3takes place.

• The web user then browses the merchant web site, perhaps returning dayslater and buys something.

• When the closure is made the merchant confirmation page is loaded in theweb browser. On the confirmation page an invisible image is embedded. Theembedded image is located on the TradeDoubler web server and is known asthe trackback pixel.

10

Page 19: Tracking Users on the World Wide Web

2.4. THE TRACKING METHODS OF TRADEDOUBLER

Affiliate page Merchant landing page

Banner ad

TradeDoubler redirect page

Database

Buy now!

Merchant confirmation page

TradeDoubler tracking pixel

Figure 2.4. Connecting previous actions to a closure

• The HTTP request for the trackback pixel contains all cookies previously setby the TradeDoubler domain; including both the impression cookie and theunique identifier click cookie.

• Thus TradeDoubler is able to connect a closure with both previous ad clicksand ad views.

The cookies set by embedded third party content, known as third party cookies,and the cookies set by a redirect page are perhaps the most common ways to trackweb users.

2.4.2 Cookies on the merchant domain

TradeDoubler also uses an alternative cookie method as a backup. This method isonly used by a few of the merchant sites because it requires additional coding onthe merchant web server.

An URL to any resource specifies where it is located and how to retrieve it. Inaddition it may also contain a query string consisting of multiple parameters, also

11

Page 20: Tracking Users on the World Wide Web

CHAPTER 2. BACKGROUND

known as request parameters. An example is this URL:

http://www.example.com/index.html?query_string

Entering the URL into the web browser address bar instructs it to use HTTP todownload the index.html resource from the www.example.com domain. Everythingbehind the “?” is considered the query string.

Whenever a user clicks on an ad, the URL specified contains a request parameterwith information about the referring affiliate. The request parameters are read byTradeDoubler on the redirect page and then resubmitted with the HTTP redirectto the merchant. Finally the merchant reads the request parameters and sets acookie belonging to its domain containing information about the referring affiliateand the time of the click. Thus a first party click cookie belonging to the merchantdomain is stored on the client web browser.

If a closure is subsequently made, the merchant have access to all informationabout the click data in the click cookie of the HTTP request for the confirma-tion page. The click data is dynamically appended as request parameters to thetrackback pixel described in Section 2.4.1. Thus TradeDoubler gets access to boththe closure and the connecting click as request parameters in an incoming HTTPrequest for the trackback pixel.

Note that cookies on the merchant domain is only viable for tracking ad clicks.It is not possible to track ad views with this approach. This is because first partycookie storage of ad views must be set by the affiliate domain; thus there is no wayto access the ad view data at the time of the closure.

2.4.3 Browser fingerprinting

The last backup method of TradeDoubler is the browser fingerprinting method,which is not dependent on HTTP cookies. This method only tracks ad clicks andnot ad views.

Whenever a user clicks on an ad, TradeDoubler receives an HTTP request forthe redirect page. In Section 2.4.2 it was shown that all required information aboutthe click got submitted in the request parameters. This HTTP request also con-tains information about the current web browser in the HTTP User-Agent header.Also the IP address of the visitor requesting the page is available. TradeDoublercalculates a hash sum of the User-Agent and IP address and stores this hash sumtogether with the other click data in a database. The reason for hashing this data isto make sure no IP addresses are actually stored or retrievable from the database,thus the privacy of the web browser users are ensured.

If a closure is subsequently made at the merchant site then the trackback pixel,described in Section 2.4.1, is requested by the visitor who made the closuring action.If there are no cookies or request parameters in the HTTP request for the trackbackpixel, TradeDoubler instead looks at the User-Agent and the IP address on theincoming request. If these matches the User-Agent and IP address of a previously

12

Page 21: Tracking Users on the World Wide Web

2.5. WHY THE CURRENT TRACKING METHODS OF TRADEDOUBLER AREINSUFFICIENT

stored click in the database for the same merchant there is a high probability thatthis is the same user. Thus the affiliate responsible for the click is credited.

2.5 Why the current tracking methods of TradeDoublerare insufficient

There are several reasons why the current tracking methods of TradeDoubler areinsufficient and therefore could potentially be improved.

2.5.1 Problems with cookies

The cookie based tracking approaches suffer from the trend of disabling support forand also removal of existing cookies by the browser users. Third party cookies arenot commonly used for purposes other than tracking web users. Because of the issuewith tracking, many web browsers provides the feature of completely disabling thirdparty cookies and some browsers also disables cookies set on a redirect page[11].Also various antivirus software blocks or removes cookies from sites known to betracking users.

In addition, all major web browser vendors have started to implement variousprivacy modes [7]. These privacy modes usually results in that no information isstored in the web browser after the current browser session is ended. Thus cookiesloses their time persistence and only exists for the current browser session.

All these problems lead to cookies being less persistent in time and perhaps notbeing stored by the web browser at all.

2.5.2 Problems with browser fingerprinting

The browser fingerprints are becoming less unique and thus less able to uniquelyidentify a web user. This is because of issues with the IP-address as well as theUser-Agent string.

IP address

An IP address is provided by the Internet service provider (ISP) and is usuallyunique per customer. However there is no guarantee that the customer keeps thesame IP address over an extended period of time. There is also a problem withmany companies keeping all of their employees behind a proxy or a network addresstranslator (NAT), resulting in a shared external IP address for all devices on thecompany network [12]. Since there is a shortage of IP addresses available on theInternet even some ISP’s are employing similar techniques. All this means that theuniqueness of an IP address is facing a downward trend, thus reducing its value inuniquely identifying a web browser. [13]

13

Page 22: Tracking Users on the World Wide Web

CHAPTER 2. BACKGROUND

User-Agent string

The User-Agent string faces a somewhat similar problem of losing its uniqueness.In old versions of popular web browser the User-Agent string contained lots ofinformation, including:

• What web browser was used? Including version number.

• What operating system was used? Including a detailed version number.

• What is the operating system language setting?

• Which browser add-ons where installed? Including version number.

All this, especially the last point, resulted in the User-Agent string often beingunique amongst millions of ad clicks and thus sufficient by itself in almost uniquelyidentifying a web browser.

New versions of web browsers are facing a trend of submitting less informationin the User-Agent header. For instance Google Chrome does not submit any in-formation about installed add-ons. Less information submitted has the effect thatthe most common distinct User-Agents strings amongst tracked clicks each occursa substantial amount of times, as visible in Table 2.1.

Table 2.1. Showing different web browser shares of all tracked clicks in Septemer2010

User-Agent string Share of incoming clicksMozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko)Chrome/5.0.375.126 Safari/533.4

3.09 %

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/533.4 (KHTML, like Gecko)Chrome/5.0.375.126 Safari/533.4

2.30 %

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko)Chrome/5.0.375.126 Safari/533.4

1.77 %

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko)Chrome/5.0.375.125 Safari/533.4

1.21 %

Mozilla/5.0 (Windows; U; Windows NT 6.1; fr;rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8

1.01 %

The least distinct User-Agent string in the TradeDoubler database for the monthof September corresponded to more than 3 % of all incoming clicks. Even moresurprising is the fact that Google Chrome, which is only the third most popularbrowser with a market share of less than 20 %[8], occupied the top four positionsof the least distinct User-Agent strings.

14

Page 23: Tracking Users on the World Wide Web

2.6. SUMMARY

2.6 SummaryThis chapter introduced the reader to how the server-client model of the web worksby explaining client side coding and the communication with HTTP. Two interest-ing features of HTTP, the cookie and the User-Agent, was introduced in order todescribe the three current tracking methods of TradeDoubler. The three methodsof tracking was: Using HTTP cookies on the TradeDoubler domain, using HTTPcookies on the merchant domain and lastly by performing browser fingerprinting.Finally the problems of the current tracking methods were defined. The reason wasthat the current methods were less effective in setting a unique identifier with thecookie approaches and at the same time less effective in uniquely identifying theweb browser via the fingerprinting method.

15

Page 24: Tracking Users on the World Wide Web
Page 25: Tracking Users on the World Wide Web

Chapter 3

New potential approaches of trackingweb user actions

In Section 2.5 the current tracking methods where shown to face a trend of becomingless effective. This chapter identifies and explores potential alternative approachesof tracking web user actions. The methods are partly based on the web technologiesexplained in Chapter 2. The methods are explained and illustrated with some possi-ble and preliminary implementation details. Some of the tracking methods explainedhere are also implemented as prototypes in Chapter 4 and evaluated in Chapter 6.

3.1 OverviewFive new potential methods of tracking were found. All of these methods are de-scribed in detail in this chapter. Here is an overview of these five methods togetherwith a short description:

CSS history leakage The CSS history leakage method inserts a unique URL intoa web browser performing a click action and subsequently used CSS to detectif any unique URL exists in the web browser history at the time of a closure.Thus a closure can be connected to a click.

The web browser cache The web browser cache method inserts an entry with aunique identifier into the web browser cache. This unique identifier can thenbe retrieved at the time of a closure and thus connect the closure to a click.

Client side storage capabilities The cookie based approaches worked by insert-ing a unique identifier in a cookie file on the client. Various other methods ofstoring a unique identifier on the client were identified. This included HTML5 capabilities, a Javascript property called window.name, MSIE UserData,Silverlight Isolated Storage and Flash Locally Stored Object. These storagecapabilities provides alternatives to cookies for saving a unique identifier.

17

Page 26: Tracking Users on the World Wide Web

CHAPTER 3. NEW POTENTIAL APPROACHES OF TRACKING WEB USERACTIONS

Extended browser fingerprinting The ordinary fingerprinting based approachused the IP address and the User-Agent string. The fingerprinting is extendedby finding additional variable parameters retrievable from the web browser,thus increasing the uniqueness of the fingerprint.

HTTP basic authentication The HTTP basic authentication method uses thefact that after a web browser has performed a HTTP basic authentication thesame authentication data will be resubmitted in subsequent HTTP requests.Thus, if TradeDoubler can get a specific web browser to authenticate withsome unique authentication data at the time of a click or a view, the sameauthentication data will be submitted at the time of a closure. Thus theactions can be connected.

3.2 CSS history leakageCascading style sheets (CSS) is a way of describing the look and formatting of adocument [14]. Usually it is used together with HTML. HTML can also be used todefine the look and formatting but the advantage of using CSS is the possibility todefine the look and formatting for all elements of the same type at the same time.The CSS file can then be included in multiple HTML web pages, thus enabling thedeveloper to instantly change the look and formatting of all pages at once.

One feature of CSS that is proved valuable is the ability to define the look ofweb links. The feature can also be used to make distinction between visited andnot visited web links, for example by setting a different colour for the two groups.

The CSS 1 specification contains a bug, enabling the potential leakage of thebrowser history. The issue has been discussed in various academic papers and alsoin the online security community. [15, 16, 17, 18, 19]

The CSS bug has also been known to browser vendors since at least the year of2000 [20]. At the start of this degree project none of the major web browser hadimplemented a fix, perhaps since the issue seemed like one of mostly theoreticalconsequences. However, new and efficient exploits [22] of the CSS bug availableon the web have turned the browser vendor attention to this issue and some haveimplemented a fix [21], others are preparing to do so [20].

The mechanism behind the bug is based on a feature of CSS. CSS provides a wayof specifying a different look to links previously visited by the web browser. The fol-lowing lines of CSS code defines the colour of unvisited and visited links respectively:

a {color: #0645ad;}a:visited {color: #0b0080;}

18

Page 27: Tracking Users on the World Wide Web

3.2. CSS HISTORY LEAKAGE

The effect in the web browser is illustrated in Figure 3.1.

Figure 3.1. Illustrating the difference between visited (purple) and unvisited(blue) links

There are more properties possible to set to links in CSS. One is the backgroundproperty. In the background property the CSS document may include an URL toan image, which is then used as a background image to the link displayed. AlsoCSS gives the ability to assign different styles to different links by tagging themwith a different id in the HTML code. These are both shown in the following linesof code:

<style>#kth:visited {

background: url(http://www.tradedoubler.com/kth-is-visited.gif)}#google:visited {

background: url(http://www.tradedoubler.com/google-is-visited.gif)}</style>

<a id="kth" href="http://www.kth.se">kth</a><a id="google" href="http://www.google.se">google</a>

19

Page 28: Tracking Users on the World Wide Web

CHAPTER 3. NEW POTENTIAL APPROACHES OF TRACKING WEB USERACTIONS

If the web browser contains any of the specified URLs in the web browser historythen the visited pseudo class is applied to that specific URL. The web browser makesan HTTP request to the different background URLs if and only if that specific URLis previously visited. An incoming HTTP request to either image thus implicatesthe browser user has visited the connected URL before.

Note that it is only possible to probe an exact URL, including request param-eters, with the CSS history leakage method. There is no way of probing an entiredomain or using wild cards and such. By optimizing the method and utilizingJavascripts, tens of thousands URLs per second can be probed [15].

3.2.1 How to use the browser history for trackingIn Chapter 1 the problem of tracking was shown to be similar to the problem ofhow to uniquely identify a web browser. If there was some way to insert a uniqueURL in the history of a specific web browser and subsequently probe the uniqueURL with the the CSS history leakage bug, this web browser would be uniquelyidentifiable. How could the CSS history leakage be used for tracking purposes then?

When a visitor to an affiliate clicks a link the web browser first requests a pageon the TradeDoubler domain where it gets a redirect to the merchant domain asillustrated in Figure 3.2.

Affiliate page Merchant landing page

Banner ad

TradeDoubler redirect page

Figure 3.2. The current redirecting after an ad click

An additional redirect page at the TradeDoubler domain is proposed, illustratedin Figure 3.3 and detailed below:

• Whenever the first redirect page is reached the usual cookie setting anddatabase storing is done, also a unique URL is generated on the same do-main and stored in a database together with information about the affiliate,the merchant and the time.

• Next the web browser receives an HTTP redirect to the new and unique URL.

20

Page 29: Tracking Users on the World Wide Web

3.3. THE WEB BROWSER CACHE

Affiliate page Merchant landing page

Banner ad

TradeDoubler redirect page

Second redirect, with unique URL

Figure 3.3. A proposed additional redirect page

• The page with the unique URL has two functions. First it must redirect theweb browser to the merchant page, but the URL also becomes an entry in theweb browser history list.

Thus a specific browser is successfully tagged with a unique entry. For theunique entry to be of any use to the problem, TradeDoubler must also be able toreceive the unique entry. This can be done on the merchant site after a closurehas been made at the confirmation page, by using the CSS history leakage method.TradeDoubler needs to supply a list with URLs to probe on the client. The listcontains all clicks to this merchant during a specified time period, which can bequeried dynamically from the database of previously generated URLs connected toa specific merchant.

3.3 The web browser cacheAny given site often shares some resources between the individual pages. An exam-ple would be the KTH logo visible on the top left of all pages on the www.kth.se site.A client browsing multiple pages on the KTH web site would thus issue multipleHTTP requests to the same resource. The multiple HTTP requests are a potentialwaste of bandwidth and leads to increases in response time.

One feature of the HTTP is the cache functionality. If the web browser couldstore some frequently accessed resources on the local hard drive both bandwidth andresponse time could be greatly reduced. At the same time it is often important thatthe web browser shows the latest version of a requested resource. HTTP supportscaching with the help of two different HTTP response header fields and subsequentconditional HTTP requests. [10]

To achieve client side caching the server must set at least one of two headerfields. These are the ETag and the Last-Modified headers.

21

Page 30: Tracking Users on the World Wide Web

CHAPTER 3. NEW POTENTIAL APPROACHES OF TRACKING WEB USERACTIONS

The ETag may contain an arbitrary string. The ETag string is often a hash sumof the file requested. If the client supports caching then the file in the HTTP re-sponse is stored locally together with the server generated ETag string. If the clientsubsequently makes an HTTP request for the same resource, the HTTP requestcontains the If-None-Match header containing the previously stored ETag string.The server then reads the If-None-Match string and compare it to the current ETagvalue of the requested resource. If the ETag values match the server can issue ashort response containing a HTTP 304 Not Modified status. This tells the clientthat its cached version is the same as the current version on the server and that itshould still be used.

If the server does not find a match, then a full HTTP response is issued con-taining the new version of the resource as well as a new ETag, as illustrated inFigure 3.4:

HTTP GET /img/logo-main.png

HTTP 200 OK

Etag: W/"15746-1284645858000"

+ Image data

Client Server

HTTP 304 Not Modified

HTTP GET /img/logo-main.png

If-None-Match: W/"15746-1284645858000"

Figure 3.4. An example of an HTTP ETag exchange

The Last-Modified header works in a similar way as the ETag. The differenceis that the Last-Modified header instead contains a timestamp. If the resource iscached on the client together with the timestamp then subsequent HTTP requestsare issued with the If-Modifies-Since header containing the same timestamp. Theserver could then decide whether to issue a short 304 Not Modified response or toresend a new version of the requested resource with a new timestamp.

3.3.1 How to use the web browser cache for trackingHow can the web browser cache be used to uniquely identify a web browser? Theanswer is basically with the ETag and/or the Last-Modified header. Even if theETag header field usually contains a hash sum of the requested file, the web server

22

Page 31: Tracking Users on the World Wide Web

3.3. THE WEB BROWSER CACHE

is free to set an arbitrary string. This means a unique identifier can be insertedanytime a certain resource is requested for the first time. Any subsequent HTTPrequest for the same resource, if the web browser still has the resource in its cache,is conditional and contain the unique identifier specified in the original HTTP re-sponse. The web browser thus identifies itself with the unique ID previously set.

To use the ETag feature for tracking an invisible pixel image can be inserted onall interesting pages in the way illustrated in Figure 3.5 and explained below:

Affiliate page Merchant landing page

Banner ad

TradeDoubler redirect page

Buy now!

Merchant confirmation page

TradeDoubler tracking pixel

Figure 3.5. Implementing the ETag as a unique identifier

• The first time a user visits an affiliate page the tracking pixel is loaded for thefirst time from the TradeDoubler server. The incoming HTTP request doesnot contain an ETag so the TradeDoubler server generates a unique identifierand pass it in the ETag header.

• If the visitor then clicks the banner ad the tracking pixel is once again re-quested on the merchant landing page. Now, the web browser makes a con-ditional request for the same pixel. Since the HTTP request for the pixelcontains the unique identifier it must be the same web browser. Thus Trade-Doubler can store the click data in a database.

• Finally the web user makes a closure and gets transferred to the merchantconfirmation page. The web browser once again makes a conditional HTTP

23

Page 32: Tracking Users on the World Wide Web

CHAPTER 3. NEW POTENTIAL APPROACHES OF TRACKING WEB USERACTIONS

request for the tracking pixel. TradeDoubler can then use the identifier toconnect the closure with the previous click.

The Last-Modified header field could be used in the same way, with the exceptionthat the unique identifier must look like a timestamp.

There is one problem with this approach. How can it be determined which webpage is causing the conditional HTTP request for the tracking pixel? What webpage the pixel is embedded into is necessary knowledge for deducing what actionhad been performed. One method could perhaps be by using request parameters.However, this is not possible. The HTTP request must have the exact same URLfor the browser to use the cached resource and issue an conditional HTTP request.The solution is instead to use the HTTP referer header. When a web page includeother resources the web browser makes an HTTP request for these containing theweb page URL in the HTTP referer header. Thus the different web page originsof the conditional HTTP request is distinguishable. This means that for the clickto be stored correctly, the merchant landing page URL must contain an affiliateidentifier. This is easily achieved by submitting the affiliate identifier as a requestparameter in the redirect to the merchant landing page.

The approach of tracking using the HTTP ETag has been discussed in variouspapers [17, 18] as well as implemented in real world examples [6, 23, 24].

Another approach is to put the identifier in the cached content itself, instead ofputting it in some meta-information such as the HTTP ETag. One way is to puta unique global variable in a cached Javascript [18], thus the web browser can beprobed for identity by client side Javascript code. Another real world implementa-tion [25] of ETag tracking uses a dynamically rendered and unique image file withthe ID encoded in every pixel. These pixels can then be probed for the ID withclient side code.

3.4 Client side storage capabilities

It was shown in Section 2.3 that HTTP cookies enabled a way to store data on theclient side. There are several other options of achieving client side storage.

3.4.1 Rich Internet Applications

Rich Internet Application (RIA) platforms are web browser plugins that interactdirectly with the client side operating system. In comparison, normal web applica-tions can only use the features available in the client web browser while the RIAsallows web developers to develop more advanced web applications than the currentweb standards allows.

There are many of these platforms available, the most common being AdobeFlash, Sun Java and Microsoft Silverlight with a web browser penetration rate ofabout 97%, 79% and 55% respectively [26].

24

Page 33: Tracking Users on the World Wide Web

3.4. CLIENT SIDE STORAGE CAPABILITIES

These RIA platforms have loads of features; however there is one of particularinterest for web tracking. That is the ability to store data on the web browser andsubsequently retrieve it. All these RIA platforms have some feature enabling thebrowser storage ability.

3.4.2 HTML 5The HTML specification is in the progress of getting a complete overhaul. The goalis to release the new HTML 5 standard, thus replacing the old HTML standard andenabling numerous new features. One of these new features is the ability to storedata on the web client.

The current HTML 5 specification draft contains support for client side storagevia an API called web storage [27]. The web storage feature is available in newerversions of most common web browsers. Previous drafts of HTML 5 included an-other feature called web database [28]. The web database feature is currently puton hold in the new standard specification, but some browser vendors have alreadyimplemented it [29].

In addition Microsoft Internet Explorer (MSIE) has a non-standard client sidestorage capability called MSIE userData which also enables client side storage [30].The MSIE userData is not part of any HTML specification, but the feature issimilar to the HTML 5 methods and they are therefore grouped together in thisdegree report.

3.4.3 Javascript and the window.name propertyThe most common type of script language used on the web is Javascript. Javascriptenables programmatic access to objects within the HTML code, elements on theweb page and some features of the web browser.

One feature of Javascript is to store and retrieve values as variables. The vari-ables are only meant to survive for one web page view. When the browser loads anew web page all the previous Javascript together with stored data is terminated.

Different developers have independently found a similar way to make Javascriptvariables persist between page loads [41, 42]. The method involves a propertycalled window.name. The window.name property is accessible by Javascript andit can hold a string value of an arbitrary length and content. This property wasoriginally intended to let different web browser windows communicate with eachother. One example would be a web page loading a popup window with a specifiedwindow.name and then letting links and Javascript’s address this popup windowvia the window.name string.

3.4.4 How to use client side storage capabilities for trackingThese techniques have many similarities with HTTP cookies. The main purpose isto either store tracking data in the client storage or to put a unique identifier onthe client and store the tracking data on the server side.

25

Page 34: Tracking Users on the World Wide Web

CHAPTER 3. NEW POTENTIAL APPROACHES OF TRACKING WEB USERACTIONS

The main difference between these techniques and the HTTP cookies is how thedata is set and retrieved. With HTTP cookies the necessary data could be set in anyHTTP response header and all data were resubmitted in any HTTP request. Theautomatic resubmission can not be achieved natively with any of these alternativemethods. To set and receive data some sort of client side code is necessary.

There are some reports of tracking companies using these techniques as a backupto HTTP cookies [31]. One method described is the tracking company setting anidentifier in an HTTP cookie and then setting the same identifier in a Flash LocalStorage Object (LSO). If the cookie subsequently is deleted by the browser user itwould later be restored with the data from the LSO [34].

The method of restoring user deleted cookies, also known as cookie respawning,could be considered unethical and Adobe has condemned the behaviour [32]. Also aclass act has been issued towards the companies involved in one such implementation[33].

These client side storage methods do not need to rely on cookie respawning toprovide tracking functionality. Instead a combination of various scripting methodscould be used to read and submit the necessary data from the client to the server,without involving cookies.

3.5 Extended browser fingerprintingSection 2.4.3 defined the concept of browser fingerprinting. The fingerprintingmethod tried to uniquely identify a web browser based on some characteristics of theincoming HTTP requests, namely the HTTP User-Agent field and the originatingsource IP address. Also some problems with the fingerprinting method becomingless reliable were shown.

There are more browser characteristics that can be used for fingerprinting pur-poses. Extensive work on fingerprinting has been made in a proof of concept webbrowser fingerprinting application available on the web [35]. The technique hasalso been discussed in academic reports [37, 38]. There are also companies sellingproducts claiming to be able to uniquely fingerprint devices on the Internet [39].One of these companies claims to be able to identify a fingerprint on 89 % of allweb site visitors in comparison to the HTTP cookies where 78 % are claimed to beidentifiable [40]. If those claims are true an additional 14 % of user actions trackedis plausible.

All variable characteristics between different web browsers will increase theuniqueness of the fingerprint. In addition to the HTTP User-Agent and the IPaddress the following retrievable fingerprinting characteristics have been identifiedduring the research of this degree project:

HTTP_Accept Headers The HTTP_Accept headers are a part of the HTTPspecification and are submitted with every HTTP request. The HTTP_Acceptheaders consists of several fields with names such as Accept, Accept-Language,Accept-Encoding and Accept-Charset. These header fields enables the web

26

Page 35: Tracking Users on the World Wide Web

3.6. HTTP BASIC AUTHENTICATION

client to specify to the web server what format is acceptable in the next HTTPresponse. The HTTP_Accept headers will sometimes vary between uniqueweb browsers depending on installed software on the client computer. Forexample a Microsoft Office installation will affect the HTTP_Accept Headersent by the Internet Explorer web browser. [35]

Browser Plug-in Details A Javascript can detect installed plug-ins in the webbrowser. [35]

Time Zone A Javascript can detect the web client time zone. [35]

Screen Size and Colour Depth A Javascript can detect screen information suchas resolution and colour depth. [35]

System Fonts There are several ways to detect installed system fonts. It can bedone with a Flash application, a Java applet or CSS. [35]

MAC address A Java applet can detect the MAC address of the network card.The MAC address is a unique ID in itself. The main problem with this methodis that this Java applet could take up to 30 seconds to load. [36]

System clock error The web client system clock time can be detected with aJavascript or sometimes in a TCP packet header field.

System clock skew The system clock skew is how much the web client systemclock deviates from the real time per tick. The clock skew can potentially bemeasured with a sequence of TCP traffic. [38]

3.5.1 How to use extended browser fingerprinting for trackingBasically a web browser fingerprint could uniquely identify a web browser in dif-ferent points of time. Thus if the fingerprint could be retrieved during critical useractions, the user actions could potentially be connected. The current fingerprintingmethod at TradeDoubler only uses variable characteristics in the HTTP header.Thus any incoming HTTP request could be fingerprinted. The extended finger-printing methods would be more unique but most of the variable characteristicsmust be detected on the client side with a Javascript. The problem is then how tosubmit the variable characteristics to the web server. One approach is by addingall the characteristics as request parameters on an invisible pixel on all interestingpages.

3.6 HTTP basic authenticationAnother useful feature of the HTTP is the basic authentication [10]. The HTTPbasic authentication is designed to allow the client program to provide credentialswith an HTTP request. If a client requests a restricted resource on the server

27

Page 36: Tracking Users on the World Wide Web

CHAPTER 3. NEW POTENTIAL APPROACHES OF TRACKING WEB USERACTIONS

without passing any credentials the server can respond with a 401 Unauthorized.Most client web browsers then prompt the browser user for login credentials, whichis a username and a password, in a popup window. After the user inputs thecredentials a new similar HTTP request is made. This HTTP request contains theAuthorization header containing the credentials. The server side can then choosewhether to grant access or respond with a 401 Unauthorized, thus denying accessto the requested resource and prompting the client for new credentials.

3.6.1 How to use HTTP basic authentication for trackingMost web browsers remember the authorization data that the user entered untilthe browser is closed. Also the web browser passes along the credentials with anysubsequent request to the same domain during the current browser session.

If the web browser authorizes with a unique identifier it will also resubmit theunique identifier with any subsequent HTTP request during the current browsersession. The question is then how to get the web browser to do authenticate witha unique identifier? For obvious reasons the web browser user cannot be queriedto input a unique identifier in a login popup window. If this can be done in someother way is evaluated in Section 4.5.2

3.7 SummaryThis chapter introduced five potential approaches of performing the necessary webuser tracking. These were the CSS history leakage, the web browser cache, theclient side storage capabilities, the extended browser fingerprinting and the HTTPbasic authentication method. Also some illustrative and preliminary implementa-tion details were given for each potential tracking method.

28

Page 37: Tracking Users on the World Wide Web

Chapter 4

Implementation

In this chapter the most promising tracking methods are selected for implementa-tion as prototypes. These implementations are based on the drafted design suggestionfrom Chapter 3, but some adjustments were made to perform better on the subse-quent evaluation in Chapter 6.

4.1 Selection of methods to implementAll the tracking methods in Chapter 3 are viable options for prototyping. Two ofthem, the extended browser fingerprinting method and the HTTP basic authenti-cation method, were never fully implemented for reasons given in Section 4.5.1 andSection 4.5.2. The other three methods were implemented as prototypes.

4.2 CSS history leakageThe first implemented method was the CSS history leakage. The idea behind theCSS history leakage method is to generate and insert a unique URL into the webbrowser history file at the time of a click. Subsequently the web browser is probed atthe time of a closure to see if it contains any of the previously generated URL’s. Thusa list of URL’s to be probed has to be generated and submitted to the web browser.If a unique URL is found the closure can be connected to the click connected to theunique URL.

4.2.1 Implementation

The final prototype of the CSS history leakage method is similar to the drafteddesign in Section 3.2. The implementation consists of the same two parts: theinsertion of a generated unique URL at the time of a click and the subsequentretrieval of the same URL at the time of a closure. The prototype model is illustratedin Figure 4.1 with the following explanation:

29

Page 38: Tracking Users on the World Wide Web

CHAPTER 4. IMPLEMENTATION

Affiliate page Merchant landing page

Banner ad

TradeDoubler redirect page

Second redirect, with unique URL

Merchant confirmation page

Buy now!

Database

Iframe containg URL’s and trackback

pixel

Figure 4.1. Prototype model of tracking using the web browser history

• The user clicks the banner on the affiliate site and gets transferred to theTradeDoubler redirect page.

• When TradeDoubler recieves the incoming HTTP request for the redirect pagea unique URL is generated. The generated URL looks like:

http://www.tradedoubler.com/merchantID_clickID.html

The merchantID contains an identifier to the merchant and clickID containsan incremental click counter. To illustrate the generated URL’s the first threead clicks leading to the merchant with ID 42 would generate and insert thefollowing URL’s into the respective web browsers:

http://www.tradedoubler.com/42_1.html

http://www.tradedoubler.com/42_2.html

http://www.tradedoubler.com/42_3.html

• The generated URL together with the necessary click data is stored in theTradeDoubler database.

30

Page 39: Tracking Users on the World Wide Web

4.3. THE WEB BROWSER CACHE

• The web browser then gets redirected to the generated URL, thus inserting theunique URL into the web browser history. The web browser is then instantlyredirected to the merchant landing page.

• On the merchant landing page the user clicks on the buy button to confirm aclosure.

• The closure page embeds an iframe located on the TradeDoubler server. Whenthe TradeDoubler web server receives the request for the iframe the databaseis queried for all URL’s generated by clicks to the specified merchant duringa specified time interval. The result is then submitted in the HTTP responseand displayed as HTML links in the iframe.

• Also in the iframe a CSS and a Javascript file is embedded. The CSS file setsa certain colour to any of the listed URL’s in the iframe if and only if theURL also exist in the web browser history.

• The Javascript then checks the colour of all the URL’s to see if any of themexists in the web browser history. If there is a match the Javascript notifiesthe TradeDoubler server and thus a connection is made between the click anda closure.

To summarize the above; a unique identifying entry is stored in the web browserhistory at the time of a click and the same entry is subsequently retrieved at thetime of a closure. Thus a closure can be connected to a previous click.

4.3 The web browser cacheThe second implementation was the usage of the web browser cache. According tothe drafted design in Section 3.3.1, the approach is to use the ETag to store a uniqueidentifier in the client cache. When the visitor browses the affiliate sites the ID canbe generated and set in an ETag because of an embedded pixel image resulting inan incoming HTTP request to the TradeDoubler domain. Subsequently the usercan be tracked with the ETag on the merchant landing page (implying a click) andon the merchant confirmation page (implying a closure).

There are some problems with this design. According to the evaluation criteriain Section 5.1 current customer implementations are preferred to be unaffected ifpossible. With the drafted design from Section 3.3.1, affiliates will be affectedbecause they need to embed an additional image on their site. Merchants will beeven more affected since they would need to both embed the image and also makesure that the HTTP referer contains the right data as described in Section 3.3.1.The HTTP referer contains the URL shown in the address bar of the browser,thus the merchant will have to put the required data into the visible URL of boththe landing page and the confirmation page. This approach requires some back endcoding on the merchant site to work and is also not visually appealing to the visitorsbecause of the effects on the web browser address bar.

31

Page 40: Tracking Users on the World Wide Web

CHAPTER 4. IMPLEMENTATION

4.3.1 Implementation

The final prototype uses a different approach from the drafted design in Sec-tion 3.3.1. Since the embedded pixel image on the affiliate site is not used tostore ad views, the unique identifier does not need to be set before an actual clickhappens. Thus the affiliates do not have to embed the pixel image at all.

The next part was to limit what the merchants needed to implement. Thetechnique chosen is to let the client load and cache a Javascript on the TradeDoublerredirect page and subsequently let the merchant embed the same Javascript on themerchant confirmation page. The embedded Javascript contains a dynamicallygenerated ID in a global variable and a method which appends this ID as a requestparameter to an URL. Thus the necessary merchant implementation details arelimited to two steps, both on the closure page:

• Embed the Javascript in the HTML header.

• Call the Javascript’s getID() method to append the ID to the ordinary track-back pixel as a request parameter.

The prototype model is illustrated in Figure 4.2 with an explanation below:

Affiliate page Merchant landing page

Banner ad

TradeDoubler redirect page

Buy now!

Merchant confirmation pageTradeDoubler tracking pixel

Javascript with unique ID

Figure 4.2. Prototype model of tracking using the web browser cache

32

Page 41: Tracking Users on the World Wide Web

4.4. CLIENT SIDE STORAGE CAPABILITIES

• The user clicks the banner on the affiliate site and gets transferred to theTradeDoubler redirect page.

• On the TradeDoubler redirect page an HTML page is submitted with theHTTP response. In the HTML document a Javascript is embedded. Thus theweb browser issues additional HTTP request for the Javascript.

• The TradeDoubler server responds with a dynamically generated Javascriptcontaining a unique ID in a global variable and a getID() method. Alsoan ETag is added to the response, instructing the web browser to keep thegenerated Javascript in its cache. At the same time the click data belongingto the ID is saved in the TradeDoubler database. Finally the HTML pageinstructs the web browser to redirect to the merchant landing page.

• On the merchant landing page the user clicks the buy button to confirm aclosure.

• The closure page embeds the same Javascript as the redirect page. The webbrowser once again issues an HTTP request. If the Javascript is still in thebrowser cache the HTTP request is conditional and contains the ETag. TheTradeDoubler server sees the ETag and responds with a HTTP 304 Not Mod-ified. Thus the web browser continues to use the cached Javascript with theunique ID generated at the click action. Finally the Javascript getID() methodis called which appends the ID to the request parameter of the trackback pixel.

To summarize the above; an ID is stored in the web browser cache at the timeof a click and the same ID is subsequently retrieved at the time of a closure. Thusa closure can be connected to a previous click.

Note that the prototype does not track ad views. Ad view tracking could beachieved on graphical banner advertisements by putting a unique ETag on eachHTTP response containing the banner image file. These ETags could then be re-trieved at the time of a closure by embedding all possible graphical banners on anot visible part of the merchant confirmation page. The reason this has not beenprototyped is that the image files are often located on web servers on other domains.This could be the web servers of the merchants or a third-party content deliverycompany. Thus TradeDoubler has no way of setting the ETag at the time of an adview.

4.4 Client side storage capabilitiesThe third and last implementation was the client side storage method. There arevarious ways of storing data on the client as described in Section 3.4. A commoncharacteristic of most web browser storage locations is the fact that only the domainsetting the data can later read it. This is true also for cookies as described inSection 2.2.2.

33

Page 42: Tracking Users on the World Wide Web

CHAPTER 4. IMPLEMENTATION

A problem is that ad clicks and ad views take place on the affiliate domain, theclosures happens on the merchant domain but the tracking is done by TradeDoubler.The current cookie based tracking solution of TradeDoubler solved this by lettingaffiliates embed third-party content from the TradeDoubler domain, thus HTTPcookies could be set belonging to TradeDoubler. A similar solution can be madewith the other client side storage capabilities.

4.4.1 Implementation

The main part of the prototype is a Javascript which is able to store both ad clickdata and ad view data in all available storage locations. The storage locations usedin the Javascript are:

• HTML 5 localStorage

• HTML 5 globalStorage

• HTML 5 sessionStorage

• HTML 5 SQL database

• MSIE userData

• Javascript window.name

• Microsoft Silverlight Isolated Storage

• Adobe Flash Local Stored Objects

The Javascript has a method which takes a text string as an input and storesthe string in all the listed storage methods available.

The prototype model is illustrated in Figure 4.3 with an explanation below:

• The user visits an affiliate page. The affiliate embeds an iframe located onthe TradeDoubler domain. This iframe contains both a banner ad and thestorage Javascript. The reason for putting the banner and the Javascript inan iframe is to be able to store both ad view data and ad click data belongingto the TradeDoubler domain. When the web browser has finished loadingthis iframe, details about the banner view is stored in the available storagelocations.

• The user clicks on the banner. A method in the Javascript is called that storesinformation about the click in all available storage locations. The web browseris then instructed to follow the link to the merchant landing page.

• On the merchant landing page the user clicks on the buy button to confirm aclosure.

34

Page 43: Tracking Users on the World Wide Web

4.5. METHODS NOT IMPLEMENTED

Affiliate page Merchant landing page

TradeDoubler redirect page

Buy now!

Merchant confirmation pageTradeDoubler trackback iFrame

TradeDoubler iFrame(including storage Javascript and ad)

Figure 4.3. Prototype model of tracking using client storage capabilities

• The merchant closure pages embed an iframe located on the TradeDoublerdomain. The merchant also submits the closure details as request parametersto this iframe. The iframe contains a Javascript which retrieves all view andclick data for this merchant and submits them together with the details aboutthe closure to the TradeDoubler server.

• Thus TradeDoubler is able to connect a closure to previous clicks and views.

4.5 Methods not implementedTwo of the potential tracking methods were not implemented as prototypes.

4.5.1 Extended browser fingerprinting

The extended browser fingerprinting method can potentially uniquely identify al-most all distinct web browsers given enough variable characteristics. There arehowever some problems in comparison to the basic fingerprinting method describedin Section 2.5.2.

More variable characteristics used in the browser fingerprint increases the prob-ability of a fingerprint change between different user actions. What happens if one

35

Page 44: Tracking Users on the World Wide Web

CHAPTER 4. IMPLEMENTATION

of these characteristics changes between a click and a subsequent closure? If thefingerprinting variables are likely to change the ability to connect the click to theclosure might be lost.

A good fingerprinting algorithm should accept small changes in the fingerprintand still accept it as the same unique web browser. For example, if all fingerprintcharacteristics matches except that the web browser has a newer version number,it is likely the same unique user. Also different characteristics are not equallyidentifying. For example, a change in the IP address is more likely to be a newuser than a change in the version number of the web browser. Therefore a goodfingerprinting algorithm need to assign a weight to each variable characteristic. Ifthe sum of the weights of the matching characteristics are a above some thresholdvalue it should be considered a matching fingerprint.

How to construct a good algorithm for doing this is out of the scope for thisdegree project.

4.5.2 HTTP basic authentication

Some preliminary work was done with implementing the HTTP basic authenticationtracking method. The key issue is how to get the web browser to authorize withsome unique identifier without displaying a login popup window. After a successfulauthentication the web browser will resubmit the same authentication data in eachsubsequent HTTP request. Thus, it is in theory possible to tag a web browser withan ID in the HTTP authentication header and later retrieve it.

Some web browsers do authenticate if the web page contains an embedded imagewith the following URL format:

<img src="http://UNIQUE_ID:[email protected]/image.gif">

After successfully requesting and receiving the image the web browser will re-member the login credentials and subsequently resubmit them to all HTTP requeststo the TradeDoubler domain. However, most browsers ignore the login credentialsof the embedded image and thus this approach was not successful. Another ap-proach is to use Javascript functionality to issue an HTTP request. An example ofthis is the following Javascript code:

var req = new XMLHttpRequest();var user = "UNIQUE_IDENTIFIER";var pwd = "password";var enc = Base64.encode(user + ":" + pwd);

req.open("GET", "http://www.tradedoubler.com/pixel.gif", false);req.setRequestHeader("Authorization", "Basic " + enc);req.send(null);

36

Page 45: Tracking Users on the World Wide Web

4.6. SUMMARY

This approach works successfully in some web browsers but not all.After some more work it was decided that the HTTP basic authentication

method would not be fully implemented and evaluated. The primary reason isthat even if it could work reliably with all HTTP compliant web browser, the au-thorization data will still only be cached in the web browser for the current browsingsession. When the web browser is closed the cached authorization data is terminatedand the ID is lost, Since many other methods allows persistent storage over numer-ous browser sessions the HTTP basic authentication method was not as appealingand no more work were done trying to implement a prototype.

4.6 SummaryIn this chapter the prototype implementations are described. The methods selectedfor implementation were the CSS history leakage method, the web browser cachemethod and the client side storage method. The implementations were made withfocus on performing well on the subsequent evaluation. The motivation for not im-plementing the final two methods were also given. The HTTP basic authenticationmethod suffered from not having an acceptable time persistence and the browserfingerprinting method suffered from being too complicated to do an acceptable im-plementation within the time scope of this degree project.

37

Page 46: Tracking Users on the World Wide Web
Page 47: Tracking Users on the World Wide Web

Chapter 5

Evaluation Criteria

In this chapter the criteria for evaluating the prototype implementations from Chap-ter 4 are presented.

5.1 Evaluation criteriaTo evaluate and compare the different tracking methods three evaluation criteriawere selected.

5.1.1 How large share of user actions can be tracked?

TradeDoubler and the affiliates usually get paid only when a previous ad click orad view can be tracked to a closure. Therefore one of the main criterium whenevaluating a tracking method is: How large share of user actions can be tracked?

There are several reasons why 100 % of all user actions cannot be tracked. Onereason is the fact that there are various web browsers and web browser versionsavailable; all with different capabilities. Not all web browsers support all trackingmethods and sometimes the web users themselves use programs or web browserssettings that make them harder to track.

To get a good estimate of the current traceable share of all user action themethods would have to be tested on real web traffic. Since testing the prototypeson real traffic was out of the scope of this degree project the criteria was insteadevaluated with three sub-criteria:

• Each method is evaluated to see if ad clicks, ad views or both are traceable.

• Each method is tested for browser compatibility to see which web browsersare traceable. The prototypes are tested with the latest version (2010-12-01) of Microsoft Internet Explorer, Mozilla Firefox, Google Chrome, Opera,Apple Safari and Apple Safari for iPhone. All these browser also has thelatest version (2010-12-01) of Adobe Flash and Microsoft Silverlight, with theexception of the Apple iPhone Safari where neither is available.

39

Page 48: Tracking Users on the World Wide Web

CHAPTER 5. EVALUATION CRITERIA

• A good tracking method should have some time persistence. The currentmethods of TradeDoubler can track ad clicks and ad views up to 30 days.Since the time persistence of the evaluated tracking methods depend greatlyon user settings and user browsing behavior a precise estimate can not bereached without testing on real web traffic. Instead some imprecise estimateis made based on the default settings of the different tested web browsers.

5.1.2 What are the customer effects?

There are two different groups of customers of TradeDoubler, namely the merchantsand the affiliates as described in Section 1.1. The tracking system will affect boththe merchants and the affiliates as it requires some implementation details on eachparticipating web site.

The current tracking methods of TradeDoubler have different requirements forthe two groups of customers:

• The affiliates need to embed an image with a belonging link. This image isthe advertisement shown to the web user and the link points takes the userto the merchant, as illustrated in Figure 2.3

• The merchants need to embed the trackback pixel and make sure it con-tains the correct request parameters, as illustrated in Figure 2.4. Also somemerchants need code to handle incoming request parameters, as described inSection 2.4.2.

A new tracking method should preferably need little or no implementationchanges in neither affiliate nor merchant web sites. If changes are necessary theyare preferred on the merchant sites since there are much fewer merchants than af-filiates and also since merchants usually have better coding expertise. Naturally,small changes in implementation are preferred over large ones.

For each evaluated tracking method a description of required changes in cus-tomer implementations are given.

5.1.3 What are the server side effects?

TradeDoubler handles huge volumes of data. The servers of TradeDoubler must forexample handle the serving of content, the storing of various action data and alsoperform the tracking at the time of a closure. When implementing new trackingmethods special care must be taken to avoid wasting the limited resources of theTradeDoubler servers.

A new tracking method should not replace the old tracking methods, but ratherwork side by side. It is preferred if the new tracking methods could be implementedwithout needing significant changes to the older ones. Also, two tracking methodsrunning side by side would perform much better and minimize server cost if they

40

Page 49: Tracking Users on the World Wide Web

5.2. SUMMARY

could reuse the same user action data instead of redundantly storing the necessarydata twice.

For each evaluated tracking method a description of the TradeDoubler serverside changes necessary are given. It is also evaluated if the tracking methods areable to reuse the same stored click and view data of the current tracking methodsinstead of redundantly storing the same data again.

5.2 SummaryIn this chapter three evaluation criteria were defined:

• How large share of user actions can be tracked?

• What are the customer effects?

• What are the server side effects?

41

Page 50: Tracking Users on the World Wide Web
Page 51: Tracking Users on the World Wide Web

Chapter 6

Result

In this chapter the prototype implementation from Chapter 4 are evaluated accordingto the evaluation criteria given in Chapter 5.

6.1 CSS history leakage

The CSS history leakage method inserts a unique URL into a web browser perform-ing a click action and subsequently uses CSS to detect if any unique URL exists inthe web browser history at the time of a closure. Thus a closure can be connectedto a click.

6.1.1 How large share of user actions can be tracked?

The prototype implementation could successfully track ad clicks but not ad views.The prototype tracking worked successfully in all the tested browser with default

settings. Also, since both the web browser history and CSS have been standardfeatures of web browsers for a long time, many older browsers should likely alsosupport the CSS history leakage method.

To get an accurate estimate of the average time persistence the prototype needsto be tested on real traffic. However, all tested web browsers have default settingsof how long the browsing history is stored, as seen in Table 6.1. This should only beconsidered a rough estimate of the time persistence of the browser history storage,since browser users may often clear and also change the default settings for the webbrowser history.

6.1.2 What are the customer effects?

The affiliate sites are unaffected by the prototype. The merchant sites only need aminor modification: The ordinary embedded trackback pixel on the closure page isreplaced with an embedded iframe, using the same request parameters.

43

Page 52: Tracking Users on the World Wide Web

CHAPTER 6. RESULT

Table 6.1. The default settings of the web browsers history persistence

Web browser HistoryInternet Explorer 8 20 daysFirefox 3.5 >90 daysSafari 5 30 daysSafari for iPhone >10 daysChrome 7 120 daysOpera 10 1000 URL’s

6.1.3 What are the server side effects?The CSS history leakage method requires additional resources on the TradeDoublerservers. This is because during a closure action the list of previously generatedURL’s to the specified merchant from a specified time need to be queried fromthe TradeDoubler database and then submitted to the web browser performing theclosuring action. This requires more processing power and additional bandwidthfor TradeDoubler.

The CSS history leakage method uses a different identifying mechanism than thecookie based approach. Therefore the cookie ID is not reused for the CSS historyleakage method. If the two methods coexist the same click data is stored twice inthe database, both for the cookie tracking and also for the browser history tracking.Thus the storage space on the TradeDoubler server needs to be twice as large.

6.1.4 Summary of CSS history leakage evaluation resultThe CSS history leakage method operates by inserting and retrieving a unique entryin the web browser history. The prototype is able to track ad clicks but not ad views.There are several advantages with using this method:

• It can be used without needing changes to current affiliate implementations.

• It only needs small changes to current merchant implementations.

• It is likely more time persistent than the current cookie based method, sinceclearing the browser history affects the browser user more than clearing forexample the web browser cache.

The CSS history leakage method also have several disadvantages:

• The prototype implementation cannot store clicks and there is probably noeasy way to achieve this.

• Newer versions of Google Chrome and Apple Safari have blocked the CSShistory leakage and Mozilla Firefox will likely follow soon, thus the method isnot future safe.

44

Page 53: Tracking Users on the World Wide Web

6.2. THE WEB BROWSER CACHE

• It needs significantly more server resources than the other methods.

In summary, the CSS history can track clicks in a majority of todays webbrowsers, but will likely not work in future web browser versions. Thus it is not agood long-term tracking solution for TradeDoubler.

6.2 The web browser cacheThe web browser cache method inserts an entry with a unique identifier into theweb browser cache. This unique identifier can subsequently be retrieved at the timeof a closure and thus connect the closure to a click.

6.2.1 How large share of user actions can be tracked?The prototype implementation can successfully track ad clicks but not ad views.Theoretically it is possible to track ad views too, as described in the implementationdetails.

The prototype implementation works successfully in all the tested browser withdefault settings. Also, since the caching functionality have been a part of HTTPfor a long time, most older browsers should also support the cache method.

It is not an easy task to estimate the average time persistence without testingthe prototype on real traffic. The time persistence depends on the web browsercache size, how often the cache is manually cleared but mostly on the surfing habitsof the web browser user. All the tested web browsers have some maximum cachesize as seen in Table 6.2. Most browsers allow the user to manually change thecache size. When the cache is full the cached resources are replaced on a first infirst out basis, implying that the cached Javascript has less time persistence in theweb browser of a more frequent web user.

Table 6.2. The default cache size of different web browsers

Web browser Default cache sizeInternet Explorer 8 50 MBFirefox 3.5 50 MBSafari 5 >100 MBSafari for iPhone 1582 kBChrome 7 320 MBOpera 10 20 MB

6.2.2 What are the customer effects?The affiliate sites are unaffected by the prototype. The merchant sites only needminor changes to their respective closure pages, namely including a Javascript andcall a method to append the cached ID to the ordinary trackback pixel.

45

Page 54: Tracking Users on the World Wide Web

CHAPTER 6. RESULT

6.2.3 What are the server side effects?The original cookie based approach of tracking put an ID in a cookie at the time ofa click, while at the same time storing the click data in the TradeDoubler database.The same ID can be used as a global variable in the unique cached Javascript.Thus both methods use the same ID that points to the same click data in theTradeDoubler database and no redundant click information is written. Thereforethere are only minor server side effect using this prototype, such as generating andserving the Javascript.

6.2.4 Summary of the web browser cache evaluation resultThe web browser cache method operates by inserting and retrieving a unique iden-tifier in a cached Javascript in the web browser. The prototype implementation isable to track ad clicks but not ad views. There are several pros with using thismethod:

• It works with all tested browsers, and likely works with most older browserversions as well. This is because the mechanism used have been a standardpart of HTTP for many years.

• It requires no changes to current affiliate implementations.

• It only need minor changes to current merchant implementations.

• The stored click data from the cookie approach can be reused with the webbrowser cache method, thus no redundant information is stored in the Trade-Doubler database.

The web browser cache method also have some cons:

• It needs, albeit minor, changes to current merchant implementations.

• The current prototype does not have the ability to store ad views. This mightbe possible with a better implementation as described in Section 4.3.

In summary, the web browser cache method can track ad clicks and perhaps alsoad views with some persistence in time. Since the mechanism behind the methodhave been a part of the HTTP specification for a long time it likely works in mostbrowser versions available today. Also, it will likely continue to work in futurebrowser versions.

6.3 Client side storage capabilitiesThe Client side storage capabilities method works by using various storage locationsfor storing click and view data in the user web browser. This included HTML 5capabilities, a Javascript property called window.name, MSIE UserData, SilverlightIsolated Storage and Flash Locally Stored Object.

46

Page 55: Tracking Users on the World Wide Web

6.3. CLIENT SIDE STORAGE CAPABILITIES

6.3.1 How large share of user actions can be tracked?The prototype implementation can successfully track both clicks and views.

The prototype tracking works successfully in all the tested browsers. All stor-age locations are not available in all web browsers but all web browser have someavailable storage location. This method of tracking only needs one available storagelocation to work.

To get an accurate estimate of the average time persistence the prototype needsto be tested on real web traffic. However, most of the storage locations are notautomatically cleared as is the case with all the previous methods. In order to clearthe stored data the browser users needs to manually clear all locations. This cannotbe done from a single user interface in any of the tested browser. This means thatthe client storage method of tracking is likely the most time persistent trackingmethod available with a potential unlimited time persistence.

6.3.2 What are the customer effects?The affiliates need to change their current implementation from embedding a bannerimage to instead embed an iframe. The merchant sites need to replace the trackbackpixel on the closure page with an iframe using the same request parameters.

6.3.3 What are the server side effects?The prototype implementation have minor effects on server side resources since theclick and view data is stored locally in the users browsers. Thus no additional datais stored at the TradeDoubler server.

There are only minor server side effect using the prototype, such as generatingand serving the different iframes and Javascripts.

6.3.4 Summary of client side storage evaluation resultThe client side storage method operates by inserting the ad click and ad view data inall available client side storage locations. This stored data is subsequently retrievedat the time of a closure. The prototype implementation is able to track both adclicks and ad views. There are several advantages with using this method:

• It works with all tested browsers. All of the tested browsers had some kind ofclient side storage available.

• Since the ad click and ad view data is all stored on the client side no redundantinformation is stored in the TradeDoubler server with the current prototype.

• This tracking method will likely have the greatest time persistence of all foundmethods.

The client side storage method also have some disadvantages:

47

Page 56: Tracking Users on the World Wide Web

CHAPTER 6. RESULT

• Even though the client side storage method works with all tested browsers, itwill likely not work with older still existing browser versions. This is becausethe client side storage capabilities is a recently introduced capability in webbrowsers.

• It requires changes, albeit small, to the merchant implementations.

• It requires changes, albeit small, to the affiliate implementations.

• Since the prototype stores all data on the client web browser, TradeDoublerloses some control over that data.

In summary, the client side storage method can track ad clicks and ad views withlikely a great persistence in time. Even if it does not work with older web browsers, itwill likely work in future web browser versions, as more browser vendors implementthe different storage mechanisms.

48

Page 57: Tracking Users on the World Wide Web

Chapter 7

Conclusions

In this chapter the conclusions of this degree project are described. Also some rec-ommendations are given for TradeDoubler and future work is discussed.

7.1 Conclusions

Amongst the evaluated prototypes there is one method with an overall best resultin the evaluation result in Chapter 6. That is the web browser cache method. Thereasons are good results on all evaluation criteria:

A large share of user actions can be tracked The prototype implementationcan track ad clicks, but a better implementation were shown to be able to alsotrack ad views. The web browser cache method had the greatest browser com-patibility of all implemented methods. This is likely true for both earlier andfuture browser versions as well. A high degree of web browser compatibilityimplies being able to track a larger share of user actions. Also the likely goodtime persistence of the method implies a larger share of user actions tracked.

Minor customer effects The web browser cache method had very small effectson customer implementations. This is especially true on the affiliate siteswhere no changes were required and also where it was most important withsmall or none implementation effects.

Small server side effects Since the web browser cache method can reuse the IDfrom the cookie based method, no duplicate user action data need to be writtento the TradeDoubler database.

In addition to the above and in comparison to the other implemented trackingmethod, the web browser cache method was by far the easiest to develop a workingprototype for all the tested web browsers. No web browser specific adaption of thecode was needed at all.

49

Page 58: Tracking Users on the World Wide Web

CHAPTER 7. CONCLUSIONS

The other two implemented methods, namely the CSS history leakage and theclient side storage capabilities, are also capable of performing tracking of users butthey suffer from the following problems:

• There is uncertainty whether the web browser vendors will plug CSS historyleakage in the future, thus completely disabling this method of tracking webusers.

• Client side storage requires customers, especially the affiliates to change theirweb sites. If not all customers change their web site implementations to sup-port the client side storage method it will not reach its full potential share ofuser actions tracked.

7.2 Recommendations

The authors recommendation for TradeDoubler is to implement the web browsercache method as a first choice.

Having the web browser cache method as a first choice does not mean the othermethods are not of any interest. Several methods of tracking may be combined toreach an even larger tracked share of user actions, just as TradeDoubler currentlyuses the three separate methods described in Section 2.3.

The next step after implementing the web browser cache method should be toimplement the client side storage method. The CSS history leakage does not looklike a good long-term tracking solution as of now, but the situation might changedepending on how the browser vendors react. The current recommendation of theauthor is to wait and see before possibly implementing it.

7.3 Future work

The research area of tracking web users is not an area where some research andsubsequent result leads to profit forever. Since the web browsers and the web ingeneral is rapidly evolving continuous research is required for cutting edge trackingmethods.

The extended browser fingerprinting method, even though not implemented inthis degree project, is also a good candidate for future implementation. In com-parison to the other methods the fingerprinting is much harder to prevent fromworking. All other methods could potentially be blocked by the browser vendors,but a browser fingerprint is always achievable as long as there are some identifiablevariable characteristics of the web browser. Also, since other companies are alreadyusing the fingerprinting approach it is proved to be a possible way to track webusers. Future research could be done further exploring the fingerprinting approach.

50

Page 59: Tracking Users on the World Wide Web

7.4. SUMMARY

7.4 SummaryIn this chapter the conclusions of this degree project was made. All three evaluatedmethods were viable for real implementations. However, the web browser cachemethod was recommended as the best method to implement because of the over-all evaluation result. Finally some future work regarding tracking methods werediscussed.

51

Page 60: Tracking Users on the World Wide Web
Page 61: Tracking Users on the World Wide Web

Bibliography

[1] TradeDoubler. Frequently asked questions regarding tracking.Web page. Visited: 2010-10-18URL: http://hstse.tradedoubler.com/file/20649/uk/help_centre/faq_tracking.htm#question_01

[2] Commision Junction. Frequently asked questions regarding tracking.Web page. Visited: 2010-10-18URL: http://www.cj.com/faq.html#tracking

[3] Share a sale. Frequently asked questions by affiliates.Web page. Visited: 2010-10-18URL: http://www.shareasale.com/affiliatefaq.cfm#10

[4] ClixGalore. Frequently asked questions.Web page. Visited: 2010-10-18URL: http://www.clixgalore.com/faq.asp#AD8

[5] AffiliateWindow. Merchant tracking guide.Web page. Visited: 2010-10-18URL: http://wiki.affiliatewindow.com/index.php/Merchant_Tracking_Guide#Flash_Cookies

[6] AffiliateWindow. Merchant tracking guide.Web page. Visited: 2010-10-18URL: http://wiki.affiliatewindow.com/index.php/Merchant_Tracking_Guide#ETag_Tracking

[7] Wikipedia. Web browser privacy modes.Web page. Visited: 2010-10-18URL: http://en.wikipedia.org/wiki/Privacy_mode

[8] W3schools. Browser statistics.Web page. Visited: 2010-10-18URL: http://www.w3schools.com/browsers/browsers_stats.asp

[9] W3C. HTML 4.01 Specification, (1999).Web page. Visited: 2010-10-18URL: http://www.w3.org/TR/1999/REC-html401-19991224/

53

Page 62: Tracking Users on the World Wide Web

BIBLIOGRAPHY

[10] R. Fielding, J. Gettys, J, Mogul, H. Frystyk, L. Masinter, P. Leach, T. Berners-Lee P. Hypertext Transfer Protocol – HTTP/1.1. RFC no. 2616 (1999).URL: http://www.w3.org/Protocols/rfc2616/rfc2616.html

[11] Mozilla. Disabling third party cookies.Web page. Visited: 2010-10-18URL: http://support.mozilla.com/en-US/kb/Disabling+third+party+cookies

[12] K. Egevang, P. Francis. The IP Network Address Translator, RFC no. 1631(1994).URL: http://www.faqs.org/rfcs/rfc1631.html

[13] Wikipedia. IPv4 address exhaustion.Web page. Visited: 2010-10-18URL: http://en.wikipedia.org/wiki/IPv4_address_exhaustion

[14] W3C. Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification, (2009).Web page. Visited: 2010-10-18URL: http://www.w3.org/TR/CSS2/

[15] A. Janc, L. Olejnik. Feasibility and Real-World Implications of Web BrowserHistory Detection, Web 2.0 security and privacy 2010 conference (2010).URL: http://w2spconf.com/2010/papers/p26.pdf

[16] E. W. Felten, M. A. Schneider. Timing Attacks on Web Privacy, (2000), Prince-ton University.URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.6864&rep=rep1&type=pdf

[17] M. Jakobsson, S. Stamm. Invasive Browser Sniffing and Countermeasures,(2006), Indiana University.URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.9497&rep=rep1&type=pdf

[18] C. Jackson, D. Boneh, A. Bortz, J. C. Mitchell. Protecting Browser State fromWeb Privacy Attacks, International world wide web conference (2006).URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.4136&rep=rep1&type=pdf

[19] G. Wondracek, T. Holz, E. Kirda, C. Kruegel. A Practical Attack to De-Anonymize Social Network Users, (2010), iSecLab.URL: http://www.iseclab.org/papers/sonda-tr.pdf

[20] Mozilla. Bug 57351 - css on a:visited can load an image and/or reveal if visitorbeen to a site.Web page. Visited: 2010-10-18URL: https://bugzilla.mozilla.org/show_bug.cgi?id=57351

54

Page 63: Tracking Users on the World Wide Web

[21] Apple. About the security content of Safari 5.0 and Safari 4.1.Web page. Visited: 2010-12-09URL: http://support.apple.com/kb/HT4196

[22] What The Internet Knows About You. A proof of concept showing the CSShistory leak.Web page. Visited: 2010-10-18URL: http://whattheinternetknowsaboutyou.com/

[23] meantime: non-consensual http user tracking using caches. A proof of conceptshowing a way of tracking web users via the HTTP ETag.Web page. Visited: 2010-10-18URL: http://sourcefrog.net/projects/meantime/

[24] Honey bee net. A proof of concept showing a way of tracking web users via theHTTP ETag.Web page. Visited: 2010-10-18URL: http://honeybeenet.altervista.org/fun/tracker/

[25] S. Kamkar. Evercookie. A proof of concept showing a way of tracking web usersvia the HTTP ETag.Web page. Visited: 2010-10-18URL: http://samy.pl/evercookie/

[26] Statowl Rich Internet Application Market Share.Web page. Visited: 2010-10-18URL: http://www.statowl.com/custom_ria_market_penetration.php

[27] W3C. Web Storage - editors draft 15 October 2010, (2010).Web page. Visited: 2010-10-18URL: http://dev.w3.org/html5/webstorage/

[28] W3C. Web SQL Database - editors draft 15 October 2010, (2010).Web page. Visited: 2010-10-18URL: http://dev.w3.org/html5/webstorage/

[29] B. Eidson. WebKit Does HTML5 Client-side Database Storages (2007).Web page. Visited: 2010-10-18URL: http://webkit.org/blog/126/webkit-does-html5-client-side-database-storage/

[30] Microsoft MSDNuserData Behavior.Web page. Visited: 2010-10-18URL: http://msdn.microsoft.com/en-us/library/ms531424%28VS.85%29.aspx

[31] K. McKinlet. Cleaning Up After Cookies, (2008), iSECpartners.URL: https://www.isecpartners.com/files/iSEC_Cleaning_Up_After_Cookies.pdf

55

Page 64: Tracking Users on the World Wide Web

BIBLIOGRAPHY

[32] Adobe. Re: Comments from Adobe Systems Incorporated – Privacy Roundta-bles Project No. P095416 , (2010).URL: http://www.ftc.gov/os/comments/privacyroundtable/544506-00085.pdf

[33] R. Singel. Privacy Lawsuit Targets Net Giants Over Zombie Cookies, (2010),wired.com.URL: http://m.wired.com/threatlevel/2010/07/zombie-cookies-lawsuit

[34] A. Soltani, S. Canty, Q. Mayo, L. Thomas, C. J. Hoofnagle. Flash Cookies andPrivacy, (2009), UC Berkeley School of Law.URL: http://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID1446862_code364326.pdf?abstractid=1446862&mirid=1

[35] Electronic Frontier Foundation. Panopticlick. A proof of concept of browserfingerprinting.Web page. Visited: 2010-10-18URL: https://panopticlick.eff.org/

[36] Agwego Enterprises Inc. MAC Address Java Applet. A proof of concept ofretrieving the MAC address.Web page. Visited: 2010-10-18URL: https://panopticlick.eff.org/

[37] P. Eckersley. How Unique Is You Web Browser, Privacy Enhancing Technolo-gies Symposium (2010).URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.9497&rep=rep1&type=pdf

[38] T. Kohno, A. Broido, K. C. Claffy. Remote physical device fingerprinting, IEEESymposium on Security and Privacy (2005).URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.7873&rep=rep1&type=pdf

[39] 41st parameter. DeviceInsight: Leading Device Fingerprinting and Identifica-tion Technology.Web page. Visited: 2010-10-18URL: http://www.the41st.com/land/DeviceID.asp

[40] Angwin. J, Valentino-Devries. J. Race Is On to ’Fingerprint’ Phones, PCs,2010, The Wall Street Journal.URL: http://online.wsj.com/article/SB10001424052748704679204575646704100959546.html

[41] T. Frank. Sessvars. A proof of concept showing a way of tracking web users viathe Javascript window.name.

56

Page 65: Tracking Users on the World Wide Web

Web page. Visited: 2010-10-18URL: http://www.thomasfrank.se/sessionvars.html

[42] JSTONE. A proof of concept showing a way of tracking web users via theJavascript window.name.Web page. Visited: 2010-10-18URL: http://devpro.it/JSON/files/JSTONE-js.html

57

Page 66: Tracking Users on the World Wide Web

TRITA-CSC-E 2011:041 ISRN-KTH/CSC/E--11/041-SE

ISSN-1653-5715

www.kth.se