collecting user-data-socially-responsibly

Post on 16-Apr-2017

225 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

“Collecting User's Data in a Socially-Responsible Manner.” Photograph: Daniel Beltra/Greenpeace

Konark Modi @konarkmodi

Josep M. Pujol @solso

About Cliqz

• 80+ - Team size

• 500,000 - DAU

• 3 Million+ - Downloads (Germany only)

• 1 billion+ - Indexed pages (We do not believe in indexing the web.)

• 5 TB - In-Memory indexed (Based on open source and in-house build NoSQL stores.)

• 10x more coverage for anti-phishing protection - As compared to other players like safebrowsing by Google.

• Upcoming products like Anti-tracking etc.

About Cliqz

We Love Data …

Let's step back a bit in time, to get the context.

Source : http://thehumanfaceofbigdata.com

“ Data is the new oil ” - Clive HumBy (2006)

Data is still being collected without enough controls & measures.

Is privacy the new Green ?

The biggest by-product of which being SESSIONS.

Is privacy the new Green ?

How ?

Alice

Alice

Bob

MAP/REDUCE :D

Server-Side

Alice

Alice

Bob

Client-Side

Uncharted w

ater

Instead …

Uncharted w

ater

Server-Side

Alice

Alice

Bob

Client-Side

Alice

Alice

Bob

MAP/REDUCE :D

MAP/REDUCE :D

MAP/REDUCE :D

Who is responsible ?

Is there a conspiracy theory or an evil plan ?

Well, we have a simpler explanation:

It’s the consequences of common development

practices, which results in trading user’s data

knowingly / unknowingly !

Demo

This looks like a toy example ?

Which are the queries that are so bad that forces people to redo the same query

elsewhere ?

Let’s take a more complex case

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Client-Side

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Uncharted w

ater

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-ReduceClient-Side

Server - Side

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Uncharted w

ater

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-ReduceClient-Side

Server - Side

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Uncharted w

ater

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-Reduce

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-Reduce

Client-Side

Server - Side

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Uncharted w

ater

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-Reduce

Aliceapache big data

conf

search engine 2

search engine 1

Aliceapache big data

conf

Map-Reduce

Client-Side

Server - Side

We mentioned before, we believe in data and are not against the collection .

• Stopping data collection altogether would be foolish and dangerous.This also means stopping the wheels of innovation.

• Who would benefit the most by supporting the ban on advertisements of tobacco products??

“Socially responsible manner” is an analogy to ensure events being collected are not suffering from pollutants like Explicit IDs, Implicit IDs and reaches home Secure.

Why does CLIQZ Care ?

German Data Privacy Laws

Security breachesWhen government knocks

on your door

So what do we bring on the table ??

HUMAN WEB

• We have developed HumanWeb to balance the Right-to-Privacy with the needs to build products that improve the web and allow for more openness.

• Ensuring data that can infer sessions, linkages to navigation patterns is not collected.

• Does not create so much data that could allow identification of individuals

• We do not want to know who "YOU" are, what "YOU" searched and when "YOU" searched.

• Designed keeping in mind so that a "malicious/untrustworthy" actor or as a matter of fact even anyone at Cliqz, getting access to the raw data flow cannot infer or identify individuals.

Sample events:{

"action": action of the message,

"ver": version name,

"type": "humanweb",

"payload": { }, //the actual data

"ts": UTC time capped to the day, e.g. 20150909

}

• Sample event for Page

• Sample event for Query

HumanWeb

[{event1}, {event2},

{event3}]

Event Queue | Schedule to ensure not sent in batch

Final checks

Filtering

Sanitisation / Masking

Secure Channel

Client-side

Local storage | Structural data about webpages

Map-Reduce Aggregations, Heuristics,

Filtering,Hashing

Privacy breaches on the way home

To achieve total privacy, we must rely on a network of proxies that remove any network-related data like cookies, IP,

headers so that finger-printing is impossible.

SecureChannel : Protection from network fingerprinting

SecureChannel : What do we encrypt ?

• The queries from the user (initiated by them upon activity on the Cliqz’s instrumented Firefox address bar).

• All telemetry signals (initiated by Cliqz’s instrumented Firefox)

• All messages regarding the HumanWeb data collection effort.

Also, before reaching our infrastructure the encrypted messages are routed through a mesh of

proxies.

SecureChannel : How do we encrypt ?

Life-Cycle of hashes / keys : • AES : Hash-keys used with AES are used only one time. Even if the user types the

same query . • Public / Private KeyPair ( Client ) :

• The Keys on client side are all short lived, we continuously generate keys on the client-side.

• The public/private key pair of the client (the Extension) is meant to be used only once and then thrown away. The key pairs are regenerated to fill a pool while the browser is idle.

• Public / Private KeyPair ( Server ) : • Only public part of this key is shared with the extension. • The client uses it while encrypting the request. This is long lived key, currently

only to change in the case it is compromised

Client side : 128-bit symmetric AES encryption, OpenSSL RSA 1024-bit encryption. EventLogger: 128-bit symmetric AES encryption, OpenSSL RSA 4096-bit encryption.

SecureChannel : How do we encrypt ? (Extension)

encryptedRequest(iv:encryptedMsg:encryptedKey)

iv :Initializaton Vector msg = (originalRequest + ExtensionPublicKey) key = md5(msg) encryptedMsg = AES.encrypt(msg, key, {mode: CBC, padding: PKCS7, iv: iv}) encryptedKey = sign(EventLoggerPublicKey, key)

Each request to be encrypted has the following components : • Message / Request to encrypt : Query or Data• ExtensionPublicKey : Chosen from a pool of public keys for that user on

the machine, key is used only once and then discarded).• Initialisation Vector : Derived from wordarray of 16-bits. • EventLoggerPublicKey : Our public key, shared with the extension.

SecureChannel : Routing ? (Extension)

• Extension maintains a list of proxies which are healthy / good at that point in time.

• When sending the request / message extension picks up the end-point in a round-robin fashion (Round-robin for now).

• To avoid the risk of proxies being malicious with the message, we implement scrambling and splitting of messages into a random ‘n’ parts just before sending the message from extension.

• The value of n is determined by the extension, we expect ‘n’ to be 1,2,4 or 8 for the time being. Also, the value of ’n’ is not known to proxies hence they are unaware if it has all the parts.

• The only way to tamper a message is to have all the parts to decrypt it, but since messages are scrambled, split and send through different proxies this makes the messages safe from proxies.

• Event Logger waits for all the message by combination at our Event Logger(Secure) can decrypt the message.

SecureChannel : How do we decrypt ? (Server)

EncryptedRequest = iv:encryptedMsg:encryptedKey key = unlock(EventLoggerPrivateKey, encryptedKey) msg = AES.decrypt(encryptedMsg, key, {mode: CBC, padding: PKCS7, iv: iv) request = msg.data ExtensionPublicKey = msg.pk (We need it to sign the response)

Important: • Because the server receives messages in parts, to get the key and message we rely on

combinations. • The message itself is scrambled, so even if it is decrypted we need to stitch it together by trying

different combinations.

All talk and no play, makes Jack a dull boy !

Demo

Thank You http://www.cliqz.com/en

We believe it’s possible, we are actually doing it

photo: projectsecretidentity.org

top related