(mobile web applications track) "profiling user activities with minimal traffic traces" -...

16
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION 1 COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION Profiling User Activities With Minimal Traffic Traces Tiep Mai, Deepak Ajwani and Alessandra Sala Bell Laboratories, Ireland

Upload: icwe2015

Post on 15-Aug-2015

10 views

Category:

Internet


1 download

TRANSCRIPT

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

1

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

Profiling User Activities With Minimal Traffic Traces

Tiep Mai, Deepak Ajwani and Alessandra SalaBell Laboratories, Ireland

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

2

Outline

Telecom data and privacy issue

Truncated URL dataset

User behavior analysis on limited data

• Micro-action burst decomposition

• Representative URL selection

Future work and Conclusions

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

3

End-to-End View of the Telecom Network

Mobile user

Webservices

Client-sidedata

Server-sidedata

Telecom data

Huge data but with limited features

Empower telecom data analysis with this data

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

4

Providing Personalized Services

• Personalized services require user activity profiling Traditional approaches rely on features extracted from rich data sources

Server side data: full URLs of visited pages, page categories, transaction data, search queries, click through rate, etc.

Client side data: full URLs (cookies), application data (web browsing), etc.

Network side data: full URLs, HTTP packet content, etc.

• Our goal: Provide medium-grained user profiling with privacy preserving limited dataset for a large user-pool

User privacy considerations

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

5

Mobile Web TracesUser Behavioral Analysis from Timestamped Data

• Mobile traces provide precious insights in user behavior Critical to enable service personalization and enrich user’s online

experience

• Complete mobile web traces risk to reveal sensitive info http://finance.yahoo.com/q?s=BAC Bank of America Corp. stock

price

https://www.google.ie/#q=postnatal+depression sensitive health condition

http://www.amazon.com/Dell-Inspiron-i15R-15-6-inch-Laptop/dp/B009US2BKA specific purchased product

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

6

Removing Sensitive Data from URL Traces

• Telecom Operators subjected to restrictive privacy legislations

• Conservative approach to share data Anonymized, truncate and sampled data

Traces from10,000 anonymized users over 30 days, i.e. +130 Million records

• Focus on the dataset of truncated URLs or IP addresses

• Resulting data:

1. Truncated: www.amazon.com/Dell-Inspiron-i15R-15-6-inch-Laptop/dp/B009US2BKA

2. Noisy: unintentional web traffic as advertisement, web analytics, etc. Quality of behavior analysis depends on effectively separating

unintentional traffic from user activities on truncated URL

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

7

• Collection of web traces of several URL types

• Aim: filter out traces that do not represent explicit user action

Identifying features to drive detection on unintentional traces

Validate across different users

• Diversity of web domains:

Web Browsing Behaviors Across Time & Users

High diversity in user activities High diversity across users

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

8

Methodology Approach

• User activities as collection of micro user actions, i.e. burst

Web clicks, chat replies

• Assumption: Each burst represents atomic user activity

Combination of intended and unintended web-traffics

• Methodology

1. Burst decomposition

2. Activity extraction:

Domain classification : Leverage specialized feature of domain appearance in the burst

Online representative URL selection and activity association

Increase prediction

accuracy by 20%

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

9

Burst Decomposition – Statistical Parametric Distribution Fitting

• Goal: Decompose the web-trace back into constituent data bursts

• A need for a threshold of packet inter-arrival time (IAT) to separate traces into bursts

• Study the inter-arrival time distribution

• No parametric distribution would match most user traces

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

10

Burst Decomposition Algorithm

• Robust burst decomposition algorithm that is independent of the distribution shape

• Starting from the smallest value, find the value such that extended probability by increasing decaying point is insignificant, compared to the accumulated probability at that point

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

11

Domain Classification – Initial Insight

• Goal: automatically identify URLs representing user activities

• Measurements are aggregated for all users for each domain

Record-level measurements

Burst-level measurements

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

12

Domain Classification - Methodology

• Logistic regression

• Validation error and AIC, BIC

• Two discriminating features

ob,j=1 – ub,j=1 (~ 22.87) : probability that a domain comes first in bursts with more than one unique domains

ub,j=2 (~ -9.51) : probability that a domain comes in bursts with two unique domains

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

13

Trade-offs of Domain Classification Results

• Trade-off between accuracy, sensitivity, precision and specificity

Maximizing accuracy

Maximizing sensitivity and specificity

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

14

Future Works

• Mapping domain to activities (reading, shopping, browsing) and identifying user activities online

• Activity query and recommendation

• Correlating truncated URL data with user location data

Spatial temporal study of user activities

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

15

Conclusions and Remarks

• Telecom data: Huge but limited; Strict privacy regulations

• URL trace data:

Privacy preservation with truncation

Noisy data

Burst property of micro user actions

• Goal: Perform activity extraction and behaviour analysis for a large user-pool with limited and noisy data

• Method:

Burst decomposition and feature extractions

Representative URL identification and activity extraction

Doing medium-grained behavior analysis is feasible with limited, noisy and privacy preservation URL data

COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION

16

Thank you

• Thank you

• Questions?