Chi-Yao Hong, UIUC
Fang Yu, MSR Silicon Valley
Yinglian Xie, MSR Silicon Valley
Populated IP Addresses — Classification and Applications
ACM CCS (October, 2012)
2A Seminar at Advanced Defense Lab
• Introduction• System Design• Implementation• Evaluation• Application
Outline
2012/9/25
3A Seminar at Advanced Defense Lab
• While online services have become everyday essentials for billions of users, they are also heavily abused by attackers.• Web-based email
• Online service providers often rely on IP addresses to perform blacklisting and service throttling.• For IP addresses that are associated with a large
number of user requests, they must be treated differently.
2012/9/25
Introduction
4A Seminar at Advanced Defense Lab
• We deffine IP addresses that are associated with a large number of user requests as Populated IP (PIP) addresses.• not equivalent to the traditional concept of
proxies, NATs, gateways, or other middleboxes
2012/9/25
Populated IP Addresses
5A Seminar at Advanced Defense Lab
• In this paper, we introduce PIPMiner, a fully automated method to extract and classify PIPs.
2012/9/25
Goal
6A Seminar at Advanced Defense Lab
• We take a data-driven approach using service logs that are readily available to all service providers.
• And we train a non-linear support vector machine (SVM) classifier that is highly tolerant of noise in input data.
2012/9/25
System Design
7A Seminar at Advanced Defense Lab
• PIP Selection • Phase 1 : IP addresses with rL requests, rL =
1,000• Phase 2: IP address has been used by at least
uM accounts, together accounting for at least rM requests.• uM = 10, rM = 300
2012/9/25
System Flow
8A Seminar at Advanced Defense Lab
• Population Features capture aggregated user characteristics.
• Time Series Features model the detailed request patterns.
• IP Block Level Features aggregate IP block level activities and help recognize proxy farms.
2012/9/25
Features
9A Seminar at Advanced Defense Lab2012/9/25
Population Features
10A Seminar at Advanced Defense Lab2012/9/25
Time Series Features
11A Seminar at Advanced Defense Lab
• large proxy farms often redirect trac to dierent outgoing network interfaces for load balancing purposes.
• Determine neighboring IP addresses:• Neighboring IPs must be announced by the same
AS.• Neighboring IPs are continuous over the IP
address space, and each neighboring IP is itself a PIP.
2012/9/25
IP Block Level Features
12A Seminar at Advanced Defense Lab2012/9/25
EX: Block Level Time Series
13A Seminar at Advanced Defense Lab
• Non-linear SVM
2012/9/25
Training and Classification
14A Seminar at Advanced Defense Lab2012/9/25
Kernel Function k(xi, x)
15A Seminar at Advanced Defense Lab
• Data Parse and Feature Extraction (Stage 1)• We implement PIPMiner on top of DryadLINQ [link], a
distributed programming model for large-scale computing.
• Using a 240-machine cluster
• Training and Testing (Stage 2)• Quad Core CPU with 8GB RAM• LIBSVM [link] and LIBLINEAR [link] toolkits
2012/9/25
Implementation
16A Seminar at Advanced Defense Lab
• We apply PIPMiner to a month-long Hotmail login log pertaining to August 2010 and identify 1.7 million PIP addresses. (200 MB )• 0.5% of the observed IP addresses• the source of more than 20.1% of the total
requests• Associated with 13.7% of the total accounts
in our dataset• At Stage 1, PIPMiner processes a 296 GB
dataset in only 1.5 hours.
2012/9/25
Evaluation
17A Seminar at Advanced Defense Lab2012/9/25
PIP Score Distribution
18A Seminar at Advanced Defense Lab2012/9/25
PIP Address Distribution
Dynamic IP
Dynamic IP
19A Seminar at Advanced Defense Lab
• Among 1.7 million PIP addresses, 973K of them can be labeled based on the account reputation data.
2012/9/25
Accuracy Evaluation
20A Seminar at Advanced Defense Lab2012/9/25
Accuracy of Individual Componets
21A Seminar at Advanced Defense Lab2012/9/25
Accuracy against Data Length
22A Seminar at Advanced Defense Lab
• Future Reputation• the reputation score of July 2011 (after 11
months)
2012/9/25
Validation of Unlabeled Cases
23A Seminar at Advanced Defense Lab
• Windows Live ID Sign-up Abuse Problem• We focus on the sign-ups related to Hotmail and use
the Hotmail reputation trace in July, 2011 (after 11 months) to determine whether a particular sign-up account was malicious or not.
• We study the sign-up behavior on two types of the PIP addresses.
• The first is the 1.7 million derived PIPs. • The second is the set of IP addresses that have
more than 20 sign-ups from the Windows Live ID system, but they are not included in the 1.7 million PIPs.
2012/9/25
Application
24A Seminar at Advanced Defense Lab
• Precision = 97%
2012/9/25
Using PIPs to Predict User Reputation
25A Seminar at Advanced Defense Lab
Thank you for listening
2012/9/25
Q & A