adversarial analytics - 2013 strata & hadoop world talk
TRANSCRIPT
![Page 1: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/1.jpg)
Adversarial Analy-cs 101
Robert L. Grossman Open Data Group
Strata Conference & Hadoop World October 28, 2013
![Page 2: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/2.jpg)
Examples of Adversarial Analy-cs
• Compete for resources – Low latency trading – Auc-ons
• Steal someone else’s resources – Credit card fraud rings – Insurance fraud rings
• Hide your ac-vity – Click spam – Exfiltra-on of informa-on
![Page 3: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/3.jpg)
Conven-onal vs Adversarial Analy-cs
Conven&onal Analy&cs Adversarial Analy&cs Type of game Gain / Gain Gain / Loss Subject Individual Group / ring / organiza-on Behavior DriUs over -me Sudden changes Transparency Behavior usually open Behavior usually hidden Automa-on Manual (an individual) Some-mes another model
![Page 4: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/4.jpg)
1. Points of Compromise
Source: Wall Street Journal, May 4, 2007
![Page 5: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/5.jpg)
Case Study: Points of Compromise • TJX compromise • Wireless Point of Sale devices compromised
• Personal informa-on on 451,000 individuals taken
• Informa-on used months later for fraudulent purchases.
• WSJ reports that 45.7 million credit card accounts at risk
• “TJX's breach-‐related bill could surpass $1 billion over five years”
Source: Wall Street Journal, May 4, 2007
![Page 6: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/6.jpg)
Learning TradecraU
![Page 7: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/7.jpg)
The adversary is oUen a group, criminal ring, etc.
![Page 8: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/8.jpg)
It’s Not an Individual…
Source: Jus-ce Department, New York Times.
![Page 9: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/9.jpg)
Gonzalez TradecraU
1. Exploit vulnerabili-es in wireless networks outside of retail stores.
2. Exploit vulnerabili-es in databases of retail organiza-ons.
3. Gain unauthorized access to networks that process and store credit card transac-ons.
4. Exfiltrate 40 Million track 2 records (data on credit card’s magne-c strip)
![Page 10: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/10.jpg)
5. Sell track 2 data in Eastern Europe, USA, and other places.
6. Create counterfeit ATM and debit cards using track 2 data.
7. Conceal and launder the illegal proceeds obtained through anonymous web currencies in the US and Russia.
Source: US District Court, District of Massachusehs, Indictment of Albert Gonzalez, August 5, 2008.
Gonzalez TradecraU (cont’d)
![Page 11: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/11.jpg)
Data Is Large
• TJX compromise data is too large to fit into memory
• Data is difficult to fit into database
• Millions of possible points of compromise
• Data must be kept for months to years
Source: Wall Street Journal, May 4, 2007
![Page 12: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/12.jpg)
2. Introduc-on to Adversarial Analy-cs
![Page 13: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/13.jpg)
• Example: predic-ve model to select online ads.
• Example: predic-ve model to classify sen-ment of a customer service interac-on.
• Example: commercial compe-tor trying to maximize trading gains.
• Example: criminal gang ahacking a system and hiding its behavior.
Conven-onal Analy-cs Adversarial Analy-cs
![Page 14: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/14.jpg)
What’s Different About the Models?
Conven&onal Analy&cs Adversarial Analy&cs Data Labeled Unlabeled Data size Small to large Small to very large Model Classifica-on /
Regression Clustering, change detec-on
Frequency of update
Monthly, quarterly, yearly
Hourly, daily, weekly, …
![Page 15: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/15.jpg)
Types of Adversaries
Use exis-ng analy-c techniques.
Create new analy-c techniques.
Change the game.
$1,000
$100,000
$10,000,000 Use specialized teams and approaches.
Use products.
Build custom analy-c models.
Home Team Adversary
Source: This approach is based in part on the threat hierarchy in the Defense Science (DSB) Report on Resilient Military Systems and the Cyber Threat, January, 2013
![Page 16: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/16.jpg)
What is Different About the Adversary?
Conven&onal Analy&cs Adversarial Analy&cs En-ty Person browsing Gang ahacking a system What is modeled?
Events, en--es Events, en--es, rings
Behavior Component of natural behavior
Wai-ng for opportuni-es
Evolu-on DriU Ac-ve change in behavior
Obfusca-on Ignore Hide
![Page 17: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/17.jpg)
Upda-ng Models Conven&onal Analy&cs
Adversarial Analy&cs
When do you update model?
When there is a major change in behavior
Frequently, to gain an advantage.
Process for upda-ng models
Automa-on and analy-c infrastructure helpful.
Automa-on and analy-c infrastructure essen-al.
![Page 18: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/18.jpg)
3. More Examples
![Page 19: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/19.jpg)
Click Fraud
• Is a click on an ad legi-mate? • Is it done by a computer or a human? • If done by a human, is it an adversary?
?
![Page 20: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/20.jpg)
Types of Adversaries
Use exis-ng analy-c techniques.
Create new analy-c techniques.
Change the game.
$1,000
$100,000
$10,000,000 Use specialized teams and approaches.
Use products.
Build custom analy-c models.
Home Team Adversary
Source: This approach is based in part on the threat hierarchy in the Defense Science (DSB) Report on Resilient Military Systems and the Cyber Threat, January, 2013
![Page 21: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/21.jpg)
Low Latency Trading
• 2+ billion bids, asks, executes / day • > 90% are machine generated • Most are cancelled • Decisions are made in ms
ECN Trades
Back tes-ng
Low latency trader
Market data
Algorithm
Algorithm Adversarial traders
Broker/Dealer
![Page 22: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/22.jpg)
Connec-ng Chicago & NYC
Technology Vendor Round Trip Time (ms)
Microwave Windy Apple
9 .0
Dark fiber, shorter path
Spread Networks
13.1
Standard fiber, ISP
Various 14.5+
Source: lowlatency.com, June 27, 2012; Wired Magazine, August 3, 2012
![Page 23: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/23.jpg)
Types of Adversaries
Use exis-ng analy-c techniques.
Create new analy-c techniques.
Change the game.
$1,000
$100,000
$10,000,000 Use specialized teams and approaches.
Use products.
Build custom analy-c models.
Home Team Adversary
Source: This approach is based in part on the threat hierarchy in the Defense Science (DSB) Report on Resilient Military Systems and the Cyber Threat, January, 2013
![Page 24: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/24.jpg)
Example: Conficker
Source: Conficker Working Group: Lessons Learned, June 2010 (Published January 2011),www.confickerworkinggroup.org
![Page 25: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/25.jpg)
Command and control center
Infected PCs
![Page 26: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/26.jpg)
Source: Conficker Working Group: Lessons Learned, June 2010 (Published January 2011),www.confickerworkinggroup.org
![Page 27: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/27.jpg)
Source: Conficker Working Group: Lessons Learned, June 2010 (Published January 2011),www.confickerworkinggroup.org
![Page 28: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/28.jpg)
Types of Adversaries
Use exis-ng analy-c techniques.
Create new analy-c techniques.
Change the game.
$1,000
$100,000
$10,000,000 Use specialized teams and approaches.
Use products.
Build custom analy-c models.
Home Team Adversary
Source: This approach is based in part on the threat hierarchy in the Defense Science (DSB) Report on Resilient Military Systems and the Cyber Threat, January, 2013
![Page 29: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/29.jpg)
4. Building Models for Adversarial Analy-cs
![Page 30: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/30.jpg)
Building Models To Iden-fy Fraudulent Transac-ons
Scoring transac-ons: 1. Get next event
(transac-on). 2. Update one or more
associated feature vectors (account-‐based)
3. Use model to process updated feature vectors to compute scores.
4. Post-‐process scores using rules to compute alerts.
Building the model: 1. Get a dataset of labeled
data (i.e. fraud is labeled). 2. Build a candidate predic-ve
model that scores each transac-on with likelihood of fraud.
3. Compare liU of candidate model to current champion model on hold-‐out dataset.
4. Deploy new model if performance is beher.
![Page 31: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/31.jpg)
Building Models to Iden-fy Fraud Rings
• Look for suspicious paherns and rela-onships among claims.
• Look for hidden rela-onships through common phone numbers, nearby addresses, etc.
Repair shop
Driver
Driver
Clinic
Driver
Driver
Driver
Repair shop
Clinic
Driver
![Page 32: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/32.jpg)
Method 1: Look for Tes-ng Behavior
• Adversaries oUen test an approach first. • Build models to detect the tes-ng.
![Page 33: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/33.jpg)
Method 2: Iden-fy Common Behavior
• Common cluster algorithms (e.g. k-‐means) require specifying the number of clusters
• For many applica-ons, we keep the number of clusters k rela-vely small
• With microclustering, we produce many clusters, even if some of the them are quite similar.
• Microclusters can some-mes be labeled
![Page 34: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/34.jpg)
Microcluster Based Scoring
• The closer a feature vector is to a good cluster, the more likely it is to be good
• The closer it is to a bad cluster, the more likely it is to be bad.
Bad Cluster
Good Cluster
![Page 35: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/35.jpg)
Method 3: Scaling Baseline Algorithms
Drive by exploits
Source: Wall Street Journal
Points of compromise
![Page 36: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/36.jpg)
What are the Common Elements? • Time stamps • Sites – e.g. Web sites, vendors, computers, network devices
• En--es – e.g. visitors, users, flows
• Log files fill disks; many, many disks • Behavior occurs at all scales • Want to iden-fy phenomena at all scales • Need to group “similar behavior”
![Page 37: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/37.jpg)
Examples of a Site-‐En-ty Data
37
Example Sites En&&es Drive-‐by exploits Web sites Computers Compromised user accounts
Compromised computers
User accounts
Compromised payment systems
Merchants Cardholders
Source: Collin Benneh, Robert L. Grossman, David Locke, Jonathan Seidman and Steve Vejcik, MalStone: Towards a Benchmark for Analy-cs on Large Data Clouds, The 16th ACM SIGKDD Interna-onal Conference on Knowledge Discovery and Data Mining (KDD 2010), ACM, 2010.
![Page 38: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/38.jpg)
-me 38 dk-‐1 dk
Exposure Window Monitor Window
sites en--es
![Page 39: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/39.jpg)
The Mark Model • Some sites are marked (percent of mark is a parameter and type of sites marked is a draw from a distribu-on)
• Some en--es become marked aUer visi-ng a marked site (this is a draw from a distribu-on)
• There is a delay between the visit and the when the en-ty becomes marked (this is a draw from a distribu-on)
• There is a background process that marks some en--es independent of visit (this adds noise to problem)
![Page 40: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/40.jpg)
Subsequent Propor-on of Marks
• Fix a site s[j] • Let A[j] be en--es that transact during ExpWin and if en-ty is marked, then visit occurs before mark
• Let B[j] be all en--es in A[j] that become marked some-me during the MonWin
• Subsequent propor-on of marks is r[j] = | B[j] | / | A[j] | • Easily computed with MapReduce.
![Page 41: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/41.jpg)
Compu-ng Site En-ty Sta-s-cs with MapReduce
41
MalStone B Hadoop MapReduce v0.18.3 799 min Hadoop Python Streams v0.18.3 142 min Sector UDFs v1.19 44 min # Nodes 20 nodes # Records 10 Billion Size of Dataset 1 TB
Source: Collin Benneh, Robert L. Grossman, David Locke, Jonathan Seidman and Steve Vejcik, MalStone: Towards a Benchmark for Analy-cs on Large Data Clouds, The 16th ACM SIGKDD Interna-onal Conference on Knowledge Discovery and Data Mining (KDD 2010), ACM, 2010.
![Page 42: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/42.jpg)
5. Summary
![Page 43: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/43.jpg)
Some Rules 1. Understand your adversary's tradecraU; he or
she will go aUer the easiest or weakest link. 2. Your adversary will be crea-ng new
opportuni-es; you must create new models and analy-c approaches.
3. Risk is usually viewed as the cost of doing business, but every once in a while the cost may be 100x – 1,000x greater
4. The quicker you can update your strategy and models, the greater your advantage. Use automa-on (e.g. PMML-‐based scoring engines) to deploy models more quickly.
![Page 44: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/44.jpg)
Ques-ons?
![Page 45: Adversarial Analytics - 2013 Strata & Hadoop World Talk](https://reader033.vdocuments.site/reader033/viewer/2022051415/55d4f96abb61eb741f8b471c/html5/thumbnails/45.jpg)
About Robert L. Grossman Robert Grossman (@bobgrossman) is the Founder and a Partner of Open Data Group, which specializes in building predic-ve models over big data. He is also a Senior Fellow at the Computa-on Ins-tute and Ins-tute for Genomics and Systems Biology at the University of Chicago and a Professor in the Biological Sciences Division. He has led the development of new open source soUware tools for analyzing big data, cloud compu-ng, and high performance networking. Prior to star-ng Open Data Group, he founded Magnify, Inc., which provides data mining solu-ons to the insurance industry. Grossman was Magnify’s CEO un-l 2001 and its Chairman un-l it was sold to ChoicePoint in 2005. He blogs occasionaly about big data, data science, and data engineering at rgrossman.com.