floncon 2013 | january 7 10 | albuquerque, new mexico · 2013. 1. 7. · floncon 2013 | january...

25
FlonCon 2013 | January 7–10 | Albuquerque, New Mexico John Munro / [email protected] Jason Trost / [email protected]

Upload: others

Post on 09-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • FlonCon 2013 | January 7–10 | Albuquerque, New Mexico

    John Munro / [email protected] Jason Trost / [email protected]

  • • John Munro ([email protected]) – Network Security Researcher and Data Scientist

    • Jason Trost ([email protected]) – Senior Software Engineer – Specializes in Hadoop/Storm/BigData

    Introductions

  • • The Problem • Our Approach • DGA Domain Classifier • String Statistics as Features • Malicious Domain Classifier • Demo • Real-time Streaming Platform

    Agenda

  • The Problem

    yahoo.com

    docs.joomla.org

    youtube.com

    bibz01.apple.com

    p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com

    ns3.ohio.gov

    abulqe.com za6.limfoklubs.com

    Ct0u2xj5dbe4.wvw—game465.com

    txmxbo.info

    Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com

  • The Problem

    yahoo.com

    docs.joomla.org

    youtube.com

    bibz01.apple.com

    p4.httzd5e2ufizo.3bawhfuec45dca65.401724.s1.v4.ipv6-exp.l.google.com

    ns3.ohio.gov

    abulqe.com za6.limfoklubs.com

    Ct0u2xj5dbe4.wvw—game465.com

    txmxbo.info

    Wmk41035u3751s0bgv4n91b0b7h74v.ipcheker.com

  • • Massive Volumes – Some of our partners

    deal with TBs per day of DNS PCAPs

    • Incredible Rates – One partner sees

    13k requests/sec

    – Another closer to 100k/sec

    The Problem

  • • Real-time streaming classification – In parallel across multiple servers

    • Markov Models – Random Domain Generation Traffic – Normal Benign Traffic

    • Random Forests – Benign vs Malicious

    • Periodically retrained – In order to maintain accuracy

    Our Approach: Machine Learning!

  • • Benign Domains – Millions of popular, real domains

    • Correlated with the Alexa top 10k domains

    • Malicious Domains – 800k domains gathered from an internal malware

    sandbox

    – Public blacklist domains from Conficker and Murofet Botnets

    Data Sources

  • Markov Models

  • • Domain Generation Algorithm (DGA) • Popular Domain Model

    – Trained: 258,039 domains from Day 1 of our Benign set – Tested: 331,359 domains from Day 2 of our

    Benign set

    – Accuracy: 99.40 % with 1,458 Unknown • Randomly Generated Domain Model

    – Trained: 90,884 domains from Conficker Botnet – Tested: 295,306 domains from Murofet Botnet – Accuracy: 99.34 % with 1,923 Unknown

    Markovian DGA Classifier

  • String Statistics as Features

  • Feature Usefulness

  • Random Forests Algorithm

    FPO VIDEO TO COME

  • • Pros: – Very high accuracy – Scalable across many nodes – Built-in protection from over fitting – Can handle very large data sets with many features – Robust with respect to goodness of features – Practical for real world use – Does not assume a distribution – Only two parameters to tune – Memory efficient

    • Cons: – Not the quickest classifier, but plenty fast in practice

    Random Forests

  • • Performance measured by 10 – fold Cross Validation

    • Training Set – 200k Benign – 200k Malicious

    Malicious Domain Classifier

    0

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    10 20 30 40 50 60 70 80 90 100

    Ou

    t o

    f B

    ag E

    rro

    r

    Number of Trees

    Cross Validation

    K = 3 K = 5 K = 10

  • Results

    0.975

    0.976

    0.977

    0.978

    0.979

    0.98

    0.981

    0.982

    0.983

    0.984

    10 20 30 40 50 60 70 80 90 100

    Pre

    cisi

    on

    Number of Trees

    Bad Precision

    0.9745

    0.975

    0.9755

    0.976

    0.9765

    0.977

    0.9775

    0.978

    0.9785

    0.979

    0.9795

    10 20 30 40 50 60 70 80 90 100

    Pre

    cisi

    on

    Number of Trees

    Good Precision

    0.976

    0.9765

    0.977

    0.9775

    0.978

    0.9785

    0.979

    0.9795

    0.98

    0.9805

    10 20 30 40 50 60 70 80 90 100

    Acc

    ura

    cy

    Number of Trees

    Bad Accuracy

    0.976

    0.9765

    0.977

    0.9775

    0.978

    0.9785

    0.979

    0.9795

    0.98

    0.9805

    10 20 30 40 50 60 70 80 90 100

    Acc

    ura

    cy

    Number of Trees

    Good Accuracy

    K = 3 K = 5 K = 10

  • Results

    0

    10,000

    20,000

    30,000

    40,000

    50,000

    60,000

    10 20 30 40 50 60 70 80 90 100

    Cla

    ssif

    ica

    tio

    ns/

    sec

    Number of Trees

    Classification Throughput

    K=3

    K=5

    K=10

    0

    50

    100

    150

    200

    250

    300

    350

    400

    10 20 30 40 50 60 70 80 90 100

    Mo

    del

    Siz

    e (M

    B)

    Number of Trees

    Model Size

    K=3

    K=5

    K=10

  • Results

  • Demo

  • • Velocity is a platform for processing, analyzing, and visualizing large-scale event data in real-time

    • It was designed to be horizontally scalable and is built using Twitter’s Storm

    • It was built primarily for internal use with DNS events, IDS alerts, and netflow data, but it is in the process of being commercialized

    Realtime Streaming Platform

  • Velocity Pipeline

  • • Malicious domain classification • DGA domain identification using Markov

    Models

    • Summary Statistics based on domain string work well

    • Random Forests are very successful at classifying domains as Benign or Malicious

    • Real-time, distributed implementation

    Conclusion

  • • Include more features: TTL, frequency seen, etc.

    • Correlation of bad domains based on ASN, Country, Organization, etc.

    • Identify subnets that are infected based on high traffic to bad domains

    • Identify Content Delivery Networks • Self Organizing Maps and other visualizations

    Future Work

  • Questions

  • • John Munro • Email: [email protected]

    • Jason Trost • Email: [email protected] • Twitter: @jason_trost • Blog: www.covert.io

    Contact Information

    mailto:[email protected]:[email protected]://twitter.com/jason_trosthttps://twitter.com/jason_trosthttp://www.covert.io/