(un)supervised learning for malware with quickspan · seeland clustering algorithm (un)supervised...

17
(Un)Supervised Learning for Malware with quickSpan Thomas Given-Wilson Cisco- Ecole Polytechnique Symposium, Paris, France 10th April 2018

Upload: others

Post on 05-Jul-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

(Un)Supervised Learning for Malware with quickSpanThomas Given-Wilson

Cisco- Ecole Polytechnique Symposium,Paris, France 10th April 2018

Page 2: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Goals

(Un)Supervised Learning for Malware with quickSpan 2

Supervised Learning:• Detect existing and new variants of malware• Use behaviour (not YARA), here SCDGs• Middle line-of-defenceUnsupervised Learning:• Group unknown programs• Use behaviour, again SCDGs• Aid analysis

Page 3: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

SCDG Toy Example

(Un)Supervised Learning for Malware with quickSpan 3

1 - Binary

2 - Trace

3 - Graph

Page 4: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Common Subgraphs: gSpan Algorithm

(Un)Supervised Learning for Malware with quickSpan 4

Find common sub-graphs:• Builds set of (sub-)graphs• That meet “support” (% of

graph set), here support=1.0• Example set of graphs

(Note: ignores edgedirection and edge labels here.)

Page 5: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

gSpan (Implementation) Problems

(Un)Supervised Learning for Malware with quickSpan 5

Algorithm:• Scales poorly with graph size• Vulnerable to pathological casesImplementation (gBolt [1]):• Ignores edge direction• Runs till completion/failure• Memory exhaustion crashes

1 - https://github.com/Jokeren/DataMining-gSpan

Page 6: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

quickSpan: Improved gSpan Implementation

(Un)Supervised Learning for Malware with quickSpan 6

Extended and improved parallel version of gBolt:• Edge direction• (Safe) Time out• Memory safety (safe failure)• Minimum and maximum output size• Fast exit on “success”• Thread control• Removes duplicate sub-graphs• Incremental output• “Best” output• …

Page 7: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

quickSpan:Comparison

(Un)Supervised Learning for Malware with quickSpan 7

Performs very well overall! Data sets where quickSpan is not clear winner:• AIDS• Cancer

Note: All data sets arepublic and gathered fromgraph mining papers.(except Pathological).

Page 8: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Supervised Classification Using gSpan

(Un)Supervised Learning for Malware with quickSpan 8

Learning:1. Given graphs from malware family Fi:

obtain canonical representation gi =gSpan(Fi, learning_parameters)

Classification:1. Given sample s

For each gi:calculate similarity using sim(s,gi) =

max_size(gSpan({s,gi}, classification_parameters))2. Classification result r = max ( sim(s,gi) )3. If r > threshold then classify as “malware”

otherwise classify as “cleanware” …(or “unknown”)

Page 9: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Classification Example

(Un)Supervised Learning for Malware with quickSpan 9

Learning:• Find common subgraphs• Take largest 1Classification:• Find common sub-graphs• Test threshold (0.25)?

size(AB)/size(ABC) = 2/3 > 0.25 => true, classify as “malware”

Learned Graph Sample

Page 10: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Mirai Case Study: Data Set

(Un)Supervised Learning for Malware with quickSpan 10

Mirai:• Collected from IoT honeypot (6 April to 14 August 2017)• Selected only X86 32-bit ELFCleanware:• Static-get distribution[1]

After (attempted) SCDG extraction:• 504 Mirai SCDGs• 942 clean SCDGs

1 - http://s.minos.io/

Page 11: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Mirai Case Study: Experiments

Best parameters:• Undirected• Learning time 2500• Support 0.7• Number of graphs 50• Threshold 0.32

(Un)Supervised Learning for Malware with quickSpan 11

Conclusions:• Zero false positives!• Low false negatives.• Reasonable classification

time (1.56s)Compared with YARA:• Comparable/better 𝐹".$ score

(YARA: 84.38 to 98.61%)• Higher cost

Page 12: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Unsupervised Learning: Clustering

(Un)Supervised Learning for Malware with quickSpan 12

• Exploit gSpan for clustering graphs– Base algorithm by Seeland et al.– Allows overlap of clusters

• Apply this to SCDGs– Cluster by common behaviour– Should correlate with families

• Aim is to cluster suspected malware samples– Complement to classification– Weaker correlation => show new families/evolution

Page 13: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Seeland Clustering Algorithm

(Un)Supervised Learning for Malware with quickSpan 13

Find clusters in size ordered graphs using gSpan:For each graph g in the sequence:

clustered = falseFor each cluster C in clusters:

If gspan(C+g,1.0,min_size) != {} thenC += gclustered = true

If clustered == falseclusters.append({g})

Cluster 1

New graph..

Cluster 2Run gSpan with Cluster 1 …

not common enough, new cluster.common enough, add to cluster 1.

Run gSpan with Cluster 2 …

not common enough, but already clustered.

Page 14: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Practical Clustering

(Un)Supervised Learning for Malware with quickSpan 14

The min_size parameter is unclear, possibilities:• Fixed/absolute (base algorithm):

– User must specify/know the optimal size– Size is chosen uniformly for all graph sizes

• Percentage of graph:– User must specify/know the percentage that is optimal– Percentage automatically adjusts for graph size

• Computation based:– User doesn’t need to specify– Automatically adjusts for graph size

The sorting by size is sub-optimal:• Instead we use random ordering

Page 15: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Initial Experiment Configuration

(Un)Supervised Learning for Malware with quickSpan 15

Small sample experiment with:• 5 families (1-20 graphs each):

– Browserfox 6 graphs– Loadmoney 16 graphs– Shodi 1 graph– Virlock 20 graphs– Zbot 8 graphs

• quickSpan parameters:– Timeout– Direction– min_size

Page 16: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Experimental Results

(Un)Supervised Learning for Malware with quickSpan 16

LM1

LM3

LM2

LM5LM4

LM6

LM8

LM9

LM7

LM10 LM12

LM11

LM15

LM14

LM13 LM16

BF1BF2

BF3

BF6BF4BF5

SH

VI1

VI3

VI2VI4

VI5

VI6

VI8

VI7

VI9

VI10VI11

VI12

VI13

VI14

VI15

VI16

VI17

VI18

VI19

VI20

ZB1

ZB2

ZB8ZB3

ZB7

ZB6

ZB5ZB4

Browserfox (BF)Loadmoney (LM)Shodi (SH)Virlock (VI)Zbot (ZB)

Parameters:Timeout=500Directed=TrueMin_size=

Computation

Page 17: (Un)Supervised Learning for Malware with quickSpan · Seeland Clustering Algorithm (Un)Supervised Learning for Malware with quickSpan 13 Find clusters in size ordered graphs using

Conclusions

(Un)Supervised Learning for Malware with quickSpan 17

gSpan:• Good graph mining algorithm• quickSpan adds performance and quality of life features• Useful for both supervised and unsupervised learningSupervised Learning:• Good results on Mirai• Comparable with YARA• Highly tuneableUnsupervised Learning:• Good results for correctness of clustering• Highly tuneable