automated worm fingerprinting
DESCRIPTION
Automated Worm Fingerprinting. Sumeet Singh, Cristian Estan, George Varghese, and Stefan Savage. Introduction. Problem: how to react quickly to worms? CodeRed 2001 Infected ~360,000 hosts within 11 hours Sapphire/Slammer (376 bytes) 2002 Infected ~75,000 hosts within 10 minutes. - PowerPoint PPT PresentationTRANSCRIPT
Introduction
Problem: how to react quickly to worms?
CodeRed 2001 Infected ~360,000 hosts within 11 hours
Sapphire/Slammer (376 bytes) 2002 Infected ~75,000 hosts within 10 minutes
Existing Approaches
Detection Ad hoc intrusion detection
Characterization Manual signature extraction
Isolates and decompiles a new worm Look for and test unique signatures Can take hours or days
Earlybird
Automatically detect and contain new worms
Two observations Some portion of the content in existing
worms is invariant Rare to see the same string recurring
from many sources to many destinations
Earlybird
Automatically extract the signature of all known worms Also Blaster, MyDoom, and Kibuv.B
hours or days before any public signatures were distributed
Few false positives
Background and Related Work
Almost all IPs were scanned by Slammer < 10 minutes Limited only by bandwidth constraints
The SQL Slammer Worm:30 Minutes After “Release”
- Infections doubled every 8.5 seconds- Spread 100X faster than Code Red- At peak, scanned 55 million hosts per second.
Network Effects Of The SQL Slammer Worm
At the height of infections Several ISPs noted significant bandwidth
consumption at peering points Average packet loss approached 20% South Korea lost almost all Internet
service for period of time Financial ATMs were affected Some airline ticketing systems
overwhelmed
Signature-Based Methods
Pretty effective if signatures can be generated quickly For CodeRed, 60 minutes For Slammer, 1 – 5 minutes
Scan Detection
Look for unusual frequency and distribution of address scanning
Limitations Not suited to worms that spread in a non-
random fashion (i.e. emails) Detects infected sites Does not produce a signature
Honeypots
Monitored idle hosts with untreated vulnerabilities Used to isolate worms
Limitations Manual extraction of
signatures Depend on quick infections
Behavioral Detection
Looks for unusual system call patterns Sending a packet from the same buffer
containing a received packet Can detect slow moving worms
Limitations Needs application-specific knowledge Cannot infer a large-scale outbreak
Characterization
Process of analyzing and identifying a new worm
Current approaches Use a priori vulnerability signatures Automated signature extraction
Vulnerability Signatures
Example Slammer Worm
UDP traffic on port 1434 that is longer than 100 bytes (buffer overflow)
Can be deployed before the outbreak Can only be applied to well-known
vulnerabilities
Some Automated Signature Extraction Techniques
Allows viruses to infect decoy progs Extracts the modified regions of the
decoy Uses heuristics to identify invariant code
strings across infected instances Limitation
Assumes the presence of a virus in a controlled environment
Some Automated Signature Extraction Techniques
Honeycomb Finds longest common subsequences
among sets of strings found in messages Autograph
Uses network-level data to infer worm signatures
Limitations Scale and full distributed deployments
Containment
Mechanism used to deter the spread of an active worm Host quarantine
Via IP ACLs on routers or firewalls String-matching Connection throttling
On all outgoing connections
Defining Worm Behavior
Content invariance Portions of a worm are invariant (e.g. the
decryption routine) Content prevalence
Appears frequently on the network Address dispersion
Distribution of destination addresses more uniform to spread fast
Finding Worm Signatures
Traffic pattern is sufficient for detecting worms Relatively straightforward Extract all possible substrings Raise an alarm when
FrequencyCounter[substring] > threshold1 SourceCounter[substring] > threshold2 DestCounter[substring] > threshold3
Detecting Common Strings
Cannot afford to detect all substrings Maybe can afford to detect all strings
with a small fixed length
Detecting Common Strings
Cannot afford to detect all substrings Maybe can afford to detect all strings
with a small fixed length
A horse is a horse, of course, of course
F1 = (c1p4 + c2p3 + c3p2 + c4p1 + c5) mod M
Detecting Common Strings
Cannot afford to detect all substrings Maybe can afford to detect all strings
with a small fixed length
A horse is a horse, of course, of course
F1 = (c1p4 + c2p3 + c3p2 + c4p1 + c5) mod M
F2 = (c2p4 + c3p3 + c4p2 + c5p1 + c6) mod M
Too CPU-Intensive
Each packet with payload of s bytes has s-β+1 strings of length β A packet with 1,000 bytes of payload
Needs 960 fingerprints for string length of 40 Still too expensive
Prone to Denial-of-Service attacks
CPU Scaling
Random sampling may miss many substrings
Solution: value sampling Track only certain substrings
e.g. last 6 bits of fingerprint are 0 P(not tracking a worm)
= P(not tracking any of its substrings)
CPU Scaling
Example Track only substrings with last 6 bits = 0s String length = 40 1,000 char string
960 substrings 960 fingerprints ‘11100…101010’…‘10110…000000’…
Track only ‘xxxxx….000000’ substrings Probably 960 / 26 = 15 substrings in total
string that end in 6 0’s
CPU Scaling
P(finding a 100-byte signature) = 55% P(finding a 200-byte signature) = 92% P(finding a 400-byte signature) =
99.64%
Estimating Content Prevalence
Finding the packet payloads that appear at least x times among the N packets sent During a given interval
Estimating Content Prevalence
Table[payload] 1 GB table filled in 10 seconds
Table[hash[payload]] 1 GB table filled in 4 minutes Tracking millions of ants to track a few
elephants Collisions...false positives
Stage 2
packet memory
packet1 1
Stage 1
Multistage Filters
No false negatives!(guaranteed detection)
Estimating Address Dispersion
Not sufficient to count the number of source and destination pairs e.g. send a mail to a mailing list
Two sources—mail server and the sender Many destinations
Need to count the unique source and destination traffic flows For each substring
[Estan et al. 2003]
Bitmap counting – direct bitmap
HASH(green)=10001001
Set bits in the bitmap using hash of the flow ID of incoming packets
Bitmap counting – direct bitmap
HASH(green)=10001001
Packets from the same flow always hash to the same bit
Bitmap counting – direct bitmap
HASH(yellow)=01100011
As the bitmap fills up, estimates get inaccurate
Bitmap counting – direct bitmap
Solution: use more bits
Problem: memory scales with the number of flows
HASH(blue)=00100100
Bitmap counting – virtual bitmap
Solution: a) store only a portion of the bitmap
b) multiply estimate by scaling factor
Bitmap counting – virtual bitmap
HASH(yellow)=01100011
Problem: estimate inaccurate when few flows active
Bitmap counting – multires. bmp
Problem: must update up to three bitmaps per packet
Solution: combine bitmaps into one
OR
OR
Putting It Together
header payload
substring fingerprintssubstring fingerprints
key src cnt dest cnt
AD entry exist?update counters
key cntelseupdate counter
cnt > prevalence threshold?create AD entry
Content Prevalence Table(multistage filters)
Address Dispersion Table(scalable counters)
counters > dispersion threshold?report key as suspicious worm
Putting It Together
Sample frequency: 1/64 String length: 40 Use 4 hash functions to update
prevalence table Multistage filter reset every 60 seconds
System Design
Two major components Sensors
Sift through traffic for a given address space Report signatures
An aggregator Coordinates real-time updates Distributes signatures
Implementation and Environment
Written in C and MySQL (5,000 lines) Prototype executes on a 1.6Ghz AMD
Opteron 242 1U Server Linux 2.6 kernel
EarlyBird Performance
Processes 1TB of traffic per day 200Mbps of continuous traffic Can be pipelined and parallelized for
achieve 40Gbps
Parameter Tuning
Prevalence threshold: 3 Very few signatures repeat
Address dispersion threshold 30 sources and 30 destinations Reset every few hours Reduces the number of reported
signatures down to ~25,000
Parameter Tuning
Tradeoff between and speed and accuracy Can detect Slammer in 1 second as
opposed to 5 seconds With 100x more reported signatures
Memory Consumption
Prevalence table 4 stages
Each with ~500,000 bins (8 bits/bin) 2MB total
Address dispersion table 25K entries (28 bytes each) < 1 MB
Total: < 4MB
Trace-Based Verification
Two main sources of false positives 2,000 common protocol headers
e.g. HTTP, SMTP Whitelisted
SPAM e-mails BitTorrent
Many-to-many download
Inter-Packet Signatures
An attacker might evade detection by splitting an invariant string across packets
With 7MB extra, EarlyBird can keep per flow states and fingerprint across packets
Live Experience with EarlyBird
Detected precise signatures CodeRed variants MyDoom mail worm Sasser Kibvu.B