CINBAD CERN/HP ProCurve Joint Project
on Networking
26 May 2009
Ryszard Erazm Jurga - CERNMilosz Marian Hulboj - CERN
Outline
Introduction to CERN network
CINBAD project and its goals
Data - sources, collection and analysis
Results and conclusions
2
3
10-1 40-S2 376-R
VaultComputer Center
887-R 513-C
Vault
874-CCC
513-CCR
513-CCR
Meyrin area
CERN sites
Farms
Internet
230x
40x50x
100x
1000x
or
Minor Starpoints 100x
21x 433x
SL1PL1
BL1 RL1
RL2BL2RL3BL3
BL4
RL4
BL5
RL5
BL7 RL7 BB16
RB16 RB17
BB17
BB20
RB20
BB52 ZB52
AB52 RB52
ZB51BB51
RB51 AB51
ZB50 BB50
AB50 RB50
BB53 ZB53
AB53 RB53
BB54 ZB54
RB54 AB54
BB55 ZB55
AB55RB55
BB56
ZB56
AB56RB56
IB1
IB2
BT15
RT16
TT1 TT2 TT3 TT4 TT5 TT6 TT7
TT8
BT4 RT5 BT5 RT6
BT6 BT7 BT8 BT9 BT10 BT11RT7 RT8 RT9 RT10 RT11 RT12
BT2
RT3
BT1
RT2
BT3
RT4
BT13
RT14
BT12
RT13
Prevessin area
BT16
RT17
PG1
PG2
SG2
SG1
2-S 874-R
Computer Center
LHC area
15x15x
CDR
DAQ
AG
ATCN
90x
AG
AG
15x
AG
Control
CDR
DAQ
HLT
TNTN
AG
AG
15x
25x
10x10x
CDR
DAQ
AG
Control
90x
AG
AG
TN TN Control
AG 8x
10x10x
CDR
DAQ
Monitoring
2175
2275
2375
2475
2575
2675
2775
2875
513 874
874 874212 354 376 513
Simplified overall CERN campus network topology
Numbers:
150+ 10Gb routers
1000 10Gb ports
2’200+ switches
50’000 active user devices
70’000 1Gb user ports
140 Gbps WAN connectivity
4.8 Tbps LCG Core
CINBAD codename deciphered
CERN Investigation of Network Behaviour and Anomaly Detection
Project Goal
“To understand the behaviour of large computer networks (10’000+ nodes) in High Performance Computing or large Campus installations to be able to: Detect traffic anomalies in the system Be able to perform trend analysis Automatically take counter measures Provide post-mortem analysis facilities “
What is anomaly? (1)
Anomalies are a fact in computer networks Anomaly definition is very domain specific:
But there is a common denominator: “Anomaly is a deviation of the system from the normal
(expected) behaviour (baseline)” “Normal behaviour (baseline) is not stationary and is not
always easy to define” “Anomalies are not necessarily easy to detect”
5
Network faults Malicious attacks Viruses/worms
Misconfiguration … …
What is anomaly? (2)
Just a few examples of anomalies:
The network infrastructure misuse• unauthorised DHCP/DNS server (either malicious or
accidental)• network scans• worms and viruses
Violation of a local network/security policy • NAT, TOR usage (not allowed at CERN)
6
CINBAD project principle
data sources
collectors
storage
analysis
sFlow – main data source
Based on packet sampling (RFC 3176)
on average 1-out-of-N packet is sampled by an agent and sent to a collector
• packet header and payload included (max 128 bytes)– switching/routing/transport protocol information– application protocol data (e.g. http, dns)
SNMP counters included
low CPU/memory requirements – scalable
For more details, see our technical reporthttp://openlab-mu-internal.web.cern.ch/openlab-mu-internal/Documents/2_Technical_Documents/Technical_Reports/2007/RJ-MM_SamplingReport.pdf
8
sFlow data usage
Typically: traffic accounting (e.g. for billing, network planning, SLA etc.)
Useful for post-mortem network analysis CERN-wide network tcpdump
• access to the whole CERN sampled network traffic• information where and when the traffic has been seen• particularly useful in detecting packets ‘that should not be
there’ (policy violation)
Rare examples of sflow usage for anomaly detection
9
Other data sources
Packet sampling data is not enough! Data is partial, cannot provide 100% accuracy Not always easy to identify the anomaly
More data to understand flow of data in the network External sources provide useful information and time
triggers Correlation between various data sources
Example: Central Antivirus Service at CERN
10
CINBAD sFlow data collection
Collector
RedundantCollector
Up to 1Gbps of raw sFlowMulticasted to two collectors
Level I Disk Storage
UnpackedData
Level I Processing Level II Processing
CINBAD DB
Level II Storage
Level III ProcessingData Mining
Configurator
Current collection based on traffic from ~1000 switches ~6000 sampled packets per second~ 3500 snmp counter sets per second
Multistage data storage
Stage 1 sFlow datagram tree-like format unpacked into CINBAD file
format to enable fast direct access Minimal space overhead introduced
Stage 2 Oracle DB as a long-term storage data aggregation
• tradeoff between sFlow randomness and space, data lifetime and anomaly detection requirements
– e.g. number of destination IPs for a given source IP
12
Data analysis
Various approaches are being investigated Statistical analysis methods
• detect a change from “normal network behaviour”– selection of suitable metrics is needed
• can detect new, unknown anomalies• poor anomaly type identification
Signature based• we ported SNORT to work with sampled data• performs well against know problems• tends to have low false positive rate• does not work against unknown anomalies
13
Synergy from both detection techniques
14
Sample-basedSNORT evaluation
engine
Rules
Translation
StatisticalAnalysisengine
TrafficProfiles
AnomalyAlertsIncoming sflow stream
New baselines
New Signatures
Current results
Campus and Internet traffic analysis Identified anomalies which went undetected by
the central CERN IDS
Detected number of misbehaviors• both statistical and pattern matching approaches used• TOR, DNS abuse, Trojans, worms, network scans, p2p
applications, rogue DHCP servers, etc.
Findings reported to the CERN security team Security team adapted their policies
15
Achievements and next steps
It has been demonstrated that pattern matching for anomaly detection is possible with sflow data Payload data is a key advantage of sflow sflow allows distributed detection at the edge of
the network
Combining statistical analysis with pattern matching is providing encouraging initial results
Integration of the two mechanisms holds promise for zero-day detection
16