rethinking netflow: a case for a coordinated “risc” architecture for flow monitoring vyas sekar...
Post on 19-Dec-2015
217 views
TRANSCRIPT
1
Rethinking NetFlow: A Case for a Coordinated “RISC” Architecture for Flow Monitoring
Vyas SekarJoint work with
Mike Reiter, Hui ZhangDavid Andersen, Anupam Gupta,
Ramana Kompella, Walter Willinger
Flow Monitoring is critical for effective Network Management
Traffic Engineering
Analyze new user apps
AnomalyDetection
Network Forensics
Worm Detection
Accounting
Botnet analysis
…….
Many management applications Evolving and growing over time Need high-fidelity measurements
3
Requirements for monitoring Network
Operations Center
Flow reports
report = ( flow = same src-dst, ports, proto) + pkt/byte counters
Respect resource constraints
High flow coverage
Provide network-wide goals
Low data management overhead
High-fidelity for all applications
4
Sampling due to resource constraints
• Routers cannot record every packet/flow– Constraints: CPU, Memory, Bandwidth
• Resource constraints don’t go away!– Network demands scale even as routers become
more powerful
• Some form of sampling is inevitable– Record/report only a subset of the traffic
5
Current solution
Respect resource constraints
High flow coverage
Provide network-wide goals
Low data management overhead
Uniform packet sampling, e.g., Cisco NetFlow• Each router independently samples packets• Aggregates sampled packets into flow reports
Biased towardslarge flows
Redundant measurements
Too coarse
High-fidelity for all applications Not very goodfor security
6
How do we meet the requirements?
High-fidelity for all applications
Respect resource constraints
High flow coverage
Provide network-wide goals
Low data mgmt overhead
Part 1: Coordinated Sampling
Part 2: “RISC” monitoring
7
How do we meet the requirements?
High-fidelity for all applications
Respect resource constraints
High flow coverage
Provide network-wide goals
Low data mgmt overhead
Part 1: Coordinated Sampling
Part 2: “RISC” monitoring
8
High-level idea
Sampling algorithm not biased to large flows
Packet sampling has low flow coverage due to bias toward large flows
Routers sample independently Wasted measurements
Can’t reason about network-wide goals
Treat routers in the network as a system to be managed in a coordinated fashion!
9
Part 1 Outline
• Motivation
• Design of cSamp (Coordinated Sampling)
• Evaluation
• Practical deployment
10
Design
• Random flow sampling (single router)– Sample flows not packets
• Hash-based coordination (single path)– Efficient, non-redundant sampling– Coordination without explicit communication
• Network-wide optimization (whole network)– Satisfy network-wide constraints and objectives
11
Design (single router)
• Random flow sampling– Sample flows not packets
12
Flow sampling
1613111
Flow memory(flow, counter #pkts)
3
[3,10]Hash range
6
Sample flows, not packets, to increase flow coverage
1131611
Compute hash, log if in range
Version IHL TOS LengthIdentification Flags OffsetTTL Protocol ChecksumSource IP addressDestination IP address ……SourcePort DestinationPort
Hash Flowid [0,Max]
1131611
Pack
et h
eade
r
1
1
13
Design (single path)
• Random flow sampling (single router)– Sample flows not packets
• Hash-based coordination – Efficient, non-redundant sampling– Coordination without explicit communication
14
Hash-based coordination
Flow memory
1[1,4]
Hash range
3
Flow memory
8[7,9]
Hash range
Non-overlapping hash-ranges avoids redundant monitoringCoordination without communication
Stream:11816135
R1R2
4
1
1
15
Design (whole network)
• Random flow sampling (single router)– Sample flows not packets
• Hash-based coordination (single path)– Efficient, non-redundant sampling– Coordination without explicit communication
• Network-wide optimization– Satisfy network-wide constraints and objectives
16
Network-wide view
Many paths = Origin-Destination (OD) pairs in a networke.g., NYC-PIT, PIT-SFO
Moving from a single-path to network?
17
Network-wide coordination
Assign non-overlapping ranges per OD-pair/path
[1,5]
[1,3]
[3,7]
[1,2]
[7,9]
[5,8]
18
cSamp algorithm on each router
[5,10]
[1,4]
Sampling Manifest
1. Get OD-Pair from packet
3. Look up hash-range for OD-pair from sampling manifest 2. Compute hash (flow = packet 5-tuple)
4.Log if hash falls in range for this OD-pair
Red vs. Green?
Flow memory
2
2
1
OD Range
19
Overall system architecture
[1,5]
[5,9]
[3,7]
[1,2]
[7,9]
[5,8]
Network Operations
Center
Generate sampling manifests
Applications
Configuration Dissemination
Flow reports
20
Framework for generating manifests
Network-wide optimization
OD-pair infoTraffic, Path(routers)
Router constraintse.g., SRAM for flowrecords
Sampling manifests
{<OD-Pair,Hash-range>} per router
Objective:Max i ε ODPairs Coveragei Traffici
Subject to achieving maximum Mini ε ODPairs {Coveragei}
LinearProgram
Inputs
Output
21
Part 1 Outline
• Motivation
• Design of cSamp (Coordinated Sampling)
• Evaluation
• Practical deployment
22
cSamp vs. other sampling solutions
• Metrics reflect initial goals– Coverage, network-wide goals, redundancy
• Flow sampling– Fixed-rate and Maximal flow sampling– Use same memory (400K flow records)
• Packet sampling– 1-in-100 and 1-in-50 (edge)– Allow infinite memory
23
Total flow coveragecSamp is 2-3X better than packet sampling, 30% over maximal flow sampling
24
Minimum fractional coveragecSamp is significantly better than other solutions!
Maximal flow sampling is inadequate for network-wide objectives
25
How do these solutions fare?
Requirements
Packet Sampling(NetFlow)
Flow Sampling cSamp
Respect resource constraints
High flow coverage Network-wide goals Minimize redundant measurements
26
Part 1 Outline
• Motivation
• Design of cSamp (Coordinated Sampling)
• Evaluation
• Practical deployment
27
Practical Issues
2. Is the optimization scalable?Need two improvements
(binary search + max-flow)
3. What about multi-path routing?Simple, lightweight extension
1. What about traffic dynamics?History + short-term adaptation
4. How do interior routers identify OD-pairs?Assume ingress routers mark packets
28
Why we may want to avoid this ….
Extra overhead on ingress
OD-pair id might be ambiguous (multi-egress peers)
Need to modify packet headers or add shim header
May require overhaul of routing infrastructure
How do interior routers identify OD-pairs?Assume ingress routers mark packets
29
Can we realize the benefits of cSamp withoutrequiring OD-pair identification?
Use local info. at router to make sampling decisions
“Stitch” coverage for a path across routers on that path
30
R1 R3R2R1 R4R0
What local info can I get from
packet and routing table?
{Previous Hop, My Id, NextHop}
SamplingSpecGranularity at which
sampling decisions are made
How much traffic to sample for this SamplingSpec?
SamplingAtomDiscrete hash-
ranges, select some of them to log
31
=
=
“Stitching” together coverage
union
union
R1
R2
R4R3
R5
R6
R7
32
Problem Formulation
Coverage for path Pi
Load on router Rj
Maximize: Total flow coverage: i TiCi
Minimum fractional coverage: mini {Ci } Subject To:
j, Loadj Lj
33
Maximize: Total flow coverage: i TiCi
Min. frac coverage: mini {Ci } Subject To:
j, Loadj Lj
Sorry .. NP-hard!Can’t even approximate
min without resource augmentation
Total flow coverage: Submodular maximization with partition-knapsack constraintsEfficient greedy algorithm with near-optimal performance
Min. fractional flow coverage: Intelligent augmentation much better than theoretical guaranteePartial/incremental deployment of adding OD-pair identifiers
34
Total flow coverage
cSamp-T (tuple+) gives near-ideal total flow coverage vs. cSamp
35
Minimum fractional coverage
With smart resource augmentation, cSamp-T gives good min. frac. coverage
36
How do we meet the requirements?
High-fidelity for all applications
Respect resource constraints
High flow coverage
Provide network-wide goals
Low data mgmt overhead
Part 1: Coordinated Sampling
Part 2: “RISC” monitoring
37
FSD
HeavyHitters
Entropy SuperSpreaders
ChangeDetection
Outdegreehistogram
PortAddr
PortAddr
SrcDst
What functionality should we put on
routers ?
38
Current Research: Application-Specific!
FSD
PortAddr
Separate Counters & Estimation algorithms
Per App
HeavyHitters
Entropy SuperSpreaders
ChangeDetection
Outdegreehistogram
PortAddr Src
Dst
Why? Application-specific approaches provide higher fidelity
Traffic
39
Alternative: “RISC”
Traffic
Why? Late-binding to applications, Easier to implement, “Future-proof”
FSD
HeavyHitters
Entropy SuperSpreaders
ChangeDetection
Outdegreehistogram
PortAddr
PortAddr
SrcDst
GenericData
CollectionDecouple Collection
and Computation
40
RISC vs. Application-Specific
Revisit this perception that RISC does not provide good performance
Requirements
Application Specific
RISC 1.0NetFlow
RISC 2.0
Fidelity acrossapplications ?ImplementationComplexity ?Processing Overhead ?Changing Applications ?Enable Diagnostics ?
41
Why this might make sense?
Each app-specific algorithm requires dedicated counters
Primary bottleneck for high-speed monitoring = SRAM counters
Look at aggregate memory usage across applications
Pool in these resources into a few sampling primitivesRun these with sufficient fidelity!
42
Challenges
What RISC primitives should we implement?
Does it perform comparably to application-specific approaches?
Combination of flow sampling, sample and hold, cSamp
Yes! RISC with aggregate resources is comparable or even better
43
Challenges
What RISC primitives should we implement?
Does it perform comparably to application-specific approaches?
Combination of flow sampling, sample and hold, cSamp
Yes! RISC with aggregate resources is comparable or even better
44
RequirementsRISC 2.0
Fidelity acrossapplications ?ImplementationComplexity ?Processing Overhead ?Changing Applications ?Enable Diagnostics ?
Two broad classes“Structure”
Flow Sampling“Volume”
Sample and Hold
Provide flow reports like
NetFlow
Coordination Network-wideOptimization
What RISC primitives should we implement?
45
Sample and Hold
1613111
Flow memory(flow, counter #pkts)
1
6
Accurate counts of “heavy hitters” with few counters
1131611
AlgorithmIf flow is already logged updateSample packet with probability p
If new flow create counter
1131611
1
1
234
46
Putting the pieces together
47
Challenges
What RISC primitives should we implement?
Does it perform comparably to application-specific approaches?
Combination of flow sampling, sample and hold, cSamp
Yes! RISC with aggregate resources is comparable or even better
48
FSD
HeavyHitters
Entropy SuperSpreaders
ChangeDetection
Outdegreehistogram
PortAddr
PortAddr
SrcDst
FlowSamp + Sample & Hold
Calculate aggregate memory usage
Compute “Relative Accuracy
Difference” + good
- bad
49
Sensitivity to Application Portfolio
Bigger app. portfolio or Some resource intensive apps Better gains for RISC approach
“Relative Accuracy
Difference” + good
- bad
Bigger portfolio More resources
50
Evaluation: Single Router
RISC > Application-specific for most applications
Worse for heavyhitter, but not by much!
“Relative Accuracy
Difference” + good
- bad
51
Network-wide evaluation: Per-Ingress
Results for network-wide per ingress mirror single router evaluation
RISC > Application-specific for most applications
“Relative Accuracy
Difference” + good
- bad
52
RISC vs. Application-Specific
Requirements
Application Specific
RISC 1.0NetFlow
RISC 2.0FS, SH, cSamp
Fidelity acrossapplications ImplementationComplexity Processing Overhead Changing Applications Enable Diagnostics
Yes, We Can!
53
Meeting requirements for Flow Monitoring
High-fidelity for all applications
Respect resource constraints
High flow coverage
Provide network-wide goals
Low data mgmt overhead
CoordinatedSamplingWith “RISC” primitives
54
Looking beyond flow monitoring
• WAN optimization, Content-Aware Networking“Intelligent and targeted caching of traffic at selected nodes of the network can help engineer traffic by reducing load and avoiding hot spots, cut end-to-end latency, and enhance the end-user experience.”
– SmartRE (with Ashok Anand, Aditya Akella)
• Intrusion Detection/Prevention for Enterprises (ongoing work .. )