rethinking netflow: a case for a coordinated “risc” architecture for flow monitoring vyas sekar...

1

Rethinking NetFlow: A Case for a Coordinated “RISC” Architecture for Flow Monitoring

Vyas SekarJoint work with

Mike Reiter, Hui ZhangDavid Andersen, Anupam Gupta,

Ramana Kompella, Walter Willinger

Flow Monitoring is critical for effective Network Management

Traffic Engineering

Analyze new user apps

AnomalyDetection

Network Forensics

Worm Detection

Accounting

Botnet analysis

…….

Many management applications Evolving and growing over time Need high-fidelity measurements

3

Requirements for monitoring Network

Operations Center

Flow reports

report = ( flow = same src-dst, ports, proto) + pkt/byte counters

Respect resource constraints

High flow coverage

Provide network-wide goals

Low data management overhead

High-fidelity for all applications

4

Sampling due to resource constraints

• Routers cannot record every packet/flow– Constraints: CPU, Memory, Bandwidth

• Resource constraints don’t go away!– Network demands scale even as routers become

more powerful

• Some form of sampling is inevitable– Record/report only a subset of the traffic

5

Current solution


High flow coverage


Low data management overhead

Uniform packet sampling, e.g., Cisco NetFlow• Each router independently samples packets• Aggregates sampled packets into flow reports

Biased towardslarge flows

Redundant measurements

Too coarse

High-fidelity for all applications Not very goodfor security

6

How do we meet the requirements?



High flow coverage


Low data mgmt overhead

Part 1: Coordinated Sampling

Part 2: “RISC” monitoring

7




High flow coverage





8

High-level idea

Sampling algorithm not biased to large flows

Packet sampling has low flow coverage due to bias toward large flows

Routers sample independently Wasted measurements

Can’t reason about network-wide goals

Treat routers in the network as a system to be managed in a coordinated fashion!

9

Part 1 Outline

• Motivation

• Design of cSamp (Coordinated Sampling)

• Evaluation

• Practical deployment

10

Design

• Random flow sampling (single router)– Sample flows not packets

• Hash-based coordination (single path)– Efficient, non-redundant sampling– Coordination without explicit communication

• Network-wide optimization (whole network)– Satisfy network-wide constraints and objectives

11

Design (single router)

• Random flow sampling– Sample flows not packets

12

Flow sampling

1613111

Flow memory(flow, counter #pkts)

3

[3,10]Hash range

6

Sample flows, not packets, to increase flow coverage

1131611

Compute hash, log if in range

Version IHL TOS LengthIdentification Flags OffsetTTL Protocol ChecksumSource IP addressDestination IP address ……SourcePort DestinationPort

Hash Flowid [0,Max]

1131611

Pack

et h

eade

r

1

1

13

Design (single path)


• Hash-based coordination – Efficient, non-redundant sampling– Coordination without explicit communication

14

Hash-based coordination

Flow memory

1[1,4]

Hash range

3

Flow memory

8[7,9]

Hash range

Non-overlapping hash-ranges avoids redundant monitoringCoordination without communication

Stream:11816135

R1R2

4

1

1

15

Design (whole network)


• Hash-based coordination (single path)– Efficient, non-redundant sampling– Coordination without explicit communication

• Network-wide optimization– Satisfy network-wide constraints and objectives

16

Network-wide view

Many paths = Origin-Destination (OD) pairs in a networke.g., NYC-PIT, PIT-SFO

Moving from a single-path to network?

17

Network-wide coordination

Assign non-overlapping ranges per OD-pair/path

[1,5]

[1,3]

[3,7]

[1,2]

[7,9]

[5,8]

18

cSamp algorithm on each router

[5,10]

[1,4]

Sampling Manifest

1. Get OD-Pair from packet

3. Look up hash-range for OD-pair from sampling manifest 2. Compute hash (flow = packet 5-tuple)

4.Log if hash falls in range for this OD-pair

Red vs. Green?

Flow memory

2

2

1

OD Range

19

Overall system architecture

[1,5]

[5,9]

[3,7]

[1,2]

[7,9]

[5,8]

Network Operations

Center

Generate sampling manifests

Applications

Configuration Dissemination

Flow reports

20

Framework for generating manifests

Network-wide optimization

OD-pair infoTraffic, Path(routers)

Router constraintse.g., SRAM for flowrecords

Sampling manifests

{<OD-Pair,Hash-range>} per router

Objective:Max i ε ODPairs Coveragei Traffici

Subject to achieving maximum Mini ε ODPairs {Coveragei}

LinearProgram

Inputs

Output

21

Part 1 Outline

• Motivation


• Evaluation


22

cSamp vs. other sampling solutions

• Metrics reflect initial goals– Coverage, network-wide goals, redundancy

• Flow sampling– Fixed-rate and Maximal flow sampling– Use same memory (400K flow records)

• Packet sampling– 1-in-100 and 1-in-50 (edge)– Allow infinite memory

23

Total flow coveragecSamp is 2-3X better than packet sampling, 30% over maximal flow sampling

24

Minimum fractional coveragecSamp is significantly better than other solutions!

Maximal flow sampling is inadequate for network-wide objectives

25

How do these solutions fare?

Requirements

Packet Sampling(NetFlow)

Flow Sampling cSamp


High flow coverage Network-wide goals Minimize redundant measurements

26

Part 1 Outline

• Motivation


• Evaluation


27

Practical Issues

2. Is the optimization scalable?Need two improvements

(binary search + max-flow)

3. What about multi-path routing?Simple, lightweight extension

1. What about traffic dynamics?History + short-term adaptation

4. How do interior routers identify OD-pairs?Assume ingress routers mark packets

28

Why we may want to avoid this ….

Extra overhead on ingress

OD-pair id might be ambiguous (multi-egress peers)

Need to modify packet headers or add shim header

May require overhaul of routing infrastructure

How do interior routers identify OD-pairs?Assume ingress routers mark packets

29

Can we realize the benefits of cSamp withoutrequiring OD-pair identification?

Use local info. at router to make sampling decisions

“Stitch” coverage for a path across routers on that path

30

R1 R3R2R1 R4R0

What local info can I get from

packet and routing table?

{Previous Hop, My Id, NextHop}

SamplingSpecGranularity at which

sampling decisions are made

How much traffic to sample for this SamplingSpec?

SamplingAtomDiscrete hash-

ranges, select some of them to log

31

=

=

“Stitching” together coverage

union

union

R1

R2

R4R3

R5

R6

R7

32

Problem Formulation

Coverage for path Pi

Load on router Rj

Maximize: Total flow coverage: i TiCi

Minimum fractional coverage: mini {Ci } Subject To:

j, Loadj Lj

33

Maximize: Total flow coverage: i TiCi

Min. frac coverage: mini {Ci } Subject To:

j, Loadj Lj

Sorry .. NP-hard!Can’t even approximate

min without resource augmentation

Total flow coverage: Submodular maximization with partition-knapsack constraintsEfficient greedy algorithm with near-optimal performance

Min. fractional flow coverage: Intelligent augmentation much better than theoretical guaranteePartial/incremental deployment of adding OD-pair identifiers

34

Total flow coverage

cSamp-T (tuple+) gives near-ideal total flow coverage vs. cSamp

35

Minimum fractional coverage

With smart resource augmentation, cSamp-T gives good min. frac. coverage

36




High flow coverage





37

FSD

HeavyHitters

Entropy SuperSpreaders

ChangeDetection

Outdegreehistogram

PortAddr

PortAddr

SrcDst

What functionality should we put on

routers ?

38

Current Research: Application-Specific!

FSD

PortAddr

Separate Counters & Estimation algorithms

Per App

HeavyHitters


ChangeDetection

Outdegreehistogram

PortAddr Src

Dst

Why? Application-specific approaches provide higher fidelity

Traffic

39

Alternative: “RISC”

Traffic

Why? Late-binding to applications, Easier to implement, “Future-proof”

FSD

HeavyHitters


ChangeDetection

Outdegreehistogram

PortAddr

PortAddr

SrcDst

GenericData

CollectionDecouple Collection

and Computation

40

RISC vs. Application-Specific

Revisit this perception that RISC does not provide good performance

Requirements

Application Specific

RISC 1.0NetFlow

RISC 2.0

Fidelity acrossapplications ?ImplementationComplexity ?Processing Overhead ?Changing Applications ?Enable Diagnostics ?

41

Why this might make sense?

Each app-specific algorithm requires dedicated counters

Primary bottleneck for high-speed monitoring = SRAM counters

Look at aggregate memory usage across applications

Pool in these resources into a few sampling primitivesRun these with sufficient fidelity!

42

Challenges

What RISC primitives should we implement?

Does it perform comparably to application-specific approaches?

Combination of flow sampling, sample and hold, cSamp

Yes! RISC with aggregate resources is comparable or even better

43

Challenges





44

RequirementsRISC 2.0

Fidelity acrossapplications ?ImplementationComplexity ?Processing Overhead ?Changing Applications ?Enable Diagnostics ?

Two broad classes“Structure”

Flow Sampling“Volume”

Sample and Hold

Provide flow reports like

NetFlow

Coordination Network-wideOptimization


45

Sample and Hold

1613111

Flow memory(flow, counter #pkts)

1

6

Accurate counts of “heavy hitters” with few counters

1131611

AlgorithmIf flow is already logged updateSample packet with probability p

If new flow create counter

1131611

1

1

234

46

Putting the pieces together

47

Challenges





48

FSD

HeavyHitters


ChangeDetection

Outdegreehistogram

PortAddr

PortAddr

SrcDst

FlowSamp + Sample & Hold

Calculate aggregate memory usage

Compute “Relative Accuracy

Difference” + good

- bad

49

Sensitivity to Application Portfolio

Bigger app. portfolio or Some resource intensive apps Better gains for RISC approach

“Relative Accuracy


- bad

Bigger portfolio More resources

50

Evaluation: Single Router

RISC > Application-specific for most applications

Worse for heavyhitter, but not by much!



- bad

51

Network-wide evaluation: Per-Ingress

Results for network-wide per ingress mirror single router evaluation

RISC > Application-specific for most applications



- bad

52

RISC vs. Application-Specific

Requirements

Application Specific

RISC 1.0NetFlow

RISC 2.0FS, SH, cSamp

Fidelity acrossapplications ImplementationComplexity Processing Overhead Changing Applications Enable Diagnostics

Yes, We Can!

53

Meeting requirements for Flow Monitoring



High flow coverage



CoordinatedSamplingWith “RISC” primitives

54

Looking beyond flow monitoring

• WAN optimization, Content-Aware Networking“Intelligent and targeted caching of traffic at selected nodes of the network can help engineer traffic by reducing load and avoiding hot spots, cut end-to-end latency, and enhance the end-user experience.”

– SmartRE (with Ashok Anand, Aditya Akella)

• Intrusion Detection/Prevention for Enterprises (ongoing work .. )

55

Thanks!www.cs.cmu.edu/~vyass

www.cs.cmu.edu/[email protected]

rethinking netflow: a case for a coordinated “risc” architecture for flow monitoring vyas sekar...

Documents

flow memory flow

coordinated sampling

low flow coverage

networkwide constraints

sampling algorithm

form of sampling

security slide

large flows packet sampling