enma: co-operation in the corporation mort (richard mortier) msr-cambridge september 2004
TRANSCRIPT
ENMA:Co-operation in the corporation
Mort (Richard Mortier)
MSR-Cambridge
September 2004
…is the process of monitoring and controlling a large complex distributed system of dumb devices where failures are common and resources scarce
Enterprise networks are large but closely managed Contrast with the Internet or university campus networks
No-one has the big picture! Internet routeing uses distributed protocols
Current management tools all consider local info Patchy SNMP support, configuration issues, sampling
artefacts, tools generate CPU and network load
Network management
Building edge-based network management platform Collect flow information from hosts, and Combine with topology information from routeing protocols
Enable visualization, analysis, simulation, control
Avoid problems of not-quite-standard interfaces Management support is typically ‘non-critical’ (i.e. buggy )
and not extensively tested for inter-operability Do the work where resources are plentiful
Hosts have lots of cycles and little traffic (relatively) Protocol visibility: see into tunnels, IPSec, etc
This project
Problem context: Enterprise networks Large
105 edge devices, 103 network devices Geographically distributed
Multiple continents, 102 countries Tightly controlled
IT department has (nearly) complete control over user desktops and network connected equipment
Talk outline
System outline
What would it be good for?
In more detail…
Research issues
System outline
Control
Packets
Flows
Routeingprotocol
Topology
VisualizeSimulate
Simulator
Distributeddatabase
Traffic matrix Set of routes
srcs
dsts
routes
Pictures of current topology and traffic Routes+flows+forwarding rules BIG PICTURE
In fact, where did my traffic go yesterday? Keep historical data for capacity planning, etc
A platform for anomaly detection Historical data suggests “normality”, live
monitoring allows anomalies to be detected
Where is my traffic going today?
Where might my traffic go tomorrow? Plug into a simulator back-end
Discrete event simulator, flow allocation solver Run multiple ‘what-if’ scenarios
…failures …reconfigurations …technology deployments
E.g. “What happens if we coalesce all the Exchange servers in one data-centre?”
Where should my traffic be going? Close the loop: compute link weights to
implement policy goals Recompute on order of hours/days
Allows more dynamic policies Modify network configuration to track e.g. time of
day load changes Might make network more efficient(~cheaper)
Where are we now?
Three major components Flow collection Route collection Distributed database
Still studying feasibility Starting to build prototypes
Data collection
Flow collection Hosts track active flows
Using low overhead event posting infrastructure, ETW Built prototype device driver provider & user-space consumer
Used packet traces for feasibility study on (client, server) Peaks at (165, 5667) live and (39, 567) active flows per sec
Route collection OSPF is link-state: passively collect link state adverts Extension of my work at Sprint (for IS-IS and BGP); also
been done at AT&T (NSDI’04 paper)
The distributed database
Logically contains 1. Traffic flow matrix (bandwidths), {srcs} × {dsts}2. …each entry annotated with current route from src to dst
N.B. src/dst might be e.g. (IP end-point, application) Large dynamic data set suggests aggregation
Related work { distributed, continuous query, temporal } databases Sensor networks
Potential starting points: Astrolabe or SDIMS (SIGCOMM’04) Where/what/how much to aggregate?
Is data read- or write-dominated? Which is more dynamic, flow or topology data? Can the system successfully self-tune?
The distributed database
Construct traffic matrix from flow monitoring Hosts can supply flows they source and sink Only need a subset of this data to get complete traffic matrix
Construct topology from route collection OSPF supplies topology → routes
Wish to be able to answer queries like “Who are the top-10 traffic generators?”
Easy to aggregate, don’t care about topology “What is the load on link l?”
Can aggregate from hosts, but need to know routes “What happens if we remove links {l…m}?”
Interaction between traffic matrix, topology, even flow control
The distributed database
Building simulation model OSPF data gives topology, event list, routes Simple load model to start with (load ~ # subnets) Precedence matrix (from SPF) reduces flow-data query set
Can we do as well/better than e.g. NetFlow? Accuracy/coverage trade-off
How should we distribute the DB? Just OSPF data? Just flow data? A mixture?
How many levels of aggregation? How many nodes do queries touch?
What sort of API is suitable? Example queries for sample applications
Research issues
Corner cases Scalability
Robustness, accuracy Control systems
Research issues
Corner cases Multi-homed hosts: how best to define a flow L4 routeing, NAT, proxy ARP, transparent proxies (Solve using device config files, perhaps SNMP)
Scalability Host measurement must not be intrusive (in terms of
packet latency, CPU load, network bandwidth) Aggregators must elect themselves in such a way that they
do not implode under event load What happens if network radically alters? E.g.
Extensive use of multicast Connection patterns shift due to e.g. P2P deployment
Research issues
Robustness Network management had better still work as nodes fail or
the network partitions! Accuracy in the face of late, partial information
By accident: unmonitored hosts By design: aggregation, more detail about local area Inference of link contribution to cumulative metrics, e.g. RTT
Network control: modify link weights How efficient is the current configuration anyway? What are plausible timescales to reconfigure?
Summary
Aim to build a coherent edge-based network management platform using flow monitoring and standard routeing protocols Applications include visualization, simulation, dynamic
control Research issues include
Scalability: want to manage a 300,000 node network Robustness: must work as nodes fail or network partitions Accuracy: will not be able to monitor 100% of traffic Control systems: use the data to optimize the network in
real-time, as well as just observe and simulate
Current status
Submitted HotNets paper Prototype ETW provider/consumer driver Studied feasibility of flow monitoring Prototype OSPF collector & topology reconstruction
Investigating “distributed database” via simulation Query properties System decomposition
Questions, comments?
Backup slides
SNMP Internet routeing OSPF BGP Security
SNMP
Protocol to manage information tables at devices Provides get, set, trap, notify operations
get, set: read, write values trap: signal a condition (e.g. threshold exceeded) notify: reliable trap
Complexity mostly in the table design Some standard tables, but many vendor specific Non-critical, so often tables populated incorrectly
Internet routeing
Q: how to get a packet from node to destination?
A1: advertise all reachable destinations and apply a consistent cost function (distance vector)
A2: learn network topology and compute consistent shortest paths (link state) Each node (1) discovers and advertises adjacencies;
(2) builds link state database; (3) computes shortest paths A1, A2: Forward to next-hop using longest-prefix-
match
OSPF (~link state routeing)
Q: how to route given packet from any node to destination?
A: learn network topology; compute shortest paths
For each node Discover adjacencies (~immediate neighbours); advertise Build link state database (~network topology) Compute shortest paths to all destination prefixes Forward to next-hop using longest-prefix-match (~most
specific route)
BGP (~path vector routeing)
Q: how to route given packet from any node to destination? A: neighbours tell you destinations they can reach; pick cheapest
option
For each node Receive (destination, cost, next-hop) for all destinations known to
neighbour Select among all possible next-hops for given destination Advertise selected (destination, cost+, next-hop') for all known
destinations Selection process is complicated Routes can be modified/hidden at all three stages
General mechanism for application of policy
Security
Threat: malicious/compromised host Authenticate participants Must secure route collector as if a router
Threat: DoS on monitors Difference between client under DoS and server? Rate pace output from monitors
Threat: eavesdropping Standard IPSec/encryption solutions