broadening*the*reach*workshop,*raleigh,*nc** 09/04/14*–09 ... · perfsonar*...
TRANSCRIPT
perfSONAR
Broadening the Reach Workshop, Raleigh, NC 09/04/14 – 09/05/14 John Hicks – Network Research Engineer
perfSONAR Outline
• Performance IntroducJon & MoJvaJon • perfSONAR Preliminaries • Tool Use • Deployment & Regular TesJng • Debugging Strategies • perfSONAR Community
2 – ESnet Science Engagement ([email protected]) - 9/4/14
Test and Measurement – Keeping the Network Clean
• The wide area network, the Science DMZ, and all its systems can be funcJoning perfectly
• Eventually something is going to break – Networks and systems are built with many, many components
– SomeJmes things just break – this is why we buy support contracts
• Other problems arise as well – bugs, mistakes, whatever • We must be able to find and fix problems when they occur • Why is this so important? Because we use TCP!
3 – ESnet Science Engagement ([email protected]) - 9/4/14
Where Are The Problems?
4 – ESnet Science Engagement ([email protected]) - 9/4/14
Source Campus
Backbone
S
NREN
Congested or faulty links between domains
Congested intra-‐ campus links
D
DesJnaJon Campus
Latency dependant problems inside domains with small RTT
Regional
So\ Network Failures
• So\ failures are where basic connecJvity funcJons, but high performance is not possible.
• TCP was intenJonally designed to hide all transmission errors from the user: – “As long as the TCPs conJnue to funcJon properly and the internet system does not become completely parJJoned, no transmission errors will affect the users.” (From IEN 129, RFC 716)
• Some so\ failures only affect high bandwidth long RTT flows.
• Hard failures are easy to detect & fix – so\ failures can lie hidden for years!
• One network problem can o\en mask others
5 – ESnet Science Engagement ([email protected]) - 9/4/14
Network Monitoring
• All networks do some form monitoring. • Addresses needs of local staff for understanding state of the network o Would this informaJon be useful to external users? o Can these tools funcJon on a mulJ-‐domain basis?
• Beyond passive methods, there are acJve tools. o E.g. o\en we want a ‘throughput’ number. Can we automate that idea?
o Wouldn’t it be nice to get some sort of plot of performance over the course of a day? Week? Year? MulJple endpoints?
• perfSONAR = Measurement Middleware
6 – ESnet Science Engagement ([email protected]) - 9/4/14
perfSONAR Outline
• Performance IntroducJon & MoJvaJon • perfSONAR Preliminaries • Tool Use • Deployment & Regular TesJng • Debugging Strategies • perfSONAR Community
7 – ESnet Science Engagement ([email protected]) - 9/4/14
perfSONAR • All the previous Science DMZ network diagrams have
limle perfSONAR boxes everywhere – The reason for this is that consistent behavior
requires correctness – Correctness requires the ability to find and fix
problems
8 – ESnet Science Engagement ([email protected]) - 9/4/14
10GE
10GE
10GE
10GE
10G
Border Router
WAN
Science DMZSwitch/Router
Enterprise Border Router/Firewall
Site / CampusLAN
High performanceData Transfer Node
with high-speed storage
Per-service security policy control points
Clean, High-bandwidth
WAN path
Site / Campus access to Science
DMZ resources
perfSONAR
perfSONAR
perfSONAR
• You can’t fix what you can’t find • You can’t find what you can’t see • perfSONAR lets you see
• Especially important when deploying high performance services – If there is a problem with the infrastructure, need to fix it – If the problem is not with your stuff, need to prove it
• Many players in an end to end path • Ability to show correct behavior aids in problem localizaJon
What is perfSONAR? • perfSONAR is a tool to:
• Set network performance expectaJons • Find network problems (“so\ failures”) • Help fix these problems • All in mulJ-‐domain environments
• These problems are all harder when mulJple networks are involved
• perfSONAR is provides a standard way to publish acJve and passive monitoring data – This data is interesJng to network researchers as well as network operators
9 – ESnet Science Engagement ([email protected]) - 9/4/14
perfSONAR Toolkit • The “perfSONAR Toolkit” is an open source implementaJon and packaging of the perfSONAR measurement infrastructure and protocols from ESnet and Internet2
• hmp://psps.perfsonar.net/toolkit • All components are available as RPMs, and bundled into a CentOS 6-‐based “neJnstall” and a “Live CD” • perfSONAR tools are much more accurate if run on a dedicated perfSONAR host, not on the DTN
• Very easy to install and configure • Usually takes less than 30 minutes
10 – ESnet Science Engagement ([email protected]) - 9/4/14
Toolkit Use Case • The general use case is to establish
some set of tests to other locaJons/faciliJes
• To answer the what/why quesJons: – Regular tesJng with select tools
helps to establish pamerns – how much bandwidth we would see during the course of the day – or when packet loss appears
– We do this to ‘points of interest’ to see how well a real acJvity (e.g. Globus transfer) would do.
• If performance is ‘bad’, don’t expect much from the data movement tool
11 – ESnet Science Engagement ([email protected]) - 9/4/14
Deployment By The Numbers • Last updated August 2014. AdopJon trend increases with each release. CC-‐NIE
and innovaJon plasorm helped as well.
12 – ESnet Science Engagement ([email protected]) - 9/4/14
hmp://stats.es.net/ServicesDirectory/ -‐ 1200+ as of August 2014
13 – ESnet Science Engagement ([email protected]) - 9/4/14
• perfSONAR interface is meant to be simple (e.g. so easy even an Engineer ScienJst CIO could do it)
• Enabling this on campus is the first step to seeing a simulaJon of performance for a bulk data tool. Ideally you would place the perfSONAR server where the users are (e.g if they are traversing a firewall sJll, why don’t you learn their pain)?
• Configuring regular tests is systemaJc – pick regional and far away desJnaJons.
• Dust off neslow, and see where the data is going – configure tests to those locaJons too.
14 – ESnet Science Engagement ([email protected]) - 9/4/14
TransiJon
• Use the correct tool for the Job – To determine the correct tool, maybe we need to start with what we want to accomplish …
• What do we care about measuring? – Packet Loss, DuplicaJon, out-‐of-‐orderness (transport layer)
– Achievable Bandwidth (e.g. “Throughput”) – Latency (Round Trip and One Way) – Jimer (Delay variaJon) – Interface UJlizaJon/Discards/Errors (network layer) – Traveled Route – MTU Feedback
15 – ESnet Science Engagement ([email protected]) - 9/4/14
The Metrics
perfSONAR Outline
• Performance IntroducJon & MoJvaJon • perfSONAR Preliminaries • Hands On • Tool Use • Common Pisalls • Deployment & Regular TesJng • Debugging Strategies • Use Cases & Success Stories
16 – ESnet Science Engagement ([email protected]) - 9/4/14
Importance of Regular TesJng • We can’t wait for users to report problems and then fix
them (so\ failures can go unreported for years!) • Things just break someJmes
– Failing opJcs – Somebody messed around in a patch panel and kinked a fiber – Hardware goes bad
• Problems that get fixed have a way of coming back – System defaults come back a\er hardware/so\ware upgrades – New employees may not know why the previous employee set things up a certain way and back out fixes
• Important to conJnually collect, archive, and alert on acJve throughput test results
17 – ESnet Science Engagement ([email protected]) - 9/4/14
perfSONAR Dashboard: hmp://ps-‐dashboard.es.net
18 – ESnet Science Engagement ([email protected]) - 9/4/14
Regular perfSONAR Tests • We run regular tests to check for two things
– TCP throughput – One way delay and packet loss
• perfSONAR has mechanisms for managing regular tesJng between perfSONAR hosts – StaJsJcs collecJon and archiving – Graphs – Dashboard display – Integrate with NAGIOS
• This infrastructure is deployed now – perfSONAR hosts at faciliJes can take advantage of it
• At-‐a-‐glance health check for data infrastructure
19 – ESnet Science Engagement ([email protected]) - 9/4/14
Develop a Test Plan • What are you going to measure?
– Achievable bandwidth • 2-‐3 regional desJnaJons • 4-‐8 important collaborators • 4-‐8 (more if you are willing, especially to start) Jmes per day to each desJnaJon
• 20-‐30 second tests within a region, longer across oceans and conJnents
– Loss/Availability/Latency • OWAMP: ~10-‐20 collaborators over diverse paths
– Interface UJlizaJon & Errors (via SNMP) • What are you going to do with the results?
– NAGIOS Alerts – Reports to user community – Dashboard
20 – ESnet Science Engagement ([email protected]) - 9/4/14
Host ConsideraJons • hmp://psps.perfsonar.net/toolkit/hardware.html • Dedicated perfSONAR hardware is best
– Server class is a good choice – Desktop/Laptop/Mini (Mac, Shumle) can be problemaJc, but work in a diagnosJc
capacity • Other applicaJons will perturb results • Separate hosts for throughput tests and latency/loss tests is preferred
– Throughput tests can cause increased latency and loss – Latency tests on a throughput host are sJll useful however
• 1Gbps vs 10Gbps testers – There are a number of problem that only show up at speeds above 1Gbps
• Virtual Machines do not always work well as perfSONAR hosts (use specific) – Clock sync issues are a bit of a factor – throughput is reduced significantly for 10G hosts – VM technology and motherboard technology has come a long way, YMMV – NDT/NAGIOS/SNMP/1G BWCTL are good choices for a VM, OWAMP/10G BWCTL are not
21 – ESnet Science Engagement ([email protected]) - 9/4/14
perfSONAR Deployment LocaJons • CriJcal to deploy such that you can test with useful semanJcs • perfSONAR hosts allow parts of the path to be tested separately
– Reduced visibility for devices between perfSONAR hosts – Must rely on counters or other means where perfSONAR can’t go
• EffecJve test methodology derived from protocol behavior – TCP suffers much more from packet loss as latency increases – TCP is more likely to cause loss as latency increases – TesJng should leverage this in two ways
• Design tests so that they are likely to fail if there is a problem • Mimic the behavior of producJon traffic as much as possible
– Note: don’t design your tests to succeed • The point is not to “be green” even if there are problems • The point is to find problems when they come up so that the problems are
fixed quickly
22 – ESnet Science Engagement ([email protected]) - 9/4/14
perfSONAR Outline
• Performance IntroducJon & MoJvaJon • perfSONAR Preliminaries • Tool Use • Deployment & Regular TesJng • Debugging Strategies • perfSONAR Community
24 – ESnet Science Engagement ([email protected]) - 9/4/14
WAN Test Methodology – Problem IsolaJon
• Segment-‐to-‐segment tesJng is unlikely to be helpful – TCP dynamics will be different – Problem links can test clean over short distances – An excepJon to this is hops that go through a firewall
• Run long-‐distance tests – Run the longest clean test you can, then look for the shortest dirty test that includes the path of the clean test
• In order for this to work, the testers need to have already deployed when you start troubleshooJng – Internet2 has at least one perfSONAR host at each hub locaJon.
• Many (most?) R&E providers in the world have deployed at least 1 – If your provider does not have perfSONAR deployed ask them why, and then ask when they will have it done
25 – ESnet Science Engagement ([email protected]) - 9/4/14
Network Performance TroubleshooJng Example
10GE
10GE
10GE
Nx10GE
10GE
10GE
perfSONARperfSONARBorder perfSONAR Science DMZ perfSONAR
perfSONARBorder perfSONAR
perfSONARScience DMZ perfSONAR
PoorPerformance
WAN
University CampusNational Labortory
26 – ESnet Science Engagement ([email protected]) - 9/4/14
Wide Area TesJng – Full Context
10GE
10GE
10GE10GE 10GE10GE
10GE10GE
10GE
10GE
Nx10GE
Nx10GE
100GE
100GE
10GE
10GE
10GE
10GE
10GE
100GE100GE
100GE
perfSONAR
perfSONAR
perfSONARBorder perfSONAR Science DMZ perfSONAR
perfSONAR
perfSONARperfSONAR perfSONAR perfSONAR
perfSONAR
10GE
perfSONAR
perfSONARBorder perfSONAR
perfSONARScience DMZ perfSONAR
Internet2 path~15 msec
ESnet path~30 msec
RegionalPath
~2 msec
Campus~1 msecLab
~1 msec
PoorPerformance
27 – ESnet Science Engagement ([email protected]) - 9/4/14
perfSONAR Outline
• Performance IntroducJon & MoJvaJon • perfSONAR Preliminaries • Tool Use • Deployment & Regular TesJng • Debugging Strategies • perfSONAR Community
28 – ESnet Science Engagement ([email protected]) - 9/4/14
perfSONAR Community • perfSONAR-‐PS is working to build a strong user community to support the use and development of the so\ware.
• perfSONAR-‐PS Mailing Lists – Announcement Lists:
• hmps://mail.internet2.edu/wws/subrequest/perfsonar-‐announce
– Users List: • hmps://mail.internet2.edu/wws/subrequest/perfsonar-‐user
29 – ESnet Science Engagement ([email protected]) - 9/4/14
More on perfSONAR
• hmp://psps.perfsonar.net/
• hmps://code.google.com/p/perfsonar-‐ps/
30 – ESnet Science Engagement ([email protected]) - 9/4/14