broadening*the*reach*workshop,*raleigh,*nc** 09/04/14*–09 ... · perfsonar*...

31
perfSONAR Broadening the Reach Workshop, Raleigh, NC 09/04/14 – 09/05/14 John Hicks – Network Research Engineer

Upload: others

Post on 27-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

perfSONAR  

Broadening  the  Reach  Workshop,  Raleigh,  NC    09/04/14  –  09/05/14  John  Hicks  –  Network  Research  Engineer  

perfSONAR  Outline  

•  Performance  IntroducJon  &  MoJvaJon  •  perfSONAR  Preliminaries    •  Tool  Use  •  Deployment  &  Regular  TesJng    •  Debugging  Strategies    •  perfSONAR  Community  

2 – ESnet Science Engagement ([email protected]) - 9/4/14

Test  and  Measurement  –  Keeping  the  Network  Clean  

•  The  wide  area  network,  the  Science  DMZ,  and  all  its  systems  can  be  funcJoning  perfectly  

•  Eventually  something  is  going  to  break  –  Networks  and  systems  are  built  with  many,  many  components  

–  SomeJmes  things  just  break  –  this  is  why  we  buy  support  contracts  

•  Other  problems  arise  as  well  –  bugs,  mistakes,  whatever  •  We  must  be  able  to  find  and  fix  problems  when  they  occur  •  Why  is  this  so  important?    Because  we  use  TCP!  

3 – ESnet Science Engagement ([email protected]) - 9/4/14

Where  Are  The  Problems?  

4 – ESnet Science Engagement ([email protected]) - 9/4/14

Source  Campus  

 Backbone  

S  

NREN    

Congested  or  faulty  links  between  domains  

Congested  intra-­‐  campus  links  

D  

DesJnaJon  Campus  

Latency  dependant  problems  inside  domains  with  small  RTT  

Regional  

So\  Network  Failures  

•  So\  failures  are  where  basic  connecJvity  funcJons,  but  high  performance  is  not  possible.  

•  TCP  was  intenJonally  designed  to  hide  all  transmission  errors  from  the  user:  –  “As  long  as  the  TCPs  conJnue  to  funcJon  properly  and  the  internet  system  does  not  become  completely  parJJoned,  no  transmission  errors  will  affect  the  users.”  (From  IEN  129,  RFC  716)  

•  Some  so\  failures  only  affect  high  bandwidth  long  RTT  flows.  

•  Hard  failures  are  easy  to  detect  &  fix    –  so\  failures  can  lie  hidden  for  years!  

•  One  network  problem  can  o\en  mask  others  

5 – ESnet Science Engagement ([email protected]) - 9/4/14

Network  Monitoring  

•  All  networks  do  some  form  monitoring.      •  Addresses  needs  of  local  staff  for  understanding  state  of  the  network  o Would  this  informaJon  be  useful  to  external  users?  o Can  these  tools  funcJon  on  a  mulJ-­‐domain  basis?  

•  Beyond  passive  methods,  there  are  acJve  tools.      o E.g.  o\en  we  want  a  ‘throughput’  number.    Can  we  automate  that  idea?  

o Wouldn’t  it  be  nice  to  get  some  sort  of  plot  of  performance  over  the  course  of  a  day?    Week?    Year?    MulJple  endpoints?  

•  perfSONAR  =  Measurement  Middleware  

6 – ESnet Science Engagement ([email protected]) - 9/4/14

perfSONAR  Outline  

•  Performance  IntroducJon  &  MoJvaJon  •  perfSONAR  Preliminaries    •  Tool  Use  •  Deployment  &  Regular  TesJng    •  Debugging  Strategies    •  perfSONAR  Community  

7 – ESnet Science Engagement ([email protected]) - 9/4/14

perfSONAR  •  All  the  previous  Science  DMZ  network  diagrams  have  

limle  perfSONAR  boxes  everywhere  –  The  reason  for  this  is  that  consistent  behavior  

requires  correctness  –  Correctness  requires  the  ability  to  find  and  fix  

problems  

8 – ESnet Science Engagement ([email protected]) - 9/4/14

10GE

10GE

10GE

10GE

10G

Border Router

WAN

Science DMZSwitch/Router

Enterprise Border Router/Firewall

Site / CampusLAN

High performanceData Transfer Node

with high-speed storage

Per-service security policy control points

Clean, High-bandwidth

WAN path

Site / Campus access to Science

DMZ resources

perfSONAR

perfSONAR

perfSONAR

•  You  can’t  fix  what  you  can’t  find    •  You  can’t  find  what  you  can’t  see  •  perfSONAR  lets  you  see  

•  Especially  important  when  deploying  high  performance  services  –  If  there  is  a  problem  with  the  infrastructure,  need  to  fix  it  –  If  the  problem  is  not  with  your  stuff,  need  to  prove  it  

•  Many  players  in  an  end  to  end  path  •  Ability  to  show  correct  behavior  aids  in  problem  localizaJon  

What  is  perfSONAR?  •  perfSONAR  is  a  tool  to:  

•  Set  network  performance  expectaJons  •  Find  network  problems  (“so\  failures”)  •  Help  fix  these  problems  •  All  in  mulJ-­‐domain  environments  

•  These  problems  are  all  harder  when  mulJple  networks  are  involved  

•  perfSONAR  is  provides  a  standard  way  to  publish  acJve  and  passive  monitoring  data  –  This  data  is  interesJng  to  network  researchers  as  well  as  network  operators  

9 – ESnet Science Engagement ([email protected]) - 9/4/14

perfSONAR  Toolkit  •  The  “perfSONAR  Toolkit”  is  an  open  source  implementaJon    and  packaging  of  the  perfSONAR  measurement  infrastructure  and  protocols  from  ESnet  and  Internet2  

•  hmp://psps.perfsonar.net/toolkit    •  All  components  are  available  as  RPMs,  and  bundled  into  a  CentOS  6-­‐based  “neJnstall”  and  a  “Live  CD”  •  perfSONAR  tools  are  much  more  accurate  if  run  on  a  dedicated  perfSONAR  host,  not  on  the  DTN  

•  Very  easy  to  install  and  configure  •  Usually  takes  less  than  30  minutes  

10 – ESnet Science Engagement ([email protected]) - 9/4/14

Toolkit  Use  Case  •  The  general  use  case  is  to  establish  

some  set  of  tests  to  other  locaJons/faciliJes  

•  To  answer  the  what/why  quesJons:  –  Regular  tesJng  with  select  tools  

helps  to  establish  pamerns  –  how  much  bandwidth  we  would  see  during  the  course  of  the  day  –  or  when  packet  loss  appears  

–  We  do  this  to  ‘points  of  interest’  to  see  how  well  a  real  acJvity  (e.g.  Globus  transfer)  would  do.      

•  If  performance  is  ‘bad’,  don’t  expect  much  from  the  data  movement  tool  

 

11 – ESnet Science Engagement ([email protected]) - 9/4/14

Deployment  By  The  Numbers  •  Last  updated  August  2014.    AdopJon  trend  increases  with  each  release.    CC-­‐NIE  

and  innovaJon  plasorm  helped  as  well.      

12 – ESnet Science Engagement ([email protected]) - 9/4/14

hmp://stats.es.net/ServicesDirectory/  -­‐  1200+  as  of  August  2014  

13 – ESnet Science Engagement ([email protected]) - 9/4/14

•  perfSONAR  interface  is  meant  to  be  simple  (e.g.  so  easy  even  an  Engineer  ScienJst  CIO  could  do  it)  

•  Enabling  this  on  campus  is  the  first  step  to  seeing  a  simulaJon  of  performance  for  a  bulk  data  tool.    Ideally  you  would  place  the  perfSONAR  server  where  the  users  are  (e.g  if  they  are  traversing  a  firewall  sJll,  why  don’t  you  learn  their  pain)?  

•  Configuring  regular  tests  is  systemaJc  –  pick  regional  and  far  away  desJnaJons.  

•  Dust  off  neslow,  and  see  where  the  data  is  going  –  configure  tests  to  those  locaJons  too.      

14 – ESnet Science Engagement ([email protected]) - 9/4/14

TransiJon  

•  Use  the  correct  tool  for  the  Job  –  To  determine  the  correct  tool,  maybe  we  need  to  start  with  what  we  want  to  accomplish  …  

•  What  do  we  care  about  measuring?  –  Packet  Loss,  DuplicaJon,  out-­‐of-­‐orderness  (transport  layer)  

–  Achievable  Bandwidth  (e.g.  “Throughput”)  –  Latency  (Round  Trip  and  One  Way)  –  Jimer  (Delay  variaJon)  –  Interface  UJlizaJon/Discards/Errors  (network  layer)  –  Traveled  Route  – MTU  Feedback  

15 – ESnet Science Engagement ([email protected]) - 9/4/14

The  Metrics  

perfSONAR  Outline  

•  Performance  IntroducJon  &  MoJvaJon  •  perfSONAR  Preliminaries    •  Hands  On  •  Tool  Use  •  Common  Pisalls  •  Deployment  &  Regular  TesJng    •  Debugging  Strategies  •  Use  Cases  &  Success  Stories  

16 – ESnet Science Engagement ([email protected]) - 9/4/14

Importance  of  Regular  TesJng  •  We  can’t  wait  for  users  to  report  problems  and  then  fix  

them  (so\  failures  can  go  unreported  for  years!)  •  Things  just  break  someJmes  

–  Failing  opJcs  –  Somebody  messed  around  in  a  patch  panel  and  kinked  a  fiber  –  Hardware  goes  bad  

•  Problems  that  get  fixed  have  a  way  of  coming  back  –  System  defaults  come  back  a\er  hardware/so\ware  upgrades  –  New  employees  may  not  know  why  the  previous  employee  set  things  up  a  certain  way  and  back  out  fixes  

•  Important  to  conJnually  collect,  archive,  and  alert  on  acJve  throughput  test  results  

17 – ESnet Science Engagement ([email protected]) - 9/4/14

perfSONAR  Dashboard:  hmp://ps-­‐dashboard.es.net    

18 – ESnet Science Engagement ([email protected]) - 9/4/14

Regular  perfSONAR  Tests  •  We  run  regular  tests  to  check  for  two  things  

–  TCP  throughput  –  One  way  delay  and  packet  loss  

•  perfSONAR  has  mechanisms  for  managing  regular  tesJng  between  perfSONAR  hosts  –  StaJsJcs  collecJon  and  archiving  –  Graphs  –  Dashboard  display  –  Integrate  with  NAGIOS  

•  This  infrastructure  is  deployed  now  –  perfSONAR  hosts  at  faciliJes  can  take  advantage  of  it  

•  At-­‐a-­‐glance  health  check  for  data  infrastructure  

19 – ESnet Science Engagement ([email protected]) - 9/4/14

Develop  a  Test  Plan  •  What  are  you  going  to  measure?  

–  Achievable  bandwidth  •  2-­‐3  regional  desJnaJons  •  4-­‐8  important  collaborators  •  4-­‐8  (more  if  you  are  willing,  especially  to  start)  Jmes  per  day  to  each  desJnaJon  

•  20-­‐30  second  tests  within  a  region,  longer  across  oceans  and  conJnents    

–  Loss/Availability/Latency  •  OWAMP:    ~10-­‐20  collaborators  over  diverse  paths  

–  Interface  UJlizaJon  &  Errors  (via  SNMP)  •  What  are  you  going  to  do  with  the  results?  

–  NAGIOS  Alerts  –  Reports  to  user  community  –  Dashboard  

20 – ESnet Science Engagement ([email protected]) - 9/4/14

Host  ConsideraJons  •  hmp://psps.perfsonar.net/toolkit/hardware.html    •  Dedicated  perfSONAR  hardware  is  best  

–  Server  class  is  a  good  choice  –  Desktop/Laptop/Mini  (Mac,  Shumle)  can  be  problemaJc,  but  work  in  a  diagnosJc  

capacity  •  Other  applicaJons  will  perturb  results  •  Separate  hosts  for  throughput  tests  and  latency/loss  tests  is  preferred  

–  Throughput  tests  can  cause  increased  latency  and  loss  –  Latency  tests  on  a  throughput  host  are  sJll  useful  however  

•  1Gbps  vs  10Gbps  testers  –  There  are  a  number  of  problem  that  only  show  up  at  speeds  above  1Gbps  

•  Virtual  Machines  do  not  always  work  well  as  perfSONAR  hosts  (use  specific)  –  Clock  sync  issues  are  a  bit  of  a  factor  –  throughput  is  reduced  significantly  for  10G  hosts  –  VM  technology  and  motherboard  technology  has  come  a  long  way,  YMMV  –  NDT/NAGIOS/SNMP/1G  BWCTL  are  good  choices  for  a  VM,  OWAMP/10G  BWCTL  are  not  

21 – ESnet Science Engagement ([email protected]) - 9/4/14

perfSONAR  Deployment  LocaJons  •  CriJcal  to  deploy  such  that  you  can  test  with  useful  semanJcs  •  perfSONAR  hosts  allow  parts  of  the  path  to  be  tested  separately  

–  Reduced  visibility  for  devices  between  perfSONAR  hosts  –  Must  rely  on  counters  or  other  means  where  perfSONAR  can’t  go  

•  EffecJve  test  methodology  derived  from  protocol  behavior  –  TCP  suffers  much  more  from  packet  loss  as  latency  increases  –  TCP  is  more  likely  to  cause  loss  as  latency  increases  –  TesJng  should  leverage  this  in  two  ways  

•  Design  tests  so  that  they  are  likely  to  fail  if  there  is  a  problem  •  Mimic  the  behavior  of  producJon  traffic  as  much  as  possible  

–  Note:  don’t  design  your  tests  to  succeed  •  The  point  is  not  to  “be  green”  even  if  there  are  problems  •  The  point  is  to  find  problems  when  they  come  up  so  that  the  problems  are  

fixed  quickly  

22 – ESnet Science Engagement ([email protected]) - 9/4/14

Sample  Site  Deployment  

23 – ESnet Science Engagement ([email protected]) - 9/4/14

perfSONAR  Outline  

•  Performance  IntroducJon  &  MoJvaJon  •  perfSONAR  Preliminaries    •  Tool  Use  •  Deployment  &  Regular  TesJng    •  Debugging  Strategies    •  perfSONAR  Community  

24 – ESnet Science Engagement ([email protected]) - 9/4/14

WAN  Test  Methodology  –  Problem  IsolaJon  

•  Segment-­‐to-­‐segment  tesJng  is  unlikely  to  be  helpful  –  TCP  dynamics  will  be  different  –  Problem  links  can  test  clean  over  short  distances  –  An  excepJon  to  this  is  hops  that  go  through  a  firewall  

•  Run  long-­‐distance  tests  –  Run  the  longest  clean  test  you  can,  then  look  for  the  shortest  dirty  test  that  includes  the  path  of  the  clean  test  

•  In  order  for  this  to  work,  the  testers  need  to  have  already  deployed  when  you  start  troubleshooJng  –  Internet2  has  at  least  one  perfSONAR  host  at  each  hub  locaJon.      

•  Many  (most?)  R&E  providers  in  the  world  have  deployed  at  least  1  –  If  your  provider  does  not  have  perfSONAR  deployed  ask  them  why,  and  then  ask  when  they  will  have  it  done  

25 – ESnet Science Engagement ([email protected]) - 9/4/14

Network  Performance  TroubleshooJng  Example  

10GE

10GE

10GE

Nx10GE

10GE

10GE

perfSONARperfSONARBorder perfSONAR Science DMZ perfSONAR

perfSONARBorder perfSONAR

perfSONARScience DMZ perfSONAR

PoorPerformance

WAN

University CampusNational Labortory

26 – ESnet Science Engagement ([email protected]) - 9/4/14

Wide  Area  TesJng  –  Full  Context  

10GE

10GE

10GE10GE 10GE10GE

10GE10GE

10GE

10GE

Nx10GE

Nx10GE

100GE

100GE

10GE

10GE

10GE

10GE

10GE

100GE100GE

100GE

perfSONAR

perfSONAR

perfSONARBorder perfSONAR Science DMZ perfSONAR

perfSONAR

perfSONARperfSONAR perfSONAR perfSONAR

perfSONAR

10GE

perfSONAR

perfSONARBorder perfSONAR

perfSONARScience DMZ perfSONAR

Internet2 path~15 msec

ESnet path~30 msec

RegionalPath

~2 msec

Campus~1 msecLab

~1 msec

PoorPerformance

27 – ESnet Science Engagement ([email protected]) - 9/4/14

perfSONAR  Outline  

•  Performance  IntroducJon  &  MoJvaJon  •  perfSONAR  Preliminaries    •  Tool  Use  •  Deployment  &  Regular  TesJng    •  Debugging  Strategies    •  perfSONAR  Community  

28 – ESnet Science Engagement ([email protected]) - 9/4/14

perfSONAR  Community  •  perfSONAR-­‐PS  is  working  to  build  a  strong  user  community  to  support  the  use  and  development  of  the  so\ware.      

•  perfSONAR-­‐PS  Mailing  Lists  – Announcement  Lists:  

•  hmps://mail.internet2.edu/wws/subrequest/perfsonar-­‐announce    

– Users  List:  •  hmps://mail.internet2.edu/wws/subrequest/perfsonar-­‐user    

29 – ESnet Science Engagement ([email protected]) - 9/4/14

More  on  perfSONAR  

•  hmp://psps.perfsonar.net/  

•  hmps://code.google.com/p/perfsonar-­‐ps/  

30 – ESnet Science Engagement ([email protected]) - 9/4/14

perfSONAR  

Broadening  the  Reach  Workshop,  Raleigh,  NC    09/04/14  –  09/05/14  John  Hicks  –  Network  Research  Engineer