perfsonar use at us lhc facilities march 9 th 2010, lhcopn eric boyd, deputy technology officer

30
perfSONAR use at US LHC perfSONAR use at US LHC Facilities Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

Upload: millicent-miller

Post on 05-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

perfSONAR use at US LHC FacilitiesperfSONAR use at US LHC Facilities

March 9th 2010, LHCOPNEric Boyd, Deputy Technology Officer

Page 2: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Problem Statement: Distributed monitoring of US based LHC sites.

• Our Solution: pS Performance Toolkit• User Expectations and Use Cases• Challenges• Supporting the infrastructure

• Specifics• Operational recommendations• Measurement Best Common Practices

• Success Stories• Future Directions

2 – 04/20/23, © 2009 Internet2

Outline

Page 3: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

3 – 04/20/23, © 2009 Internet2

Page 4: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• US based LHC sites (e.g. US ATLAS Tier2s) wanted a monitoring solution to ensure peak performance between each other and the Tier 1 (BNL).

• Connectivity ranges by location (e.g. ESnet, Internet2, NLR, Ultralight, RENs + need to worry about campus infrastructure)

• Simple requirements:• ‘Available Bandwidth’ testing – scheduled and on demand• ‘Latency’ testing – scheduled and on demand• ‘Passive’ monitoring (e.g. SNMP)• Installation, configuration, and maintenance should be minimal• Homogenous hardware and software – eliminate systematic

errors in the results by keeping the platform the same. • Solution should be nimble enough to adapt at other facilities

(e.g. Tier 3s, US CMS, integration with MDM)4 – 04/20/23, © 2009 Internet2

Problem Statement

Page 5: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Recommend a suite of tools to install, provide guides/workshops

• Would require dedicated hardware – configuration and management would be up to local sites• Could recommend a platform• Could rely on local staff to choose a machine that meets criteria

• Edge Solution vs. Network Core– Exchange Points, Backbone, Regional, and Campus networks are

working to make perfSONAR data within the network core– Edge facilities, e.g. where the science data lives and is processed,

can get involved in the same manner

5 – 04/20/23, © 2009 Internet2

Possible Solutions – ‘Composite’ Approach

Page 6: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Positive Features:• Relative cost is low – existing hardware can be used to run most

measurement tools• Integration with existing tools on other networks

• Drawbacks:• Labor intensive for local staff to maintain system• Hardware may differ from site to site• Development team may spend a lot of time in a ‘support’ role

(e.g. getting the tools installed and configured).

6 – 04/20/23, © 2009 Internet2

Possible Solutions – ‘Composite’ Approach

Page 7: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Prepare the entire environment for the target use• Uniform system makeup/hardware/versions• Pre-installed and pre-configured• Centrally Managed Solution:

• Central facility monitors the health and updates of the framework• Support available for software and hardware

• Locally Managed Solution• Local institutions monitor daily activities• Software support (e.g. configuration, bug and security patches) is

available• Regular updates anticipated to address bugs/enhancements

• Edge Solution vs. Network Core• As in the ‘Composite’ approach, the edges can still participate at a

protocol level with all perfSONAR products

7 – 04/20/23, © 2009 Internet2

Possible Solutions – ‘Appliance’ Approach

Page 8: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Positive Features:• Integration with existing tools on other networks• Homogonous software and hardware• Easy maintenance and upgrade path

• Drawbacks:• Costs associated with management

• Hardware support• Potentially contracts to manage the software functionality and

operation

8 – 04/20/23, © 2009 Internet2

Possible Solutions – ‘Appliance’ Approach

Page 9: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• In short this is a Locally Managed Appliance• The pSPT is a bootable CD

• Contains all necessary software in a single package• Wizard interface to configure aspects of the system• Upgrade path is simple: burn a new CD and reboot!

• Hardware is similar at all USATLAS locations, everyone has 2:• ‘KOI’ 1U Server

• Pentium 2.2 GHz, Dual Core• 2G RAM• 160GB Hard Drive

• Daily operations (e.g. system maintenance and monitoring) done by the local facility

• Software support (e.g. updates, interim bug fixes, mailing list for questions) provided by development team

9 – 04/20/23, © 2009 Internet2

Solution – pS Performance Toolkit (pSPT)

Page 10: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Installation• Must be ‘easy’. Given the variability in what is easy for a Sysadmin

vs. a Physicist vs. an Administrator: we aimed very low. • Burning a CD and rebooting a machine is as simple as it gets

• Configuration• Also must be ‘easy’. Step by step instructions to guide the user

through the process of configuring the system and tests.• Ability for power users to skip the guided approach• Status and feedback on the process

• Operation• System should work without human intervention.• Reboots or system halts should not result in a loss of data or

configuration. Resume operation when back up and running• System should alert when in distress

10 – 04/20/23, © 2009 Internet2

Solution – User Expectations

Page 11: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Maintenance• Once again: ‘easy’. Most maintenance tasks, e.g. checking the disk

and software, can be automated.• Integration into alert systems (Nagios) is in progress• Upgrades should be the same as installation.

• Data Use• Way to access collected data – either through GUIs or web services• Easy interpretation of results, e.g. make sure all the measurement we

are doing is actually useful (!)• Support

• Security patches and bug fixes must be made available in a timely manner

• Method must exist to ask questions on installation, configuration, upkeep

• Community (e.g. US ATLAS) can self support along with help from the development team over time.

11 – 04/20/23, © 2009 Internet2

Solution – User Expectations – cont.

Page 12: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• US ATLAS (Scientific VO) Use Case:• 2 Servers per facility

• Bandwidth Testing• Latency Testing (sensitive – isolated from other measurements)

• Configuration is one time (initial)• Configure system information (network settings, location)• GUIs to guide regular test set up• State and measurement data saved on local disk

• Maintenance consists of examining data, and upgrading CD when required• Testing is designed to occur without intervention

• Data consumption• On-board GUIs to visualize the results• Built on perfSONAR platform – data can be easily shared (and located

in other locations) to construct new GUIs.

12 – 04/20/23, © 2009 Internet2

Solution – Use Cases

Page 13: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Regional/Campus Network Use Case:• Simple deployment within the core and edges of the network• Integrates into perfSONAR deployment on a backbone or

exchange point • General Diagnostic Use Case:

• Instant availability of a testing point anywhere in the network• Will not harm the operating system of a non-dedicated resource

(e.g. 1 time use)• Remote Facility Use Case:

• Non-technical staff can easily deploy for diagnostic purposes• Interval and magnitude of testing can be adjusted to account for

network availability

13 – 04/20/23, © 2009 Internet2

Solution – Use Cases – cont.

Page 14: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• perfSONAR-PS Development Team• ESnet, Fermilab, Indiana University, Internet2, SLAC, University of

Delaware• Collaboration with perfSONAR-MDM to ensure protocol

compatibility of all software and services• US ATLAS Support:

• Regular release schedule (~4 per year), on demand releases if something goes wrong

• Alerts on vulnerabilities, patches made in a timely manner• Feedback mechanism for bug reports and enhancements• Mailing list for discussion

• Maintained by developers• Encourages building a community to answer questions and solve

non-software related problems

14 – 04/20/23, © 2009 Internet2

Solution – Supporting the Infrastructure

Page 15: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Installation of 2 Hosts per facility (Tier 2s)• Latency and Bandwidth Hosts• Position near the ‘Edge’ of the facility• Optional: add another host near the storage/compute nodes

• Installation of 1 ‘large’ host (Tier 3s, RENs)• Position near the edge• Can run both bandwidth and latency tests, but results may be

tainted• ‘Edge’ Institutional deployment options:

• Border is good for testing connectivity to outside world and path decomposition for problem diagnosis

• Co-located with compute/storage is good for testing what the application will see (e.g. if they travel through a firewall, etc.)

15 – 04/20/23, © 2009 Internet2

Solution Details – Operations

Page 16: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Bandwidth• TCP tests every 4 hours, 20 seconds in length. Test to all Tier2s

and the Tier1• Latency

• One Way – Constant stream of 10 packets per second. Test to and from all Tier2s and the corresponding Tier1

• Round Trip – 10 Packets every 5 minutes. Test to all Tier2s and the Tier1

• Passive Monitoring• No official stance – interest in making border router data available

as well as any links of interest• Currently do not use circuit status monitoring (e.g. E2Emon)

16 – 04/20/23, © 2009 Internet2

Solution Details – Measurement BCP

Page 17: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• University of Michigan to BNL• Poor performance for a single direction• Traversed 5 networks (UofM, Ultralight, Internet2, ESnet, BNL)• perfSONAR available on all parts of the path (e.g. demarcation

points). Simplified due to infrastructure being in place• Process:

• Test from BNL to each intermediate point• Test from UofM to each intermediate point• Isolated the problem to a single section of the Ultralight network

• Once we know where to look, we had to figure out what to do:• Physical Infrastructure – no damage to infrastructure found. Cleaning

performed for good measure.• Hardware – line cards properly seated. No errors found Software –

Router operating systems up to date? Any unchecked alarms?

17 – 04/20/23, © 2009 Internet2

Success Stories - Ultralight

Page 18: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Soft Failure:• A fault or situation that doesn’t cause loss of connectivity, but will

impact performance• May go unnoticed for long periods of time• May impact select set of users

• Ultralight switch was flooded with a global routing table from a peer – caused an unchecked warning flag in configuration• Limited buffer sizes – even though switch was configured to have

these be large• Performance tools (NDT) noticed the discrepancy• The fix is to to upgrade software and reboot

• Publication: http://www.internet2.edu/performance/200904-CS-UL.pdf

18 – 04/20/23, © 2009 Internet2

Success Stories – Ultralight – cont.

Page 19: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• REDDnet (e.g. US CMS Tier3 Vanderbilt University)• Distributed data storage – equipment co-allocated at LHC schools

and positioned near other resources (compute resources, core infrastructure)

• Observed that between may facilities, the data transfer activities was taking much longer than expected (orders of magnitude slower)

• Solution was two parts:• Diagnostics:

• Install tools to get a base line of performance• Helped to identify where effort should be spend first

• Regular monitoring:• Regular bandwidth and latency testing• Establish patterns – e.g. is congestion heavily influencing the

performance or is something wrong in design

19 – 04/20/23, © 2009 Internet2

Success Stories - REDDnet

Page 20: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Problem Breakdown:• Campus network design

• Most facilities featured firewalls where scientific traffic was treated the same as enterprise traffic

• Performance tools were able to spot excessive queuing and dropped traffic

• Death sentence for large data transfers – explaining the situation with campus administrators cleared this up immediately

• Hardware limitations• Local administrators assigned non-capable unmanaged switches on

occasion• Performance tools were able to spot buffering bottlenecks

immediately• Replacing hardware is best solution – configuring settings where

applicable will work also

20 – 04/20/23, © 2009 Internet2

Success Stories – REDDnet – Cont.

Page 21: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Problem Breakdown (cont.):• Unchecked physical infrastructure errors

• Latency tools detected a steady stream of loss on a given link• Network staff on a downstream network were contacted to view

passive monitoring data (it was not available through perfSONAR)• CRC errors found on a dirty fiber in the demarcation of the networks

• Hardware Failure• Bursty loss observed on a given link, tools were able to isolate to a

single device. • Observing the device over a 2 day period showed that processing

load would spike every couple of minutes.• Device had two power units (primary and backup). The secondary

unit was not completely plugged into the wall – the power management software was constantly flapping and effecting the routing ability

21 – 04/20/23, © 2009 Internet2

Success Stories – REDDnet – Cont.

Page 22: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Problem Breakdown (cont.):• Host Tuning:

• Storage and compute nodes were not performing as well as monitoring units predicted

• Same old story: TCP settings on resources were not appropriate for the job at hand.

• See also: http://fasterdata.es.net/TCP-tuning/background.html

• Epilogue:• Regular monitoring now in place – Nagios configured to give

alarms when expected performance drops below a threshold• Data transfers working at expected levels – REDDnet is now

prepared for LHC turn on.

22 – 04/20/23, © 2009 Internet2

Success Stories – REDDnet – Cont.

Page 23: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Future Directions• Further pSPT Development• Expanding to other scientific communities• Circuit Monitoring

• Upcoming event:– NSF CISE (and OCI) sponsoring a workshop, Summer 2010– Bring together researchers and R&E network operators to spread

perfSONAR deployment and see how to create some lasting relationships between researchers and R&E operators in order to create conduits

23 – 04/20/23, © 2009 Internet2

Future Directions

Page 24: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Common complaints from users:• GUIs focused around user diagnosis, not specific tool metric

display• Need more guidance in what must be vs.. what could be

configured in the wizard GUIs• Want to be able to install directly to a host (eliminate CD)• Support for a wider range of 10G hardware (e.g. development

team has limited testing access currently)• Tighter integration of tooling – e.g. coordinate the latency testing

with bandwidth testing to not overlap measurements. • Integration with Nagios:

• Process monitoring and alerting• Data monitoring (e.g. expected value drops below a threshold)

• Integration into logging infrastructure (syslog-ng)• Integration of circuit (static and dynamic) monitoring

24 – 04/20/23, © 2009 Internet2

Future Directions – pSPT Enhancement

Page 25: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Roadmap for 3.2 series (Late Summer of 2010):• Migration to Red Hat/CentOS Live CD platform

• Mirrors software infrastructure of most LHC facilities• Re-design of Wizard interfaces

• Nagios/Logging Upgrades• Circuit Monitoring Integration• Testing on a wider variety of hardware

25 – 04/20/23, © 2009 Internet2

Future Directions – pSPT Enhancement

Page 26: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Currently working with other VOs that anticipate performance monitoring needs:• LSST – Telescopes• NEES – Earthquake simulation• Other Physics Communities (Daya Bay, LIGO)

• Overall concept of the pSPT won’t change• Other VOs have different operational requirements and

capabilities• Specific aspects of performance may matter more (e.g. stability

vs.. raw bandwidth)

26 – 04/20/23, © 2009 Internet2

Future Directions – Other VOs

Page 27: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Circuit Monitoring is extremely important, both in terms of static links and dynamic circuit networks.

• Recent Demonstrations in GLIF focused on operational aspects (difficulty of setup, what can be shown):• Multi-domain circuit monitoring (Fall 2008)• Granularity of circuits, e.g. identifying domain specific

components via information services (Fall 2009)• Recent work in the OGF:

• Standardizing the methods used to name and locate circuits and segments

• Push to define dynamic circuit architecture and protocols

27 – 04/20/23, © 2009 Internet2

Future Directions – Circuit Monitoring

Page 28: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Desirable goals:• Define a succinct system that meets the needs of both worlds• Methods to share information with related systems, e.g. circuit

identification must be tied to monitoring status and performance• perfSONAR-PS consortium is looking these problems currently:

• Desire to integrate into dynamic circuit (e.g. IDC protocol) operations

• Desire to distribute functionality on a future release of the pSPT

28 – 04/20/23, © 2009 Internet2

Future Directions – Circuit Monitoring

Page 29: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

• Different ways to approach monitoring a loosely coupled VO• Approach is dictated by resources• Scalability is a large factor

• Not only of software/hardware architecture but of human resources available for installation, testing, and maintenance as well

• Development will address the needs of the community and advance the usefulness of the tools• Accepting feedback to make the product better• Offering limited but clearly stated support• Educating the users is a priority – a high amount of confidence in

a tool comes from comfort• Questions?

29 – 04/20/23, © 2009 Internet2

Conclusion

Page 30: PerfSONAR use at US LHC Facilities March 9 th 2010, LHCOPN Eric Boyd, Deputy Technology Officer

perfSONAR use at US LHC FacilitiesperfSONAR use at US LHC FacilitiesMarch 9th 2010, LHCOPNEric Boyd, Deputy Technology Officer

For more information, visit http://psps.perfsonar.net/toolkit

30 – 04/20/23, © 2009 Internet2