GlobalNOC Services Update
2015 Internet2 Global Summit
Annual Report
๏ http://globalnoc.iu.edu/annual-report/2014/
4/28/15
Service Desk
๏ Welcomed ARE-ON and OSHEAN to the GlobalNOC Family
๏ All I2 FootPrints Projects Consolidated Into 1 = 1/5 of the Former Notifications
๏ Grown by 4 Staff and 1 Robot
April 28, 2015
Year in Review:
Service Desk
๏ Conducted DR Exercise in Early December 2015 with Positive Result
๏ Created and Implemented a Major Incident Communication Policy
April 28, 2015
Year in Review:
Service Desk
Activity Metrics for 2014 • 1.9 million alarms/year ~ 5200/day • 30,000 tickets created/year ~ 82/day • 15,600 phone calls received/year ~ 43/day • 264,000 e-mails sent and received ~ 720/
day
April 28, 2015
Service Desk
๏ Pursuing ISO 20,000 certification • Why? • By When? • What Will the Net Effect Be?
Year Ahead:
2015 Priorities
2015 Focus Areas
Automation
Goal
๏Find the worst things to do by hand. Make a machine do those things.๏Things that are:
• Dangerous• Slow• Annoying
Focus Areas๏Business Processes
๏on-call button๏auto-assign issues๏auto-notify๏auto-discover devices in a new network
๏Reporting๏How many times did we call an engineer?
๏Config automation๏alerting on config drift๏generate template config for new boxes๏push & pipeline
๏ Incident Advisor• auto-fix• hints• Annoying
Service Management
Goal
๏MINIMIZE• unplanned work• confusion• inconsistency
๏Stay flexibile, agile, and custom
Huh?
๏STANDARDIZE: for processes where consistency is most important๏ORGANIZE: a simple lightweight structure where custom and novel work
happens
2 Parts
๏Part 1: ISO/IEC 20000 Certification• Sparked by Internet2 effort, working to reach certification• Aligned with ITIL
• Incident Management
• Change Management
• Capacity Management
• Availability Management
• etc…
2 Parts
๏Part 2: Other service-level improvements• Service Dashboard (end users, network owners)• Prioritize improvements• Faster Turn-up• Change Management
So what…
๏ It’s not good enough anymore to talk about boxes and circuits. Everything is more complicated now.
๏We don’t deliver networks, we deliver services๏Requires rigor to make sure those services work, and agility to make sure
those services evolve quickly
example๏What’s the availability of everyone’s IP Service for Internet2?๏complexities:
• multiple sessions• connectors back each other up
๏Let’s define available!๏First, a service is down if packets have to be retransmitted๏So:
• Up = ALL BGP sessions are established, no loss known• At Risk = At least 1 session is down, but at least one route is still in the routing table• Down = no routes
Data Model
EntityRouted R&E
Service
BGP Peering BGP PeeringASN Peer IP
Reporting Engine
BGP Routing Data
Weekly Report
RoutesPeer State
SLA
Service Awareness
Corresponding process
report generated SLAmet?
send to NPT
outage in GRNOC control?
recommend changes
Recommended Changes
Published Report
Approve Changes
?
Published Report with Outline of Changes
NTP
Dir of Op
Sys
yes
no
yes
no
no
yes
Network
Owner
Work Management
Goal
๏Get coherent system to manage our work• systems• tools• disciplines• processes
๏ In other words, track, prioritize, and measure everything we do.
This means
๏For the people who do work:๏ "Where do I go to see everything I'm supposed to be doing? What should I be
doing first?”๏For the managers:
๏ "Are we too busy? Are we working on the right things?”๏For the strategic view:
๏ "Are we doing well/better than a year ago?”
How does work get tracked
๏Tickets๏Emails๏Post-its๏Workflow records๏Meeting docs๏Many todo lists
The future
๏Review ticketing๏Look at structured processes๏Project management๏Unified view of workload and results
Recruiting
Goal
๏Make sure we have enough talented people…now and 5 years from now
Parts
๏Attract & hire๏Pipeline
๏Get more students in๏ Improve Development
Attracting
๏How do we attract experts that fit?๏Challenges
• Scary job descriptions• People don’t know what R&E or GlobalNOC does• Indiana - No really, it’s a nice place!
Pipeline
๏Getting people into the pipeline• Students have worked very well • Summer of Networking• How do we get more?
๏Keeping the talent growing• Develop people well• Level up!
What’s New With
GlobalNOC Software?
SNAPP
๏ High performance SNMP measurement/visualization tool ๏ 3 major revisions, project began in 2002 ๏ RRDtool based storage ๏ High performance SNMP data collector ๏ Web-based data browser and Web-services API
SNAPP 4 with TSDS
๏ Moving from RRDtool to a non-relational database • “TSDS” Database based on MongoDB • Sophisticated query language: TSQL • Rich meta-data integrated with data. Allows for powerful queries; long-term
longitudinal analysis ๏ General Time Series Data Store, not just SNMP data
• Ex. NOC activity metrics / key performance indicators; optical characteristics (light levels, loss, etc.); environmental/power data; aggregate flow data; OWAMP; BWCTL
Alertmon Improvements
๏ Alert Collapsing • Collapse services on a host when host is not reachable • Root cause analysis based on dependency graph allows for intelligent collapsing
of alerts and suggests root cause of multiple alerts. • Monitoring of management VPN endpoints to collapse alerts behind VPN when
management network access is impaired
NOAA Operations Portal
๏ High-level overview of network status • Operational Status Map • Performance Measurement Overview • Operations Calendars • Detailed data pulled from other GlobalNOC tools
๏ Multi-network aggregate views
19
SciPass Science DMZ
๏ Campus Networks are enterprise infrastructure • large number of small flows • security is a required capability ๏ not elephant flow friendly ๏ could just bypass but that
doesn’t provide required security ๏ what about performance assurance?
Approach
๏ Combine • OpenFlow Switch • Bro • PerfSonar
๏ create reactive system ๏ default to secure /
slow path ๏ use IDS to control
what goes on fast path
• 64 ms - time to detect and bypass • 250 ms - doubled throughput of firewall • 1.5 sec - same throughput as no firewall
Reactive Bypass Performance
Find Out More
๏ Software Page • https://globalnoc.iu.edu/sdn/scipass.html
๏ Code Repository • https://github.com/GlobalNOC/SciPass
๏ email • [email protected] • [email protected]
FlowSpace Firewall ๏ Developed in partnership with Internet2 ๏ Open Source Software ๏ OpenFlow Hypervisor
• “Slice” OpenFlow 1.0 based on VLAN ID ๏ Currently running on Internet2 AL2S ๏ Other deployments growing. We’re interested in helping get FlowSpace
Firewall running on your OpenFlow network ๏ More Information/Download: http://globalnoc.iu.edu/sdn/fsfw.html/