jason zurawski, internet2 research liaison zurawski ...going on is very valuable 3 ... – bugfixes...

Post on 09-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Addressingthe“thingsthatgobumpinthenet”–perfSONAR/DYNES/LHCONE

March20th2012,OSG/ATLAS/CMSJasonZurawski,Internet2ResearchLiaisonzurawski@internet2.edu

•  CurrentNetworking–  perfSONARStatus(ATLAS,CMS,LHCOPN,LHCONE)–  ReachingfortheBrassRing(whywemonitor)

•  FutureNetworking– DYNES–  LHCONE

2–3/19/12,©2012Internet2

Agenda

•  "Inanylargesystem,there'salwayssomethingbroken.”

•  Networksarelargeandcomplex.TherearemulYple“layers”andweemployexpertswithknowledgeofspecificpartsjusttokeepthingsrunning– Anythingthatcangiveanexpert(oralayman)moreinsightintowhatisreallygoingonisveryvaluable

3–3/19/12,©2012Internet2

AJonPostelquote

•  EveryoneshouldbefamiliarwithwhatperfSONARisabout,thistalkisnotaboutthat–  MenYonedinthe“2013NITRDProgramSupplementtothePresident'sBudget”(page50)‐hdp://www.nitrd.gov/PUBS%5C2013supplement%5CFY13NITRDSupplement.pdf

4–3/19/12,©2012Internet2

“Why?”

•  Ifyouarenotrunningit,thisisnotasalespitch•  ThingsIwillhighlight:–  Itisbeingusedwidely–  Itisfindingproblems

•  USATLAS– AllTier2sandTier1upgradingtonewDellR310/R610(availableas‘perfsonarnode’intheportal)

– Dashboard:hdps://perfsonar.usatlas.bnl.gov:8443/exda/?page=25&cloudName=USATLAS

– Othernon‐USclouds(Canada,Japan,Italy)comingupaswell

•  CMS– AllTier2s(andTier1)havemonitoringinplace.ShouldbetesYngtoeachother

5–3/19/12,©2012Internet2

perfSONAR‐PSStatus

•  LHCOPN–  AllTier1sandTier0havemachinesinplacewithtestsinplace

–  Dashboard:hdps://perfsonar.usatlas.bnl.gov:8443/exda/?page=25&cloudName=LHCOPN

•  LHCONE–  16SitesarebeingmonitoredasapartoftheLHCONEArchprototypephase.Somearefullyconfigured,othersarenot(workinprogress–Shawnisleadingthis).

–  Dashboard:hdps://perfsonar.usatlas.bnl.gov:8443/exda/?page=25&cloudName=LHCONE

6–3/19/12,©2012Internet2

perfSONAR‐PSStatus–cont.

•  Currentrelease–3.2.1.1–  Expecta3.2.2inmid2012–  Bugfixesforthemostpart,noreal‘new’features–  hdp://psps.perfsonar.net/toolkit

•  Itemsonthelongerlist:–  ControllinganenYredeploymentinsteadofanindividualisland(N.B.someareexploringCFEngineandthelikeinthisspace)

–  IntegraYngthetoolsintoamoreportabledashboard(basingthisheavilyontheworkbyBNL)

–  Bodomline–lotstodo,lidleYmeandresourcestodoit(butthisisn’tnews)

7–3/19/12,©2012Internet2

perfSONAR‐PSSorware

•  CurrentNetworking–  perfSONARStatus(ATLAS,CMS,LHCOPN,LHCONE)–  ReachingfortheBrassRing(whywemonitor)

•  FutureNetworking– DYNES–  LHCONE

8–3/19/12,©2012Internet2

Agenda

•  Networkmonitoringis:– Awaytopickoutproblems(packetloss,congesYon,rouYngchanges,lowthroughput)

– Usedbyoperatorstofindproblemsbeforetheusers(you)findthem

– Usedbyusers(you)tokeeptheoperatorshonest•  Networkmonitoringisn’t:– Aninstantwaytosolvesaidproblems.Itwilltellyou‘what’,itwon’ttellyou‘how’or‘why’withoutspendingsomeYmeontheproblem

– AutomaYc.Thereissomeworkthatneedstobeputinbyalllevels(operators,VOs,etc.)

9–3/19/12,©2012Internet2

“WhyCare/DevoteResources?”

•  “TheNetworkisSlow”–  Yes,itsoktosaythis.Don’toverdoitthough(e.g.complainingatgetng8.5Gbpswhenyougot9.3Gbpsyesterday),andtrytoevidencewhenyoudosayit(e.g.yourgraphs)

•  Lookingattheregulardata(andalarmingonit)–  ATLAS,LHCOPN,etc.havetheregulartestsforthisexactreason

10–3/19/12,©2012Internet2

AnatomyofaProblem

•  Oneofftests–  Logontotheboxes(itseasy,justlikeanyotherlinuxmachine)andrunsometests.Don’tknowhow?Ask!

•  EscalaYon–  Youcanescalatewhenyouareinoveryourhead.ESnet/Internet2areheretohelp.

– Also–talktoyourlocalITpeoplesotheyareaware.Theydon’tbite.

•  WaiYng(isthehardestpart)– Debuggingsucks.–  IttakesalongYme.–  ItinvolvesmulYpleparYes(thisiswhatmakesittakelonger)

11–3/19/12,©2012Internet2

AnatomyofaProblem–cont.

•  1oftheTransatlanYcLinkPairs(NewYorktoAmsterdam)•  PerformancebadinonedirecYon(fromtheEUtotheUS).

–  Noproblemsseenintheother(UStoEU)direcYon.–  Commonissue–downloaders(e.g.peoplenotinyournetwork)seeaproblemvsuploaders(peopleinyournetwork).

•  Dependingonwho/whereyouare,thismaynotbeanissueforyou:–  USsites‘downloading’fromEUmayseethis–  EUsitesthatusetheNLRroutestoreachlocaYonsintheUSwillbeaffected(NLRusesAMS‐>NEWYrouteexclusively)

–  EUsitesthatusetheInternet2/ESnetroutesthroughAmsterdamtoNYtoreachUSsiteswillbeeffected.IftheEUsiteusesFRANK‐>WASHtoreachUS,therewillbenoproblem.

12–3/19/12,©2012Internet2

CurrentProblem(someofyouknowthis)

•  Itwasactually–GEANT,Internet2,andESnetcommissionedregularinter‐domaintesYngbetweenthenetworksinlate2011

•  Reportscameininlate2011•  Thehardpart(s):–  Debugging–  PassivevsAcYve–  LHCONE

13–3/19/12,©2012Internet2

Whywasn’tthiscaught?

ABasicTopology

•  AllofthemajornetworksshowupatMANLANXP•  Recentupgradetoswitchingfabric•  MajorR&EPathtoEuropeisACE(AmericaConnectstoEurope)

IRNCLink–  2x10GLAGedCircuit

•  GEANTAmsterdamExchangefeedsintoothernetworks(GEANT,SURFnet,etc.)

15–3/20/12,©2012Internet2

AnevenBedertopology

•  TACircuitsareSONET.CienaCDandAlcatelterminatetheseoneitherend

•  Switching/rouYngFabricisconnectedtothesetwodevicestosupportmoreconnecYons(10GEthernetforthemostpart)

16–3/20/12,©2012Internet2

WherewearespendingYmerightnow

•  Narrowedtheproblemasmuchaspossible.TestersonInternet2/GEANTare1hopoffoftheswitchingfabriconeitherend(andwesYllseeloss)

•  Isthisabufferingissue?Isthisaprotocolissue?Isthisanequipmentfabricissue?

17–3/20/12,©2012Internet2

1sttest–interfaceswapping@MANLAN

•  ConfiguraYonchangeonCiena(MANLAN)sidetoverifythisdevice

•  Blastthroughasetnumberofpackets,makesureinandoutpacketcountersagree–  Theydid…

18–3/20/12,©2012Internet2

2nd–interfaceswapping@MANLAN

•  ConfiguraYonchangeonBrocade(MANLAN)sidetoverifythisdevice

•  Blastthroughasetnumberofpackets,makesureinandoutpacketcountersagree–  Theydid…

3rd–It’sthebuffering,stupid

•  AllofthesedevicesarearefuncYoningat10Gbpslinerates•  Ethernet,SONET,andWAN‐PHYdohaveminorspeed

differences–  Aburstofpacketsonaninputcouldoverdriveanoutput.–  Thereneedstobeenoughbufferingtocoverthesecases–  Inputvsoutputhavedifferentqueues

•  Bufferingwasincreasedtothemax–around32K(yes,thisdoesn’tsoundlikealot,anditsnot.Enoughtohandleacoupleofframesonly…–  Itdidreducethelosspercentage

20–3/20/12,©2012Internet2

NewtesYng(~1weekfromnow)

•  ProtocolencapsulaYonistricky–  EthernetframeisshovedintoaSONETframefortransit–  WAN‐PHY(aformofEthernetw/extraencapsulaYon)wouldbein

thesameboat–  IsthetranslaYongetnggarbled?Notethatsomedeviceswill

happilypassabadpacketonagivenlayerandasitgetshandedbackuperrorcorrecYonwillrejectit.

•  TesYngthesetheoriesareabitinvasive,soitstakingalidleYmetoschedule

•  TestCoverage‐B+–  Internet2,ESnet,GEANT,andtheexperimentsallhavetestersavailable

–  SomeoftheGEANTtestersarelimitedinfuncYonality•  “Reportability”‐D

–  Itooktheroleof‘user’thisYme.MyYcketwasclosed3(!)Ymes:•  ThedayarerIopenedit,becausetherewerenocountersreporYngloss.Itwasre‐openedarerIcomplainedtheyhadto“tryharder”

•  1weeklater,arertesYnginMANLANrevealednoissues(Iwastoldto“goasksomeoneelse”).Itwasre‐openedarerInotedtheproblemisnotsolvedfroma“user”perspecYve

•  1weekarerthat,whenIwastold“openYcketscountagainsttheengineerassigned”[maybetheyarenotfedthatday?].IletitbeclosedthisYme,anddealtwithmyYcketsinothersystems

–  Thisissomethingthatneedstobefixed21–3/19/12,©2012Internet2

Whereitworked,whereitisn’tworking

•  NOCtoCustomerInteracYons‐C‐–  NOCtreatedreportwithskepYcism.Callingit‘my’packetloss(e.g.theydon’ttrustthemeasurementtools,andlooktothepassivecountersasthelawoftheland)

–  Ihadtoescalatethisintomanagementtokeepthings‘open’.StrongdesiretocloseYcketsthatareviewedas‘notmyproblem’.Thereisnohomeforthehomeless…

•  NOCtoNOCInteracYons‐B‐–  NOCscoordinateresourceswell,butYmelinestofindafixareslow.AdownYmeof5minutesisscheduledafull2weeksout,andonlyarerapprovalathighlevels

•  GetngaresoluYon‐Incomplete–  MoretesYngisneeded/isexpected.–  Thisisaverychallengingproblem,andtheYmeithastakentosolvereflectsthis(e.g.noclearsignofpacketlossondevices,butapplicaYonsreactpoorly).22–3/19/12,©2012Internet2

Whereitworked,whereitisn’tworking

•  Jason–  SYlltryingtoupdateATLAS/CMSwhenIhearnews–  Stayingontopofthemtogetthisfixed(therearesYllsomethatdenythisexists)

•  USATLASThroughputGroup–  ThinkingabouttheprocesstorecommendfortheendscienYst/sitetoreportissuesinatrackablemanner

–  “Customers”tothenetworks,usethatrelaYonshipwhenpossible

•  Networks–  DoabederjobofcoordinaYngresourcesandrespondingtoproblems

23–3/19/12,©2012Internet2

AcYons

•  CurrentNetworking–  perfSONARStatus(ATLAS,CMS,LHCOPN,LHCONE)–  ReachingfortheBrassRing(whywemonitor)

•  FutureNetworking– DYNES–  LHCONE

24–3/19/12,©2012Internet2

Agenda

•  Whatisit–readthecontenthereifyouneedto:hdp://www.internet2.edu/dynes

•  Basicidea:–  ProvidehardwareandOpenSourcesorwaretoaddressdataintensivescienceoncampuses•  Switch,datamovementserver,controllerPCforhardware•  FDT,OSCARS,andperfSONARforsorware•  Goalistoencouragecampusestocreatearesearchgradenetwork(e.g.the‘sciencedmz’‐hdp://fasterdata.es.net/fasterdata/science‐dmz/)

–  Can’tproviderawcapacity,butisatooltomanageexisYngcapacity•  Layer2networking(e.g.dynamiccapacity–possibilityofbandwidthguarantees)

•  Endtoend‘circuit’capabiliYes(e.g.protectedVLANs)25–3/19/12,©2012Internet2

UpdatesonDYNES

26–3/19/12,©2012Internet2

CampusNets–CloggingUrTubes

“Internets”

27–3/19/12,©2012Internet2

CampusNets–WhataboutScience?

“Internets”

28–3/19/12,©2012Internet2

CampusNetsw/DYNESVision

“Internets”

EncapsulatedLayer2(MPLS)

29–3/19/12,©2012Internet2

UpdatesonDYNES–cont.

•  Status(seewebformoredetails):–  GroupA(~9sites),deployedandworking–  GroupB(~11sites),deployed,andstarYngtocomeonline–  GroupC(~14sites),orderedandbeingconfigured,deploymentinthenextmonth

–  WehavefundingleAifyouarenotconnected,andaresBllinterested

•  RelatedWork:–  Workingw/AMPATHandRNPinBraziltoconnectOSCARScircuitstoresearchfaciliYes(e.g.SPRACE).Demosweredonelastyearandweresuccessful.

–  EarlytalkswithLSST(telescopeinChile)tosupportmanagementofdataflowsapproaching80Gbpsin2020

–  EarlytalkswithGlobusOnlinetointegratesupportintothistooltoreachDYNESsitesusingOSCARSandtradiYonalIPnetworking

30–3/19/12,©2012Internet2

UpdatesonDYNES–cont.

•  Wheredowegofromhere?•  ApplicaYons

–  FDTisintegratedandcanusetheAPIstouseLayer2technologies(OSCARS/ION+maybesomedaysoon‘OpenFlow’)

–  WhataboutPhEDEx/DQ2directly?–  FTS(sincethisistheschedulingbitunderthedatamovers)–  WhatabouttheunderlyingOSGtools?

•  Whichonesmakesense,SRM?Others?–  WhyintegrateanapplicaYon?

•  Layer2technologiesare‘HOT/FASTLane’comparedtocampusIP.CangiveyouadirectpathtotheCampusWANandthroughtheregionalnetwork(congesYonfree)

•  IPconnecYvitymay‘work’,butitshardtomanageendtoend(especiallyforTCP)

•  Datamoversthatcantakeadvantageofthisaremorelikelytogetresourcesinconstrainedenvironments

31–3/19/12,©2012Internet2

DYNESOpenQuesYons/NextSteps

•  Network–  LHCONE(seenext)willhavesupportforLayer2services

–  Regionals/CampusesintheUSarebeinginvitedtoparYcipateinLayer2networks•  DYNESviaInternet2ION/ESnetSDN,etc.•  OpenFlowisgainingalotoftracYon

•  Vision(beingimplementedbysomealready)intelligentapplicaYonsthatmakethechoicefortheuser.–  Don’thavetocareaboutthenetworkonthebodom,thingsjust‘work’

–  LetthescienYstsbescienYsts,notengineers32–3/19/12,©2012Internet2

DYNESOpenQuesYons/NextSteps

•  CurrentNetworking–  perfSONARStatus(ATLAS,CMS,LHCOPN,LHCONE)–  ReachingfortheBrassRing(whywemonitor)

•  FutureNetworking– DYNES–  LHCONE

33–3/19/12,©2012Internet2

Agenda

•  Incaseyoumissedit…–  NomoreglobalVLAN(notscalable,toomuchofapain)–  DirectL2circuits(e.g.throughOSCARSorsimilartechnologies)sYllbeingexplored

–  CurrentworkisonIslandsofL3VPNs•  VRF–Virtual[VPN]RouYngandForwardingisbeingused

•  Purpose?–  AllowsparYcipantstomovetrafficbetweenoneanotherasneeded.

–  BuiltusingavailablecomponentsoftheR&Enetworkinginfrastructure(e.g.ESnet,GEANT,Internet2,USLHCnet,ACE,CERNLIGHTStarlight,MANLAN,etc.)

34–3/19/12,©2012Internet2

LHCONE

35–3/19/12,©2012Internet2

LHCONE–TheIdea

•  Howisthisdone?–  itispossibletoimplementasharedbroadcastdomainusingaspecificIPprefixoritcanbeimplementedviaaVRF•  Virtualrouters(Internet2)vsdedicatedresources(e.g.StarlightCisco)

–  DifferencebetweenthisandsharedVLAN:•  ThereareroutedboundariesbetweenporYonsofthesharedstructures

•  ThereisarequirementsfortheexchangeofrouYnginformaYonacrossthoseboundaries.–  ThisinformaYonwillbeexchangedusingBGP.

36–3/19/12,©2012Internet2

LHCONEGuts

37–3/19/12,©2012Internet2

LHCONE–MoreExact

•  Hardformetoanswerthis–Iamnottheuser•  As“users”,youallhavesomeimportantthingstodo:–  Doyourscienceasbefore–  Canyoureachtheplacesyouneedtoreach?Arethingsanybederorworsethanbefore?

–  IsyourlifemeasurablybederwithLHCONEvswithout(don’tanswerthisnow,haveacookieorsomethingfirst)•  Sincethisis‘justthenetwork’youmaynotevennoYce(unlessitsnotworking)

•  Allkiddingaside–thenextstepsforthisliewiththestakeholders,anditisanYcipatedthatyouwill‘vote’withyouropinionsaswellasfundingdollars.

38–3/19/12,©2012Internet2

LHCONE–WhatsNext?

•  Monitoring– Monitoringisnotasexytopic,it’sameanstoanend

– We(networks,aswellasVOs)needittomakesurethatthingsareworkingsothatusers(allofyou)aren’tsad

•  L2&AdvancedNetworking–  Lotsofopportunitytousenewtechnologies–  HardsaletoaddfeaturesintoapplicaYons– We(networkproviders)can‘help’withadaptaYons,butwedon’thavethemanpower/fundingtoleadinthisarea.

39–3/19/12,©2012Internet2

ClosingThoughts

Addressingthe“thingsthatgobumpinthenet”–perfSONAR/DYNES/LHCONEMarch20th2012,OSG/ATLAS/CMSJasonZurawski,Internet2ResearchLiaisonzurawski@internet2.edu

FormoreinformaYon,visithdp://www.internet2.edu/research/

40–3/19/12,©2012Internet2

top related