chameleon: operational lessons › media › filer_public › 9c › 54 › ...we live to serve:...

17
www. chameleoncloud.org CHAMELEON: OPERATIONAL LESSONS Kate Keahey, Jason Anderson (ANL, UC) Paul Ruth (RENCI), Jacob Colleran (UC, ANL), Cody Hammock (TACC), Joe Stubbs (TACC), Zhuo Zhen (UC, ANL) {keahey, jasonanderson}@uchicago.edu July 29, 2019 HARC Workshop, Chicago, IL

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

CHAMELEON: OPERATIONAL LESSONS Kate Keahey, Jason Anderson (ANL, UC)

Paul Ruth (RENCI), Jacob Colleran (UC, ANL), Cody Hammock (TACC), Joe Stubbs (TACC), Zhuo Zhen (UC, ANL)

{keahey, jasonanderson}@uchicago.edu July 29, 2019 HARC Workshop, Chicago, IL

Page 2: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

CHAMELEONINANUTSHELL� Weliketochange:testbedthatadaptsitselftoyourexperimentalneeds

�  Deepreconfigurability(baremetal)andisolation(CHI)–butalsoeaseofuse(KVM)�  CHI:poweron/off,reboot,customkernel,serialconsoleaccess,etc.

� Wewanttobeallthingstoallpeople:balancinglarge-scaleanddiverse�  Large-scale:~largehomogenouspartition(~15,000cores),5PBofstoragedistributedover

2sites(now+1!)connectedwith100Gnetwork…�  …anddiverse:ARMs,Atoms,FPGAs,GPUs,Corsaswitches,etc.

�  Cloudoncloud:leveragingmainstreamcloudtechnologies�  PoweredbyOpenStackwithbaremetalreconfiguration(Ironic)+“specialsauce”�  ChameleonteamcontributionrecognizedasofficialOpenStackcomponent

� Welivetoserve:open,productiontestbedforComputerScienceResearch�  Startedin10/2014,testbedavailablesince07/2015,renewedin10/2017�  Currently3,000+users,500+projects,100+institutions

Page 3: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

CLOUDSVERSUSHPCRESOURCES�  TraditionalHPCresources:interfaces,complexity,efficiency&cost�  Clouds:interfaces,complexity,efficiency&cost�  Differencesincomplexity:

�  Operationalcomplexity:networking,security,andothers

�  Greatersharingofartifacts:appliancemanagement

�  Relativeimmaturityoftheparadigm

�  Cloudsystemsaremorecomplexbecausetheysolveamorecomplexproblem

Page 4: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

EXPERIMENTALINSTRUMENTSVERSUSCLOUDS

Bare-Metal Infrastructures security, fewer layers of abstraction, relative immaturity of infrastructure

Networking Access to L2 for all, complexity/automation, integration with commercial offerings

Chasing the Research Frontier and Adaptation Emphasis on development/adding new features, closer collaboration with user

community

Page 5: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

…ANDITNEEDSTOSCALE

# of diverse CS experiments you can run

acce

ssib

ility

Traditional open HPC resources Open cloud

resources

manually configurable closed testbeds

Chameleon

Page 6: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

WHATDOESITMEANTOPEOPLE?�  Operators

�  Veryhighlevelofskill:morediverseanddeeperexpertise

�  Significantlearningcurve

�  Teamsofoperatorswithdifferentspecialties

�  Developmentexperienceiscritical

� Moreeffort�  Manymovingparts,immatureparts,newparts,unexpectedparts

�  Closeinteractionwithusercommunity�  Usersareincreasinglylesscustomersandincreasinglymorepartners

Page 7: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

HELPINGHUMANSINTHELOOP�  Researchersandinstructors(users)

�  Makeinterfacestocloudmoreintuitive(oratleastsimilartocommercialclouds)�  Facilitatecreationofecosystemforsharingknowledge�  Directinstructionandguidance

�  Hostinstitutions,serviceproviders(operators)�  ReducecostofrunningChameleonaslowaspossible�  Enablepluggingintoexistingecosystems

�  Ourselves�  Enableteammembersofvariableexpertisetobeproductive�  GiveinsightintousageandhealthofChameleon�  Addforce-multiplierstomaketeamhaveoutsizedimpact

Page 8: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

MONITORING:THREEPILLARSQuantify�  Symptom-based

metrics�  Prometheus

�  Chameleon-specificmetrics

�  Logindexingandsearch�  Elasticsearch,Fluentd

�  Kolla-Ansible

Detect�  Metric-basedalerts

�  Prometheus,Alertmanager

�  ”Black-box”probes�  Periodicchecksfor

externalconnectivitytopublicAPIs

�  “Smoketests”�  SuiteofJenkinstests,

runnightly(expensive)orhourly(cheap)

�  Checks“happypath”throughsystem

React�  Runbooks

�  Documentationofknownerrorsandmitigations(foroperators)

�  Helpsnewteammembersbeproductive

�  Hammers�  Automatedsolutions

forknownerrors

Page 9: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

Page 10: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

ISITWORTHTHETIMETOAUTOMATE?

Page 11: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

AUTOMATION�  Newappliancereleases

�  No“snowflake”images:expressedincodeandbuiltwithdiskimage-builder

�  Newsystemreleases�  Patchesaretested,builtintoanew(Docker)container,thenpushedtoalocal

registryforrelease

�  Ajobistriggeredtodeploynewcontainerversionusingcontrolledprocess(Kolla-Ansible)

�  Nobodyhastolearnhowtobuild/installpackages!Downside:somebodyshouldknowhowtofixproblemswithpipeline.

� Maintenanceprocesses�  Takingnodeoutofproduction,attachingmetadataforoperators

Page 12: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

PACKAGINGCHAMELEON� WhatisCHI-in-a-Box?

�  InstallChameleononyourowninfrastructurewithsetofprovisioningscripts+softwarebundles

Traditional software OpenStack

Page 13: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

CHI-IN-A-BOXUSECASES�  ChameleonAssociate

�  ResourcesaddeddirectlytoChameleon,whileretainingprojectidentity

�  Chameleonprovidesusermanagement(andusersupport!),resourcediscoveryandappliancecatalog

�  JointlymaintainedbyChameleonstaffandassociatesitepartnership

�  ChameleonPart-timeAssociate�  Similartoabove,butallresourcesareexpectedtobetakenofflineattimes

�  IndependentTestbed�  AssociatesitedeploysChameleon,butoperatesusermanagementandsupport

themselves.

�  FirstsitealreadydeployedatNU

Page 14: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

PACKAGING:OURAPPROACH�  Distributeassetofprovisioningscripts,buildoncommodity

technology�  Kolla,Kolla-Ansible,Ansible

�  Allinfrastructureexpressedasversionedcode.Infrastructurecanbebuiltfromscratchrepeatably(goodfordisasterrecovery.)

�  Provideinstallationandsupportdocumentation�  Installguide,troubleshooting,runbooks

� We“dogfood”CHI-in-a-Boxinternally�  Beingconsumersofourownproductimprovesquality

�  Focusonreducingcouplingbetweensitesforreliability

Page 15: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

USERS:THEFINALFRONTIER�  Ticketsvs.supportlists

�  Dedicated,trackablecommunicationversusdiscoverable,noisycommunications

�  Covenantbetweenusersandoperators�  Everythingworksbetterwhenusersareeducatedabout“properuse”

�  Educationandoutreach�  OpenStackdocumentationismixedblessing

�  Chameleondocsareupdatedwitheachrelease

�  Livewebinarsandface-to-facemeetupsmostimpactful

�  Incentivizinganecosystem�  Sharingismuchmorepowerfulinthecloud!

Page 16: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

PARTINGTHOUGHTS�  Physicalenvironment:Chameleonisarapidlyevolvingexperimental

platform�  From“adaptstotheneedsofyourexperiment”…�  …to“adaptstothechangingresearchfrontier”

�  Cloudsarehardtooperatebecausetheysolveacomplexproblem–andexperimentalfacilitiesevenmoreso�  Moreskilledpersonnel,moreeffort–andespeciallyintestbedsmoredevelopment

�  Towardsanecosystem:ameetingplaceofuserssharingresourcesandresearch�  Testbedsaremorethanjustexperimentalplatforms:common/sharedplatformisa

“commondenominator”thatcaneliminatemuchcomplexitythatgoesintosystematicexperimentation,sharing,andreproducibility

�  WorkingwithotheroperatorsviaCHI-in-a-BoxandBYOHinitiatives�  Workingwithusersviaprovidingsharingmechanismsandfosteringcommunity

development

Page 17: CHAMELEON: OPERATIONAL LESSONS › media › filer_public › 9c › 54 › ...We live to serve: open, production testbed for Computer Science Research Started in 10/2014, testbed

www. chameleoncloud.org

WE’REHERETOCHANGE

CHAMELEON:

www.chameleoncloud.org