hardware lifecycle at scale - open compute project
TRANSCRIPT
![Page 1: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/1.jpg)
![Page 2: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/2.jpg)
Hardware L i fecyc le a t
Sca leBrian Dodds, Craig Ross
![Page 3: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/3.jpg)
Learnings
1
4
3
Wrap Up
Hardware Lifecycle2
Facebook’s Infrastructure Evolution
Agenda
![Page 4: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/4.jpg)
Facebook's Infrastructure Evolution
![Page 5: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/5.jpg)
2010
2012
2014
2016
600M
1B Intro Acquisition
1.3B 200M 200M Acquisition
1.65B 900M 500M 1B
Facebook’s Growth
![Page 6: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/6.jpg)
Facebook’s Scale Today
• Billions of photo and video uploads
• Trillions of user requests
• Tens of trillions of database queries
• 100s of trillions of cache queries
Huge demands on servers, storage, network, and
power
Each Day:
![Page 7: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/7.jpg)
Why Build Our Own Hardware?
• Faster response to growth demands
• Optimize end-to-end (Application->Power->Thermal)
• Highest Operational Efficiency
• Commodity components
Be Open
Advantages
![Page 8: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/8.jpg)
The Facebook Datacenter
![Page 9: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/9.jpg)
2014
2015
2016
Open Compute
Project Launch
Hardware
ComputePRN
Hardware
Storage LLA
2010
2012
2013
2011
Hardware
Network
Fabric
FRC
ATN
FTW, CLN
Infrastructure Evolution
![Page 10: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/10.jpg)
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Back Pack
Storage
Lightning
Hardware Evolution
![Page 11: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/11.jpg)
Facebook Datacenters
![Page 12: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/12.jpg)
Hardware LifecycleInfrastructure @ Scale
![Page 13: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/13.jpg)
Hack Sustain DecomDeployBuildDesign
![Page 14: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/14.jpg)
Hack Sustain DecomDeployBuildDesign
![Page 15: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/15.jpg)
Hack Sustain DecomDeployBuildDesign
![Page 16: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/16.jpg)
Hack Sustain DecomDeployBuildDesign
![Page 17: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/17.jpg)
![Page 18: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/18.jpg)
Chassis Level Assembly
Rack Assembly(in Region)
Data Centers
Hack Sustain DecomDeployBuildDesign
![Page 19: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/19.jpg)
Component Level Manufacturing
Chassis + Rack Level Assembly
(in Region)
Data Centers
Hack Sustain DecomDeployBuildDesign
![Page 20: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/20.jpg)
Hack Sustain DecomDeployBuildDesign
![Page 21: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/21.jpg)
![Page 22: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/22.jpg)
Hack Sustain DecomDeployBuildDesign
![Page 23: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/23.jpg)
Hack Sustain DecomDeployBuildDesign
![Page 24: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/24.jpg)
Learnings
![Page 25: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/25.jpg)
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Six Pack
Storage
Lightning
Hardware Evolution
![Page 26: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/26.jpg)
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Six Pack
Storage
Lightning
Learnings - SensorsIssues: BMC and PSU monitoring
woes
Learnings: Improve monitoring of
critical sensors.
![Page 27: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/27.jpg)
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Six Pack
Storage
Lightning
Learnings – Supply Chain/ApplicationIssues: Single-sourced epidemic
failure. App performance issues.
Row Hammer.
Learnings: Multi-source
components, robust app testing @
scale, improve component
monitoring.
![Page 28: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/28.jpg)
2014
2015
2016
2010
2012
2013
2011Rack & Power
Freedom triplet
Rack & Power
Open Rack V1
Rack & Power
Open Rack V2
Compute
Freedom
Compute
Windmill
2015
Compute
Winterfell
Storage
Knox
Compute
Leopard
Compute
Yosemite
Storage
Honey BadgerNetwork Switch
WedgeStorage
BluRay
GPU
Big Sur
Network
Six Pack
Storage
Lightning
Learnings – DC ToolingIssues: Shipped hardware before
all tooling was finished – Idle HW.
Learnings: Make tooling a first-
class citizen for phase exit.
![Page 29: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/29.jpg)
Hardware
Eventually
Fails
![Page 30: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/30.jpg)
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
![Page 31: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/31.jpg)
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
![Page 32: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/32.jpg)
MonitoringMany servers, components, services, and regions
![Page 33: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/33.jpg)
Monitoring
Failure Rate
![Page 34: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/34.jpg)
Monitoring
Error Types
![Page 35: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/35.jpg)
Monitoring
Filters
![Page 36: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/36.jpg)
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
![Page 37: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/37.jpg)
AlarmsAnomaly Detection
Anomaly Within Cohorts
Gradual Increases
And
Sudden Spikes
![Page 38: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/38.jpg)
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
![Page 39: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/39.jpg)
Remediation
• Phase 1: Root Cause Analysis
• Phase 2: Review Remediation Plan
• Phase 3: Implement Remediation
The Journey is 1% Finished
0%
10%
20%
30%
40%
Mar2014
Jul2014
Nov2014
Mar2015
Jul2015
Now
![Page 40: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/40.jpg)
Robust Infrastructure
Monitor Alarm
RemediateDesign Feedback
![Page 41: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/41.jpg)
Design ImprovementsHDD Slot Temperature vs. Swap Rate
Higher temps.
More swaps.
![Page 42: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/42.jpg)
Wrap Up
![Page 43: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/43.jpg)
Key takeaways
• FB scale is growing. Infrastructure needs to innovate
• Move fast and adapt with robust HW lifecycle
• Everything fails – minimize impact with tooling
![Page 44: Hardware Lifecycle at Scale - Open Compute Project](https://reader030.vdocuments.site/reader030/viewer/2022012610/619d2d8086041b057b74e9e0/html5/thumbnails/44.jpg)