uc santa cruz impact of failure on interconnection networks for large storage systems qin xin, ethan...
TRANSCRIPT
UC Santa Cruz
Impact of Failure on Interconnection Networks for Large Storage Systems
Qin Xin, Ethan L. Miller, Thomas J. E. Schwarz*, Darrell D. E. Long
Storage Systems Research Center University of California, Santa Cruz
*Department of Computer EngineeringSanta Clara University
MSST'05 24/13/05
Problems
Reliability is a real concern for large storage systems• A large number of components • Complex interconnections• Human errors
Robust network interconnection is desired• Failures of switches, routers, and network links are
common• Organization of nodes and links is crucial
Scale makes things complicated: petabyte storage system --10,000s disks, 100s routers,100,000s links
• Robust network topology Tolerant multi-points of failures without noticeable
performance degradation
MSST'05 34/13/05
Contributions
Examine various network topologies• Butterfly networks• Mesh structure• Hypercube• Tradeoffs among them: cost, robustness,
performance. Estimate the impacts of link and node failures
by simulation• I/O path connectivity• Number of hops in varied degraded modes• Failure impact to its neighborhood
MSST'05 44/13/05
Network Topology for Storage Systems
Butterfly networks• Reasonable cost• Poor robustness
Mesh• Brick-structured, good
robustness• Relatively long path
Hypercube• Superior robustness• Complex, costly
2
3
4
server 1
server 2
server 3
server 4
router
router
router
router
switch
switch
switch
switch
switch
switch
switch
switch
server
router
server
0 1
3 4
2
5
6 7 8
router 1
router 2
MSST'05 54/13/05
A case study: Failure Scenarios
server
0 1
3 4
2
5
6 7 8
router 1
router 2
server
0 1
3 4
2
5
6 7 8
router 1
router 2
server
0 1
3 4
2
5
6 7 8
router 1
router 2
server
0 1
3 4
2
5
6 7 8
router 1
router 2
MSST'05 64/13/05
Simulated Results: Connectivity
I/O path connectivity• Measured in “nines”: -log10(1-P), P: m/n, m: # of good
requests, n: # of total requests. (3 nines = 99.9%)• Poor connectively when the connection nodes are lost. • Hypercube and mesh are more robust in connectivity than
butterfly networks.
MSST'05 74/13/05
Simulation Results: Neighborhoodmesh
hypercube Failure impact on network
neighborhood• Neighbors suffer in presence of
failures.• Hypercube outperforms butterfly &
mesh structure in avg. I/O load.
butterfly
MSST'05 84/13/05
Summary
A well-chosen topology can provide robust interconnection.
Neighbors around failures suffer much more than average.
Hypercube structure provides better interconnection in degraded modes than butterfly and mesh structure.
MSST'05 94/13/05
Questions?
For further information, please visit:• http://ssrc.cse.ucsc.edu/• http://www.cs.ucsc.edu/~qxin/
Thanks to SSRC faculty, students, affiliates, and alumni Thanks to our sponsors
• Lawrence Livermore National Laboratory, Los Alamos National Laboratory, Sandia National Laboratory
• Engenio Information Technologies, HP Lab, Hitachi Global Storage Technologies, IBM Research, Intel, LSI Logic, Microsoft Research, Network Appliance, and Veritas
• NSF, USENIX
MSST'05 104/13/05
Big Picture: A Petabyte-Scale Storage System
High Speed NetworkWAN
...
... ...
...
MetadataServer
...
MetadataServer
SuperComputer
SuperComputer
devicedevice
Client
Client
10s10,000s
1000s
100,
000s
Potential for data loss• Thousands of disks• Petabytes of data