achieving power-efficiency in clusters without distributed file system complexity
DESCRIPTION
Achieving Power-Efficiency in Clusters without Distributed File System Complexity. Hrishikesh Amur, Karsten Schwan Georgia Tech. http://img.all2all.net/main.php?g2_itemId=157. Green Computing Research Initiative at GT. - PowerPoint PPT PresentationTRANSCRIPT
Achieving Power-Efficiency in Clusters without Distributed
File System ComplexityHrishikesh Amur, Karsten Schwan
Georgia Tech
Green Computing Research Initiative at GT
Circuit level: DVFS, power states, clock gating (ECE)
Chip and Package: power multiplexing, spatiotemporal migration (SCS, ECE)
Board: VirtualPower, scheduling/scaling/operating system… (SCS, ME, ECE)
Rack: mechanical design, thermal and airflow analysis, VPTokens, OS and management (ME, SCS)
Powe
r dist
ribut
ion
and
deliv
ery
(ECE
)
http://img.all2all.net/main.php?g2_itemId=157
Datacenter and beyond: design, IT management, HVAC control… (ME, SCS, OIT…)
focus of our work:
Data-intensive applications that use distributed storage
Focus
CPUMemoryPCI slotsMotherboardDisksFan
Per-system Power Breakdown
Power off entire nodes
Approach to Power-Efficiency of Cluster
Turning Off Nodes Breaks Conventional DFS
Turning Off Nodes Breaks Conventional DFS
Turning Off Nodes Breaks Conventional DFS
Turning Off Nodes Breaks Conventional DFS
Turning Off Nodes Breaks Conventional DFS
Turning Off Nodes Breaks Conventional DFS
Turning Off Nodes Breaks Conventional DFS
One replica of all data placed on a small set of nodes
Primary replica maintains availability, allowing nodes storing other replicas to be turned off [Sierra, Rabbit]
Modifications to Data Layout Policy
Where is new data to be written when part of the cluster is turned off?
Handling New Data
New Data: Temporary Offloading
Temporary off-loading to ‘on’ nodes is a solution
Cost of additional copying of lots of data
Usage of network bandwidth
Increased complexity!!
New Data: Temporary Offloading
Failure of primary nodes cause a large number of nodes to be started up to restore availability
To solve this, additional groups with secondary, tertiary etc. copies have to be made.
Again, increased complexity!!
Handling Primary Failures
Making a DFS power-proportional increases its complexity significantly
Provide fine-grained control over what components to turn off
Our Solution
Switch between two extreme power modes: max_perf and io_server
How do we save power?
Fine-grained control allows all disks to be kept on maintaining access to stored data
How does this keep the DFS simple?
Prototype Node Architecture
SATA Switch
Asterix Node
Obelix Node
Prototype Node Architecture
SATA Switch
Asterix Node
Obelix Node
VMM
max_perf Mode
SATA Switch
Asterix Node
Obelix Node
VM
io_server Mode
SATA Switch
Asterix Node
Obelix Node
VM
1 2 3 40
102030405060708090
ObelixAsterix-II
Servers in max_perf mode
Thro
ughp
ut/W
att
(MB/
s/W
)
Increased Performance/Power
1 2 3 40
102030405060708090
ObelixAsterix-II
Servers in max_perf mode
Thro
ughp
ut/W
att
(MB/
s/W
)
Increased Performance/Power
1 2 3 40
102030405060708090
ObelixAsterix-II
Servers in max_perf mode
Thro
ughp
ut/W
att
(MB/
s/W
)
Increased Performance/Power
1 2 3 40
102030405060708090
ObelixAsterix-II
Servers in max_perf mode
Thro
ughp
ut/W
att
(MB/
s/W
)
Increased Performance/Power
Obelix Asterix0
102030405060708090
LinuxdomUdom0domU*
Thro
ughp
ut
(MB/
s)
Virtualization Overhead: Reads
Obelix Asterix0
1020304050607080
LinuxdomUdom0domU*
Thro
ughp
ut
(MB/
s)
Virtualization Overhead: Writes
Turning entire nodes off complicates DFS
Good to be able to turn components off, or achieve more power-proportional platforms/components
Prototype uses separate machines and shared disks
Summary
Load Management Policies Static
◦ e.g., DFS, DMS, monitoring/management tasks… Dynamic
◦ e.g., based on runtime monitoring and management/scheduling…
◦ helpful to do power metering on per process/VM basis
X86+Atom+IB…
VM-level Power Metering: Our Approach
Built power profiles for various platform resources◦ CPU, memory, cache, I/O…
Utilize low-level hardware counters to track resource utilization on per VM basis◦ xenoprofile, IPMI, Xen tools…◦ track sets of VMs separately
Maintain low/acceptable overheads while maintaining desired accuracy◦ limit amount of necessary information, number of monitored
events: use instructions retired/s and LLC misses/s only
◦ establish accuracy bounds
Apply monitored information to power model to determine VM power utilization at runtime◦ in contrast to static purely profile-based approaches