low-power high performance computing

Low-Power High Performance Computing

Panagiotis Kritikakos

August 16, 2011

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2011

Abstract

The emerging development of computer systems to be used for HPC require a changein the architecture for processors. New design approaches and technologies need tobe embraced by the HPC community for making a case for new approaches in systemdesign for making it possible to be used for Exascale Supercomputers within the nexttwo decades, as well as to reduce the CO2 emissions of supercomputers and scientificclusters, leading to greener computing. Power is listed as one of the most importantissues and constraint for future Exascale systems. In this project we build a hybridcluster, investigating, measuring and evaluating the performance of low-power CPUs,such as Intel Atom and ARM (Marvell 88F6281) against commodity Intel Xeon CPUthat can be found within standard HPC and data-center clusters. Three main factors areconsidered: computational performance and efficiency, power efficiency and portingeffort.

Contents

1 Introduction 11.1 Report organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 RISC versus CISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 HPC Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 System architectures . . . . . . . . . . . . . . . . . . . . . . . 42.2.2 Memory architectures . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Power issues in modern HPC systems . . . . . . . . . . . . . . . . . . 92.4 Energy and application efficiency . . . . . . . . . . . . . . . . . . . . . 10

3 Literature review 123.1 Green500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Supercomputing in Small Spaces (SSS) . . . . . . . . . . . . . . . . . 123.3 The AppleTV Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Sony Playstation 3 Cluster . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Microsoft XBox Cluster . . . . . . . . . . . . . . . . . . . . . . . . . 143.6 IBM BlueGene/Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.7 Less Watts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.8 Energy-efficient cooling . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.8.1 Green Revolution Cooling . . . . . . . . . . . . . . . . . . . . 153.8.2 Google Data Centres . . . . . . . . . . . . . . . . . . . . . . . 153.8.3 Nordic Research . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.9 Exascale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Technology review 194.1 Low-power Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Atom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.3 PowerPC and Power . . . . . . . . . . . . . . . . . . . . . . . 224.1.4 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Benchmarking, power measurement and experimentation 255.1 Benchmark suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 HPCC Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 25

i

5.1.2 NPB Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . 255.1.3 SPEC Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 265.1.4 EEMBC Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2.1 HPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2.2 STREAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2.3 CoreMark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3 Power measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3.2 Measuring unit power . . . . . . . . . . . . . . . . . . . . . . 295.3.3 The measurement procedure . . . . . . . . . . . . . . . . . . . 29

5.4 Experiments design and execution . . . . . . . . . . . . . . . . . . . . 305.5 Validation and reproducibility . . . . . . . . . . . . . . . . . . . . . . 31

6 Cluster design and deployment 336.1 Architecture support . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1.1 Hardware considerations . . . . . . . . . . . . . . . . . . . . . 336.1.2 Software considerations . . . . . . . . . . . . . . . . . . . . . 346.1.3 Soft Float vs Hard Float . . . . . . . . . . . . . . . . . . . . . 34

6.2 Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.3 C/C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.4 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.5 Hardware decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.6 Software decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366.7 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.8 Porting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.8.1 Fortran to C . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.8.2 Binary incompatibility . . . . . . . . . . . . . . . . . . . . . . 406.8.3 Scripts developed . . . . . . . . . . . . . . . . . . . . . . . . . 41

7 Results and analysis 427.1 Thermal Design Power . . . . . . . . . . . . . . . . . . . . . . . . . . 427.2 Idle readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.3 Benchmark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.3.1 Serial performance: CoreMark . . . . . . . . . . . . . . . . . . 447.3.2 Parallel performance: HPL . . . . . . . . . . . . . . . . . . . . 507.3.3 Memory performance: STREAM . . . . . . . . . . . . . . . . 587.3.4 HDD and SSD power consumption . . . . . . . . . . . . . . . 61

8 Future work 63

9 Conclusions 64

A CoreMark results 66

ii

B HPL results 67

C STREAM results 69

D Shell Scripts 70D.1 add_node.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70D.2 status.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71D.3 armrun.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71D.4 watt_log.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71D.5 fortran2c.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

E Benchmark outputs samples 73E.1 CoreMark output sample . . . . . . . . . . . . . . . . . . . . . . . . . 73E.2 HPL output sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73E.3 STREAM output sample . . . . . . . . . . . . . . . . . . . . . . . . . 75

F Project evaluation 76F.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76F.2 Work paln . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76F.3 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77F.4 Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

G Final Project Proposal 78G.1 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78G.2 The work to be undertaken . . . . . . . . . . . . . . . . . . . . . . . . 78

G.2.1 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . 78G.3 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79G.4 Additional information / Knowledge required . . . . . . . . . . . . . . 79

iii

List of Tables

6.1 Cluster nodes hardware specifications . . . . . . . . . . . . . . . . . . 366.2 Cluster nodes software specifications . . . . . . . . . . . . . . . . . . . 376.3 Network configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.1 Maximum TDP per processor. . . . . . . . . . . . . . . . . . . . . . . 427.2 Average system power consumption on idle. . . . . . . . . . . . . . . . 437.3 CoreMark results with 1 million iterations. . . . . . . . . . . . . . . . . 447.4 HPL problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.5 HPL problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.6 STREAM results for 500MB array size. . . . . . . . . . . . . . . . . . 58

A.1 CoreMark results for various iterations. . . . . . . . . . . . . . . . . . 66

B.1 HPL problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67B.2 HPL problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68B.3 HPL results for N=500. . . . . . . . . . . . . . . . . . . . . . . . . . . 68

C.1 STREAM results for size array of 500MB. . . . . . . . . . . . . . . . . 69

iv

List of Figures

2.1 Single Instruction Single Data (Reproduced from Blaise Barney,LLNL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Single Instruction Multiple Data (Reproduced from Blaise Barney,LLNL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Multiple Instruction Single Data (Reproduced from Blaise Barney,LLNL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Multiple Instruction Multiple Data (Reproduced from Blaise Bar-ney, LLNL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.5 Distributed memory architecture (Reproduced from Blaise Barney,LLNL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.6 Shared Memory UMA architecture (Reproduced from Blaise Bar-ney, LLNL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.7 Shared Memory NUMA architecture (Reproduced from Blaise Bar-ney, LLNL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.8 Hybrid Distributed-Shared Memory architecture (Reproduced fromBlaise Barney, LLNL). . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.9 Moore’s law for power consumption (Reproduced from W-chunFeng, LANL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 GRCooling four-rack CarnotJetTM system at Midas Networks (sourceGRCooling) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Google data-centre at Finland, next to the Finnish gulf (source Google). 163.3 NATO ammunition depot at Rennesøy, Norway (source Green Moun-

tain Data Centre AS). . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Projected power demand of a supercomputer (M. Kogge) . . . . . . 18

4.1 OpenRD board SoC with ARM (Marvell 88F6281) (Cantanko). . . . 204.2 Intel D525 Board with Intel Atom dual-core. . . . . . . . . . . . . . 214.3 IBM’s BlueGene/Q 16-core compute node (Timothy Prickett Mor-

gan, The Register). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4 Pipelined MIPS, showing the five stages (instruction fetch, instruc-

tion decode, execute, memory access and write back (WikimediaCommons). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.5 Motherboard with Loongson 2G processor (Wikimedia Commons). 24

v

5.1 Power measurement setup. . . . . . . . . . . . . . . . . . . . . . . . 29

6.1 The seven-node cluster that was built as part of this project. . . . . 386.2 Cluster connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.1 Power readings over time. . . . . . . . . . . . . . . . . . . . . . . . . 437.2 CoreMark results for 1 million iterations. . . . . . . . . . . . . . . . 457.3 CoreMark results for 1 thousand iterations. . . . . . . . . . . . . . . 467.4 CoreMark results for 2 million iterations. . . . . . . . . . . . . . . . 467.5 CoreMark results for 1 million iterations utilising 1 thread per core. 477.6 CoreMark performance for 1, 2, 4, 6 and 8 cores per system. . . . . 487.7 CoreMark performance speedup per system. . . . . . . . . . . . . . 497.8 CoreMark performance on Intel Xeon. . . . . . . . . . . . . . . . . 497.9 Power consumption over time while executing CoreMark. . . . . . . 507.10 HPL results for large problem size, calculated with ACT’s script. . 527.11 HPL results for problem size 80% of the system memory. . . . . . . 527.12 HPL results for N=500. . . . . . . . . . . . . . . . . . . . . . . . . . 537.13 HPL total power consumption for N equal to 80% of memory. . . . 547.14 HPL total power consumption for N calculated with ACT’s script. . 557.15 HPL total power consumption for N=7296. . . . . . . . . . . . . . . 567.16 HPL total power consumption for N=500. . . . . . . . . . . . . . . . 567.17 Power consumption over time while executing HPL. . . . . . . . . . 577.18 STREAM results for 500MB array size. . . . . . . . . . . . . . . . 597.19 STREAM results for 3GB array size. . . . . . . . . . . . . . . . . . 607.20 Power consumption over time while executing STREAM. . . . . . . 617.21 Power consumption with 3.5" HDD and 2.5" SSD. . . . . . . . . . . 62

vi

Listings

2.1 Assembly on RISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Assembly on CISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

vii

Acknowledgements

I would like to thank my supervisors Mr Sean McGeever and Dr. Lorna Smith. Theirguidance and help throughout the project was of great value and have greatly con-tributed in the successful completion of this project.

Chapter 1

Introduction

With the continuous evolving of computer systems, power is becoming more and more aconstraint in modern systems, especially to those targeted at Supercomputing and HPCin general. The requirement and the demand for continuously increasing performancerequires additional processors per board where electrical power and heat is a limit. Thisis discussed in detail in DARPA’s Exascale Computing study [1]. For the last few years,there is an increasing interest in the use of GPUs in HPC as they offer FLOP per Wattperformance far greater than standard CPUs. Designing power-limited systems canhave a negative affect on the delivered application performance, due to less powerfulprocessors and not appropriate design for the required tasks, and as a consequencereduces the scope and effectiveness of such systems. Thinking about the upcomingExascale systems, this is going to be a great issue. New design approaches need tobe considered by exploiting low-power architectures and technologies that can deliveracceptable performance for HPC and other scientific applications at reasonable andacceptable power levels.

The Green500[6] list argues that the scope of high-performance systems for the pastdecades has been to increase the performance in relationship to price. Increasing theperformance, and the speedup as a consequence, does not necessarily mean that thesystem is efficient. SSS reports that "from the early 1990s to the early 2000s, the per-formance of our n-body code for galaxy formation improved by 2000-fold, but the per-formance per watt only improved 300-fold and the performance per square foot only65-fold. Clearly, we have been building less and less efficient supercomputers, thusresulting in the construction of massive data-centers, and even, entirely new buildings(and hence, leading to an extraordinarily high total cost of ownership). Perhaps a moreinsidious problem to the above inefficiency is that the reliability (and usability) of thesesystems continues to decrease as traditional supercomputers continue to follow Moore’sLaw for Power Consumption." [8] [9].

Up to now, the chip vendors have been following Moore’s law [8]. When more thanone cores are incorporated within the same chip, the clock speed per core is decreased.This is not an issue as two cores with a reduce clock speed give better performance

1

than a single chip with a relatively higher clock speed. Decreasing the clock speeddecrease the electrical power needed, as well as the corresponding heat, that is producedwithin the chip. This concept is followed in most modern multi-core chips. The ideabehind low-power HPC stands on the same ground. A significant number of low-power,low electricity consumption, chips and systems can be clustered together. This coulddeliver the performance required by HPC and other scientific applications in an efficientmanner, in terms of application performance and energy consumption.

Putting together nodes with low-power chips will not solve the problem right away. Asthese architectures are not being widely used in the HPC field, the required tools, mainlycompilers and libraries, might not be available or supported. An effort may be requiredfor porting these to the new architectures. Even having the tools, the codes themselvesmay require porting and optimisation as well, in order to exploit the underlying power.

From a management perspective, every watt of reduced power consumption means sav-ings of $1M per year for large supercomputers as the IESP Roadmap reports [2]. TheIESP Roadmap reports also that the high-end servers (that are also used to build upHPC clusters) were estimated to consume 2% of North American power as of 2006.The same reports mentions that the IDC (International Data Corporation) estimates thatHPC systems will be the largest faction of high-end server market. That means, theimpact of the electrical power required by such systems needs to be reduced [2].

In this project we designed and build a hybrid cluster, investigating, measuring andevaluating the performance of low-power CPUs, such as Intel Atom and ARM againstcommodity Intel Xeon CPU that can be found within standard HPC and data-centerclusters. Three main factors are considered: computational performance and efficiency,power efficiency and porting effort.

1.1 Report organisation

This dissertation is organised in three main groups of chapters. The first group includeschapters 2 to 4, presenting background material, literature and technology reviews.Chapter 5 can be considered its own group, discussing the benchmarking suites andthe benchmarks considered and used, the power measurement techniques and methodsas well as the experimentation process that we used through the project. The third groupincludes chapters 6 to 9, discussing the design and deployment of our hybrid low-powercluster, the results and analysis of the experiments that were conducted, suggestions forfuture work and, finally, conclusions over the project.

2

Chapter 2

Background

In this chapter we make a comparison between RISC and CISC systems, we presentthe systems and memory architecture that can be found in HPC, and what each onemeans. In addition, we make a case for the power issues in modern HPC systems andhow energy efficiency relates to application efficiency.

2.1 RISC versus CISC

The majority of modern commodity processors, that are also used within the field ofHPC, are implementing the CISC (Complex Instruction Set Computing) architecture.Although, the need of energy efficiency, lower cost, multiple cores and scaling are lead-ing to a simplification of the underlying architectures, requiring by hardware vendors todevelop energy efficient, high performance RISC (Reduced Instruction Set Computing)processors.

RISC emphasises a simple instruction set made of simple highly optimised instructionsusing single-clock reduced instructions and large number of general purpose registers.That is a better match to integrated circuits and compiler technology than the complexinstruction sets [3] [4]. Complex instructions could be performed by the compiler,minimising the need of additional transistors. That leads to an emphasis on softwareand the implementation of more transistors to be used as memory registers.

For instance, in assembly language, multiplying two variables and storing the result tothe first variable (i.e. a = a * b), when run on a RISC system, it would look like thefollowing (assuming 2:3 and 5:2 are memory locations).

Listing 2.1: Assembly on RISCLOAD A, 2 : 3LOAD B , 5 : 2PROD A, BSTORE 2 : 3 , A

3

Each operation - LOAD, PROD, STORE - is executed in one clock cycle, requiring threeclock cycles to perform the whole operation. Due to the simplicity of the operationsthough, the processors will manage to perform the task relatively quickly.

CISC emphasises that hardware is always faster than software and a multi-clock com-plex instruction set, adding transistors to the processors, could deliver better perfor-mance. That also minimises the assembly lines. The code can then be executed by theprocessors by performing multiple instructions per clock cycle, as opposed to RISC pro-cessors where each clock cycle would execute a single instruction. CISC gives emphasison hardware by implementing additional transistors to store complex instructions.

Within a CISC system, the multiplication example above would require a single assem-bly code line.

Listing 2.2: Assembly on CISCMULT 2 : 3 , 5 : 2

In this case, the system must support an additional instruction, that of MULT. This isa complex instruction that performs the multiplication operation within a single clockcycle, and the whole operation would be executed directly on hardware, without theneed to specify the LOAD and STORE instructions. However, due to its complexity, theexecution time would be approximately the same time to that of RISC. When thinkingof large codes with intensive computation and the use of supercomputers that num-ber thousands of cores, these additional transistors that are able to handle complex in-structions can create power/heat issues and have large energy demands for the systemsthemselves.

Modern RISC processors have become more complex than what in the early versions.They implement additional, more complex, instructions and can execute two instruc-tions per clock cycle. However, when comparing modern RISC versus modern CISCprocessors, the complexity differences and architectural design still exists, having dif-ference in both performance and energy consumption.

2.2 HPC Architectures

In this section we present the different architectures in terms of systems and memory.Both RISC and CISC processors can belong to any of the architectures discussed bellow.

2.2.1 System architectures

High Performance Computing and Parallel architectures have been firstly classified byMichael J. Flynn. Flynn’s taxonomy1 defines four classifications of architectures basedon the instruction and data stream. These classifications are:

1IEEE Trans.Comput., vol. C-21, no.9, p.948-60, Sept. 1972

4

• SISD - Singe Instruction Single Data

• SIMD - Single Instruction Multiple Data

• MISD - Multiple Instruction Single Data

• MIMD - Multiple Instruction Multiple Data

Singe Instruction Single Data: This classification defines a serial system that doesnot provides any form of parallelism for either of the streams (instruction and data).A single instruction stream is executed on a single clock cycle. A single data streamis used as input on a single clock cycle for an instruction. Systems that belong to thisgroup are old mainframes, workstations and standard single-core personal computers.

Figure 2.1: Single Instruction Single Data (Reproduced from Blaise Barney,LLNL).

Single Instruction Multiple Data: This classification defines a type of parallel process-ing, where each processor executes the same set of instructions on a different stream ofdata on every clock cycle. Each instruction is issued by the front-end and each proces-sor can communicate with any other processor but has access only to its own memory.Array and vector processors, as well as GPUs, belong to this group.

Multiple Instruction Single Data: This classification defines the most uncommon par-allel architecture were multiple processors execute the same data stream on every clockcycle. This architecture can be used for fault tolerance where different systems workingon the same data stream must report the same results.

Multiple Instruction Multiple Data: This classification defines the most commonparallel architecture used today. Modern multi-core desktops and laptops fall withinthis category. Each processor executes a different instruction set on a different datastream on every clock cycle.

2.2.2 Memory architectures

There are two main memory architectures that can be found within HPC systems: dis-tributed memory and shared memory. An MIMD system can be built with either mem-

5

Figure 2.2: Single Instruction Multiple Data (Reproduced from Blaise Barney,LLNL).

Figure 2.3: Multiple Instruction Single Data (Reproduced from Blaise Barney,LLNL).

Figure 2.4: Multiple Instruction Multiple Data (Reproduced from Blaise Barney,LLNL).

6

ory architectures.

Distributed memory: In this architecture, each processor has its own local memory,apart from caches, and each processor is connected with any other processor via aninterconnect. This requires the processors to communicate via the message-passingprogramming model. This memory architecture enables the development of MassivelyParallel Processing (MPP) systems. Some examples of such systems include the CrayXT6, IBM BlueGene and any Beowulf cluster. Each processor acts as being a individualsystem by running its own copy of the operating system. The total memory size canbe increased by adding more processors and in theory, that can grow up to any size.However, performance and scalability relies on appropriate interconnect and introducessystem management overhead.

Figure 2.5: Distributed memory architecture (Reproduced from Blaise Barney,LLNL).

Shared memory: In this architecture, each processor has access to a global sharedmemory. Communication between the processors takes place via writes and reads tomemory, using the shared-variable programming model. The most common archi-tecture of that type is Symmetric Multi-Processing and can be divided in two differ-ent architectures: Uniform Memory Access (UMA) and Non-Uniform Memory Access(NUMA). A UMA system defines a single SMP machine while a NUMA system ismade by physically linking two or more SMP systems where each system can access di-rectly the memory of another. Therefore, each processor has equal access to the globalmemory. The processors do not require message-passing but an appropriate shared-memory programming model. Example systems include IBM and Sun HPC Serversand any multi-processor PC and commodity server. The system appears as a singlemachine to the external user and runs a single copy of the operating system. Scalingon the number of processors within a single system is not trivial as memory access is abottleneck.

Hybrid Distributed-Shared Memory: This could be characterised as the most com-mon memory architecture used in supercomputers and other clusters today. It employsboth distributed and shared memory and is usually made up by interconnecting mul-tiple SMP systems in an UMA fashion, where each system has direct access only toits own memory and needs to send explicit messages to the other systems in order tocommunicate.

7

Figure 2.6: Shared Memory UMA architecture (Reproduced from Blaise Barney,LLNL).

Figure 2.7: Shared Memory NUMA architecture (Reproduced from Blaise Barney,LLNL).

Figure 2.8: Hybrid Distributed-Shared Memory architecture (Reproduced fromBlaise Barney, LLNL).

8

2.3 Power issues in modern HPC systems

Modern HPC systems and clusters are usually built by using commodity multi-coresystems. Connecting such systems with fast interconnect can create supercomputersand offer desired platforms for scientists and any HPC user candidate. The increasein speed is mainly achieved by increasing the number of cores within each system,while dropping the clock frequency of each core, as well as the number of systems ineach cluster. The main issue with the CPU technology used today is that it is designedwithout power efficiency in mind, solely following Moore’s law for theoretical perfor-mance. While this has been working for Petascale systems that use such processors, itis a challenge for the design, built and deployment of supercomputers that achieve needto Exascale performance.

In order to address, and by-bass up to a level, the power issues with the current tech-nology, the use of GPUs is increasing as they offer better flop-per-watt performance.Physicist scientists, among others, suggest that Moore’s law will gradually cease tohold true around 2020 [3]. That introduces the need for a new technology and design inCPUs as supercomputers will not be able to take advantage of Moore’s law anymore forincreasing their theoretical peak performance. Alan Gara or IBM says that "the biggestpart (of energy savings) comes from the use of a new processor optimised for energyefficiency rather than thread performance". He continues that in order to achieve that,"a different approach needs to be followed for building supercomputers, and that is theuse of scalable, energy-efficient processors". More experts have addressed the powerissues in a similar manner. Pete Beckman of ANL argues that "the issue of electri-cal power and the shift to multiple cores will dramatically change the architecture andprogramming of these systems". Bill Dally, chief scientist at NVIDIA, states that "anExascale system needs to be constructed in such a way that it can run in a machineroom with a total power budget not higher that what supercomputers use today". Thiscan be achieved by improving the energy efficiency of the computing resources that willclose the gap and reach Exascale computing in acceptable levels.

CPU is not the only high-power source in modern systems. 1) Memory, 2) communi-cations and 3) storage, add up greatly to the overall power consumption of a system.Memory transistors are charged every single time a specific memory cell needs to be ac-cessed. On commodity systems, memory chips are independent components, separatedfrom the processor (RAM, not cache memory). This increases the power cost as thereis a need for additional memory interface and bus for the communication between thememory and the processor. Embedded devices follow the concept of System-on-Chip(SoC), where all the components are part of the same module, reducing distances andinterfaces, and hence power.

The interaction between different nodes rather than components of a single node, com-munications between nodes requires power as well. The longer the distance betweensystems, the more power they need to communicate in order to charge and power thesignal to travel. Optical and serial communications are already used to speed-up and domore efficiently communications that partly solve the power issues. On the other hand,

9

Wat

ts

Feature size

Figure 2.9: Moore’s law for power consumption (Reproduced from W-chun Feng,LANL).

the large the systems become, the more communication they need. It is important tokeep the distance between the independent nodes as close as possible. Decreasing thesize of each node and keeping the extremes of a cluster close, could significantly reducethe power needs and costs.

Commodity storage devices such as Hard Disk Drivers, are the most common withinHPC clusters, due to their simplicity, easy maintainability and relatively low cost. Thetarget is to get faster interconnect between the nodes and storage pools, rather than re-placing the storage devices themselves. High I/O is not very common in HPC, but isvery common in specific science fields that use HPC resources such as Astronomy, Bi-ology and Geosciences that tend to work with enormous data-sets. Such data-intensiveuse-cases will increase the storage demands both in terms of capacity, performance andpower. HDDs smaller in physical size and SSDs (Solid State Disk) start becoming morecommon in data-intensive research and applications.

2.4 Energy and application efficiency

The driven force behind building new systems until very recently, and still in most ven-dors, is to achieve the highest clock speed possible, following Moore’s law. However,it is pointed that around 2020 Moore’s law will gradually cease and a replacement tech-nology needs to be found. Transistors will be so small that quantum theory or atomic

10

physics will take over and electrons will leak out of the wires [5]. Even with the sys-tems today, Moore’s law does not guarantee application efficiency and of course doesnot comply with energy efficiency as the overall clock speed increases. On the contrary,application efficiency follow May’s law2. May’s law states that software efficiencyhalves every 18 months, compensating for Moore’s Law. The main reason behind this isthat every new generation of hardware introduces new complex hardware optimisation,handled by the compiler and compilers come against an efficiency barrier with parallelcomputing. These two issues, especially that of energy efficiency, can be consideredas the biggest constraints for the design and development of acceptable Exascale sys-tems in terms on performance, efficiency, consumption and cost. To address this issue,HPC vendors and institutes have start using GP-GPUs (General Purpose-Graphics Pro-cessor Unit) within supercomputers, to achieve high performance without adding extrahigh-power commodity processors, leading to hybrid supercomputers. The fastest su-percomputer in the world today is a RISC system, the K computer of RAKEN AdvancedInstitute for Computational Science (AICS) in Japan, using SPARC64 processors, de-livering performance of 8.62 petaflops per second. A petaflop is equivalent to 1,000trillion calculations. This system consumes 9.89 megawatts. The second faster super-computer, the Tianhe-1A of the National Supercomputing Center in Tianjin in China, isa hybrid machine which is able to achieve the speed of 2.56 petaflops per second andconsumes 4.04 megawatts. This is achieved by combining both commodity CPUs, IntelXeon, and NVIDIA GPUs. These numbers can clearly show the difference that GPUscan do in terms of power consumption for large systems.

GPUs are able to execute specially ported code in much less time than standard CPUs,mainly due to their large number of cores and their design simplicity, delivering betterperformance-per-Watt. While overall a GPU can cost more in term of power needs,it performs the operations ver quickly, that in a length of time it overcomes the costand proves to be both more energy and application efficient when compared to standardCPUs. In addition to this, it takes the processing load off the processor, reducing theenergy demands for the standard CPU. Low-power processors and low-power clustersfollow the same concept by using a large number of cores with the simplicity of reducedinstruction sets. We can also hypothesise, based on the increased use of GPUs and theporting of applications to these platforms, that in the future the programming modelsfor GPUs will spread even more and GPUs will become more easy to program. In thesecases, the standard CPU could play the role of the data distributor to the GPUs, withlow-power CPUs being the most suitable candidate for such a task as they will not needto undertake computationally intensive jobs.

From a power consumption perspective, the systems mentioned earlier consume 9.89and 4.04 megawatts per second, for K computer and Tianhe-1A respectively. K com-puter is listed in the 6th position of the Green500 list while. The most power effi-cient supercomputer, the IBM BlueGene Q/Prototype 2 hosted at NNSA/SC, consumes40.95kW and achieves 2097.19 MFLOPs per Watt. It is listed on 110th position in theTop500 list, delivering 85.9 TFLOPs when executing the Linpack benchmark.

2May’s Law and Parallel Software - http://www.linux-mag.com/id/8422/

11

Chapter 3

Literature review

In this section we look into projects that related to low-power computing and that havebeen building and benchmarking low-power clusters.

3.1 Green500

The Green500 list is a re-ordering of the well-known TOP500 list, listing the mostenergy-efficient supercomputers. Green500 raise awareness about power consumption,promotes alternative total cost of ownership performance metrics, and ensures that su-percomputers only simulate climate change and not create it [6]. Green500 started inApril 2005 by Dr. Wu-chun Feng at the IEEE IPDPS Workshop of High-PerformanceComputing, Power-Aware Computer.

3.2 Supercomputing in Small Spaces (SSS)

The SSS project started in 2001 by Wu-chun Feng, Michael S. Warren and Eric H.Wiegle aiming at low-power architectural approaches and power-aware, software-basedproject approaches. In 2002, the SSS project deployed the Green Destiny cluster, a 240-node system consuming 3.2 kWs, placing it at #393 on the TOP500 list at the time.

The SSS project has being making it clear that traditional supercomputers need to stopfollowing Moore’s law for power consumption. Modern systems have being start be-coming less and less efficient, following May’s law, which states that the efficiency isdropped to half every two years. The project argues this with the fact that "from theearly 1990s to the early 2000s, the performance of our n-body code for galaxy forma-tion improved by 2000-fold, but the performance per watt only improved 300-fold andthe performance per square foot only 65-fold" [9].

12

3.3 The AppleTV Cluster

A research team at the Ludwig-Maximilians University in Munich, Germany have builtand experiment with low-power ARM cluster made of AppleTV devices, The AppleTVCluster. They also evaluated another ARM-based system, a BeagleBoard xM [28]. Theteam used CoreMark, High Performance Linpack, Membench and STREAM for mea-suring the CPU (serial and parallel) and memory performance of each system. TheCoreMark benchmark scored 1920 and 2316 iterations per second on BeagleBoardxM and AppleTV respectively. On the HPL benchmark, the systems 22.6 and 57.5MFLOPs for Single Precision operation. On Double Precision they achieve 29.3 and40.8 MFLOPs for BeagleBoard xM and AppleTV respectively. The support of NEONacceleration (support of 128-bit registers) on the BeagleBoard, allowed it to achieve33.8 MFLOPs on Single Precision mode.

In terms of memory performance, the team reports copying rates of 481.1 and 749.8MB/s for BeagleBoard xM and AppleTV respectively. The researchers states that whencompared to a modern Intel Core i7 CPU with 800MHz DDR2 RAM (the same fre-quency and technology as in the ARM systems used) can deliver more than ten timesof the reported bandwidth [28]

The power consumption of the AppleTV cluster, which achieves an overall system per-formance of 160.4 MFLOPs, is 10 Watt for the whole cluster when executing the HPLbenchmark and 4 Watt, for the whole cluster, when on idle. That results in 16 MFLOPsper single Watt when fully executing the benchmark.

3.4 Sony Playstation 3 Cluster

Researchers at the North Carolina State University have built a Sony PS3 Cluster3. SonyPS3 uses an eight-core Cell Broadband Engine Processor at 3.2 GHZ and 256MB XDRRAM, suitable for SMP and MPI programming. The 9-node cluster ran a PowerPCversion of Fedora Linux. The cluster achieved a total of 218 GFLOPs and 25.6 GB/smemory bandwidth. The researchers do not state any power consumption measure-ments. However, the power consumption of Sony PS3 consoles varies from 76 Watt upto 200 Watt for normal use. The power supply that is provided with these systems havea 380 Watt power supply. The size of processor varies from a 90nm Cell CPU down to45nm Cell.

3Sony PS3 Cluster - http://moss.csc.ncsu.edu/ mueller/cluster/ps3/

13

3.5 Microsoft XBox Cluster

Another research team at the University of Houston, have built a low-cost computercluster with unmodified XBox game consoles4. Microsoft XBox comes with an IntelCeleron/P3 733 MHz processor processor and 64MB DDR RAM. The 4-node clusterachieved a total of 1.4 GFLOPs when executing High Performance Linpack on DebianGNU/Linux, consuming between 96 and 130 Watts. That gives a range from 10.7 to14.58 MFLOPs per single Watt. The cluster supported MPI, Intel C++ and Fortrancompilers.

3.6 IBM BlueGene/Q

In terms of high-end supercomputing related projects, IBM BlueGene/Q prototype ma-chines aim at designing and building energy-efficient supercomputer based on embed-ded processors. On the latest Green500 list (June 2011), the BlueGene/Q Prototype 2is listed as the most energy efficient system, achieving a total of 85880 GFLOPs overallperformance. That translates to 2097.19 MFLOPs per Watt as it consumes 40.95 kW.The second most energy-efficient entry belongs to BlueGene/Q Prototype 1, achieving1684.20 MFLOPs per Watt. The BlueGene/Q is still not available on the market.

3.7 Less Watts

The overall rising concerns over power efficiently, drop in power costs and reductionof overall CO2 have pushed software vendors to look into saving power on softwarelevel. The Open Source Technology Center of Intel Corporation, have established anopen source project, LessWatts.org, that aims to save power with Linux on Intel Plat-forms. The project focuses on end users, developers and operating system vendors bydelivering those components and tools that are needed to reduce the energy required bythe Linux operating system 6. The project targets on desktop, laptops and commodityservers and achieves power savings by enabling, or disabling, specific extensions on theLinux Kernel.

3.8 Energy-efficient cooling

Apart from the considerations and research into reducing the overall energy of a systemusing energy-efficient processors, there has been research done and solutions producedin reducing the cooling needs of clusters and data-centres that require huge amounts

4Microsoft XBox Cluster - http://www.bgfax.com/xbox/home.html6Less Watts. Saving Power with Linux - http://www.lesswatts.org/

14

of power in total, including both the power needed for the systems as well as for sys-tem cooling infrastructure. The main driving force behind such methods is the growingpower costs for keeping large systems and clusters at the correct temperature. HPCclusters require sophisticated and effective cooling infrastructure as well. Such infras-tructure might use more energy than the computing systems themselves. The new cool-ing systems do not solve the issues with heating within the processor, the efficiency ofa system and its scalability in order to perform higher than petaflop. Although, they in-troduce an environmental friendly cooling infrastructure, dropping down maintenancecosts and energy demands as a whole for large clusters, similar to those of supercom-puters.

3.8.1 Green Revolution Cooling

The Green Revolution Cooling is a US based company that offers cooling solution fordata-centres. They use a a fluid submersion technology, GreenDEFTM, that reduces thecooling energy used clusters by 90-95% and the server power usage by 10-20% [19].While these facts are interesting for commodity servers and even more for cooling sys-tems, such approaches do not target at the power efficiency of the processor architectureand the standard power needs of the systems. These solution can be used with existingHPC clusters or even with future HPC clusters in order to achieve an overall low-power,and environmental friendly, infrastructure.

Figure 3.1: GRCooling four-rack CarnotJetTM system at Midas Networks (sourceGRCooling) .

3.8.2 Google Data Centres

Google has been investigating in smart, innovative and efficient design for their largedata-centres that are used to provide web services to million of users. Two of their data-centres in Europe, one in Belgium and one Finland, do not use any air conditioning or

15

chiller systems but they are cooling the systems using natural resources, such as the airtemperature and water. In Belgium, the average air temperature is less than the averagetemperature cooling systems provide to data-centres, thus it can be used for coolingthe systems. Moreover, as the data-centre is close to an industrial canal, the water ispurified and used to cool the systems. In Finland, the facility is built next to the gulfof Finland, enabling to use the low temperature of the sea water to cool the data-centre[20].

Figure 3.2: Google data-centre at Finland, next to the Finnish gulf (source Google).

3.8.3 Nordic Research

Institutions, as well as industry, in Scandinavia and Iceland are investigating into green,energy-efficient, solutions to support large HPC and data-centre infrastructure with thelowest cost and reduced CO2 emissions. For achieving this, projects aim to exploit andmake use of abandoned mines (Lefdal Mine Project) [19], retired NATO ammunitiondepot within mountains halls (Green Mountain Data Centre AS) [22] as well as designof new data-centres in remote mountain locations, close to hydro-electric power plantsfor natural cooling and green energy resources [23].

A new initiative has been signed between DCSC (Denmark), UNINETT Sigma (Nor-way), SNIC (Sweden) and the University of Iceland for the Nordic Supercomputer tooperate in Iceland later in 2011. The location of Iceland was chosen as its climate offerssuitable natural resources for cooling such a computing infrastructure. Iceland produces70% of its electricity from hydro, 29.9% from geothermal and only 0.1% from fossil[24].

16

Figure 3.3: NATO ammunition depot at Rennesøy, Norway (source Green Moun-tain Data Centre AS).

3.9 Exascale

The increasing number of computationally intensive problems and applications, such asweather prediction, nuclear simulation or analysis of space data, have put the need forthe development of new computing facilities, targeting at Exascale performance. TheIESP defines as Exascale "a system that is taken to mean that one or more key attributesof the system has 1,000 times the value of what an attribute of a Petascale system of2010 has". Building Exascale systems with the current technological trends would re-quire huge amounts of energy, among other things such as storage rooms and cooling,to keep it running. Wilfried Verachtert, high-performance computing project managerat the Belgian research institute IMEC argues that "the power demand for an Exas-cale computer made using today’s technology would keep 14 nuclear reactors running.There are a few very hard problems we have to face in building an Exascale computer.Energy is number one. Right now we need 7,000MW for Exascale performance. Wewant to get that down to 50MW, and that is still higher than we want."

There are two main approaches investigated for the design and built of Exascale sys-tems, the Low-power, Architectural Approach and the Project Aware, Software-basedApproach [10], but it is still on prototype level.

• Low-power, Architectural Approach: This approach refers to the same approachwe have chosen to work on this project. Low-power consumption, energy-efficient,processors replace the standard commodity, high power, processors used in HPCclusters up to now. Using energy efficient processors would enable system engi-neers to built larger systems, with larger number of processors in order to achieveExascale performance at acceptable levels. IBM’s BlueGene/Q Prototype 2 is

17

right now the most energy efficient, low-power supercomputing for its size, usinglow-power PowerPC processors [10].

The processor architectural approach can be followed for other parts of the hard-ware. Energy efficient storage devices, efficient high-bandwidth networking, ap-propriate power supplies can decrease the total footprint of each system.

• Project Aware, Software-based Approach: It is suggested by many system re-searchers that the low-power architectural approach sacrifices too much perfor-mance, at an unacceptable level for HPC applications. A more architectural inde-pendent approach is therefore suggested. This involves the the use of high-powerCPUs that support dynamic voltage and frequency scaling. That allows the designand programming of algorithms that conserves power by scaling up and down theprocessor voltage and frequency as needed by the application [10].

The approach chosen for this project is that of Architectural Low-Power Approach as itcan enable the design and building of any size of reliable, efficient HPC system and doesnot require any significant change to existing parallel algorithms and code. In specificdesigns and use-cases, a hybrid approach (a combination of both approaches) might bethe golden mean between acceptable performance, power consumption, efficiency andreliability.

In figure 3.4 that follows is presented the power demands of supercomputers, from2006 estimated up to 2020. Based on the fact that the graph was compiled on 2010 andthat the 2011 predictions match the current TOP500 systems, we can trust, even witha deviation, the predictions and the power demands of supercomputer as time passesover. This comes to justify the need for supercomputer that will be energy efficient.

Figure 3.4: Projected power demand of a supercomputer (M. Kogge)

18

Chapter 4

Technology review

In this section we are examining the most developed and likely candidates of low-powerprocessors for HPC

4.1 Low-power Architectures

The Low-power processor is not a new trend in the processor business. It is how-ever a new necessity in modern computer systems, especially supercomputers. Energy-efficient processors have been used for many years in embedded systems as well as inconsumer electronic devices. Also, systems used in the HPC field have been using low-power RISC processors such as Sun’s SPARC and IBM’s PowerPC. In this section wewill look into the most likely low-power processor candidates for future supercomput-ing systems.

4.1.1 ARM

The ARM processors are widely used in many portable consumer devices, such as mo-bile phones and handled organisers, as well as in networking equipment and other em-bedded devices such as AppleTV. Modern ARM cores, such as Cortex-A8 (single core,ranging from 600MHz to 1.2GHz), Cortex-A9 (single-core, dual-core and quad-coreversion with clock speed at 2GHz) and the upcoming Cortex-A15 (dual-core, and quad-core version, ranging from 1GHz to 2.5GHz), are 32-bit processors, using 16 registersand are designed under the Harvard memory model, where the processor consists of twosingle memories, one for instructions and one for data. This allows two simultaneousmemory fetches. As ARM cores are RISC cores, they implement the simple load/storemodel.

The latest ARM processor in production, and available on existing systems, is ARMCortex-A9, using the ARMv7 architecture which is ARM’s first generation superscalar

19

architecture. That is the highest performance ARM processor, designed around themost advanced, high- efficiency, dynamic length, multi-issue superscalar, out-of-order,speculating 8-stage pipeline. The Cortex-A9 processor delivers unprecedented levelsof performance and power efficiency with the functionality required for leading edgeproducts across the broad range of systems [9] [10]. The Cortex-A9 processor comesin both multi-core (MPcore) and single-core system versions, making it a promisingalternative for low-power HPC clusters. What ARM cores lack is the 64-bit addressspace, as they support only 32-bit. The recent Cortex-A9 comes with optional NEONmedia and floating-point processing engine, aiming to deliver higher performance formost intensive applications, such as video encoding [11].

Cortex-A8 uses as well the ARMv7 architecture but implementing a 13-stage integerpipeline and a 10-stage NEON pipeline. The NEON support is used for acceleratingmultimedia applications as well as signal-processing applications. The default supportof NEON in Cortex-A8 comes out of the fact that this processor is mainly designedfor embedded devices. However, NEON technology can be used as an accelerator formultiple data processing on single input. This, enables the ARM to operate on fourmultiply-accumulate instructions per cycle via dual-issue instructions to two pipelines[11]. NEON supports 64-bit and 128-bit, being able to operate both integer and floating-point operations.

Commercial servers manufacturers are already shipping low-power servers with ARMcores. A number of different low-cost, low-power ARM boxes and development boardsare available in the market as well, such as OpenRD, DreamPlug, PandaBoard andBeagleBoard. Moreover, NVIDIA has announced the Denver project, which aims tobuild custom CPU cores for the GPUs based on the ARM architecture, targeting bothon personal computers and supercomputers [9].

Figure 4.1: OpenRD board SoC with ARM (Marvell 88F6281) (Cantanko).

20

4.1.2 Atom

Atom is Intel’s low-power processor, aiming at laptop and low-cost, low-power serversand desktops, ranging on clock-speed from 800MHz to 2.13GHz. It supports both 32-bit and 64-bit registers and being an x86-based architecture make it one of the mostsuitable alternative candidates to standard high-power processors so far. Server vendorsalready ship systems with Atom chips and due to their low price, can be very appealingfor prototype low-power systems that do not require software alterations. Each instruc-tion loaded in the CPU is translated to a micro-operation performing a memory loadand a store operation on each ALU, extending the traditional RISC design, allowingthe processor to perform multiple tasks per clock-cycle. The processor has a 16-stagepipeline, where each pipeline stage is broken down to three parts, decoding, dispatchingand cache access [11].

The Intel Atom processor supports two ALUs and two FPUs. The first ALU is used tohandles any shift operation while the second one handle jumps. The FPUs are used forany arithmetic operation, including integer ones. The first FPU is used for addition only,while the second FPU handles multiple data over single instructions (SIMD) and oper-ations that involve multiplication and divisions. The basic operations can be executedand completed within a single clock-cycle while the processor can use up to 31 clock-cycles for more complex instructions, such as floating-point divisions for instance. Thenewest models support Hyper-Threading technology, allowing parallel execution of twothreads per core, providing virtually four cores on the system [11].

Figure 4.2: Intel D525 Board with Intel Atom dual-core.

21

4.1.3 PowerPC and Power

PowerPC is one of the oldest low-power RISC processors that has been used in theHPC field and is still being used for one of the world’s fastest supercomputer, IBM’sBlueGene/P. PowerPC processors are also available in standard commercial servers forgeneral purpose computing, not just for HPC. They support both 32-bit and 64-bit.PowerPC processors can be found in IBM’s system, making it an expensive solution forlow-budget projects and institutes.

The latest BlueGene/Q is using one of the latest Power processors, the A2. PowerPC A2is described as massively multicore and multi-threaded with 64-bit support. Its clock-speed ranges from 1.4GHz to 2.3GHz. Being a massively multicore processor it cansupport up to 16 cores per processor and is 4-way multi-threading, allowing simulta-neous multithreading for up to 64 threads per processor [18]. Each chip has integratedmemory and I/O controllers.

Figure 4.3: IBM’s BlueGene/Q 16-core compute node (Timothy Prickett Morgan,The Register).

Due to its low-power consumption and its flexibility, the design of the A2 is used in thePowerEN (Power Edge of Network) processor which is a hybrid processor between anetworking processor and a standard server processor. This type of processors are alsoknown as wire-speed processors, merging characteristics from network processors, suchas low-power cores, accelerators, integrated network and memory I/O, smaller memoryline sizes and total low-power, and characteristics from standard processors, such asfull ISA cores, support for standard programming models, operating system, hypervisorand full virtualisation support. The wire-speed processors are used within applicationin the area of networking processing, intelligent I/O devices, distributed computing andstreaming applications. The architectural consideration of power efficiency drops thepower consumption bellow 50% of the initial power consumption. The high numberof threaded processors are able to deliver better throughput power-performance whencompared to a standard CPU, but with poorer single-thread performance. Power isalso minimised by operating at the lowest voltage necessary to function at a specificfrequency [17].

22

4.1.4 MIPS

MIPS is a RISC processor that is used widely in consumer devices, with most popularSony Playstation PSX and Sony Playstation Portable (PSP). Being a low-power pro-cessor its design is based on RISC, having all instructions completing in one cycle. Itsupports both 32-bit and 64-bit registers and implements the Von Neumann memoryarchitecture.

MemoryPC

Adder

RegisterFile

SignExtend

IF / ID

ID / E

X

Imm

RS1

RS2Zero?

ALU

MUX

EX / M

EM

Memory

MUX

MEM / W

B

MUX

MUX

Next SEQ PC Next SEQ PC

WB Data

Branchtaken

IR

Instruction Fetch

Next PC

Instruction DecodeRegister Fetch

ExecuteAddress Calc. Memory Access Write Back

IF ID EX MEM WB

Figure 4.4: Pipelined MIPS, showing the five stages (instruction fetch, instructiondecode, execute, memory access and write back (Wikimedia Commons).

Being of RISC design, MIPS uses a fixed-length, regularly encoded instruction set withuse of the load/store model which is a fundamental concept in the RISC architecture.The arithmetic and logic operations in MIPS design use a 3-operand instructions, en-abling compilers to optimise complex expressions formulation, branch/jump optionsand delayed jump instructions. Floating-point registers are supported both in 32 and64-bit, in the same way as general purpose registers are supported. Superscalar imple-mentation are enabled by the use of no integer conditions codes. MIPS offer flexiblehigh-performance caches and memory management with well-defined cache controloptions. The 64-bit floating-point registers and the pairing of two single 32-bit floating-point operations improves the overall performance and speed up specific tasks by en-abling SIMD [31] [32] [33] [34].

MIPS Technologies license their architecture designs to third parties in order to designand build their MIPS-based processors. The Chinese Academy of Sciences have de-signed the MIPS-based processor Loongson. Chinese institutes have started designing

23

and building MIPS chips for their next generation supercomputers [8]. China’s Instituteof Computing Technology (ICT) has licensed the MIPS32 and MIPS64 architecturesfrom MIPS Technologies [35].

Figure 4.5: Motherboard with Loongson 2G processor (Wikimedia Commons).

Looking at the market, commercial MIPS products do not target the server market,or the generic computing market, making it almost impossible to identify appropriatesystems off-the-self for designing and building a MIPS low-power HPC cluster with theneeded software support for HPC codes.

24

Chapter 5

Benchmarking, power measurementand experimentation

In this chapter we will give a brief description of the benchmarking suites we haveconsidered and on the final benchmarks we have run.

5.1 Benchmark suites

5.1.1 HPCC Benchmark Suite

The HPCC suite consists of seven low-level benchmarks, reporting performance onfloating-point operation, memory bandwidth and communications latency and band-width. The most common benchmark for measuring floating-point operation is Lin-pack, used widely for measuring peak performance of supercomputer systems. Whileall of the benchmarks are written in C, Linpack builds upon the BLAS library which iswritten in Fortran. In order to compile the benchmarks successfully on the ARM archi-tecture, that requires the use of GNU version of the BLAS library which is available inC. The HPCC Benchmarks while are easy to compile and execute, do not represent acomplete HPC or scientific application. They are useful to identify the performance ofa system at low level but do not represent the performance of a system as a whole, whenexecuting a complete HPC application [14]. The HPCC Benchmarks are free of cost.

5.1.2 NPB Benchmark Suite

The NAS Parallel Benchmarks are developed by the NASA Advanced Supercomput-ing (NAS) Division. This benchmarking suite provides benchmarks for MPI, OpenMP,High-Performance Fortran, Java as well as serial versions of the parallel codes. Thesuite provides 11 benchmarks, where the majority is developed in Fortran, having only

25

4 benchmarks written in C. Most of the benchmarks are low-level targeting at specificsystem operations, such as floating-operation per second, memory bandwidth and I/Operformance. Examples of full applications are provided as well for acquiring more ac-curate results on the performance of high performance systems [15]. The NAS ParallelBenchmarks are free of cost.

5.1.3 SPEC Benchmarks

The Standard Performance Evaluation Corporation (SPEC) provides a large variety ofbenchmarks, both kernel and application benchmarks, for many different systems, in-cluding MPI and OpenMP versions. These suites are of interest to the HPC communityare the SPEC CPU, MPI, OMP and Power benchmarks. The majority of benchmarksrepresent HPC and scientific applications, allowing to measure the overall performanceof a system.

The CPU benchmarks are designed to provide performance measurements that can beused to compare computationally intensive workloads on different computer systems.The suite provide CPU-intensive codes, stressing a system’s processor, memory sub-system and compiler. It provides 29 codes, where 25 are available in C/C++ and 6 inFortran.

The MPI benchmarks are used for evaluating MPI parallel, floating point, computeintensive performance across a wide range of cluster and SMP hardware. The suiteprovides 18 codes where 12 are developed in C/C++ and 6 in Fortran.

The OMP Benchmarks are used for evaluating performance floating point and computeintensive performance on SMP hardware based on OpenMP applications. The suiteprovides 11 benchmarks, only 2 of the codes being available in C and 9 in Fortran.

The Power benchmarks is one the first industry-standard benchmark that is used tomeasure the power and performance of servers and clusters in the same way as done forperformance. While this allows power measurements, it does not allow to observe theperformance for an HPC, or another scientific application, as their using Java server-based codes to evaluate the system’s power consumption.

5.1.4 EEMBC Benchmarks

The EEMBC (Embedded Microprocessor Benchmark Consortium) provide a wide rangeof benchmarks for benchmarking embedded devices such as those used in networking,digital media, automotive, industrial, consumer and office equipment products. Someof the benchmarks are free of cost, and open source, while others are given under li-cense, academic or commercial. The benchmarks suites provide codes for measuringsingle-core and multi-core performance, power consumption, telecom/networking per-formance, floating-point operation performance as well as various codes for differentappliances of consumer electronic devices.

26

5.2 Benchmarks

In this section we describe the benchmarks we used to evaluate the systems used inthis project. These benchmarks do not represent full HPC codes, but are establishedand well defined benchmarks used widely for reporting the performance of computingsystems. Full HPC codes tend to take long time to execute, proving to be a constraint forthe project in terms of the available time. That is an additional reason over the decisionon running simpler, kernel, benchmarks, where the data-sets can be defined by the user.

5.2.1 HPL

We use the High-Performance Linpack to measure the performance in flops of each dif-ferent system. HPL solves a random dense linear system in double precision arithmeticeither on a single or on distributed-memory systems. The algorithm used in this codeuses "a 2D block-cyclic data distribution - Right-looking variant of the LU factorisa-tion with row partial pivoting featuring multiple look-ahead depths - Recursive panelfactorisation with pivot search and column broadcast combined - Various virtual panelbroadcast topologies - bandwidth reducing swap-broadcast algorithm - backward sub-stitution with look-ahead of depth 1" [16] [17]. The results outline how long it takesto solve the linear system and how many Mflops or Gflops are achieved during thecomputational process. HPL is part of the HPCC Benchmark suite.

5.2.2 STREAM

The STREAM benchmarks is a synthetic benchmark that measures the memory band-width and the computation rate for simple vector kernels [12]. The benchmark testsfour different memory functions, copy, scale, add and triad. It reports the bandwidth inMB/s as well as the average, the minimum and maximum time it takes to complete eachof the operations. STREAM is part of the HPCC Benchmark suite. It can be executedeither in serial or in multi-threaded mode.

5.2.3 CoreMark

The CoreMark benchmark is developed by the The Embedded Microprocessor Bench-mark Consortium. It is a generic, simple benchmark, targeted at the functionality of asingle processing core within a system. It uses a mixture of read/write, integer and con-trol operations including matrix manipulation, linked lists manipulation, state machineoperations and Cyclic Redundancy, an operation that is commonly used in embeddedsystems. The benchmark reports performance on how may iterations are performed intotal as well as in per second plus the total execution time and total processor ticks. It

27

can be executed either in serial or in multi-threaded mode, enabling to evaluate hyper-threading cores more effectively. CoreMark does not represent a real application andputs under stress the processor’s pipeline operations, memory access (including caches)and integer operations [26].

5.3 Power measurement

Power measurement techniques vary and can be conducted in many different parts ofthe system. Power consumption can be measured between the power supply and theelectrical socket, the motherboard (or another hardware part of the system) and thepower supply as well as between individual parts of the system. Initially, we want tomeasure the system as a whole. That will let us know which systems can be bought"off-the-shelf" on best performance-per-Watt basis.

For our experiments we adopt the technique used by the Green500 to measure the powerconsumption of a system. That is, using a power meter between the power supply’s ACinput of a selected unit and a socket connected to the external power supply system.That allows us to measure the power consumption of the system as a whole. The powermeter reports the power consumption of the system at any time and status, being idleor when running a specific code. By enabling logging of data at specific times, we canidentify the power consumption at any moment it is required.

An alternative method of measuring the same form of power consumption is by sensor-enabled software tools installed within the operating system. That has as a pre-requisitethe hardware to provide the needed sensors. For the systems we have used, the high-power Intel Xeon systems provided the necessary sensors and software allowing us touse software tools on the host system to measure the power consumption. The low-power systems do not provide sensor support, preventing us from using software toolsto gather the power consumption of these systems. Due to this, we have used externalpower meters on all of the systems in order to qualify all the readings equally by usingthe same method and provide more fairness to the experiments.

Power measurement can also be performed on individual components of the system.That would allow to measure specifically how much power each processor consumeswithout being affected by any other parts of the system. With this method, we can alsomeasure the power requirements and consumption between different parts of the systemas well, like the processor and the memory for instance. While this is of great interest,and perhaps one the best ways to qualify and quantify at the maximum level wherepower is going and how it is used by each component, due to time constraints with thisproject we could not invest the time and effort in this method.

28

5.3.1 Metrics

In this project we use the same metric as used by the Green500 list, being the "performance-per-watt" (PPW) metric that is used to rank the energy efficiency of supercomputers.The metric is defined by following equation:

PPW =Performance

Power(5.1)

Performance in equation (5.1) is defined as the maximal performance by the corre-sponding benchmark, defined as GFLOPS (Giga FLoating-point OPerations per Sec-ond) for High Performance Linpack, MB/s (Mega Bytes per Second) for STREAM andIter/s (Iterations per Second) for CoreMark. Power in equation (5.1) is defined as theaverage system power consumption during the execution of each benchmark for thegiven problem size, defined as Watt per second.

5.3.2 Measuring unit power

The power measurements performed using the Watts up? PRO ES7 and the CREATEComfortLINE8 power meters. The meter is placed between the power supply’s ACinput, of the machine to be monitored, and the socket connected to the external powersupply infrastructure. This reports watts consumed at any time. The power meter isprovided with a USB interface and software that allow us to record the data we need toan external system and study them at any desired time. This methodology reflects thetechnique followed to submit performance results to the Green500 list [12]. The basicset up is illustrated by the figure 5.1 that follows.

Figure 5.1: Power measurement setup.

5.3.3 The measurement procedure

The measurement procedure consists of nine simple steps, similar to these described inthe Green500 List Power Measurement Tutorial [15].

7Watts up? - http://www.wattsupmeters.com/8CREATE - The Energy Education Experts - http://www.create.org.uk/

29

1. Connect the power meter between the electricity socket and the physical machine.

2. Power on the meter (if required).

3. Power on the physical machine.

4. Start the power usage logger.

5. Initialise and execute the benchmark.

6. Start recording of power consumption.

7. Finish recording of power consumption.

8. Record the reported benchmark performance.

9. Load power usage data and calculate average and PPW.

Having connected the physical machines and power meter and are both running, weinitialise the execution of the benchmark and then start recording the data for the powerconsumption of the system. We use a problem size large enough in order to keep thefastest system busy enough to provide a reliable recording of power usage during theexecution time. That gives more execution time to the other systems, allowing us togather accurate power consumption data for every system we are examining. For eachbenchmark the problem size can vary depending on the limitations from a hardwarepoint of view (e.g. memory size, storage).

5.4 Experiments design and execution

Experimentation is the process that defines a series of experiments, or tests, that areconducted in order to discover something about a particular process or system. In otherwords, “experiments are used to study the performance of processes and systems“ [25].The performance of a system though depends on variables and factors, both controllableand uncontrollable.

The accuracy, meaning the success of the measurement and the observation, of an ex-periment depends on these controllable and uncontrollable variables and factors as theycan affect the results. These variables can vary in different conditions and environ-ments. For instance, the execution of unnecessary applications while conducting theexperiments is a controlled variable that can effect negatively the experimental results.The Operating System CPU scheduling algorithm on the other hand is not a controlledvariable and can vary within the same operating system when executed on a differentarchitecture. This plays major role in the differentiation of the results from system tosystem. On the other hand, the architecture of the CPU itself is an uncontrolled variablethat will affect the results. The controllable factors for this projects have been identifiedas below:

• Execution of non-Operating System specific applications and processes.

30

• Installation of unnecessary software packages as that can result in additionalpower consumption for unneeded services.

• Multiple uses of a system by different users.

These factors have been eliminated in order to get more representative and unaffectedresults. The uncontrolled factors have been identified as below:

• Operating System scheduling algorithms.

• Operating System services/applications.

• Underlying hardware architecture and implementation.

• Network noise and delay.

From this list, the only factor that is partially controlled, is the network noise and delay.We use private IPs with NAT, and that prevents the machines to be contacted by outsidethe private network, unless they issue an external call. Keeping the systems that areabout to be measured outside a public network eliminates the noise and delay that comeswith the physical wire from devices connected to this network. Finally, the technicalphase of experimentation has been separated into seven different stages:

• Designing the experiments.

• Planning the experiments.

• Conducting the experiments.

• Collecting the data.

• Generating data sets.

• Generating graphs and tables.

• Results analysis

5.5 Validation and reproducibility

The validation of each benchmark is confirmed by either of the provided validation testsof each one, like the residual tests in HPL for instance, or by being accepted for publi-cation, which was the case for the CoreMark results that are published on the CoreMarkwebsite9. The STREAM benchmark does also state at the end of each whether the runvalidates or not. Having all of the experiments with all of the benchmarks to havevalidated, we can claim accuracy and correctness over results we present bellow.

Reproducibility is confirmed by executing the benchmark four times with the same op-tions and definitions. The average of all of the runs is taken and presented in the results

9CoreMark scores - http://www.coremark.org/benchmark/index.php?pg=benchmark

31

that follow in this section. The power readings have been performed at a frequency ofevery second during the execution of the benchmark. The average is then calculated toidentify the average power consumption of each system when running a specific bench-mark.

32

Chapter 6

Cluster design and deployment

In this chapter I discuss the hardware and software specifications of the hybrid clusterI have designed and built as part of this project. I discuss the issues I encountered andhow I solved them.

6.1 Architecture support

6.1.1 Hardware considerations

To evaluate effectively the performance of low-power processors we need to have asuitable infrastructure that enable us to run the same experiments across a number ofdifferent systems, both low-power and high-power, in order to perform a comparisonin equal terms. Identifying identical systems in every aspect apart from the CPU isrealistically not feasible, within the time and budget of this project. Therefore, the ex-periments are designed in such a way that enable us to measure equal software metrics.For the analysis of the results we take under consideration any important differences inthe hardware that can affect the interpretation of the results.

The project experiments with different architectures, such as standard x86 [9] (i.e. IntelXeon), RISC x86 [10] (i.e. Intel Atom) and ARM [11] (Marvell Sheeva 88F6281). Thesefall within a modern comparison and experimentation of CISC (Complex InstructionSet Computing) versus RISC (Reduced Instruction Set Computing) designs for HPCappliances. Each of these architectures though uses a different type of registers (i.e.32/64-bit). For instance, the x86 architecture both support 64-bit registers while ARMon the other hand supports only 32-bit registers. That may prove to be an issue forscientific codes from a software performance perspective as the same code may behaveand perform differently when compiled on a 32 and 64-bit system.

While the registers (processor registers, data registers, address registers etc.) is one ofthe main differences between architectures, identical systems is very hard to build when

33

using chips of different architectures in terms of the other parts of the hardware. Theboards will require to be different, memory chips and sizes may differ, networking sup-port can differ as well (e.g. Megabit versus Gigabit Ethernet support). Also, differenthard disks type, such as HDD versus SSD, will affect the total power consumption of asystem.

6.1.2 Software considerations

Abstracting the architectural differences to a software level, some tool-chains (libraries,compilers, etc.) are not identical for every architecture. For instance, the official GNUGCC ARM tool-chain is at version 4.0.3 while the standard x86 version is at 4.5.2. Wesolved this by using the binary distributions that comes by default with Linux distri-bution from specific vendors, such as Red Hat in our case that ships GCC 4.1.2 withtheir operating version system on any supported architecture. The source code can alsobe used to compile the needed tools but that proves to be a time consuming, and sometimes not a trivial, task. It might be the only way though for installing a specific versionof the tool-chain when there is no binary compiled for the needed architecture.

The compiled Linux distributions available for ARM, such as Debian GNU/Linux andFedora are compiled for the ARMv5 architecture, which is an older ARM architecturethan the one the latest ARM processors are based on, ARMv7. Other distributions, suchas Slackware Linux, are compiled on even older architectures ARMv4. Using an oper-ating system, compilers, tools and libraries that are compiled for an older architecturedo not take advantage of the additional capabilities of the instruction set of the newestarchitecture. A simple example is the comparison between x86 and x86_64 systems.A standard x86 compiled operating system running on a x86_64 hardware would nottake advantage of the additional larger virtual and physical address spaces, preventingapplications and codes to use larger data sets.

Intel Atom on the other hand, does not have any issues with compiler, tools and soft-ware support. Being an x86 based architecture it supports and can handle any x86package that is available for commodity high-power hardware, used widely nowadaysin scientific cluster and supercomputers.

6.1.3 Soft Float vs Hard Float

Soft Floats use an FPU (Floating Point Unit) emulator on software level, while HardFloats use the hardware’s FPU. As we described earlier on, most of the modern ARMprocessors come with FPU support. However, in order to provide full FPU support,the required tools and libraries need to be re-compiled from scratch. Also, dependencypackages would need to be re-compiled as well, and that can include low-level librariessuch as the C library. The supported Linux distributions, compilers, tools and librariesthat target the ARMv5 architecture use soft floats as ARMv5 does not come with hard-ware FPU support. Therefore, they are unable to take advantage of the processors FPU

34

and additional NEON SIMD instructions. It is reported that recompiling the whole op-erating system from scratch with Hard Float support, can increase the performance upto 300% [27]. By now there is no distribution fully available to be used and takingadvantage of the hardware FPU, and recompiling the largest part of a distribution fromscratch is beyond the scope of this project.

6.2 Fortran

The GNU ARM tool-chain provides C and C++ compilers but not a Fortran compiler.That is a limitation on itself as this means no Fortran code can be compiled and runwidely on the ARM architecture. That can be a restricting factor for many scientistsand HPC system designers at this moment as there is a great number of HPC and scien-tific applications that are written in Fortran. Specific Linux distributions, such DebianGNU/Linux, Ubuntu and Fedora provide their own compiled GCC packages, includingFortran support.

On a non-supported system, porting Fortran code to C can be time consuming. A wayto do this is to use Netlib’s f2c [22] library that is able of porting automatically Fortrancode to C. Despite the ability of successfully porting the whole code to C, it mightneed additional work to link correctly the MPI or OpenMP calls within the C version.What is more, the f2c tool supports only Fortran 77 codes. As part of this project, wehave created a basic script to automate the process of converting and compiling theoriginal Fortran 77 code to C. Other proprietary and open-source compilers, such asG95, PathScale and PGI do not yet provide Fortran, or other, compilers for the ARMarchitecture.

6.3 C/C++

The C/C++ support of the ARM architecture is totally acceptable and at the same levelas in the other architectures. However, we have used the GNU C/C++ compiler and wehave not investigated any proprietary compilers. Compiler suites that are common inHPC, such as PathScale and PGI, do not support the ARM architecture. Both MPI andOpenMP are supported within all the architectures that we have used, without any needfor additional software libraries or porting of the existing codes.

6.4 Java

The Java run time environment is supported as well within the ARM architecture bythe official Oracle Java for embedded systems for ARMv5 (Soft Float), ARMv6 (HardFloat) and ARMv7 (Hard Float). It lacks though the Java compiler and that would

35

require to develop and compile the application on a system of another architecture thatprovides support for the Java compiler and then execute the resulted binary on the ARMsystem.

6.5 Hardware decisions

In order to evaluate the systems, the design of the cluster reflects that of a hybrid cluster,interconnecting systems of different architectures together. Our cluster consists of thefollowing machines.

Processor Memory Storage NIC StatusIntel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Front-end / GatewayIntel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Compute-node 1Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Compute-node 2Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Compute-node 3Intel Atom 2-core (D525) 4GB DDR2-800MHz SATA 1GigE Compute-node 4Intel Atom 2-core (D525) 4GB DDR2-800MHz SATA 1GigE Compute-node 5

ARM (Marvell 88F6281) 1-core 512MB DDR2-800MHz NAND 1GigE Compute-node 6ARM (Marvell 88F6281) 1-core 512MB DDR2-800MHz NAND 1GigE Compute-node 7

Table 6.1: Cluster nodes hardware specifications

The cluster provides access to 34 cores, 57GB of RAM and 3.6TB of storage. All of thesystems, both the gateway and the compute-nodes, are connected on a single switch.The gateway has a public and a private IP and each compute-node a private IP. Thatenables all the nodes to communicate with each other while the gateway allows them toaccess the external public network and the Internet if needed.

6.6 Software decisions

The software details of each system are outlined in the table that follows.

36

System OS C/C++/Fortran MPI OpenMP JavaFront-end SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6

Node1 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6Node2 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6Node3 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6Node4 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6Node5 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6Node6 Fedora 8 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVAE 1.6Node7 Fedora 8 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVAE 1.6

Table 6.2: Cluster nodes software specifications

The x86 based system have run Scientific Linux 5.5 x86_64, with the latest supportedGNU Compiler Collection that provides C, C++ and Fortran compilers. We have in-stalled the latest MPICH2 version for enabling programming with the message-passingmodel. Regarding OpenMP support, GCC provides default support for shared-variableprogramming using OpenMP directives by specifying the correct flag at compile time.For Java, we have deployed Oracle’s SDK that provides both the compiler and runtimeenvironment.

Regarding the ARM systems, there are some differentiations. The operating systeminstalled is Fedora 8, which belongs to the same family as Scientific Linux being a RedHat related project, but this version specifically is older. Deploying a more recent op-erating system is possible but due to project time limitations we used the pre-installedoperating system as well as compilers and libraries. However, the GCC versions isthe same across all systems. MPI and OpenMP are supported by MPICH2 and GCCrespectively. In relation to Java, Oracle provides an official version with the Java Run-time Environment for embedded devices. That version is the one we can run on theARM architecture. It lacks though the Java compiler, allowing only the execution ofpre-compiled Java applications.

The batch system used to connect the front-end and the nodes is Torque, which is basedon OpenPBS10. Torque comes with both the server and client sides of the batch systemas well as with its own scheduler, which is not though very flexible. We did not faceany issues installing and configuring the batch system across the different architecturesand systems.

10PBS Works - Enabling On-Demand Computing - http://www.pbsworks.com/

37

Figure 6.1: The seven-node cluster that was built as part of this project.

38

6.7 Networking

In terms of networking connectivity the front-end acts a gateway to the public networkand the Internet, therefore it has a public IP which can be used to be accessed remotelyas a login-node. As the front-end needs to communicate with the nodes as well, ituses a second interface with a private IP within the network 192.168.1.0/24. Each ofthe compute-nodes uses a private IP on a single NAT (Network Address Translation)interface. That allows each node to communicate with every single node in the clusteras well as the front-end, which is used as gateway when needed to communicate withthe public network.

Hostname IP Statuslhpc0 129.215.175.13 Gatewaylhpc0 192.168.1.1 Front-endlhpc1 192.168.1.2 compute-nodelhpc2 192.168.1.3 compute-nodelhpc3 192.168.1.4 compute-nodelhpc4 192.168.1.5 compute-nodelhpc5 192.168.1.6 compute-nodelhpc6 192.168.1.7 compute-nodelhpc7 192.168.1.8 compute-node

Table 6.3: Network configuration

The physical connectivity between the systems is illustrated by the figure bellow.

Figure 6.2: Cluster connectivity

39

6.8 Porting

The main reason for porting an application is the incompatibility between the archi-tecture the application is initially developed for and the targeted architecture. As wehave mentioned already in this report, the ARM architecture does not widely supporta Fortran compiler. This has as a result the need to use specific Linux distributions orthe porting Fortran code to C, or C++, in order to run it successfully on ARM. It is notpart of this project to investigate the level to which this can be done, neither for thebenchmarks used or for any other HPC or scientific application.

The Intel Atom processor, being an x86 architecture, can support all the widely usedHPC and scientific tools and codes. That means, there is no need of porting to be donefor any benchmarks or code desired to run on such a platform. Thus, Atom systemscan be used to build low-power clusters for HPC with Fortran support. Hybrid clusters(i.e. consisting of Atom and other low-power machines) can be deployed as well. Thatwould require the appropriate configuration of the batch system into different queues,reflecting the configuration of each group of systems. For instance, that could be aFortran-supported queue and a generic queue for C/C++. Queues that group togethersystems of the same architectures can be created as well, in the same concept as donealready with GPU queues and standard CPU queues on cluster and supercomputers.

6.8.1 Fortran to C

During investigating the issue of Fortran support on ARM, I came across on a possibleworkaround solution for platforms that do not support Fortran. This is using the f2ctool (i.e. Fortran-to-C) from the Netlib repository which can convert Fortran code to C.There are two main issues with this tool. Firstly, f2c is developed for converting onlyFortran 77 code to C. Secondly, and more related to HPC and scientific codes, calls tothe MPI and OpenMP libraries might not be converted successfully, failing to compilethe converted C code even when linked correctly with the MPI and OpenMP C libraries.

The f2c tool, was used to port for instance the LAPACK library to C and has alsoinfluenced the development of the GNU g77 compiler which uses a modified versionof the f2c runtime libraries. We think that with more effort and a closer study to f2c, itcould be used to convert HPC codes directly from Fortran 77 to C.

6.8.2 Binary incompatibility

Another issue with hybrid systems made of different architectures is the binary incom-patibility of a compiled code. A code that is compiled on a x86 system will not be able,in most cases, to execute on the ARM architectures, and vice versa, except if it is a verybasic one, without system calls that relate to the underlying system architecture. This is

40

a barrier for the design and deployment of hybrid clusters, like the one we built for thisproject.

This architecture incompatibility requires the existence and availability of login-nodesfor each architecture in order the users to be able to compile their applications in thetarget platform. In addition to this, each architecture should provide its own batchsystem. However, in order to eliminate the need of additional systems and the addedcomplexity of additional schedulers and queues, a single login-node can be used withspecific scripts to enable code compilation of different architecture. This single front-end, as well as the scheduler (be it the same machine or another), can have differentqueues for each architecture, allowing the users to submit jobs at the desired platformeach time without conflicts and faulty runs of binary incompatibility.

6.8.3 Scripts developed

For easing and automating the deployment and management of the cluster, as well asthe power readings, we have developed a few shell scripts as part of this projects. Thesource of the scripts can be found Appendix D.

• add_node.sh: Adds a new node on the batch system. It copies all the necessaryfiles to the targeted system, starts the required services, mount the filesystems andattach the node to the batch pool. Usage: ./add_node.sh <host> [architecture]

• status.sh: Reports on the status of each node, whether the batch service are run-ning or not. Usage: ./status.sh

• armrun.sh: It can be used to execute remotely any command on the ARM systemsfrom the x86 login-node. Particularly, it can be used to compiles ARM targetedcode on x86 login-node without requiring to login specifically to an ARM system.Usage: ./armrun <command>

• watt_log.sh: Captures power usage on Dell PowerEdge servers with IPMI sensorssupport. It logs the readings in a defined file on which the average can be calcu-lated as well. Usage: ./watt_log.sh <option> <logfile> [application to monitor]

• fortran2c.sh: Converts Fortran 77 code to C using the f2c tool and generates thebinary after compiling the resulted C file. Usage: ./fortran2c.sh <Fortran file>

41

Chapter 7

Results and analysis

In this section we present and analyse and results we gathered during the experimen-tation process of the hybrid cluster we have built during this project. We start fromdiscussing the given Thermal Design Power and idle power consumption of each sys-tem and we go into more detail for each benchmark individually.

7.1 Thermal Design Power

Each processor vendor defines the maximum Thermal Design Power (TDP). This ismaximum power that can be cooled by a cooling system within the processor and there-fore the maximum power a processor can use. The power is referred in Watts. Bellowwe present the values as given by the vendors of each of the processors we used.

Processor GHz TDP Per coreIntel Xeon, 4-core 2.27 80 Watt 20 WattIntel Atom, 2-core 1.80 13 Watt 6.5 Watt

ARM (Marvell 88F6281) 1.2 0.87 Watt (870 mW) 0.87 Watt

Table 7.1: Maximum TDP per processor.

The Intel Xeon systems uses two quad-core processors, with a TDP of 80 Watt each,giving a maximum and total of 160 Watt per system. These very fist values can give usa clear first idea on the power consumption of each system. Dividing the TDP of theprocessor by the number of cores we get 20 Watt for each Intel Xeon core, 6.5 Wattof each Intel Atom core and just 870 mW for ARM (Marvell 88F6281). By this, wecan clearly see the difference between commodity server processors, low-power serversprocessors and purely embedded systems processors. The cooling mechanism withineach system is reduced or increased by the scope of the system and the design of theprocessor.

42

7.2 Idle readings

In order to identify the power consumption of a system when is idle (i.e. lack of pro-cessing), we gathered the power consumption rate without running an special softwareor any of the benchmarks. We measured each system for 48 hours, allowing us to geta concrete indication on how much power each system consumes in idle mode. Theresults are listed bellow.

Processor WattIntel Xeon, 8-core 118 WattIntel Atom, 2-core 44 Watt

ARM (Marvell 88F6281) 8 Watt

Table 7.2: Average system power consumption on idle.

0

37.5

75.0

112.5

150.0

Intel Xeon ARMIntel Atom

TIME

POW

ER

Figure 7.1: Power readings over time.

On figure 7.1 we can see that each system tends to use relatively more power when itboots and then stabilises and keeps a constant power consumption rate throughout timewithout executing any special software. Thus, these results reflect the power consump-tion of the system while running their respective operating system after a fresh instal-lation with the only additional service running, the Batch System that we installed. We

43

can also observe that the Intel Xeon system tends to increase slightly its power usageby 1Watt, from 118 to 119, every 20 seconds approximately most probably to a specificoperating system service or procedure. The results come to justify and confirm the TDPvalues as given by each manufacturer, as the systems with the lowest TDP values arethose who consume less power on idle as well.

7.3 Benchmark results

In this section we present and discuss the results of each benchmark individually acrossthe various architectures and platforms that have been executed.

7.3.1 Serial performance: CoreMark

Table 7.3 shows the results of the CoreMark benchmark the consumption of the CPUis dropped, its efficiency increases. For instance, Intel Xeon system performs 55.76iterations per single Watt consumed, Intel Atom 65.9 iterations per Watt and ARM206.63 iterations per Watt. In terms of power efficiency, ARM is ahead of the other twocandidates. The tradeoff comes in the total execution time as ARM being a single-coretakes 3.5x and 1.5x times more to complete the iterations than Intel Xeon and IntelAtom respectively. Intel Atom, while it consumes more than half the power of IntelXeon, the performance-per-Watt (PPW) does not differ greatly to that of Intel Xeonwhile taking 2.3x times to complete the operations.

Processor Iterations/Sec Total time Usage PPWIntel Xeon 6636.40 150.68 119 Watt 55.76 Iters.Intel Atom 2969.70 336.73 45 Watt 65.9 Iters.

ARM (Marvell 88F6281) 1859.67 537.72 9 Watt 206.63 Iters.

Table 7.3: CoreMark results with 1 million iterations.

Calculating the total power consumption for performing the same number of iterations,ARM (Marvell 88F6281) proves to be the most power efficient with Intel Atom follow-ing and Intel Xeon consuming the maximum amount of power. Each system consumesin total 17930 Watt, 16152 Watt and 4839 Watt for Intel Xeon, Intel Atom, and ARMrespectively.

44

Wat

ts p

er se

cond

Iterations per second

Watts IterationsWatts Iterations

0

37.5

75.0

112.5

150.0

Intel Xeon Intel Atom ARM0

1750

3500

5250

7000

Figure 7.2: CoreMark results for 1 million iterations.

The same differences in performance, and power consumption as well, are observedwith both smaller and larger number of iterations as presented on the figures that follow,7.2 and 7.3 respectively. We also observe that the number of iterations per secondremains approximately the same no matter the total number of the iterations. The totalexecution time increases proportionally as the number of total iterations increases. Thedifference between the various systems in execution times stays near the same valueswith keeping the same levels in power consumption as well. The results that bring theARM system in front of the other two candidates, in terms of performance per Watt, canbe explained by the simplicity of the CoreMark benchmark that targets at integer-pointoperations.

45

Wat

ts p

er se

cond


IterationsWatts


IterationsWatts

0

37.5

75.0

112.5

150.0


1750

3500

5250

7000

Figure 7.3: CoreMark results for 1 thousand iterations.

Wat

ts p

er se

cond


Watts IterationsWatts Iterations

0

37.5

75.0

112.5

150.0

Intel Xeon Intel Atom ARM Cortex-A80

1750

3500

5250

7000

Figure 7.4: CoreMark results for 2 million iterations.

The results presented show the performance on a single core per system. The Intel Xeonsystem has though 8-cores and the Intel Atom 4-cores (2 cores with Hyper-threading

46

on each core). The results of the systems with all the threads turned on for each systemare illustrated by figure 7.5g as follows.

Wat

ts p

er se

cond


IterationsWatts


IterationsWatts

0

37.5

75.0

112.5

150.0


125

250

375

500

8 THREADS

4 THREADS1 THREAD

Figure 7.5: CoreMark results for 1 million iterations utilising 1 thread per core.

We can observe that the performance is increasing almost proportionally for Intel Xeonand Intel Atom achieving in total 51516.21 and 9076.67 iterations per second, giving432.90 and 201.7 iterations-per-Watt respectively. With these results, the ARM proces-sor is ahead of the Intel Atom with 4.93 iterations per Watt/s, and 261.65 iterations perWatt/s behind Intel Xeon, which has significantly higher clock-speed, 2.27GHz versus1.22 GHz. With these considerations under mind, as well as the fact that ARM doesnot support 64-bit registers, we could argue that there is plenty of space of develop-ments and progress for the ARM microprocessor, as we can also see from its currentdevelopments with multi-core support and NEON acceleration.

CoreMark is not based, and does not represent, any real application, but allows us todraw some conclusions over specifically the performance of a single core and the CPUitself. The presented results show clearly that the CPU with the highest clock-speed, andarchitecture complexity, can achieve higher performance, being able to perform largernumber of iterations per second in shorter total execution time. In our experiments,Intel Xeon that achieves the best performance, uses both instantly, as well as in overallexecution time, the highest amount of Watts to perform the total number of iterations.Based on the figures and results presented earlier in this section, ARM is the mostefficient processor on performance-per-Watt basis, being able to handle very efficientinteger-point operations.

47

Looking solely at iterations per second, the figures 7.6 and 7.7 show us how each systemperforms, in terms of iterations and speedup, for the serial version as well as for 2, 4,6 and 8 processors. Intel Xeon does pretty well, while Amdahl’s law applies on IntelAtom and ARM once we exploit more threads than the physical number of cores. Thus,the ARM system, being a single-core machine, performs any tasks, serial or multi-threaded, as serial.

0

15000

30000

45000

60000

1 2 4 6 8

ITER

ATIONS

Intel Xeon Intel Atom ARM

Figure 7.6: CoreMark performance for 1, 2, 4, 6 and 8 cores per system.

48

0

2

4

6

8

1 2 4 6 8


SPEEDUP

Figure 7.7: CoreMark performance speedup per system.

The same rule applies for Intel Xeon. Figure 7.8 shows that the Intel Xeon system hitsthe performance wall once there are allocated more threads than the actual number ofcores on the system.

0

15000

30000

45000

60000

1 2 4 6 8 10 12 14 16

Intel Xeon

ITER

ATIONS

Figure 7.8: CoreMark performance on Intel Xeon.

In the figure 7.9 that follows, we can see any power changes over time of benchmarkingeach system with the CoreMark benchmark. As it can be clearly seen, the power usagethroughout the execution of the benchmark on each system is stable. The low-powersystems do not raise their power consumption as much as the high-power Intel Xeon

49

system. An explanation to this can be the fact in order to keep load-balance betweenthe processors, the system will utilise more than a single core even when executing asingle thread, thus requiring more power. The Intel Xeon system increases its powerusage by 5.88$, Intel Atom by 0.8% and ARM by 12.5%.

0

37.5

75.0

112.5

150.0

TIME

POW

ER

ARMIntel AtomIntel Xeon

Figure 7.9: Power consumption over time while executing CoreMark.

7.3.2 Parallel performance: HPL

For HPL, we used four different approaches to identify the suitable problem size foreach system. The first one, is the rule of thumb suggested by HPL developers, giving aproblem size nearly 80% of the total amount of the memory system11. The second one,is using an automated script provided by Advanced Clustering Technologies, Inc. thatcalculates the ideal problem size based on the information given for the target system12.The third one is by using the ideal problem size for the smallest machine for all ofthe systems. The fourth one using a very small problem size to identify differences inperformance depending on problem size, as larger problem sizes that do not fit in thephysical memory of the system will need to make use of swap memory, with a drop offon performance. All the problem sizes are presented in the table 7.4.

11http://www.netlib.org/benchmark/hpl/faqs.html#pbsize12http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

50

Processor Problem size Block size MethodIntel Xeon 13107 128 HPLIntel Atom 3276 128 HPL

ARM (Marvell 88F6281) 409 128 HPLIntel Xeon 41344 128 ACTIntel Atom 20608 128 ACT

ARM (Marvell 88F6281) 7296 128 ACTIntel XeonIntel Atom 7296 128 Equal size

ARM (Marvell 88F6281)Intel XeomIntel Atom 500 32 Small size

ARM (Marvell 88F6281)

Table 7.4: HPL problem sizes.

Table 7.5 presents the benchmarks results of the HPL benchmark for all the differentproblem sizes we used. Figures 7.10 and 7.11 present the results of the benchmark withproblem sizes defined by HPL and ACT.

Processor GFLOPs Usage PPW Problem sizeIntel Xeon 1.22 197 Watt 6.1 MFLOPs 13107Intel Atom 4.28 55 Watt 77.81 MFLOPs 3276

ARM (Marvell 88F6281) 1.11 9 Watt 123.3 MFLOPs 409Intel Xeon 1.21 197 Watt 6.1 MFLOPs 41344Intel Atom 3.48 55 Watt 63.27 MFLOPs 20608

ARM (Marvell 88F6281) 1.10 9 Watt 122 MFLOPs 7296Intel Xeon 1.21 197 Watt 6.14 MFLOPsIntel Atom 4.15 55 Watt 75.45 MFLOPs 7296

ARM (Marvell 88F6281) 1.10 9 Watt 122.2 MFLOPsIntel Xeon 7.18 197 Watt 36.44 MFLOPsIntel Atom 5.46 55 Watt 99.27 MFLOPs 500

ARM (Marvell 88F6281) 1.13 9 Watt 125.5 MFLOPs

Table 7.5: HPL problem sizes.

51

Wat

ts p

er se

cond

GFLO

ps per second

Watts GFLOPs

0

50

100

150

200


1.25

2.50

3.75

5.00

Figure 7.10: HPL results for large problem size, calculated with ACT’s script.

Wat

ts p

er se

cond

GFLO

ps per second

Watts GFLOPs

GFLO

ps per second

Watts GFLOPs

0

50

100

150

200


0

1.25

2.50

3.75

5.00

Figure 7.11: HPL results for problem size 80% of the system memory.

The GFLOPs rate as well as the power consumption remains at the same level for boththe Intel Xeon and ARM systems while Intel Atom improves its overall performance

52

by 800 MFLOPs and 14.54 MFLOps per Watt, when using a problem size equal to the80% of the memory. We have experiment with a smaller problem size, N = 500 andthat allows the systems to achieve higher performance. These results are illustrated onthe figure 7.12 that follows.

0

50

100

150

200


2

4

6

8

Wat

ts p

er se

cond

GFLOPs per second

GFLOPsWatts

Figure 7.12: HPL results for N=500.

This experiment shows us that Intel Xeon is capable of achieving relatively high perfor-mance for small problem sizes, Intel Atom increases its performance by approximately2 GFLOPs and ARM (Marvell 88F6281) by 20 MFLOPs, staying within the same lev-els of performance as for large problem sizes. Despite the increase in performance forboth Intel Xeon and Intel Atom, ARM achieves the best performance-per-Watt, with152.2 MFLOPs per Watt while Intel Atom 96.5 MFLOPs per Watt and Intel Xeon 37.51MFLOPs per Watt. These results should not surprise us. In the Green500 list, the firstentry belongs to BlueGene/Q Prototype 2 that is ranked as the 110th fastest supercom-puter in the TOP500 list, meaning that the fastest supercomputer does not necessarilymean that is the most power efficiency as well and vice versa.

For the reported performance, we must not underestimate the fact that the installed op-erating systems (including tools, compilers and libraries) on the ARM machines, as wellas the processor’s design and implementation by Marvell that we used, do not supporthardware FPU and use a soft float, FPU on software level. That prevents the systemsfrom using the NEON SIMD acceleration, from which the executed benchmarks couldbenefit. As there is an increased interest from both the desktop/laptop and the HPC com-munities for exploiting low-power chips, we can strongly suggest that hardware FPUsupport will be available in the near future, enabling applications to achieve higherperformance, and take full advantage of the underlying hardware. It is reported, thatNEON SIMD acceleration can increase HPL performance by 60% [28]. ARM states

53

that NEON technology can accelerate multimedia and signal processing algorithms byat least 3x on ARMv7 [29] [30].

While the performance in GFLOPs is of course important and interesting, we must notleave aside the total execution time and the total power consumption a system needs inorder to solve a problem of a given size. The CoreMark results have shown that theARM system can achieve the best performance in terms of performance-per-Watt aswell as in overall power consumption for integer-point operations. The results for HPLdifferentiate from this and are less clear, depending on the problem size.

In figure 7.13 we can see the total power that is used by each system when is solvinga problem of which the size is equal to 80% of the total main system memory. Thisexperiment clearly shows us that larger the memory is, a larger problem is requiredwhich then leads to more power usage. We see in this experiment that Intel Xeon uses236250 Watt and takes 119.24 seconds to complete. Intel Atom uses 3004 Watt andtakes 54.62 seconds to complete while ARM uses 36.63 Watts in total and takes 4.07to complete for N equal to 13107, 3276 and 409 for Intel Xeon, Intel Atom and ARMrespectively.

0

75000

150000

225000

300000


375

750

1125

1500

Tota

l pow

er co

nsum

ptio

n (W

att)

Execution time (sec)

Execution timeWatts

Figure 7.13: HPL total power consumption for N equal to 80% of memory.

The great difference in problem sizes do not allow us to draw specific conclusions,neither for the achieved performance in GFLOPs or the power usage. Figure 7.14presents the total power consumption for a given problem size calculated with the ACTscript, that is for N equal to 41344 for Intel Xeon, 20608 for Intel Atom and 7296 for

54

ARM (Marvell 88F6281). That results to 7625270 Watt and 38707.10 seconds for IntelXeon, 919756.75 Watt and 16722.85 seconds for Intel Atom and 21988 and 23554.23for ARM. We can see here that as the problem size increases for each node, both thetotal power usage and execution time increase as expected. In this experiment we seethat while Intel Atom is able to solve the linear problem faster, the ARM system is stillahead when comparing performance-per-Watt.

0

2000000

4000000

6000000

8000000


10000

20000

30000

40000

Tota

l pow

er co

nsum

ptio

n (W

att)


Execution timeWatts

Figure 7.14: HPL total power consumption for N calculated with ACT’s script.

In order to be able and quantify in a better way the total power consumption, we per-formed another experiment with a problem size N equal to 7296 on all systems. Figure7.15 presents the power consumption for each system. We can see that this problemsize is solved relatively quick on Intel and Intel Atom, taking 41934 Watt and 213.95seconds for Intel Xeon and 34266.1 Watt and 623.02 seconds for Intel Atom. The ARMsystems uses in total 211988 Watt and takes 23554.23 seconds to solve the problem.That brings it at the bottom of power efficiency for this given problem due to the lackof a floating-point unit on hardware level.

In order to quantify the differences in performance, we draw the same graph for theproblem size N equal to 500. The problem size is rather small to draw concrete conclu-sions for the performance of the Intel Xeon and Intel Atom systems as they both solvethe problem within a second, using 197 and 52 Watts respectively, while the ARM sys-tem takes 7.37 seconds consuming 66.33, using in total 130.67 Watts less than IntelXeon and 11.33 Watt more than Intel Atom. The results are illustrated by figure 7.16.

55

0

75000

150000

225000

300000


7500

15000

22500

30000To

tal p

ower

cons

umpt

ion

(Wat

t)Execution tim

e (sec)

Execution timeWatts

Figure 7.15: HPL total power consumption for N=7296.

0

50

100

150

200


2

4

6

8

Tota

l pow

er co

nsum

ptio

n (W

att)


Execution timeWatts

Figure 7.16: HPL total power consumption for N=500.

56

All the results can clearly show that the ARM system lacks in terms of floating-pointoperations, although being competitive in terms of performance-per-Watt for smallfloating-point problem sizes. As we have mentioned, the ARM system we used, OpenRDClient with Marvell Sheeva 88F6281, does not have implemented the FPU on hardwarelevel neither provides the NEON acceleration for SIMD processing. The underlyingcompilers and libraries perform the floating-point operations on software level and thatis a performance drawback for the system. Intel Atom is very competitive when com-pared to high-power Intel Xeon, as it can achieve reasonably high performance withrelatively low power-consumption.

The graph in figure 7.17 shows us the power consumption over time for each systemwhen executing the HPL benchmark. The low-power systems achieve the peak of theirpower consumption and keep a stable rate of it, only a few seconds after the benchmarkstart executing. On the other hand, like with the CoreMark benchmark, the high-powerIntel Xeon system it takes approximately from 10 to 15 seconds to achieve its peakpower consumption and keeps a stable rate during the execution of the HPL benchmark.This comes to justify the suggestions by the Green500 power measurement tutorial forstart recording the actual power consumption 10 seconds after the benchmark has beeninitialised. The Intel Xeon system raises its power consumption by 56%, Intel Atom by17.9% and ARM by 14.28%.

It is important to note here that the build up of power consumption for real applications,and different types of applications, might differ to that of the HPL benchmark, or anyother benchmark.

0

50

100

150

200

0

50

100

150

200

TIME

POW

ER


Figure 7.17: Power consumption over time while executing HPL.

57

7.3.3 Memory performance: STREAM

Processor Function Rate (MB/s) Avg. time UsageCopy 3612.4793 0.0978

Intel Xeon Scale 3642.3530 0.0968 118 WattAdd 3960.9033 0.1334Triad 4009.4806 0.1319Copy 2851.0365 0.1236

Intel Atom Scale 2282.0852 0.1543 44 WattAdd 3033.9793 0.1742Triad 2237.8844 0.2361Copy 777.8065 0.4029

ARM Scale 190.8710 1.6398 8 WattCortex-A8 Add 173.9241 2.6886

Triad 113.8851 4.0880

Table 7.6: STREAM results for 500MB array size.

As an overall observation, we see that the power consumption is not increased at allwhen performing intensive memory operations when executing the STREAM bench-mark with small array size. ARM proves to be more efficient in terms of performance-per-Watt as it copies 97.2MB per Watt consumed, while Intel Atom and Intel Xeon54.8MB and 64.79MB respectively. That is, 1.7x and 1.5x times more efficient interms of performance and actual power usage. These results reflect the performance ofthe system when using the maximum memory that could be handled by the OpenRDclient system, 512MB of physical memory in total. These results are presented on figure7.18.

We executed an additional experiment on the Intel Atom and Intel Xeon boxes with thelarger array size, 3GB, nearly the maximum size that can be handled by the Intel Atomsystem, having available 4096MB or physical memory. This experiment have showndifferentiation in the power consumption of each system, increasing the usage by 4Watts on each system, to 122 and 48 Watts on Intel Xeon and Intel Atom respectively.The increase of the same amount of power in both systems may reflect the similari-ties they share being both of x86 architecture. The performance results with the 3GBarray size are presented on figure 7.19. We can see that the performance of both sys-tems is kept at the same levels, with Intel Xeon increasing slightly its performance andpower efficiency on the functions Copy and Scale, from 3612MB to 3627MB and from3642MB to 3670MB respectively, than when using smaller size of array. The functionsAdd and Triad are slightly decreased when using larger array size, from 3960MB to3943MB and from 4009MB to 3991MB. These differences are so small that can becategorised within the area of the statistical fault and standard deviation.

The performance differences between the various memory subsystems can be justifiedby the bandwidth interface and the frequency of each system. The Intel Xeon sys-

58

tem uses the highest bandwidth interface and highest data-rate frequency (DDR3 and1333MHz) than the other two systems (DDR2 and 800MHz). Having a closer look atthe low-power systems, both of them use equal bandwidth interfaces and data-rate fre-quency. The large bandwidth advantage of the Intel Atom system lies to the fact thatits memory subsystem is made by two chips of 2GB each while the ARM system usesfour chips of 128MB each. That makes the Intel Atom system capable to fit all of thearray size (500MB) into a single chip, requiring less data movement within the tran-sistors. Although, the ARM system keeps higher performance-per-Watt than the IntelAtom system.

0

37.5

75.0

112.5

150.0

Copy Scale Add Triad Copy Scale Add Triad Copy Scale Add Triad0

1250

2500

3750

5000

Wat

ts p

er se

cond

Bandwidth per second

Scale Add TriadTriadT Copy Scale Add TriadTriadT Copy Scale

Watts MBs

Intel Xeon E5507 Intel Atom D525 ARM Cortex-A8

Figure 7.18: STREAM results for 500MB array size.

59

0

32.5

65.0

97.5

130.0

Copy Scale Add Triad Copy Scale Add Triad0

1000

2000

3000

4000

Wat

ts p

er se

cond

Bandwidth per second

Watts MBsIntel Xeon E5507 Intel Atom D525

Figure 7.19: STREAM results for 3GB array size.

The figure 7.20 confirms that the power consumption stability rate for every secondof the sample is equal for each sample from the different array sizes we have used tostress the memory subsystem of each system. With the larger 3GB array, the Intel Xeonsystem increases its power consumption by 3.38% and the Intel Atom by 9.1%.

60

0

37.5

75.0

112.5

150.0

TIME

POW

ER


Figure 7.20: Power consumption over time while executing STREAM.

7.3.4 HDD and SSD power consumption

We have mentioned earlier in this work that altering components within the targetedsystems could affect performance, by either increasing it or decreasing it. The com-ponent that is the easiest to test is the storage device. By default, the Intel Xeon andIntel Atom machines come with commodity SATA HD drives (with SCSI interface forIntel Xeon). We replaced the Hard Disk Drive on one of the Intel Atom machines witha SATA Solid-State Drive. SSD does not involve spinning platters and thus avoids thepower required to spin them.

During the experiments we performed, various power consumptions were observed andwe can not draw a specific pattern, apart from the generic observation that SSD de-creases the overall power consumption of the system. When idle, the system with theSSD uses 6 Watt less than the system with the HDD. On the CoreMark experiments,the SSD system consumes 3 Watts less. On STREAM, the SSD system consumes 4Watts less, while when executing HPL the difference is 10 Watts, giving a total powerconsumption of 58 Watt per second. These results are illustrated by the figure 7.21 thatfollows.

61

Wat

ts p

er se

cond

SSDHDD SSDHDD

0

15

30

45

60

Idle CoreMark STREAM HPL

Figure 7.21: Power consumption with 3.5" HDD and 2.5" SSD.

HHDs of smaller physical size, for instance 2.5" instead of the standard 3.5" may de-crease the power consumption as well. As we did not have one of these disks availablewe could not confirm this hypothesis. Previous research suggests as the physical sizeof the disk decreases, its power consumption decreases as well, improving the powerefficiency of the whole system [36] [37].

These differences in power do not only reduce the costs and the scalability, in termsof power, of such systems, but allow the deployment of extra nodes that consume theamount of power taken from the difference in the consumption of the components. Forinstance, the maximum difference of the HDD and SSD systems is 10 Watts, which isenough for an additional ARM system which consumes at maximum 9 Watt. Savingpower from a single component in larger scale, can allow the deployment of additionalcompute-nodes what would consume the same power as when using a not very powerefficient component on each system.

While CoreMark, HPL and STREAM do not perform intensive I/O operations, theyallow us to measure the standalone power consumption of the SSD and compare itagainst that of the HDD.

62

Chapter 8

Future work

Future work in this field could investigate a number of different possibilities as they areoutlined bellow:

• Real HPC/scientific applications: Real HPC and scientific applications could beexecuted on the existing cluster and their results could be used for analysis andcomparison against the results preented in this dissertation.

• Modern ARM: The cluster can be extended by deploying more modern ARM sys-tems, such as Coretex-A9 and the upcoming Cortex-A15 that support hardwareFPUs and multi-core.

• Intensive I/O: Additional I/O intensive benchmarks and applications could be ex-ecuted for identifying power consumption over such applications rather than ap-plications and codes that do no make heavy use of I/O operations.

• Detailed power measurements: More detailed power measurements could be per-formed by measuring each system component individually and quantifying howand where is power exactly used.

• CPUs vs. GPUs: Comparison between performance and performance-per-Wattof low-power CPUs and GPUs.

• Parallelism: Extend the existing cluster by adding a significant number of low-power nodes to exploit more parallelism.

63

Chapter 9

Conclusions

This dissertation has achieved its goals as it has researched the current trends and tech-nologies on low-power systems and techniques for High Performance Computing in-frastructures and have reported the related work in the field. We have also designed andsuccessfully built a hybrid seven-node cluster consisting of three different systems, In-tel Xeon, Intel Atom and ARM (Marvell 88F6281), providing access to 34 cores, 57GBof RAM and 3.6TB of storage and described the issues faced and how were solved.The cluster environment supports programming in both message-passing and shared-variable models. MPI, OpenMP and Java threads are supported on all of the platforms.We have experimented and analysed the performance, power consumption and powerefficiency for each different system of the cluster.

By observing the market and the developments of the HPC systems, low-power pro-cessors will start being one of the default choices in the very near future. The energydemands of large system will require the shifting to processors and systems that haveconsidered energy by design. Consumer electronics devices are becoming more andmore powerful as they need to execute computensively intensive applications and stillare designed with energy efficiency on mind.

For qualifying and quantifying the computationally performance and efficiency as wellas the power efficiency of each system, we ran three main different benchmarks (Core-Mark, High Performance Linpack and STREAM) in order to quantify the performanceof each system on performance-per-Watt basis for integer-point operations, floating-point operations as well as memory bandwidth. On CoreMark, the serial integer-pointbenchmark, the ARM system achieves the best performance-per-Watt, with 206.63 iter-ations per Watt against Intel Xeon and Intel Atom that perform 55.76 and 56.03iterationsper Watt respectively on single thread and 432.90 and 171.25 on utilising every threadper core. This allows us to conclude that the ARM processor is very competitive andcan achieve very high score on integer-point operations, performing better than IntelAtom which is a dual-core system with hyper-threading support, providing access tofour cores.

The ARM system does not support an FPU on hardware level due to its ARMv5 ar-

64

chitecture, lacking on performance on floating-point operations as we can see from theHPL results, being able to achieve at maximum 1.37 GFLOPs while Intel Xeon 7.39GFLOPs and Intel Atom 6.08 GFLOPs. In terms of power consumption, while ARMachieves the best performance-per-Watt, 152.2 versus 37.51 and 96.50 for Intel Xeonand Intel Atom respectively, it takes much longer to solve large problems. That intro-duces a high overhead in total power consumption, having as a consequence the usageof more power in total than what Intel Xeon or Intel Atom use.

In terms of memory performance, for small sizes, the power consumption remains atminimum levels. Larger data-sizes, higher than 2GB increase the consumption on IntelXeon and Intel by 4 Watts. The ARM system is able to handle only small data-sizes,512MB. Intel Xeon achieves the highest bandwidth per second as it uses the DDR3 at1.3GHz while Intel Atom and ARM use DDR2 at 800MHz. Intel Atom also scoreshigher than ARM as it is able to store the maximum data-set the ARM system canhandle within a single chip, unlike ARM that uses four individual memory chips.

Individual components affect systems performance as well. We have observed that SSDstorage can reduce the power consumption from 3 up to 10 Watt when compared to stan-dard 3.5" HDD at 7200rpm. Other components, such as different memory subsystem,interconnect as well as different power supplies, could affect the system performance.Due to time constraints, as well as budget, we did not experiment with different com-ponents one each of these subsystems.

In terms of porting and software support, all of the tested platforms support C, C++,Fortran and Java. ARM does not support the Java compiler but only the Java RuntimeEnvironment. Intel Atom, being an x86 based architecture (despite its RISC design),supports and is fully compatible with any x86 system that is being currently used. ARMdoes not provide the same binary compatibility with existing systems due to architec-ture differences, requiring recompiling of the targeted code. What is more, ARMv5 isnot capable on performing floating-point operations on the hardware level, having touse soft float instead. The latest architecture, ARMv7, provides hardware FPU func-tionality as well as SIMD acceleration. For taking advantage of the hardware FPU andSIMD acceleration, changes are need to be made on the software level as well. Linuxdistributions, or the needed compilers and libraries with all their dependencies, needto be recompiled on the ARMv7 architecture in order to support hardware FPUs andimprove the overall system performance.

The emerging interest by the HPC communities in exploiting low-power architecturesand new designs to support efficiently and reliably the design and development of Exas-cale systems. In combination with the market developments in consumer devices fromdesktops to mobile phones, ensures the increasing of the functionality and the perfor-mance of low-power processors to levels acceptable for HPC and scientific applications.The cease of Moore’s law introduces an extra need for the development of such systems.

65

Appendix A

CoreMark results

Processor Iterations Iterations/Sec Total time (sec) Threads Consumption PPWIntel Xeon 100000 6617.25 15.11 1 119 Watt 55.60Intel Atom 100000 2954.12 33.85 1 54 Watt 54.70

ARM 100000 1859.70 53.77 1 9 Watt 206.63Intel Xeon 1000000 6636.40 150.68 1 119 Watt 55.76Intel Atom 1000000 2969.70 336.73 1 53 Watt 56.03



ARM 1000000 1859.67 537.72 1 9 Watt 206.63

Table A.1: CoreMark results for various iterations.

66

Appendix B

HPL results

Processor Problem size Block size MethodIntel Xeon 13107 128 HPLIntel Atom 3276 128 HPL

ARM (Marvell 88F6281) 409 128 HPLIntel Xeon 41344 128 ACTIntel Atom 20608 128 ACT

ARM (Marvell 88F6281) 7296 128 ACTIntel XeonIntel Atom 7296 128 Equal size

ARM (Marvell 88F6281)Intel XeomIntel Atom 500 32 Small size

ARM (Marvell 88F6281)

Table B.1: HPL problem sizes.

67

Processor GFLOPs Usage PPW Problem sizeIntel Xeon 1.22 197 Watt 6.1 MFLOPs 13107Intel Atom 4.28 55 Watt 77.81 MFLOPs 3276

ARM (Marvell 88F6281) 1.11 9 Watt 123.3 MFLOPs 409Intel Xeon 1.21 197 Watt 6.1 MFLOPs 41344Intel Atom 3.48 55 Watt 63.27 MFLOPs 20608

ARM (Marvell 88F6281) 1.10 9 Watt 122 MFLOPs 7296Intel Xeon 1.21 197 Watt 6.14 MFLOPsIntel Atom 4.15 55 Watt 75.45 MFLOPs 7296

ARM (Marvell 88F6281) 1.10 9 Watt 122.2 MFLOPsIntel Xeon 7.18 197 Watt 36.44 MFLOPsIntel Atom 5.46 55 Watt 99.27 MFLOPs 500

ARM (Marvell 88F6281) 1.13 9 Watt 125.5 MFLOPs

Table B.2: HPL problem sizes.

Processor GFLOPs Usage PPWIntel Xeon 7.39 197 Watt 37.51 MFLOPsIntel Atom 6.08 63 Watt 96.50 MFLOPs

ARM 1.37 9 Watt 152.2 MFLOPs

Table B.3: HPL results for N=500.

68

Appendix C

STREAM results

Processor Size Function Rate (MB/s) Avg. time UsageCopy 3612.4793 0.0978

Intel Xeon 500MB Scale 3642.3530 0.0968 118 WattAdd 3960.9033 0.1334Triad 4009.4806 0.1319Copy 2851.0365 0.1236

Intel Atom 500MB Scale 2282.0852 0.1543 52 WattAdd 3033.9793 0.1742Triad 2237.8844 0.2361Copy 777.8065 0.4029

ARM 500MB Scale 190.8710 1.6398 8 Watt(Marvell 88F6281) Add 173.9241 2.6886

Triad 113.8851 4.0880Copy 3627.5380 0.5886

Intel Xeon 3GB Scale 3670.4334 0.5816 122 WattAdd 3943.3052 0.8120Triad 3991.4984 0.8022Copy 2875.2246 0.7422

Intel Atom 3GB Scale 2275.0291 0.9379 56 WattAdd 3035.8659 1.0544Triad 2269.4263 1.4103

Table C.1: STREAM results for size array of 500MB.

69

Appendix D

Shell Scripts

D.1 add_node.sh

#!/bin/bash## Author: Panagiotis Kritikakos

NODE=$1ARCH=$2

ssh root@${NODE} ’mkdir /root/.ssh; chmod 700 /root/.ssh’scp /root/.ssh/id_dsa.pub root@${NODE}:.ssh/authorized_keys

if [ "${ARCH}" == "ARM" ] || [ "${ARCH}" == "arm" ]; thenscp fstab.arm root@${NODE}:/etc/fstab

elsescp fstab root@${NODE}:/etc/fstab

fiscp hosts root@${NODE}:/etc/hostsscp profile root@${NODE}:/etc/profilescp mom_priv.config root@${NODE}:/var/spool/torque/mom_priv/configscp pbs_mom root@${NODE}:/etc/init.d/.

ssh root@${NODE} ’mount /home’ssh root@${NODE} ’mkdir /usr/local/mpich2-1.3.2p1’ssh root@${NODE} ’mount /usr/local/mpich2-1.3.2p1’

ssh root@${NODE} ’/sbin/chkconfig --add pbs_mom \&& /sbin/chkconfig --levels 234 pbs_mon on’

ssh root@${NODE} ’/sbin/service pbs_mom start’

70

qterm -t quickpbs_server

D.2 status.sh


for i in ‘cat nodes.txt‘do

ssh root@$i ’hostname; service pbs_mom status’done

D.3 armrun.sh


ARMHOST=lhpc6ARGUMENTS=$*

ssh ${ARMHOST} ${ARGUMENTS}

D.4 watt_log.sh


option=$1logfile=$2code=$3

getAvg(){totalwatts=‘cat ${logfile} | \awk ’{total = total + $1}END{print total}’‘

71

elements=‘cat ${logfile} | wc -l‘avgwatts=‘echo "${totalwatts} / ${elements}" | bc‘

printf "\n\n Average watts: ${avgwatts}\n\n"}

if [ "${option}" == "average" ]; thengetAvgexit 0

fi

if [ $# -lt 3 ] || [ $# -gt 3 ]; thenecho " Specify logfile and code"exit 1

fi

if [ -e ${logfile} ]; then rm -f ${logfile}; fi

codeis=‘ps aux | grep ${code} | grep -v grep | wc -l‘

while [ ${codeis} -gt 0 ]; do

sudo /usr/sbin/ipmi-sensors | grep -w "System Level" | \awk {’print $5’} | awk ’ sub("\\.*0+$","") ’ >> ${logfile}sleep 1codeis=‘ps aux | grep ${code} | grep -v grep | wc -l‘

done

getAvg

D.5 fortran2c.sh

#!/bin/bash

fortranFile=$1fileName=‘echo $1 | sed ’s/$.*$\..*/\1/’‘f2c $fortranFilegcc ${fileName}.c -o $fileName -lf2c

72

Appendix E

Benchmark outputs samples

E.1 CoreMark output sample

The output that follows is a sample output from an Intel Xeon system when executingCoreMark with 100000 iterations and a single thread.

2K performance run parameters for coremark.CoreMark Size : 666Total ticks : 15112Total time (secs): 15.112000Iterations/Sec : 6617.257808Iterations : 100000Compiler version : GCC4.1.2 20080704 (Red Hat 4.1.2-50)Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrtMemory location : Please put data memory location here(e.g. code in flash, data on heap etc)seedcrc : 0xe9f5[0]crclist : 0xe714[0]crcmatrix : 0x1fd7[0]crcstate : 0x8e3a[0]crcfinal : 0xd340Correct operation validated. See readme.txt for run and reporting rules.CoreMark 1.0 : 6617.257808 / GCC4.1.2 20080704 (Red Hat 4.1.2-50) -O2-DPERFORMANCE_RUN=1 -lrt / Heap

E.2 HPL output sample

The output that follows is a sample output from an Intel Atom system when executingHPL with problem size N=407.

Gflops : Rate of execution for solving the linear system.

73

The following parameter values will be used:

N : 407NB : 128PMAP : Row-major process mappingP : 1Q : 1PFACT : RightNBMIN : 4NDIV : 2RFACT : CroutBCAST : 1ringMDEPTH : 1SWAP : Mix (threshold = 64)L1 : transposed formU : transposed formEQUIL : yesALIGN : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.- The following scaled residual check will be computed:

||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )- The relative machine precision (eps) is taken to be 1.110223e-16- Computational tests pass if scaled residuals are less than 16.0

==================================================================T/V N NB P Q Time Gflops--------------------------------------------------------------------------WR11C2R4 3274 128 2 2 54.62 4.287e-01--------------------------------------------------------------------------||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0051440 ...... PASSED==================================================================

Finished 1 tests with the following results:1 tests completed and passed residual checks,0 tests completed and failed residual checks,0 tests skipped because of illegal input values.

----------------------------------------------------------------------------

End of Tests.==================================================================

74

E.3 STREAM output sample

The output that follows is a sample output from an Intel Atom system when executingSTREAM with array size 441.7MB.

-------------------------------------------------------------STREAM version $Revision: 5.9 $-------------------------------------------------------------This system uses 8 bytes per DOUBLE PRECISION word.-------------------------------------------------------------Array size = 19300000, Offset = 0Total memory required = 441.7 MB.Each test is run 10 times, but onlythe *best* time for each is used.-------------------------------------------------------------Printing one line per active thread....-------------------------------------------------------------Your clock granularity/precision appears to be 2 microseconds.Each test below will take on the order of 1194623 microseconds.

(= 597311 clock ticks)Increase the size of the arrays if this shows thatyou are not getting at least 20 clock ticks per test.-------------------------------------------------------------WARNING -- The above is only a rough guideline.For best results, please be sure you know theprecision of your system timer.-------------------------------------------------------------Function Rate (MB/s) Avg time Min time Max timeCopy: 777.8065 0.4029 0.3970 0.4318Scale: 190.8710 1.6398 1.6178 1.6900Add: 173.9241 2.6886 2.6632 2.7319Triad: 113.8851 4.0880 4.0673 4.1260-------------------------------------------------------------Solution Validates-------------------------------------------------------------

75

Appendix F

Project evaluation

F.1 Goals

The project have achieved the following goals, as set by the project proposal and aspresented within this dissertation:

• Report on low-power architectures targeted for HPC systems.

• Report on related work done in the field of low-power HPC.

• Report on the analysis and specification of requirements for the low-power HPCproject.

• Report on the constraints of the available architectures on their use in HPC.

• Functional low-power seven-node cluster targeted for HPC applications.

• A specific set of benchmarks that can run across all chosen architectures.

• Final MSc dissertation.

The final project proposal can be found on Appendix G.

F.2 Work paln

The schedule that was presented in the Project Preparation report has been followedand we have met the deadlines as described in the schedule. Slight changes have beenmade and the schedule had to be adjusted as the project was progressing. The changesapplied to time scales of certain tasks.

76

F.3 Risks

During the project preparation, the following risks have been identified:

• Risk 1: Unavailable architectures.

• Risk 2: Unavailable tool-chains.

• Risk 3: More time required to build cluster/port code than to run benchmarks.

• Risk 4: Unavailability of identical tools /underlying platform.

• Risk 5: Architectural differences.

The risks have finally not affected the project and we managed to mitigate any of themthat occurred accordingly, as described in the project preparation report. However, wecame across another two risks that have not been initially identified:

• Risk 6: Service outage and support by the University’s related groups.

• Risk 7: Absence due to summer holidays.

The first one affected the project for a week and it slowed down the experimenta-tion process as the cluster could not be accessed remotely due to network issuesthat were later solved. During these outage, the cluster had to be accessed physi-cal to conduct experiments and gather results. The second one did not cause anyissues to the project itself but perhaps if there was no absence more experimentsand more benchmarks could have been designed and executed.

F.4 Changes

The most important change was the decision over which benchmarks to run. We haveleft aside the SPEC Benchmarks as they would require large times of execution, some-thing that could not be afforded for the project. We also left out the NAS ParallelBenchmarks as they as they proved a bit complicated to execute in a similar manneron all three architectures as well as they would take rather long to finish execution andgather the needed results. For that reason we finally decided to proceed with the Core-Mark benchmark to measure the serial performance of a core, the High-PerformanceLinpack to measure the parallel performance of a system and STREAM to measure thememory bandwidth of a system. These three benchmarks have been a good choice asthey are widely used and accepted in the HPC field, are configurable, easy to run andcan complete their execution in relatively short time, enabling us to design a number ofdifferent experiments for qualifying and quantifying our the results.

77

Appendix G

Final Project Proposal

G.1 Content

The main scope of the project is to investigate, measure and compare the performanceof low-power CPUs versus standard commodity 32/64-bit x86 CPUs when executingselected High-Performance Computing applications. Performance factors to be inves-tigated include: the computational performance along with power consumption andporting effort of standard HPC codes across to the low-power architectures.

Using 32/64-bit x86 as the baseline, a number of different low-power CPUs will beinvestigated and compared, such as ARM, Intel Atom and PowerPC. The performance,in terms of cost and efficiency, of the various architectures will be measured by usingwell-known and established benchmarking suites. Due to the differences in the archi-tectures and the available supported compilers, a set of appropriate benchmarks willneed to be identified. Fortran compilers are not available on the ARM platform, there-fore a number of C or C++ codes will need to be identified, that will represent eitherHPC applications or parts of HPC operations, in order to put the systems under stress.

G.2 The work to be undertaken

G.2.1 Deliverables

• Report on low-power architectures targeted for HPC systems.

• Report on related work done in the field of low-power HPC.

• Report on the analysis and specification of requirements for the low-power HPCproject.

• Report on the constraints of the available architectures on their use in HPC- e.g.,32-bit only, toolchain availability, existing code ports.

78

• Functional low-power cluster, between 6 and 12 nodes, targeted for HPC appli-cations.

• A specific set of codes that can run across all chosen architectures.

• Final MSc dissertation.

• Project presentation.

G.3 Tasks

• Survey of available and possible low-power architecture for HPC use.

• Survey on existing work done in the low-power HPC field.

• Deployment of low-power HPC cluster.

• Identification of appropriate set of benchmarks to run on all architectures, runexperiments and analyse the results.

• Writing of the dissertation reflecting the work undertaken and the outcomes ofthe project.

G.4 Additional information / Knowledge required

• Programming knowledge and skills are assumed as the benchmark codes mightrequire porting.

• Systems engineering knowledge to build up, configure and deploy low-powercluster.

• Understanding of different methods/techniques of power measuring for computersystems.

• Presentation skills for writing a good dissertation and presenting the results of theproject in front of public.

79

Bibliography

[1] P. M. Kogge and et al., "Exascale Computing Study: Technology Challenges inAchieving Exascale Systems", DARPA Information Processing Techniques Of-fice, Washington, DC, pp. 278, September 28, 2008.

[2] J. Dongarra, et al., "International Exascale Software Project: Roadmap 1.1",http://www.Exascale.org/mediawiki/images/2/20/IESP-roadmap.pdf, February2011

[3] D. A. Patterson and D. R. Ditzel, "The Case for the Reduced Instruction SetComputer," ACM SIGARCH Computer Architecture News, 8: 6, 25-33, Oct.1980.

[4] D. W. Clark and W. D. Strecker, "Comments on ÔThe Case for the ReducedInstruction Set Computer", ibid, 34-38, Oct. 1980.

[5] Michio Kaku, "The Physics of the Future", 2010

[6] S. Sharma, Chung-Hsing Hsu and Wu-chun Feng, "Making a Case for aGreen500 List", 20th IEEE International Parallel & Distributed Processing Sym-posium (IPDPS), Workshop on High-Performance, Power-Aware Computing(HP-PAC), April 2006

[7] W. Feng, M. Warren, E. Weigle, "Honey, I Shrunk the Beowulf", In the Proceed-ings of the 2002 International Conference on Parallel Processing, August 2002

[8] Wu-chu Feng, The Importance of Being Low Power in High Performance Com-puting, CTWatch QUARTERLY, Volume 1 Number 3, Page 12, August 2005

[9] NVIDIA Press, http://pressroom.nvidia.com/easyir/customrel.do?easyirid=A0D622CE9F579F09&version=live&releasejsp=release_157&xhtml=true&prid=705184, Accessed13 May 2011

[10] HPC Wire, http://www.hpcwire.com/hpcwire/2011-03-07/china_makes_its_own_supercomputing_cores.html, Accessed 13 May2011

[11] Katie Roberts-Hoffman, Pawankumar Hedge, ARM (Marvell 88F6281) vs. IntelAtom: Architectural and Benchmark Comparisons, EE6304 Computer Architec-ture Course project, University of Texas at Dallas, 2009

80

[12] R. Ge, X. Feng, H. Pyla, K. Cameron, W. Feng, Power Measurement Tutorial forthe Green500 List, The Green500 List: Environmentally Responsible Supercom-puting, June 27, 2007

[13] J. J. Dongarra, the LINPACK benchmark: an explanation, In the Proceedingsof the 1st International Conference on Supercomputings, Springer-Verlag NewYork, Inc. New York, NY, USA, 1988

[14] Piotr R. Luszczek et al., The HPC Challenge (HPCC) benchmark suite, In theProceeding of SC ’06 Proceedings of the 2006 ACM/IEEE conference on Super-computing, New York, NY, USA, 2006

[15] D. Weeratunga et al., "The NAS Parallel Benchmarks", NAS Technical ReportRNR-94-007, NASA Ames Research Center, Moffett Field, CA, March 1994

[16] Cathy May, et al., "The PowerPC Architecture: A Specification for A New Familyof RISC Processors", Morgan Kaufmann Publishers, 1994

[17] Charles Johnson, et al., A Wire-Speed Power Processor: 2.3GHz 45nm SOI with16 Cores and 64 Threads, IEEE International Solid-State Circuits Conference,White paper, 2010

[18] D.M. Tullsen , S.J. Eggers, H.M. Levy, "Simultaneous multithreading: maximiz-ing on-chip parallelism", ISCA Ô95, pp. 392-403, June 22, 1995

[19] Green Revolution Cooling, http://www.grcooling.com, Accessed 2 June 2011

[20] Google Data Centers, http://www.google.com/corporate/datacenter/index.html,Accessed 2 June 2011

[21] Sindre Kvalheim, "Lefdal Mine Project", META magazine, Number 2: 2011, p.14-15, Notur II Project, February 2011

[22] Knut Molaug, "Green Mountain Data Centre AS", META magazine, Number 2:2011, p. 16-17, Notur II Project, February 2011

[23] Bjørn Rønning, "Rjukan Mountain Hall - RMH, META magazine, Number 2:2011, p. 18-19, Notur II Project, February 2011

[24] Jacko Koster, "A Nordic Supercomputer in Iceland", META magazine, Number2: 2011, p. 13, Notur II Project, February 2011

[25] Douglas Montgomery, "Design and Analysis of Experiments", John Wiley &Sons, sixth edition, 2004

[26] CoreMark an EMMBC Benchmark, http://www.coremark.org, Accessed 12 May2011

[27] Genesi’s Hard Float optimizations speeds up Linux performance up to300% on ARM Laptops, http://armdevices.net/2011/06/21/genesis-hard-float-optimizations-speeds-up-linux-performance-up-to-300-on-arm-laptops/,Accessed 21 June 2011

81

[28] K. Furlinger, C. Klausecker, D. Kranzlmuller, The AppleTV-Cluster: TowardsEnergy Efficient Parallel Computing on Consumer Electronic Devices, Whitepa-per, Ludwig-Maximilians-Universitat, April 2011

[29] NEONTM Technology, http://www.arm.com/products/processors/technologies/neon.php,Accessed 21 June 2011

[30] ARM, ARM NEON support in the ARM compiler, White Paper, September 2008

[31] MIPS Technologies, MIPS64 Architecture for Programmers Volume I: Introduc-tion to the MIPS64, v3.02

[32] MIPS Technologies, MIPS64 Architecture for Programmers Volume I-B: Intro-duction to the microMIPS64, v3.02

[33] MIPS Technologies, MIPS64 Architecture for Programmers Volume II: TheMIPS64 Instruction Set, v3.02

[34] MIPS Technologies, MIPS Architecture For Programmers Volume III: TheMIPS64 and microMIPS64 Privileged Resource Architecture, v3.12

[35] MIPS Technologies, ChinaÕs Institute of Computing Technology Li-censes Industry-Standard MIPS Architectures, http://www.mips.com/news-events/newsroom/release-archive-2009/6_15_09.dot, Accessed 21 June 2011

[36] Young-Jin Kim, Kwon-Taek Kwon, Jihong Kim, Energy-efficient disk replace-ment and file placement techniques for mobile systems with hard disks, In theProceedings of the 2007 ACM symposium on Applied computing, New York,NY, USA 2007

[37] Young-Jin Kim, Kwon-Taek Kwon, Jihong Kim, Energy-efficient file placementtechniques for heterogeneous mobile storage systems, In the Proceeding EM-SOFT ’06 Proceedings of the 6th ACM & IEEE International conference on Em-bedded software, ACM New York, NY, USA 2006

82

low-power high performance computing

Documents