voltage noise in multi-core processors: empirical ...rbertran.site.ac.upc.edu/pdfs/micro2014.pdf ·...

Voltage Noise in Multi-core Processors:Empirical Characterization and Optimization Opportunities

Ramon Bertran∗, Alper Buyuktosunoglu∗, Pradip Bose∗, Timothy J. Slegel†,Gerard Salem‡, Sean Carey†, Richard F. Rizzolo†, Thomas Strach§

∗IBM Research, Yorktown Heights, NY, USA†IBM Systems & Technology Group, Poughkeepsie, NY, USA‡IBM Systems & Technology Group, Burlington, VT, USA§IBM Systems & Technology Group, Boeblingen, Germany

Email: {rbertra,alperb,pbose,slegel,gsalem,smcarey,rizzolo}@us.ibm.com, [email protected]

Abstract—Voltage noise characterization is an essential aspectof optimizing the shipped voltage of high-end processor basedsystems. Voltage noise, i.e. variations in the supply voltage dueto transient fluctuations on current, can negatively affect the ro-bustness of the design if it is not properly characterized. Modelingand estimation of voltage noise in a pre-silicon setting is typicallyinadequate because it is difficult to model the chip/systempackaging and power distribution network (PDN) parametersvery precisely. Therefore, a systematic, direct measurement-basedcharacterization of voltage noise in a post-silicon setting ismandatory in validating the robustness of the design.

In this paper, we present a direct measurement-based voltagenoise characterization of a state-of-the-art mainframe class multi-core processor. We develop a systematic methodology to generatenoise stressmarks. We study the sensitivity of noise in relation tothe different parameters involved in noise generation: (a) stim-ulus sequence frequency, (b) supply current delta, (c) numberof noise events and, (d) degree of alignment or synchronizationof events in a multi-core context. By sensing per-core noise in amulti-core chip, we characterize the noise propagation acrossthe cores. This insight opens up new opportunities for noisemitigation via workload mappings and dynamic voltage guard-banding.

Keywords-dI/dt; inductive noise; voltage droop; stressmarkgeneration; noise-aware workload mapping; dynamic guard-banding; multi-core hardware measurements

I. INTRODUCTION

For mainframe-class systems like the IBM System z series,zero-fault reliability is an essential part of the design specification.Continued technology scaling and the well known ‘power-wall’ has added pressure to the historically conservative designmargins used for reliable operation. Specifically, in the faceof aggressive supply voltage scaling and power managementsolutions to reduce the power burden, the need to characterize thevoltage noise in precise detail has become imperative. Thus, asystematic approach is needed to define and validate an operatingvoltage point that ensures robust performance and zero-faultreliability, without being overly conservative.

Modeling and characterization of voltage noise in a pre-siliconsetting is typically inadequate. This is because it is difficultto model the chip/system packaging and power distributionnetwork (PDN) parameters very precisely. Therefore, a system-atic, direct measurement-based characterization of voltage noisein a post-silicon setting (during the processor testing process)is required to define the adequate voltage levels and packagecharacteristics that ensure robust functionality.

A key part of such post-silicon processor testing processis the use of specially crafted benchmarks that maximize the

voltage noise, known as dI/dt stressmarks1. Manual, expert-driven generation (and fine-tuning) of such dI/dt stressmarkscan be a tedious and error-prone process. This is specially truein a multi-core setting, in which inter-core synchronization ofnoise events and experimental discovery of the system resonancefrequencies can require hundreds (or even thousands) of testruns with hand-crafted programs.

In this paper, we use a latest generation IBM Systemz multi-core processor (IBM zEnterpriseTM EC12 processorchip) to illustrate a systematic, automation-driven voltage noisecharacterization methodology. This methodology is deemed to bea key post-silicon aid in determining the optimal voltage levelsand package characteristics to ensure the robust performance andultra-reliable functionality that is targeted for IBM mainframesystems. In particular, we present a tight bounds analysis toquantify maximum voltage noise on the targeted system. Thisincludes a characterization of noise propagation across the coresin the multi-core chip. The main contributions of the paper are:

• A systematic methodology to generate all types of dI/dtstressmarks. The methodology, in contrast to previousworks (e.g. [25], [26], [28]), permits the control of differentparameters affecting the voltage noise generated: amountof ΔI , synchronization level of ΔI events, number ofconsecutive ΔI events and frequency of ΔI events. This‘white-box’ approach provides a rich toolbox to performdetailed noise characterizations.

• A complete noise sensitivity analysis of the differentparameters affecting voltage noise: the amount of ΔI ,the synchronization level of ΔI events, the number ofconsecutive ΔI events and the frequency of ΔI events.This detailed characterization allows us to: (a) validate therobustness of the zEC12 power delivery system, (b) confirmthe shift of the ‘1st droop’ resonance frequency to lowerbands (around the 2MHz band) due to the usage of deeptrench technology and large on-chip eDRAMs, (c) empiri-cally quantify the relative importance of each parameterinvolved in the noise generation. We show that the amountof ΔI and the synchronization of ΔI events are the maincontributors of noise and the number of consecutive ΔIevents and their frequency are secondary factors. Overall,we present a significantly more comprehensive noisecharacterization of multi-core systems compared to priorart (e.g. [2], [27], [31], [40], [41], [55]), which do not studyall the parameters related to noise.

1In this work we use the terms dI/dt and ΔI interchangeably and we referto a dI/dt event or ΔI event as a fast transition in power consumption.

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

1072-4451/14 $31.00 © 2014 IEEE

DOI 10.1109/MICRO.2014.12

368

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

1072-4451/14 $31.00 © 2014 IEEE

DOI 10.1109/MICRO.2014.12

368

Figure 1: Schema of a power distribution network (PDN).

• An analysis of the noise propagation on multi-core proces-sors based on per-core noise hardware measurements. Wedescribe a method to empirically evaluate the effects ofphysical layout, process variation and other power deliverysystem design parameters on the noise perceived by eachcore. For instance, we show that the L3 cache —with arelatively large capacitance— isolates the noise comingfrom different cores. Thus, it acts as a damping element.

• A discussion of the optimization opportunities that arisefrom noise propagation behavior. We provide a conceptualoverview of the potential benefits of a noise-aware workloadmapping policy and a utilization-based dynamic voltageguard-banding.

To the best of our knowledge, there is no previous workon complete noise characterization of multi-core processorsin which both intra- and inter-core noise behavior is capturedand interpreted in the context of a real, high-end mainframesystem. Accurate noise characterization and robustness testing isa pre-requisite to shipping systems to customers in this high-endmarket segment. Hence, this work is of significant and tangiblevalue to a commercial systems vendor.

The rest of the paper is organized as follows: In Section IIwe review the fundamentals of voltage noise. Section IIIpresents the stressmark generation methodology. Section IVdetails the experimental set up. Section V and VI present theanalysis of noise behavior and inter-core noise characteristicsrespectively. Section VII discusses the optimization opportunities.Section VIII presents the related work and we conclude inSection IX.

II. BACKGROUND

This section provides the basic background to understand thereasons for the existence of voltage noise and its negative effectson energy efficiency and reliability. For further details, we referthe reader to the several previous works on the topic [16], [20],[22], [24], [30], [34], [38], [43], [46].

A. Problem OverviewIn a computer system, the electrical power has to be distributed

from the main power source to all the devices present in themicro-processor. This is done through the power distributionnetwork (PDN). Figure 1 shows a schematic diagram of aPDN. First, a voltage regulator module (VRM) attached in themotherboard reduces the voltage to a proper package voltage.Then, the power is distributed through the motherboard to thepackage, where controlled collapse chip connections (C4s) —also known as flip chips— connect the package with the micro-processor connection pads to distribute the power inside thechip. Finally, the on-chip PDN distributes the power across allthe devices in the micro-processor.

In an ideal scenario, the on-chip devices would observethe voltage level generated by the VRM, Vnom. However, the

Figure 2: Model of a power distribution network (PDN).

parasitic components present in the PDN produce variationsin the voltage supply level perceived by the devices (VDie).Figure 2 shows a simplified model of the power distribu-tion network that includes the components (resistances andimpedances) present on the motherboard, package and die.Resistive components (RMb, RPkg1, RPkg2, RDie) producea voltage droop across the network referred to as IR-droop(Vdroop), whereas inductive components (LMb, LPkg1, LPkg2,LDie) produce voltage fluctuations across the network. We referto these fluctuations as didt-droop (Vdidt), which in conjunctionwith the Vdroop constitute the voltage noise (Vnoise). Thus,the VDie = Vnom − Vdroop − Vdidt = Vnom − Vnoise.

The maximum magnitude that Vnoise can achieve definesthe voltage margin (Vmargin) required for a reliable operation.Designers mitigate the Vnoise magnitude by adding decouplingcapacitance —also known as decap— at different points inthe PDN (CMb, CPkg and CDie in Figure 2) [6]. However,since the amount of decap is limited by area constraints, theVnoise cannot be nullified. In this work, we focus our analysis incharacterizing the Δ in VDie (Vnoise) depending on the activitybeing generated in a multi-core processor (Idie(t)).

B. FundamentalsIn order to understand the reasons of such Vnoise, we have

to review some fundamentals on circuit theory. In reactive RLCcircuits with time-varying signals —like the PDNs— Ohm’slaw is generalized to:

ΔVDie = Vnoise = ΔI · Z (1)

where the Z denotes the impedance, i.e. the complex general-ization of resistance. Z can be further decomposed to:

Z =√R2 + (XL −XC)2 (2)

where R is the resistance in ohms of the resistor elements, XL isthe reactance of the inductor elements and XC is the reactanceof the capacitor elements. Given the following definition ofreactance of the inductor and capacitor elements that assumes asinusoidal input signal2:

XL = 2π · f · L (3)

XC = (2π · f · C)−1 (4)

where f is the frequency of the signal, L is the inductance inhenries, C the capacitance in farads; we observe that both XL

and XC depend on the input signal frequency. As a result, fora given R, L, C, the Vnoise can be expressed as a function ofthe ΔI generated and the frequency (f ) at which this ΔI isgenerated.

2Fourier analysis can be used to approximate any arbitrary signal as a sumof sinusoids.

369369

The resonance frequency (f ).: Parallel RLC circuits havethe property of resonance at specific frequencies. Resonanceis the disposition of the circuit to have higher magnitudeoscillations for some frequencies than others. Resonance occursbecause energy is kept in two distinct ways in the circuit: in themagnetic field of the inductors and in the charge of capacitors.The rate at which the energy is transferred from one to the otherin a periodic fashion and where the magnitude of oscillations ismaximum is referred to as resonant frequency. The PDN modelshown in Figure 2 can be decomposed into a set of parallel RLCcircuits. Therefore, this results in various resonant frequencies:

fr = (2π√L · C)−1 (5)

that are the consequence of the interaction of the differentinductances and decoupling capacitances present in the PDN.One effect of stimulating a parallel RLC circuit at resonantfrequency is that the impedance (Z) is maximized. As a result,the Vnoise generated is also maximized (Equation 1).

Since the activity in the device can eventually generate ΔI atsuch resonant frequencies, the impedance Z of the system mustbe defined not only based on the operating clock frequency, butalso on the spectrum of frequencies where current fluctuationscan exist. This definition is done during the package designprocess, when PDN impedance (Z) profiles and decap mapsare generated. In that process, package designers ensure that atarget maximum impedance Z is not surpassed for any givenfrequency by placing enough decaps in parallel. This guaranteesthat Vnoise remains within a constrained magnitude, allowingfor affordable and reliable voltage margins (Vmargin).

Then, one could define the Vmargin based on impedanceZ profiles generated during the package design process [2],[14]. However, that would lead to a pessimistic design since,after all, the Vmargin defined by this process is based on an in-lab generated worst-case maximum ΔI . Therefore, to improvethe efficiency of the design, Vmargin must be squeezed to thelimits using possible real-world worst-case situations, but withoutimpacting the reliability of the design.

According to the formulas mentioned above, in order to checkthe reliability of the design it is sufficient to validate that thevoltage noise is within the limits (Vnoise < Vmargin) when amaximum ΔI event is generated at resonant frequency. Thisis done through stressmarks, which are specifically tailored togenerate such (possible but improbable) worst case conditions.Generating them is a challenging task that gained focus inrecent years [25], [26], [28]. In this work, besides presentinga methodology to generate these stressmarks, we focus ourdiscussion on exploring, in detail, the effects of the ΔImagnitude and the ΔI frequency on the Vnoise.

C. Vnoise in Multi-core ProcessorsIn the previous section we have seen that the Vnoise can be

expressed as a function of the ΔI generated by the activity in theprocessor and the frequency (f ) at which this ΔI is generated.Ideally, in a multi-core processor, this could be expressed as:

Vnoiseproc = max0≤i≤#cores

(Vnoisei) = max0≤i≤#cores

(γi(ΔIi, fΔIi))

(6)

where the total Vnoiseproc is defined as the maximum of thelocal noise Vnoisei generated in each core. This local noise is a

function (γi) of the ΔIi that the core generates and the frequencyat which it is generated (fΔIi ). This definition assumes thatnoise on each core is independent from each other. However,this is not accurate since the on-chip PDN is shared across thecores in order to minimize the noise [21]. Therefore, a morerealistic definition of the local noise on a given core i (Vnoisei )would be:

Vnoisei = γi(ΔIi, fΔIi) + ηi(Vnoisej , ∀j �= i) (7)

where the interaction of local noises generated by different cores(second element of the formula) is taken into account. Noticethat we define noise function γ and interaction function η on aper core basis in order to express that different cores can showdifferent noise levels and interactions depending, for instance,on their physical layout or due to process variations. This simpleformulation of a much more complex system provides a highlevel view of the complexity of performing a sound voltagenoise characterization of a multi-core processor.

In this work, we focus our analysis in understanding theVnoise depending on the activity being generated on each corein the processor. Specifically, we first examine in detail thedifferent parameters of interest in noise generation: (a) theamount of ΔI , (b) the frequency of ΔI , (c) the alignmentof ΔI events generated on each core, and (d) the numberof consecutive ΔI events generated. Then, we analyze theinteractions of local noises generated by different cores in orderto identify possible trends resulting from a particular physicallayout or decap placement. This is done by analyzing datagathered from noise measurement macros implemented on eachcore of the processor. This is the first work performing suchlocal vs. global noise characterization on a production system.This type of characterization efforts provide a fundamental pieceof data to researchers and system architects in order to definethe requirements of voltage noise mitigation mechanisms [4],[15], [27], [29], [49].

III. EXPERIMENTAL FRAMEWORK

This section describes the experimental platform as wellas the different methods used to gather the experimental datapresented in this paper. Unless otherwise stated, experimentshave been run on different processors multiple times to checktheir reproducibility, and arithmetic average values are reported.

Evaluation platform: The experimental platform is an IBMzEnterpriseTM EC12 system (zEC12) [44], [51]. Each processorchip (CP) in the system contains six super-scalar out-of-orderprocessor cores running at 5.5 GHz. Each core contains privateinstruction and data L1 and L2 SRAM caches with 256B cachelines. The center of the chip contains a 48 MB eDRAM L3cache that is shared by all 6 cores. The DDR3 interface is on theleft side of the chip (MCU), and the I/O bus controller (GX) ison the right. Figure 3 shows the chip layout. For the purpose ofthis work, various CP chips of zEC12 systems were measured.

Software environment: When running experiments, theplatform runs SUSE Linux Enterprise Server 11 SP3 with linuxkernel version 3.0.82-0.7. This version provides the standardPCL API [8] to access hardware performance counters. We usethis standard interface to gather performance counter data toassess the generation of dI/dt stressmarks.

370370

Figure 3: IBM zEC12 CP chip layout. Each unit, i.e. the cores,the MCU, the GX and the nest, implements a skitter macro thatallows measurement of voltage noise.

Voltage control: Each IBM zEC12 system is provided witha management console, i.e. a service element, from which controland monitoring commands can be sent to the different devices inthe system. Specifically, chip voltage can be controlled in stepsof 0.5% of the nominal voltage. This fine-grained granularityenables accurately finding the failure point when errors start tooccur. Errors are detected using the recovery unit (R-Unit) thatis implemented in these highly reliable systems [45], [52].

In order to find the available voltage margin, i.e. the voltageat failure [26], [27], the operating voltage is slowly decreaseduntil first failure happens. This process, commonly named Vmin

experiment, is the ultimate bullet-proof method to check theavailable voltage margin. The drawback of this method is its longturn-around time. This is because voltage is decreased slowly—0.5% every two minutes— until first failure. In addition, foreach experiment, the system performs failure checks and has tobe rebooted. Therefore, since we had other means to evaluatethe voltage noise generated (see below), we only used thismechanism to double-check the stressmarks generated and togather the extra system information, e.g. the circuit paths withinthe design macros that fail first.

Power measurement: The service element can monitor thepower consumption of each device in the system. Specifically,power measurements are done at the chip level by reading thecurrent and voltage of the input rails. The granularity of thereadings is in milliwatts. This power consumption data has beenused extensively to assess the generation of the dI/dt stressmarks.

Voltage noise measurement: The measurement of thevoltage noise on the different cores of the system is done throughthe skitter macros implemented in the different units of thezEC12 CP chip [13], [42]. We also performed oscilloscopemeasurements of key experiments to confirm the voltagebehavior. Next, we provide an overview about the skitters. Werefer the reader to [13], [42] for an in-depth explanation of theiroperation and usability.

Skitter macros contain a latched-tapped delay line of 129inverters, for the best timing resolution, with a nominal delaybetween 5 and 8 ps depending on the threshold voltage andtechnology. These delay lines and sampling latches form anedge-capture circuit that captures the rising and falling edges ofthe clock signal. The sampling latches take a snapshot of thestate of the inverter chain every cycle, forming a 129 bit stringof 0’s with 1’s where the edges are detected. If the inverter

��

��

��

��

��

��

��

��

��

��

��

��

� �� !��

��

��

Figure 4: dI/dt stressmark generation methodology. First, theEPI profile of the target platform is generated. Then, the profileis used to obtain max./min. power instruction sequences. Finally,instruction sequences, system and stimulus frequencies, andsynchronization options are used to generate the desired dI/dtstressmark.

delays are constant and there is no clock jitter, then the clockedges (the 1’s) are captured in the same latches (same bit in thestring). When there is clock jitter, the location of the capturededges varies.

Besides measuring the clock jitter, skitter macros implicitlytake into account the effect of voltage changes on the clock jitterand skew. This is because the delay line is very sensitive tovoltage changes. In zEC12 processor chip, skitters were placedin locations that were known to have the potential for large ΔIevents. Additional skitters were then placed to add visibilitythroughout the design. In this study, we provide results for theskitter macros implemented on the cores.

Skitter macros can be configured to run in sticky-mode. Inthat case, the latches record all locations where one or moreclock edges occur within a period of time. This allows the worst-case timing uncertainty (noise) to be measured while runningany application. We used this mode for measuring worst-casevoltage noise while running our dI/dt stressmarks.

Results of skitter macros are reported in percentage peak-to-peak variation (%p2p). The %p2p value is directly related tothe voltage droop in the PDN. That is, the higher the %p2pnoise, the higher the voltage droop. These characteristics enableus to quickly and reliably evaluate the voltage droops in thesystem, avoiding the time-consuming Vmin experiments.

IV. DI/DT STRESSMARK GENERATION METHODOLOGY

In order to profile the voltage noise of a given architecture,several thousands of dI/dt stressmarks —with different properties,such as the target stimulus frequency, ΔI synchronizationmethod or alignment constraints— have to be generated. Manualimplementation is impractical. Therefore, we require a systematicmethodology to generate dI/dt stressmarks with the differentdesired properties.

Figure 4 shows a high-level view of the methodologyused to generate dI/dt stressmarks. The methodology uses theMicroprobe [3] micro-benchmark generation framework as theunderlying infrastructure to generate the dI/dt stressmarks. Tothat end, before starting this noise characterization effort, aback-end knowledge base for the zEC12 architecture had to beimplemented via target definition files [3].

As discussed in Section II, there are various parameters thataffect the noise generated. One parameter is the magnitude of

371371

Rank # Instr. Description Power1 CIB Compare immediate and branch (32<8) 1.582 CRB Compare and branch (32) 1.573 BXHG Branch on index high (64) 1.574 CGIB Compare immediate and branch (64<8) 1.555 CHHSI Compare halfword immediate (16<16) 1.55

1297 DDTRA Divide long DFP with rounding mode 1.011298 MXTRA Multiply extended DFP with rounding mode 1.011299 MDTRA Multiply long DFP with rounding mode 11300 STCK Store clock 11301 SRNM Set rounding mode 1

Table I: First and last five instructions in the zEC12 EPI profile.Instruction power is normalized to SRNM instruction.

the core ΔI . In order to maximize it, maximum and minimumpower instruction sequences first need to be defined, and thenconcatenated. Another parameter is the frequency at which coreΔIs are generated, controlling it permits the detection of theresonant regions of the system. One more parameter is theinstant at which the core ΔIs are generated, by synchronizingthem the overall voltage droop is maximized. Finally, thelast parameter that affects the overall noise generated is thenumber of consecutive ΔI events generated. The rest of thesection describes how we generate these parametrizable dI/dtstressmarks.

A. Energy-per-instruction profile

The first step required to produce dI/dt stressmarks is thegeneration of an energy-per-instruction (EPI) profile [3]. Thisprofile allows ranking all the ISA instructions by their powerusage. The rank is used afterwards to select the instructioncandidates when searching for maximum and minimum powerinstruction sequences.

This systematic methodology for generating an EPI profileconsists in generating a micro-benchmark for each and everyinstruction in the ISA. The micro-benchmark skeleton is anendless loop with 4000 repetitions of the instruction, withoutdependencies. Micro-benchmarks are run for a few secondsand power and performance metrics are gathered. Metrics arepost-processed afterwards in order to generate the EPI profile.

The fact that all instructions are profiled provides a completepicture of the ISA characteristics. This is useful to detect poten-tial instruction candidates for low and high power instructionsequence generation that would not be considered by an expert.For instance, in Table I, which shows the first and last fiveinstructions in the instruction ranking, we see the non-intuitivecase where a compare immediate instruction (CHHSI) is inthe Top 5 of most energy consuming instructions. Besides thedetection of these non-intuitive cases, the profile also providesimportant feedback to optimize future implementations of thearchitecture.

B. Maximum and minimum power instruction sequences

In order to generate the maximum power instruction sequence,we implement a methodology based on the one presentedin [3]. In that work, the authors propose to select instructioncandidates by functional unit basis —the top instruction for eachfunctional unit was selected— and then evaluate all the possiblecombinations of a given length to select the highest power one.We adapt that approach to the extra complexity of the zEC12CISC architecture.

Figure 5 shows the methodology used to generate themaximum power instruction sequence. It includes the followingsteps:

1) Instruction candidate selection: We use the EPI profile tocategorize the instructions by their functional unit usageand issue class. From each category, we select the topmost power-consuming instructions. Categories with lowpower or low IPC3 are discarded to reduce the numberof instruction candidates to nine, avoiding a design spaceexplosion problem.

2) Sequence candidate generation: We generate all possiblecombinations of length six of these nine instructions(96 = 531 441). Length six is selected because it is twicethe dispatch group size in the zEC12 architecture. Thus, itprovides the best trade-off between combinations exploredand experimental time.

3) Microarchitectural filtering: The 531 441 combinationsare filtered using constraints based on the microarchi-tecture details implemented in Microprobe. For instance,sequences that are known to not have an average dispatchgroup size of 3 —the maximum possible in zEC12— arefiltered out because they will not exhibit a high IPC. Otherconstraints, such as the maximum number of branches orprefetches per sequence are also added. This results in areduction of the design space from 531 441 sequences to32000.

4) IPC filtering: Since it is still not practical to evaluatethe power of the 32000 remaining candidate sequences,we filter them based on their IPC. We select the topthousand sequences showing the highest IPC. Two reasonsmotivated us to choose IPC filtering: (a) it is well-known that IPC is directly related to power and, (b) IPCevaluations are much faster than the power ones because:(i) it is possible to run them in parallel using differentcores and machines and, (ii) a run of a few seconds isenough to obtain the IPC. This is much faster than thepower evaluations, which have to be done on the sameprocessor with the same experimental conditions (e.g. chiptemperature) between runs for a fair comparison.

5) Power evaluation: We evaluate the power consumptionof the remaining thousand sequence candidates. Thesequence showing the highest power is selected as themaximum power instruction sequence. We validate thesequence on different processors to confirm its high powerconsumption.

Finally, we also rely on the EPI profile to define the minimumpower sequence. We select the last instruction of the instructionrank as the minimum power sequence. Note that, the no-operationinstruction (nop) is not the optimal candidate. Instead, long-latency instructions (such as divisions or decimal instructions)are better candidates because they stall all parts of the processor.The EPI profile provides an evidence of that fact, confirmingprevious findings [22].

3We refer to IPC as the micro-operations executed per cycle, which in aCISC architecture might differ from the general IPC definition of instructioncommitted per cycle.

372372

Figure 5: Maximum power instruction sequence generation methodology. First, instruction candidates are selected from the EPIprofile. Second, several combinations are generated and filtered using microarchitecture constraints. Then, IPC is evaluated toselect the top thousand instruction sequence candidates which will proceed to the power evaluation stage.

��

��

��

��

��

��

��

�

�

��

Figure 6: dI/dt stressmark skeleton. Rich number of configurableparameters to control the noise generation: number of steps, syn-chronization options, low and high power instruction sequencesand their IPCs, stimulus frequency and target system frequency.

C. dI/dt stressmark generation

In order to understand in detail the voltage noise in multi-coresystems and present studies such as the ones presented in Sec-tions V and VI, several dI/dt stressmarks with different propertiesare needed. For instance, we need dI/dt stressmarks generatingdifferent amounts of ΔI , with different stimulus frequencies—i.e. duration between ΔI events—, with different number ofconsecutive ΔI events, with specific alignment between ΔIevents, etc. Therefore, the dI/dt stressmark generation shouldbe highly configurable.

We have implemented a dI/dt stressmark generation policyon top of the Microprobe framework. As shown in the last stepof Figure 4, the policy takes as input the high and low powersequences and their IPCs, the target system frequency, the targetstimulus frequency, the number of consecutive ΔI events togenerate and the synchronization options. With that information,one can derive the length of high and low power sequencesto generate low/high activity at the given stimulus frequency.Then, high and low power sequences are concatenated within aloop to generate a dI/dt stressmark as shown in Figure 6. Weexplored the addition of intruction dependencies between highand low power sequences to ensure a sharper activity changebut results were similar.

As explained in Section II-C, an important challenge toovercome in multi-core systems is how to synchronize thestressmarks running on different cores. This is required tomaximize the ΔI generated in a given instant. Althoughprobabilistic approaches exist to ensure an eventual alignmentof ΔI events within a time window [26], we implemented adeterministic approach. We leverage the advanced state-of-the-art timing facilities present in the zEC12 architecture to perform

perfect cycle-accurate alignment within the stressmarks. Thetiming facilities provide a global 64-bit Time-Of-Day (TOD)register that allows for the control of the misalignment betweenstressmarks in steps of 62.5 nanoseconds. This cycle-accuratecapability of controlling the misalignment permits the analysisof the sensitivity of noise to ΔI event misalignments presentedin Section V-B. Note that without the right architecture supportthe perfect control of alignment would not be possible.

During the dI/dt stressmark generation methodology def-inition, we also studied the introduction of disruptive(e.g. branch/cache/TLB misses) events and memory hierarchyactivity to maximize the ΔI generated. However, we decidedto generate core-contained sequences without disruptive eventsbecause: (a) disruptive events showed small differences in powerconsumption with respect to the minimum power sequence(b) the introduction of memory activity in the maximum powersequence did not improve the maximum power significantly;(c) disruptive events and memory activity in shared resourceslimit the capacity to control the stimulus frequency generatedby the workload in a multi-core context.

Overall, the flexibility of dI/dt stressmark generation method-ology allows us to generate any type of dI/dt stressmarksby controlling every relevant aspect involved in the noisegeneration: the magnitude of the ΔI , the frequency of ΔIevents, the number of consecutive ΔI events and the alignmentbetween the ΔI events. This ‘white-box’ approach providesto the user all the knobs to control the noise. This approachcomplements the existing solutions that are only focused infinding the worst-case noise scenario. It would be possibleto implement optimization algorithms —such as the geneticalgorithms employed in previous works [26]— on top of thepresented solution. The following sections present detailed noiseanalysis that would not be possible without such dI/dt stressmarkgeneration flexibility.

V. UNDERSTANDING VOLTAGE NOISE

In this section, we present sensitivity analyses for eachparameter related to the voltage noise. This permits us to:(a) validate the dI/dt stressmark generation policy, (b) gaindetailed insights on noise behavior in multi-core systems,(c) understand the conditions where worst case noise scenariooccurs, and (d) confirm the robustness of the zEC12 design.Although we provide per-core noise measurements, this sectionanalyzes the global behavior of the noise at chip level. Section VIstudies the noise propagation across cores and the inter-coreinteractions. These detailed characterizations are used to motivatethe optimization opportunities presented in Section VII.

373373

��

��

�

��

�

��

�

��

�

��

�

��

�

��

�

��

�

��

�� !"#$

��

��

��

�� !"#$

(a) Maximum per core noise measured when running the dI/dtstressmarks with different stimulus frequencies

��

��

� ��

��

!� ��"��

(b) Post-silicon impedance (Z) profile generated via direct measure-ments during package characterization

Figure 7: Noise sensitivity to stimulus frequency. The detectedresonance bands in (a) are in accordance with the Z profile (b).

A. Noise sensitivity to stimulus frequency

The first question to answer is if the skitter-measured noisegenerated by the dI/dt stressmarks is magnified at the expectedresonant bands. For this purpose, we explore a wide range ofstimulus frequencies. For each stimulus frequency, we generatea dI/dt stressmark without synchronization and we run one copyof it on each core.

Figure 7a shows the per-core noise measurements for differentstimulus frequencies. Note that the step function of noisereadings is due to the discrete nature of the skitter bit-string. Themeasurements confirm two resonant bands around the 40KHzregion and 2MHz region, respectively. This agrees with the post-silicon impedance profile generated via direct measurementsduring package testing process (Figure 7b).

The six cores show the same trends in the entire range of thefrequency spectrum. This is expected because the PDN is sharedand therefore, the noise is propagated across cores. Nevertheless,some cores exhibit more noise than others. For instance, themaximum noise seen is ∼41%p2p variation in cores 2 and 4.These differences in the noise seen on each core are mainly dueto manufacturing process variation, although —as we presentlater on— some other factors such as the physical layout orinter-core noise propagation can also contribute.

(a) 20 microseconds (b) Single-period

Figure 8: Oscilloscope shot of voltage noise on core 0 whenexecuting the maximum dI/dt stressmark at ∼2MHz.

One important thing to note is the displacement of the ‘firstdroop’ to the ∼2MHz region4. The traditional ‘first droop’ inolder systems was at frequencies around 30-100 MHz and it wascaused by the charge exchange oscillation between the chip andthe surrounding module capacitors. The path inductance betweenchip and module capacitors is fairly small, and therefore the ‘firstdroop’ oscillation time is short. Modern IBM systems use deeptrench technology for implementing the large on-chip eDRAM.This augmented the on-chip capacitance by 40x. As a result,the resonant bands shifted to a much lower frequency, aroundthe ∼2MHz region in Figure 7a. The outcome is that there isno longer an oscillatory power noise behavior at frequenciesabove 5 MHz, making the design more robust.

Finally, although the noise levels seen are confirming thecorrectness of the dI/dt stressmarks generated, we performedoscilloscope measurements to double-check the activity beinggenerated. Figure 8a shows oscilloscope measurements of thevoltage in core 0 when running the noisiest dI/dt stressmark(∼2MHz). The repetition of the sinusoidal form in the 20microseconds shot confirms the correctness of the stressmark.A single period is shown in Figure 8b. The measurementsconsistently show large peak-to-peak variations in the voltagesupply. Despite these large noise events, the zEC12 platformdid not trigger any recovery mechanism, confirming its robustdesign.

B. Noise sensitivity to alignmentIn this section, we repeat the same experiment but adding the

synchronization mechanism in order to evaluate the effect ofsynchronization on the ΔI events happening on each core. Weexpect that synchronizing the ΔI events will increase the voltagenoise but we would like to quantify, empirically, that increase inorder to discern the relative importance of the synchronizationeffect versus other parameters.

We generate dI/dt stressmarks with synchronization for awide spectrum of stimulus frequencies and we run a copy ofeach stressmark on every core. The synchronization code isconfigured to loop until the low-order bits of the clock value(TOD register) are zero; this happens every 4ms. After that, thedI/dt loop is executed for a thousand iterations (thousand ΔIevents) before re-synching again. This ensures the stressmarksare in sync and minimizes the synchronization overhead. Weevaluate the noise sensitivity to the number of consecutive ΔIevents in Section V-E.

Figure 9 shows the noise perceived by the cores. As expected,the noise increased significantly confirming the effectivenessof the synchronization mechanism. We see a general increaseof around 20 %p2p points. For instance, the maximum noise

4One can also say that the ‘first droop’ disappeared.

374374

��

��

��

��

��

�

�

��

��

��

��

��

�

�

��

��#��

��

�� %��&'(�)

!"

�"

��

��

Figure 9: Maximum per core noise measured for differentstimulus frequencies when running the dI/dt stressmarks thatare synchronized every 4ms. Synchronization enlarged the noiselevels seen in Figure 7a.

seen at the resonant region around the 2MHz point was 41%(Figure 7a), while with synchronization it is 61% (Figure 9).The noise is not only magnified around the resonant regions. Itis also magnified in the rest of the stimulus frequencies explored.Actually, we measured higher noise in non-resonant bands withsynchronization enabled than in the resonant bands without thesynchronization (region between 40KHz and 2MHz stimulusfrequencies in Figure 9). This is because the effect due tothe large ΔI events resulting from synchronization exceeds theeffect of resonance. This suggests that ΔI event synchronizationis a more important factor to avoid than ΔI events occurringat resonance frequency.

C. Noise sensitivity to misalignmentSince ΔI event synchronization is an important factor to

consider, we want to evaluate ‘how much’ misalignment isnecessary to diminish the synchronization effect. By quantifying,empirically, the time window where ΔI events should nothappen simultaneously, one can define the requirements of noisereduction mechanisms on future architecture generations.

For this experiment, we generate dI/dt stressmarks witha stimulus frequency of 2MHz (within the resonant band)and synchronizing every 4ms after a thousand ΔI events.However, in this case, we generate different versions varyingthe exit condition of the synchronization loop. For instance, onestressmark exits the synchronization loop when bits 39–63 ofthe TOD register are zero, while the other does so when bits39–62 of the TOD register are zero and bit 63 is one. Thissmall modification ensures a 62.5ns misalignment between thestressmarks.

Figure 10 shows the average %p2p noise seen for eachcore versus the maximum allowed misalignment. To createa given maximum allowed misalignment, the stressmarks aredistributed evenly within the misalignment range. For instance,for a maximum allowed misalignment of 125ns, 2 stressmarksare synchronized at t = 0ns, 2 at t = 62.5ns and 2 att = 125ns. Since there are multiple possible stressmarks tocore mappings, we execute all of them and report the averagevalues. The results show that a small misalignment of 62.5nsis sufficient to reduce the noise to levels similar to the oneswithout synchronization. For this study, we were limited by the62.5ns granularity ensured by the particular implementation of

� ��

��

��

��

��

��

��

��

!"�"�#��$��

��

��

Figure 10: Sensitivity of noise to ΔI event alignment. X-axis shows the maximum misalignment allowed between thestressmarks. A small misalignment is sufficient to diminish thesynchronization effect.

the architecture, but it is likely that smaller misalignments, inthe order of a few nanoseconds, will be enough to nullify thealignment effect.

In conclusion, the fact is that synchronized ΔI eventsgenerate large voltage noise but the strict alignment requirementconsiderably reduces their likelihood to happen. Moreover, thislikelihood will decrease in the future with the addition of morecores to the system. Nevertheless, a platform with such highreliability standards as the zEC12 cannot take the risk of nottaking into account these improbable but possible noise events.

D. Noise sensitivity to ΔI

As explained in Section II, voltage noise is directly related tothe amount of ΔI generated in a given instant. In this sectionwe quantify that relation. For this experiment, we use threetypes of workloads: nothing (idle), medium dI/dt and maximumdI/dt. The dI/dt ones have a stimulus frequency of 2MHz anduse the synchronization mechanism in order to maximize thenoise. To generate the medium dI/dt stressmark we use aninstruction sequence that consumes exactly the average betweenthe maximum and the minimum power sequence. In other words,two medium dI/dt stressmarks generate the same ΔI as onemaximum dI/dt stressmark. We run all possible workload tocore mappings (6 cores & 3 workloads⇒ 36 combinations)and gather skitter noise measurements.

Figure 11a shows the maximum noise generated with respectto the percentage of the maximum possible ΔI that can begenerated. As expected, the noise grows with the amountof ΔI . This quantification is crucial to accurately define therequirements of noise reduction mechanisms. For instance, ifwe want to implement a noise mitigation mechanism to keep%p2p noise below 30%, we should not allow more than 60%ΔI .

In addition, we want to know if the way the ΔI is generatedaffects the noise. That is, we want to answer the followingquestion: what workload distribution creates more noise? Asingle stressmark generating a given ΔI or two stressmarksgenerating ΔI/2? Figure 11b shows the results of Figure 11aorganized by the workload distribution. In this case, noise isaveraged across cores and mappings with the same workloaddistribution and ΔI . The results depict a trend showing thatwhen the source of the noise is more spread out, the noise is

375375

� ��

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

(a) Maximum per core noise measurements with respect to thepercentage of maximum possible ΔI that can be generated. Eachpoint shows, for a given % of ΔI and core number, the maximum%p2p noise seen across all workload mappings generating that particular% of ΔI . The dotted regions designate the minimum number of coresneeded to generate a particular noise level.

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

�� !�"��#��$� !��#��%

�&�&��

(b) Average noise measurements with respect to ΔI generated andworkload distribution. For instance, point 1-4 shows the average noiseacross all possible workload mappings of 1 maximum dI/dt stressmarkand 4 medium dI/dt stressmarks (and 1 core idling).

Figure 11: Noise sensitivity to ΔI . We observe in (a) that noise is directly related to the amount of ΔI and the number of coresenabled. We also detect in (b) a trend showing that when the source of the noise is more spread out, the noise is higher.

higher. For instance, in Figure 11b we see that, for 50% ΔI , thenoise slightly decreases when going from a 0-6 configuration (allcores running the medium dI/dt workload) to a 3-0 configuration(3 cores running the dI/dt workload). Nevertheless, the trendis not significant. This suggests that the important factor is theamount of ΔI generated and not the source of the ΔI .

E. Noise sensitivity to number of consecutive ΔI eventsTo wrap up this sensitivity analysis section, we evaluate the

sensitivity of noise regarding the number of consecutive ΔIevents generated. We want to know if a single ΔI event isenough to generate large noise or if we need more. We generatemaximum dI/dt stressmarks with different stimulus frequenciesand synchronization enabled while varying the number ofconsecutive ΔI events between synchronization points. We alsogenerate stressmarks for these different stimulus frequencieswithout synchronization, i.e. ∞ number of consecutive ΔIevents. The stimulus frequencies selected are at resonant bands(35KHz and 2.5MHz) and their surroundings. For this lastsensitivity experiment, instead of skitter measurements we runVmin experiments in order to measure the available marginempirically.

Figure 12 shows the available margin —the amount of Vbias5

required to get the first failure— per stimulus frequency andnumber of consecutive ΔI events. Results are normalized to theworst case, i.e. highest Vbias to fail. The results show that neitherthe number of consecutive ΔI events nor the stimulus frequencyaffect the voltage margin available significantly. This is becausethe noise generated with just a single synchronized ΔI event islarge enough so that the noise amplification due to resonancebecomes relatively small. Note that although the skitters showedmuch higher %p2p noise at resonant frequencies, the fact thatwe are in the high region of noise —where the linearitybetween Vnoise and skitter measurements diminishes [13], [42]—translates this to similar Vbias levels. When the ΔI eventsynchronization is disabled (∞ events/no synch in Figure 12),the margin available with respect to the worst case is more

5Vmin = Vbias × Vnominal

��

��

��

��

��

��

��

��

��

��

�

��

��

��

!� �

��"��

��

#�� $�� %��

Figure 12: Available margin for different number of consecutiveΔI events and stimulus frequencies. The synchronization of ΔIevents is essential to reduce the available margin. The stimulusfrequency is a secondary factor.

than doubled (from 0–2% range to 5–7%). This confirms theimportance of alignment in the noise formation. Finally, 1Hzand 100MHz frequency points show higher margins becauseeither the stressmarks are misaligned6 or the stimulus frequencyis too high to generate ΔI events.

In order to put these available margins in context, wedepict a line showing what we would consider the worst caseavailable margin for a typical customer code. The extrapolationassumes the following: (a) ΔI events are not synchronized—synchronized ΔI events are very unlikely to happen— and (b)the magnitude of the ΔI events generated on each core is around∼80% of the maximum possible ΔI . This is based on the factthat, historically, maximum power stressmarks showed ∼20%higher than worst case regular user codes. Therefore, in thiscontext, there is plenty of margin for optimization opportunities.One can apply dynamic guard-banding to reduce the marginsdynamically (like in POWER7/POWER7+ systems [11], [12],[29]) or noise reduction techniques to reduce the maximumnoise that can be generated by a workload [27].

6The fact that frequency (1Hz) is much larger than the synchronization interval(4ms) leads to stressmarks being aligned at different 4ms interval points.

376376

F. Sensitivity summary

We quantified the sensitivity of noise with respect to differentparameters. We confirm that the shift of the ‘first droop’ to lowerfrequencies —around the 1-5MHz range— is due to the on-chipcapacitance increase because of the implementation of largeeDRAMs using deep trench technology. This could potentiallyaffect some prior works on noise reduction techniques that aretoo tailored to reduce high frequency noise events (30-100MHzrange).

In addition, we conclude that the key enabler for noisegeneration is the amount of ΔI generated in a given instantof time. The synchronization of ΔI events on different coresis crucial in that regard. If a mechanism is implemented toavoid the synchronization of ΔI events happening on differentcores, the noise can be reduced by 2-3x. The results alsoshow that any mechanism implemented to reduce the noiseshould be implemented on a chip-wide basis. Two observationssupport that fact: (a) large intra-core ΔI events happeningon a few number of cores do not lead to high noise and, (b)relatively small ΔI events happening simultaneously to allcores could lead to a large Vnoise. This global requirement fornoise reduction techniques imposes significant implementationchallenges since prediction/detection and reduction of ΔI eventsshould be done globally. The next generation processor chipfor System z mainframes will include a mechanism to globallymonitor/reduce noise if necessary.

Finally, we also confirm experimentally that the stimulus fre-quency and the number of consecutive ΔI events are secondaryfactors. We do not see any value in implementing mechanismsto avoid resonance or a given number of consecutive ΔI eventsif the main factors (magnitude of ΔI and synchronization ofΔI events) are not solved first. Throughout this section, weanalyzed the noise globally and we proved the robustness ofthe zEC12 power delivery system. The next section providesdetailed analyses on inter-core noise propagation.

VI. NOISE PROPAGATION ON MULTI-CORES

In this section, we analyze how the noise is propagatedin a multi-core system. We want to discern all patterns (ifany) that could arise as a result of the system design suchas core placement, decap placement or overall chip layout.Detecting such patterns is important for: (a) validating thedesign, e.g. ensuring that there is no particular core/region morevulnerable to noise, (b) detecting optimization opportunities. Forexample, if the current delivery system is not robust and somecores are more correlated in terms of noise, then one can avoidscheduling tasks on them at the same time if other cores areavailable. This would reduce the worst-case noise.

To evaluate the inter-core noise relations, we use the sameexperimental setup as in Section V-D. We also evaluate allthe possible workload mappings to cores. This provides thecomplete picture of possible core activities. By performingstatistical analyses on the dataset, one can detect trends oranomalies.

First, in Figure 13a, we compute the correlation factor betweenthe noise seen in all the possible mappings for each pair ofcores. This tells us if some cores are more related to others. Asexpected, inter-core noise correlation factors are high (>0.91).

��

��

��

��

��

��

��

��

��

��

��

�

��

��

��

��

(a) Correlation coefficient between noise observed by each core forall possible workload mappings.

(b) Simulation of a ΔI event on core 0 confirming that the noisefrom core 0 is transferred faster and more strongly to cores 2 and4 than to the other cores.

Figure 13: Inter-core noise propagation. We detect the existenceof two noise-related core clusters resulting from the chip layout.

This is because noise is globally propagated across the PDN.Therefore, if there is noise in the PDN, all the cores are affected.However, we also see that some cores are more correlated thanothers.

As shown in Figure 13a, we detect two clusters of cores:cores 0,2,4 and cores 1,3,5. Two main reasons explain thisresult: First, if we revisit Figure 3, the clusters correspond to theupper three cores and the lower three cores, respectively. Sincethe L3 —with a relatively large capacitance— sits betweenthe clusters, it slightly isolates the noise from one cluster toanother, acting as a damping element. Second, the PDN topologyalso affects how the noise is propagated. The two clusters areon different on-chip voltage domains sharing a single packagevoltage domain.

In order to confirm these findings, we use our in-housePDN simulation set up based on Cadence/Sigrity Speed 2000software [5]. We evaluate the effects of a large ΔI event onCore 0, while the other cores are idling. Figure 13b shows thesimulated voltage on the six cores over time. We see that thenoise in the cores 0, 2, 4 on one side of the chip is larger thanthe noise in the cores on the opposite side of the chip (cores1,3,5). This correlates with our observations. Moreover, we alsosee that the noise from core 0 is transferred faster to cores 2and 4 than to the other cores.

We show an example of how these inter-core relations canaffect the noise generated in Figure 14. The figure shows twodifferent mappings of three dI/dt stressmarks on the six-coreprocessor. In Figure 14a, the dI/dt stressmarks are mapped oncores 1,4,5 and we see a worst-case noise of 24.6 %p2p oncore 1. In contrast, if we map the three dI/dt stressmarks inone of the clusters detected (as shown in Figure 14b), the worst

377377

(a) Best case

��

��

��

��

(b) Worst case

��

��

��

��

Figure 14: Two different mappings of 3 worst-case dI/dtstressmarks. Each core shows the workload and the %p2p noise.Worst-case noise is highlighted on each mapping to spotlightthat the worst-case noise depends on workload to core mapping.

case noise increases to 28.2 %p2p (on core 2). The main reasonfor this effect is the clustering relation mentioned above. Inaddition, the noise seen in core 2 of Figure 14b is amplifiedbecause it sits in the middle of two other ‘noisy’ cores (core 0and 4).

To summarize, we have seen that technology and physicallayout as well as other PDN design parameters —such asthe number of power domains— can affect how the noise ispropagated across the different cores in a multi-core chip. Theamount of propagation depends on the distance/capacitancebetween elements. Therefore, there are some isolation benefitsrelated to element layout. However, their significance has tobe evaluated on each particular design. These interactionswill likely be higher in the future due to the higher processvariation and number of cores. This suggests that there mightbe increasing opportunities to decrease worst-case noise viaappropriate workload-mapping techniques. The next sectiondiscusses some of them.

VII. OPTIMIZATION OPPORTUNITIES

Based on the knowledge learned during the characterization ofnoise performed in previous sections, we infer two optimizationopportunities: noise-aware workload mapping and utilization-based dynamic guard-banding. In this section, we discuss thepotential gains of implementing them in order to build a strongcase for further research on the topic. Detailed proposals andevaluations are needed to assess their implementability on futuregenerations of processors. In any case, we believe that they couldserve as a guide for future research on noise-related techniques.Note also that, given the robustness of the power deliverysystem implemented in IBM zEnterpriseTM EC12 systems (asshown throughout the paper), we do not foresee the necessityto implement noise reduction techniques for maintaining thesystem reliability.

A. Noise-aware workload mapping

As we have seen in Section VI, the worst case noise for agiven number of workloads is not equal for all possible mappings.That is, depending on the mapping used, the worst case noisewould be different. This opens up the possibility of implementingnoise-aware workload mapping policies. In other words, onecan implement a task mapping policy with the objective of

� � � � � � �

�

��

��

��

��

��

��

�

��

�

��

�

��

�

��

�

� ��

��

��

��

��

Figure 15: Noise reduction opportunity to reduce worst casenoise via workload mapping.

minimizing the worst-case noise. Then, one can proactivelysqueeze the available voltage margin accordingly or just letother hardware-implemented mechanisms such as critical pathmonitors [11], [12], [29] to reap the benefits of lower noiseautomatically.

Figure 15 shows the maximum benefits that could arise fromnoise-aware workload mapping. We compare for each number ofworkloads (dI/dt stressmarks) to schedule, the mapping showingthe highest core noise (Worst Mapping in Figure 15) vs. theone showing the lowest core noise (Best Mapping in Figure 15).The difference between them is shown in the secondary y-axis. When we have 2, 3 or 4 workloads to schedule thereare mappings that could reduce the worst-case noise between2 and 3 %p2p points. For low or high number of workloadsto schedule, the benefits are smaller because the potential ΔIis too small or too high. Overall, these results show a smallpotential for the approach on current systems. However, weforesee that with the increasing number of cores and processvariation, the optimization opportunities will increase. This isbecause the number of possible combinations (mappings) willgrow exponentially as well as the variation among them.

B. Utilization-based dynamic guard-banding

From Figure 11a, it is clear that the worst case noise isdirectly related to the amount of ΔI generated. At the sametime, the amount of ΔI that can be generated is bounded by thenumber of cores that are executing a workload (regions depictedin Figure 11a). If the hardware, or the software managing theresources such as the OS or the firmware, is aware of the numberof cores that can execute a workload, then it could safely adaptthe available margin accordingly. That is, once a new core isrequested to execute some workload, the hardware would raisethe voltage to maintain the safety margin because the worst-casenoise is higher. Similarly, when a core is freed from execution,the hardware would decrease the voltage to ensure that themargin is not over-provisioned. Although mechanisms such ascritical path monitors [11], [12], [29] already do that implicitly,they can benefit from this worst-case bound approach becausethe dynamic range where they actuate will be tailored to thereal worst case for each situation. In the end, the benefits ofthis simple mechanism depend on the utilization rates of theprocessor on real environments, with potential huge impact onenergy efficiency when the system is not fully utilized.

378378

VIII. RELATED WORK

Several works have explored ways to optimize the PDN andminimize voltage noise. This section summarizes some of them.

Architectural solutions: Several architecture-level solutionstarget the reduction of voltage noise events. Powell et al. [35]propose a pipeline muffling technique to avoid transient andlarge change in resource utilization. In [39], the authors presenta voltage noise prediction mechanism, a signature-based noisereduction scheme and an efficient implementation of thembased on bloom-filters. Gupta et al. [18] presents a delay-commit mechanism that allows rolling-back the execution if noiseemergencies have been detected. In [36], the authors use pipelinethrottling knobs to change the frequency of voltage noise,avoiding the resonant frequency band. Different works [32],[33], [47] target the problem of minimizing the noise whenactivating/deactivating, or clock-gating, functional units. In [22],the authors propose techniques to eliminate voltage emergenciesbased on control theory. The authors extend their work in [23]by using wavelet-based approach for identifying voltage levelsat run-time. SMT aspect of voltage noise is studied in [7].In that work, the authors propose intelligent hardware threadmanagement to control voltage noise. In a broader scope, theauthors in [9], [10] propose a reliable architecture that eliminatesthe need of voltage margins.

Software-based noise reduction: Voltage noise can not onlybe mitigated using hardware solutions, it can also be reducedby appropriate software control. In [48], the authors proposea noise-aware instruction scheduling algorithm for compilers.Types of noise events are categorized and code transformationsto avoid them are proposed in [17]. Dynamic software-basedoptimization systems are presented in [19], [37]. Based oncontinuous hardware profiling, the solutions use compiler-basedtransformations to reduce the frequency of noise events. In [31],the authors propose a synchronization methodology to mitigatethe noise related to thread synchronization primitives. Ourwork foresees two software-based solutions related to voltagenoise that were not explored in prior art: noise-aware workloadmapping and utilization-based dynamic guard-banding. Althoughthese solutions do not directly target the reduction of noise, theyalleviate the guard-banding that is required to be noise-safe.

Noise in production processors: Empirical quantification ofnoise on real hardware provides fundamental insights and directdata to the research community. Reddi et al. [41], [40] analyzethe noise behavior on an Intel R© CoreTM2 Duo processor. Theyuse a technique —based on removing package capacitors— toextrapolate voltage noise in future systems. The extrapolation isused to analytically evaluate a temporal-based thread schedulingpolicy that minimizes voltage emergencies. In [27], Kim et al.analyze the floating point throttling mechanism implemented inAMD Orochi processors. After analyzing the power, performanceand reliability trade-offs, they propose a noise reduction solution—controlling, dynamically, the floating-point unit throttling— thatincreases system performance. In [1], the authors use the errorcorrection mechanisms already built into modern processorsto dynamically adapt voltage margins while keeping a safeoperating voltage. Our work provides a step forward in theunderstanding of voltage noise in production processors byproviding insights on intra-/inter-core noise behavior, crossing

—for the first-time— the chip-level measurement barrier. Inaddition, we also present a complete characterization of all theparameters related to voltage noise.

dI/dt stressmark generation: Correct dI/dt stressmark gen-eration is crucial to understand the noise behavior [22]. In [28],Kim et al. propose a systematic methodology to generate ofdI/dt stressmarks based on genetic algorithm-based optimization.In [26], the same authors validate the methodology on differentprocessors, AMD Buldozer and Phenom, by proposing a ditheringtechnique to align the threads in a multi-core system. Ournovelty on systematic dI/dt generation is that we provide a fullyconfigurable method to generate all kinds of dI/dt stressmarks,not only the worst case. Besides, we show a deterministicapproach leveraging the time facilities to solve the threadalignment issue for multi-core systems, which also allows tocontrol the miss-alignment between threads.

Noise modeling and PDN optimization: It is also importantto the community to understand the trends of the noise in futureprocessors in order to anticipate possible limitations. To that end,exploratory PDN and noise modeling are crucial. In [55], theauthors model the noise in a future 10nm technology processorrunning at near-threshold voltages. They highlight the importanceof accurate voltage noise characterizations, like the one presentedin this paper, to avoid the large guard-band requirements theypredict for future near-threshold processors. The same issue isaddressed in [53], where the authors use a PDN optimizationmodel to anticipate that future processors will have I/O padslimited due to the increase of pads required to maintain reliablelevels of voltage noise. This pad budgeting problem is furtheranalyzed in [50], [54].

IX. CONCLUSIONS

This paper addresses the issue of generating complete voltagenoise characterizations on current, state-of-the-art, multi-coreprocessors. We presented a configurable ‘white-box’ methodol-ogy to systematically generate dI/dt stressmarks with differentcharacteristics. Then, we showed how that methodology enabledthe detailed characterization of the different parameters involvedin voltage noise generation. This permits the validation of thepower delivery system design and the quantification of therelative importance of each parameter involved in the noisegeneration. In addition, we presented a noise propagation anal-ysis on multi-core systems, detecting empirically the differentnoise interactions between the cores. Finally, we discussedpotential optimization opportunities emerging from this newunderstanding of noise behavior.

Overall, we proved the robustness of the power deliverysystem implemented in IBM zEnterpriseTM EC12 systems. Theempirical and detailed characterization of the noise behaviorenables an accurate definition of the requirements for noisemonitoring/mitigation mechanisms.

ACKNOWLEDGMENT

This work has been partially sponsored by Defense AdvancedResearch Projects Agency (DARPA), Microsystems TechnologyOffice (MTO), under contract no. HR0011-13-C-0022. The viewsexpressed are those of the authors and do not reflect the officialpolicy or position of the Department of Defense or the U.S.

379379

Government. This document is: Approved for Public Release,Distribution Unlimited.

Many individuals across IBM made this work possible. Weextend particular thanks to Michael D. Campbell, Andrew Z.Muszynski, Jochen Supper, Hubert Harrer, David S. Hutton,Douglas C. Mershimer, Ervin Conn Jr, Karl E. Anderson, ScottB. Swaney, Stanley F. Lutek, Andreas Krebbel and SreekalaAnandavally. Finally, we would like to thank the anonymousreviewers for their comments and feedback.

REFERENCES

[1] A. Bacha and R. Teodorescu, “Dynamic reduction of voltage mar-gins by leveraging on-chip ecc in itanium ii processors” in ISCA-40,Jun 2013, Tel-Aviv, Israel.

[2] W. Becker et al., “Mid-frequency simultaneous switching noise incomputer systems” in ECTC-47, May 1997, San Jose, CA, USA.

[3] R. Bertran et al., “Systematic Energy Characterization ofCMP/SMT Processor Systems via Automated Micro-Benchmarks”in MICRO-45, Dec 2012, Vancouver, BC, Canada.

[4] K. Bowman et al., “A 22nm dynamically adaptive clock distribu-tion for voltage droop tolerance” in VLSI, Jun 2012, Honolulu, HI,USA.

[5] “Cadence/sigrity speed 2000.” [Online]. Available:http://www.cadence.com/products/sigrity/speed2000

[6] R. Downing et al., “Decoupling capacitor effects on switchingnoise” in IEEE Transactions on Components, Hybrids, and Man-ufacturing Technology, vol. 16, no. 5, Aug 1993.

[7] W. El-Essawy and D. Albonesi, “Mitigating inductive noise in SMTprocessors” in ISLPED, Aug 2004, Newport Beach, CA, USA.

[8] S. Eranian, “Linux Has a Generic Performance Monitoring API!”CSCADS Workshop, Tahoe City, California, USA, Jul 2009.

[9] D. Ernst et al., “Razor: Circuit-level correction of timing errors forlow-power operation” in IEEE Micro, vol. 24, no. 6, Nov 2004.

[10] D. Ernst et al., “Razor: a low-power pipeline based on circuit-level timing speculation” in MICRO-36, Dec 2003, San Diego, CA,USA.

[11] M. S. Floyd et al., “Introducing the adaptive energy managementfeatures of the power7 chip” in IEEE Micro, vol. 31, no. 2, Mar2011.

[12] M. S. Floyd et al., “Adaptive energy-management features of theibm power7 chip” in IBM JRD, vol. 55, no. 3, May 2011.

[13] R. Franch et al., “On-chip Timing Uncertainty Measurements onIBM Microprocessors” in ITC, Oct 2008, Santa Clara, CA, USA.

[14] B. Garben et al., “Mid-frequency delta-i noise analysis of complexcomputer system boards with multiprocessor modules and verifica-tion by measurements” in IEEE Trans. on Adv. Packaging, vol. 24,no. 3, Aug 2001.

[15] A. Grenat et al., “Adaptive clocking system for improved powerefficiency in a 28nm x86-64 microprocessor” in ISSCC, Feb 2014,San Francisco, CA, USA.

[16] M. Gupta et al., “Understanding voltage variations in chip multi-processors using a distributed power-delivery network” in DATE,Apr 2007, Nice, France.

[17] M. Gupta et al., “Towards a software approach to mitigate voltageemergencies” in ISLPED, Aug 2007, Portland, OR, USA.

[18] M. Gupta et al., “Decor: A delayed commit and rollback mech-anism for handling inductive noise in processors” in HPCA, Feb2008, Salt Lake City, UT, USA.

[19] K. Hazelwood and D. Brooks, “Eliminating voltage emergenciesvia microarchitectural voltage control feedback and dynamic opti-mization” in ISLPED, Aug 2004, Newport Beach, CA, USA.

[20] D. Herrell and B. Beker, “Modeling of power distribution systemsfor high-performance microprocessors” in IEEE Trans.s on Adv.Packaging, vol. 22, no. 3, Aug 1999.

[21] N. James et al., “Comparison of Split-Versus Connected-CoreSupplies in the POWER6 Microprocessor” in ISSCC, Feb 2007,San Francisco, CA, USA.

[22] R. Joseph et al., “Control techniques to eliminate voltage emer-gencies in high performance processors” in HPCA, Feb 2003,Anaheim, CA, USA.

[23] R. Joseph et al., “Wavelet analysis for microprocessor design:Experiences with wavelet-based di/dt characterization” in HPCA,Feb 2004, Madrid, Spain.

[24] G. Katopis, “Delta-i noise specification for a high-performancecomputing machine” in Proceedings of the IEEE, vol. 73, no. 9,Sep 1985.

[25] M. Ketkar and E. Chiprout, “A microarchitecture-based frameworkfor pre- and post-silicon power delivery analysis” in MICRO-42,Dec 2009, New York, NY, USA.

[26] Y. Kim et al., “AUDIT: Stress Testing the Automatic Way” inMICRO-45, Dec 2012, Vancouver, BC, Canada.

[27] Y. Kim et al., “Performance boosting under reliability and powerconstraints” in ICCAD, Nov 2013, San Jose, CA, USA.

[28] Y. Kim and L. John, “Automated di/dt stressmark generation formicroprocessor power delivery networks” in ISLPED, Aug 2011,Fukuoka, Japan.

[29] C. Lefurgy et al., “Active management of timing guardband tosave energy in POWER7” in MICRO-44, Dec 2011, Porto Alegre,Brazil.

[30] B. McCredie and W. Becker, “Modeling, measurement, and simu-lation of simultaneous switching noise” in IEEE Transactions onComponents, Packaging, and Manufacturing Technology, vol. 19,no. 3, Aug 1996.

[31] T. N. Miller et al., “Vrsync: Characterizing and eliminatingsynchronization-induced voltage emergencies in many-core proces-sors” in ISCA-39, Jun 2012, Portland, OR, USA.

[32] M. Pant et al., “An architectural solution for the inductive noiseproblem due to clock-gating” in ISLPED, Aug 1999, San Diego,CA, USA.

[33] M. Pant et al., “Inductive noise reduction at the architectural level”in VLSI-13, Jan 2000, Calcutta, India.

[34] S. Pant and E. Chiprout, “Power grid physics and implications forcad” in DAC-43, Jul 2006, San Francisco, CA, USA.

[35] M. Powell and T. N. Vijaykumar, “Pipeline muffling and a prioricurrent ramping: architectural techniques to reduce high-frequencyinductive noise” in ISLPED, Aug 2003, Seoul, Korea.

[36] M. Powell and T. N. Vijaykumar, “Exploiting resonant behavior toreduce inductive noise” in ISCA-31, Jun 2004, Munich, Germany.

[37] V. J. Reddi et al., “Eliminating voltage emergencies via software-guided code transformations” in TACO, vol. 7, no. 2, Oct.2010.

[38] V. J. Reddi et al., “Voltage Noise: Why Its Bad, and What To DoAbout It” in SELSE-5, Mar 2009, Stanford, CA, USA.

[39] V. J. Reddi et al., “Voltage emergency prediction: Using signaturesto reduce operating margins” in HPCA-15, Feb 2009, Raleigh, NC,USA.

[40] V. J. Reddi et al., “Voltage noise in production processors” in IEEEMicro, vol. 31, no. 1, Jan 2011.

[41] V. J. Reddi et al., “Voltage smoothing: Characterizing and miti-gating voltage noise in production processors via software-guidedthread scheduling” in MICRO-43, Dec 2010, Atlanta, GE, USA.

[42] P. Restle et al., “Timing uncertainty measurements on the Power5microprocessor” in ISSCC, Feb 2004, San Francisco, CA, USA.

[43] R. Senthinathan and J. Prince, Simultaneous Switching Noise andCMOS Devices and Systems, vol. 249, no. 1, Jan 1994.

[44] C. Shum et al., “IBM zEC12: The Third-Generation High-Frequency Mainframe Microprocessor” in IEEE Micro, vol. 33,no. 2, Mar 2013.

[45] T. Slegel et al., “IBM’s S/390 G5 microprocessor design” in IEEEMicro, vol. 19, no. 2, Mar 1999.

[46] L. Smith et al., “Power distribution system design methodologyand capacitor selection for modern CMOS technology” in IEEETrans. on Adv. Packaging, vol. 22, no. 3, Aug 1999.

[47] Z. Tang et al., “Ramp up/down functional unit to reduce steppower” in First International Workshop on Power-Aware ComputerSystems, Nov 2001, Cambridge, MA, USA.

[48] M. C. Toburen, “Power analysis and instruction scheduling forreduced di/dt in the execution core of high-performance micropro-cessors” Master’s thesis, North Carolina State University, Raleigh,NC, USA, Jun 1999.

[49] C. Tokunaga et al., “A graphics execution core in 22nm CMOSfeaturing adaptive clocking, selective boosting and state-retentivesleep” in ISSCC, Feb 2014, San Francisco, CA, USA.

[50] K. Wang et al., “Walking pads: Fast power-supply pad-placementoptimization” in ASP-DAC-19, Jan 2014, Singapore.

[51] J. Warnock et al., “Circuit and Physical Design of thezEnterpriseTM EC12 Microprocessor Chips and Multi-Chip Mod-ule” in IEEE Journal of Solid-State Circuits, vol. 49, no. 1, Jan2014.

[52] C. Webb, “IBM z10: The Next-Generation Mainframe Micropro-cessor” in IEEE Micro, vol. 28, no. 2, Mar 2008.

[53] R. Zhang et al., “Some limits of power delivery in the multicoreera” in WEED-4 workshop at ISCA-39, Jun 2012, Portland, OR,USA.

[54] R. Zhang et al., “Architecture implications of pads as a scarceresource” in ISCA-41, Jun 2014, Minneapolis, MN, USA.

[55] X. Zhang et al., “Characterizing and evaluating voltage noisein multi-core near-threshold processors” in ISLPED, Sep 2013,Beijing, China.

380380

voltage noise in multi-core processors: empirical ...rbertran.site.ac.upc.edu/pdfs/micro2014.pdf ·...

Documents