maximizing limited resources: a limit-based study and ...tcarlson/pdfs/alipour2018mlralsatoo… ·...

18
Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit Mehdi Alipour · Trevor E. Carlson · David Black-Schaffer · Stefanos Kaxiras Abstract Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling. How- ever, it is typically limited by the requirement of vis- ibly sequential, atomic instruction execution in other words, in-order instruction commit. While in-order com- mit has a number of advantages, such as providing precise interrupts and avoiding complications with the memory consistency model, it requires the core to hold on to resources (reorder buffer entries, load/store queue entries, physical registers) until they are released in pro- gram order. In contrast, out-of-order commit can re- lease some resources much earlier, yielding improved performance and/or lower resource requirements. Non- speculative out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti [5]. In this paper we revisit out-of-order commit by ex- amining the potential performance benefits of lifting these conditions one by one and in combination, for both non-speculative and speculative out-of-order com- mit. While correctly handling recovery for all out-of- order commit conditions currently requires complex track- ing and expensive checkpointing, this work aims to demon- strate the potential for selective, speculative out-of-order commit using an oracle implementation without specu- lative rollback costs. Mehdi Alipour · David Black-Schaffer · Stefanos Kaxiras Uppsala University, Department of Information Technology Uppsala, Sweden E-mail: fi[email protected] Trevor E. Carlson National University of Singapore (NUS), Department of Computer Science Singapore E-mail: [email protected] Through this analysis of the potential of out-of- order commit, we learn that: a) there is significant un- tapped potential for aggressive variants of out-of-order commit; b) it is important to optimize the out-of-order commit depth for a balanced design, as smaller cores benefit from reduced depth while larger cores continue to benefit from deeper designs; c) the focus on imple- menting only a subset of the out-of-order commit con- ditions could lead to efficient implementations; d) the benefits of out-of-order commit increases with higher memory latency and in conjunction with prefetching; e) out-of-order commit exposes additional parallelism in the memory hierarchy. Keywords Superscalar Processors · Out-of-Order Commit · Performance Evaluation · Memory Hierarchy Parallelism 1 Introduction Typical dynamically-scheduled superscalar processors execute instructions out-of-order but commit in-order to present to the programmer the illusion that instruc- tions execute atomically and sequentially as intended by the program. In this context, precise interrupts are easily provided as the processor verifies correct execu- tion before each instruction is committed [25]. The disadvantage of in-order commit (IOC) is that it ties up resources (such as reorder buffer [ROB] en- tries, load-store queue [LSQ] entries, and physical reg- isters) for a much longer time than is necessary for correct execution. In-order commit ties up resources until all instructions complete and commit in the cor- rect sequential program order. This means that execu- tion is halted when any of the resources are exhausted: when the ROB fills up, or when we run out of either

Upload: others

Post on 11-Dec-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

Maximizing Limited Resources A Limit-based Study andTaxonomy of Out-of-order Commit

Mehdi Alipour middot Trevor E Carlson middot David Black-Schaffer middot Stefanos Kaxiras

Abstract Out-of-order execution is essential for high performance general-purpose computation as it can find and execute useful work instead of stalling How-ever it is typically limited by the requirement of vis-ibly sequential atomic instruction execution mdash in other words in-order instruction commit While in-order com-mit has a number of advantages such as providing precise interrupts and avoiding complications with the memory consistency model it requires the core to hold on to resources (reorder buffer entries loadstore queue entries physical registers) until they are released in pro-gram order In contrast out-of-order commit can re-lease some resources much earlier yielding improved performance andor lower resource requirements Non-speculative out-of-order commit is limited in terms of correctness by the conditions described in the work of Bell and Lipasti [5]

In this paper we revisit out-of-order commit by ex-amining the potential performance benefits of lifting these conditions one by one and in combination for both non-speculative and speculative out-of-order com-mit While correctly handling recovery for all out-of-order commit conditions currently requires complex track-ing and expensive checkpointing this work aims to demon-strate the potential for selective speculative out-of-order commit using an oracle implementation without specu-lative rollback costs

Mehdi Alipour middot David Black-Schaffer middot Stefanos KaxirasUppsala University Department of Information TechnologyUppsala SwedenE-mail firstlastituuse

Trevor E CarlsonNational University of Singapore (NUS)Department of Computer ScienceSingaporeE-mail tcarlsoncompnusedusg

Through this analysis of the potential of out-of-order commit we learn that a) there is significant un-tapped potential for aggressive variants of out-of-ordercommit b) it is important to optimize the out-of-ordercommit depth for a balanced design as smaller coresbenefit from reduced depth while larger cores continueto benefit from deeper designs c) the focus on imple-menting only a subset of the out-of-order commit con-ditions could lead to efficient implementations d) thebenefits of out-of-order commit increases with highermemory latency and in conjunction with prefetchinge) out-of-order commit exposes additional parallelismin the memory hierarchy

Keywords Superscalar Processors middot Out-of-OrderCommit middot Performance Evaluation middot Memory HierarchyParallelism

1 Introduction

Typical dynamically-scheduled superscalar processorsexecute instructions out-of-order but commit in-orderto present to the programmer the illusion that instruc-tions execute atomically and sequentially as intendedby the program In this context precise interrupts areeasily provided as the processor verifies correct execu-tion before each instruction is committed [25]

The disadvantage of in-order commit (IOC) is thatit ties up resources (such as reorder buffer [ROB] en-tries load-store queue [LSQ] entries and physical reg-isters) for a much longer time than is necessary forcorrect execution In-order commit ties up resourcesuntil all instructions complete and commit in the cor-rect sequential program order This means that execu-tion is halted when any of the resources are exhaustedwhen the ROB fills up or when we run out of either

2 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLMSLM_ROOC

SLM_AOOC

NHMNHM_ROOC

NHM_AOOC

HSWHSW_ROOC

HSW_AOOC

075100125150175200225

IPC normalized to SLM

commit-depth=4 commit-depth=ROB size

Fig 1 Performance comparison (harmonic mean ofIPCs across SPEC CPU2006 [16]) of in-order and twotypes of safe out-of-order commit reluctant (ROOC)and aggressive (AOOC) with a commit depth of 4 andROB size for increasingly aggressive microarchitecturesThese experiments respect all traditional commit con-ditions and show that aggressive out-of-order commitcan reach the performance of the next class of proces-sor

LSQ entries or registers To overcome this hurdle de-signers size these structures so that they minimize thechances of any single resource becoming exhausted cre-ating a balanced microarchitecture This however con-tributes to the power-inefficiency of the OoO cores Atthe very least after a certain point increasing the sizeof the LSQ (which is usually implemented as an ex-pensive CAM) or the size of the register file resultsin an increase in energy consumption that far exceedsthe performance benefit a significant disadvantage forpower-constrained processors Thus the incentive forpursuing out-of-order commit (OOC) lies in the promiseof higher performance with fewer resources A turn-ing point in our understanding of out-of-order commitcame with the work of Bell and Lipasti [5] who artic-ulated the limiting factors for non-speculative OOCThe necessary conditions to allow an instruction to becommitted range from the completion status of the in-struction itself to the branch prediction and exceptionstate of intervening instructions Several proposals forout-of-order commit implicitly or explicitly abide bythese conditions potentially harming efficiency to en-force them

11 Analyses performed

The question we explore in this paper is what couldthe performance gain be if we had the means to evadeany one or even all of these conditions together To thebest of our knowledge this work is the first to evalu-ate the potential performance contribution of relaxingeach individual commit condition For example we in-

vestigate the potential performance gain if we couldcorrectly commit past unresolved branches stores withunresolved addresses or instructions that can generateexceptions We explore the interactions of out-of-ordercommit across a range of important processor designpoints including variations in prefetchers MSHRs mem-ory latency branch prediction and the size of the dy-namic instruction window Answering these questionsallows us to understand the most profitable conditionsto address Our work confirms previous studies [2 5 21]that the least aggressive core benefits the most fromleast aggressive out-of-order commit and in addition byintroducing a taxonomy for the first time we show thatmore aggressive cores gain more comparatively fromaggressive out-of-order commit All in all our studyshows that as a future direction not only safe but alsounsafe out-of-order commit appears to be very promis-ing This is especially true if the potential benefits canbe tapped with much more efficient and selective mech-anisms to guarantee correctness instead of bulk check-pointing and rollback Already such non-speculativemechanisms have been proposed [24]

12 Why We Do it

Bell and Lipasti first articulated the conditions for ldquosaferdquonon-speculative out of order commit [5] in an effort totackle the problem of improving single-thread perfor-mance This was at the same time that IC manufac-turing broke through the 100nm technology node [103] However the acceptance for OOC architectures hasbeen slow Today significantly improving single-threadperformance in an energy-efficient manner remains achallenge The goal of this work is to help researchersto hone in on the most profitable aspects of OOC byoffering i) a detailed exploration of the limits of out-of-order commit conditions and ii) a taxonomy for out-of-order commit [2]

13 Analyses outside the scope of this work

However in this paper we do not quantify the cost thatwould be required to guarantee correctness when com-mitting past any or all of these conditions as this wouldbe tied to a specific hardware or software mechanismInstead we evaluate the potential performance benefitsavailable as a means to gauge the potential of futureproposals in relation to their cost In the same veinwe have not explored every possible core optimizationthat could be construed as extending the instructionwindow outside the core (eg run-ahead execution) Inthis work we chose to study three common architecturesto provide baselines with understood costs

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 3

14 Contribution

We study safemdashnon-speculativemdash out-of-order commit(refraining from committing until all conditions are met)and unsafemdashspeculativemdash out-of-order commit (com-mitting even before all conditions are met assuming thepotential for correct recovery) In addition to studyingboth safe and unsafe out-of-order commit as a minimumand maximum potential performance improvement wealso study out-of-order commit in two additional di-mensions aggressiveness and degree of speculation

ndash Aggressiveness A dimension that determines thepotential benefit is the aggressiveness of the out-of-order-commit implementation how far into the in-struction window we try to commit Prior work [22]examines this aggressiveness in their implementa-tion of checkpoint-based out-of-order commit wherethey show a middle-ground in aggressiveness is neededto mitigate checkpoint-based penalties while stillimproving performanceIn addition work on non-speculative out-of-ordercommit [5] has recognized that a restricted commitwindow can achieve a good portion of the perfor-mance improvements compared to an unrestrictedunlimited out-of-order commit implementation

ndash Degree of speculation Apart from the aggres-siveness of commit we also consider varying degreesof speculation The insight of this study is that selec-tive speculation preserving or relaxing of one moreconditions necessary for out-of-order commit couldlead to more efficient implementations In fact re-cent work [15 24] has shown that this might be thecase for load instructions

15 Results

This study aims to provide a guidepost for each majorparameter of out-of-order commit to provide an upperbound on the performance benefits given a particularaggressiveness and degree of selective speculation Inthis work we show that

ndash there is significant untapped potential for unsafeout-of-order commit beyond traditional in-order andsafe out-of-order commit the gap widens in morepowerful architectures

ndash for energy-efficient cores with moderately-sized in-struction windows reluctant1 limited out-of-order

1 Reluctant out-of-order commit continues normal in-oredrcommit and is only enabled when the hardware cannot continueto make forward progress This mode switching happens earlyenough which is a cycle before the related queue (ROB IQLSQ) is full so that the CPU stall is avoided

commit is sufficient to reap the most benefit whileaggressive versions of out-of-order commit becomea requirement for larger cores

ndash the order of importance of the commit conditionschanges depending on the type of application thearchitecture (limited or aggressive OoO processor)and the aggressiveness of out-of-order commit (com-mit depth)

ndash we unexpectedly found that a focus on specific out-of-order commit conditions could be an importantfuture direction for high-performance efficient out-of-order processors

ndash the potential benefits of out-of-order commit increaseswith memory latency (relatively more for unsafe)while the benefits of the prefetching strategy thatwe picked are orthogonal to out-of-order commitbenefits This raises the enticing possibility of re-ducing system-wide silicon financial cost withoutcompromising performance by coupling dense buthigher-latency (slow-but-efficient) DRAM with out-of-order commit cores

ndash out-of-order commit increases memory hierarchy par-allelism [7]

ndash While it is generally acceptable that by releasingpipeline resources as early as possible out-of-ordercommit improves performance in minor and smallcores relatively more than in large cores in this workwe show that this is only true for reluctant out-of-order commit In fact performance improvement inlarge out-of-order cores can exceed that of smallercores if aggressive out-of-order commit is employed

ndash Our results show the potential for future systemsthat implement out-of-order commit and indicatewhich are the most promising directions (safe vsunsafe commit and which of Bell and Lipastirsquos con-ditions [5] are most important to support) for futuredesigns

The rest of this paper is organized as follows InSection 2 we first provide an overview of the conditionsthat need to be honored for in-order commit as wellas provide an overview of out-of-order commit Nextin Section 3 we present our evaluation methodologyand simulated system configurations Section 4 pro-vides a detailed performance analysis for each out-of-order commit condition both from the point of view ofaggressive and reluctant out-of-order commit In Sec-tion 5 and Section 6 we compare the early releaseof physical register and memory hierarchy parallelismrespectively with Out-of-Order commit extending pre-vious work [2] Finally the related works and conclusionare presented in Section 7 and 8 respectively

4 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

2 Out-of-Order Commit

The introduction of the reorder buffer (ROB) to pro-vide in-order commit in an out-of-order scheduled su-perscalar pipeline was an important advance for com-puter architecture culminating an effort that startedwith Tomasulorsquos algorithm and included techniques suchas reservation stations and the register update unit(RUU) [26] The reorder buffer maintains precise ar-chitectural state in the presence of interrupts unknownmemory dependencies or memory re-orderings that canperturb the ordering required by a memory consistencymodel Out-of-order commit on the other hand at-tempts to break this rigid updating of the architecturalstate either in a safe way (ie one that does not requireadditional speculation and rollback to revert changes tothe architectural state) or in an unsafe way (ie onethat does)

21 Safe vs Unsafe OOC

A turning point in our understanding of out-of-ordercommit came with the work of Bell and Lipasti [5] inthe form of a number of limiting conditions for safe out-of-order commit The necessary conditions to allow aninstruction to be committed out-of-order are1 The instruction is completed instructions can com-

mit only after their completion2 The instruction is not involved in memory replay

traps This condition simply says that we cannotcommit speculative loads or their dependent instruc-tions This condition relates to unresolved memorydependencies or memory consistency enforcementFor example total store order (TSO) requires a re-play of speculative loads that violate loadrarrload or-dering when this reordering is detected by othercores

3 Register WAR hazards are resolved (ie a write toa particular register cannot be permitted to commitbefore all prior reads of that architectural registerhave been completed)

4 Previous branches are successfully predicted Thiscondition simply says that we can commit only whileon the correct path of execution

5 No prior instruction in program order is going toraise an exception This condition provides preciseinterrupts and is essential in easing handling of egpage faultsIn the rest of this paper we will use the following

convention to discuss how we evade the BellLipasticonditions1 Safe_OOC where all out-of-order-commit condi-

tions are preserved This case provides the minimum

potential performance improvement of out-of-ordercommit but also the minimum hardware to imple-ment as it does not rely on speculation and rollbackbeyond what is already available in the out-of-ordercore

2 Unsafe_OOC where one or more (or all) of theout-of-order commit conditions are evaded (apartfrom true dependencies) Doing so the maximumpotential performance improvement of out-of-ordercommit is evaluated but this may require extra sup-port for speculation and rollback to be able to revertchanges in the architectural state that were foundto be incorrect after the commit

22 Reluctant vs Aggressive OOC

Aside from the limiting conditions described above aseparate dimension is the aggressiveness of committingout-of-order Thus concerning the mechanics of out-of-order commit we distinguish two versions

1 Reluctant out-of-order-commit (ROOC) where theout-of-order commit mechanisms are engaged onlywhen needed and

2 Aggressive out-of-order-commit (AOOC) wherethe out-of-order commit mechanisms are continu-ously active looking for opportunities to commitinstructions as early as possible

While the always on nature of the aggressive out-of-order commit is obvious in its meaning the when neededof the reluctant out-of-order-commit requires clarifica-tion For this paper reluctant out-of-order commit isengaged only when the core is in imminent danger ofgoing into a complete stall In other words we engagereluctant out-of-order commit only when one of the crit-ical resources (ROB entries registers load-store queueentries) is all but exhausted and cannot support thefetching of new instructions in the front end of thepipeline As such reluctant out-of-order commit acts asa safety valve to release the pressure on resources (justbefore this pressure reaches a critical point) ratherthan aggressively trying to keep resource pressure low

In contrast aggressive out-of-order commit releasesresources more eagerly but disregards the following is-sues

1 it might prove wasteful as traditional in-order com-mit may still be able to provide sufficient resourcesfor forward progress

2 it may be futile as the chances of encountering aninstruction that restricts further commit (eg anunresolved branch) tends to increase with aggres-siveness

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 5

3 it creates a significant management problem as out-of-order commit can create gaps in several struc-tures including the ROB and also the load queueand store queue (which is not completely addressedin prior works [21 22 5])

Commit Width and DepthThe final parameter to explore within the context

of out-of-order commit is the commit depth to scanfor potential instructions to commit out-of-order Whilecommit width is the number of instructions that can becommitted simultaneously per cycle the commit depthis the measure of how far the core can scan looking forinstructions to commit out-of-order in a given cycle

23 OOC Conditions

In Section 3 we describe the methodology used to com-mit instructions out-of-order With this methodologywe can selectively relax the commit conditions describedin Subsection 21 (except completion) while still guar-anteeing correct execution For example by relaxingthe branch condition (Unsafe_BR) ldquocommitting onlydown a non-speculative pathrdquo we can continue to freeresources past unresolved branches but effectively onlycommit from the correct path In this paper we donot evaluate the implementation required to relax theseconditions but instead evaluate the potential for per-formance improvement We evaluate the maximum po-tential performance improvement with a speculative out-of-order rollback cost of zero In this section we providedetails on the performance implications of the relax-ation of a single and combination of conditions

231 Instruction complete

The core waits for an instruction to finish executing be-fore commit can occur We do not examine early com-mit of loads [14 15] that miss in the cache and insteadwe consider them available for commit only after thedata returns and is bound to the destination register

232 Memory replay traps (safe_ST and safe_LD)

We describe two sub-cases for this conditionStore-Load (safe_ST) This condition applies to

same-thread memory dependencies involving a store anda load In particular we cannot commit a load out-of-order in the presence of a prior store with an un-resolved address If the store and the load prove tobe dependent (the load should have taken the valueof the store) the commit would have been incorrectThe LD condition disallows the commit of a load and

its dependent instructions until all prior stores resolvetheir addresses and all the memory dependencies arecorrectly enforced By relaxing this condition we cancommit loads and their dependent instructions even ifprior non-aliasing stores have unresolved addresses

Load-Load (safe_LD) This concerns memory con-sistency models that enforce loadrarrload ordering (egSequential Consistency or TSO) Under this orderingconstraint it is possible to allow loads out-of-order aslong as this is not observed in the memory system Thesafe_LD condition disallows the out-of-order commitof loads unless it is guaranteed that the correct or-der will be observed by the memory system To relaxthis condition we allow loadrarrload re-orderings that arenot observed by other cores A very specific case wouldbe a memory mapped IO (MMIO) request that mightchange the order of memory operations The MMIOcase acts as a rsquocoprocessorrsquo meaning that we have amulti-processor system here We ignore memory requestsfrom other cores (IO coprocessor)

233 WAR hazards

WAR hazards are already handled by the out-of-ordercore within the ROB and we assume a solution suchas the Value Buffer [21] for committing out-of-orderThus we do not consider this condition further

234 Unresolved Branches (safe_BR)

This condition guarantees that we commit only fromthe correct path of execution Out-of-order commit shouldnot proceed past unresolved branches until they are cor-rectly resolved We can relax this condition and commitpast an unresolved branch if we are able to undo thecommit To evaluate maximum performance potentialwe assume a zero rollback cost for out-of-order commitmisspredictions However the normal branch mispre-diction cost (10 cycles) is faithfully accounted (see Sub-section 49 for more details) In addition we evaluatethe rollback count for this condition in Subsection 49

235 Exceptions (safe_EXC)

This condition caters to precise interrupts Enforcingthis condition requires that we do not commit past aninstruction (floating-point memory access or any in-struction that may cause an exception) unless we makesure that the instruction will not cause a exception Torelax this safe_EXC condition we assume the code re-gions are exception free

6 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

Fig 2 Conceptual comparison of in-order and out-of-order commit (commit depth=4) C Ready to commitX Not ready to commit E Executing I Issued OEmpty entry

Fig 3 Functionality of AOOC and ROOC In thisexample AOOC commits one more instruction thanROOC (commit depth=4) C Ready to commit XNot ready to commit E Executing I Issued OEmpty entry

24 Safe and Unsafe Out-of-Order Commit

Normally discussion of out-of-order commit tends tofocus on the changes that occur in the back end ofthe pipeline However the purpose of out-of-order com-mit is to enable the front end to proceed The per-ception that out-of-order commit increases performancecan also be misleading instruction execution is not spedup it is the removal of conditions that stall the frontend that increases performance More specifically inan unconstrained architecture (unrestricted ROB reg-isters and loadstore queue entries) safe out-of-ordercommit does not perform faster than in-order commitFor restricted real-world implementations Safe_OOChelps to ameliorate this problem

In contrast Unsafe_OOC has the capacity to ex-ceed the performance of an unconstrained in-order com-mit architecture as it removes penalties needed to guar-antee correctness (eg correctly handling memory de-

pendencies correctly enforcing memory consistency or-dering etc) in situations that the hardware does nothave any other means of imposing such correctness Inthis case Unsafe_OOC has the potential to violate cor-rectness and requires a means to revert back to a safestate if a violation occurs This can be a good trade-offwhen the conditions that violate correctness are rareWe implement an oracle version of Unsafe_OOC inthat it violates the correctness conditions because itwill be safe to do so (and therefore no correctness prob-lems will occur) This removes the need to provide themechanisms to revert to a safe state in the case of a mis-speculation This model provides a best-case speedup asrecovery from misspeculation is zero cost

25 Aggressive and Reluctant OOC

Orthogonal to the enforcement of the out-of-order com-mit conditions the aggressiveness of out-of-order com-mit plays a significant role in the resulting performanceand the cost it incurs We introduced two approachesfor out-of-order commit Aggressive (AOOC) and Re-luctant (ROOC) out-of-order commit A good way todescribe them is to contrast their main difference howoften each mechanism is engaged

AOOC is engaged all the time ie it constantlytries to find instructions to commit out-of-order if theopportunity arises In this respect it aims to increasecommit bandwidth

In contrast ROOC is only engaged when one of thecritical resources in the core (ROB entries registersLoadstore queue entries) is about to be exhaustedROOC is concerned about front end stalls not aboutcommit bandwidth This means that the number of in-structions that ROOC needs to find that can commitout-of-order is limited ROOC needs to provide enoughfree entries in the ROBregistersLSQ so that the frontend can dispatch as many instructions as possible (up tothe dispatch width) to the out-of-order engine Figure 2shows this with an example The reason that the com-mit stage of an in-order commit is blocked is becauseof an unresolved instruction at the head of ROB Forexample given a four-wide superscalar shown in Fig-ure 2 tries to commit four instructions per cycle Whilethe first instruction at the head of ROB can committhe second oldest instruction causes the commit stageto block and instead of four only one instruction com-mits (in-order)

The behavior of out-of-order commit depends on itsaggressiveness

AOOC AOOC attempts to find up to commit-width instructions so it can satisfy the need to commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 7

Table 1 Baseline core parameters

Full system simulation 34 GHz x86-64L1id 32KiB 8-way 4clkL2 256KiB 8-way 12clkL3 1MiB 8-way 36clkDRAM 200clkBranch Predictor Tournament front end penalty 10clkPrefetcher OFF(default)ON Stride degree 8

Table 2 Microarchitecture configuration with reorderbuffer (ROB) instruction queue (IQ) load and storequeues (LQSQ) and integer and floating-point regis-ter file (RF) details Dispatch width (D) commit width(CW) and Out-of-Order commit depth (CD) is thesame to enable a fair comparison (SLM hardware hasa DCW of 2) Register values are additional physicalregisters above architectural state

Microarchitecture DCWCD ROB IQ LQSQ RF(INTFP)Silvermont-like (SLM) 448 32 32 1016 3232Nehalem-like (NHM) 448 128 56 4836 6868Haswell-like (HSW) 448 192 60 7242 130130

at the highest possible commit bandwidth In our ex-ample it aims to find four instructions However sinceonly one more is needed to fill the gap in the head groupof four leading instructions AOOC produces an excessof commit-ready instructions that can potentially beexploited in the near future

ROOC ROOC on the other hand aims to findthe minimum number of instructions needed to commit(out-of-order) so that no resource is exhausted and thefront end can continue to issue instructions at its peakbandwidth The reason for seeking the minimum num-ber of instructions to commit out-of-order is that thisminimizes the perturbations in instruction order Thiscould potentially lead to more efficient hardware im-plementations While Figure 3 only considers the ROBin the general case all dispatch-related resources areconsidered in the same way In our particular exam-ple ROOC needs to find just one instruction to com-mit out-of-order The ROB already contains two emptyslots and the in-order commit mechanisms can find oneinstruction to commit leaving one left for ROOC

In contrast in AOOC mode there are in total threeinstructions that commit (instructions 1 3 and 4in Figure 3) The result is that AOOC needs to scandeeper than ROOC

3 Methodology

We use the gem5 [6] simulator in full system mode tosimulate an x86-64 target with a frequency of 34GHzTo test our models we use the SPEC CPU2006 bench-mark suite We use ten uniformly distributed check-points from each benchmark Each simulation check-

point has three phases that begin with 250M instruc-tions of cache warm-up followed by 100k instructions ofdetailed pipeline warm-up and ends with a detailed sim-ulation of 100M instructions Furthermore three dif-ferent configurations similar to three conventional out-of-order processors are used Intelrsquos SLM NHM andHSW [9] microarchitectures Tables 1 and 2 list the de-tailed configuration of the simulation environment

To implement our out-of-order commit model wefirst configure a simulated machine with a very largenumber of core resources We then monitor the numberof committed and non-committed instructions that ap-pear in the pipeline and control at dispatch whetherwe can support additional instructions in the back endof the processor In this way we dynamically determinethe resource availability in the processor for each cyclebased on the out-of-order commit conditions

4 Out-of-Order Commit Evaluation

In this section we analyze the benefits of out-of-ordercommit on the performance of a number of applica-tions We look at how the commit bandwidth changeswith out-of-order commit and how the effective re-source size of each critical component changes as weenable different out-of-order commit conditions Nextwe show how out-of-order commit conditions affect per-formance Safe_OOC and Unsafe_OOC are two ex-treme points that are defined by either enabling (re-specting) or disabling all of the out-of-order commitconditions This results in a minimum and maximumpotential performance improvement across the bench-mark suite To understand the effect of each conditionin isolation we study the effect of each one both in thepresence and absence of other conditions These stud-ies on Safe_OOC and Unsafe_OOC conditions wereconducted for both Aggressive (AOOC) and Reluctant(ROOC) out-of-order commit

41 Microarchitecture Aggressiveness

We target three microarchitectures resembling IntelrsquosSilvermont (SLM) Nehalem (NHM) and Haswell (HSW)as small medium and large cores (See Table 2 for de-tails) As an overview Figure 1 shows the performanceimprovement for each microarchitecture assuming Safe-OOC (all conditions respected) for all benchmarks onaverage across SPEC CPU2006 [16] We can see thatin the case of narrow commit depth (four in this fig-ure) a relatively small out-of-order processor (SLM)has more potential for relative improvement comparedto the medium and aggressive microarchitectures The

8 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLM SLMOOC

NHMNHMOOC

HSWHSWOOC

20406080

100

Cycle

s ()

4 committed3 committed

2 committed1 committed

0 committed

Fig 4 Commit bandwidth distribution for the SPECCPU2006 benchmarks of a 4-wide core with in-ordercommit and out-of-order commit respecting all commitconditions (Safe_OOC) Out-of-order commit increasescommit pressure even without aggressive speculation

reason is that the smaller processor (with a shorter in-struction reach and given a balanced design smallerhardware structures) will more likely stall as it exposesa smaller amount of the potential ILP in an applica-tion In case of a larger commit depth the more ag-gressive cores (NHM and HSW) have higher potentialperformance improvement (See Section 44 for a de-tailed overview) Out-of-order commit frees the proces-sor from the traditional limits reducing the number oftimes the processor experiences exhausted resources Inmedium and large aggressive cores thanks to a largerROB as well as other hardware resources more intrinsicILP is extracted by traditional in-order commit leav-ing less potential for out-of-order commit with a narrowcommit depth

42 The Effect on Commit Bandwidth

Figure 4 shows the distribution of the number of com-mitted instructions per cycle for three different microar-chitectures for both in-order and out-of-order commitacross all SPEC CPU2006 benchmarks Although an in-order commit 4-wide commit HSW microarchitecturecan retire up to four instructions per cycle this occurson average less than 20 of the time In practice wesee a large number of commit stage stalls (zero instruc-tions committed per cycle) For out-of-order committhe distribution shifts toward four instructions per cy-cle reflecting the improved commit performance (andthe resulting improvement in overall performance) Fi-nally this figure shows that for smaller less aggressivecores (such as SLM see Table 2) out-of-order commitprovides a relatively larger improvement compared tothe other microarchitectures because of the short com-

LQ SQ IQ RF ROB00

05

10

15

20

Norm

alize

d effective siz

e

IOC Safe_OOC Unsafe_OOC

Fig 5 Effective resource use in in-order and Aggressiveout-of-order commit across SPEC CPU2006

mit depth of four Larger cores improve more with moreaggressive commit depths and out-of-order commit pa-rameters (see section 44 for details)

43 The Effect on Resources

One of the main issues that out-of-order commit canhelp resolve is the early release (and subsequent reuse)of hardware resources that otherwise would still be re-quired to maintain the in-order state in an in-ordercommit processor Therefore with a small ROB (andother appropriately sized structures) given in-order com-mit the core can more easily stall when it runs out ofresources For example consider a ROB size of 32 froma SLM microarchitecture When the ROB is full it con-tains 32 micro-operations in flight and the CPUrsquos backend will no longer accept additional micro-operationsfrom the front end Increasing the physical size of theROB along with other resources is one potential butrather expensive solution to this issue An alternativesolution and one of the benefits of using out-of-ordercommit is the early release of resources This early re-lease increases the effective size of a resource (comparedto an in-order commit core) improving performance ofthe core

Benchmark specific analysis The xalan bench-mark is one of the top 5 benchmarks with a large num-ber of CPU stalls caused by exhausted resources BothAOOC and ROOC are very effective for the xalan bench-mark OOC is able to provide additional free entries inthe ROB RF and LSQ (see Figure 7)

On the other hand leslie3d has the lowest num-ber of CPU stalls based on exhausted resources whichlimits the potential for improvement with OOC

Figure 5 compares the effective size of the SLM mi-croarchitecture between in-order commit and both safeand unsafe aggressive out-of-order commit (AOOC) mod-els for the SPEC CPU2006 benchmarks The resultsare normalized to a fixed size SLM_IOC microarchitec-ture (see Table 2) An effective size of 10 translates tofrequent stalls due to resource exhaustion while sizes

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 9

4 8 16 32 128192

(a) SLM

20

40

60

80

100

120

Perfo

rmance

improvem

ent ()

4 8 16 32 128192

Commit depth

(b) NHM

20

40

60

80

100

120

Safe_OOC Unsafe_OOC

4 8 16 32 128192

(c) HSW

20

40

60

80

100

120

Fig 6 Effect of commit depth on performance improve-ment normalized to their respective in-order commitperformance across SPEC CPU2006 The smaller corereceives less benefit from increasing the depth of com-mit Safe_OOC saturates at a commit depth of 8 16and 32 for SLM NHM and HSW respectively (the dis-tance to the first unresolved instruction from the headof the ROB) There is no saturation in performance im-provement of Unsafe_OOC while the commit depth isincreased

greater than 10 shows the effective resource size in-creases due to OOC

For aggressive out-of-order commit the larger effec-tive sizes show the reach of this technique We see thatthe utilization of all structures except for the ROB isalmost the same for safe and unsafe out-of-order com-mit which allows Safe_OOC to achieve most of theperformance of the unsafe version Nevertheless the un-safe core is much better at improving the reach of theROB allowing applications like hmmer continue to showa benefit when moving from safe to unsafe out-of-ordercommit See Section 452 for more details on hmmer

44 Evaluation of Commit Depth

To gauge the effect of the commit depth (ie how far wescan the ROB to find instructions that can commit out-of-order) we impose a hard limit on it and evaluate theeffects on the resulting performance The strictest limitis the commit-width itself starting to commit out-of-order from the first commit-width instructions We thenrelax this to the immediate vicinity (eg double thecommit-width) and progressively relax until we reachthe size of the ROB

In Figure 6 we see that a commit depth of 4 (equalto commit width) provides the smallest benefit but alsothe smallest difference between Safe_OOC and Un-safe_OOC In addition the SLM core benefits morefrom OOC compared to NHM and HSW when com-mit depth is smaller than 8 At a commit depth of 8and above the large aggressive cores benefit much more

than the smaller core type with the maximum improve-ment of Unsafe_OOC at 80 119 and 129 for SLMNHM and HSW respectively For aggressive cores thelarger commit depth allows for continued performanceimprovements and will be necessary for the design of abalanced aggressive out-of-order commit design

45 Out-of-Order Commit Performance

In this section we analyze the out-of-order commit con-ditions to determine the minimum (and maximum) per-formance improvement potential The minimum andmaximum improvement is provided by Safe_OOC andUnsafe_OOC respectively In Figure 7 we show theamount of improvement provided by both safe and un-safe out-of-order commit for both aggressive and reluc-tant modes for all three microarchitectures

451 Safe_OOC

Honoring all out-of-order commit conditions results in amodest performance improvement One implication forfuture processors is that Safe_OOC does not requireadditional support for speculative out-of-order rollbackrecovery mechanisms it requires support for the com-mit of instructions out-of-order and the freeing of struc-tures for use by future instructions

Safe_AOOC When Safe_AOOC is evaluated onthe SLM microarchitecture (see Figure 7a) the rangeof improvement spans from a low of 3 (leslie3d) upto 82 (xalan) with an average of 44 In the NHMmicroarchitecture the improvement ranges from 9 to108 for milc and xalan with an average improvementof 57 and for HSW we see an improvement of 1for mcf to 110 for wrf with an average of 55 (SeeSection 44 for more details)

Safe_ROOC For Safe_ROOC the performanceimprovement is lower for all three microarchitecturescompared to AOOC (See Subsection 25 for additionaldetails) In the SLM microarchitecture the range ofperformance improvement of safe ROOC is from 2 to55 for calculix and gcc respectively with an averageimprovement of 20 In the NHM microarchitecturewe see performance improvements that range from 1(mcf) to 22 (bwaves) with an average improvement of7 (8 for HSW) Because NHM and HSM have fewerCPU stalls ROOC has less of an effect on performancecompared with the smaller more efficient core

452 Unsafe_OOC

By relaxing all conditions Unsafe_OOC provides themaximum potential for performance improvement Un-

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 2: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

2 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLMSLM_ROOC

SLM_AOOC

NHMNHM_ROOC

NHM_AOOC

HSWHSW_ROOC

HSW_AOOC

075100125150175200225

IPC normalized to SLM

commit-depth=4 commit-depth=ROB size

Fig 1 Performance comparison (harmonic mean ofIPCs across SPEC CPU2006 [16]) of in-order and twotypes of safe out-of-order commit reluctant (ROOC)and aggressive (AOOC) with a commit depth of 4 andROB size for increasingly aggressive microarchitecturesThese experiments respect all traditional commit con-ditions and show that aggressive out-of-order commitcan reach the performance of the next class of proces-sor

LSQ entries or registers To overcome this hurdle de-signers size these structures so that they minimize thechances of any single resource becoming exhausted cre-ating a balanced microarchitecture This however con-tributes to the power-inefficiency of the OoO cores Atthe very least after a certain point increasing the sizeof the LSQ (which is usually implemented as an ex-pensive CAM) or the size of the register file resultsin an increase in energy consumption that far exceedsthe performance benefit a significant disadvantage forpower-constrained processors Thus the incentive forpursuing out-of-order commit (OOC) lies in the promiseof higher performance with fewer resources A turn-ing point in our understanding of out-of-order commitcame with the work of Bell and Lipasti [5] who artic-ulated the limiting factors for non-speculative OOCThe necessary conditions to allow an instruction to becommitted range from the completion status of the in-struction itself to the branch prediction and exceptionstate of intervening instructions Several proposals forout-of-order commit implicitly or explicitly abide bythese conditions potentially harming efficiency to en-force them

11 Analyses performed

The question we explore in this paper is what couldthe performance gain be if we had the means to evadeany one or even all of these conditions together To thebest of our knowledge this work is the first to evalu-ate the potential performance contribution of relaxingeach individual commit condition For example we in-

vestigate the potential performance gain if we couldcorrectly commit past unresolved branches stores withunresolved addresses or instructions that can generateexceptions We explore the interactions of out-of-ordercommit across a range of important processor designpoints including variations in prefetchers MSHRs mem-ory latency branch prediction and the size of the dy-namic instruction window Answering these questionsallows us to understand the most profitable conditionsto address Our work confirms previous studies [2 5 21]that the least aggressive core benefits the most fromleast aggressive out-of-order commit and in addition byintroducing a taxonomy for the first time we show thatmore aggressive cores gain more comparatively fromaggressive out-of-order commit All in all our studyshows that as a future direction not only safe but alsounsafe out-of-order commit appears to be very promis-ing This is especially true if the potential benefits canbe tapped with much more efficient and selective mech-anisms to guarantee correctness instead of bulk check-pointing and rollback Already such non-speculativemechanisms have been proposed [24]

12 Why We Do it

Bell and Lipasti first articulated the conditions for ldquosaferdquonon-speculative out of order commit [5] in an effort totackle the problem of improving single-thread perfor-mance This was at the same time that IC manufac-turing broke through the 100nm technology node [103] However the acceptance for OOC architectures hasbeen slow Today significantly improving single-threadperformance in an energy-efficient manner remains achallenge The goal of this work is to help researchersto hone in on the most profitable aspects of OOC byoffering i) a detailed exploration of the limits of out-of-order commit conditions and ii) a taxonomy for out-of-order commit [2]

13 Analyses outside the scope of this work

However in this paper we do not quantify the cost thatwould be required to guarantee correctness when com-mitting past any or all of these conditions as this wouldbe tied to a specific hardware or software mechanismInstead we evaluate the potential performance benefitsavailable as a means to gauge the potential of futureproposals in relation to their cost In the same veinwe have not explored every possible core optimizationthat could be construed as extending the instructionwindow outside the core (eg run-ahead execution) Inthis work we chose to study three common architecturesto provide baselines with understood costs

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 3

14 Contribution

We study safemdashnon-speculativemdash out-of-order commit(refraining from committing until all conditions are met)and unsafemdashspeculativemdash out-of-order commit (com-mitting even before all conditions are met assuming thepotential for correct recovery) In addition to studyingboth safe and unsafe out-of-order commit as a minimumand maximum potential performance improvement wealso study out-of-order commit in two additional di-mensions aggressiveness and degree of speculation

ndash Aggressiveness A dimension that determines thepotential benefit is the aggressiveness of the out-of-order-commit implementation how far into the in-struction window we try to commit Prior work [22]examines this aggressiveness in their implementa-tion of checkpoint-based out-of-order commit wherethey show a middle-ground in aggressiveness is neededto mitigate checkpoint-based penalties while stillimproving performanceIn addition work on non-speculative out-of-ordercommit [5] has recognized that a restricted commitwindow can achieve a good portion of the perfor-mance improvements compared to an unrestrictedunlimited out-of-order commit implementation

ndash Degree of speculation Apart from the aggres-siveness of commit we also consider varying degreesof speculation The insight of this study is that selec-tive speculation preserving or relaxing of one moreconditions necessary for out-of-order commit couldlead to more efficient implementations In fact re-cent work [15 24] has shown that this might be thecase for load instructions

15 Results

This study aims to provide a guidepost for each majorparameter of out-of-order commit to provide an upperbound on the performance benefits given a particularaggressiveness and degree of selective speculation Inthis work we show that

ndash there is significant untapped potential for unsafeout-of-order commit beyond traditional in-order andsafe out-of-order commit the gap widens in morepowerful architectures

ndash for energy-efficient cores with moderately-sized in-struction windows reluctant1 limited out-of-order

1 Reluctant out-of-order commit continues normal in-oredrcommit and is only enabled when the hardware cannot continueto make forward progress This mode switching happens earlyenough which is a cycle before the related queue (ROB IQLSQ) is full so that the CPU stall is avoided

commit is sufficient to reap the most benefit whileaggressive versions of out-of-order commit becomea requirement for larger cores

ndash the order of importance of the commit conditionschanges depending on the type of application thearchitecture (limited or aggressive OoO processor)and the aggressiveness of out-of-order commit (com-mit depth)

ndash we unexpectedly found that a focus on specific out-of-order commit conditions could be an importantfuture direction for high-performance efficient out-of-order processors

ndash the potential benefits of out-of-order commit increaseswith memory latency (relatively more for unsafe)while the benefits of the prefetching strategy thatwe picked are orthogonal to out-of-order commitbenefits This raises the enticing possibility of re-ducing system-wide silicon financial cost withoutcompromising performance by coupling dense buthigher-latency (slow-but-efficient) DRAM with out-of-order commit cores

ndash out-of-order commit increases memory hierarchy par-allelism [7]

ndash While it is generally acceptable that by releasingpipeline resources as early as possible out-of-ordercommit improves performance in minor and smallcores relatively more than in large cores in this workwe show that this is only true for reluctant out-of-order commit In fact performance improvement inlarge out-of-order cores can exceed that of smallercores if aggressive out-of-order commit is employed

ndash Our results show the potential for future systemsthat implement out-of-order commit and indicatewhich are the most promising directions (safe vsunsafe commit and which of Bell and Lipastirsquos con-ditions [5] are most important to support) for futuredesigns

The rest of this paper is organized as follows InSection 2 we first provide an overview of the conditionsthat need to be honored for in-order commit as wellas provide an overview of out-of-order commit Nextin Section 3 we present our evaluation methodologyand simulated system configurations Section 4 pro-vides a detailed performance analysis for each out-of-order commit condition both from the point of view ofaggressive and reluctant out-of-order commit In Sec-tion 5 and Section 6 we compare the early releaseof physical register and memory hierarchy parallelismrespectively with Out-of-Order commit extending pre-vious work [2] Finally the related works and conclusionare presented in Section 7 and 8 respectively

4 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

2 Out-of-Order Commit

The introduction of the reorder buffer (ROB) to pro-vide in-order commit in an out-of-order scheduled su-perscalar pipeline was an important advance for com-puter architecture culminating an effort that startedwith Tomasulorsquos algorithm and included techniques suchas reservation stations and the register update unit(RUU) [26] The reorder buffer maintains precise ar-chitectural state in the presence of interrupts unknownmemory dependencies or memory re-orderings that canperturb the ordering required by a memory consistencymodel Out-of-order commit on the other hand at-tempts to break this rigid updating of the architecturalstate either in a safe way (ie one that does not requireadditional speculation and rollback to revert changes tothe architectural state) or in an unsafe way (ie onethat does)

21 Safe vs Unsafe OOC

A turning point in our understanding of out-of-ordercommit came with the work of Bell and Lipasti [5] inthe form of a number of limiting conditions for safe out-of-order commit The necessary conditions to allow aninstruction to be committed out-of-order are1 The instruction is completed instructions can com-

mit only after their completion2 The instruction is not involved in memory replay

traps This condition simply says that we cannotcommit speculative loads or their dependent instruc-tions This condition relates to unresolved memorydependencies or memory consistency enforcementFor example total store order (TSO) requires a re-play of speculative loads that violate loadrarrload or-dering when this reordering is detected by othercores

3 Register WAR hazards are resolved (ie a write toa particular register cannot be permitted to commitbefore all prior reads of that architectural registerhave been completed)

4 Previous branches are successfully predicted Thiscondition simply says that we can commit only whileon the correct path of execution

5 No prior instruction in program order is going toraise an exception This condition provides preciseinterrupts and is essential in easing handling of egpage faultsIn the rest of this paper we will use the following

convention to discuss how we evade the BellLipasticonditions1 Safe_OOC where all out-of-order-commit condi-

tions are preserved This case provides the minimum

potential performance improvement of out-of-ordercommit but also the minimum hardware to imple-ment as it does not rely on speculation and rollbackbeyond what is already available in the out-of-ordercore

2 Unsafe_OOC where one or more (or all) of theout-of-order commit conditions are evaded (apartfrom true dependencies) Doing so the maximumpotential performance improvement of out-of-ordercommit is evaluated but this may require extra sup-port for speculation and rollback to be able to revertchanges in the architectural state that were foundto be incorrect after the commit

22 Reluctant vs Aggressive OOC

Aside from the limiting conditions described above aseparate dimension is the aggressiveness of committingout-of-order Thus concerning the mechanics of out-of-order commit we distinguish two versions

1 Reluctant out-of-order-commit (ROOC) where theout-of-order commit mechanisms are engaged onlywhen needed and

2 Aggressive out-of-order-commit (AOOC) wherethe out-of-order commit mechanisms are continu-ously active looking for opportunities to commitinstructions as early as possible

While the always on nature of the aggressive out-of-order commit is obvious in its meaning the when neededof the reluctant out-of-order-commit requires clarifica-tion For this paper reluctant out-of-order commit isengaged only when the core is in imminent danger ofgoing into a complete stall In other words we engagereluctant out-of-order commit only when one of the crit-ical resources (ROB entries registers load-store queueentries) is all but exhausted and cannot support thefetching of new instructions in the front end of thepipeline As such reluctant out-of-order commit acts asa safety valve to release the pressure on resources (justbefore this pressure reaches a critical point) ratherthan aggressively trying to keep resource pressure low

In contrast aggressive out-of-order commit releasesresources more eagerly but disregards the following is-sues

1 it might prove wasteful as traditional in-order com-mit may still be able to provide sufficient resourcesfor forward progress

2 it may be futile as the chances of encountering aninstruction that restricts further commit (eg anunresolved branch) tends to increase with aggres-siveness

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 5

3 it creates a significant management problem as out-of-order commit can create gaps in several struc-tures including the ROB and also the load queueand store queue (which is not completely addressedin prior works [21 22 5])

Commit Width and DepthThe final parameter to explore within the context

of out-of-order commit is the commit depth to scanfor potential instructions to commit out-of-order Whilecommit width is the number of instructions that can becommitted simultaneously per cycle the commit depthis the measure of how far the core can scan looking forinstructions to commit out-of-order in a given cycle

23 OOC Conditions

In Section 3 we describe the methodology used to com-mit instructions out-of-order With this methodologywe can selectively relax the commit conditions describedin Subsection 21 (except completion) while still guar-anteeing correct execution For example by relaxingthe branch condition (Unsafe_BR) ldquocommitting onlydown a non-speculative pathrdquo we can continue to freeresources past unresolved branches but effectively onlycommit from the correct path In this paper we donot evaluate the implementation required to relax theseconditions but instead evaluate the potential for per-formance improvement We evaluate the maximum po-tential performance improvement with a speculative out-of-order rollback cost of zero In this section we providedetails on the performance implications of the relax-ation of a single and combination of conditions

231 Instruction complete

The core waits for an instruction to finish executing be-fore commit can occur We do not examine early com-mit of loads [14 15] that miss in the cache and insteadwe consider them available for commit only after thedata returns and is bound to the destination register

232 Memory replay traps (safe_ST and safe_LD)

We describe two sub-cases for this conditionStore-Load (safe_ST) This condition applies to

same-thread memory dependencies involving a store anda load In particular we cannot commit a load out-of-order in the presence of a prior store with an un-resolved address If the store and the load prove tobe dependent (the load should have taken the valueof the store) the commit would have been incorrectThe LD condition disallows the commit of a load and

its dependent instructions until all prior stores resolvetheir addresses and all the memory dependencies arecorrectly enforced By relaxing this condition we cancommit loads and their dependent instructions even ifprior non-aliasing stores have unresolved addresses

Load-Load (safe_LD) This concerns memory con-sistency models that enforce loadrarrload ordering (egSequential Consistency or TSO) Under this orderingconstraint it is possible to allow loads out-of-order aslong as this is not observed in the memory system Thesafe_LD condition disallows the out-of-order commitof loads unless it is guaranteed that the correct or-der will be observed by the memory system To relaxthis condition we allow loadrarrload re-orderings that arenot observed by other cores A very specific case wouldbe a memory mapped IO (MMIO) request that mightchange the order of memory operations The MMIOcase acts as a rsquocoprocessorrsquo meaning that we have amulti-processor system here We ignore memory requestsfrom other cores (IO coprocessor)

233 WAR hazards

WAR hazards are already handled by the out-of-ordercore within the ROB and we assume a solution suchas the Value Buffer [21] for committing out-of-orderThus we do not consider this condition further

234 Unresolved Branches (safe_BR)

This condition guarantees that we commit only fromthe correct path of execution Out-of-order commit shouldnot proceed past unresolved branches until they are cor-rectly resolved We can relax this condition and commitpast an unresolved branch if we are able to undo thecommit To evaluate maximum performance potentialwe assume a zero rollback cost for out-of-order commitmisspredictions However the normal branch mispre-diction cost (10 cycles) is faithfully accounted (see Sub-section 49 for more details) In addition we evaluatethe rollback count for this condition in Subsection 49

235 Exceptions (safe_EXC)

This condition caters to precise interrupts Enforcingthis condition requires that we do not commit past aninstruction (floating-point memory access or any in-struction that may cause an exception) unless we makesure that the instruction will not cause a exception Torelax this safe_EXC condition we assume the code re-gions are exception free

6 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

Fig 2 Conceptual comparison of in-order and out-of-order commit (commit depth=4) C Ready to commitX Not ready to commit E Executing I Issued OEmpty entry

Fig 3 Functionality of AOOC and ROOC In thisexample AOOC commits one more instruction thanROOC (commit depth=4) C Ready to commit XNot ready to commit E Executing I Issued OEmpty entry

24 Safe and Unsafe Out-of-Order Commit

Normally discussion of out-of-order commit tends tofocus on the changes that occur in the back end ofthe pipeline However the purpose of out-of-order com-mit is to enable the front end to proceed The per-ception that out-of-order commit increases performancecan also be misleading instruction execution is not spedup it is the removal of conditions that stall the frontend that increases performance More specifically inan unconstrained architecture (unrestricted ROB reg-isters and loadstore queue entries) safe out-of-ordercommit does not perform faster than in-order commitFor restricted real-world implementations Safe_OOChelps to ameliorate this problem

In contrast Unsafe_OOC has the capacity to ex-ceed the performance of an unconstrained in-order com-mit architecture as it removes penalties needed to guar-antee correctness (eg correctly handling memory de-

pendencies correctly enforcing memory consistency or-dering etc) in situations that the hardware does nothave any other means of imposing such correctness Inthis case Unsafe_OOC has the potential to violate cor-rectness and requires a means to revert back to a safestate if a violation occurs This can be a good trade-offwhen the conditions that violate correctness are rareWe implement an oracle version of Unsafe_OOC inthat it violates the correctness conditions because itwill be safe to do so (and therefore no correctness prob-lems will occur) This removes the need to provide themechanisms to revert to a safe state in the case of a mis-speculation This model provides a best-case speedup asrecovery from misspeculation is zero cost

25 Aggressive and Reluctant OOC

Orthogonal to the enforcement of the out-of-order com-mit conditions the aggressiveness of out-of-order com-mit plays a significant role in the resulting performanceand the cost it incurs We introduced two approachesfor out-of-order commit Aggressive (AOOC) and Re-luctant (ROOC) out-of-order commit A good way todescribe them is to contrast their main difference howoften each mechanism is engaged

AOOC is engaged all the time ie it constantlytries to find instructions to commit out-of-order if theopportunity arises In this respect it aims to increasecommit bandwidth

In contrast ROOC is only engaged when one of thecritical resources in the core (ROB entries registersLoadstore queue entries) is about to be exhaustedROOC is concerned about front end stalls not aboutcommit bandwidth This means that the number of in-structions that ROOC needs to find that can commitout-of-order is limited ROOC needs to provide enoughfree entries in the ROBregistersLSQ so that the frontend can dispatch as many instructions as possible (up tothe dispatch width) to the out-of-order engine Figure 2shows this with an example The reason that the com-mit stage of an in-order commit is blocked is becauseof an unresolved instruction at the head of ROB Forexample given a four-wide superscalar shown in Fig-ure 2 tries to commit four instructions per cycle Whilethe first instruction at the head of ROB can committhe second oldest instruction causes the commit stageto block and instead of four only one instruction com-mits (in-order)

The behavior of out-of-order commit depends on itsaggressiveness

AOOC AOOC attempts to find up to commit-width instructions so it can satisfy the need to commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 7

Table 1 Baseline core parameters

Full system simulation 34 GHz x86-64L1id 32KiB 8-way 4clkL2 256KiB 8-way 12clkL3 1MiB 8-way 36clkDRAM 200clkBranch Predictor Tournament front end penalty 10clkPrefetcher OFF(default)ON Stride degree 8

Table 2 Microarchitecture configuration with reorderbuffer (ROB) instruction queue (IQ) load and storequeues (LQSQ) and integer and floating-point regis-ter file (RF) details Dispatch width (D) commit width(CW) and Out-of-Order commit depth (CD) is thesame to enable a fair comparison (SLM hardware hasa DCW of 2) Register values are additional physicalregisters above architectural state

Microarchitecture DCWCD ROB IQ LQSQ RF(INTFP)Silvermont-like (SLM) 448 32 32 1016 3232Nehalem-like (NHM) 448 128 56 4836 6868Haswell-like (HSW) 448 192 60 7242 130130

at the highest possible commit bandwidth In our ex-ample it aims to find four instructions However sinceonly one more is needed to fill the gap in the head groupof four leading instructions AOOC produces an excessof commit-ready instructions that can potentially beexploited in the near future

ROOC ROOC on the other hand aims to findthe minimum number of instructions needed to commit(out-of-order) so that no resource is exhausted and thefront end can continue to issue instructions at its peakbandwidth The reason for seeking the minimum num-ber of instructions to commit out-of-order is that thisminimizes the perturbations in instruction order Thiscould potentially lead to more efficient hardware im-plementations While Figure 3 only considers the ROBin the general case all dispatch-related resources areconsidered in the same way In our particular exam-ple ROOC needs to find just one instruction to com-mit out-of-order The ROB already contains two emptyslots and the in-order commit mechanisms can find oneinstruction to commit leaving one left for ROOC

In contrast in AOOC mode there are in total threeinstructions that commit (instructions 1 3 and 4in Figure 3) The result is that AOOC needs to scandeeper than ROOC

3 Methodology

We use the gem5 [6] simulator in full system mode tosimulate an x86-64 target with a frequency of 34GHzTo test our models we use the SPEC CPU2006 bench-mark suite We use ten uniformly distributed check-points from each benchmark Each simulation check-

point has three phases that begin with 250M instruc-tions of cache warm-up followed by 100k instructions ofdetailed pipeline warm-up and ends with a detailed sim-ulation of 100M instructions Furthermore three dif-ferent configurations similar to three conventional out-of-order processors are used Intelrsquos SLM NHM andHSW [9] microarchitectures Tables 1 and 2 list the de-tailed configuration of the simulation environment

To implement our out-of-order commit model wefirst configure a simulated machine with a very largenumber of core resources We then monitor the numberof committed and non-committed instructions that ap-pear in the pipeline and control at dispatch whetherwe can support additional instructions in the back endof the processor In this way we dynamically determinethe resource availability in the processor for each cyclebased on the out-of-order commit conditions

4 Out-of-Order Commit Evaluation

In this section we analyze the benefits of out-of-ordercommit on the performance of a number of applica-tions We look at how the commit bandwidth changeswith out-of-order commit and how the effective re-source size of each critical component changes as weenable different out-of-order commit conditions Nextwe show how out-of-order commit conditions affect per-formance Safe_OOC and Unsafe_OOC are two ex-treme points that are defined by either enabling (re-specting) or disabling all of the out-of-order commitconditions This results in a minimum and maximumpotential performance improvement across the bench-mark suite To understand the effect of each conditionin isolation we study the effect of each one both in thepresence and absence of other conditions These stud-ies on Safe_OOC and Unsafe_OOC conditions wereconducted for both Aggressive (AOOC) and Reluctant(ROOC) out-of-order commit

41 Microarchitecture Aggressiveness

We target three microarchitectures resembling IntelrsquosSilvermont (SLM) Nehalem (NHM) and Haswell (HSW)as small medium and large cores (See Table 2 for de-tails) As an overview Figure 1 shows the performanceimprovement for each microarchitecture assuming Safe-OOC (all conditions respected) for all benchmarks onaverage across SPEC CPU2006 [16] We can see thatin the case of narrow commit depth (four in this fig-ure) a relatively small out-of-order processor (SLM)has more potential for relative improvement comparedto the medium and aggressive microarchitectures The

8 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLM SLMOOC

NHMNHMOOC

HSWHSWOOC

20406080

100

Cycle

s ()

4 committed3 committed

2 committed1 committed

0 committed

Fig 4 Commit bandwidth distribution for the SPECCPU2006 benchmarks of a 4-wide core with in-ordercommit and out-of-order commit respecting all commitconditions (Safe_OOC) Out-of-order commit increasescommit pressure even without aggressive speculation

reason is that the smaller processor (with a shorter in-struction reach and given a balanced design smallerhardware structures) will more likely stall as it exposesa smaller amount of the potential ILP in an applica-tion In case of a larger commit depth the more ag-gressive cores (NHM and HSW) have higher potentialperformance improvement (See Section 44 for a de-tailed overview) Out-of-order commit frees the proces-sor from the traditional limits reducing the number oftimes the processor experiences exhausted resources Inmedium and large aggressive cores thanks to a largerROB as well as other hardware resources more intrinsicILP is extracted by traditional in-order commit leav-ing less potential for out-of-order commit with a narrowcommit depth

42 The Effect on Commit Bandwidth

Figure 4 shows the distribution of the number of com-mitted instructions per cycle for three different microar-chitectures for both in-order and out-of-order commitacross all SPEC CPU2006 benchmarks Although an in-order commit 4-wide commit HSW microarchitecturecan retire up to four instructions per cycle this occurson average less than 20 of the time In practice wesee a large number of commit stage stalls (zero instruc-tions committed per cycle) For out-of-order committhe distribution shifts toward four instructions per cy-cle reflecting the improved commit performance (andthe resulting improvement in overall performance) Fi-nally this figure shows that for smaller less aggressivecores (such as SLM see Table 2) out-of-order commitprovides a relatively larger improvement compared tothe other microarchitectures because of the short com-

LQ SQ IQ RF ROB00

05

10

15

20

Norm

alize

d effective siz

e

IOC Safe_OOC Unsafe_OOC

Fig 5 Effective resource use in in-order and Aggressiveout-of-order commit across SPEC CPU2006

mit depth of four Larger cores improve more with moreaggressive commit depths and out-of-order commit pa-rameters (see section 44 for details)

43 The Effect on Resources

One of the main issues that out-of-order commit canhelp resolve is the early release (and subsequent reuse)of hardware resources that otherwise would still be re-quired to maintain the in-order state in an in-ordercommit processor Therefore with a small ROB (andother appropriately sized structures) given in-order com-mit the core can more easily stall when it runs out ofresources For example consider a ROB size of 32 froma SLM microarchitecture When the ROB is full it con-tains 32 micro-operations in flight and the CPUrsquos backend will no longer accept additional micro-operationsfrom the front end Increasing the physical size of theROB along with other resources is one potential butrather expensive solution to this issue An alternativesolution and one of the benefits of using out-of-ordercommit is the early release of resources This early re-lease increases the effective size of a resource (comparedto an in-order commit core) improving performance ofthe core

Benchmark specific analysis The xalan bench-mark is one of the top 5 benchmarks with a large num-ber of CPU stalls caused by exhausted resources BothAOOC and ROOC are very effective for the xalan bench-mark OOC is able to provide additional free entries inthe ROB RF and LSQ (see Figure 7)

On the other hand leslie3d has the lowest num-ber of CPU stalls based on exhausted resources whichlimits the potential for improvement with OOC

Figure 5 compares the effective size of the SLM mi-croarchitecture between in-order commit and both safeand unsafe aggressive out-of-order commit (AOOC) mod-els for the SPEC CPU2006 benchmarks The resultsare normalized to a fixed size SLM_IOC microarchitec-ture (see Table 2) An effective size of 10 translates tofrequent stalls due to resource exhaustion while sizes

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 9

4 8 16 32 128192

(a) SLM

20

40

60

80

100

120

Perfo

rmance

improvem

ent ()

4 8 16 32 128192

Commit depth

(b) NHM

20

40

60

80

100

120

Safe_OOC Unsafe_OOC

4 8 16 32 128192

(c) HSW

20

40

60

80

100

120

Fig 6 Effect of commit depth on performance improve-ment normalized to their respective in-order commitperformance across SPEC CPU2006 The smaller corereceives less benefit from increasing the depth of com-mit Safe_OOC saturates at a commit depth of 8 16and 32 for SLM NHM and HSW respectively (the dis-tance to the first unresolved instruction from the headof the ROB) There is no saturation in performance im-provement of Unsafe_OOC while the commit depth isincreased

greater than 10 shows the effective resource size in-creases due to OOC

For aggressive out-of-order commit the larger effec-tive sizes show the reach of this technique We see thatthe utilization of all structures except for the ROB isalmost the same for safe and unsafe out-of-order com-mit which allows Safe_OOC to achieve most of theperformance of the unsafe version Nevertheless the un-safe core is much better at improving the reach of theROB allowing applications like hmmer continue to showa benefit when moving from safe to unsafe out-of-ordercommit See Section 452 for more details on hmmer

44 Evaluation of Commit Depth

To gauge the effect of the commit depth (ie how far wescan the ROB to find instructions that can commit out-of-order) we impose a hard limit on it and evaluate theeffects on the resulting performance The strictest limitis the commit-width itself starting to commit out-of-order from the first commit-width instructions We thenrelax this to the immediate vicinity (eg double thecommit-width) and progressively relax until we reachthe size of the ROB

In Figure 6 we see that a commit depth of 4 (equalto commit width) provides the smallest benefit but alsothe smallest difference between Safe_OOC and Un-safe_OOC In addition the SLM core benefits morefrom OOC compared to NHM and HSW when com-mit depth is smaller than 8 At a commit depth of 8and above the large aggressive cores benefit much more

than the smaller core type with the maximum improve-ment of Unsafe_OOC at 80 119 and 129 for SLMNHM and HSW respectively For aggressive cores thelarger commit depth allows for continued performanceimprovements and will be necessary for the design of abalanced aggressive out-of-order commit design

45 Out-of-Order Commit Performance

In this section we analyze the out-of-order commit con-ditions to determine the minimum (and maximum) per-formance improvement potential The minimum andmaximum improvement is provided by Safe_OOC andUnsafe_OOC respectively In Figure 7 we show theamount of improvement provided by both safe and un-safe out-of-order commit for both aggressive and reluc-tant modes for all three microarchitectures

451 Safe_OOC

Honoring all out-of-order commit conditions results in amodest performance improvement One implication forfuture processors is that Safe_OOC does not requireadditional support for speculative out-of-order rollbackrecovery mechanisms it requires support for the com-mit of instructions out-of-order and the freeing of struc-tures for use by future instructions

Safe_AOOC When Safe_AOOC is evaluated onthe SLM microarchitecture (see Figure 7a) the rangeof improvement spans from a low of 3 (leslie3d) upto 82 (xalan) with an average of 44 In the NHMmicroarchitecture the improvement ranges from 9 to108 for milc and xalan with an average improvementof 57 and for HSW we see an improvement of 1for mcf to 110 for wrf with an average of 55 (SeeSection 44 for more details)

Safe_ROOC For Safe_ROOC the performanceimprovement is lower for all three microarchitecturescompared to AOOC (See Subsection 25 for additionaldetails) In the SLM microarchitecture the range ofperformance improvement of safe ROOC is from 2 to55 for calculix and gcc respectively with an averageimprovement of 20 In the NHM microarchitecturewe see performance improvements that range from 1(mcf) to 22 (bwaves) with an average improvement of7 (8 for HSW) Because NHM and HSM have fewerCPU stalls ROOC has less of an effect on performancecompared with the smaller more efficient core

452 Unsafe_OOC

By relaxing all conditions Unsafe_OOC provides themaximum potential for performance improvement Un-

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 3: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 3

14 Contribution

We study safemdashnon-speculativemdash out-of-order commit(refraining from committing until all conditions are met)and unsafemdashspeculativemdash out-of-order commit (com-mitting even before all conditions are met assuming thepotential for correct recovery) In addition to studyingboth safe and unsafe out-of-order commit as a minimumand maximum potential performance improvement wealso study out-of-order commit in two additional di-mensions aggressiveness and degree of speculation

ndash Aggressiveness A dimension that determines thepotential benefit is the aggressiveness of the out-of-order-commit implementation how far into the in-struction window we try to commit Prior work [22]examines this aggressiveness in their implementa-tion of checkpoint-based out-of-order commit wherethey show a middle-ground in aggressiveness is neededto mitigate checkpoint-based penalties while stillimproving performanceIn addition work on non-speculative out-of-ordercommit [5] has recognized that a restricted commitwindow can achieve a good portion of the perfor-mance improvements compared to an unrestrictedunlimited out-of-order commit implementation

ndash Degree of speculation Apart from the aggres-siveness of commit we also consider varying degreesof speculation The insight of this study is that selec-tive speculation preserving or relaxing of one moreconditions necessary for out-of-order commit couldlead to more efficient implementations In fact re-cent work [15 24] has shown that this might be thecase for load instructions

15 Results

This study aims to provide a guidepost for each majorparameter of out-of-order commit to provide an upperbound on the performance benefits given a particularaggressiveness and degree of selective speculation Inthis work we show that

ndash there is significant untapped potential for unsafeout-of-order commit beyond traditional in-order andsafe out-of-order commit the gap widens in morepowerful architectures

ndash for energy-efficient cores with moderately-sized in-struction windows reluctant1 limited out-of-order

1 Reluctant out-of-order commit continues normal in-oredrcommit and is only enabled when the hardware cannot continueto make forward progress This mode switching happens earlyenough which is a cycle before the related queue (ROB IQLSQ) is full so that the CPU stall is avoided

commit is sufficient to reap the most benefit whileaggressive versions of out-of-order commit becomea requirement for larger cores

ndash the order of importance of the commit conditionschanges depending on the type of application thearchitecture (limited or aggressive OoO processor)and the aggressiveness of out-of-order commit (com-mit depth)

ndash we unexpectedly found that a focus on specific out-of-order commit conditions could be an importantfuture direction for high-performance efficient out-of-order processors

ndash the potential benefits of out-of-order commit increaseswith memory latency (relatively more for unsafe)while the benefits of the prefetching strategy thatwe picked are orthogonal to out-of-order commitbenefits This raises the enticing possibility of re-ducing system-wide silicon financial cost withoutcompromising performance by coupling dense buthigher-latency (slow-but-efficient) DRAM with out-of-order commit cores

ndash out-of-order commit increases memory hierarchy par-allelism [7]

ndash While it is generally acceptable that by releasingpipeline resources as early as possible out-of-ordercommit improves performance in minor and smallcores relatively more than in large cores in this workwe show that this is only true for reluctant out-of-order commit In fact performance improvement inlarge out-of-order cores can exceed that of smallercores if aggressive out-of-order commit is employed

ndash Our results show the potential for future systemsthat implement out-of-order commit and indicatewhich are the most promising directions (safe vsunsafe commit and which of Bell and Lipastirsquos con-ditions [5] are most important to support) for futuredesigns

The rest of this paper is organized as follows InSection 2 we first provide an overview of the conditionsthat need to be honored for in-order commit as wellas provide an overview of out-of-order commit Nextin Section 3 we present our evaluation methodologyand simulated system configurations Section 4 pro-vides a detailed performance analysis for each out-of-order commit condition both from the point of view ofaggressive and reluctant out-of-order commit In Sec-tion 5 and Section 6 we compare the early releaseof physical register and memory hierarchy parallelismrespectively with Out-of-Order commit extending pre-vious work [2] Finally the related works and conclusionare presented in Section 7 and 8 respectively

4 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

2 Out-of-Order Commit

The introduction of the reorder buffer (ROB) to pro-vide in-order commit in an out-of-order scheduled su-perscalar pipeline was an important advance for com-puter architecture culminating an effort that startedwith Tomasulorsquos algorithm and included techniques suchas reservation stations and the register update unit(RUU) [26] The reorder buffer maintains precise ar-chitectural state in the presence of interrupts unknownmemory dependencies or memory re-orderings that canperturb the ordering required by a memory consistencymodel Out-of-order commit on the other hand at-tempts to break this rigid updating of the architecturalstate either in a safe way (ie one that does not requireadditional speculation and rollback to revert changes tothe architectural state) or in an unsafe way (ie onethat does)

21 Safe vs Unsafe OOC

A turning point in our understanding of out-of-ordercommit came with the work of Bell and Lipasti [5] inthe form of a number of limiting conditions for safe out-of-order commit The necessary conditions to allow aninstruction to be committed out-of-order are1 The instruction is completed instructions can com-

mit only after their completion2 The instruction is not involved in memory replay

traps This condition simply says that we cannotcommit speculative loads or their dependent instruc-tions This condition relates to unresolved memorydependencies or memory consistency enforcementFor example total store order (TSO) requires a re-play of speculative loads that violate loadrarrload or-dering when this reordering is detected by othercores

3 Register WAR hazards are resolved (ie a write toa particular register cannot be permitted to commitbefore all prior reads of that architectural registerhave been completed)

4 Previous branches are successfully predicted Thiscondition simply says that we can commit only whileon the correct path of execution

5 No prior instruction in program order is going toraise an exception This condition provides preciseinterrupts and is essential in easing handling of egpage faultsIn the rest of this paper we will use the following

convention to discuss how we evade the BellLipasticonditions1 Safe_OOC where all out-of-order-commit condi-

tions are preserved This case provides the minimum

potential performance improvement of out-of-ordercommit but also the minimum hardware to imple-ment as it does not rely on speculation and rollbackbeyond what is already available in the out-of-ordercore

2 Unsafe_OOC where one or more (or all) of theout-of-order commit conditions are evaded (apartfrom true dependencies) Doing so the maximumpotential performance improvement of out-of-ordercommit is evaluated but this may require extra sup-port for speculation and rollback to be able to revertchanges in the architectural state that were foundto be incorrect after the commit

22 Reluctant vs Aggressive OOC

Aside from the limiting conditions described above aseparate dimension is the aggressiveness of committingout-of-order Thus concerning the mechanics of out-of-order commit we distinguish two versions

1 Reluctant out-of-order-commit (ROOC) where theout-of-order commit mechanisms are engaged onlywhen needed and

2 Aggressive out-of-order-commit (AOOC) wherethe out-of-order commit mechanisms are continu-ously active looking for opportunities to commitinstructions as early as possible

While the always on nature of the aggressive out-of-order commit is obvious in its meaning the when neededof the reluctant out-of-order-commit requires clarifica-tion For this paper reluctant out-of-order commit isengaged only when the core is in imminent danger ofgoing into a complete stall In other words we engagereluctant out-of-order commit only when one of the crit-ical resources (ROB entries registers load-store queueentries) is all but exhausted and cannot support thefetching of new instructions in the front end of thepipeline As such reluctant out-of-order commit acts asa safety valve to release the pressure on resources (justbefore this pressure reaches a critical point) ratherthan aggressively trying to keep resource pressure low

In contrast aggressive out-of-order commit releasesresources more eagerly but disregards the following is-sues

1 it might prove wasteful as traditional in-order com-mit may still be able to provide sufficient resourcesfor forward progress

2 it may be futile as the chances of encountering aninstruction that restricts further commit (eg anunresolved branch) tends to increase with aggres-siveness

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 5

3 it creates a significant management problem as out-of-order commit can create gaps in several struc-tures including the ROB and also the load queueand store queue (which is not completely addressedin prior works [21 22 5])

Commit Width and DepthThe final parameter to explore within the context

of out-of-order commit is the commit depth to scanfor potential instructions to commit out-of-order Whilecommit width is the number of instructions that can becommitted simultaneously per cycle the commit depthis the measure of how far the core can scan looking forinstructions to commit out-of-order in a given cycle

23 OOC Conditions

In Section 3 we describe the methodology used to com-mit instructions out-of-order With this methodologywe can selectively relax the commit conditions describedin Subsection 21 (except completion) while still guar-anteeing correct execution For example by relaxingthe branch condition (Unsafe_BR) ldquocommitting onlydown a non-speculative pathrdquo we can continue to freeresources past unresolved branches but effectively onlycommit from the correct path In this paper we donot evaluate the implementation required to relax theseconditions but instead evaluate the potential for per-formance improvement We evaluate the maximum po-tential performance improvement with a speculative out-of-order rollback cost of zero In this section we providedetails on the performance implications of the relax-ation of a single and combination of conditions

231 Instruction complete

The core waits for an instruction to finish executing be-fore commit can occur We do not examine early com-mit of loads [14 15] that miss in the cache and insteadwe consider them available for commit only after thedata returns and is bound to the destination register

232 Memory replay traps (safe_ST and safe_LD)

We describe two sub-cases for this conditionStore-Load (safe_ST) This condition applies to

same-thread memory dependencies involving a store anda load In particular we cannot commit a load out-of-order in the presence of a prior store with an un-resolved address If the store and the load prove tobe dependent (the load should have taken the valueof the store) the commit would have been incorrectThe LD condition disallows the commit of a load and

its dependent instructions until all prior stores resolvetheir addresses and all the memory dependencies arecorrectly enforced By relaxing this condition we cancommit loads and their dependent instructions even ifprior non-aliasing stores have unresolved addresses

Load-Load (safe_LD) This concerns memory con-sistency models that enforce loadrarrload ordering (egSequential Consistency or TSO) Under this orderingconstraint it is possible to allow loads out-of-order aslong as this is not observed in the memory system Thesafe_LD condition disallows the out-of-order commitof loads unless it is guaranteed that the correct or-der will be observed by the memory system To relaxthis condition we allow loadrarrload re-orderings that arenot observed by other cores A very specific case wouldbe a memory mapped IO (MMIO) request that mightchange the order of memory operations The MMIOcase acts as a rsquocoprocessorrsquo meaning that we have amulti-processor system here We ignore memory requestsfrom other cores (IO coprocessor)

233 WAR hazards

WAR hazards are already handled by the out-of-ordercore within the ROB and we assume a solution suchas the Value Buffer [21] for committing out-of-orderThus we do not consider this condition further

234 Unresolved Branches (safe_BR)

This condition guarantees that we commit only fromthe correct path of execution Out-of-order commit shouldnot proceed past unresolved branches until they are cor-rectly resolved We can relax this condition and commitpast an unresolved branch if we are able to undo thecommit To evaluate maximum performance potentialwe assume a zero rollback cost for out-of-order commitmisspredictions However the normal branch mispre-diction cost (10 cycles) is faithfully accounted (see Sub-section 49 for more details) In addition we evaluatethe rollback count for this condition in Subsection 49

235 Exceptions (safe_EXC)

This condition caters to precise interrupts Enforcingthis condition requires that we do not commit past aninstruction (floating-point memory access or any in-struction that may cause an exception) unless we makesure that the instruction will not cause a exception Torelax this safe_EXC condition we assume the code re-gions are exception free

6 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

Fig 2 Conceptual comparison of in-order and out-of-order commit (commit depth=4) C Ready to commitX Not ready to commit E Executing I Issued OEmpty entry

Fig 3 Functionality of AOOC and ROOC In thisexample AOOC commits one more instruction thanROOC (commit depth=4) C Ready to commit XNot ready to commit E Executing I Issued OEmpty entry

24 Safe and Unsafe Out-of-Order Commit

Normally discussion of out-of-order commit tends tofocus on the changes that occur in the back end ofthe pipeline However the purpose of out-of-order com-mit is to enable the front end to proceed The per-ception that out-of-order commit increases performancecan also be misleading instruction execution is not spedup it is the removal of conditions that stall the frontend that increases performance More specifically inan unconstrained architecture (unrestricted ROB reg-isters and loadstore queue entries) safe out-of-ordercommit does not perform faster than in-order commitFor restricted real-world implementations Safe_OOChelps to ameliorate this problem

In contrast Unsafe_OOC has the capacity to ex-ceed the performance of an unconstrained in-order com-mit architecture as it removes penalties needed to guar-antee correctness (eg correctly handling memory de-

pendencies correctly enforcing memory consistency or-dering etc) in situations that the hardware does nothave any other means of imposing such correctness Inthis case Unsafe_OOC has the potential to violate cor-rectness and requires a means to revert back to a safestate if a violation occurs This can be a good trade-offwhen the conditions that violate correctness are rareWe implement an oracle version of Unsafe_OOC inthat it violates the correctness conditions because itwill be safe to do so (and therefore no correctness prob-lems will occur) This removes the need to provide themechanisms to revert to a safe state in the case of a mis-speculation This model provides a best-case speedup asrecovery from misspeculation is zero cost

25 Aggressive and Reluctant OOC

Orthogonal to the enforcement of the out-of-order com-mit conditions the aggressiveness of out-of-order com-mit plays a significant role in the resulting performanceand the cost it incurs We introduced two approachesfor out-of-order commit Aggressive (AOOC) and Re-luctant (ROOC) out-of-order commit A good way todescribe them is to contrast their main difference howoften each mechanism is engaged

AOOC is engaged all the time ie it constantlytries to find instructions to commit out-of-order if theopportunity arises In this respect it aims to increasecommit bandwidth

In contrast ROOC is only engaged when one of thecritical resources in the core (ROB entries registersLoadstore queue entries) is about to be exhaustedROOC is concerned about front end stalls not aboutcommit bandwidth This means that the number of in-structions that ROOC needs to find that can commitout-of-order is limited ROOC needs to provide enoughfree entries in the ROBregistersLSQ so that the frontend can dispatch as many instructions as possible (up tothe dispatch width) to the out-of-order engine Figure 2shows this with an example The reason that the com-mit stage of an in-order commit is blocked is becauseof an unresolved instruction at the head of ROB Forexample given a four-wide superscalar shown in Fig-ure 2 tries to commit four instructions per cycle Whilethe first instruction at the head of ROB can committhe second oldest instruction causes the commit stageto block and instead of four only one instruction com-mits (in-order)

The behavior of out-of-order commit depends on itsaggressiveness

AOOC AOOC attempts to find up to commit-width instructions so it can satisfy the need to commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 7

Table 1 Baseline core parameters

Full system simulation 34 GHz x86-64L1id 32KiB 8-way 4clkL2 256KiB 8-way 12clkL3 1MiB 8-way 36clkDRAM 200clkBranch Predictor Tournament front end penalty 10clkPrefetcher OFF(default)ON Stride degree 8

Table 2 Microarchitecture configuration with reorderbuffer (ROB) instruction queue (IQ) load and storequeues (LQSQ) and integer and floating-point regis-ter file (RF) details Dispatch width (D) commit width(CW) and Out-of-Order commit depth (CD) is thesame to enable a fair comparison (SLM hardware hasa DCW of 2) Register values are additional physicalregisters above architectural state

Microarchitecture DCWCD ROB IQ LQSQ RF(INTFP)Silvermont-like (SLM) 448 32 32 1016 3232Nehalem-like (NHM) 448 128 56 4836 6868Haswell-like (HSW) 448 192 60 7242 130130

at the highest possible commit bandwidth In our ex-ample it aims to find four instructions However sinceonly one more is needed to fill the gap in the head groupof four leading instructions AOOC produces an excessof commit-ready instructions that can potentially beexploited in the near future

ROOC ROOC on the other hand aims to findthe minimum number of instructions needed to commit(out-of-order) so that no resource is exhausted and thefront end can continue to issue instructions at its peakbandwidth The reason for seeking the minimum num-ber of instructions to commit out-of-order is that thisminimizes the perturbations in instruction order Thiscould potentially lead to more efficient hardware im-plementations While Figure 3 only considers the ROBin the general case all dispatch-related resources areconsidered in the same way In our particular exam-ple ROOC needs to find just one instruction to com-mit out-of-order The ROB already contains two emptyslots and the in-order commit mechanisms can find oneinstruction to commit leaving one left for ROOC

In contrast in AOOC mode there are in total threeinstructions that commit (instructions 1 3 and 4in Figure 3) The result is that AOOC needs to scandeeper than ROOC

3 Methodology

We use the gem5 [6] simulator in full system mode tosimulate an x86-64 target with a frequency of 34GHzTo test our models we use the SPEC CPU2006 bench-mark suite We use ten uniformly distributed check-points from each benchmark Each simulation check-

point has three phases that begin with 250M instruc-tions of cache warm-up followed by 100k instructions ofdetailed pipeline warm-up and ends with a detailed sim-ulation of 100M instructions Furthermore three dif-ferent configurations similar to three conventional out-of-order processors are used Intelrsquos SLM NHM andHSW [9] microarchitectures Tables 1 and 2 list the de-tailed configuration of the simulation environment

To implement our out-of-order commit model wefirst configure a simulated machine with a very largenumber of core resources We then monitor the numberof committed and non-committed instructions that ap-pear in the pipeline and control at dispatch whetherwe can support additional instructions in the back endof the processor In this way we dynamically determinethe resource availability in the processor for each cyclebased on the out-of-order commit conditions

4 Out-of-Order Commit Evaluation

In this section we analyze the benefits of out-of-ordercommit on the performance of a number of applica-tions We look at how the commit bandwidth changeswith out-of-order commit and how the effective re-source size of each critical component changes as weenable different out-of-order commit conditions Nextwe show how out-of-order commit conditions affect per-formance Safe_OOC and Unsafe_OOC are two ex-treme points that are defined by either enabling (re-specting) or disabling all of the out-of-order commitconditions This results in a minimum and maximumpotential performance improvement across the bench-mark suite To understand the effect of each conditionin isolation we study the effect of each one both in thepresence and absence of other conditions These stud-ies on Safe_OOC and Unsafe_OOC conditions wereconducted for both Aggressive (AOOC) and Reluctant(ROOC) out-of-order commit

41 Microarchitecture Aggressiveness

We target three microarchitectures resembling IntelrsquosSilvermont (SLM) Nehalem (NHM) and Haswell (HSW)as small medium and large cores (See Table 2 for de-tails) As an overview Figure 1 shows the performanceimprovement for each microarchitecture assuming Safe-OOC (all conditions respected) for all benchmarks onaverage across SPEC CPU2006 [16] We can see thatin the case of narrow commit depth (four in this fig-ure) a relatively small out-of-order processor (SLM)has more potential for relative improvement comparedto the medium and aggressive microarchitectures The

8 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLM SLMOOC

NHMNHMOOC

HSWHSWOOC

20406080

100

Cycle

s ()

4 committed3 committed

2 committed1 committed

0 committed

Fig 4 Commit bandwidth distribution for the SPECCPU2006 benchmarks of a 4-wide core with in-ordercommit and out-of-order commit respecting all commitconditions (Safe_OOC) Out-of-order commit increasescommit pressure even without aggressive speculation

reason is that the smaller processor (with a shorter in-struction reach and given a balanced design smallerhardware structures) will more likely stall as it exposesa smaller amount of the potential ILP in an applica-tion In case of a larger commit depth the more ag-gressive cores (NHM and HSW) have higher potentialperformance improvement (See Section 44 for a de-tailed overview) Out-of-order commit frees the proces-sor from the traditional limits reducing the number oftimes the processor experiences exhausted resources Inmedium and large aggressive cores thanks to a largerROB as well as other hardware resources more intrinsicILP is extracted by traditional in-order commit leav-ing less potential for out-of-order commit with a narrowcommit depth

42 The Effect on Commit Bandwidth

Figure 4 shows the distribution of the number of com-mitted instructions per cycle for three different microar-chitectures for both in-order and out-of-order commitacross all SPEC CPU2006 benchmarks Although an in-order commit 4-wide commit HSW microarchitecturecan retire up to four instructions per cycle this occurson average less than 20 of the time In practice wesee a large number of commit stage stalls (zero instruc-tions committed per cycle) For out-of-order committhe distribution shifts toward four instructions per cy-cle reflecting the improved commit performance (andthe resulting improvement in overall performance) Fi-nally this figure shows that for smaller less aggressivecores (such as SLM see Table 2) out-of-order commitprovides a relatively larger improvement compared tothe other microarchitectures because of the short com-

LQ SQ IQ RF ROB00

05

10

15

20

Norm

alize

d effective siz

e

IOC Safe_OOC Unsafe_OOC

Fig 5 Effective resource use in in-order and Aggressiveout-of-order commit across SPEC CPU2006

mit depth of four Larger cores improve more with moreaggressive commit depths and out-of-order commit pa-rameters (see section 44 for details)

43 The Effect on Resources

One of the main issues that out-of-order commit canhelp resolve is the early release (and subsequent reuse)of hardware resources that otherwise would still be re-quired to maintain the in-order state in an in-ordercommit processor Therefore with a small ROB (andother appropriately sized structures) given in-order com-mit the core can more easily stall when it runs out ofresources For example consider a ROB size of 32 froma SLM microarchitecture When the ROB is full it con-tains 32 micro-operations in flight and the CPUrsquos backend will no longer accept additional micro-operationsfrom the front end Increasing the physical size of theROB along with other resources is one potential butrather expensive solution to this issue An alternativesolution and one of the benefits of using out-of-ordercommit is the early release of resources This early re-lease increases the effective size of a resource (comparedto an in-order commit core) improving performance ofthe core

Benchmark specific analysis The xalan bench-mark is one of the top 5 benchmarks with a large num-ber of CPU stalls caused by exhausted resources BothAOOC and ROOC are very effective for the xalan bench-mark OOC is able to provide additional free entries inthe ROB RF and LSQ (see Figure 7)

On the other hand leslie3d has the lowest num-ber of CPU stalls based on exhausted resources whichlimits the potential for improvement with OOC

Figure 5 compares the effective size of the SLM mi-croarchitecture between in-order commit and both safeand unsafe aggressive out-of-order commit (AOOC) mod-els for the SPEC CPU2006 benchmarks The resultsare normalized to a fixed size SLM_IOC microarchitec-ture (see Table 2) An effective size of 10 translates tofrequent stalls due to resource exhaustion while sizes

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 9

4 8 16 32 128192

(a) SLM

20

40

60

80

100

120

Perfo

rmance

improvem

ent ()

4 8 16 32 128192

Commit depth

(b) NHM

20

40

60

80

100

120

Safe_OOC Unsafe_OOC

4 8 16 32 128192

(c) HSW

20

40

60

80

100

120

Fig 6 Effect of commit depth on performance improve-ment normalized to their respective in-order commitperformance across SPEC CPU2006 The smaller corereceives less benefit from increasing the depth of com-mit Safe_OOC saturates at a commit depth of 8 16and 32 for SLM NHM and HSW respectively (the dis-tance to the first unresolved instruction from the headof the ROB) There is no saturation in performance im-provement of Unsafe_OOC while the commit depth isincreased

greater than 10 shows the effective resource size in-creases due to OOC

For aggressive out-of-order commit the larger effec-tive sizes show the reach of this technique We see thatthe utilization of all structures except for the ROB isalmost the same for safe and unsafe out-of-order com-mit which allows Safe_OOC to achieve most of theperformance of the unsafe version Nevertheless the un-safe core is much better at improving the reach of theROB allowing applications like hmmer continue to showa benefit when moving from safe to unsafe out-of-ordercommit See Section 452 for more details on hmmer

44 Evaluation of Commit Depth

To gauge the effect of the commit depth (ie how far wescan the ROB to find instructions that can commit out-of-order) we impose a hard limit on it and evaluate theeffects on the resulting performance The strictest limitis the commit-width itself starting to commit out-of-order from the first commit-width instructions We thenrelax this to the immediate vicinity (eg double thecommit-width) and progressively relax until we reachthe size of the ROB

In Figure 6 we see that a commit depth of 4 (equalto commit width) provides the smallest benefit but alsothe smallest difference between Safe_OOC and Un-safe_OOC In addition the SLM core benefits morefrom OOC compared to NHM and HSW when com-mit depth is smaller than 8 At a commit depth of 8and above the large aggressive cores benefit much more

than the smaller core type with the maximum improve-ment of Unsafe_OOC at 80 119 and 129 for SLMNHM and HSW respectively For aggressive cores thelarger commit depth allows for continued performanceimprovements and will be necessary for the design of abalanced aggressive out-of-order commit design

45 Out-of-Order Commit Performance

In this section we analyze the out-of-order commit con-ditions to determine the minimum (and maximum) per-formance improvement potential The minimum andmaximum improvement is provided by Safe_OOC andUnsafe_OOC respectively In Figure 7 we show theamount of improvement provided by both safe and un-safe out-of-order commit for both aggressive and reluc-tant modes for all three microarchitectures

451 Safe_OOC

Honoring all out-of-order commit conditions results in amodest performance improvement One implication forfuture processors is that Safe_OOC does not requireadditional support for speculative out-of-order rollbackrecovery mechanisms it requires support for the com-mit of instructions out-of-order and the freeing of struc-tures for use by future instructions

Safe_AOOC When Safe_AOOC is evaluated onthe SLM microarchitecture (see Figure 7a) the rangeof improvement spans from a low of 3 (leslie3d) upto 82 (xalan) with an average of 44 In the NHMmicroarchitecture the improvement ranges from 9 to108 for milc and xalan with an average improvementof 57 and for HSW we see an improvement of 1for mcf to 110 for wrf with an average of 55 (SeeSection 44 for more details)

Safe_ROOC For Safe_ROOC the performanceimprovement is lower for all three microarchitecturescompared to AOOC (See Subsection 25 for additionaldetails) In the SLM microarchitecture the range ofperformance improvement of safe ROOC is from 2 to55 for calculix and gcc respectively with an averageimprovement of 20 In the NHM microarchitecturewe see performance improvements that range from 1(mcf) to 22 (bwaves) with an average improvement of7 (8 for HSW) Because NHM and HSM have fewerCPU stalls ROOC has less of an effect on performancecompared with the smaller more efficient core

452 Unsafe_OOC

By relaxing all conditions Unsafe_OOC provides themaximum potential for performance improvement Un-

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 4: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

4 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

2 Out-of-Order Commit

The introduction of the reorder buffer (ROB) to pro-vide in-order commit in an out-of-order scheduled su-perscalar pipeline was an important advance for com-puter architecture culminating an effort that startedwith Tomasulorsquos algorithm and included techniques suchas reservation stations and the register update unit(RUU) [26] The reorder buffer maintains precise ar-chitectural state in the presence of interrupts unknownmemory dependencies or memory re-orderings that canperturb the ordering required by a memory consistencymodel Out-of-order commit on the other hand at-tempts to break this rigid updating of the architecturalstate either in a safe way (ie one that does not requireadditional speculation and rollback to revert changes tothe architectural state) or in an unsafe way (ie onethat does)

21 Safe vs Unsafe OOC

A turning point in our understanding of out-of-ordercommit came with the work of Bell and Lipasti [5] inthe form of a number of limiting conditions for safe out-of-order commit The necessary conditions to allow aninstruction to be committed out-of-order are1 The instruction is completed instructions can com-

mit only after their completion2 The instruction is not involved in memory replay

traps This condition simply says that we cannotcommit speculative loads or their dependent instruc-tions This condition relates to unresolved memorydependencies or memory consistency enforcementFor example total store order (TSO) requires a re-play of speculative loads that violate loadrarrload or-dering when this reordering is detected by othercores

3 Register WAR hazards are resolved (ie a write toa particular register cannot be permitted to commitbefore all prior reads of that architectural registerhave been completed)

4 Previous branches are successfully predicted Thiscondition simply says that we can commit only whileon the correct path of execution

5 No prior instruction in program order is going toraise an exception This condition provides preciseinterrupts and is essential in easing handling of egpage faultsIn the rest of this paper we will use the following

convention to discuss how we evade the BellLipasticonditions1 Safe_OOC where all out-of-order-commit condi-

tions are preserved This case provides the minimum

potential performance improvement of out-of-ordercommit but also the minimum hardware to imple-ment as it does not rely on speculation and rollbackbeyond what is already available in the out-of-ordercore

2 Unsafe_OOC where one or more (or all) of theout-of-order commit conditions are evaded (apartfrom true dependencies) Doing so the maximumpotential performance improvement of out-of-ordercommit is evaluated but this may require extra sup-port for speculation and rollback to be able to revertchanges in the architectural state that were foundto be incorrect after the commit

22 Reluctant vs Aggressive OOC

Aside from the limiting conditions described above aseparate dimension is the aggressiveness of committingout-of-order Thus concerning the mechanics of out-of-order commit we distinguish two versions

1 Reluctant out-of-order-commit (ROOC) where theout-of-order commit mechanisms are engaged onlywhen needed and

2 Aggressive out-of-order-commit (AOOC) wherethe out-of-order commit mechanisms are continu-ously active looking for opportunities to commitinstructions as early as possible

While the always on nature of the aggressive out-of-order commit is obvious in its meaning the when neededof the reluctant out-of-order-commit requires clarifica-tion For this paper reluctant out-of-order commit isengaged only when the core is in imminent danger ofgoing into a complete stall In other words we engagereluctant out-of-order commit only when one of the crit-ical resources (ROB entries registers load-store queueentries) is all but exhausted and cannot support thefetching of new instructions in the front end of thepipeline As such reluctant out-of-order commit acts asa safety valve to release the pressure on resources (justbefore this pressure reaches a critical point) ratherthan aggressively trying to keep resource pressure low

In contrast aggressive out-of-order commit releasesresources more eagerly but disregards the following is-sues

1 it might prove wasteful as traditional in-order com-mit may still be able to provide sufficient resourcesfor forward progress

2 it may be futile as the chances of encountering aninstruction that restricts further commit (eg anunresolved branch) tends to increase with aggres-siveness

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 5

3 it creates a significant management problem as out-of-order commit can create gaps in several struc-tures including the ROB and also the load queueand store queue (which is not completely addressedin prior works [21 22 5])

Commit Width and DepthThe final parameter to explore within the context

of out-of-order commit is the commit depth to scanfor potential instructions to commit out-of-order Whilecommit width is the number of instructions that can becommitted simultaneously per cycle the commit depthis the measure of how far the core can scan looking forinstructions to commit out-of-order in a given cycle

23 OOC Conditions

In Section 3 we describe the methodology used to com-mit instructions out-of-order With this methodologywe can selectively relax the commit conditions describedin Subsection 21 (except completion) while still guar-anteeing correct execution For example by relaxingthe branch condition (Unsafe_BR) ldquocommitting onlydown a non-speculative pathrdquo we can continue to freeresources past unresolved branches but effectively onlycommit from the correct path In this paper we donot evaluate the implementation required to relax theseconditions but instead evaluate the potential for per-formance improvement We evaluate the maximum po-tential performance improvement with a speculative out-of-order rollback cost of zero In this section we providedetails on the performance implications of the relax-ation of a single and combination of conditions

231 Instruction complete

The core waits for an instruction to finish executing be-fore commit can occur We do not examine early com-mit of loads [14 15] that miss in the cache and insteadwe consider them available for commit only after thedata returns and is bound to the destination register

232 Memory replay traps (safe_ST and safe_LD)

We describe two sub-cases for this conditionStore-Load (safe_ST) This condition applies to

same-thread memory dependencies involving a store anda load In particular we cannot commit a load out-of-order in the presence of a prior store with an un-resolved address If the store and the load prove tobe dependent (the load should have taken the valueof the store) the commit would have been incorrectThe LD condition disallows the commit of a load and

its dependent instructions until all prior stores resolvetheir addresses and all the memory dependencies arecorrectly enforced By relaxing this condition we cancommit loads and their dependent instructions even ifprior non-aliasing stores have unresolved addresses

Load-Load (safe_LD) This concerns memory con-sistency models that enforce loadrarrload ordering (egSequential Consistency or TSO) Under this orderingconstraint it is possible to allow loads out-of-order aslong as this is not observed in the memory system Thesafe_LD condition disallows the out-of-order commitof loads unless it is guaranteed that the correct or-der will be observed by the memory system To relaxthis condition we allow loadrarrload re-orderings that arenot observed by other cores A very specific case wouldbe a memory mapped IO (MMIO) request that mightchange the order of memory operations The MMIOcase acts as a rsquocoprocessorrsquo meaning that we have amulti-processor system here We ignore memory requestsfrom other cores (IO coprocessor)

233 WAR hazards

WAR hazards are already handled by the out-of-ordercore within the ROB and we assume a solution suchas the Value Buffer [21] for committing out-of-orderThus we do not consider this condition further

234 Unresolved Branches (safe_BR)

This condition guarantees that we commit only fromthe correct path of execution Out-of-order commit shouldnot proceed past unresolved branches until they are cor-rectly resolved We can relax this condition and commitpast an unresolved branch if we are able to undo thecommit To evaluate maximum performance potentialwe assume a zero rollback cost for out-of-order commitmisspredictions However the normal branch mispre-diction cost (10 cycles) is faithfully accounted (see Sub-section 49 for more details) In addition we evaluatethe rollback count for this condition in Subsection 49

235 Exceptions (safe_EXC)

This condition caters to precise interrupts Enforcingthis condition requires that we do not commit past aninstruction (floating-point memory access or any in-struction that may cause an exception) unless we makesure that the instruction will not cause a exception Torelax this safe_EXC condition we assume the code re-gions are exception free

6 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

Fig 2 Conceptual comparison of in-order and out-of-order commit (commit depth=4) C Ready to commitX Not ready to commit E Executing I Issued OEmpty entry

Fig 3 Functionality of AOOC and ROOC In thisexample AOOC commits one more instruction thanROOC (commit depth=4) C Ready to commit XNot ready to commit E Executing I Issued OEmpty entry

24 Safe and Unsafe Out-of-Order Commit

Normally discussion of out-of-order commit tends tofocus on the changes that occur in the back end ofthe pipeline However the purpose of out-of-order com-mit is to enable the front end to proceed The per-ception that out-of-order commit increases performancecan also be misleading instruction execution is not spedup it is the removal of conditions that stall the frontend that increases performance More specifically inan unconstrained architecture (unrestricted ROB reg-isters and loadstore queue entries) safe out-of-ordercommit does not perform faster than in-order commitFor restricted real-world implementations Safe_OOChelps to ameliorate this problem

In contrast Unsafe_OOC has the capacity to ex-ceed the performance of an unconstrained in-order com-mit architecture as it removes penalties needed to guar-antee correctness (eg correctly handling memory de-

pendencies correctly enforcing memory consistency or-dering etc) in situations that the hardware does nothave any other means of imposing such correctness Inthis case Unsafe_OOC has the potential to violate cor-rectness and requires a means to revert back to a safestate if a violation occurs This can be a good trade-offwhen the conditions that violate correctness are rareWe implement an oracle version of Unsafe_OOC inthat it violates the correctness conditions because itwill be safe to do so (and therefore no correctness prob-lems will occur) This removes the need to provide themechanisms to revert to a safe state in the case of a mis-speculation This model provides a best-case speedup asrecovery from misspeculation is zero cost

25 Aggressive and Reluctant OOC

Orthogonal to the enforcement of the out-of-order com-mit conditions the aggressiveness of out-of-order com-mit plays a significant role in the resulting performanceand the cost it incurs We introduced two approachesfor out-of-order commit Aggressive (AOOC) and Re-luctant (ROOC) out-of-order commit A good way todescribe them is to contrast their main difference howoften each mechanism is engaged

AOOC is engaged all the time ie it constantlytries to find instructions to commit out-of-order if theopportunity arises In this respect it aims to increasecommit bandwidth

In contrast ROOC is only engaged when one of thecritical resources in the core (ROB entries registersLoadstore queue entries) is about to be exhaustedROOC is concerned about front end stalls not aboutcommit bandwidth This means that the number of in-structions that ROOC needs to find that can commitout-of-order is limited ROOC needs to provide enoughfree entries in the ROBregistersLSQ so that the frontend can dispatch as many instructions as possible (up tothe dispatch width) to the out-of-order engine Figure 2shows this with an example The reason that the com-mit stage of an in-order commit is blocked is becauseof an unresolved instruction at the head of ROB Forexample given a four-wide superscalar shown in Fig-ure 2 tries to commit four instructions per cycle Whilethe first instruction at the head of ROB can committhe second oldest instruction causes the commit stageto block and instead of four only one instruction com-mits (in-order)

The behavior of out-of-order commit depends on itsaggressiveness

AOOC AOOC attempts to find up to commit-width instructions so it can satisfy the need to commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 7

Table 1 Baseline core parameters

Full system simulation 34 GHz x86-64L1id 32KiB 8-way 4clkL2 256KiB 8-way 12clkL3 1MiB 8-way 36clkDRAM 200clkBranch Predictor Tournament front end penalty 10clkPrefetcher OFF(default)ON Stride degree 8

Table 2 Microarchitecture configuration with reorderbuffer (ROB) instruction queue (IQ) load and storequeues (LQSQ) and integer and floating-point regis-ter file (RF) details Dispatch width (D) commit width(CW) and Out-of-Order commit depth (CD) is thesame to enable a fair comparison (SLM hardware hasa DCW of 2) Register values are additional physicalregisters above architectural state

Microarchitecture DCWCD ROB IQ LQSQ RF(INTFP)Silvermont-like (SLM) 448 32 32 1016 3232Nehalem-like (NHM) 448 128 56 4836 6868Haswell-like (HSW) 448 192 60 7242 130130

at the highest possible commit bandwidth In our ex-ample it aims to find four instructions However sinceonly one more is needed to fill the gap in the head groupof four leading instructions AOOC produces an excessof commit-ready instructions that can potentially beexploited in the near future

ROOC ROOC on the other hand aims to findthe minimum number of instructions needed to commit(out-of-order) so that no resource is exhausted and thefront end can continue to issue instructions at its peakbandwidth The reason for seeking the minimum num-ber of instructions to commit out-of-order is that thisminimizes the perturbations in instruction order Thiscould potentially lead to more efficient hardware im-plementations While Figure 3 only considers the ROBin the general case all dispatch-related resources areconsidered in the same way In our particular exam-ple ROOC needs to find just one instruction to com-mit out-of-order The ROB already contains two emptyslots and the in-order commit mechanisms can find oneinstruction to commit leaving one left for ROOC

In contrast in AOOC mode there are in total threeinstructions that commit (instructions 1 3 and 4in Figure 3) The result is that AOOC needs to scandeeper than ROOC

3 Methodology

We use the gem5 [6] simulator in full system mode tosimulate an x86-64 target with a frequency of 34GHzTo test our models we use the SPEC CPU2006 bench-mark suite We use ten uniformly distributed check-points from each benchmark Each simulation check-

point has three phases that begin with 250M instruc-tions of cache warm-up followed by 100k instructions ofdetailed pipeline warm-up and ends with a detailed sim-ulation of 100M instructions Furthermore three dif-ferent configurations similar to three conventional out-of-order processors are used Intelrsquos SLM NHM andHSW [9] microarchitectures Tables 1 and 2 list the de-tailed configuration of the simulation environment

To implement our out-of-order commit model wefirst configure a simulated machine with a very largenumber of core resources We then monitor the numberof committed and non-committed instructions that ap-pear in the pipeline and control at dispatch whetherwe can support additional instructions in the back endof the processor In this way we dynamically determinethe resource availability in the processor for each cyclebased on the out-of-order commit conditions

4 Out-of-Order Commit Evaluation

In this section we analyze the benefits of out-of-ordercommit on the performance of a number of applica-tions We look at how the commit bandwidth changeswith out-of-order commit and how the effective re-source size of each critical component changes as weenable different out-of-order commit conditions Nextwe show how out-of-order commit conditions affect per-formance Safe_OOC and Unsafe_OOC are two ex-treme points that are defined by either enabling (re-specting) or disabling all of the out-of-order commitconditions This results in a minimum and maximumpotential performance improvement across the bench-mark suite To understand the effect of each conditionin isolation we study the effect of each one both in thepresence and absence of other conditions These stud-ies on Safe_OOC and Unsafe_OOC conditions wereconducted for both Aggressive (AOOC) and Reluctant(ROOC) out-of-order commit

41 Microarchitecture Aggressiveness

We target three microarchitectures resembling IntelrsquosSilvermont (SLM) Nehalem (NHM) and Haswell (HSW)as small medium and large cores (See Table 2 for de-tails) As an overview Figure 1 shows the performanceimprovement for each microarchitecture assuming Safe-OOC (all conditions respected) for all benchmarks onaverage across SPEC CPU2006 [16] We can see thatin the case of narrow commit depth (four in this fig-ure) a relatively small out-of-order processor (SLM)has more potential for relative improvement comparedto the medium and aggressive microarchitectures The

8 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLM SLMOOC

NHMNHMOOC

HSWHSWOOC

20406080

100

Cycle

s ()

4 committed3 committed

2 committed1 committed

0 committed

Fig 4 Commit bandwidth distribution for the SPECCPU2006 benchmarks of a 4-wide core with in-ordercommit and out-of-order commit respecting all commitconditions (Safe_OOC) Out-of-order commit increasescommit pressure even without aggressive speculation

reason is that the smaller processor (with a shorter in-struction reach and given a balanced design smallerhardware structures) will more likely stall as it exposesa smaller amount of the potential ILP in an applica-tion In case of a larger commit depth the more ag-gressive cores (NHM and HSW) have higher potentialperformance improvement (See Section 44 for a de-tailed overview) Out-of-order commit frees the proces-sor from the traditional limits reducing the number oftimes the processor experiences exhausted resources Inmedium and large aggressive cores thanks to a largerROB as well as other hardware resources more intrinsicILP is extracted by traditional in-order commit leav-ing less potential for out-of-order commit with a narrowcommit depth

42 The Effect on Commit Bandwidth

Figure 4 shows the distribution of the number of com-mitted instructions per cycle for three different microar-chitectures for both in-order and out-of-order commitacross all SPEC CPU2006 benchmarks Although an in-order commit 4-wide commit HSW microarchitecturecan retire up to four instructions per cycle this occurson average less than 20 of the time In practice wesee a large number of commit stage stalls (zero instruc-tions committed per cycle) For out-of-order committhe distribution shifts toward four instructions per cy-cle reflecting the improved commit performance (andthe resulting improvement in overall performance) Fi-nally this figure shows that for smaller less aggressivecores (such as SLM see Table 2) out-of-order commitprovides a relatively larger improvement compared tothe other microarchitectures because of the short com-

LQ SQ IQ RF ROB00

05

10

15

20

Norm

alize

d effective siz

e

IOC Safe_OOC Unsafe_OOC

Fig 5 Effective resource use in in-order and Aggressiveout-of-order commit across SPEC CPU2006

mit depth of four Larger cores improve more with moreaggressive commit depths and out-of-order commit pa-rameters (see section 44 for details)

43 The Effect on Resources

One of the main issues that out-of-order commit canhelp resolve is the early release (and subsequent reuse)of hardware resources that otherwise would still be re-quired to maintain the in-order state in an in-ordercommit processor Therefore with a small ROB (andother appropriately sized structures) given in-order com-mit the core can more easily stall when it runs out ofresources For example consider a ROB size of 32 froma SLM microarchitecture When the ROB is full it con-tains 32 micro-operations in flight and the CPUrsquos backend will no longer accept additional micro-operationsfrom the front end Increasing the physical size of theROB along with other resources is one potential butrather expensive solution to this issue An alternativesolution and one of the benefits of using out-of-ordercommit is the early release of resources This early re-lease increases the effective size of a resource (comparedto an in-order commit core) improving performance ofthe core

Benchmark specific analysis The xalan bench-mark is one of the top 5 benchmarks with a large num-ber of CPU stalls caused by exhausted resources BothAOOC and ROOC are very effective for the xalan bench-mark OOC is able to provide additional free entries inthe ROB RF and LSQ (see Figure 7)

On the other hand leslie3d has the lowest num-ber of CPU stalls based on exhausted resources whichlimits the potential for improvement with OOC

Figure 5 compares the effective size of the SLM mi-croarchitecture between in-order commit and both safeand unsafe aggressive out-of-order commit (AOOC) mod-els for the SPEC CPU2006 benchmarks The resultsare normalized to a fixed size SLM_IOC microarchitec-ture (see Table 2) An effective size of 10 translates tofrequent stalls due to resource exhaustion while sizes

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 9

4 8 16 32 128192

(a) SLM

20

40

60

80

100

120

Perfo

rmance

improvem

ent ()

4 8 16 32 128192

Commit depth

(b) NHM

20

40

60

80

100

120

Safe_OOC Unsafe_OOC

4 8 16 32 128192

(c) HSW

20

40

60

80

100

120

Fig 6 Effect of commit depth on performance improve-ment normalized to their respective in-order commitperformance across SPEC CPU2006 The smaller corereceives less benefit from increasing the depth of com-mit Safe_OOC saturates at a commit depth of 8 16and 32 for SLM NHM and HSW respectively (the dis-tance to the first unresolved instruction from the headof the ROB) There is no saturation in performance im-provement of Unsafe_OOC while the commit depth isincreased

greater than 10 shows the effective resource size in-creases due to OOC

For aggressive out-of-order commit the larger effec-tive sizes show the reach of this technique We see thatthe utilization of all structures except for the ROB isalmost the same for safe and unsafe out-of-order com-mit which allows Safe_OOC to achieve most of theperformance of the unsafe version Nevertheless the un-safe core is much better at improving the reach of theROB allowing applications like hmmer continue to showa benefit when moving from safe to unsafe out-of-ordercommit See Section 452 for more details on hmmer

44 Evaluation of Commit Depth

To gauge the effect of the commit depth (ie how far wescan the ROB to find instructions that can commit out-of-order) we impose a hard limit on it and evaluate theeffects on the resulting performance The strictest limitis the commit-width itself starting to commit out-of-order from the first commit-width instructions We thenrelax this to the immediate vicinity (eg double thecommit-width) and progressively relax until we reachthe size of the ROB

In Figure 6 we see that a commit depth of 4 (equalto commit width) provides the smallest benefit but alsothe smallest difference between Safe_OOC and Un-safe_OOC In addition the SLM core benefits morefrom OOC compared to NHM and HSW when com-mit depth is smaller than 8 At a commit depth of 8and above the large aggressive cores benefit much more

than the smaller core type with the maximum improve-ment of Unsafe_OOC at 80 119 and 129 for SLMNHM and HSW respectively For aggressive cores thelarger commit depth allows for continued performanceimprovements and will be necessary for the design of abalanced aggressive out-of-order commit design

45 Out-of-Order Commit Performance

In this section we analyze the out-of-order commit con-ditions to determine the minimum (and maximum) per-formance improvement potential The minimum andmaximum improvement is provided by Safe_OOC andUnsafe_OOC respectively In Figure 7 we show theamount of improvement provided by both safe and un-safe out-of-order commit for both aggressive and reluc-tant modes for all three microarchitectures

451 Safe_OOC

Honoring all out-of-order commit conditions results in amodest performance improvement One implication forfuture processors is that Safe_OOC does not requireadditional support for speculative out-of-order rollbackrecovery mechanisms it requires support for the com-mit of instructions out-of-order and the freeing of struc-tures for use by future instructions

Safe_AOOC When Safe_AOOC is evaluated onthe SLM microarchitecture (see Figure 7a) the rangeof improvement spans from a low of 3 (leslie3d) upto 82 (xalan) with an average of 44 In the NHMmicroarchitecture the improvement ranges from 9 to108 for milc and xalan with an average improvementof 57 and for HSW we see an improvement of 1for mcf to 110 for wrf with an average of 55 (SeeSection 44 for more details)

Safe_ROOC For Safe_ROOC the performanceimprovement is lower for all three microarchitecturescompared to AOOC (See Subsection 25 for additionaldetails) In the SLM microarchitecture the range ofperformance improvement of safe ROOC is from 2 to55 for calculix and gcc respectively with an averageimprovement of 20 In the NHM microarchitecturewe see performance improvements that range from 1(mcf) to 22 (bwaves) with an average improvement of7 (8 for HSW) Because NHM and HSM have fewerCPU stalls ROOC has less of an effect on performancecompared with the smaller more efficient core

452 Unsafe_OOC

By relaxing all conditions Unsafe_OOC provides themaximum potential for performance improvement Un-

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 5: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 5

3 it creates a significant management problem as out-of-order commit can create gaps in several struc-tures including the ROB and also the load queueand store queue (which is not completely addressedin prior works [21 22 5])

Commit Width and DepthThe final parameter to explore within the context

of out-of-order commit is the commit depth to scanfor potential instructions to commit out-of-order Whilecommit width is the number of instructions that can becommitted simultaneously per cycle the commit depthis the measure of how far the core can scan looking forinstructions to commit out-of-order in a given cycle

23 OOC Conditions

In Section 3 we describe the methodology used to com-mit instructions out-of-order With this methodologywe can selectively relax the commit conditions describedin Subsection 21 (except completion) while still guar-anteeing correct execution For example by relaxingthe branch condition (Unsafe_BR) ldquocommitting onlydown a non-speculative pathrdquo we can continue to freeresources past unresolved branches but effectively onlycommit from the correct path In this paper we donot evaluate the implementation required to relax theseconditions but instead evaluate the potential for per-formance improvement We evaluate the maximum po-tential performance improvement with a speculative out-of-order rollback cost of zero In this section we providedetails on the performance implications of the relax-ation of a single and combination of conditions

231 Instruction complete

The core waits for an instruction to finish executing be-fore commit can occur We do not examine early com-mit of loads [14 15] that miss in the cache and insteadwe consider them available for commit only after thedata returns and is bound to the destination register

232 Memory replay traps (safe_ST and safe_LD)

We describe two sub-cases for this conditionStore-Load (safe_ST) This condition applies to

same-thread memory dependencies involving a store anda load In particular we cannot commit a load out-of-order in the presence of a prior store with an un-resolved address If the store and the load prove tobe dependent (the load should have taken the valueof the store) the commit would have been incorrectThe LD condition disallows the commit of a load and

its dependent instructions until all prior stores resolvetheir addresses and all the memory dependencies arecorrectly enforced By relaxing this condition we cancommit loads and their dependent instructions even ifprior non-aliasing stores have unresolved addresses

Load-Load (safe_LD) This concerns memory con-sistency models that enforce loadrarrload ordering (egSequential Consistency or TSO) Under this orderingconstraint it is possible to allow loads out-of-order aslong as this is not observed in the memory system Thesafe_LD condition disallows the out-of-order commitof loads unless it is guaranteed that the correct or-der will be observed by the memory system To relaxthis condition we allow loadrarrload re-orderings that arenot observed by other cores A very specific case wouldbe a memory mapped IO (MMIO) request that mightchange the order of memory operations The MMIOcase acts as a rsquocoprocessorrsquo meaning that we have amulti-processor system here We ignore memory requestsfrom other cores (IO coprocessor)

233 WAR hazards

WAR hazards are already handled by the out-of-ordercore within the ROB and we assume a solution suchas the Value Buffer [21] for committing out-of-orderThus we do not consider this condition further

234 Unresolved Branches (safe_BR)

This condition guarantees that we commit only fromthe correct path of execution Out-of-order commit shouldnot proceed past unresolved branches until they are cor-rectly resolved We can relax this condition and commitpast an unresolved branch if we are able to undo thecommit To evaluate maximum performance potentialwe assume a zero rollback cost for out-of-order commitmisspredictions However the normal branch mispre-diction cost (10 cycles) is faithfully accounted (see Sub-section 49 for more details) In addition we evaluatethe rollback count for this condition in Subsection 49

235 Exceptions (safe_EXC)

This condition caters to precise interrupts Enforcingthis condition requires that we do not commit past aninstruction (floating-point memory access or any in-struction that may cause an exception) unless we makesure that the instruction will not cause a exception Torelax this safe_EXC condition we assume the code re-gions are exception free

6 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

Fig 2 Conceptual comparison of in-order and out-of-order commit (commit depth=4) C Ready to commitX Not ready to commit E Executing I Issued OEmpty entry

Fig 3 Functionality of AOOC and ROOC In thisexample AOOC commits one more instruction thanROOC (commit depth=4) C Ready to commit XNot ready to commit E Executing I Issued OEmpty entry

24 Safe and Unsafe Out-of-Order Commit

Normally discussion of out-of-order commit tends tofocus on the changes that occur in the back end ofthe pipeline However the purpose of out-of-order com-mit is to enable the front end to proceed The per-ception that out-of-order commit increases performancecan also be misleading instruction execution is not spedup it is the removal of conditions that stall the frontend that increases performance More specifically inan unconstrained architecture (unrestricted ROB reg-isters and loadstore queue entries) safe out-of-ordercommit does not perform faster than in-order commitFor restricted real-world implementations Safe_OOChelps to ameliorate this problem

In contrast Unsafe_OOC has the capacity to ex-ceed the performance of an unconstrained in-order com-mit architecture as it removes penalties needed to guar-antee correctness (eg correctly handling memory de-

pendencies correctly enforcing memory consistency or-dering etc) in situations that the hardware does nothave any other means of imposing such correctness Inthis case Unsafe_OOC has the potential to violate cor-rectness and requires a means to revert back to a safestate if a violation occurs This can be a good trade-offwhen the conditions that violate correctness are rareWe implement an oracle version of Unsafe_OOC inthat it violates the correctness conditions because itwill be safe to do so (and therefore no correctness prob-lems will occur) This removes the need to provide themechanisms to revert to a safe state in the case of a mis-speculation This model provides a best-case speedup asrecovery from misspeculation is zero cost

25 Aggressive and Reluctant OOC

Orthogonal to the enforcement of the out-of-order com-mit conditions the aggressiveness of out-of-order com-mit plays a significant role in the resulting performanceand the cost it incurs We introduced two approachesfor out-of-order commit Aggressive (AOOC) and Re-luctant (ROOC) out-of-order commit A good way todescribe them is to contrast their main difference howoften each mechanism is engaged

AOOC is engaged all the time ie it constantlytries to find instructions to commit out-of-order if theopportunity arises In this respect it aims to increasecommit bandwidth

In contrast ROOC is only engaged when one of thecritical resources in the core (ROB entries registersLoadstore queue entries) is about to be exhaustedROOC is concerned about front end stalls not aboutcommit bandwidth This means that the number of in-structions that ROOC needs to find that can commitout-of-order is limited ROOC needs to provide enoughfree entries in the ROBregistersLSQ so that the frontend can dispatch as many instructions as possible (up tothe dispatch width) to the out-of-order engine Figure 2shows this with an example The reason that the com-mit stage of an in-order commit is blocked is becauseof an unresolved instruction at the head of ROB Forexample given a four-wide superscalar shown in Fig-ure 2 tries to commit four instructions per cycle Whilethe first instruction at the head of ROB can committhe second oldest instruction causes the commit stageto block and instead of four only one instruction com-mits (in-order)

The behavior of out-of-order commit depends on itsaggressiveness

AOOC AOOC attempts to find up to commit-width instructions so it can satisfy the need to commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 7

Table 1 Baseline core parameters

Full system simulation 34 GHz x86-64L1id 32KiB 8-way 4clkL2 256KiB 8-way 12clkL3 1MiB 8-way 36clkDRAM 200clkBranch Predictor Tournament front end penalty 10clkPrefetcher OFF(default)ON Stride degree 8

Table 2 Microarchitecture configuration with reorderbuffer (ROB) instruction queue (IQ) load and storequeues (LQSQ) and integer and floating-point regis-ter file (RF) details Dispatch width (D) commit width(CW) and Out-of-Order commit depth (CD) is thesame to enable a fair comparison (SLM hardware hasa DCW of 2) Register values are additional physicalregisters above architectural state

Microarchitecture DCWCD ROB IQ LQSQ RF(INTFP)Silvermont-like (SLM) 448 32 32 1016 3232Nehalem-like (NHM) 448 128 56 4836 6868Haswell-like (HSW) 448 192 60 7242 130130

at the highest possible commit bandwidth In our ex-ample it aims to find four instructions However sinceonly one more is needed to fill the gap in the head groupof four leading instructions AOOC produces an excessof commit-ready instructions that can potentially beexploited in the near future

ROOC ROOC on the other hand aims to findthe minimum number of instructions needed to commit(out-of-order) so that no resource is exhausted and thefront end can continue to issue instructions at its peakbandwidth The reason for seeking the minimum num-ber of instructions to commit out-of-order is that thisminimizes the perturbations in instruction order Thiscould potentially lead to more efficient hardware im-plementations While Figure 3 only considers the ROBin the general case all dispatch-related resources areconsidered in the same way In our particular exam-ple ROOC needs to find just one instruction to com-mit out-of-order The ROB already contains two emptyslots and the in-order commit mechanisms can find oneinstruction to commit leaving one left for ROOC

In contrast in AOOC mode there are in total threeinstructions that commit (instructions 1 3 and 4in Figure 3) The result is that AOOC needs to scandeeper than ROOC

3 Methodology

We use the gem5 [6] simulator in full system mode tosimulate an x86-64 target with a frequency of 34GHzTo test our models we use the SPEC CPU2006 bench-mark suite We use ten uniformly distributed check-points from each benchmark Each simulation check-

point has three phases that begin with 250M instruc-tions of cache warm-up followed by 100k instructions ofdetailed pipeline warm-up and ends with a detailed sim-ulation of 100M instructions Furthermore three dif-ferent configurations similar to three conventional out-of-order processors are used Intelrsquos SLM NHM andHSW [9] microarchitectures Tables 1 and 2 list the de-tailed configuration of the simulation environment

To implement our out-of-order commit model wefirst configure a simulated machine with a very largenumber of core resources We then monitor the numberof committed and non-committed instructions that ap-pear in the pipeline and control at dispatch whetherwe can support additional instructions in the back endof the processor In this way we dynamically determinethe resource availability in the processor for each cyclebased on the out-of-order commit conditions

4 Out-of-Order Commit Evaluation

In this section we analyze the benefits of out-of-ordercommit on the performance of a number of applica-tions We look at how the commit bandwidth changeswith out-of-order commit and how the effective re-source size of each critical component changes as weenable different out-of-order commit conditions Nextwe show how out-of-order commit conditions affect per-formance Safe_OOC and Unsafe_OOC are two ex-treme points that are defined by either enabling (re-specting) or disabling all of the out-of-order commitconditions This results in a minimum and maximumpotential performance improvement across the bench-mark suite To understand the effect of each conditionin isolation we study the effect of each one both in thepresence and absence of other conditions These stud-ies on Safe_OOC and Unsafe_OOC conditions wereconducted for both Aggressive (AOOC) and Reluctant(ROOC) out-of-order commit

41 Microarchitecture Aggressiveness

We target three microarchitectures resembling IntelrsquosSilvermont (SLM) Nehalem (NHM) and Haswell (HSW)as small medium and large cores (See Table 2 for de-tails) As an overview Figure 1 shows the performanceimprovement for each microarchitecture assuming Safe-OOC (all conditions respected) for all benchmarks onaverage across SPEC CPU2006 [16] We can see thatin the case of narrow commit depth (four in this fig-ure) a relatively small out-of-order processor (SLM)has more potential for relative improvement comparedto the medium and aggressive microarchitectures The

8 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLM SLMOOC

NHMNHMOOC

HSWHSWOOC

20406080

100

Cycle

s ()

4 committed3 committed

2 committed1 committed

0 committed

Fig 4 Commit bandwidth distribution for the SPECCPU2006 benchmarks of a 4-wide core with in-ordercommit and out-of-order commit respecting all commitconditions (Safe_OOC) Out-of-order commit increasescommit pressure even without aggressive speculation

reason is that the smaller processor (with a shorter in-struction reach and given a balanced design smallerhardware structures) will more likely stall as it exposesa smaller amount of the potential ILP in an applica-tion In case of a larger commit depth the more ag-gressive cores (NHM and HSW) have higher potentialperformance improvement (See Section 44 for a de-tailed overview) Out-of-order commit frees the proces-sor from the traditional limits reducing the number oftimes the processor experiences exhausted resources Inmedium and large aggressive cores thanks to a largerROB as well as other hardware resources more intrinsicILP is extracted by traditional in-order commit leav-ing less potential for out-of-order commit with a narrowcommit depth

42 The Effect on Commit Bandwidth

Figure 4 shows the distribution of the number of com-mitted instructions per cycle for three different microar-chitectures for both in-order and out-of-order commitacross all SPEC CPU2006 benchmarks Although an in-order commit 4-wide commit HSW microarchitecturecan retire up to four instructions per cycle this occurson average less than 20 of the time In practice wesee a large number of commit stage stalls (zero instruc-tions committed per cycle) For out-of-order committhe distribution shifts toward four instructions per cy-cle reflecting the improved commit performance (andthe resulting improvement in overall performance) Fi-nally this figure shows that for smaller less aggressivecores (such as SLM see Table 2) out-of-order commitprovides a relatively larger improvement compared tothe other microarchitectures because of the short com-

LQ SQ IQ RF ROB00

05

10

15

20

Norm

alize

d effective siz

e

IOC Safe_OOC Unsafe_OOC

Fig 5 Effective resource use in in-order and Aggressiveout-of-order commit across SPEC CPU2006

mit depth of four Larger cores improve more with moreaggressive commit depths and out-of-order commit pa-rameters (see section 44 for details)

43 The Effect on Resources

One of the main issues that out-of-order commit canhelp resolve is the early release (and subsequent reuse)of hardware resources that otherwise would still be re-quired to maintain the in-order state in an in-ordercommit processor Therefore with a small ROB (andother appropriately sized structures) given in-order com-mit the core can more easily stall when it runs out ofresources For example consider a ROB size of 32 froma SLM microarchitecture When the ROB is full it con-tains 32 micro-operations in flight and the CPUrsquos backend will no longer accept additional micro-operationsfrom the front end Increasing the physical size of theROB along with other resources is one potential butrather expensive solution to this issue An alternativesolution and one of the benefits of using out-of-ordercommit is the early release of resources This early re-lease increases the effective size of a resource (comparedto an in-order commit core) improving performance ofthe core

Benchmark specific analysis The xalan bench-mark is one of the top 5 benchmarks with a large num-ber of CPU stalls caused by exhausted resources BothAOOC and ROOC are very effective for the xalan bench-mark OOC is able to provide additional free entries inthe ROB RF and LSQ (see Figure 7)

On the other hand leslie3d has the lowest num-ber of CPU stalls based on exhausted resources whichlimits the potential for improvement with OOC

Figure 5 compares the effective size of the SLM mi-croarchitecture between in-order commit and both safeand unsafe aggressive out-of-order commit (AOOC) mod-els for the SPEC CPU2006 benchmarks The resultsare normalized to a fixed size SLM_IOC microarchitec-ture (see Table 2) An effective size of 10 translates tofrequent stalls due to resource exhaustion while sizes

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 9

4 8 16 32 128192

(a) SLM

20

40

60

80

100

120

Perfo

rmance

improvem

ent ()

4 8 16 32 128192

Commit depth

(b) NHM

20

40

60

80

100

120

Safe_OOC Unsafe_OOC

4 8 16 32 128192

(c) HSW

20

40

60

80

100

120

Fig 6 Effect of commit depth on performance improve-ment normalized to their respective in-order commitperformance across SPEC CPU2006 The smaller corereceives less benefit from increasing the depth of com-mit Safe_OOC saturates at a commit depth of 8 16and 32 for SLM NHM and HSW respectively (the dis-tance to the first unresolved instruction from the headof the ROB) There is no saturation in performance im-provement of Unsafe_OOC while the commit depth isincreased

greater than 10 shows the effective resource size in-creases due to OOC

For aggressive out-of-order commit the larger effec-tive sizes show the reach of this technique We see thatthe utilization of all structures except for the ROB isalmost the same for safe and unsafe out-of-order com-mit which allows Safe_OOC to achieve most of theperformance of the unsafe version Nevertheless the un-safe core is much better at improving the reach of theROB allowing applications like hmmer continue to showa benefit when moving from safe to unsafe out-of-ordercommit See Section 452 for more details on hmmer

44 Evaluation of Commit Depth

To gauge the effect of the commit depth (ie how far wescan the ROB to find instructions that can commit out-of-order) we impose a hard limit on it and evaluate theeffects on the resulting performance The strictest limitis the commit-width itself starting to commit out-of-order from the first commit-width instructions We thenrelax this to the immediate vicinity (eg double thecommit-width) and progressively relax until we reachthe size of the ROB

In Figure 6 we see that a commit depth of 4 (equalto commit width) provides the smallest benefit but alsothe smallest difference between Safe_OOC and Un-safe_OOC In addition the SLM core benefits morefrom OOC compared to NHM and HSW when com-mit depth is smaller than 8 At a commit depth of 8and above the large aggressive cores benefit much more

than the smaller core type with the maximum improve-ment of Unsafe_OOC at 80 119 and 129 for SLMNHM and HSW respectively For aggressive cores thelarger commit depth allows for continued performanceimprovements and will be necessary for the design of abalanced aggressive out-of-order commit design

45 Out-of-Order Commit Performance

In this section we analyze the out-of-order commit con-ditions to determine the minimum (and maximum) per-formance improvement potential The minimum andmaximum improvement is provided by Safe_OOC andUnsafe_OOC respectively In Figure 7 we show theamount of improvement provided by both safe and un-safe out-of-order commit for both aggressive and reluc-tant modes for all three microarchitectures

451 Safe_OOC

Honoring all out-of-order commit conditions results in amodest performance improvement One implication forfuture processors is that Safe_OOC does not requireadditional support for speculative out-of-order rollbackrecovery mechanisms it requires support for the com-mit of instructions out-of-order and the freeing of struc-tures for use by future instructions

Safe_AOOC When Safe_AOOC is evaluated onthe SLM microarchitecture (see Figure 7a) the rangeof improvement spans from a low of 3 (leslie3d) upto 82 (xalan) with an average of 44 In the NHMmicroarchitecture the improvement ranges from 9 to108 for milc and xalan with an average improvementof 57 and for HSW we see an improvement of 1for mcf to 110 for wrf with an average of 55 (SeeSection 44 for more details)

Safe_ROOC For Safe_ROOC the performanceimprovement is lower for all three microarchitecturescompared to AOOC (See Subsection 25 for additionaldetails) In the SLM microarchitecture the range ofperformance improvement of safe ROOC is from 2 to55 for calculix and gcc respectively with an averageimprovement of 20 In the NHM microarchitecturewe see performance improvements that range from 1(mcf) to 22 (bwaves) with an average improvement of7 (8 for HSW) Because NHM and HSM have fewerCPU stalls ROOC has less of an effect on performancecompared with the smaller more efficient core

452 Unsafe_OOC

By relaxing all conditions Unsafe_OOC provides themaximum potential for performance improvement Un-

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 6: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

6 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

Fig 2 Conceptual comparison of in-order and out-of-order commit (commit depth=4) C Ready to commitX Not ready to commit E Executing I Issued OEmpty entry

Fig 3 Functionality of AOOC and ROOC In thisexample AOOC commits one more instruction thanROOC (commit depth=4) C Ready to commit XNot ready to commit E Executing I Issued OEmpty entry

24 Safe and Unsafe Out-of-Order Commit

Normally discussion of out-of-order commit tends tofocus on the changes that occur in the back end ofthe pipeline However the purpose of out-of-order com-mit is to enable the front end to proceed The per-ception that out-of-order commit increases performancecan also be misleading instruction execution is not spedup it is the removal of conditions that stall the frontend that increases performance More specifically inan unconstrained architecture (unrestricted ROB reg-isters and loadstore queue entries) safe out-of-ordercommit does not perform faster than in-order commitFor restricted real-world implementations Safe_OOChelps to ameliorate this problem

In contrast Unsafe_OOC has the capacity to ex-ceed the performance of an unconstrained in-order com-mit architecture as it removes penalties needed to guar-antee correctness (eg correctly handling memory de-

pendencies correctly enforcing memory consistency or-dering etc) in situations that the hardware does nothave any other means of imposing such correctness Inthis case Unsafe_OOC has the potential to violate cor-rectness and requires a means to revert back to a safestate if a violation occurs This can be a good trade-offwhen the conditions that violate correctness are rareWe implement an oracle version of Unsafe_OOC inthat it violates the correctness conditions because itwill be safe to do so (and therefore no correctness prob-lems will occur) This removes the need to provide themechanisms to revert to a safe state in the case of a mis-speculation This model provides a best-case speedup asrecovery from misspeculation is zero cost

25 Aggressive and Reluctant OOC

Orthogonal to the enforcement of the out-of-order com-mit conditions the aggressiveness of out-of-order com-mit plays a significant role in the resulting performanceand the cost it incurs We introduced two approachesfor out-of-order commit Aggressive (AOOC) and Re-luctant (ROOC) out-of-order commit A good way todescribe them is to contrast their main difference howoften each mechanism is engaged

AOOC is engaged all the time ie it constantlytries to find instructions to commit out-of-order if theopportunity arises In this respect it aims to increasecommit bandwidth

In contrast ROOC is only engaged when one of thecritical resources in the core (ROB entries registersLoadstore queue entries) is about to be exhaustedROOC is concerned about front end stalls not aboutcommit bandwidth This means that the number of in-structions that ROOC needs to find that can commitout-of-order is limited ROOC needs to provide enoughfree entries in the ROBregistersLSQ so that the frontend can dispatch as many instructions as possible (up tothe dispatch width) to the out-of-order engine Figure 2shows this with an example The reason that the com-mit stage of an in-order commit is blocked is becauseof an unresolved instruction at the head of ROB Forexample given a four-wide superscalar shown in Fig-ure 2 tries to commit four instructions per cycle Whilethe first instruction at the head of ROB can committhe second oldest instruction causes the commit stageto block and instead of four only one instruction com-mits (in-order)

The behavior of out-of-order commit depends on itsaggressiveness

AOOC AOOC attempts to find up to commit-width instructions so it can satisfy the need to commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 7

Table 1 Baseline core parameters

Full system simulation 34 GHz x86-64L1id 32KiB 8-way 4clkL2 256KiB 8-way 12clkL3 1MiB 8-way 36clkDRAM 200clkBranch Predictor Tournament front end penalty 10clkPrefetcher OFF(default)ON Stride degree 8

Table 2 Microarchitecture configuration with reorderbuffer (ROB) instruction queue (IQ) load and storequeues (LQSQ) and integer and floating-point regis-ter file (RF) details Dispatch width (D) commit width(CW) and Out-of-Order commit depth (CD) is thesame to enable a fair comparison (SLM hardware hasa DCW of 2) Register values are additional physicalregisters above architectural state

Microarchitecture DCWCD ROB IQ LQSQ RF(INTFP)Silvermont-like (SLM) 448 32 32 1016 3232Nehalem-like (NHM) 448 128 56 4836 6868Haswell-like (HSW) 448 192 60 7242 130130

at the highest possible commit bandwidth In our ex-ample it aims to find four instructions However sinceonly one more is needed to fill the gap in the head groupof four leading instructions AOOC produces an excessof commit-ready instructions that can potentially beexploited in the near future

ROOC ROOC on the other hand aims to findthe minimum number of instructions needed to commit(out-of-order) so that no resource is exhausted and thefront end can continue to issue instructions at its peakbandwidth The reason for seeking the minimum num-ber of instructions to commit out-of-order is that thisminimizes the perturbations in instruction order Thiscould potentially lead to more efficient hardware im-plementations While Figure 3 only considers the ROBin the general case all dispatch-related resources areconsidered in the same way In our particular exam-ple ROOC needs to find just one instruction to com-mit out-of-order The ROB already contains two emptyslots and the in-order commit mechanisms can find oneinstruction to commit leaving one left for ROOC

In contrast in AOOC mode there are in total threeinstructions that commit (instructions 1 3 and 4in Figure 3) The result is that AOOC needs to scandeeper than ROOC

3 Methodology

We use the gem5 [6] simulator in full system mode tosimulate an x86-64 target with a frequency of 34GHzTo test our models we use the SPEC CPU2006 bench-mark suite We use ten uniformly distributed check-points from each benchmark Each simulation check-

point has three phases that begin with 250M instruc-tions of cache warm-up followed by 100k instructions ofdetailed pipeline warm-up and ends with a detailed sim-ulation of 100M instructions Furthermore three dif-ferent configurations similar to three conventional out-of-order processors are used Intelrsquos SLM NHM andHSW [9] microarchitectures Tables 1 and 2 list the de-tailed configuration of the simulation environment

To implement our out-of-order commit model wefirst configure a simulated machine with a very largenumber of core resources We then monitor the numberof committed and non-committed instructions that ap-pear in the pipeline and control at dispatch whetherwe can support additional instructions in the back endof the processor In this way we dynamically determinethe resource availability in the processor for each cyclebased on the out-of-order commit conditions

4 Out-of-Order Commit Evaluation

In this section we analyze the benefits of out-of-ordercommit on the performance of a number of applica-tions We look at how the commit bandwidth changeswith out-of-order commit and how the effective re-source size of each critical component changes as weenable different out-of-order commit conditions Nextwe show how out-of-order commit conditions affect per-formance Safe_OOC and Unsafe_OOC are two ex-treme points that are defined by either enabling (re-specting) or disabling all of the out-of-order commitconditions This results in a minimum and maximumpotential performance improvement across the bench-mark suite To understand the effect of each conditionin isolation we study the effect of each one both in thepresence and absence of other conditions These stud-ies on Safe_OOC and Unsafe_OOC conditions wereconducted for both Aggressive (AOOC) and Reluctant(ROOC) out-of-order commit

41 Microarchitecture Aggressiveness

We target three microarchitectures resembling IntelrsquosSilvermont (SLM) Nehalem (NHM) and Haswell (HSW)as small medium and large cores (See Table 2 for de-tails) As an overview Figure 1 shows the performanceimprovement for each microarchitecture assuming Safe-OOC (all conditions respected) for all benchmarks onaverage across SPEC CPU2006 [16] We can see thatin the case of narrow commit depth (four in this fig-ure) a relatively small out-of-order processor (SLM)has more potential for relative improvement comparedto the medium and aggressive microarchitectures The

8 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLM SLMOOC

NHMNHMOOC

HSWHSWOOC

20406080

100

Cycle

s ()

4 committed3 committed

2 committed1 committed

0 committed

Fig 4 Commit bandwidth distribution for the SPECCPU2006 benchmarks of a 4-wide core with in-ordercommit and out-of-order commit respecting all commitconditions (Safe_OOC) Out-of-order commit increasescommit pressure even without aggressive speculation

reason is that the smaller processor (with a shorter in-struction reach and given a balanced design smallerhardware structures) will more likely stall as it exposesa smaller amount of the potential ILP in an applica-tion In case of a larger commit depth the more ag-gressive cores (NHM and HSW) have higher potentialperformance improvement (See Section 44 for a de-tailed overview) Out-of-order commit frees the proces-sor from the traditional limits reducing the number oftimes the processor experiences exhausted resources Inmedium and large aggressive cores thanks to a largerROB as well as other hardware resources more intrinsicILP is extracted by traditional in-order commit leav-ing less potential for out-of-order commit with a narrowcommit depth

42 The Effect on Commit Bandwidth

Figure 4 shows the distribution of the number of com-mitted instructions per cycle for three different microar-chitectures for both in-order and out-of-order commitacross all SPEC CPU2006 benchmarks Although an in-order commit 4-wide commit HSW microarchitecturecan retire up to four instructions per cycle this occurson average less than 20 of the time In practice wesee a large number of commit stage stalls (zero instruc-tions committed per cycle) For out-of-order committhe distribution shifts toward four instructions per cy-cle reflecting the improved commit performance (andthe resulting improvement in overall performance) Fi-nally this figure shows that for smaller less aggressivecores (such as SLM see Table 2) out-of-order commitprovides a relatively larger improvement compared tothe other microarchitectures because of the short com-

LQ SQ IQ RF ROB00

05

10

15

20

Norm

alize

d effective siz

e

IOC Safe_OOC Unsafe_OOC

Fig 5 Effective resource use in in-order and Aggressiveout-of-order commit across SPEC CPU2006

mit depth of four Larger cores improve more with moreaggressive commit depths and out-of-order commit pa-rameters (see section 44 for details)

43 The Effect on Resources

One of the main issues that out-of-order commit canhelp resolve is the early release (and subsequent reuse)of hardware resources that otherwise would still be re-quired to maintain the in-order state in an in-ordercommit processor Therefore with a small ROB (andother appropriately sized structures) given in-order com-mit the core can more easily stall when it runs out ofresources For example consider a ROB size of 32 froma SLM microarchitecture When the ROB is full it con-tains 32 micro-operations in flight and the CPUrsquos backend will no longer accept additional micro-operationsfrom the front end Increasing the physical size of theROB along with other resources is one potential butrather expensive solution to this issue An alternativesolution and one of the benefits of using out-of-ordercommit is the early release of resources This early re-lease increases the effective size of a resource (comparedto an in-order commit core) improving performance ofthe core

Benchmark specific analysis The xalan bench-mark is one of the top 5 benchmarks with a large num-ber of CPU stalls caused by exhausted resources BothAOOC and ROOC are very effective for the xalan bench-mark OOC is able to provide additional free entries inthe ROB RF and LSQ (see Figure 7)

On the other hand leslie3d has the lowest num-ber of CPU stalls based on exhausted resources whichlimits the potential for improvement with OOC

Figure 5 compares the effective size of the SLM mi-croarchitecture between in-order commit and both safeand unsafe aggressive out-of-order commit (AOOC) mod-els for the SPEC CPU2006 benchmarks The resultsare normalized to a fixed size SLM_IOC microarchitec-ture (see Table 2) An effective size of 10 translates tofrequent stalls due to resource exhaustion while sizes

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 9

4 8 16 32 128192

(a) SLM

20

40

60

80

100

120

Perfo

rmance

improvem

ent ()

4 8 16 32 128192

Commit depth

(b) NHM

20

40

60

80

100

120

Safe_OOC Unsafe_OOC

4 8 16 32 128192

(c) HSW

20

40

60

80

100

120

Fig 6 Effect of commit depth on performance improve-ment normalized to their respective in-order commitperformance across SPEC CPU2006 The smaller corereceives less benefit from increasing the depth of com-mit Safe_OOC saturates at a commit depth of 8 16and 32 for SLM NHM and HSW respectively (the dis-tance to the first unresolved instruction from the headof the ROB) There is no saturation in performance im-provement of Unsafe_OOC while the commit depth isincreased

greater than 10 shows the effective resource size in-creases due to OOC

For aggressive out-of-order commit the larger effec-tive sizes show the reach of this technique We see thatthe utilization of all structures except for the ROB isalmost the same for safe and unsafe out-of-order com-mit which allows Safe_OOC to achieve most of theperformance of the unsafe version Nevertheless the un-safe core is much better at improving the reach of theROB allowing applications like hmmer continue to showa benefit when moving from safe to unsafe out-of-ordercommit See Section 452 for more details on hmmer

44 Evaluation of Commit Depth

To gauge the effect of the commit depth (ie how far wescan the ROB to find instructions that can commit out-of-order) we impose a hard limit on it and evaluate theeffects on the resulting performance The strictest limitis the commit-width itself starting to commit out-of-order from the first commit-width instructions We thenrelax this to the immediate vicinity (eg double thecommit-width) and progressively relax until we reachthe size of the ROB

In Figure 6 we see that a commit depth of 4 (equalto commit width) provides the smallest benefit but alsothe smallest difference between Safe_OOC and Un-safe_OOC In addition the SLM core benefits morefrom OOC compared to NHM and HSW when com-mit depth is smaller than 8 At a commit depth of 8and above the large aggressive cores benefit much more

than the smaller core type with the maximum improve-ment of Unsafe_OOC at 80 119 and 129 for SLMNHM and HSW respectively For aggressive cores thelarger commit depth allows for continued performanceimprovements and will be necessary for the design of abalanced aggressive out-of-order commit design

45 Out-of-Order Commit Performance

In this section we analyze the out-of-order commit con-ditions to determine the minimum (and maximum) per-formance improvement potential The minimum andmaximum improvement is provided by Safe_OOC andUnsafe_OOC respectively In Figure 7 we show theamount of improvement provided by both safe and un-safe out-of-order commit for both aggressive and reluc-tant modes for all three microarchitectures

451 Safe_OOC

Honoring all out-of-order commit conditions results in amodest performance improvement One implication forfuture processors is that Safe_OOC does not requireadditional support for speculative out-of-order rollbackrecovery mechanisms it requires support for the com-mit of instructions out-of-order and the freeing of struc-tures for use by future instructions

Safe_AOOC When Safe_AOOC is evaluated onthe SLM microarchitecture (see Figure 7a) the rangeof improvement spans from a low of 3 (leslie3d) upto 82 (xalan) with an average of 44 In the NHMmicroarchitecture the improvement ranges from 9 to108 for milc and xalan with an average improvementof 57 and for HSW we see an improvement of 1for mcf to 110 for wrf with an average of 55 (SeeSection 44 for more details)

Safe_ROOC For Safe_ROOC the performanceimprovement is lower for all three microarchitecturescompared to AOOC (See Subsection 25 for additionaldetails) In the SLM microarchitecture the range ofperformance improvement of safe ROOC is from 2 to55 for calculix and gcc respectively with an averageimprovement of 20 In the NHM microarchitecturewe see performance improvements that range from 1(mcf) to 22 (bwaves) with an average improvement of7 (8 for HSW) Because NHM and HSM have fewerCPU stalls ROOC has less of an effect on performancecompared with the smaller more efficient core

452 Unsafe_OOC

By relaxing all conditions Unsafe_OOC provides themaximum potential for performance improvement Un-

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 7: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 7

Table 1 Baseline core parameters

Full system simulation 34 GHz x86-64L1id 32KiB 8-way 4clkL2 256KiB 8-way 12clkL3 1MiB 8-way 36clkDRAM 200clkBranch Predictor Tournament front end penalty 10clkPrefetcher OFF(default)ON Stride degree 8

Table 2 Microarchitecture configuration with reorderbuffer (ROB) instruction queue (IQ) load and storequeues (LQSQ) and integer and floating-point regis-ter file (RF) details Dispatch width (D) commit width(CW) and Out-of-Order commit depth (CD) is thesame to enable a fair comparison (SLM hardware hasa DCW of 2) Register values are additional physicalregisters above architectural state

Microarchitecture DCWCD ROB IQ LQSQ RF(INTFP)Silvermont-like (SLM) 448 32 32 1016 3232Nehalem-like (NHM) 448 128 56 4836 6868Haswell-like (HSW) 448 192 60 7242 130130

at the highest possible commit bandwidth In our ex-ample it aims to find four instructions However sinceonly one more is needed to fill the gap in the head groupof four leading instructions AOOC produces an excessof commit-ready instructions that can potentially beexploited in the near future

ROOC ROOC on the other hand aims to findthe minimum number of instructions needed to commit(out-of-order) so that no resource is exhausted and thefront end can continue to issue instructions at its peakbandwidth The reason for seeking the minimum num-ber of instructions to commit out-of-order is that thisminimizes the perturbations in instruction order Thiscould potentially lead to more efficient hardware im-plementations While Figure 3 only considers the ROBin the general case all dispatch-related resources areconsidered in the same way In our particular exam-ple ROOC needs to find just one instruction to com-mit out-of-order The ROB already contains two emptyslots and the in-order commit mechanisms can find oneinstruction to commit leaving one left for ROOC

In contrast in AOOC mode there are in total threeinstructions that commit (instructions 1 3 and 4in Figure 3) The result is that AOOC needs to scandeeper than ROOC

3 Methodology

We use the gem5 [6] simulator in full system mode tosimulate an x86-64 target with a frequency of 34GHzTo test our models we use the SPEC CPU2006 bench-mark suite We use ten uniformly distributed check-points from each benchmark Each simulation check-

point has three phases that begin with 250M instruc-tions of cache warm-up followed by 100k instructions ofdetailed pipeline warm-up and ends with a detailed sim-ulation of 100M instructions Furthermore three dif-ferent configurations similar to three conventional out-of-order processors are used Intelrsquos SLM NHM andHSW [9] microarchitectures Tables 1 and 2 list the de-tailed configuration of the simulation environment

To implement our out-of-order commit model wefirst configure a simulated machine with a very largenumber of core resources We then monitor the numberof committed and non-committed instructions that ap-pear in the pipeline and control at dispatch whetherwe can support additional instructions in the back endof the processor In this way we dynamically determinethe resource availability in the processor for each cyclebased on the out-of-order commit conditions

4 Out-of-Order Commit Evaluation

In this section we analyze the benefits of out-of-ordercommit on the performance of a number of applica-tions We look at how the commit bandwidth changeswith out-of-order commit and how the effective re-source size of each critical component changes as weenable different out-of-order commit conditions Nextwe show how out-of-order commit conditions affect per-formance Safe_OOC and Unsafe_OOC are two ex-treme points that are defined by either enabling (re-specting) or disabling all of the out-of-order commitconditions This results in a minimum and maximumpotential performance improvement across the bench-mark suite To understand the effect of each conditionin isolation we study the effect of each one both in thepresence and absence of other conditions These stud-ies on Safe_OOC and Unsafe_OOC conditions wereconducted for both Aggressive (AOOC) and Reluctant(ROOC) out-of-order commit

41 Microarchitecture Aggressiveness

We target three microarchitectures resembling IntelrsquosSilvermont (SLM) Nehalem (NHM) and Haswell (HSW)as small medium and large cores (See Table 2 for de-tails) As an overview Figure 1 shows the performanceimprovement for each microarchitecture assuming Safe-OOC (all conditions respected) for all benchmarks onaverage across SPEC CPU2006 [16] We can see thatin the case of narrow commit depth (four in this fig-ure) a relatively small out-of-order processor (SLM)has more potential for relative improvement comparedto the medium and aggressive microarchitectures The

8 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLM SLMOOC

NHMNHMOOC

HSWHSWOOC

20406080

100

Cycle

s ()

4 committed3 committed

2 committed1 committed

0 committed

Fig 4 Commit bandwidth distribution for the SPECCPU2006 benchmarks of a 4-wide core with in-ordercommit and out-of-order commit respecting all commitconditions (Safe_OOC) Out-of-order commit increasescommit pressure even without aggressive speculation

reason is that the smaller processor (with a shorter in-struction reach and given a balanced design smallerhardware structures) will more likely stall as it exposesa smaller amount of the potential ILP in an applica-tion In case of a larger commit depth the more ag-gressive cores (NHM and HSW) have higher potentialperformance improvement (See Section 44 for a de-tailed overview) Out-of-order commit frees the proces-sor from the traditional limits reducing the number oftimes the processor experiences exhausted resources Inmedium and large aggressive cores thanks to a largerROB as well as other hardware resources more intrinsicILP is extracted by traditional in-order commit leav-ing less potential for out-of-order commit with a narrowcommit depth

42 The Effect on Commit Bandwidth

Figure 4 shows the distribution of the number of com-mitted instructions per cycle for three different microar-chitectures for both in-order and out-of-order commitacross all SPEC CPU2006 benchmarks Although an in-order commit 4-wide commit HSW microarchitecturecan retire up to four instructions per cycle this occurson average less than 20 of the time In practice wesee a large number of commit stage stalls (zero instruc-tions committed per cycle) For out-of-order committhe distribution shifts toward four instructions per cy-cle reflecting the improved commit performance (andthe resulting improvement in overall performance) Fi-nally this figure shows that for smaller less aggressivecores (such as SLM see Table 2) out-of-order commitprovides a relatively larger improvement compared tothe other microarchitectures because of the short com-

LQ SQ IQ RF ROB00

05

10

15

20

Norm

alize

d effective siz

e

IOC Safe_OOC Unsafe_OOC

Fig 5 Effective resource use in in-order and Aggressiveout-of-order commit across SPEC CPU2006

mit depth of four Larger cores improve more with moreaggressive commit depths and out-of-order commit pa-rameters (see section 44 for details)

43 The Effect on Resources

One of the main issues that out-of-order commit canhelp resolve is the early release (and subsequent reuse)of hardware resources that otherwise would still be re-quired to maintain the in-order state in an in-ordercommit processor Therefore with a small ROB (andother appropriately sized structures) given in-order com-mit the core can more easily stall when it runs out ofresources For example consider a ROB size of 32 froma SLM microarchitecture When the ROB is full it con-tains 32 micro-operations in flight and the CPUrsquos backend will no longer accept additional micro-operationsfrom the front end Increasing the physical size of theROB along with other resources is one potential butrather expensive solution to this issue An alternativesolution and one of the benefits of using out-of-ordercommit is the early release of resources This early re-lease increases the effective size of a resource (comparedto an in-order commit core) improving performance ofthe core

Benchmark specific analysis The xalan bench-mark is one of the top 5 benchmarks with a large num-ber of CPU stalls caused by exhausted resources BothAOOC and ROOC are very effective for the xalan bench-mark OOC is able to provide additional free entries inthe ROB RF and LSQ (see Figure 7)

On the other hand leslie3d has the lowest num-ber of CPU stalls based on exhausted resources whichlimits the potential for improvement with OOC

Figure 5 compares the effective size of the SLM mi-croarchitecture between in-order commit and both safeand unsafe aggressive out-of-order commit (AOOC) mod-els for the SPEC CPU2006 benchmarks The resultsare normalized to a fixed size SLM_IOC microarchitec-ture (see Table 2) An effective size of 10 translates tofrequent stalls due to resource exhaustion while sizes

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 9

4 8 16 32 128192

(a) SLM

20

40

60

80

100

120

Perfo

rmance

improvem

ent ()

4 8 16 32 128192

Commit depth

(b) NHM

20

40

60

80

100

120

Safe_OOC Unsafe_OOC

4 8 16 32 128192

(c) HSW

20

40

60

80

100

120

Fig 6 Effect of commit depth on performance improve-ment normalized to their respective in-order commitperformance across SPEC CPU2006 The smaller corereceives less benefit from increasing the depth of com-mit Safe_OOC saturates at a commit depth of 8 16and 32 for SLM NHM and HSW respectively (the dis-tance to the first unresolved instruction from the headof the ROB) There is no saturation in performance im-provement of Unsafe_OOC while the commit depth isincreased

greater than 10 shows the effective resource size in-creases due to OOC

For aggressive out-of-order commit the larger effec-tive sizes show the reach of this technique We see thatthe utilization of all structures except for the ROB isalmost the same for safe and unsafe out-of-order com-mit which allows Safe_OOC to achieve most of theperformance of the unsafe version Nevertheless the un-safe core is much better at improving the reach of theROB allowing applications like hmmer continue to showa benefit when moving from safe to unsafe out-of-ordercommit See Section 452 for more details on hmmer

44 Evaluation of Commit Depth

To gauge the effect of the commit depth (ie how far wescan the ROB to find instructions that can commit out-of-order) we impose a hard limit on it and evaluate theeffects on the resulting performance The strictest limitis the commit-width itself starting to commit out-of-order from the first commit-width instructions We thenrelax this to the immediate vicinity (eg double thecommit-width) and progressively relax until we reachthe size of the ROB

In Figure 6 we see that a commit depth of 4 (equalto commit width) provides the smallest benefit but alsothe smallest difference between Safe_OOC and Un-safe_OOC In addition the SLM core benefits morefrom OOC compared to NHM and HSW when com-mit depth is smaller than 8 At a commit depth of 8and above the large aggressive cores benefit much more

than the smaller core type with the maximum improve-ment of Unsafe_OOC at 80 119 and 129 for SLMNHM and HSW respectively For aggressive cores thelarger commit depth allows for continued performanceimprovements and will be necessary for the design of abalanced aggressive out-of-order commit design

45 Out-of-Order Commit Performance

In this section we analyze the out-of-order commit con-ditions to determine the minimum (and maximum) per-formance improvement potential The minimum andmaximum improvement is provided by Safe_OOC andUnsafe_OOC respectively In Figure 7 we show theamount of improvement provided by both safe and un-safe out-of-order commit for both aggressive and reluc-tant modes for all three microarchitectures

451 Safe_OOC

Honoring all out-of-order commit conditions results in amodest performance improvement One implication forfuture processors is that Safe_OOC does not requireadditional support for speculative out-of-order rollbackrecovery mechanisms it requires support for the com-mit of instructions out-of-order and the freeing of struc-tures for use by future instructions

Safe_AOOC When Safe_AOOC is evaluated onthe SLM microarchitecture (see Figure 7a) the rangeof improvement spans from a low of 3 (leslie3d) upto 82 (xalan) with an average of 44 In the NHMmicroarchitecture the improvement ranges from 9 to108 for milc and xalan with an average improvementof 57 and for HSW we see an improvement of 1for mcf to 110 for wrf with an average of 55 (SeeSection 44 for more details)

Safe_ROOC For Safe_ROOC the performanceimprovement is lower for all three microarchitecturescompared to AOOC (See Subsection 25 for additionaldetails) In the SLM microarchitecture the range ofperformance improvement of safe ROOC is from 2 to55 for calculix and gcc respectively with an averageimprovement of 20 In the NHM microarchitecturewe see performance improvements that range from 1(mcf) to 22 (bwaves) with an average improvement of7 (8 for HSW) Because NHM and HSM have fewerCPU stalls ROOC has less of an effect on performancecompared with the smaller more efficient core

452 Unsafe_OOC

By relaxing all conditions Unsafe_OOC provides themaximum potential for performance improvement Un-

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 8: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

8 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

SLM SLMOOC

NHMNHMOOC

HSWHSWOOC

20406080

100

Cycle

s ()

4 committed3 committed

2 committed1 committed

0 committed

Fig 4 Commit bandwidth distribution for the SPECCPU2006 benchmarks of a 4-wide core with in-ordercommit and out-of-order commit respecting all commitconditions (Safe_OOC) Out-of-order commit increasescommit pressure even without aggressive speculation

reason is that the smaller processor (with a shorter in-struction reach and given a balanced design smallerhardware structures) will more likely stall as it exposesa smaller amount of the potential ILP in an applica-tion In case of a larger commit depth the more ag-gressive cores (NHM and HSW) have higher potentialperformance improvement (See Section 44 for a de-tailed overview) Out-of-order commit frees the proces-sor from the traditional limits reducing the number oftimes the processor experiences exhausted resources Inmedium and large aggressive cores thanks to a largerROB as well as other hardware resources more intrinsicILP is extracted by traditional in-order commit leav-ing less potential for out-of-order commit with a narrowcommit depth

42 The Effect on Commit Bandwidth

Figure 4 shows the distribution of the number of com-mitted instructions per cycle for three different microar-chitectures for both in-order and out-of-order commitacross all SPEC CPU2006 benchmarks Although an in-order commit 4-wide commit HSW microarchitecturecan retire up to four instructions per cycle this occurson average less than 20 of the time In practice wesee a large number of commit stage stalls (zero instruc-tions committed per cycle) For out-of-order committhe distribution shifts toward four instructions per cy-cle reflecting the improved commit performance (andthe resulting improvement in overall performance) Fi-nally this figure shows that for smaller less aggressivecores (such as SLM see Table 2) out-of-order commitprovides a relatively larger improvement compared tothe other microarchitectures because of the short com-

LQ SQ IQ RF ROB00

05

10

15

20

Norm

alize

d effective siz

e

IOC Safe_OOC Unsafe_OOC

Fig 5 Effective resource use in in-order and Aggressiveout-of-order commit across SPEC CPU2006

mit depth of four Larger cores improve more with moreaggressive commit depths and out-of-order commit pa-rameters (see section 44 for details)

43 The Effect on Resources

One of the main issues that out-of-order commit canhelp resolve is the early release (and subsequent reuse)of hardware resources that otherwise would still be re-quired to maintain the in-order state in an in-ordercommit processor Therefore with a small ROB (andother appropriately sized structures) given in-order com-mit the core can more easily stall when it runs out ofresources For example consider a ROB size of 32 froma SLM microarchitecture When the ROB is full it con-tains 32 micro-operations in flight and the CPUrsquos backend will no longer accept additional micro-operationsfrom the front end Increasing the physical size of theROB along with other resources is one potential butrather expensive solution to this issue An alternativesolution and one of the benefits of using out-of-ordercommit is the early release of resources This early re-lease increases the effective size of a resource (comparedto an in-order commit core) improving performance ofthe core

Benchmark specific analysis The xalan bench-mark is one of the top 5 benchmarks with a large num-ber of CPU stalls caused by exhausted resources BothAOOC and ROOC are very effective for the xalan bench-mark OOC is able to provide additional free entries inthe ROB RF and LSQ (see Figure 7)

On the other hand leslie3d has the lowest num-ber of CPU stalls based on exhausted resources whichlimits the potential for improvement with OOC

Figure 5 compares the effective size of the SLM mi-croarchitecture between in-order commit and both safeand unsafe aggressive out-of-order commit (AOOC) mod-els for the SPEC CPU2006 benchmarks The resultsare normalized to a fixed size SLM_IOC microarchitec-ture (see Table 2) An effective size of 10 translates tofrequent stalls due to resource exhaustion while sizes

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 9

4 8 16 32 128192

(a) SLM

20

40

60

80

100

120

Perfo

rmance

improvem

ent ()

4 8 16 32 128192

Commit depth

(b) NHM

20

40

60

80

100

120

Safe_OOC Unsafe_OOC

4 8 16 32 128192

(c) HSW

20

40

60

80

100

120

Fig 6 Effect of commit depth on performance improve-ment normalized to their respective in-order commitperformance across SPEC CPU2006 The smaller corereceives less benefit from increasing the depth of com-mit Safe_OOC saturates at a commit depth of 8 16and 32 for SLM NHM and HSW respectively (the dis-tance to the first unresolved instruction from the headof the ROB) There is no saturation in performance im-provement of Unsafe_OOC while the commit depth isincreased

greater than 10 shows the effective resource size in-creases due to OOC

For aggressive out-of-order commit the larger effec-tive sizes show the reach of this technique We see thatthe utilization of all structures except for the ROB isalmost the same for safe and unsafe out-of-order com-mit which allows Safe_OOC to achieve most of theperformance of the unsafe version Nevertheless the un-safe core is much better at improving the reach of theROB allowing applications like hmmer continue to showa benefit when moving from safe to unsafe out-of-ordercommit See Section 452 for more details on hmmer

44 Evaluation of Commit Depth

To gauge the effect of the commit depth (ie how far wescan the ROB to find instructions that can commit out-of-order) we impose a hard limit on it and evaluate theeffects on the resulting performance The strictest limitis the commit-width itself starting to commit out-of-order from the first commit-width instructions We thenrelax this to the immediate vicinity (eg double thecommit-width) and progressively relax until we reachthe size of the ROB

In Figure 6 we see that a commit depth of 4 (equalto commit width) provides the smallest benefit but alsothe smallest difference between Safe_OOC and Un-safe_OOC In addition the SLM core benefits morefrom OOC compared to NHM and HSW when com-mit depth is smaller than 8 At a commit depth of 8and above the large aggressive cores benefit much more

than the smaller core type with the maximum improve-ment of Unsafe_OOC at 80 119 and 129 for SLMNHM and HSW respectively For aggressive cores thelarger commit depth allows for continued performanceimprovements and will be necessary for the design of abalanced aggressive out-of-order commit design

45 Out-of-Order Commit Performance

In this section we analyze the out-of-order commit con-ditions to determine the minimum (and maximum) per-formance improvement potential The minimum andmaximum improvement is provided by Safe_OOC andUnsafe_OOC respectively In Figure 7 we show theamount of improvement provided by both safe and un-safe out-of-order commit for both aggressive and reluc-tant modes for all three microarchitectures

451 Safe_OOC

Honoring all out-of-order commit conditions results in amodest performance improvement One implication forfuture processors is that Safe_OOC does not requireadditional support for speculative out-of-order rollbackrecovery mechanisms it requires support for the com-mit of instructions out-of-order and the freeing of struc-tures for use by future instructions

Safe_AOOC When Safe_AOOC is evaluated onthe SLM microarchitecture (see Figure 7a) the rangeof improvement spans from a low of 3 (leslie3d) upto 82 (xalan) with an average of 44 In the NHMmicroarchitecture the improvement ranges from 9 to108 for milc and xalan with an average improvementof 57 and for HSW we see an improvement of 1for mcf to 110 for wrf with an average of 55 (SeeSection 44 for more details)

Safe_ROOC For Safe_ROOC the performanceimprovement is lower for all three microarchitecturescompared to AOOC (See Subsection 25 for additionaldetails) In the SLM microarchitecture the range ofperformance improvement of safe ROOC is from 2 to55 for calculix and gcc respectively with an averageimprovement of 20 In the NHM microarchitecturewe see performance improvements that range from 1(mcf) to 22 (bwaves) with an average improvement of7 (8 for HSW) Because NHM and HSM have fewerCPU stalls ROOC has less of an effect on performancecompared with the smaller more efficient core

452 Unsafe_OOC

By relaxing all conditions Unsafe_OOC provides themaximum potential for performance improvement Un-

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 9: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 9

4 8 16 32 128192

(a) SLM

20

40

60

80

100

120

Perfo

rmance

improvem

ent ()

4 8 16 32 128192

Commit depth

(b) NHM

20

40

60

80

100

120

Safe_OOC Unsafe_OOC

4 8 16 32 128192

(c) HSW

20

40

60

80

100

120

Fig 6 Effect of commit depth on performance improve-ment normalized to their respective in-order commitperformance across SPEC CPU2006 The smaller corereceives less benefit from increasing the depth of com-mit Safe_OOC saturates at a commit depth of 8 16and 32 for SLM NHM and HSW respectively (the dis-tance to the first unresolved instruction from the headof the ROB) There is no saturation in performance im-provement of Unsafe_OOC while the commit depth isincreased

greater than 10 shows the effective resource size in-creases due to OOC

For aggressive out-of-order commit the larger effec-tive sizes show the reach of this technique We see thatthe utilization of all structures except for the ROB isalmost the same for safe and unsafe out-of-order com-mit which allows Safe_OOC to achieve most of theperformance of the unsafe version Nevertheless the un-safe core is much better at improving the reach of theROB allowing applications like hmmer continue to showa benefit when moving from safe to unsafe out-of-ordercommit See Section 452 for more details on hmmer

44 Evaluation of Commit Depth

To gauge the effect of the commit depth (ie how far wescan the ROB to find instructions that can commit out-of-order) we impose a hard limit on it and evaluate theeffects on the resulting performance The strictest limitis the commit-width itself starting to commit out-of-order from the first commit-width instructions We thenrelax this to the immediate vicinity (eg double thecommit-width) and progressively relax until we reachthe size of the ROB

In Figure 6 we see that a commit depth of 4 (equalto commit width) provides the smallest benefit but alsothe smallest difference between Safe_OOC and Un-safe_OOC In addition the SLM core benefits morefrom OOC compared to NHM and HSW when com-mit depth is smaller than 8 At a commit depth of 8and above the large aggressive cores benefit much more

than the smaller core type with the maximum improve-ment of Unsafe_OOC at 80 119 and 129 for SLMNHM and HSW respectively For aggressive cores thelarger commit depth allows for continued performanceimprovements and will be necessary for the design of abalanced aggressive out-of-order commit design

45 Out-of-Order Commit Performance

In this section we analyze the out-of-order commit con-ditions to determine the minimum (and maximum) per-formance improvement potential The minimum andmaximum improvement is provided by Safe_OOC andUnsafe_OOC respectively In Figure 7 we show theamount of improvement provided by both safe and un-safe out-of-order commit for both aggressive and reluc-tant modes for all three microarchitectures

451 Safe_OOC

Honoring all out-of-order commit conditions results in amodest performance improvement One implication forfuture processors is that Safe_OOC does not requireadditional support for speculative out-of-order rollbackrecovery mechanisms it requires support for the com-mit of instructions out-of-order and the freeing of struc-tures for use by future instructions

Safe_AOOC When Safe_AOOC is evaluated onthe SLM microarchitecture (see Figure 7a) the rangeof improvement spans from a low of 3 (leslie3d) upto 82 (xalan) with an average of 44 In the NHMmicroarchitecture the improvement ranges from 9 to108 for milc and xalan with an average improvementof 57 and for HSW we see an improvement of 1for mcf to 110 for wrf with an average of 55 (SeeSection 44 for more details)

Safe_ROOC For Safe_ROOC the performanceimprovement is lower for all three microarchitecturescompared to AOOC (See Subsection 25 for additionaldetails) In the SLM microarchitecture the range ofperformance improvement of safe ROOC is from 2 to55 for calculix and gcc respectively with an averageimprovement of 20 In the NHM microarchitecturewe see performance improvements that range from 1(mcf) to 22 (bwaves) with an average improvement of7 (8 for HSW) Because NHM and HSM have fewerCPU stalls ROOC has less of an effect on performancecompared with the smaller more efficient core

452 Unsafe_OOC

By relaxing all conditions Unsafe_OOC provides themaximum potential for performance improvement Un-

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 10: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

10 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

020406080

100120140

Perfo

rman

ceim

prov

emen

t () a) AOOC across SpecCPU

SLM_UnsafeSLM_Safe

NHM_UnsafeNHM_Safe

HSW_UnsafeHSW_Safe

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmpSpecCPU-avg

020406080

100120140

b) ROOC across SpecCPU

Fig 7 IPC improvement of safe and unsafe out-of-order commit relative to in-order commit as a baseline for bothreluctant and aggressive versions applied on SPEC CPU2006 benchmarks on three microarchitectures

safe_OOC will require recovery mechanisms for thesetechniques which can reduce the performance potentialbecause of recovery costs

To understand the effectiveness of all conditions to-gether we consider zero cost for recovery for any mis-speculated out-of-order commit condition

Unsafe_AOOC This technique provides the high-est performance improvement for all three different ar-chitectures In SLM cores the improvement ranges from9 to 120 for libquantum and astar applicationsrespectively and the average is 59 In NHM archi-tecture the average is 72 and the range is betweenmilc and astar respectively with 11 and 196 im-provement In HSW cores similar to NHM cores astarhas the maximum benefit from Unsafe_ooc with 192improvement while the minimum improvement is fordealii at 7 The average improvement for this classof architecture is 70

Unsafe_ROOC Reluctant out-of-order commit islower performing because it is not continuously lookingto commit additional instructions (See Section 25 fordetails) In the case of SLM the improvement rangesfrom 3 to 93 respectively for leselie3d and xalanwith an average of a 38 improvement In NHM mcfand bwaves with 1 and 22 show the maximum andthe minimum improvement with an average of 7 (HSWis similar with an average of 8) Between the threemicroarchitectures the limited SLM benefits the mostfrom ROOC because of the large number of stalls seenby this core Therefore ROOC especially Unsafe_ROOCis an interesting methodology to improve the perfor-mance of relatively small but energy efficient CPUs aswe see a relatively high performance improvement fora less aggressive commit implementation

Benchmark-specific Analysis The hmmer bench-mark is a particularly strong case for the benefits of out-of-order commit for SLM This application is L1-cacheresident and exhibits very few last-level cache missesNevertheless we still see a very strong improvement inperformance from 33 to 52 for Safe_OOC increas-ing to 51 to 66 for Unsafe_OOC Looking aheadto Figure 8 and Figure 11 we can see that evadingthe branch condition provides the most benefit for thisapplication Making room for additional instructions toallow the hardware to expose additional ILP works welleven for those applications without a significant num-ber of LLC misses The mcf benchmark contains load-dependent branches has the highest MPKI (misses perkilo instructions) among the benchmarks and thereforeit has a rather low IPC when it is executed on an in-order commit CPU This results in a good opportunityto improve performance as it is extremely limited bythese misses

46 Performance Effects of Commit Conditions

In the previous section by analyzing safe and unsafeout-of-order commit we observe that there is a largegap between the performance improvement of these twoimplementations Understanding the cause of this per-formance improvement (by looking at individual com-mit conditions in isolation) allows us to better under-stand where to focus future hardware efforts

461 Positive Contribution of Out-of-Order CommitConditions

To study the gap between safe and unsafe out-of-ordercommit (Figure 7) we analyze the effect of relaxing

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 11: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 11

020406080

100

Contrib

ution in

improv

emen

t () a) SLM-AOOC across SpecCPU

Safe Unsafe_LD Unsafe_ST Unsafe_BR Unsafe_EXC

020406080

100b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

020406080

100c) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

020406080

100d) ROOC SpecCPU average

Fig 8 Contribution of safe and selectively unsafe out-of-order commit on three different microarchitecturesUnsafe_XX is equivalent to activating (enforcing) all out-of-order commit conditions except XX (the XX conditionis relaxed) By relaxing the specific XX condition the dependence between other conditions is also observed

each condition in the presence of the other preservedconditions in Figure 8 We analyze the SLM microar-chitecture in detail and provide averages across all mi-croarchitectures for both AOOC and ROOC Each out-of-order commit condition in analyzed in isolation andwe consider Unsafe_OOC (all relaxed conditions) asthe 100 potential improvement budget In the case ofthe mcf benchmark in Figure 7a the safe and unsafeOOC performance improvement is 33 and 71 re-spectively (46 of the potential improvement budget isprovided by Safe_OOC) We also observe that by re-laxing the LD condition (unsafe_LD) 52 of potentialimprovement budget is achievable (see Figure 8a) InFigure 8 we can see in some applications (like namd inAOOC mode and leslie3d in ROOC mode) that re-laxing just a single condition is not sufficient to fill thegap between safe and unsafe OOC This does not meanthat a single condition is not important but rather thatother preserved conditions are preventing out-of-ordercommit from achieving its full potential

AOOC We observe that for most of applicationsUnsafe_BR and Unsafe_LD are the most interestingconditions (Figure 8a) Additionally the more aggres-sive the core the more important the Unsafe_LD con-dition becomes In SLM NHM and HSW CPUs Un-safe_LD respectively fills 4 10 and 12 and Un-safe_BR fills 9 8 and 7 of the gap between safeand unsafe OOC Unsafe_ST is not very effective be-cause of the rarity of this condition and the conserva-tive memory dependence predictor used Unsafe_EXCor relaxing exceptions are not that effective becausethey are very rare especially in integer benchmarks

ROOC Relaxing OOC conditions in ROOC hasless effect in reducing the gap between safe and un-safe OOC and this is because of the nature of this on-demand OOC mode which is enabled and needed more

in SLM and much less in NHM and HSW (see Fig-ure 7b)

Benchmark-specific Analysis The astar andgobmk benchmarks have the highest number of mis-predicted branches per 1000 instructions Therefore re-laxing the branch condition (unsafe_BR) improves theperformance of these two benchmarks by 68 and 94respectively On the other hand benchmarks such ascactusadm and lbm have a low branch mispredictionrate (05) and therefore relaxing the branch condi-tion does not show significant improvement for thesebenchmarks

The sphinx benchmark has high degree of intrinsicILP2 thus an increase in the number of effective re-sources allows this benchmark to improve performanceAs most of its load instructions are L2-cache missesrelaxing the load condition (Unsafe_LD) improves per-formance of this benchmark

462 Negative contribution of OOC conditions

In this section we analyze the gap between safe andunsafe out-of-order commit from a different angle

We relax all of the out-of-order commit conditionsexcept for one For example safe_LD means the LDcondition is preserved but ST BR and EXC conditionsare relaxed By preserving one of the conditions we lookat the negative effect (performance reduction) of theactivated condition compared to Unsafe_OOC

Figure 11 depicts the effect of each condition onperformance For most of the benchmarks the BR andthe LD conditions are the most effective ones Amongfloating-point benchmarks LD and EXC conditions havea large impact on performance Therefore relaxing theEXC condition as it is rare could lead to significant

2 The sphinx benchmark continues to show performance im-provements as the in-order commit processor aggressiveness in-creases from SLM to HSW

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 12: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

12 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

200 400 600 800 1000DRAM latency (cycle)

020406080

100120140160

Perfo

rman

ceim

prov

emen

t (

)Safe_OOC Unsafe_OOC

Fig 9 A comparison between safe and unsafe out-of-order commit across DRAM latencies for SLM andAOOC on SPEC CPU2006 OOC improvement in-creases with a higher DRAM latency Unsafe_OOCoutperforms Safe_OOC This study uses caches thatare four times smaller compared to Table 1 to put ad-ditional stress on the DRAM This has been done forboth in-order and out-of-order commit configurations

performance improvements at relatively low cost espe-cially if recovery mechanisms in software are used SThas the least effect among out-of-order commit condi-tions when it is preserved in isolation from other condi-tions This is valid between all three microarchitectures

47 Memory Latency Evaluation

Current DRAM cells are optimized for cost and notfor access latency [20] The potential to use more af-fordable but higher latency memory could have a largeimpact on allocation of memory in datacenter serversas high density DRAM modules cost much more thanthose that are 4times smaller (175times per GB [4]) To eval-uate the potential of out-of-order commit to handlehigher DRAM latency we evaluate memory latency from200 cycles to 1000 cycles in 200 cycle increments Inaddition we reduce the size of all caches by 4 timeswhen compared to Table 1 to put additional stress onthe DRAM subsystem Here we see an increasing per-formance improvement for the Unsafe_OOC conditionwhile Safe_OOC increases linearly compared to the in-order commit core For memory-intensive applicationsan efficient implementation of Unsafe_OOC could al-low the use of denser higher-latency DRAM that couldpotentially cost much less

48 Prefetching Evaluation

Prefetching is essential for performance in modern sys-tems and interacts tightly with the pipeline and out-of-order commit We do not consider all prefetchingoptions but instead focus on the potential benefitsTherefore to evaluate this interaction we configure theL1-D cache in gem5 with a stride prefetcher of degree8 Figure 10 compares the relative performance gains

IOC+pref

AOOCAOOC+pref

(b) SLM

0

20

40

60

80

Perfo

rman

ceim

prov

emen

t (

)

IOC+pref

AOOCAOOC+pref

(b) NHM

0

20

40

60

80

IOC+pref

AOOCAOOC+pref

(c) HSW

0

20

40

60

80

Fig 10 The effect of aggressive out-of-order commit onthe performance with and without prefetchers in theL1 data cache across SPEC CPU2006 Improvementis relative to the baseline IOC architectures withoutprefetchers

of prefetchers for in-order commit out-of-order com-mit and prefetchers for out-of-order commit Acrossall three architectures both out-of-order commit andprefetching on their own provide roughly 40 improve-ment in performance with out-of-order commit beingslightly better However the combination of out-of-ordercommit and prefetching delivers nearly 70 better per-formance nearly the sum of the two independent contri-butions In fact combining aggressive out-of-order com-mit with prefetchers allows us to reach 84 of the ideal(sum of both) improvement for SLM 77 for NHMand 83 for HSW showing that these techniques workwell together

49 Rollback Costs for Unsafe Branches

While this work does not evaluate hardware costs weare able to evaluate the number of rollbacks caused bycommitting past a mispredicted branch for each of theevaluated configurations For this evaluation we use acommit depth of 8 with aggressive out-of-order commit(AOOC) and Unsafe_BR (only the branch conditionis not respected for out-of-order commit) In Table 3we list the number of executed branches mispredictedbranches the number of rollbacks caused by specula-tively committing a mispredicted branch and the num-ber of instructions that were rolled back due to this roll-back While the number of executed branches increaseswith the aggressiveness of the core (because these corescan more aggressively speculate past branches) the av-erage number of rollbacks (and instructions rolled back)decreases slightly with the aggressiveness of the coreThese aggressive cores resolve speculative state quicklyreducing the number of instructions that need to becommitted out-of-order This results in a similar num-ber of rollbacks per thousand architecturally committedinstructions around 3 for all configurations

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 13: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 13

0minus20minus40minus60minus80Co

ntrib

ution in

degrad

ation (

) a) SLM-AOOC across SpecCPU

Safe_LD Safe_ST Safe_BR Safe_EXC

0minus20minus40minus60minus80

b) AOOC SpecCPU average

astarbwaves

bzip cactusadm

calculixdealii

gamessgcc gemsfdtd

gobmkgromacs

h264refhmmer

lbm leslie3dlibquantum

mcf milc namdomnetpp

perl povraysjeng

soplexsphinx3

tontowrf xalan

zeusmp

0minus20minus40minus60minus80

b) SLM-ROOC across SpecCPU

SLM-avgNHM-avg

HSW-avg

0minus20minus40minus60minus80

d) ROOC SpecCPU average

Fig 11 Maximum potential performance improvement is provided by Unsafe_OOC in which all conditions areunsafe By preserving all conditions maximum performance is reduced This figure shows the negative effect ofpreserving (or making safe) a single condition compared to the maximum potential performance improvementSafe_XX is equivalent to disabling all out-of-order commit conditions (all unsafe highest performance) where XXindicates the only safe and preserved condition

Table 3 Out-of-Order Commit Costs for UnsafeBranches

Metric-PKI SLM NHM HSWAvg Max Avg Max Avg Max

Branches 1462 3733 1533 4608 1568 4957Mispred br 78 509 81 545 82 560Num rollbacks 32 243 31 256 30 265Instrs rollback 85 529 66 430 54 343

5 Memory Parallelism

Overlapping cache misses to service them in parallelin particular long-latency accesses to DRAM but alsolower-latency accesses to the LLC can result deliversignificant performance benefits [8] This memory par-allelism is typically achieved through the use of multi-ple Miss Status Holding Registers (MSHRs) [19] whichtrack outstanding memory requests and allow them toexecute in parallel In this section we compare in-ordercommit and out-of-order commit in terms of memoryparallelism (both to DRAM (MLP) and within the cachehierarchy (MHP) [7]) by changing the number of L1MSHRs and observing the effect on performance Toexplore these effects we select three applications thatare highly memory-bound [18] (mcf) medium memory-bound (lbm) and largely not memory-bound (gcc) inFigure 12 One key observation is that out-of-order com-mit in both reluctant and aggressive modes is muchbetter than in-order commit in exposing intrinsic ap-plication memory parallelism Figure 12 shows that thegap between in-order and out-of-order commit is muchlarger in the case of HSW which means that the moreaggressive is the microarchitectures the more MLP iscovered if we apply out-of-order instruction commit Incase of lbm comparing the SLM NHM and HSW weobserved that the rate of performance improvement inthe range of (1 lt= MSHRSize lt= 5) is higher than

the rate in the range of (5 lt MSHRSize lt= 9) Inthe range of (MSHRSize gt 9)) HSW and NHM havecontinued improvement This is most likely the resultof an additional loop iteration containing DRAM ac-cesses that is being covered by out-of-order instructioncommit

Figure 12 also allows us to explore the impact ofmemory-boundedness by comparing across the threeapplications (mcf is most memory bound gcc least) Al-thoughmcf (Figure 12 (d) (e) and (f)) is more memory-bound than lbm (Figure 12 (g) (h) and (i)) it exposesless memory parallelism when the number of MSHRsis increased (the gap between in-order commit and safeaggressive out-of-order commit is larger in case of lbmcompared to mcf ) This is because mcf has more iso-lated cache misses that are misses that cannot be ser-viced at the same time to provide memory level par-allelism Also mcf has branch instructions based onmemory accesses (dependent loads) that miss in thecache hierarchy thereby reducing the number of in-structions that can potentially be committed out-of-order lbm has medium level of memory-boundedness[18] and therefore exposes much more memory paral-lelism as the number of MSHRs is increased Its per-formance improvement is therefore much better whenthe CPU commits instructions out-of-order gcc bench-mark Figure 12 (j) (k) and (l) is one of the leastmemory-bound benchmarks As a result increasing thenumber of MSHRs does not improve its performanceeven in the case of unsafe out-of-order commit Over-all out-of-order commit outperforms in-order commitby exposing additional memory parallelism (both toDRAM and in the hierarchy) In summary in case ofMLP out-of-order commit provides more benefit forbigger architecture and more memory-bound applica-tions

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 14: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

14 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

1 2 3 4 5 6 7 8 9 10123456789

Norm

alize

d Pe

rform

ance

improv

emen

t ()

(a) SPEC2006 SLM

1 2 3 4 5 6 7 8 9 10123456789

(b) SPEC2006 NHM

IOC Safe-AOOC Unsafe-AOOC

1 2 3 4 5 6 7 8 9 10123456789

(c) SPEC2006 HSW

1 2 3 4 5 6 7 8 9 10123456789

(d) mcf SLM

1 2 3 4 5 6 7 8 9 10123456789

(e) mcf NHM

1 2 3 4 5 6 7 8 9 10123456789

(f) mcf HSW

1 2 3 4 5 6 7 8 9 10123456789

(g) lbm SLM

1 2 3 4 5 6 7 8 9 10123456789

(h) lbm NHM

1 2 3 4 5 6 7 8 9 10123456789

(i) lbm HSW

1 2 3 4 5 6 7 8 9 10123456789

(j) gcc SLM

1 2 3 4 5 6 7 8 9 10MSHR size

123456789

(k) gcc NHM

1 2 3 4 5 6 7 8 9 10123456789

(l) gcc HSW

Fig 12 Memory hierarchy parallelism comparison between in-order commit and safe and unsafe out-of-ordercommit in MSHRs size of 1 to 10 The results have been normalized to in-order commit with MSHR size of oneSubgraph (a) (b) and (c) are based on harmonic mean across SPEC CPU2006 The mcf lbm and gcc benchmarksare respectively representative of categories with high medium and low memory boundedness

6 Early Release of Physical Registers vsOut-of-order Commit

Register renaming enables the avoidance of false de-pendencies through the register file thereby improvingcore performance It does this by translating architec-tural registers to larger set of physical registers As thisrenaming is done early in the pipeline and the physi-

cal registers are not released until commit which mayhappen long after the last consumer reads the data thephysical register entries may be kept alive for longerthan what the dependencies alone require To addressthis resource constraint techniques such as delayed al-location and Rarly Release of the Physical Register

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 15: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 15

Table 4 The Contribution of exhausted (full) regis-ter file (RF) and reorder buffer (ROB) on CPU stallsOther reasons to the CPU stalls could be the instruc-tion queue (IQ) the load-store queue (LSQ) instruc-tions cache misses or not having a free functional unitto allocate to the ready-to-dispatch instructions

Microarchitecture RF ROB Othermin avg max min avg max min avg max

SLM 6 59 86 13 36 92 0 5 27NHM 2 68 91 8 30 98 0 20 7HSW 2 69 91 8 29 98 0 20 7

(ERPR) [23] have been developed In this section wecompare ERPR with the OOC taxonomy [2]

OOC and ERPR are similar as they release physicalregister as early as possible Aggressive OOC outper-forms ERPR in general because it releases ROB andLSQ entries early as well as physical registers Unfor-tunately neither ERPR nor OOC can release IQ entriesearlier than usual because OOC only address processorstate after the instructions have left the IQ and ERPRis effective before instructions are inserted tot the IQFrom another point of view OOC put additional pres-sure on the IQ and provides free entries in one or all ofthe resource of superscalar processors

61 Analysis of CPU Aggressiveness

CPU aggressiveness analysis in this context is basedon microarchitecture configurations (see Table 2) andcommit-depth The overall trend of Figure 13 showsthat AOOC outperforms ROOC and ERPR in all con-figurations This is not surprising as AOOC is alwaysenabled while ROOC is enabled only when a resourcesis exhausted (see Section 25)

ERPR is essentially a subset of AOOC that only fo-cuses on the register file (RF) Therefore the higher isthe program sensitivity to the RF capacity the higher isthe effect of ERPR To better understand this behaviorwe analyzed the reasons for CPU stalls across differentmicroarchitectures (Table 4) The data show that in-creasing the effective size of RF (eg what ERPR doesconceptually) is less effective in SLM compare to NHMand HSW barbecue is SLM the CPU stalls are moreoften due to other resources such as the ROB size Asa result since in SLM architecture the size of RF isless effective in CPU stalls ROOC outperforms ERPRin terms of performance improvement (ROOC can vir-tually extend the effective size of ROB LSQ as wellas RF although ERPR only does this on RF) RF inNHM and HSW architectures is more effective on CPUstall than in SLM architecture therefore ERPR out-

performs ROOC regarding performance improvementin these two architecture

By focusing only on ERPR we observe that thewider the core the more effective is the is increasingcommit-depth (wider commit-depth covers more readyinstructions to commit) For example in the case ofSLM although widening the commit-depth releases ad-ditional physical registers earlier than general thesefreed registers are not effective anymore because theCPU stalls are due to limits in other resources (likeROB IQ or LSQ) As NHM and HSW have more re-sources increasing the commit-depth is more effectivein those architectures In the case of AOOC increas-ing the commit-depth is more effective than ROOCand ERPR because it enables additional instructionsto be committed out-of-order (because it is active onevery cycle compare to ROOC and also affects all ofthe resources compare to ERPR that only affects theRF) In addition we can see that the more aggressivethe CPU the more effective is increasing the commit-depth This is because wider cores execute and com-plete instructions that are far from the head of ROBearlier with their additional resources(either providedby the baseline or released by the OOOC) and theycan therefore provide more instructions to commit ifthe commit-depth is increased

62 Analysis of Out-of-Order Conditions

Figure 13 shows the potential for performance improve-ment from relaxing the out-of-order commit conditionsacross the three architectures for both safe and unsafeout-of-order-commit As has been seen previously [2 521] the least aggressive out-of-order commit configu-ration (safe_OOC with a commit depth of 4) is mostbeneficial for the least aggressive out-of-order processorSLM (compare the left-most bars of Figure 13(a-c))This is expected as the SLM processor has the least ex-ecution hardware and therefore suffers from more stallsthan the more aggressive processors

Increasing the commit depth from 4 to 8 (secondgroup of bars in Figure 13(a-c)) shows that the more ag-gressive out-of-order processors (NHM and HSW) nowbenefit more from out-of-order commit than the sim-pler SLM architecture This new result shows the inter-action between the aggressiveness of the out-of-orderexecution and out-of-order commit for more aggres-sive out-of-order execution there is more benefit frommore aggressive out-of-order commit The reason is thataggressive architectures appear to have more unusedresources available for computation This trend con-tinues through the more aggressive out-of-order com-mit modes as well with unsafe optimizations (Figure

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 16: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

16 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

4 8 16 320

20406080

100

Perfo

rman

ceim

prov

emen

t () (a) SLM-SAFE

4 8 16 32 128Commit Depth

020406080

100

(b) NHM-SAFEAOOC ROOC ERPR

4 8 16 32 128 1920

20406080

100

(c) HSW-SAFE

4 8 16 320

20406080

100

(d) SLM-UnsafeLD

4 8 16 32 1280

20406080

100

(e) NHM-UnsafeLD

4 8 16 32 128 1920

20406080

100

(f) HSW-UnsafeLD

4 8 16 320

20406080

100

(g) SLM-UnsafeBR

4 8 16 32 1280

20406080

100

(h) NHM-UnsafeBR

4 8 16 32 128 1920

20406080

100

(i) HSW-UnsafeBR

4 8 16 320

20406080

100

(j) SLM-UnsafeST

4 8 16 32 1280

20406080

100

(k) NHM-UnsafeST

4 8 16 32 128 1920

20406080

100

(l) HSW-UnsafeST

4 8 16 320

20406080

100

(m) SLM-UnsafeEXC

4 8 16 32 1280

20406080

100

(n) NHM-UnsafeEXC

4 8 16 32 128 1920

20406080

100

(o) HSW-UnsafeEXC

4 8 16 320

20406080

100

(p) SLM-Unsafe

4 8 16 32 1280

20406080

100

(q) NHM-Unsafe

4 8 16 32 128 1920

20406080

100

(r) HSW-Unsafe

Fig 13 Comparison between AOOC ROOC and ERPR in different commit-depth and microarchitecture setupin six different modes UnsafeXX is equivalent to (enforcing) all out-of-order commit conditions except conditionXX By relaxing the specific XX condition the dependence between other conditions is also observed

13(d-r)) benefiting the more aggressive out-of-order ex-ecution designs (NHM and HSW) more than the sim-pler one (SLM) The reasons for this are similar tothe safe case as the more aggressive processors arefaster to achieve the first out-of-order condition in-struction completion and therefore providing greatercommit depths (eg changing the commit depth from 4to 8 in Figure 13 (a-c)) results in a larger likelihoodof finding instructions to commit out-of-order Table 5

uses this analysis to rank the relative importance ofeach out-of-order commit condition for the three ar-chitectures This information provides a guideline forwhich areas to work on to achieve the best performanceimprovement depending on the baseline out-of-order ex-ecution

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 17: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

Maximizing Limited Resources A Limit-based Study and Taxonomy of Out-of-order Commit 17

Conditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 NA NAUnsafe_BR 2 1 1 1 NA NAUnsafe_ST 4 4 4 4 NA NAUnsafe_EXC 3 3 3 3 NA NA

(a) SLMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 1 2 2 2 2 NAUnsafe_BR 3 1 1 1 1 NAUnsafe_ST 4 4 4 4 4 NAUnsafe_EXC 2 3 3 3 3 NA

(b) NHMConditioncommit-depth 4 8 16 32 128 192

Unsafe_LD 4 2 2 2 2 2Unsafe_BR 1 1 1 1 1 1Unsafe_ST 3 3 4 4 4 4Unsafe_EXC 2 4 3 3 3 3

(c) HSW

Table 5 Ranking of the benefits of out-of-order commitconditions in different microarchitecture based on thedepth of commit

7 Related Work

The goal of this work is to provide a detailed under-standing into the potential for performance benefitsacross different out-of-order commit conditions and lev-els of aggressiveness While this work aims to describethe maximum potential benefit for each individual con-dition there have been many previous works that de-scribe hardware solutions for early release of hardwarestructures and out-of-order commit strategiesBelowwe provide an overview of these works and how theyfit into the categories described in this work

71 Speculative Release of Hardware StructuresA number of implementations require register and pro-cessor state checkpointing support to speculatively re-tire or release hardware structures [12 11 22 1 17]

Processor state checkpointing especially with theadvent of very large SIMD registers such as AVX-512registers which now support up to 32 registers with upto 512 bits per register can require a significant amountof state to be saved when speculation is aggressive

72 Non-speculative Structure Release

Non-speculative early release of hardware structures be-fore commit requires knowledge that no older instruc-tion (that has come earlier in the instruction stream)can cause the program to abort raise an exception orrequire exposure of the architected state at that timeNon-speculative solutions have the potential to be themost energy efficient a necessity in an era of the end ofDennard Scaling for power-limited platforms A range

of solutions from hardware-only to software-assistedsolutions are described below

Compiler support Two previous works [1 13]allow for early commit or reclaim of resources basedon compiler knowledge but require software recompi-lation

Checkpoint-free A number of solutions do notuse checkpoints [5 21 13] and therefore do not requirespeculation (they respect the all commit conditions) orused software help to extend knowledge to the hard-ware

Selective early commit Early commit of loads [1415] allows for loads that have not yet received data fromthe memory hierarchy to become part of the commit-ted state of a processor This allows the processor tocontinue to process instructions past normally blockedstructures improving performance for memory-boundworkloads To accomplish this the authors [15] de-couple page faults that can occur from the fetching ofdata While recompilation is not strictly required forthis technique the authors evaluate their work using acompilation strategy to expose loads early Late alloca-tion and early release of physical registers is proposedas a register renaming scheme in [23] They explored theeffect of late allocation and early release of physical reg-ister on the performance of an 8-way superscalar pro-cessor without the need for additional speculation andmaintains precise exceptions The out-of-order commitmethods evaluated in this work overlap with the earlyrelease of registers while also evaluating the early re-lease of other structures (like the LSQ)

Evading out-of-order commit conditions[24] provides a coherency protocol level solution suchthat the loadrarrload reordering can be evaded non spec-ulatively under TSO In their design it is no longer nec-essary to squash and re-execute speculatively reorderedloads if their reordering can be hidden by the coherencyprotocol This allows speculatively reordered loads thatotherwise adhere to the rest of the Bell-Lipasti condi-tions to be committed out-of-order without affectingthe memory model

8 Conclusion

To obtain higher performance extending the reach ofthe processor core has been a focus in much of microar-chitecture research One promising direction is the useof out-of-order commit which releases precious proces-sor resources early to allow the processor to extend itsreach past typical hardware limits In this work wepresent a limit study for out-of-order commit throughthe introduction of reluctant and aggressive out-of-order

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion
Page 18: Maximizing Limited Resources: A Limit-based Study and ...tcarlson/pdfs/alipour2018mlralsatoo… · Maximizing Limited Resources: A Limit-based Study and Taxonomy of Out-of-order Commit

18 Mehdi Alipour Trevor E Carlson David Black-Schaffer Stefanos Kaxiras

commit modes We show how smaller processors evenwith a limited commit scan depth can benefit fromout-of-order commit strategies but that larger aggres-sive cores require deeper commit scan depths to achieveimproved performance In addition we provide a de-tailed breakdown of the contributions for each out-of-order commit condition for the SPEC CPU2006 bench-mark suite and compare against similar works suchas early release of physical registers Our results showa very high potential for performance improvementabove 225x for some benchmarks and believe that out-of-order commit strategies can play an important rolefor future energy-efficient and high-performance proces-sor designs

References

1 Afram F Zeng H Ghose K A group-commitmechanism for ROB-based processors implement-ing the x86 ISA In HPCA pp 47ndash58 (2013)

2 Alipour M Carlson TE Kaxiras S Exploringthe performance limits of out-of-order commit InCF pp 211ndash220 (2017)

3 Allan A Edenfeld D Joyner Jr WH KahngAB Rodgers M Zorian Y 2001 technologyroadmap for semiconductors Computer 35(1) 42ndash53 (2002)

4 Badalone R Dramrsquos surprising role in the costof data centers httpwwwdatacenterknowledgecomarchives20151112dont (2015)

5 Bell GB Lipasti MH Deconstructing commitIn ISPASS pp 68ndash77 (2004)

6 Binkert N Beckmann B Black G ReinhardtSK Saidi A Basu A Hestness J Hower DRKrishna T Sardashti S Sen R Sewell KShoaib M Vaish N Hill MD Wood DA Thegem5 simulator SIGARCH Comput Archit News39(2) 1ndash7 (2011)

7 Carlson TE Heirman W Allam O Kaxiras SEeckhout L The load slice core microarchitectureIn ISCA pp 272ndash284 (2015)

8 Chou Y Fahs B Abraham S Microarchitec-ture optimizations for exploiting memory-level par-allelism In ISCA pp 76ndash88 (2004)

9 Corporation I Intel Rcopy 64 and ia-32 ar-chitectures optimization reference manualhttpwwwintelcomcontentwwwusenarchitecture-and-technology64-ia-32-architectures-optimization-manualhtml(2016)

10 Corporation I Intel Rcopy intelrsquos rsquotick-tockrsquo seeminglydead becomes rsquoprocess-architecture-optimizationrsquohttpwwwanandtechcomshow10183

intels-tick-tock-seemingly-dead-becomes-process-architecture-optimization (2016)

11 Cristal A Ortega D Llosa J Valero M Out-of-order commit processors In HPCA pp 48ndash59(2004)

12 Cristal A Santana OJ Valero M MartiacutenezJF Toward kilo-instruction processors ACMTrans Archit Code Optim 1(4) 389ndash417 (2004)

13 Duong N Veidenbaum AV Compiler-assistedselective out-of-order commit IEEE Computer Ar-chitecture Letters 12(1) 21ndash24 (2013)

14 Gwennap L Digital leads the pack with 21164 InMicroprocessor Report 8(12) pp 249ndash260 (1994)

15 Ham T Aragoacuten JL Martonosi M Desc De-coupled supply-compute communication manage-ment for heterogeneous architectures In MICROpp 191ndash203 (2015)

16 Henning JL SPEC CPU2006 benchmark descrip-tions SIGARCH Comput Archit News 34(4) 1ndash17 (2006)

17 Hilton A Roth A BOLT Energy-efficient out-of-order latency-tolerant execution In HPCA pp1ndash12 (2010)

18 Jaleel A Memory characterization of workloadsusing instrumentation driven simulation httpwwwglueumdeduajaleelworkload (2010)

19 Kroft D Lockup-free instruction fetchprefetchcache organization In ISCA pp 81ndash87 (1981)

20 Lee D Kim Y Seshadri V Liu J Subrama-nian L Mutlu O Tiered-latency dram A lowlatency and low cost dram architecture In HPCApp 615ndash626 (2013)

21 Marti S Borras J Rodriguez P Tena RMarin J A complexity-effective out-of-order re-tirement microarchitecture IEEE Transactions onComputers 58(12) 1626ndash1639 (2009)

22 Martinez JF Renau J Huang MC PrvulovicM Cherry Checkpointed early resource recyclingin out-of-order microprocessors In MICRO pp3ndash14 (2002)

23 Monreal T Vinals V Gonzalez J GonzalezA Valero M Late allocation and early release ofphysical registers IEEE Trans Comput 53(10)1244ndash1259 (2004)

24 Ros A Carlson TE Alipour M Kaxiras SNon-speculative load-load reordering in TSO InISCA pp 187ndash200 (2017)

25 Smith JE Pleszkun AR Implementing preciseinterrupts in pipelined processors IEEE Transac-tions on Computers 37(5) 562ndash573 (1988)

26 Sohi GS Vajapeyam S Instruction issue logicfor high-performance interruptable pipelined pro-cessors In ISCA pp 27ndash34 (1987)

  • Introduction
  • Out-of-Order Commit
  • Methodology
  • Out-of-Order Commit Evaluation
  • Memory Parallelism
  • Early Release of Physical Registers vs Out-of-order Commit
  • Related Work
  • Conclusion