20130327 ieice d hanada en draft

9

Click here to load reader

Upload: mossaied2

Post on 01-Oct-2015

215 views

Category:

Documents


0 download

DESCRIPTION

sdf sd

TRANSCRIPT

  • IEICE TRANS. INF. & SYST., VOL.E96D, NO.3 MARCH 20131

    PAPERPerformance Comparison of the 2.5D integrated Multi-core and the3-Dimensional (3D) Stacked Multi-core Processors

    Takaaki HANADAya),Member, Farhad MEHDIPOURyyb), Nonmember, Koji INOUEyyc),and Kazuaki MURAKAMIyyd),Members

    SUMMARY This work compares the 2.5D multicore processor and the3D stacked multicore processor. Three-dimensional (3D) integration tech-nology leads to improve the performance and reduce the energy of LSIs bythe stacking multiple dies and the inter-die vertical interconnects. Manyof researchers and designers have discussed the 3D integrated LSI designs,and the 2 designs (functional unit block 3D stacking and 2.5D integration)are great hopes that the performance and the design reusability. These 2designs are dierences in the wiring and structure. And, the dierencesaects to their performances, which is the performances depending on thewire latency and the temperature. Thus, there are advantages and disad-vantages in the 3D stacked LSIs and 2.5D LSIs. It is not obvious that thewhich design has better eective performance from the qualitative featuresconsidering with temperature.

    In this paper, we compare the 3D stacked multicore and the 2.5D mul-ticore with using three evaluation metrics (the shared cache access latency,the rated clock frequency under the thermal constraints, and the eectiveperformance under the thermal constraints). From the model base sharedcache access latency comparison, it is shown that the 3D multicore hasshorter average access latency compared to the 2.5D multicores one, from24.4% to 34.1%. Also, from the rated clock frequency comparison usingthe thermal model, it is shown that the 3D multicore need the lower clockfrequency design compared to the 2.5D multicore. Finally, from the eec-tive performance comparison using the processor simulation, it is shownthat the 3D multicore are able to obtain the advantage compared to the2.5D multicore in some conditions. The conditions are the memory in-tensive workload processing case and the low heat density FUBs stackingcase. On the other hand, in the other conditions, it is shown that the 2.5Dmulticore is good design point for good performance compared to the 3Dmulticore.key words: Multicore Processor, Three-dimensional (3D) Integration,2.5D Integration, Thermal Analysis, Performance Comparison

    1. Introduction

    Large scale integrated circuits (LSIs), like a microproces-sors, are required to more and more performance. The per-formance improvement of LSIs has been achieved by the in-creasing of number of implemented transistors and the im-provement of clock speed with wire scaling. Although, inrecent years, wire scaling is hardly eective to wire latency

    Manuscript received February 16, 2010.Manuscript revised December 16, 2011.yGraduate School of Information Science and Electrical Engi-

    neering, Kyushu University744 Motooka Nishi-ku Fukuoka 819-0395 JAPAN

    yyFaculty of Information Science and Electrical Engineering,Kyushu Universitya) E-mail: [email protected]) E-mail: [email protected]) E-mail: [email protected]) E-mail: [email protected]: 10.1587/transinf.E96.D.1

    reduction because the poor scaling of wire RC delays in thedeep sub-micron process. Furthermore, as the increasingdie size, wiring and interconnection logic complexity beginto adversely aect performance.Three-dimensional (3D) integration is an emerging fabri-

    cation technology which improve performance and reducethe energy of LSIs[1][2]. 3D integrated LSIs consists ofstacking multiple dies and inter-die vertical interconnects(e.g. Through-Silicon-Vias, TSVs). By the multiple layersstacking and the flexible vertical interconnects, it is providedthat greater device density and shorter interconnects due tothe ability to place and route in the third dimension.The 3D integrated multicore processors is the one of the

    applications of 3D integration technology. The 3D mul-ticore designs take on dierent forms depending on the3D stacking granularity. Choosing the stacking granular-ity leads to dierent design options, trade-os, and benefits.One of the designs is using 3D partitioning splits individ-ual functional unit blocks (FUBs) across multiple layers[3][4][5]. In these fine-grained 3D stacking designs, it has po-tential for the significant wire reduction in the every FUBs.Although, the fine-grain 3D stacking designs are high costsby reasons of the massive number of TSV implementationand the low reusability of FUB design. Thus, for realisticdesign and fabrication, many of researchers and designersare focused on the below 2 designs. The one of 3D stack-ing applications is which wires are partitioned between theFUBs and FUBs are stacked on the other FUBs, these de-signs are referred to as Inter-FUB 3D stacked multicorein this paper[6][7]. This intra-die 3D stacking design haspotential for the global wire reductions and the FUB designreusability. Another 3D stacking application is which LSIdies are stacked on the silicon-interposer, it is referred toas 2.5D integrated multicore in this paper[8][9]. In thiscoarse-grain 3D integration designs, It has hardly to be de-sired the potential for the wire reduction, but the designs hasFUBs design reusability. In this paper, we focus on the mul-ticores derived by above 2 stacking granularities and discussabout them below.There are dierences in the wiring and structure be-

    tween the Inter-FUB 3D stacked multicore and the 2.5Dintegrated multicore. And, the dierences aects to theirperformances. The Inter-FUB 3D stacked multicore hasthe potential of wire latency reductions compared to the2.5D multicore. The example of the advantage is shownin the wire latency reductions between the cores and shared

    Copyright c 2013 The Institute of Electronics, Information and Communication Engineers

  • 2IEICE TRANS. INF. & SYST., VOL.E96D, NO.3 MARCH 2013

    cache[10][11]. In contrast, 3D integrated circuits tend torise the die temperature by the high thermal density and thelow thermal conductivity [12]. Therefore, The thermal den-sity reductions is important for keeping safe temperature.The example of the thermal density reduction is the clockfrequency selection under the Thermal Design Power. Thisthermal problem is not serious in the 2.5Dmulticore becauseof the number of stacked dies is a small.From the above discussions, the 3D stacked multicore and

    the 2.5Dmulticore have advantages and disadvantages (wirelatency and thermal density). Generally, the 3D stackedmulticore has the potential for the better performance com-pared to the 2.5D multicore by the wire latency reductions,but it is not obvious from the qualitative features consid-ering with thermal problems. In this paper, we compare theperformances of the 3D stacked multicore and the 2.5Dmul-ticore quantatively. We compare the multicores with usingthree evaluation metrics (the shared cache access latency,the rated clock frequency under the thermal constraints, andthe eective performance under the thermal constraints).This paper shows the below contributions:

    The 3D multicores has good wiring latency comparedto 2.5D multicores. Because of, the TSV latencies areshorter than the horizontal global wire latencies on theSi-Interposer.

    The 2.5D multicores has potential for the better perfor-mance compared to 3D multicores with the high ther-mal density FUB 3D stacking case.

    Depending on the situation, the thermal problem aectsto the performance of 3D multicores strongly. The per-formance overhead negates the performance improve-ments by the 3D stacking wiring latency improvementeorts.

    This paper is organized as follows. Section 2 presents therelated works, and discussing the dierence of the contribu-tions. Section 3 explains the assumptions about multicoresmade in this paper. And section 4 discuss the qualitativefeatures of target multicores. Section 5 discuss the perfor-mance comparisons of the multicores. Finally, we concludethe paper and show the future work in Section 6.

    2. Related Works

    This paper is consisted of the 3 performance analysis, wirelatencies, clock frequencies under the thermal constraints,and eective performance. Relative to these analysis, manyrecent research eorts have explored the possibility of the2.5D and the 3D integrated LSIs.The wire latency analysis of the 2.5D integrated pro-

    cessor is studied by Kumer and Roullard[8][9]. Example,Roullard have analyzed the wire latency of 2.5D integratedprocessors based on the estimating vertical interconnectionlatency. They assumed that the 2.5D integrated proces-sors has stacked processor dies and stacked DRAM dieson a silicon interposer. From Roullards quantative analysisshowed that the inter-die latency of the 2.5D integrated pro-

    cessor is longer than 3D integrated processor cases. In morefine-grain 3D stacking granularity cases, the wire latencyof 3D integrated multicores have been analyzed in someprior works[3][6][7]. Especially, Puttaswamy analyzed theshared cache access time of Inter-FUB integrated 3D pro-cessors[11].Puttaswamy and Kursun have analyzed the temperature

    of 3D integrated processors. They show that 3D integratedprocessor temperature is higher than conventional 2D pro-cessors. And, they show that the temperature rise with theincreasing number of stacked dies [12][13]. Especially, thevertically overlapping of high thermal density FUBs causehigh heat density spot (hot spot), and the overlapping causestemperature rising extremely. From that experience, Kursunpresents that the temperature aware floorplanning is impor-tant for minimizing the hot spots in the 3D integrated pro-cessors.Awasthi have evaluated individually the thermal speci-

    fications and the performance of Inter-FUB 3D integratedmulticores[10]. Their evaluation showed that 3D integratedmulticores has good performance. But, such prior worksdont consider with the thermal eect to the performance.Additionally, no prior works exists the eective perfor-mance comparison between the 2.5D multicores and the 3Dintegrated multicores.A part of performance comparison framework in this pa-

    per is based on Lois prior performance evaluation[14]. Loihave evaluated the eective performance of the DRAMmain memory stacked processors under the thermal con-straints. They focus on the trade-o of the DRAM stackedprocessor between the memory access bandwidth and theclock frequency. Although, in this paper, we focus on thetrade-o of the 3D integrated multicore processors betweenthe shared LLC access latency and the clock frequency.

    3. Assumptions

    3.1 Common Assumptions of Multicores

    In this paper, we assume that multicore processors are sym-metric multicores. Every cores has L1 instruction cache anddata cache. And, the L1 caches are connected to unifiedshared last level cache (LLC) with on-chip network. SharedLLC is multibanked cache which consists of some SRAMcache memory banks. In particular, L1 caches are connectedto LLC banks with the shared bus physically, and L1 cachesare able to access to any cache banks. In the assumed busnetwork, bus controller is needed for the bus arbitration andthe memory interleaving. Thus, we assume that the bus con-troller is connected to shared buses. The architectural pa-rameters of multicores are shown in Table 1.Also, in 2.5D or 3D integrated multicores, the dies are

    connected to each other with the silicon-interposer or TSVs.If the single shared bus is consisted of on-die global wiresand inter-die materials, the bus length is too long. And, itlimits the bus clock frequency. Thus, we assume that theshared bus is consisted of multi-layered buses[9]. So, the

  • HANADA et al.: PERFORMANCE COMPARISONOF THE 2.5D INTEGRATEDMULTI-CORE AND THE 3-DIMENSIONAL (3D) STACKEDMULTI-CORE PROCESSORS3

    Table 1 Common Parameters of Multicore Architectures

    ISA Alpha EV6 core likeTechnology Node 32 nm

    Issue width 1 (In-Order)

    L1 (I/D) Cache Size 32 kB, Access latency 0.43 nsAssoc. 2 ways, Block size 64 BL2 Shared Cache Assoc. 8 ways, Block size 64 BCommon featuresMain Memory Access latency 105 ns

    $

    C

    o

    r

    e

    C

    o

    r

    e

    C

    o

    r

    e

    C

    o

    r

    e

    $ $ $

    B.B.

    Bus Layer 0 (On chip bus)

    $

    C

    o

    r

    e

    C

    o

    r

    e

    C

    o

    r

    e

    C

    o

    r

    e

    $ $ $

    B.B.

    Bus Layer 0 (On chip bus)

    . . .

    Bus Layer 1 (SI-IP bus / 3D bus)

    B.B. : Bus bridge

    Fig. 1 2.5D / 3D Multicore Processor Block Diagram

    on-chip network access of assumed multicores is Non Uni-form Cache Access (NUCA). In this paper, the shared bushas 2 level shared bus layers, is composed of the on-die bus(Bus Layer 0) and inter-die bus (Bus Layer 1). Addition-ally, The Bus Layer 0 and Bus Layer 1 are connected eachother with a Bus Bridge. The Bus Bridge is FIFO buer fordata transport between the Bus Layer 0 and the Bus Layer1. Figure 1 shows the common block diagram of assumedmulticores.In this paper, it is assumed that the processing model is

    simultaneous multithreading. If the cores miss L1 cache ac-cess frequently, the access conflict is occured in the sharedLLC.Also, multicores are packaged and assembled the thermal

    solutions (heatspreader, heatsink, and so on) like worksta-tions and servers.Additionally, its assumed that multicores are designed

    with based on Thermal Design Power for keeping the ratedtemperature. The multicore peak power is adjusted by clockfrequency scaling. Also, the source voltage is scaled withthe clock frequency scaling, and the ranges are based on realcommercial processor specifications.

    3.2 Assumptions of 2.5D and 3D Stacked Multicores

    In the 2.5D multicore case, its assumed that multicoredies are implemented on the silicon-interposer using mi-crobumps. Also, the dies are connected each other with mi-crobumps and the Re-Distributions Layout (RDL) which isthe implemented wire on the silicon interposer. Thus, in2.5D multicores, Bus Layer 1 is composed of RDL, mi-crobump and Bus Controller.In the contrast, it is assumed that 3D stacked multicores

    are consisted of the stacked dies and TSVs. Every cores areconnected to shared LLC banks with on-die wires and TSVs.

    Table 2 The List of Assuming Multicore Processors

    # of # of # of LLCcores dies LLC banks size

    2.5D 8 Cores 8 2 16 16 MB2.5D 16 Cores 16 4 32 32 MB3D 8 Cores 8 2 16 16 MB3D 16 Cores 16 4 32 32 MB

    Table 3 Comparisons of between the 2.5D and 3D Multicore

    Factors 2.5D 3DTSV Latency small up to # of diesRDL Latency up to RDL length none

    Thermal Density low highThermal Conductivity high low

    Thus, in 3D stacked multicores, Bus Layer 1 is composedof TSVs and Bus Controller. Additionally, we assume thatthe dies are connected by Face-to-Back connection, whichis that front-face (metal layer) of every dies are faced onback-face (bulk silicon layer) of the next to dies. Thus, ev-ery TSVs are same physical and electric specifications. Thecomparison target of multicores are showed in Table 2.Furthermore, in the 3D stacked multicore case, the 3D

    floorplan aects to the thermal density of the 3D chip. Inthis paper, we compare the some of floorplans which arediscussed on prior researches[10] [13][15]. In particular, wefocus on the 3 floorplans which are that cores are placing onnext to other cores vertically, cores are placing on next toother cores horizontally, and cores are not placing on nextto other cores.

    4. Qualitative Performance Comparison

    The 2.5D multicore and the 3D multicore have in a com-mon specification which every small dies are implementedon the chip. Although, the structures and the wiring are dif-ferent. Thus, the thermal density and inter FUB communi-cation latency of the 2.5D multicore are dierent from the3D multicores ones.Inter-die communications occur when cores access to the

    LLC banks on the other dies. In 2.5D multicore inter-diecommunications, the access latency include the RDL la-tency on the silicon interposer. From the prior research anal-ysis, it is reported that the RDL latency is increase with theRDL length [9]. Thus, the LLC acccess latency of the 2.5Dmulticore is depend on the RDL wire length. In contrast, inthe 3D multicore case, the access latency include the TSVlatencies. The TSV vertical communication latency is in-creased by the number of TSVs. From the prior works, itis reported that the latency of a TSV is smaller than 1mmglobal wire latency[16]. Thus, the 3D multicore LLC accesslatency has possible to be smaller than the 2.5D multicoresone.Also, The 3D multicore thermal density is dierent from

    the 2.5D multicores one. The generated heats are dissipatedquickly to air via a package and a glue in the 2.5D multicore

  • 4IEICE TRANS. INF. & SYST., VOL.E96D, NO.3 MARCH 2013

    Table 4 Interconnect Physical Parameters

    Technology Node 32 nmTSV Latency [16] 20 ps / TSVRDL Latency [9] 0.12 ns / mm

    RDL Length 2.5D 8 Cores 11.54 mm2.5D 16 Cores 16.32 mmOn Die Bus Clock Speed 1333 MHz

    cases. In contrast, in the 3D multicore cases, the generatedheats are dissipated slowly than the 2.5D case. Because, thegenerated heats are dissipated via the stacked dies exceptingthe layer next to package. Thus, the 3D multicore thermalconductivity is smaller than the 2.5D multicores ones. Ad-ditionally, high thermal density FUBs (like a processor core)has possibility to overlap each other in 3D multicore case.Therefore, 3D multicores tend to be hotter than 2.5D.From the above discussions, the qualitative comparison

    of the 2.5D multicore and the 3D multicore is organized inTable 3. From this comparison, it is hoped that the 3Dmulti-core LLC access latency is shorter than the 2.5D multicoresone. In contrast, the 3D multicore tend to be hotter than the2.5D multicore, and need to reduce the power dissipationslike a clock frequency scaling. In the next section, we dis-cuss the quantitative performance comparison of the 2.5Dmulticore and the 3D multicore.

    5. Quantative Performance Comparison

    5.1 LLC Access Latency

    For the LLC access latency comparison, we make the modelof the access latency. The shared LLC access latency is oc-cured at the L1 cache miss. The LLC access latency is splitto the shared bus latency TBus and LLC bank decode latencyTLLC Bank. Thus the shared LLC access latency is the sumof the above 2 latencies. The access model is showed to thebelow equation (1):

    TLLC = TBus + TLLC Bank (1)

    The base shared bus latency model is showed in the equation(2). This model is based on Chos bus latency model [17].

    TBus = 2 1fBus (2)

    The integer 2 in the equation (2) means that 2 stages of buscommunications (1st: REQ+ADRS, 2nd: DQ+ACK). fBusmeans the bus clock frequency. If cores access to the LLCbanks on same die, the latency of bus is only Bus Layer 0latency which is modeled on the equation (2).Furthermore, in the 2.5D multicore case, if cores access

    to the LLC banks on other dies, the shared bus latency TBusincludes the communication latency of silicon interposer.Also, in the 3D multicore case, if cores access to the LLCbanks on other dies, the shared bus latency TBus includes thecommunication latency of TSV vertical bus.The shared bus access latency model of the accesses to

    Fig. 2 Shared LLC Access Latency Evaluation Result

    the other die is showed in the equations (3), (4). In the equa-tions, TTSV means a TSV latency, TRDL means a RDL la-tency, and NTSV means the number of TSVs signal transited.

    TBus:2:5D:Worst = 2 1fBus

    + TRDL +1fBus

    !(3)

    TBus:3D:Worst = 2 1fBus

    + NTSV TRDL + 1fBus!

    (4)

    Also, in this section, we compare the worst access la-tency and average access latency of the multicores. Basedon Chos latency model, the average access latencies areshowed in the equations (5), (6).

    TBus:2:5D:AVG =4

    NBanksTBus

    +NBanks 4NBanks

    TBus:2:5D:Worst (5)

    TBus:3D:AVG =4

    NBanksTBus

    +NBanks 4NBanks

    TBus:3D:Worst (6)

    In the equations (5) and (6), NBanks means the number ofLLC banks and integer 4 means the number of implementedLLC banks on a die.The decode latency of shared LLC banks TLLC Bank is

    computed with cache model tool CACTI [18]. RDL latencyTRDL is calculated from the RDL wire length and the RDLlatency per unit of length. The RDL wire length is estimatedfrom the 2.5D multicore floorplans. The RDL latency perunit of length is refered from Roullards analysis report[9].Furthermore, the TSV latency per unit of length is referedfrom Savidiss analysis report[16]. The parameters in cacheaccess model is listed up in the Table 4.The LLC access latency comparison results are showed

    in the Figure 2. The unit of the bars is nsec. The everybars means the access time brake down of the access time tosame die (Best), the access time to other dies (Worst), andthe average access time (AVG) in a left-to-right.

  • HANADA et al.: PERFORMANCE COMPARISONOF THE 2.5D INTEGRATEDMULTI-CORE AND THE 3-DIMENSIONAL (3D) STACKEDMULTI-CORE PROCESSORS5

    Fig. 3 Dies Power Consumption Breakdown (4 Cores, 1.2 GHz)

    Table 5 Device Parameters and Thermal Parameters

    Host Die thick 0.89 mm, 100 W/(m-K)Stacked Die thick 0.02 mm, 100 W/(m-K)Heat Spreader 0.82 x 50 x 50 mm3, 400 W/(m-K)

    Heat Sink 8.70 x 75 x 75 mm3, 400 W/(m-K)

    Convective Thermal Conductivity 2 W/K

    Figure 2 shows that the 3D multicore LLC access laten-cies are shorter than the 2.5D multicores ones. In focusingon every factors, the RDL latency occupies the large partof the 2.5D multicore LLC access latencies. In particular,the occupied rate is 31% in 8 cores case and 38% in 16cores case. In contrast, The TSV latencies occupy the smallpart of the 3D multicore LLC access latency. Thus, it isshowed that the additional LLC access latency by increas-ing the stacked dies is small. As the result, the 3D multicorehas better performance than the 2.5D multicore. In particu-lar, the 3D multicore LLC access latencies are shorter thanthe 2.5D multicores ones, 24.4% in 8 cores case and 34.1%in 16 cores case from the models.

    5.2 Maximum Clock Frequency Under the Thermal Con-straint

    In this section, we analyze the multicore processor temper-ature for the comparison of the rated multicore clock fre-quency under the thermal constraint. From the tempera-ture analysis, we get the correlations between the clock fre-quency and the operation temperature of multicores, and getthe clock frequency under the thermal constraints.This temperature analysis is done by using the thermal

    simulation tool named HotSpot 5.0 [19] which supports the3D processor temperature analysis. Table 5 summarizes thephysical parameters for the thermal simulation. Also, theair temperature is fixed to 30, and the rated temperatureis set to 90.Furthermore, in this paper, we use the steady temperature

    of multicores for the comparison. We assume that the powerdissipation of a 3Dmulti-core is always at peak (which is theactually worst case). Also, we observe the chip temperatureas a steady temperature after the well warming up. The peakpower is calculated by the sum of dynamic power and staticpower:

    PPeak = CL f Vdd2 + PS tatic (7)

    LLC

    Banks

    Array

    (8MB)

    Core #0

    Core #1

    Core #2

    Core #3

    Si Interposer

    Inter-Die Bus (RDLs)

    (d) Homogeneous-

    Stack (top: 8 Cores,

    bottom: 16 Cores)

    (e) Heterogeneous-

    Stack (top: 8 Cores,

    bottom: 16 Cores)

    (f) Checker-Stack

    (top: 8 Cores,

    bottom: 16 Cores)

    (a) Common-Die in

    2.5D Cases

    (b) 2.5D 8 Cores

    (c) 2.5D 16 Cores

    Inter-

    Die

    Bus

    (TSVs)

    Fig. 4 Multicore Processor Floorplans

    In the equation (7), PDynamic means dynamic power andPS tatic means static power, means the transistor switch-ing actvity, CL means load capacitance, f means clock fre-quency and Vdd means source voltage. The equation (7)shows that dynamic power is scaled with the clock fre-quency scaling. Similarly, the source voltage is scaled withthe clock frequency. In particular, we consider that the volt-age varies by 0.05V per 200MHz. These voltage assump-tions are referred from [20]. In this peak power dissipationassumption, we assume that switching activity is always1. In this paper, we assume the base clock frequency is1200MHz, and base voltage is 1.10V. Also, the base powerbrake down is calculated from the multicore power modelMcPAT [21] in the thermal simulation. For example, Figure3 shows the peak power brake down of the 4 cores + 8 MBmulti-core die @ 1200MHz.Also, in the thermal simulation, we assume that the static

    power PS tatic is not varied by the temperature rising. In re-alistic case, the static power is sensitive to the chip temper-ature. The leakage currents and the temperature-sensitivityof 3D multicore is well analyzed by prior research [?, ].Figure 4 shows the floorplans of the multicores in the

    thermal simulation. floorplan (b) and (c) are 2.5D multi-core floorplans, and floorplan (d), (e) and (f) are 3D multi-core floorplans. We assume that the dies are implementedon the silicon interposer in the 2.5D multicores case. And,the dies are stacked and connected by TSVs in the 3Dmulticore case. In this floorplan, the every dies has thecores and/or shared LLC banks. In the 2.5D multicores andhomogeneous-stacked 3D multicores, all implemented diesare same as the common die (a) which has 4 cores and 8MBshared LLC banks. The floorplans and area of FUBs arerefered the prior reports[19][21][22]. Also, we assume thatthe gaps between implemented dies on the silicon interposerare 0.5mm which parameter is based on the prior report[9].Furthermore, the other functional blocks (lile a shared bus,

  • 6IEICE TRANS. INF. & SYST., VOL.E96D, NO.3 MARCH 2013

    30

    50

    70

    90

    110

    0 0.4 0.8 1.2 1.6 2 2.4 2.8

    T

    e

    m

    p

    e

    r

    a

    t

    u

    r

    e

    (

    )

    Clock Frequency (GHz)

    8 Cores (2D) 16 Cores (2D)

    8 Cores (2 Tiers, Homo) 8 Cores (2 Tiers, Hetero)

    8 Cores (2 Tiers, Checker) 16 Cores (4 Tiers, Homo)

    16 Cores (4 Tiers, Hetero) 16 Cores (4 Tiers, Checker)

    Safe

    Temperature

    90 ()

    Fig. 5 Steady Temperature of the Multicores

    Bus controller, TSVs, and so on) are included in the LLCbanks arrays in the Figure 4.For mitigating the temperature in the 3D stacked multi-

    cores, the stacking structure is an important factor[3][13].Thus, we analyze 3 floorplans in the 3D multicore tempera-ture analysis. The 3 floorplans are illustrated in the (d), (e)and (f) in the Figure 4. Homogeneous-stacking (d) is thefloorplan which tend to generating the hot spot by the verti-cal core overlap. The Heterogeneous-stacking (e) floorplanis consisted of core only layers and LLC banks only layers.In floorplan (e), the cores are assigned to near of heatsinks.Thus the generated heat is immediately dissipated from thecores. The Checker-stacking (f) is the floorplan which thecores dont adjoin each other. Thus the generated heat isdiused uniformly in the processor, thus hot spots shouldhardly occur.Figure 5 shows the results of temperature analysis. In

    the figure, the horizontal-axis means the clock frequency ofmulticores and the vertical-axis means the steady tempera-ture of the hot spot in the multicores.The result shows that the 3D multicore temperature tend

    to be high than the 2.5D multicores one. This tendencyis caused by the dierence of the heat density and the ther-mal conductivity from the dies to the heatsink. Additionally,the heat density eect is appeared between the 3D multi-core floorplans. Homogeneous-stack causes higher temper-ature by the high heat density than other floorplans. Thesteady temperature of Checker-stack floorplan is slightlyhigher than the Heterogeneous-stack floorplans ones. Be-cause, The some of cores are assigned on the layers farfrom the heatsink in the Checker-stack floorplan. Thus, thesome of generated heat are dissipated slowly to the outsideof package, and the steady temperature is higher than theHeterogeneous-stack case.From the temperature analysis, we obtain the maximum

    clock frequency of the 2.5D and 3D multicores. At first, weget the approximate curve from the plots on the Figure 5.Next, we get the maximum clock frequency from the inter-section of the approximate curve and the safe temperatureline. In this paper, we assume that the safe temperature is90. With it, Table 6 shows the resulting rated clock fre-quencies for the 2.5D and 3D multicores obtained by the

    Table 6 Maximum Clock Frequency of the Multicores under the SafeTemperature

    8 Cores 16 Cores2.5D 2.53 GHz 1.75 GHz

    3Dhomogeneous-stack 2.22 GHz 1.40 GHzheterogeneous-stack 2.53 GHz 1.59 GHz

    checker-stack 2.44 GHz 1.54 GHz

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    1.05

    1.1

    1.15

    1.2

    2.5D homo hetero check 2.5D homo hetero check

    8 Cores (16MB LLC) 16 Cores (32MB LLC)

    N

    o

    r

    m

    a

    l

    i

    z

    e

    d

    B

    I

    P

    S

    Fig. 6 Eective Performance of the Multicores

    temperature analysis. From the results, the 3D multicoresneed the lower clock frequency design compared to the 2.5Dmulticores. Also, the result shows that the thermally-aware3D multicore floorplans mitigate the clock frequency over-heads.

    5.3 Eective Performance

    From the results of below sections, we compare the eectiveperformance of the 2.5D and 3D multicores. In this com-parison, the eective performance metrics is BIPS (BillionInstructions Per Second), which is calculated by the belowmodel:

    BIPS = IPC fMAX (8)In the equation (8), IPC means the executed average Instruc-tions Per Clock cycle (IPC), fMAX (GHz) means the clockfrequency under the thermal constraint from the Table 6.IPC is calculated from the multicore simulation. We use theprocessor simulator Gem5 which is provided by MichiganUniversity[23] for getting the average IPC. The comparisontarget multicores are organized on Table 2. The assumptionsof Core architecture are showed in Table 1. Also, the LLCaccess latency and the clock frequency under the thermalconstraint of the multicores are used the results from abovesections 5.1 and 5.2. Furthermore, we get the average IPCin some workloads. In this simulation, we use the total 21parallelized benchmark programs from SPLASH-2 [24] andPARSEC [25], and used input size train and simlarge.In addition, for our experimental environment, we cant

    simulate the multi layer bus configurations. Thus, we sim-ulate the single bus configuration, and the access time tothe shared LLC banks is fixed to average LLC access timewhich is calculated in section 5.1.Figure 6 shows experimental results. The horizontal items

  • HANADA et al.: PERFORMANCE COMPARISONOF THE 2.5D INTEGRATEDMULTI-CORE AND THE 3-DIMENSIONAL (3D) STACKEDMULTI-CORE PROCESSORS7

    Table 7 The List of the Assuming LLC-stacked Multicores

    # of # of # of LLCcores dies LLC banks size

    2.5D 4 Cores 4 2 24 24 MB2.5D 8 Cores 8 4 48 48 MB3D 4 Cores 4 2 24 24 MB3D 8 Cores 8 4 48 48 MB

    means comparison target multicores and the vertical axismeans the BIPS are normalized by the 2.5D multicore case.The top and bottom of vars mean the maximum and min-imum performance in all program simulations. Also, thecentral marker of the var means the geometric mean of theperformance in the all program simulation.The result shows that the 2.5D multicores have good per-

    formance compared to the 3D multicores. For example,in the 8 cores case, it is illustrated that 3D multicores haspoor performance compared to the 2.5D multicores exclud-ing Heterogeneous-stacking case. This means that the dis-advantage of 3D multicores (lower clock frequency) over-come to the advantage of the 3D multicores (short LLC acc-cess latency). In 16 cores case, the performance gap is widerthan 8 cores case by the clock frequency overhead. Al-though, in memory-intensive programs execution case, theperformance of the 3D multicores are higher than 2.5D mul-ticoress ones. Because, the LLC access misses are often oc-cured in the memory-intensive program case. Thus, the LLCaccess latency aects strongly to the eective performancein the memory-intensive program case. Therefore, focus-ing on the only memory intensive workloads execution, itis showed that the 3D multicores has competitive advantagethan the 2.5D multicores.

    5.4 Additional Experiment

    From the comparison results in the section 5.3, the heatdensity strongly aects to the eective performance. It isexpected that this thermal problem is mitigated in the 3Dmulticore which stacking all of FUBs are low heat density.Therefore, we also compare the performance of large sizecache stacked multicores [1] [26] which are low power pro-cessor compared to the core stacked multicores in the abovesection 3.The assumed cache stacked multicores has same architec-

    tures, on-chip networks, processing and stacked structuresas the assumed multicores in section 3. However, the cachestacked multicores dier in the number of cores and LLCcache banks from the multicores in the section 3. Table 7shows target cache stacked multicores which is based onmulticore configurations in the section 3. In this section,the half of cores are replaced to the LLC banks. Also, in thissection, we focus only the Heterogeneous-stacking floorplanfor the 3D cache stacked multicores.In these cache stacked multicores case, Figure 7 shows

    the LLC access latency comparison results. The LLC ac-cess latency is calculated from the models in section 5.1.Figure 7 shows that 3D stacked multicores has good perfor-

    Fig. 7 Shared LLC Access Latency Evaluation Result

    Table 8 Maximum Clock Frequency of the LLC stacked Multicores un-der the Safe Temperature

    4 Cores 8 Cores2.5D 2.89 GHz 2.17 GHz

    3D (heterogeneous-stack) 3.01 GHz 2.04 GHz

    mance compared to the 2.5D multicores. This tendency is,as same as the mention in section 5.1, the 3D stacked multi-cores inter-dies communication latency is smaller than 2.5Dmulticoress ones.Next, Table 8 shows the cache stacked multicores maxi-

    mum clock frequency of the under the thermal constraints.This clock frequencies are gotten by the thermal simulationand the calculation method in section 5.2. Table 8 showsthat the thermal problem of the 3D multicore is mitigated inthe low heat density FUB stacking case like a cache stackedmulticores. Thus, in the cache stack case, the 3D multicorehas possibilities to operate with the high clock frequency assame as 2.5D multicores ones. In this case, the 3D mul-ticore can operate clock frequency +4.1% in 4 cores and-6.1% in 8 cores compared to the 2.5D multicores.Finally, Figure 8 shows the normalized eective perfor-

    mance of cache stacked multicores. This results are gottenby the experimental environments in section 5.3. Figure 8shows the 3D multicores have good performance (AVG +6.3% in 4 cores, + 0.7% in 8 cores) compared to the 2.5Dmulticores. Thus, this paper illustrate the below contribu-tions:

    3D stacked multicores have possibilities of the shorterinter-die access latency compared to 2.5D multicores.

    The heat density strongly aects to the 3D stacked mul-ticore clock frequency. The 2.5D multicore has possi-bilities of the higher clock frequency operation com-pared to the 3D stacked multicore.

    From the eective performance comparison, it isshowed that the clock frequency is strongly aectsto performance compared to LLC access latency inour assumptions. 3D multicores are able to obtainthe advantage compared to 2.5D multicores only lowheat density FUBs stacking case. Thus, for high-performance operation in 3D stacked multicores, its

  • 8IEICE TRANS. INF. & SYST., VOL.E96D, NO.3 MARCH 2013

    0.85

    0.9

    0.95

    1

    1.05

    1.1

    1.15

    1.2

    1.25

    2.5D 3D (hetero) 2.5D 3D (hetero)

    4 Cores (24MB LLC) 8 Cores (48MB LLC)

    N

    o

    r

    m

    a

    l

    i

    z

    e

    d

    B

    I

    P

    S

    Fig. 8 Eective Performance of the LLC stacked Multicores

    important that reducing the heat density (like a thereducing generated heat and the thermal diusion-oriented floorplanning) compared to LLC access la-tency reduction.

    6. Conclusion

    This work compares the 2.5D multicore processor and theInter-FUB 3D stacked multicore processor. In this work, wecompare the multicores with using three evaluation metrics(the shared cache access latency, the rated clock frequencyunder the thermal constraints, and the eective performanceunder the thermal constraints). From the model base sharedcache access latency comparison, it is shown that the 3Dmulticore has shorter average access latencies compared tothe 2.5D multicores ones, from 24.4% to 34.1%. Also,from the rated clock frequency comparison using the ther-mal model, it is shown that the 3D multicore need the lowerclock frequency design compared to the 2.5D multicore. Fi-nally, from the eective performance comparison using theprocessor simulation, it is shown that the 3D multicore isable to obtain the advantage compared to the 2.5D multi-core in some conditions. The conditions are the memoryintensive workload processing case and the low heat densityFUBs stacking case. On the other hand, in the other con-ditions, it is shown that the 2.5D multicore is good designpoint for good performance compared to the 3D multicore.

    Acknowledgments

    We like to thank the reviewers for their feedbacks on thepaper. Special thanks to Nobuaki Miyakawa, MasayoshiYoshimura and Krishna Chaitanya Nunna for their co-operations. Also, a part of this works are collaborative re-search with Panasonic. Furthermore, We used the Comput-ing System for Research in Research Institute for Informa-tion Technology, Kyushu University.

    References

    [1] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H.Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed,J. Rupley, S. Shankar, J. Shen, and C. Webb, Die Stacking (3D) Mi-croarchitecture, in Proceedings of the 39th Annual IEEE/ACM In-ternational Symposium on Microarchitecture, MICRO 39, pp. 469479, 2006.

    [2] G. H. Loh, Y. Xie, and B. Black, Processor Design in 3D Die-Stacking Technologies,Micro, IEEE, vol. 27, pp. 31 48, may-june2007.

    [3] Y. Xie, G. H. Loh, B. Black, and K. Bernstein, Design Space Ex-ploration for 3D Architectures, J. Emerg. Technol. Comput. Syst.,vol. 2, pp. 65103, Apr. 2006.

    [4] K. Puttaswamy and G. H. Loh, Scalability of 3D-integrated Arith-metic Units in High-Performance Microprocessors, in Proceed-ings of the 44th annual Design Automation Conference, DAC 07,pp. 622625, 2007.

    [5] K. Puttaswamy and G. H. Loh, Thermal Herding: Microarchitec-ture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors, in Proceedings of the 2007 IEEE 13th Inter-national Symposium on High Performance Computer Architecture,HPCA 07, pp. 193204, 2007.

    [6] J. Kim, C. Nicopoulos, D. Park, R. Das, Y. Xie, V. Narayanan, M. S.Yousif, and C. R. Das, A Novel Dimensionally-decomposed Routerfor On-chip Communication in 3D Architectures, in Proceedings ofthe 34th annual International Symposium on Computer Architecture,ISCA 07, pp. 138149, 2007.

    [7] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, andM. Kandemir, Design and Management of 3D Chip Multiproces-sors Using Network-in-Memory, in Proceedings of the 33rd an-nual International Symposium on Computer Architecture, ISCA 06,pp. 130141, 2006.

    [8] G. Kumar, T. Bandyopadhyay, V. Sukumaran, V. Sundaram, S. K.Lim, and R. Tummala, Ultra-high I/O Density Glass/Silicon Inter-posers for High Bandwidth Smart Mobile Applications, in Elec-tronic Components and Technology Conference (ECTC), 2011 IEEE61st, pp. 217 223, 31 2011-june 3 2011.

    [9] J. Roullard, A. Farcy, S. Capraro, T. Lacrevaz, C. Bermond,G. Houzet, J. Charbonnier, C. Fuchs, C. Ferrandon, P. LeDuc, andB. Flechet, Evaluation of 3D Interconnect Routing and StackingStrategy to Optimize High Speed Signal Transmission for Memoryon Logic, in Electronic Components and Technology Conference(ECTC), 2012 IEEE 62nd, pp. 8 13, 29 2012-june 1 2012.

    [10] M. Awasthi and R. Balasubramonian, Exploring the Design Spacefor 3D Clustered Architectures, in 3rd IBM Watson Conference onInteraction between Architecture, Circuits, and Compilers, 2006.

    [11] K. Puttaswamy and G. Loh, 3D-Integrated SRAM Componentsfor High-Performance Microprocessors, Computers, IEEE Trans-actions on, vol. 58, pp. 1369 1381, oct. 2009.

    [12] K. Puttaswamy and G. H. Loh, Thermal Analysis of a 3D Die-stacked High-performance Microprocessor, in Proceedings of the16th ACM Great Lakes Symposium on VLSI, GLSVLSI 06, pp. 1924, 2006.

    [13] E. Kursun, J. Wakil, and M. Iyengar, Analysis of Spatial and Tem-poral Behavior of Threedimensional Multi-core Architectures To-wards Run-time Thermal Management, in Thermal and Thermo-mechanical Phenomena in Electronic Systems (ITherm), 2010 12thIEEE Intersociety Conference on, pp. 1 8, june 2010.

    [14] G. L. Loi, B. Agrawal, N. Srivastava, S.-C. Lin, T. Sherwood, andK. Banerjee, A Thermally-aware Performance Analysis of Verti-cally Integrated (3-D) Processor-memory Hierarchy, in Proceed-ings of the 43rd annual Design Automation Conference, DAC 06,pp. 991996, 2006.

    [15] X. Zhou, Y. Xu, Y. Du, Y. Zhang, and J. Yang, Thermal Manage-ment for 3D Processors via Task Scheduling, in Proceedings of the2008 37th International Conference on Parallel Processing, ICPP08, pp. 115122, 2008.

    [16] I. Savidis, S. M. Alam, A. Jain, S. Pozder, R. E. Jones, and R. Chat-terjee, Electrical Modeling and Characterization of Through-siliconVias (TSVs) for 3-D Integrated Circuits, Microelectronics Journal,vol. 41, no. 1, pp. 9 16, 2010.

    [17] Y.-S. Cho, E.-J. Choi, and K.-R. Cho, Modeling and Analysis ofthe System Bus Latency on the SoC Platform, in Proceedings ofthe 2006 International Workshop on System-level Interconnect Pre-

  • HANADA et al.: PERFORMANCE COMPARISONOF THE 2.5D INTEGRATEDMULTI-CORE AND THE 3-DIMENSIONAL (3D) STACKEDMULTI-CORE PROCESSORS9

    diction, SLIP 06, pp. 6774, 2006.[18] S. Wilton and N. Jouppi, CACTI: An Enhanced Cache Access and

    Cycle Time Model, Solid-State Circuits, IEEE Journal of, vol. 31,no. 5, pp. 677 688, 1996.

    [19] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankara-narayanan, and D. Tarjan, Temperature-aware Microarchitecture,SIGARCH Comput. Archit. News, vol. 31, no. 2, pp. 213, 2003.

    [20] J. Charles, P. Jassi, N. Ananth, A. Sadat, and A. Fedorova, Evalu-ation of the Intel Core i7 Turbo Boost Feature, in Workload Char-acterization, 2009. IISWC 2009. IEEE International Symposium on,pp. 188 197, oct. 2009.

    [21] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen,and N. P. Jouppi, McPAT: An Integrated Power, Area, and TimingModeling Framework for Multicore and Manycore Architectures,in Proceedings of the 42nd Annual IEEE/ACM International Sympo-sium on Microarchitecture, MICRO 42, pp. 469480, 2009.

    [22] A. Jain, W. Anderson, T. Benningho, D. Berucci, M. Braganza,J. Burnetie, and et al., A 1.2 GHz Alpha Microprocessor with44.8 GB/s Chip Pin Bandwidth , in Solid-State Circuits Conference,2001. Digest of Technical Papers. ISSCC. 2001 IEEE International,pp. 240 241, 2001.

    [23] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi,and S. K. Reinhardt, The M5 Simulator: Modeling Networked Sys-tems, IEEE Micro, vol. 26, no. 4, pp. 5260, 2006.

    [24] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, TheSPLASH-2 Programs: Characterization and Methodological Con-siderations, in Proceedings of the 22nd annual International Sym-posium on Computer Architecture, ISCA 95, pp. 2436, 1995.

    [25] C. Bienia, S. Kumar, J. P. Singh, and K. Li, The PARSEC Bench-mark Suite: Characterization and Architectural Implications, inProceedings of the 17th International Conference on Parallel Archi-tectures and Compilation Techniques, PACT 08, pp. 7281, 2008.

    [26] G. Sun, X. Wu, and Y. Xie, Exploration of 3D Stacked L2 CacheDesign for High Performance and Ecient Thermal Control, inProceedings of the 14th ACM/IEEE international symposium on Lowpower electronics and design, ISLPED 09, pp. 295298, 2009.