slides 115 of hello

Upload: manpreetgugnani

Post on 02-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Slides 115 of hello

    1/18

    University of MichiganElectrical Engineering and Computer Science

    1

    Application-Specific Processing on a

    General Purpose Core via Transparent

    Instruction Set Customization

    Nathan Clark, Manjunath Kudlur, Hyunchul Park,

    Scott Mahlke, Krisztin Flautner*

    Advanced Computer Architecture Lab, University of Michigan*ARM Ltd.

  • 8/11/2019 Slides 115 of hello

    2/18

    University of MichiganElectrical Engineering and Computer Science

    2

    A Case for Customization

    General purpose processors handles manyapplications fairly well, but

    Each application has different requirements

    Need for efficient execution

    Impressive design wins through customization

    Performance, power, area Up to 3.5x speedup [Hot Chips 16]

  • 8/11/2019 Slides 115 of hello

    3/18

    University of MichiganElectrical Engineering and Computer Science

    3

    Computationally demanding parts of applicationsrun on special hardware

    New instructions use the special hardware

    Instruction Set Customization

    CUSTOM

    XOR

    MPY LD

    XOR

    SHR

    XOR

    MOV

    MPY

    LD

    SHR

    AND

  • 8/11/2019 Slides 115 of hello

    4/18

    University of MichiganElectrical Engineering and Computer Science

    4

    Traditional vs. Transparent Customization

    High Non-RecurringEngineering costs (NRE)

    Universal accelerator No ISA change

    CPU

    CPU

    Compute Accelerator

    (CCA)

    CPU

    CPU

    CPU

    CPU

    Traditional Transparent

  • 8/11/2019 Slides 115 of hello

    5/18

    University of MichiganElectrical Engineering and Computer Science

    5

    Design of a Compute Accelerator

    Goal: support importantcomputation subgraphs

    Array of function units

    Exploits subgraph

    parallelism

    Allows natural data

    propagation

    FU FU FU

    FU FU FU

    IN 1

    IN 2

    Fetch

    Issue

    ALUALU

    CCA

    WB

  • 8/11/2019 Slides 115 of hello

    6/18

    University of MichiganElectrical Engineering and Computer Science

    6

    Or

    AndMov

    Or

    And

    Or

    AndMov

    Or

    And Mov

    Or

    AndMov

    Or

    And Mov

    Mov

    Mov

    1

    11

    1

    1 1

    1

    1

    CCA Shape

    164.gzip

  • 8/11/2019 Slides 115 of hello

    7/18

    University of MichiganElectrical Engineering and Computer Science7

    AndXor

    Xor

    Xor Add

    Mov

    Mov

    1

    22

    2

    2 2

    2

    2

    CCA Shape

    Blowfish

  • 8/11/2019 Slides 115 of hello

    8/18

    University of MichiganElectrical Engineering and Computer Science8

    Dynamic % of subgraphs using FU

    CCA Utilization

    1 2 3 4 5 6 7

    1 100 59.0 22.9 13.1 6.5 4.2 0.3

    2 91.1 50.6 9.9 4.1 0.6 0.2 0.03 57.4 17.8 6.3 2.9 0.1 0.0 0.0

    4 18.5 8.3 1.6 0.1 0.0 0.0 0.0

    5 8.7 2.1 0.1 0.0 0.0 0.0 0.0

    6 2.1 1.2 0.1 0.0 0.0 0.0 0.0

    7 1.2 0.1 0.1 0.0 0.0 0.0 0.0

    8 0.1 0.1 0.0 0.0 0.0 0.0 0.0

  • 8/11/2019 Slides 115 of hello

    9/18

    University of MichiganElectrical Engineering and Computer Science9

    CCA Operations

    Dynamic opcodes inimportant subgraphs

    Excluded mpy/div,

    load/store, branch

    Two main categories

    logicals, adds

    Subgraphs rarely have

    more than 3 dependentadds

    Opcode %

    Add 28.7

    And 12.5

    Move 11.7

    Sext 10.4Lshift 9.8

    Or 8.7

    Xor 5.1

    Sub 4.8

    Rshift 2.4

    Compare 0.4

  • 8/11/2019 Slides 115 of hello

    10/18

    University of MichiganElectrical Engineering and Computer Science10

    Proposed CCA Design

    4 inputs/2 outputs Two FU types

    Arith/logic

    Logic

    Crossbar between rows

    Captures > 99% of

    important subgraphs

    I1 I2I1 I3 I4

    O1 O2

  • 8/11/2019 Slides 115 of hello

    11/18

    University of MichiganElectrical Engineering and Computer Science11

    Synthesis of CCA

    Synopsys design tools, 130nm library

    Depth Configuration Control (bits) Delay (ns) Cell area

    (mm2)

    Subgraphs

    Supported

    7 6A-4L-4A-3L-2A-2L-1L

    245 5.62 0.48 99.3%

    6 6A-4L-4A-3L-2A-1L

    229 4.56 0.45 95.1%

    5 6A-4L-4A-2L-1L 197 3.50 0.40 87.6%

    4 6A-4L-3A-2L 172 3.19 0.38 81.8%

  • 8/11/2019 Slides 115 of hello

    12/18

    University of MichiganElectrical Engineering and Computer Science12

    + No ISA change

    + No recompile

    Simple selection

    Hardware complexity

    + Powerful selection

    + Simple hardware

    Some ISA changeRecompile necessary

    ASIPs

    ISA change

    High NRE

    + No ISA change

    + No recompile

    Simple selection

    Hardware complexity

    + Powerful selection

    + Simple hardware

    Some ISA changeRecompile necessary

    ASIPs

    ISA change

    High NRE

    + No ISA change

    + No recompile

    Simple selection

    Hardware complexity

    + Powerful selection

    + Simple hardware

    Some ISA changeRecompile necessary

    ASIPs

    ISA change

    High NRE

    + No ISA change

    + No recompile

    Simple selection

    Hardware complexity

    + Powerful selection

    + Simple hardware

    Some ISA changeRecompile necessary

    ASIPs

    ISA change

    High NRE

    Static Dynamic

    CCA Utilization

    Realization

    Selection

    Static

    Dynamic

  • 8/11/2019 Slides 115 of hello

    13/18

    University of MichiganElectrical Engineering and Computer Science13

    ADD r4, r1, #1

    LSR r2, r2, #4

    XOR r5, r4, r2

    LD r3

    ADD r6, r5, r3

    XOR r7, r6, r8

    SHR

    Dynamic SelectionDynamic Realization

    Detect and replace subgraphs in fill unit of trace cache

    I-Cache

    Trace

    Cache

    Retire

    ..

    .

    Execute

    ..

    .

    Decode

    Trace

    Construction

    Subgraph

    Selection and

    Insertion

    LSR r2, r2, #4

    LD r3

    CUSTOM

    SHR

    ADD r4, r1, #1

    LSR r2, r2, #4

    XOR r5, r4, r2

    LD r3

    ADD r6, r5, r3

    XOR r7, r6, r8

    SHR

    ADD r4, r1, #1

    LSR r2, r2, #4

    XOR r5, r4, r2

    LD r3

    ADD r6, r5, r3

    XOR r7, r6, r8

    SHR

    ADD r4, r1, #1

    LSR r2, r2, #4

    XOR r5, r4, r2

    LD r3

    ADD r6, r5, r3

    XOR r7, r6, r8

    SHR

    ADD r4, r1, #1

    LSR r2, r2, #4

    XOR r5, r4, r2

    LD r3

    ADD r6, r5, r3

    XOR r7, r6, r8

    SHR

  • 8/11/2019 Slides 115 of hello

    14/18

    University of MichiganElectrical Engineering and Computer Science14

    Simulation

    SimpleScalarARM instruction set 4-wide Execution, 1 compute accelerator

    128 RUU entries

    32k inst. trace cache, 256 inst. Traces 5000 cycle selection/insert latency

    L1 I-cache : 32k, 2 way, 2 cycle hit

    L1 D-cache : 32k, 4 way, 2 cycle hit

  • 8/11/2019 Slides 115 of hello

    15/18

    University of MichiganElectrical Engineering and Computer Science15

    Varying CCA Latency

    1.00

    1.05

    1.10

    1.15

    1.20

    1.25

    1.30

    1.35

    1.40

    1.45

    Speedup

    6

    4

    21

    SPECintMediaBench Encryption

    Lat

  • 8/11/2019 Slides 115 of hello

    16/18

    University of MichiganElectrical Engineering and Computer Science16

    Static SelectionDynamic Realization

    Compiler selects subgraphs offline Communicated to the hardware at load time

    Control bits stored in a table and inserted at decode

    ADD r4, r1, #1

    LSR r2, r2, #4

    XOR r5, r4, r2

    LD r3

    ADD r6, r5, r3

    XOR r7, r6, r8SHR

    LSR r2, r2, #4

    LD r3

    CCA_Start #2

    ADD r4, r1, #1

    XOR r5, r4, r2

    ADD r6, r5, r3XOR r7, r6, r8

    CCA_End

    SHR

    I-Cache

    Control

    Table

    Retire

    .

    .

    .

    Execute

    .

    .

    .

    Decode

  • 8/11/2019 Slides 115 of hello

    17/18

    University of MichiganElectrical Engineering and Computer Science17

    1.00

    1.05

    1.10

    1.15

    1.20

    1.25

    1.30

    1.35

    1.40

    1.45

    Speed

    up

    Dynamic Selection Static Selection

    Dynamic vs. Static SelectionSPECint MediaBench Encryption

  • 8/11/2019 Slides 115 of hello

    18/18

    University of MichiganElectrical Engineering and Computer Science18

    Summary

    Transparent instruction set customization Benefits of customization without changing ISA

    Presented design of a compute accelerator

    Handle majority of important computationsubgraphs in many benchmarks

    Developed ways to utilize the accelerator

    Table-based static selectiondynamic realization Trace cache based dynamic selectiondynamic

    realization