mars: a 64-core armv8 processor...btb itlb i cache dirpre indpre srs loop detect instruction buffer...

23
Mars: A 64-core ARMv8 Processor Charles Zhang Phytium Technology Co., Ltd

Upload: others

Post on 16-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Mars: A 64-core ARMv8 Processor

Charles Zhang

Phytium Technology Co., Ltd

Page 2: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Statements

The following slides are presented to introduce the

general features of one of our products, instead of any

commitment about it. It is for information purposes

only, and may not be incorporated into any contract. It

is not suggested to make purchasing decisions

accordingly. The development, release, and timing of

any features or functionality described here remains at

the sole discretion of Phytium.

2

Page 3: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

A Brief Introduction of Phytium

China corporation, founded in 2012

Guangzhou

Tianjin

Vision

Leading edge CPU and ASIC provider in China

Market focuses on chips for

Internet & Cloud Computing infrastructure

Traditional workload mainframe servers

3

Page 4: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

China is a Fast-growing Server Market

4

Company

1Q15

Revenue

1Q15 Market

Share (%)

1Q14

Revenue

1Q14 Market

Share (%)

1Q15-1Q14

Growth (%)

HP 3,191,694,948 23.8 2,890,992,229 25.5 10.4

Dell 2,296,473,026 17.1 2,006,639,006 17.7 14.4

IBM 1,887,939,141 14.1 2,244,631,789 19.8 -15.9

Lenovo 970,254,659 7.2 127,973,470 1.1 658.2

Cisco 890,179,930 6.6 616,620,000 5.4 44.4

Others 4,157,871,704 31.0 3,469,383,444 30.6 19.8

Total 13,394,413,409 100.0 11,356,239,939 100.0 17.9

Company

1Q15

Revenue

1Q15 Market

Share (%)

1Q14

Revenue

1Q14 Market

Share (%)

1Q15-1Q14

Growth (%)

Inspur 332,613,480 21 227,328,256 17 46

Dell 322,063,140 20 246,281,271 19 31

Lenovo 295,914,571 18 80,084,826 6 270

HP 217,487,450 14 167,775,923 13 30

Huawei 197,490,419 12 189,963,266 14 4

Sugon 140,377.091 9 70,705,366 5 99

Others 104,566,737 6 329,549,621 25 -68

Total 1,610,512,888 100.0 1,311,688,529 100.0 23

Source: Gartner (May 2015)

China

WW

Page 5: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

What is Mars for?

5

High performance

High volume of memory

High bandwidth memory access

High bandwidth I/O access

Large scale cache coherency maintained

Moderate performance

High power efficiency

High density computing

High bandwidth memory

access

Low cost

Mars

Earth

Page 6: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Mars Overview

Architecture Features

64 Xiaomi cores, ARMv8

compatible

Hardware-maintained global

cache coherency

Panel-based data affinity

architecture

Mesh topology on chip network

32MB L2 cache

8 Cache & Memory Chips (CMC)

128MB L3 cache

16 DDR3-1600 channels

Two 16-lane PCIE3.0 i/f

ECC and parity protection on all

caches, tags and TLBs

6

Physical • ~180M instances

• 2.0GHz@28nm

• 120W

Performance • Peak:512GFLOPS

• Mem BW:204GB/s

• I/O BW: 32GB/s

panel0 panel1 panel3 panel2

panel4 panel5 panel7 panel6

CM

C

PCIe

PCIe

DDR3

DDR3

CM

C

DDR3

DDR3

CM

C

DDR3

DDR3

CM

C

DDR3

DDR3

CMC

DD

R3

DD

R3

CMC

DD

R3

DD

R3

CMC

DD

R3

DD

R3

CMC

DD

R3

DD

R3

Page 7: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Panel Architecture

Eight Xiaomi Cores

Compatible design with ARMv8 arch license

Both AArch32 and AArch64 modes

EL0~EL3 supported

ASIMD-128 supported

Adv. hybrid Branch Prediction

4 fetch/4 decode/4 dispatch Out-of-Order

superscalar pipeline

Cache Hierarchy

Separated L1 ICache and L1 Dcache

Shared L2 cache, totally 4MB

Directory-based cache coherency

maintenance

Directory Control Unit (DCU)

Routing Cell

7

Xiaomi

Xiaomi

Xiaomi

Xiaomi L2cache

Routing Cell

DCU

DCU

Xiaomi

Xiaomi

Xiaomi

Xiaomi L2cache

6000μm

10600μm

Page 8: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd 8

Xiaomi Core

ITLB I Cache BTB

DirPre

IndPre

SRS Loop

Detect

Instruction Buffer

decoder decoder decoder decoder

Rename Logic

Arch.

Reg

file Phy.

Reg

file Dispatch Logic

Reorder

Buffer

SCInt/VT

Queue

FP/VT

Queue

LD/ST

Queue

ALU

/BR

FMA

C/FDI

V

FMA

C/FDI

V

DTLB

D Cache

L2 Cache

STB & Prefetch

Prefetch

Debug

/Trace

/Interrupt

/Timer ALU

/BR

MUL

/DIV

MUL

/DIV

MCI/VT

Queue

Page 9: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

ITLB I Cache BTB

DirPre

IndPre

SRS Loop

Detect

Instruction Buffer

decoder decoder decoder decoder

Rename Logic

Arch.

Reg

file Phy.

Reg

file Dispatch Logic

Reorder

Buffer

SCInt/VT

Queue

FP/VT

Queue

LD/ST

Queue

ALU

/BR

FMA

C/FDI

V

FMA

C/FDI

V

DTLB

D Cache

L2 Cache

STB & Prefetch

Prefetch

Debug

/Trace

/Interrupt

/Timer ALU

/BR

MUL

/DIV

MUL

/DIV

MCI/VT

Queue

9

Xiaomi Core Front End

ITLB I Cache BTB

DirPre

IndPre

SRS Loop

Detect

Instruction Buffer

Prefetch

• 32KB L1 instr. Cache

• Next line prefetch

• Hybrid Branch Predictor

• 2048-entry BTB

• Direction predict with TAGE predictor

• 512-entry indirect predictor

• 48-entry Speculative Return Stack

• Four instructions fetched per cycle

• 32-entry instruction buffer

• Loop detect and Instr. Cache bypass

Page 10: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

ITLB I Cache BTB

DirPre

IndPre

SRS Loop

Detect

Instruction Buffer

decoder decoder decoder decoder

Rename Logic

Arch.

Reg

file Phy.

Reg

file Dispatch Logic

Reorder

Buffer

SCInt/VT

Queue

FP/VT

Queue

LD/ST

Queue

ALU

/BR

FMA

C/FDI

V

FMA

C/FDI

V

DTLB

D Cache

L2 Cache

STB & Prefetch

Prefetch

Debug

/Trace

/Interrupt

/Timer ALU

/BR

MUL

/DIV

MUL

/DIV

MCI/VT

Queue

10

Xiaomi Core Decode, Rename & Dispatch

• Up to four instructions

decoded per cycle

• 192 physical registers

• Up to four instructions

renamed per cycle decoder decoder decoder decoder

Rename Logic

Arch.

Reg

file Phy.

Reg

file Dispatch Logic

Reorder

Buffer

• Up to four instructions dispatched per cycle

• Reorder buffer can hold 160 instructions, and about 210+ instructions

can be in-flight in the whole pipeline.

• Dispatch in-order, execution out-of-order, retirement in-order.

Page 11: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

ITLB I Cache BTB

DirPre

IndPre

SRS Loop

Detect

Instruction Buffer

decoder decoder decoder decoder

Rename Logic

Arch.

Reg

file Phy.

Reg

file Dispatch Logic

Reorder

Buffer

SCInt/VT

Queue

FP/VT

Queue

LD/ST

Queue

ALU

/BR

FMA

C/FDI

V

FMA

C/FDI

V

DTLB

D Cache

L2 Cache

STB & Prefetch

Prefetch

Debug

/Trace

/Interrupt

/Timer ALU

/BR

MUL

/DIV

MUL

/DIV

MCI/VT

Queue

11

Xiaomi Core Function Units

• Two separated 16-entry integer and ASIMD queues shared by four

integer units

• Two integer unit can process single-cycle integer instructions and

integer SIMD instructions, one can also process branch instructions.

• Two integer units can process multi-cycle integer instructions and

integer SIMD instructions.

• One shared16-entry floating point and ASIMD queue

• Two FP/ASIMD units equipped, which can be combined into one

lockstep ASIMD unit.

• FMA supported in both units.

• FMUL: 3cycles, FADD: 3cycles, FMA: 6cycles

SCInt/VT

Queue

FP/VT

Queue

ALU

/BR

FMA

C/FDI

V

FMA

C/FDI

V

ALU

/BR

MUL

/DIV

MUL

/DIV

MCI/VT

Queue

Page 12: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

ITLB I Cache BTB

DirPre

IndPre

SRS Loop

Detect

Instruction Buffer

decoder decoder decoder decoder

Rename Logic

Arch.

Reg

file Phy.

Reg

file Dispatch Logic

Reorder

Buffer

SCInt/VT

Queue

FP/VT

Queue

LD/ST

Queue

ALU

/BR

FMA

C/FDI

V

FMA

C/FDI

V

DTLB

D Cache

L2 Cache

STB & Prefetch

Prefetch

Debug

/Trace

/Interrupt

/Timer ALU

/BR

MUL

/DIV

MUL

/DIV

MCI/VT

Queue

12

Xiaomi Core Function Units

• One 24-entry load/store queue

• 32KB L1 data cache

• 6 outstanding loads

• 4 cycles latency from load to use

• Next line and stride detected data

prefetch

• Streamlined pattern auto detected

LD/ST

Queue

DTLB

D Cache

STB & Prefetch

Page 13: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Cache coherence protocol Hawk cache coherence protocol

Distributed directory-based global cache coherency

MOESI-like packet-based coherence protocol

A home node DCU(directory control unit) supports

Affinitive pairing of L2Cs and CMCs

“Infinite” capacity for non-conflicting Reads & Writes

Optimized transaction flow for exclusive atomic accesses

Reduced latency by cacheline forwarding

13

L2C L2C L2C

Hawk

L3C &

Memory I/O

Interconnects

Global Exclusive Monitor

Co

re0

Co

re7

Coherence Logic

Pa

ne

l N

MEM

Co

re0

Co

re7

Coherence Logic

Pa

ne

l 0

Local Monitor

Page 14: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Network on Chip

2D Concentrated Mesh Architecture

Cell based switch with 6 bidirectional ports

Uniform package format for each port, a port can be configured to be

connected with a device or cascade cell

4 physical channels for CC and 1 channel for debug, DOR Y-X routing

Low latency: 3 cycles for each hop

High bandwidth: 384GB/s each cell

Cell1

53

0

24

L2cache

L2cache

MIU/IOU

MIU or

Cascade

Dest. Lat. (cycles)

0 3

1 6

2 9

3 12

4 15

5 12

6 9

7 6

Avg. 9

3

4

Cell01

53

0

24 Cell1

0

24

1

53

Cell45

13

2

04 Cell5

2

04

5

13 Cell7

5

13

2

04

Cell20

24

1

53Cell3

1

53

0

24

Cell62

04

5

13

master

0

1 2

5 6

7

14

Page 15: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Cache & Memory Chip

L3 cache

16MB Data Array

2MB Data ECC

DDR bandwidth

2 x DDR3-800:25.6GB/s

Proprietary interface between Mars &

CMC

Parallel interface

Needs more pins, but lower latency than

serdes

Separate write/cmd and read data

channel

L3

Bank0

Mars Interface

L3

Bank1

L3

Bank2

L3

Bank3

Mem

Ctrl0

Mem

Ctrl1

D

D

R

D

D

R

15

Effective read channel bandwidth:12.8GB/s

Effective write/cmd channel bandwidth:6.4GB/s

Page 16: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Latency of affinitive access

Memory access latency(ns)

Local L1 cache hit ~2

Local L2 cache hit ~8

Affinitive L2 cache hit ~20

Affinitive L3 cache hit ~36

Affinitive DDR access ~70

• Panel : 2.0GHz

• NoC: 2.0GHz

• CMC: 1.5GHz

* PCB latency not considered

16

Xiaomi

Xiaomi

Xiaomi

Xiaomi L2cache

Routing Cell

DCU

DCU

Xiaomi

Xiaomi

Xiaomi

Xiaomi L2cache

Xiaomi

Xiaomi

Xiaomi

Xiaomi L2cache

Routing Cell

DCU

DCU

Xiaomi

Xiaomi

Xiaomi

Xiaomi L2cache

CM

C

Page 17: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Memory Tune (mTune)

Rich Data Collection

Number of cache hits/misses for L1/L2/L3

Workload of cache pipelines

Busyness of the NoC

ECC corrections of the memory system

Support Multiple Metrics

Average Miss rate/Hit rate

Minimal/Maximal/Average Access Latency

Bandwidth Analysis

Concurrent Average Memory Access Time (CAMAT)

Support MPI/OpenMP Applications

Thread behavior analysis

Inter-process behavior analysis

17

Page 18: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Scalable Debug System

ARMv8 CoreSight Compatible debug system

Scalable dedicated debug network across 64 cores

Distributed debug components

Configurable events broadcast scope

Timestamp broadcasts with single signal to simplify

implementation

18

Cell0

DNB_C

DNB_C

1

530

DNB_M

DNB_I2

4 Cell14 31 0

5 2

DNB_C

DNB_C

DNB_M

Cell4

DNB_C

DNB_C

5

132

04

DNB_M

Cell54 35 2

1 0

DNB_C

DNB_CDNB_M

Cell33 40 1

2 5

DNB_C

DNB_C

DNB_M

Cell24 31 0

5 2

DNB_C

DNB_C

DNB_M

Cell7

DNB_C

DNB_C

5

132

04

DNB_M

Cell64 35 2

1 0

DNB_C

DNB_CDNB_M

panel0 panel1

panel4 panel5

panel3 panel2

panel7 panel6

hdbg

JTAG1

JTAG2

Page 19: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Physical Design

28nm process

0.9v core/1.8v IO

10 metal layers

~180M instances

2.0GHz

120W

640mm2 die size

FCBGA

~3000 pins

19

25.38mm

25.2mm

Page 20: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Performance Evaluation

SpecCPU2006

20

Single copy of SPEC CPU benchmark 64 copies of SPEC CPU benchmark

19.2 17.8

0

5

10

15

20

25

INT FP

SPEC_CPU2006_base

672 585

0

100

200

300

400

500

600

700

800

INT FP

SPEC_CPU2006_rate

Page 21: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Performance Evaluation

STREAM

21

0

10

20

30

40

50

60

70

80

90

1 2 4 8 16 24 32 40 48 56 64

STREAM triad

#cores

Bandw

idth

(G

B/s

)

Page 22: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Phytium Technology Co., Ltd

Next Generation Scale-up CPU

More powerful core

Aggressive Branch Predictor

Multithreading

More aggressive ILP exploitation

Wider SIMD

More RAS features

Higher bandwidth memory access

Higher power efficiency

22

Page 23: Mars: A 64-core ARMv8 Processor...BTB ITLB I Cache DirPre IndPre SRS Loop Detect Instruction Buffer decoderdecoder decoder decoder Rename Logic Arch. Reg file Phy. Reg file Dispatch

Mars: A 64-core ARMv8 Processor

Charles Zhang

[email protected]