Download - S2 1405 Schooler Presentation
-
8/3/2019 S2 1405 Schooler Presentation
1/35
Tile Processors:Many-Core for Embedded
and Cloud Computing
Richard SchoolerVP Software Engineering
Tilera Corporation
-
8/3/2019 S2 1405 Schooler Presentation
2/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.2
Exploiting Natural Parallelism
High-performance applications have lots ofparallelism! Embedded apps:
Networking: packets, flows
Media: streams, images, functional & data parallelism Cloud apps:
Many clients: network sessions Data mining: distributed data & computation
Lots of different levels: SIMD (fine-grain data parallelism) Thread/process (medium-grain task parallelism) Distributed system (coarse-grain job parallelism)
-
8/3/2019 S2 1405 Schooler Presentation
3/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.3
Every one is going Manycore, but can thearchitecture scale?
The computing world
is ready for radicalchange
Time
#Cores
2010
1
2006
2
4
32
20202005
Intel
Sun IBM Cell
nCores
Larrabee
Performance
&
Performance/WGap
2014
Performance
-
8/3/2019 S2 1405 Schooler Presentation
4/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.4
Key Many-Core Challenges: The 3 Ps
Performance challenge How to scale from 1 to 1000 cores the number of
cores is the new Megahertz
Power efficiency challenge Performance per watt is the new metric systems are
often constrained by power & cooling
Programming challenge How to provide a converged many core solution in a
standard programming environment
-
8/3/2019 S2 1405 Schooler Presentation
5/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.5
Problems cannot be solved by the samelevel of thinking that created them.
Current technologies fail to deliver Incremental performance increase High power Low level of Integration
Increasingly bigger cores
We need to have a new thinking to get 10 x performance
10 x performance per watt Converged computing Standard programming models
http://www.google.com.hk/imgres?imgurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/albert_einstein.jpg&imgrefurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/quotes_by_albert_einstein.html&usg=__LvhalC7aLpumHqxWYGcDvid_Rbo=&h=288&w=460&sz=33&hl=zh-CN&start=54&um=1&itbs=1&tbnid=o47CTU5Z_NLSSM:&tbnh=80&tbnw=128&prev=/images%3Fq%3Dalbert%2Beinstein%26start%3D40%26um%3D1%26hl%3Dzh-CN%26newwindow%3D1%26safe%3Dstrict%26sa%3DN%26rls%3DIBMA,IBMA:2006-45,IBMA:en%26ndsp%3D20%26tbs%3Disch:1http://www.google.com.hk/imgres?imgurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/albert_einstein.jpg&imgrefurl=http://www.damonchernavsky.com/Famous_Quotes/Quotes_By_Albert_Einstein/quotes_by_albert_einstein.html&usg=__LvhalC7aLpumHqxWYGcDvid_Rbo=&h=288&w=460&sz=33&hl=zh-CN&start=54&um=1&itbs=1&tbnid=o47CTU5Z_NLSSM:&tbnh=80&tbnw=128&prev=/images%3Fq%3Dalbert%2Beinstein%26start%3D40%26um%3D1%26hl%3Dzh-CN%26newwindow%3D1%26safe%3Dstrict%26sa%3DN%26rls%3DIBMA,IBMA:2006-45,IBMA:en%26ndsp%3D20%26tbs%3Disch:1 -
8/3/2019 S2 1405 Schooler Presentation
6/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.6
Stepping Back: How Did We Get Here?
Moores Conundrum:
More devices =>? More performance
Old answers: More complex cores; biggercaches But power-hungry
New answers: More cores But do conventional approaches scale?
Diminishing returns!
-
8/3/2019 S2 1405 Schooler Presentation
7/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.7
The Old Challenge: CPU-on-a-chip
20 MIPS CPUin 1987
Few thousand gates
-
8/3/2019 S2 1405 Schooler Presentation
8/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.8
What to do with all those transistors?
The Opportunity: Billions of Transistors
OldCPU:
-
8/3/2019 S2 1405 Schooler Presentation
9/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.9
ASICs have high performance and low power
Custom-routed, short wires Lots of ALUs, registers, memories huge on-chip parallelism
memmem
mem
mem
mem
Take Inspiration from ASICs
But how to build a programmable chip?
-
8/3/2019 S2 1405 Schooler Presentation
10/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.10
Replace Long Wires with RoutedInterconnect
Ctrl
[IEEE Computer 97]
-
8/3/2019 S2 1405 Schooler Presentation
11/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.11
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
RF
From Centralized Clump of CPUs
-
8/3/2019 S2 1405 Schooler Presentation
12/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.12
ALU
ALU
ALU
ALU
ALU
ALU
R
Scalar Operand Network (SON) [TPDS 2005]
To Distributed ALUs, Routed Bypass
Network
-
8/3/2019 S2 1405 Schooler Presentation
13/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.13
From a Large Centralized Cache
-
8/3/2019 S2 1405 Schooler Presentation
14/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.14
to a Distributed Shared Cache
ALU
ALU
ALU
ALU
ALU
ALU
R
$
[ISCA 1999]
-
8/3/2019 S2 1405 Schooler Presentation
15/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.15
Distributed Everything + RoutedInterconnect Tiled Multicore
A
LU
ALU
ALU
ALU
ALU
A
LU
R
$
Each tile is a processor, so programmable
-
8/3/2019 S2 1405 Schooler Presentation
16/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.16
Tiled Multicore Captures ASIC Benefitsand is Programmable
Scales to large numbers of cores Modular design and verify 1 tile Power efficient
Short wires plus locality opts
CV2f Chandrakasan effect, more cores at
lower freq and voltage CV2f
ProcessorCore
= TileCore + Switch
SCurrent Bus Architecture
-
8/3/2019 S2 1405 Schooler Presentation
17/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.17
Tilera processor portfolioDemonstrating the scale of many-core
200920082007 2010
TILEPro36
TILE64
TILEPro64
Gx64 & Gx100Up to 8x performance
Gx16 & Gx362x the performance
. . .
-
8/3/2019 S2 1405 Schooler Presentation
18/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.18
Memory Controller (DDR3) Memory Controller (DDR3)
Memory Controller (DDR3) Memory Controller(DDR3)
mPIPE
1.2GHz 1.5GHz 32 MBytes total cache
546 Gbps peak mem BW
200 Tbps iMesh BW
80-120 Gbps packet I/O 8 ports XAUI / 2 XAUI
2 40Gb Interlaken
32 ports 1GbE (SGMII)
80 Gbps PCIe I/O 3 StreamIO ports (20Gb)
Wire-speed packet eng. 120Mpps
MiCA engines:
40 Gbps crypto
compress & decompress
FlexibleI/O
UART x2,USB x2,JTAG,I2C, SPI
MiCA
MiCA
Se
rDes
PCIe 2.0
8-lane
SerDes
PCIe 2.04-lane
SerDes
PCIe 2.08-lane
Interlaken
Interlaken
10 GbEXAUI S
erDes4x GbE
SGMII
10 GbEXAUI S
erDes4x GbE
SGMII
10 GbEXAUI S
erDes4x GbE
SGMII
10 GbEXAUI S
erDes4x GbE
SGMII
10 GbEXAUI S
erDes4x GbE
SGMII
10 GbEXAUI S
erDes4x GbE
SGMII
10 GbEXAUI S
erDes4x GbE
SGMII
10 GbEXAUI S
erDes4x GbE
SGMII
TILE-Gx100:Complete System-on-a-Chip with 100 64-bit cores
-
8/3/2019 S2 1405 Schooler Presentation
19/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.19
36 Processor Cores 866M, 1.2GHz, 1.5GHz clk
12 MBytes total cache
40 Gbps total packet I/O
4 ports 10GbE (XAUI)
16 ports 1GbE (SGMII)
48 Gbps PCIe I/O
2 16Gbps Stream IO ports
Wire-speed packet engine
60Mpps
MiCA engine:
20 Gbps crypto
Compress & decompress
Flexible
I/O
UARTx2,USBx2,JTAG,I2C, SPI
MiCA
mPIPE
10 GbEXAUI
SerDes
4x GbESGMII
10 GbEXAUI
Se
rDes
4x GbESGMII
10 GbEXAUI
SerDes
4x GbESGMII
10 GbEXAUI
SerDes
4x GbESGMII
SerDesPCIe 2.0
8-lane
SerDes
PCIe 2.04-lane
SerDes
PCIe 2.04-lane
Memory Controller (DDR3)
Memory Controller (DDR3)
TILE-Gx36:Scaling to a broad range of applications
-
8/3/2019 S2 1405 Schooler Presentation
20/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.20
Full-Featured General Converged Cores
Processor Each core is a complete computer 3-way VLIW CPU SIMD instructions: 32, 16, and 8-bit ops Instructions for video (e.g., SAD) and
networking Protection and interrupts
Memory L1 cache and L2 Cache Virtual and physical address space Instruction and data TLBs
Cache integrated 2D DMA engine
Runs SMP Linux
Runs off-the-shelf C/C++ programs
Signal processing and general apps
Core
Cache
16K L1-I
64K L2
I-TLB
D-TLB
2DDMA8K L1-D
Register File
Three Execution Pipelines
TerabitSwitch
-
8/3/2019 S2 1405 Schooler Presentation
21/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.21
Software must complement the hardwareEnable re-use of existing code-bases
Standards-based development environment e.g. gcc, C, C++, Java, Linux
Comprehensive command-line & GUI-based tools
Support multiple OS models
One OS running SMP Multiple virtualized OSs with protection
Bare metal or zero-overhead with background OS environment
Support range of parallel programming styles Threaded programming (pThreads, TBB)
Run-to-Completion with load-balancing
Decomposition & Pipelining
Higher-level frameworks (Erlang, OpenMP, Hadoop etc.)
http://images.google.com/imgres?imgurl=http://www.alistairdickie.com/overlay/images/stories/linux-logo.gif&imgrefurl=http://www.alistairdickie.com/overlay/index.php%3Foption%3Dcom_content%26task%3Dblogcategory%26id%3D15%26Itemid%3D26&h=446&w=473&sz=35&hl=en&start=1&um=1&tbnid=UCR_ba2bRHrHqM:&tbnh=122&tbnw=129&prev=/images%3Fq%3DLinux%2Blogo%26svnum%3D10%26um%3D1%26hl%3Den%26rls%3DIBMA,IBMA:2006-45,IBMA:en -
8/3/2019 S2 1405 Schooler Presentation
22/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.22
Software Roadmap
Standards & open source integration Compiler: gcc, g++ 4.4+
Linux:
Kernel: Tile architecture integrated to 2.6.36 User-space: glibc, broader set of standard
packages
Extended programming and runtime
environments Java: porting OpenJDK
Virtualization: porting KVM
-
8/3/2019 S2 1405 Schooler Presentation
23/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.23
Tile architecture:The future of many-core computing
Multicore is the way forward But we need the right architecture to utilize it
The Tile architecture addresses the challenges Scales to 100s of cores
Delivers very low power
Runs your existing code
Standards-based software Familiar tools
Full range of standard programming environments
-
8/3/2019 S2 1405 Schooler Presentation
24/35
Thank you!
Questions?
-
8/3/2019 S2 1405 Schooler Presentation
25/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.25
Research Vision to Commercial Product
2002
2007
100Btransistors
2018
1BTransistors
in 2007
1996
The opportunity
CPU
Mem
1997
A blankslate
MIT Raw16 cores Tile Processor
64 cores
MemoryController (DDR3) MemoryController (DDR3)
MemoryController (DDR3) MemoryController(DDR3)
mPIPE
FlexibleI/O
UART x2,USB x2,JTAG,I2C,SPI
MiCA
MiCA
SerDes
PCIe2.08-lane
SerDes
PCIe2.04-lane
SerDes
PCIe2.08-lane
Interlaken
Interlaken
10GbEXAUI S
erDes
4xGbESGMII
10GbEXAUI S
erDes4xGbE
SGMII
10GbEXAUI S
erDes4xGbE
SGMII
10GbEXAUI S
erDes4xGbE
SGMII
10GbEXAUI S
erDes
4xGbESGMII
10GbEXAUI S
erDes4xGbE
SGMII
10GbEXAUI S
erDes4xGbE
SGMII
10GbEXAUI S
erD
es4xGbE
SGMII
TILE-Gx100100 cores
The future?
2010
-
8/3/2019 S2 1405 Schooler Presentation
26/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.26
Standard tools and programming model
Multicore Development Environment
Standard application stack
Standard programming SMP Linux 2.6
ANSI C/C++
Java, PHP
Integrated tools GCC compiler
Standard gdb gprof
Eclipse IDE
Innovative tools Multicore debug
Multicore profile
Standards-based tools
Application layer Open source apps
Standard C/C++ libs
Operating System layer 64-way SMP Linux
Zero Overhead Linux
Bare metal environment
Hypervisor layer Virtualizes hardware
I/O device drivers
Tile Processor
Tile Tile Tile Tile
Hypervisor
Operating System
Applications
Virtualization and highspeed I/O drivers
kernel drivers
Applicationslibraries
http://images.google.com/imgres?imgurl=http://www.alistairdickie.com/overlay/images/stories/linux-logo.gif&imgrefurl=http://www.alistairdickie.com/overlay/index.php%3Foption%3Dcom_content%26task%3Dblogcategory%26id%3D15%26Itemid%3D26&h=446&w=473&sz=35&hl=en&start=1&um=1&tbnid=UCR_ba2bRHrHqM:&tbnh=122&tbnw=129&prev=/images%3Fq%3DLinux%2Blogo%26svnum%3D10%26um%3D1%26hl%3Den%26rls%3DIBMA,IBMA:2006-45,IBMA:en -
8/3/2019 S2 1405 Schooler Presentation
27/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.27
Standard Software Stack
Compiler, OSHypervisor
LanguageSupport
InfrastructureApps
Perl
gcc & g++
Commercial LinuxDistribution
ManagementProtocols IPMI 2.0
Transcoding
NetworkMonitoring
-
8/3/2019 S2 1405 Schooler Presentation
28/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.28
High single core performanceComparable to Atom & ARM Cortex-A9 cores
- Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark website http://coremark.org/home.php- TILE-Gx and single thread Atom results were measured in Tilera labs- Single core, single thread result for ARM is calculated based on chip scores
Single-Core Single thread CoreMark Comparison
-
500
1,000
1,500
2,000
2,500
3,000
3,500
Tilera
TILEPro64
866 MHz
Tilera
TILE-Gx36
1.25 GHz
ARM
Cortex-A9
1 GHz
Intel
Atom N270
1600 MHz
CoreMarkScore
http://coremark.org/home.phphttp://coremark.org/home.php -
8/3/2019 S2 1405 Schooler Presentation
29/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.29 29
Significant value across multiple markets
Networking Multimedia Wireless Cloud Classification L4-7 Services Load Balancing
Monitoring/QoS Security
VideoConferencing
Media Streaming
Transcoding
Base Station Media Gateway Service Nodes
Test Equipment
Apache Memcached Web Applications
LAMP stack
High Performance
Low Power
Standard Programming
Over 100 customersOver 40 customers going into production
Tier 1 customers in all target markets
-
8/3/2019 S2 1405 Schooler Presentation
30/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.30
Targeting markets with highly parallelapplications
Web ServingIn Memory Cache
Data Mining
Web
TranscodingVideo deliveryWireless media
Media delivery
Lawful interceptionSurveillance
Other
Government
Common ThemesHundreds and Thousands of servers running each application
thousands of parallel transactionsAll need better performance and power efficiency
-
8/3/2019 S2 1405 Schooler Presentation
31/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.31
200 Tbps on-chip bandwidth
2 Dimensional mesh network
Scales to large numbers of cores
Modular: Design-and-verify 1 tile
Power efficient:
Short wires & locality optimize CV2f Chandrakasan effect, more cores at
lower freq and voltage CV2f
Core + Switch = Tile
Traditional Bus/Ring Architecture
S
ProcessorCore
The Tile Processor ArchitectureMesh interconnect, power-optimized cores
-
8/3/2019 S2 1405 Schooler Presentation
32/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.32
Global 64-bit address space
Distributed caches Big centralized caches dont scale Contention
Long latency
High power
Distributed caches have numerousbenefits
Lower power (less logic lit-up per access)
Exploit locality (local L1 & L2) Can exploit various cache placement
mechanisms to enhance performance
Distributed everythingCache, memory management, connectivity
CompletelyHW Coherent
CacheSystem
-
8/3/2019 S2 1405 Schooler Presentation
33/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.33
Highest compute density
2U form factor
4 hot pluggable modules
8 Tilera TILEPro processors
512 general purpose cores
1.3 trillion operations /sec
-
8/3/2019 S2 1405 Schooler Presentation
34/35
HPEC, 15 September 2010 2010 Copyright Tilera Corporation. All Rights Reserved.34
Power efficient and eco-friendly server
10,000 cores in a 8 Kilowatt rack
35-50 watts max per node Server power of 400 watts
90%+ efficient power supplies
Shared fans and power supplies
-
8/3/2019 S2 1405 Schooler Presentation
35/35
HPEC 15 September 2010
Coherent distributed cache system
Globally Shared Physical Address Space Full Hardware Cache Coherence Standard shared memory programming model
Distributed cache Each tile has local L1 and L2 caches Aggregate of L2 serves as a globally shared L3
Any cache block can be replicated locally Hardware tracks sharers, invalidates stale copies
Dynamic Distributed Cache (DDC) Memory pages distributed across all cores or
homed by allocating core
Coherent I/O Hardware maintains coherence I/O reads/writes coherent with tile caches Reads/writes delivered by HW to home cache Header/packet delivered directly to tile caches
SerDes
GbE 0
GbE 1
FlexibleI/O
UARTJTAG
SPI, I2C
DDR2 Controller 3
DDR2 Controller 0
DDR2 Controller 2
DDR2 Controller 1
PCIe 0
SerDes
PCIe 1
XAUI 0
SerDes
XAUI 1
SerDes
CompletelyCoherentSystem