risc-v i/o scale-out architecture for distributed data analytics...summary • scale-out analytics...

Integrated Device Technology

The Analog and Digital Company™

RISC-VI/OScale-outArchitectureforDistributedData

Analytics

MohammadAkhterIntegratedDeviceTechnology,Inc.

[email protected]

July12,2016

©Copyright2016, IDTInc.

Contributors/Acknowledgement

• HenryCook,SiFive,Inc.• HowardMao,SiFive,Inc.• KrsteAsanovic,SiFive,Inc.• MohammadAkhter,IDT,Inc.• StephenDurr,IDT,Inc.

2

Integrated Device Technology

The Analog and Digital Company™

SocialmediaisdrivingMobileInternetAccess

Massive Data Growth Driving Internet

3

3.6BMoreVideouploadedtoYouTubethanthetraditionalTV

networks3.6BPhotosuploaded in2014

(whatsapp, facebook,instagram,snapchat …)

807M 807MpeoplewatchedYouTubevideo"Charliebitmyfingeragain“

(642MtoDisneylandsince1955)

91%

UnstructuredData….MachineDatanotincluded….

The Real-Time Usage Rising

Live Data from Wireless to the Cloud

Networksand

Data Centersmust

THINK FAST

Social Media, Finance, Health, Video, Audio, Cloud Compute, Transportation, Military C3I

• Millions of users• More data per user• Low Latency• Data Analytics

4

Focus on the Use Cases …

5

HPCandAnalyticsà HeterogeneousComputingà PerformanceCriticalDataAnalyticsà LowestLatencyProcessing

WirelessAccessNetworkà Edgehighperformancecomputingà DeterministicLowestLatencyà Fault-TolerantSystemà NetworkandSignalProcessing

Robotics/SensorFusion/Autonmousà Real-timeVideo/ImageRecognitionà TimesynchronizedProcessingà DeterministicLatencyà Fault-TolerantSystem

SpaceandMilitaryà HighPerformanceEmbedComputingà HeterogeneousArchitectureà VideoandImageAnalyticsà LowestLatencyProcessing

Underlying Structure for Analytics…

6

Demands balanced low latency computing, I/O, memory, and storage processing

Wireless Network EvolutionDriven by Real-time Data with better QoS

7

150xReduction

LowerLatencytoachievesuperiorQoE

~10x

~10-50x

HigherBandwidthdrivenbyVideo/SocialMedia

Edge-Core Network ConvergenceRace to msec!

88

UserRadioBase-Station

GatewayInternet/C

ore Network

Servers/Storage

AccessEdgeCore/Aggregation

Traditional Computing Model

9

I/O, Memory and Storage

Bottleneck

Suffers from inherent mismatch between I/O, Memory, and Storage and Processing (CPU/Accelearators) Performance Evolution

Distributed Edge Network Computing Model

10

Power

X86/GPU

EdgeServers

CloudRAN

4GBTSwithintegratedcomputing

à Reducesbandwidthdemand fromAccesstoCore

à Provides Real-timepredictivedecision

à UserCentricoptimizationofContentdelivery

ComputingandNetworkConvergenceUsercentricNetwork

Real-timeUserCentricNetwork

11

Deep Learning Video Example(Hacking a new Computing)

12

Hacking a new Computer• BalancingGPU/CPUperformancewithMemoryandI/O

Hacking a new Computer

• Scale-outComputingandstorage

4 GPU Eval Cards8G I/O NIC

4 GPU (NVIDIA Tegra K1)56G I/O (4x14G NIC) 200G Switching

38U System144 TFLops608 GPU Nodes (4x4x36 )38 Port Switching Fabric

Hacking a new Computer

Deep Learning Micro-ClusterReal-time Video/Social Analytics

RapidIO Switch Appliance

Low Power ARM+GPU Cluster with RapidIO

~29 frames/sec/node~100 ns RapidIO switching latency

> TFlops Processing/node

~5W-11 W per GPU node

Example Analytics Deep Learning Model for Image/Video

17

Analytics Computing Model Challenges

18

+SharedmemorybetweenCPUandGPU+Supports Scale-outSystem- SharedMemoryandStoragechallengingwithouthardwaresupport+Switchlatencyexcellent(~100nsec)- Memorytomemorylatencycanbeimproved

Scale-out Analytics Clustering Model

19

+ LargeScale-outAnalyticsModel+ Lowlatency<200- 300nsec+ LowPower+ SharedMemory/StoragePool+ OpenISAModel+ StandardFabric

RISC-V Scale-out Analytics Clustering Model

20

FabricTileLink and AMBA

• Off-chipFabric– Scale-outandRemoteMemoryAccelerationwithRapidIO

• On-chipFabric– TileLink

• UCBerkley• Hierarchical/DeadlockFree/Built-inMessagetypes(supportsAccelerators)

• ExtensiblewithCustomMessageforCoherency(5-stateandothers)

– AMBA• ARMInc.• AMBAAXIandACEProtocolSpecification

– IssueE.22Feb,2013• CC-NUMAArchitecture,Up-to5StateModel

21

RISC-V Architecture with RapidIO

22

TileLink-RapidIOI/OandMemoryFabric

RISC-V Multi-chip Generator

23

InitialVersionofMulti-chipHardwareGeneratoravailable

Remote Memory/IO Scale-out Packets

24

DestinationID SourceID FlowControl Payload ProtectionAddressing TYPE

SoC1 SoC2 SoC4

SoC3

SoC5

Request-ResponsePeer-to-PeerModel

TileLink Basic Protocol and RapidIO Packet Structure

25

client

manager

Acquire FinishGrantProbe Release

+0 +1 +2 +3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Byte 0 ackID vc CRF prio tt=10 Ftype destinationID[7:0] sourceID[7:0] W 0Byte 4 TType (transaction) rQoS rd-/wr-size srcTID rSizeBurst W 1

Byte 8 udrB rsvd rOffset rsvd rUnion W 2

Byte 12 Address Address W 3Byte 16 Address Address W 4Byte 20 Payload(Word0) W 5Byte 24 Payload(Word1) W 6

…Payload(Word15) W 20

CRC Padding W 21

Hardware Simulation Model

26

RapidIOEgress

RapidIOIngress

TargetLatencyCPUtoCPU~100– 200nsec

Summary• Scale-outAnalyticswithbalancingFabricandComputing

– Scale-outthroughdistributedEdgeComputingModel• Reducesround-triplatency• Reducesaccesstocorenetworkbandwidth

– RISC-VwithintegratedFabricforI/OandMemory• FabricforAnyTopology(Mesh/Hypercube)• LowLatencydirectlyfromSoC on-chipFabric• SupportsTileLink scale-outformulti-nodeclustering• EnablesBalancedCoherentandscale-outarchitecture

27

RISC-VCPUGeneratormodelwithPortforRapidIOavailable!

risc-v i/o scale-out architecture for distributed data analytics...summary • scale-out analytics...

Documents