risc-v i/o scale-out architecture for distributed data analytics...summary • scale-out analytics...
TRANSCRIPT
Integrated Device Technology
The Analog and Digital Company™
RISC-VI/OScale-outArchitectureforDistributedData
Analytics
MohammadAkhterIntegratedDeviceTechnology,Inc.
July12,2016
©Copyright2016, IDTInc.
Contributors/Acknowledgement
• HenryCook,SiFive,Inc.• HowardMao,SiFive,Inc.• KrsteAsanovic,SiFive,Inc.• MohammadAkhter,IDT,Inc.• StephenDurr,IDT,Inc.
2
Integrated Device Technology
The Analog and Digital Company™
SocialmediaisdrivingMobileInternetAccess
Massive Data Growth Driving Internet
3
3.6BMoreVideouploadedtoYouTubethanthetraditionalTV
networks3.6BPhotosuploaded in2014
(whatsapp, facebook,instagram,snapchat …)
807M 807MpeoplewatchedYouTubevideo"Charliebitmyfingeragain“
(642MtoDisneylandsince1955)
91%
UnstructuredData….MachineDatanotincluded….
The Real-Time Usage Rising
Live Data from Wireless to the Cloud
Networksand
Data Centersmust
THINK FAST
Social Media, Finance, Health, Video, Audio, Cloud Compute, Transportation, Military C3I
• Millions of users• More data per user• Low Latency• Data Analytics
4
Focus on the Use Cases …
5
HPCandAnalyticsà HeterogeneousComputingà PerformanceCriticalDataAnalyticsà LowestLatencyProcessing
WirelessAccessNetworkà Edgehighperformancecomputingà DeterministicLowestLatencyà Fault-TolerantSystemà NetworkandSignalProcessing
Robotics/SensorFusion/Autonmousà Real-timeVideo/ImageRecognitionà TimesynchronizedProcessingà DeterministicLatencyà Fault-TolerantSystem
SpaceandMilitaryà HighPerformanceEmbedComputingà HeterogeneousArchitectureà VideoandImageAnalyticsà LowestLatencyProcessing
Underlying Structure for Analytics…
6
Demands balanced low latency computing, I/O, memory, and storage processing
Wireless Network EvolutionDriven by Real-time Data with better QoS
7
150xReduction
LowerLatencytoachievesuperiorQoE
~10x
~10-50x
HigherBandwidthdrivenbyVideo/SocialMedia
Edge-Core Network ConvergenceRace to msec!
88
UserRadioBase-Station
GatewayInternet/C
ore Network
Servers/Storage
AccessEdgeCore/Aggregation
Traditional Computing Model
9
I/O, Memory and Storage
Bottleneck
Suffers from inherent mismatch between I/O, Memory, and Storage and Processing (CPU/Accelearators) Performance Evolution
Distributed Edge Network Computing Model
10
Power
X86/GPU
EdgeServers
CloudRAN
4GBTSwithintegratedcomputing
à Reducesbandwidthdemand fromAccesstoCore
à Provides Real-timepredictivedecision
à UserCentricoptimizationofContentdelivery
ComputingandNetworkConvergenceUsercentricNetwork
Real-timeUserCentricNetwork
11
Deep Learning Video Example(Hacking a new Computing)
12
Hacking a new Computer• BalancingGPU/CPUperformancewithMemoryandI/O
Hacking a new Computer
• Scale-outComputingandstorage
4 GPU Eval Cards8G I/O NIC
4 GPU (NVIDIA Tegra K1)56G I/O (4x14G NIC) 200G Switching
38U System144 TFLops608 GPU Nodes (4x4x36 )38 Port Switching Fabric
Hacking a new Computer
Deep Learning Micro-ClusterReal-time Video/Social Analytics
RapidIO Switch Appliance
Low Power ARM+GPU Cluster with RapidIO
~29 frames/sec/node~100 ns RapidIO switching latency
> TFlops Processing/node
~5W-11 W per GPU node
Example Analytics Deep Learning Model for Image/Video
17
Analytics Computing Model Challenges
18
+SharedmemorybetweenCPUandGPU+Supports Scale-outSystem- SharedMemoryandStoragechallengingwithouthardwaresupport+Switchlatencyexcellent(~100nsec)- Memorytomemorylatencycanbeimproved
Scale-out Analytics Clustering Model
19
+ LargeScale-outAnalyticsModel+ Lowlatency<200- 300nsec+ LowPower+ SharedMemory/StoragePool+ OpenISAModel+ StandardFabric
RISC-V Scale-out Analytics Clustering Model
20
FabricTileLink and AMBA
• Off-chipFabric– Scale-outandRemoteMemoryAccelerationwithRapidIO
• On-chipFabric– TileLink
• UCBerkley• Hierarchical/DeadlockFree/Built-inMessagetypes(supportsAccelerators)
• ExtensiblewithCustomMessageforCoherency(5-stateandothers)
– AMBA• ARMInc.• AMBAAXIandACEProtocolSpecification
– IssueE.22Feb,2013• CC-NUMAArchitecture,Up-to5StateModel
21
RISC-V Architecture with RapidIO
22
TileLink-RapidIOI/OandMemoryFabric
RISC-V Multi-chip Generator
23
InitialVersionofMulti-chipHardwareGeneratoravailable
Remote Memory/IO Scale-out Packets
24
DestinationID SourceID FlowControl Payload ProtectionAddressing TYPE
SoC1 SoC2 SoC4
SoC3
SoC5
Request-ResponsePeer-to-PeerModel
TileLink Basic Protocol and RapidIO Packet Structure
25
client
manager
Acquire FinishGrantProbe Release
+0 +1 +2 +3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Byte 0 ackID vc CRF prio tt=10 Ftype destinationID[7:0] sourceID[7:0] W 0Byte 4 TType (transaction) rQoS rd-/wr-size srcTID rSizeBurst W 1
Byte 8 udrB rsvd rOffset rsvd rUnion W 2
Byte 12 Address Address W 3Byte 16 Address Address W 4Byte 20 Payload(Word0) W 5Byte 24 Payload(Word1) W 6
…Payload(Word15) W 20
CRC Padding W 21
Hardware Simulation Model
26
RapidIOEgress
RapidIOIngress
TargetLatencyCPUtoCPU~100– 200nsec
Summary• Scale-outAnalyticswithbalancingFabricandComputing
– Scale-outthroughdistributedEdgeComputingModel• Reducesround-triplatency• Reducesaccesstocorenetworkbandwidth
– RISC-VwithintegratedFabricforI/OandMemory• FabricforAnyTopology(Mesh/Hypercube)• LowLatencydirectlyfromSoC on-chipFabric• SupportsTileLink scale-outformulti-nodeclustering• EnablesBalancedCoherentandscale-outarchitecture
27
RISC-VCPUGeneratormodelwithPortforRapidIOavailable!