extended task queuing · extended task queuing: active messages for heterogeneous systems michael...
TRANSCRIPT
EXTENDED TASK QUEUING:ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS
MICHAEL LEBEANEPOST GRADUATE RESEARCHER, AMD RESEARCH
GRADUATE STUDENT, THE UNIVERSITY OF TEXAS AT [email protected]
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 20182
AUTHORS
Michael LeBeane, Brandon Potter, Abhisek Pan, Alexandru Dutu, Vinay Agarwala, Wonchan Lee, Deepak Majeti, Bibek Ghimire, Eric Van Tassell, Samuel Wasmundt, Brad Benton, Mauricio Breternitz, Michael L. Chu, Mithuna Thottethodi, Lizy K. John, Steven K. Reinhardt
DISCLAIMER
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
The work described in this presentation was made with Government support awarded by the DOE. The Government may have certain rights in this work.
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 20183
ACCELERATORS EVERYWHERE!
Accelerators (especially GPUs) are everywhere in modern HPC
Over 80 of the Top 500 supercomputers use accelerators[1]
100’s of applications designed to leverage GPU compute[2]
Accelerator communication across nodes is cumbersome....
INTRODUCTION
[1] TOP500.org, “Highlights – June 2016,” http://www.top500.org/lists/2016/06/highlights, 2016.[2] Nvidia, “GPU-Accelerated Applications,” http://www.nvidia.com/content/gpu-applications/pdf/gpu-apps-catalog-mar2015.pdf, 2016.
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 20184
CROSS-NODE HETEROGENEITY
Current HPC GPU Communication‒ Send data through some local interconnect (e.g., PCIe) out of an HPC NIC (e.g., InfiniBand®)
‒ Target receives data and invokes GPU driver to enqueue task
‒ Optimized variants exist (e.g., GPUDirect RDMA[3]), but still must sync with CPU driver
Can we do better???‒ Yes! But first we need to understand two important technologies…..
INTRODUCTION
Initiator Target
CPU
CacheMemory
NIC Memory
Network
IOCCPU
Cache Memory
NICMemory
AcceleratorIOC Accelerator
Memory Memory
[3] Mellanox, “Mellanox GPUDirect RDMA user manual,” http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.2.pdf, 2015.
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 20185
REMOTE DIRECT MEMORY ACCESS (RDMA)
RDMA allows for direct access of remote memory without involving CPU‒Heavy lifting is performed on the NIC (off-load networking model)
‒Generally expressed in terms of remote Put/Get operations
Many common RDMA interfaces‒RoCE, InfiniBand, iWARP, Portals 4, etc.
BACKGROUND
Initiator Target
Network
CPU
Cache NIC
Memory
IOCMemory
CPU
CacheNIC
Memory
IOCMemory
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 20186
TIGHTLY COUPLED FRAMEWORKS
Tightly Coupled Frameworks‒Complete system architectures and interconnects
integrating CPUs, GPUs, and other accelerators
‒HSA™, OpenCAPI™, Gen-Z, CCIX, etc.
HSA[4] will be our example tightly coupled framework for this work
Relevant Features‒User-level, architected command queuing
‒Globally coherent memory regions
‒Shared virtual address space
BACKGROUND
MMU
CPU
Tightly Coupled Devices
Physical Memory
AcceleratorOS
DriverIOMMU
CPU(Producer)
Tightly Coupled Devices
VirtualMemory
Command Queue
Accelerator(Consumer)
Command Packet
[4] HSA Foundation, “HSA platform system architecture specification 1.0,” http://www.hsafoundation.com/standards, 2015.
Architected Queuing
Shared Virtual Memory
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 20187
CROSS-NODE HETEROGENEITY
Tightly-coupled accelerator frameworks enable efficient, user-level task invocation between devices inside a node
INTRODUCTION
By combining the intra-node tasking model of HSA with the inter-node data movement of RDMA, we can produce a generalized, user-level tasking framework for accelerators in distributed memory systems.
What would such a system look like?
RDMA enables efficient data movement between devices across nodes
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 20188
CROSS-NODE HETEROGENEITY
HSA-like Solution ‒Can communicate through shared virtual address space
‒CPU must still launch tasks on target-side GPU
‒Can we do even better?
INTRODUCTION
NIC
Cache Cache
CPU
Cache
Memory
NIC
CacheCache
CPU
Cache
Memory
Accelerator Accelerator
Initiator Target
Network
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 20189
CROSS-NODE HETEROGENEITY
Our solution: Extended Task Queuing (XTQ)‒Can communicate through shared virtual address space
‒NIC is aware of all an chip compute devices
‒NIC is an HSA device
INTRODUCTION
CacheCache
CPU
Cache
Memory
Cache Cache
CPU
Cache
Memory
XTQ NIC XTQ NICAccelerator Accelerator
Initiator Target
Network
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201810
INTRODUCING XTQ
XTQ uses an HSA-compliant, RDMA-capable NIC to provide an active messaging[5] framework for all devices in distributed systems
XTQ reduces the launch latency for remote GPU task invocation‒Tasks are directly scheduled on the GPU by the NIC using shared memory queues
XTQ removes the message processing on the CPU for GPU-destined tasks‒The CPU is free to perform more useful computation
INTRODUCTION
[5] T. Eicken, D. Culler, S. Goldstein, and K. Schauser, “Active messages: A mechanism for integrated communication and computation,” in Int. Symp. on Computer Architecture (ISCA), 1992, pp. 256–266.
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201811
OVERVIEW
XTQ NIC extends RDMA operations to access HSA task queues
On initiator, put operation is very standard‒NIC performs local DMA read of send buffer and transfers over the network
The magic happens on the target side
XTQ ARCHITECTURE
CacheCache
CPU
Cache
Memory
Cache Cache
CPU
Cache
Memory
XTQ NIC XTQ NICAccelerator Accelerator
Initiator Target
Network
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201812
TARGET-SIDE XTQ PUT
Payload data streams into target-side receive buffer
Command descriptor is placed into HSA queue
XTQ ARCHITECTURE
CPUAccelerator
Tightly Coupled Devices
XTQ NIC
Doorbell
Payload Data
Command Queue
Lookup
Virtual Memory
Signal
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201813
TARGET-SIDE XTQ PUT
NIC notifies the accelerator using memory-mapped doorbell
Accelerator reads command packet
XTQ ARCHITECTURE
CPUAccelerator
Tightly Coupled Devices
XTQ NIC
Doorbell
Payload Data
Command Queue
Lookup
Virtual Memory
Signal
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201814
TARGET-SIDE XTQ PUT
Accelerator reads transferred data
Accelerator writes shared memory completion signal
XTQ ARCHITECTURE
CPUAccelerator
Tightly Coupled Devices
XTQ NIC
Doorbell
Payload Data
Command Queue
Lookup
Virtual Memory
Signal
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201815
TARGET-SIDE XTQ PUT
CPU reads shared memory completion signal
XTQ ARCHITECTURE
CPUAccelerator
Tightly Coupled Devices
XTQ NIC
Doorbell
Payload Data
Command Queue
Lookup
Virtual Memory
Signal
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201816
CHALLENGES
Address Translation?‒How does initiator know about remote VAs at the target?
‒Use coordinated indices specified by the initiator
‒Lookup tables are populated by the target-side XTQ Library
XTQ ARCHITECTURE
Command Packet
Data Payload
Kernel Arguments
RDMA Header
Queue LookupTable
Queue Lookup TableBase Address Register
Target PID
0xF123
Queue Index ....
....
Initiator Target
𝑥
Unified Virtual
Memory
....
Example Queue Lookup
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201817
CHALLENGES
Flow Control? Security?- XTQ data structures need flow control and security
- Low-level networking APIs provide mechanisms to support these features
- XTQ can adopt the policies of the transport it extends
XTQ ARCHITECTURE
Unified Virtual
Memory
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201818
XTQ RDMA EXTENSIONS
XTQ Put is implemented as a simple extension to standard RDMA put operation‒ Compatible with many low-level RDMA transports (e.g. InfiniBand, RoCE, Portals 4, iWARP, etc.)
XTQ Registration API is used to provide address index-to-address translations
XTQ API
Regular RDMA Put Operation XTQ-Enhanced RDMA Put Operation
Put Command Fields
Target NID/PID
Send Buffer Ptr.
Send Buffer Length
Target Buffer Index
Transport specific metadata
Additional XTQ Fields
Remote Queue Index
Remote Function/Kernel Index
HSA-style command packet
Kernel/Function Launch Parameters
Register Queue
‒ Queue Desc. VA
Register Function
‒ Function Ptr. VA
‒ Target Side Buffer VA
Register Kernel
‒ Kernel Ptr. VA
‒ Target Side Buffer VA
‒ Kernel Argument Size
‒ HSA-style completion signal VA
XTQ Rewrite Registration API
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201819
EXPERIMENTAL FRAMEWORK
All data collected in gem5[6]
‒System call emulation mode (no OS)
‒AMD GPU model[7]
‒Full Support for HSA
‒Tightly coupled system
Portals 4-based NIC model[8]
‒Low-level RDMA network programming API currently supported by:
‒ MPICH, Open MPI, GASNet, Berkeley UPC, GNU UPC, and others
‒XTQ implemented as an extension of the Portals 4 remote Put operation
RESULTS
CPU and Memory Configuration
CPU Type 8-wide OOO, 4Ghz, 8 cores
I,D-Cache 64K, 2-way, 1 cycle
L2-Cache 2MB, 8-way, 4 cycles
L3-Cache 16MB, 16-way, 20 cycles
DRAM DDR3, 4 Channels, 800MHz
GPU Configuration
GPU Type 1 Ghz, 24 Compute Units
D-Cache 16kB, 64B line, 16-way, 4 cycles
I-Cache 32kB, 64B line, 8-way, 4 cycles
L2-Cache 768kB, 64B line, 16-way, 24 cycles
NIC Configuration
Link Speed 100ns/ 100Gbps
Network API Portals 4
Topology Star
[6] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” SIGARCH Comput. Archit. News, pp. 1–7, 2011.[7] AMD. (2015) The AMD gem5 APU simulator: Modeling heterogeneous systems in gem5. http://gem5.org/GPU_Models.[8] Sandia National Laboratories, “The Portals 4.0.2 network programming interface,” http://www.cs.sandia.gov/Portals/portals402.pdf, 2014.
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201820
Target-side tasking control path:
CPU: CPU performs computation
HSA: GPU performs computation through CPU-side HSA Runtime
XTQ: GPU performs computation using XTQ NIC-to-accelerator tasking
SYSTEM MODELSRESULTS
CPU HSA XTQ
NIC
Cache Cache
CPU
Cache
Memory
GPUNIC
Cache Cache
CPU
Cache
Memory
GPU NIC
Cache Cache
CPU
Cache
Memory
GPU
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201821
MICROBENCHMARKSRESULTS
311
312
310
313
156
113
241
219
306
303
435
432
94
64
145
135
51
412
126
414
228
229
592
545
70
75
57
68
0 500 1000 1500 2000 2500
XTQ (64B)
HSA (64B)
XTQ (4KB)
HSA (4KB)
Time (ns)
CPU PtlPut NIC Initiator Put Network NIC Target Put GPU Launch GPU Kernel Execution CPU Completion
0.1
1
10
1 16 256 4,096 65,5361,048,576
Spee
dup
Data Items (4 Byte Integers)
CPU
HSA
XTQ
1 16 256 4K 64K 1M0
500
1000
1500
2000
0 8 16 24 32 40 48 56 64
Run
tim
e (u
s)
CPU
HSA
XTQ
Nodes
Latency Decomposition
MPI Accumulate MPI Reduce MPI Allreduce
0.1
1
10
Spee
dup
Data Items (4 Byte Integers)
CPU
HSA
XTQ
1 16 256 4K 64K 1M
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201822
MACHINE LEARNING BENCHMARKS
Benchmarks from the Microsoft Cognitive Toolkit (CNTK)[9]
Results are projected using real runs augmented with Allreduce() speed-up numbers from the simulator
RESULTS
0.8
1
1.2
1.4
1.6
1.8
2
AlexNet AN4 LST CIFAR Large Synth MNIST Conv MNIST Hidden
Pro
ject
ed S
pee
du
p
CPU HSA XTQ
[9] Agarwal, et.al., An introduction to computational networks and the Computational Network Toolkit,” Microsoft, Technical Report, 2014.
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201823
SUMMARY
XTQ uses an HSA-compliant, RDMA-capable NIC to provide an active messaging framework for all devices in distributed systems
XTQ reduces the launch latency for remote GPU task invocation‒Tasks are directly scheduled on the GPU by the NIC using shared memory queues
XTQ removes the message processing on the CPU for GPU-destined tasks‒The CPU is free to perform more useful computation
CONCLUSION
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201824
THANK YOU!
QUESTIONS?
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201825
REFERENCES
[1] TOP500.org, “Highlights – June 2016,” http://www.top500.org/lists/2016/06/highlights, 2016.
[2] Nvidia, “GPU-Accelerated Applications,” http://www.nvidia.com/content/gpu-applications/pdf/gpu-apps-catalog-mar2015.pdf, 2016.
[3] Mellanox, “Mellanox GPUDirect RDMA user manual,” http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.2.pdf, 2015.
[4] HSA Foundation, “HSA platform system architecture specification 1.0,” http://www.hsafoundation.com/standards, 2015.
[5] T. Eicken, D. Culler, S. Goldstein, and K. Schauser, “Active messages: A mechanism for integrated communication and computation,” in Int. Symp. on Computer Architecture (ISCA), 1992, pp. 256–266.
[6] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” SIGARCH Comput. Archit. News, pp. 1–7, 2011.
[7] AMD. (2015) The AMD gem5 APU simulator: Modeling heterogeneous systems in gem5. http://gem5.org/GPU_Models.
[8] Sandia National Laboratories, “The Portals 4.0.2 network programming interface,” http://www.cs.sandia.gov/Portals/portals402.pdf, 2014.
[9] A. Agarwal, E. Akchurin, C. Basoglu, G. Chen, S. Cyphers, J. Droppo, A. Eversole, B. Guenter, M. Hillebrand, T. R. Hoens, X. Huang, Z. Huang, V. Ivanov, A. Kamenev, P. Kranen, O. Kuchaiev, W. Manousek, A. May, B. Mitra, O. Nano, G. Navarro, A. Orlov, H. Parthasarathi, B. Peng, M. Radmilac, A. Reznichenko, F. Seide, M. L. Seltzer, M. Slaney, A. Stolcke, H. Wang, Y. Wang, K. Yao, D. Yu, Y. Zhang, and G. Zweig, An introduction to computational networks and the Computational Network Toolkit,” Microsoft, Technical Report, 2014.
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201826
XTQ PORTALS EXTENSIONSXTQ API
XtqPut(ptl_handle_md_t md_handle,
ptl_size_t local_offset,
ptl_size_t length,
ptl_handle_md_t md_handle2,
ptl_size_t local_offset2,
ptl_size_t length2,
ptl_ack_req_t ack_req,
ptl_process_t target_id,
ptl_pt_index_t pt_index,
ptl_match_bits_t match_bits,
ptl_size_t remote_offset,
void *user_ptr,
ptl_hdr_data_t hdr_data);
);
Primary operation is XtqPut
Same as regular PtlPut except:- Two send buffers
- One for generic data
- One for XTQ command packet
XTQ Command Packet
Generic Data Buffer
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201827
XTQ PORTALS EXTENSIONS
Three functions populate the target-side Rewrite Tables
One of each for queues, kernels descriptors, and function descriptors
XTQ API
int XtqRegisterQueue(int id,
hsa_queue_t* q);
int XtqRegisterFunction(int id,
void *buffer_pointer,
void (*function) (uint64_t,uint64_t,uint64_t,uint64_t);
int XtqRegisterKernel(int id,
void* buffer_pointer,
int kernarg_size,
int rewrite_offset,
hsa_signal_t completion_signal,
uint64_t kernel);
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201828
SAMPLE CODE
Initiator
// Initialize RDMA comm layer
int rank = RdmaInit();
int index = 42;
// Construct XTQ command
void *cmd = ConstructCmd(CMD_SIZE, index);
// Post initialization sync with target
ExecutionBarrier();
// Launch on remote GPU using XTQ
XtqPut(TARGET, cmd, CMD_SIZE,
payload, BUFFER_SIZE);
XTQ API
Target
// Initialize RDMA comm layer
int rank = RdmaInit();
int index = 42;
// Post receive buffer
RdmaPostBuffer(recv_buf);
// Initialize HSA CPU Runtime
TaskingInit(&signal, &kernel, &queue);
// Register Kernel/Queues
XtqRegisterKernel(signal, kernel, index);
XtqRegisterQueue(queue, index);
// Post initialization sync with initiator
ExecutionBarrier();
// Wait for GPU to complete task
SignalWait(signal);
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201829
XTQ PACKET FORMAT
XTQ command packets are HSA AQL packets‒Currently CPU and GPU format is supported
‒Optional Fields:‒Kernel Arguments
‒Data Payload
XTQ ARCHITECTURE
Group Seg Size
Grid Z
Private Seg Size
Data Payload
Kernel Arguments
AQL Packet
Op
tio
nal
/V
aria
ble
Len
gth
Kernel Object
Grid X
Kernel Argument Address
Grid Y
Reserved
Completion Signal
OR
GPU AQL Packet
Header WG X WG Y
WG Z Reserved
Dispatch
CPU AQL Packet
64 Bits
64
BytesArgument 2
Return Address
Argument 3
Argument 0
Argument 1
Reserved
Completion Signal
Header ReservedType
Typical XTQ Message
64 Bits
...
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201830
XTQ REWRITE FUNCTIONALITY
Initiator does not know about target side resources needed for tasking
Several fields populated by the target-side NIC using coordinated indices specified by the initiator
Rewrite tables are populated by the target-side XTQ Library
XTQ ARCHITECTURE
…..
Kernel Object Ptr
Kernel Arguments Ptr
Data Payload
.....
Kernel Arguments
RDMA Header .....
Completion Signal Ptr
Kernel Object Ptr
Kernel Arguments Ptr
Data Payload
.....
Kernel Arguments
Arg0
Arg1
Kernel Image
Target Side Buffer
Completion Signal Ptr HSA Signal
Kernel LookupTable
Kernel Lookup TableBase Address Register
Target PID
Kernel Index ....
.... ....
Initiator Target
𝑥
Unified Virtual
Memory
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201831
PORTALS4
Low-level network interface designed to provide RDMA primitives for higher level network protocols
Software reference implementation of Portals 4 using InfiniBand is publicly available
Currently supported by:‒ MPICH, Open MPI, GASNet, Berkeley UPC, GNU UPC, and others
XTQ leverages Portals for its basic RDMA features
BACKGROUND
| EXTENDED TASK QUEUING: ACTIVE MESSAGES FOR HETEROGENEOUS SYSTEMS | AUGUST 2, 201832
ACTIVE MESSAGES
Messages that specify computation
Each message contains enough info to perform some action
Message contains pointer to code
Input data is optionally transmitted
BACKGROUND