distributing computation to large gpu clusters | gtc...
TRANSCRIPT
Distributing Computation to Large GPU Clusters
What is this about?
DiCE: Software library for writing applications
scaling to many GPUs and CPUs in a cluster
What is this about?
DiCE: Software library for writing applications
scaling to many GPUs and CPUs in a cluster
Used since 2003 in our rendering products...
NVIDIA indeX NVIDIA Iray
courtesy of Vyacheslav Serov
courtesy of Rüdiger Raab
courtesy of Thomas Zancker
Why are we presenting this here?
DiCE is a base technology in indeX
— Clustering / networking /distribution based on DiCE
DiCE API exposed by indeX
— Distribute pre-computation of data for indeX
— Do your own computation…
Design Goals
„Provide a software library to be used by rendering
experts to write scalable software for GPU clusters.“
— Not required: low level paralellization / networking knowledge
— High level of abstraction / easy to use...
— Not specific to special domain (e.g. rendering)
— High performance, meant for interactive applications
Other solutions...
Unique Combination of Features
Simple programming model
Ease of deployment / commodity hardware
Unified multi-core and cluster parallelization
GPU support
Dynamic clustering
Focus on interactive applications
Multi-user support e.g. for web services
Available on Windows, Linux, Mac OS X
Overview
Networking / Clustering
Datastore
Job System
C++ API
Application
Overview
Networking / Clustering
Datastore
Job System
C++ API
Application
Overview
Networking / Clustering
Datastore
Job System
C++ API
Application
Overview
Networking / Clustering
Datastore
Job System
C++ API
Application
DiCE and indeX
Networking / Clustering
Datastore
Job System
C++ API
Application
indeX
Job System
Networking / Clustering
Datastore
Job System
C++ API
Application
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
Parallelization Model
Programmer: split work in n fragments!
— As independent as possible
— Potentially thousands per „frame“!
No apriori knowledge about resources in the cluster!
Goal: Distribute work over all GPUs / CPUs in cluster
Parallelization Model
Fragmented Job
~ similar to CUDA kernel
Implement C++ class:
void execute_fragment(int i, int n) {…}
To be called once for every fragment
Ask DiCE to execute job in n fragments
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
Parallelization Model - Cluster
Not a shared memory model!
Parallelization Model - Cluster
Not a shared memory model!
Idea: Split execution and integration of results
void execute_remote(int i, int n, OUT){…} Remote host
void receive_result(int i, int n, IN) {…} Origin host
execute_remote()+receive_result() = execute_fragment()
Parallelization Model – Single Host
My_job
• Scene
• Camera
• Framebuf[ ]
1 Host
2 GPUs
0 GPU 1
1 GPU 1
2 GPU 2
3 GPU 2
4 GPU1
5 GPU 2
Parallelization Model – Single Host
My_job
• Scene
• Camera
• Framebuf[ ]
1 Host
2 GPUs
Parallelization Model – Single Host
Exe
cu
te fra
gm
en
t 1
Exe
cu
te fra
gm
en
t 2
Exe
cu
te fra
gm
en
t 4
Exe
cu
te fra
gm
en
t 5
My_job
• Scene
• Camera
• Framebuf[ ]
Exe
cu
te fra
gm
en
t 0
Exe
cu
te fra
gm
en
t 3
0 GPU 1
1 GPU 1
2 GPU 2
3 GPU 2
4 GPU1
5 GPU 2
Parallelization Model – Single Host
Exe
cu
te fra
gm
en
t 1
Exe
cu
te fra
gm
en
t 2
Exe
cu
te fra
gm
en
t 4
Exe
cu
te fra
gm
en
t 5
My_job
• Scene
• Camera
• Framebuf[ ]
Exe
cu
te fra
gm
en
t 0
Exe
cu
te fra
gm
en
t 3
0 GPU 1
1 GPU 1
2 GPU 2
3 GPU 2
4 GPU1
5 GPU 2
Parallelization Model – Single Host
Exe
cu
te fra
gm
en
t 1
Exe
cu
te fra
gm
en
t 2
Exe
cu
te fra
gm
en
t 4
Exe
cu
te fra
gm
en
t 5
My_job
• Scene
• Camera
• Framebuf[ ]
Exe
cu
te fra
gm
en
t 0
Exe
cu
te fra
gm
en
t 3
0 GPU 1
1 GPU 1
2 GPU 2
3 GPU 2
4 GPU1
5 GPU 2
Parallelization Model – Single Host
Exe
cu
te fra
gm
en
t 1
Exe
cu
te fra
gm
en
t 2
Exe
cu
te fra
gm
en
t 4
Exe
cu
te fra
gm
en
t 5
My_job
• Scene
• Camera
• Framebuf[ ]
Exe
cu
te fra
gm
en
t 0
Exe
cu
te fra
gm
en
t 3
0 GPU 1
1 GPU 1
2 GPU 2
3 GPU 2
4 GPU1
5 GPU 2
Parallelization Model
0 GPU 1
Host 1
1 GPU 1
Host 2
2 GPU 2
Host 2
3 GPU 2
Host 1
4 GPU1
Host 3
5 GPU 2
Host 3
Parallelization Model – 3 Hosts
Host 2 Host 3
Host 1 My_job
• Scene
• Camera
• Framebuf[ ]
3 Host
2 GPUs, each
Host 3 Host 2
Parallelization Model – 3 Hosts
Host 1 My_job
• Scene
• Camera
• Framebuf[ ]
My_job
• Scene
• Camera
My_job
• Scene
• Camera
0 GPU 1
Host 1
1 GPU 1
Host 2
2 GPU 2
Host 2
3 GPU 2
Host 1
4 GPU1
Host 3
5 GPU 2
Host 3
Host 3 Host 2
Parallelization Model – 3 Hosts
Host 1 My_job
• Scene
• Camera
• Framebuf[ ]
My_job
• Scene
• Camera
My_job
• Scene
• Camera
Exe
cu
te re
mo
te 1
Exe
cu
te re
mo
te 2
Exe
cu
te re
mo
te 4
Exe
cu
te re
mo
te 5
Exe
cu
te fra
gm
en
t 0
Exe
cu
te fra
gm
en
t 3
0 GPU 1
Host 1
1 GPU 1
Host 2
2 GPU 2
Host 2
3 GPU 2
Host 1
4 GPU1
Host 3
5 GPU 2
Host 3
Host 3 Host 2
Parallelization Model – 3 Hosts
Host 1 My_job
• Scene
• Camera
• Framebuf[ ]
My_job
• Scene
• Camera
My_job
• Scene
• Camera
Exe
cu
te re
mo
te 1
Exe
cu
te re
mo
te 2
Exe
cu
te re
mo
te 4
Exe
cu
te re
mo
te 5
Exe
cu
te fra
gm
en
t 0
Exe
cu
te fra
gm
en
t 3
0 GPU 1
Host 1
1 GPU 1
Host 2
2 GPU 2
Host 2
3 GPU 2
Host 1
4 GPU1
Host 3
5 GPU 2
Host 3
Host 3 Host 2
Parallelization Model – 3 Hosts
Host 1 My_job
• Scene
• Camera
• Framebuf[ ]
My_job
• Scene
• Camera
My_job
• Scene
• Camera
Exe
cu
te re
mo
te 1
Exe
cu
te re
mo
te 2
Exe
cu
te re
mo
te 4
Exe
cu
te re
mo
te 5
Exe
cu
te fra
gm
en
t 0
Exe
cu
te fra
gm
en
t 3
Re
ce
ive
resu
lt 1
Re
ce
ive
resu
lt 2
Re
ce
ive
resu
lt 4
Re
ce
vie
resu
lt 5
0 GPU 1
Host 1
1 GPU 1
Host 2
2 GPU 2
Host 2
3 GPU 2
Host 1
4 GPU1
Host 3
5 GPU 2
Host 3
Parallelization Model – 3 Hosts
Parallelization Model - Hierarchical
Viewer Host
Compositor Host
Render Host
GPUs
Compositor Job
GPU Fragment
Rendering Job
GPU Job
Datastore
Networking / Clustering
Datastore
Job System
C++ API
Application
Datastore
In memory NoSQL datastore for arbitrary C++ objects
Store object on some host / retrieve on any host
Data transport (mostly) transparent to application
Datastore Objects
class My_adder Your class
{
float m_a;
int m_b;
float sum() { return m_a + m_b; }
};
Datastore Objects
class My_adder
{
float m_a; Arbitrary member variables
int m_b;
float sum() { return m_a + m_b; }
};
Datastore Objects
class My_adder
{
float m_a;
int m_b;
float sum() { return m_a + m_b; } Arbitrary member functions
};
Datastore Objects
class My_adder : public Element< UUID > Derive from base class
{
float m_a;
int m_b;
float sum() { return m_a + m_b; }
};
Datastore Objects
class My_adder : public Element< UUID >
{
float m_a;
int m_b;
void serialize(Serializer* serializer) Implement serialization
{
serializer->write(m_a);
serializer->write(m_b);
}
};
Datastore Objects
class My_adder : public Element< UUID >
{
float m_a;
int m_b;
void serialize(Serializer* serializer);
void deserialize(Deserializer* deserializer) Implement deserialization
{
deserializer->read(m_a);
deserializer->read(m_b);
}
};
Datastore Objects
class My_adder : public Element< UUID >
{
float m_a;
int m_b;
void serialize(ISerializer* serializer);
void deserialize(IDeserializer* deserializer);
};
register_serializable_class< My_adder >(); Register class
Datastore: Cache
Per host cache for objects
— Accessing object will make sure it is in the cache!
— If necessary fetch from other hosts
If cache is full: throw away objects owned by others (LRU)
— Store more data in cluster than a single host could
Configurable redundant storage for handling host failure
Datastore Transactions
Important for multi-user operation
Datastore Transactions
Important for multi-user operation
ACID
— Atomicity: Transaction commit, abort
— Isolation: Starting transaction “freezes” view on datastore
Datastore Transactions
Important for multi-user operation
ACID
— Atomicity: Transaction commit, abort
— Consistency: Cluster wide locks available
— Isolation: Starting transaction “freezes” view on datastore
— Durability: Redundancy
Transaction Isolation
A X
T7
T8
1
Transaction Isolation
Isolation based on multi-version capability
A5 X9
T7
T8
1
Transaction Isolation
Isolation based on multi-version capability
Copy-on-write
A5 X9
T7
T8
1 X10
Transaction Isolation
Isolation based on multi-version capability
Copy-on-write
A5 X9
T7
T8
1 X10
Transaction Isolation
Isolation based on multi-version capability
Copy-on-write
A5
T8
1 X10
Networking / Clustering
Networking / Clustering
Datastore
Job System
C++ API
Application
Networking / Clustering
Handles cluster building and data transfers
— Self-organizing, dynamic addition and removal of hosts
— Tested with up to 1000 hosts
— Several networking protocols for different environments…
Network Layer: UDP with Multicast
Unicast: Send to each host
Network Layer: UDP with Multicast
Unicast: Send to each host
Multicast: Like radio, send once, received by many
Network Layer: UDP with Multicast
Unicast: Send to each host
Multicast: Like radio, send once, received by many
Network Layer: UDP with Multicast
Self Organization:
— Multicast address identifies cluster
— Multicast “beacon” packets to announce to other hosts
— “Election” process to elect one synchronizer
Multicast / unicast used for bulk data transfers
— Especially effective for many hosts
Network Layer: TCP
For networks with
— low bandwidth multicast or
— No multicast (e.g. Amazon Web Services)
Discovering hosts
— UDP multicast layer or
— At least one know host
TCP used for all data transport
Still fully dynamic
Host 1
Memory
Network Layer: Infiniband
Native Infiniband with RDMA
0x1234
Host 2
Memory
0x4532
CPU CPU
Host 1
Memory
Network Layer: Infiniband
Native Infiniband with RDMA
0x1234
Host 2
Memory
0x4532
CPU CPU
Host 1
Memory
Network Layer: Infiniband
Native Infiniband with RDMA
RDMA used for speeding up bulk data transfer
Fastest transmissions > 30 Gbit/s end-to-end
0x1234
Host 2
Memory
0x4532
CPU CPU
Other Features
More multi-user capabilities (scopes, ...)
„Futures“
Global logging system
HTTP Server
RTMP Video streaming
Cloud Bridge
...
Summary
DiCE is a library for writing scalable applications
DiCE used since 10 years in our rendering products
Currently directly usable to if you use indeX
Thank you …
Stefan Radig Sr. Manager, NVIDIA Iray and DiCE