INTEL® HPC DEVELOPER CONFERENCE
Fuel Your Insight
Large-scale Distributed Rendering with the OSPRay Ray Tracing Framework
Carson Brownlee
Shared-memory
Distributed-memory
Why MPI?
Data that exceeds the memory limits of a single node
Performance limitations
Tiled displays
In Situ
Strong Scaling
Weak Scaling
High Fidelity Rendering
Related Work
Sending Rays, Kilauea - Kato ’01,’02,’03
Interactive Ray Tracing on Clusters - Wald et al. ‘03
Distributed Shared Memory - DeMarle et al. ‘03
IceT Compositing - Moreland et al. ’11
Multiple Device
API commands are processed on the appropriate active device. This provides a modular backend for processing API calls. Currently these include:
1. Local
2. MPI
3. COI (Now deprecated in favor of MPI)
Using OSPRay MPI
Compile
OSPRAY_BUILD_MPI_DEVICE=ON
Requires MPI Library with multi-threading support (IMPI recommended)
OSPRAY_EXP_DATA_PARALLEL=ON (expiremental)
Run
mpirun -n 3 ./ospGlutViewer —osp:mpi teapot.obj (mpirun args vary)
mpirun -ppn 1 -n 1 -host localhost ./ospGlutViewer —osp:mpi teapot.obj : -n 2 -host n1, n2 ./ospray_mpi_worker —osp:mpi
ParaView
VTKOSPRAY_ARGS=“—osp:mpi” mpirun -ppn 1 -n 1 -host localhost ./paraview : -n 1 -host n1, n2 ./ospray_mpi_worker —osp:mpi
Distributed Framebuffer
Data replicated and Data distributed
Tile ownership
Stores accumulation buffer locally
Pixel Operations
Processed tiles with framebuffer colors sent to display node
Tiling Pseudocode
Load Balancing
Static load balancing
Tiles are strided to avoid work imbalance
1 2 3 1 2 3
2 3 1 2 3 1
3 1 2 3 1 2
Work API Comm
API:
ospRenderFrame() {…}
MPIDevice:
MPIDevice::renderFrame()
{
work::RenderFrame work(_fb, _renderer, fbChannelFlags);
processWork(&work);
}
Work:
void RenderFrame::serialize(SerialBuffer &b) const {
b << (int64)fbHandle << (int64)rendererHandle << channels;
}
Work API Comm
Work:
void RenderFrame::run() {
FrameBuffer *fb = (FrameBuffer*)fbHandle.lookup();
Renderer *renderer = (Renderer*)rendererHandle.lookup();
renderer->renderFrame(fb, channels);
}
Worker:
mpi::recv(mpi::Address(&mpi::app, (int32)mpi::RECV_ALL), workCommands);
for (work::Work *&w : workCommands)
w->run();
Async Comm Layer
Actions are separated into
receive queue
process queue
send queue
Async Comm Layer
struct MasterTileMessage : public mpi::async::CommLayer::Message {
vec2i coords;
float error;
uint32 color[TILE_SIZE][TILE_SIZE];
};
void DFB::incoming(mpi::async::CommLayer::Message *_msg) {
switch (_msg->command) {
case MASTER_WRITE_TILE_NONE:
this->processMessage((MasterTileMessage_NONE*)_msg);
break;
}
Distributed Data
Currently experimental and only for Volume data
env var OSPRAY_DATA_PARALLEL=blockXxBlockYxBlockZ
Data is projected onto tiles, all nodes determine tile overlap
Tiles sent to owning node for compositing
Strong Scaling
Distributed API
Ability to specify what is run where
3 Modes:
Master/Slave
- All ranks not master run commands specified from master rank
Collaborative
- All ranks make the same commands
Independent
- run locally
D-API Example - Distributed Volume Rendering
Sync: initialization
Sync: create shared volume
Local: create resident volume section
Local: add local volume to synchronous volume
Master: add annotations
Sync: render
Distributed API
ospdApiMode(OSPD_MODE_INDEPENDENT);
OSPVolume localVol = ospNewVolume("shared_structured_volume");
OSPData ospLocalVolData = ospNewData(volumeData.size(), OSP_UCHAR, volumeData.data(), OSP_DATA_SHARED_BUFFER);
ospCommit(ospLocalVolData);
// Switch back to collaborative mode and commit the collab volume and add it to the world
ospdApiMode(OSPD_MODE_COLLABORATIVE);
ospCommit(volume);
ospAddVolume(world, volume);
ospCommit(world);
D-API Implementation
void MPIDevice::processWork(work::Work* work)
{
if (currentApiMode == OSPD_MODE_MASTER) {
mpi::send(mpi::Address(&mpi::worker,(int32)mpi::SEND_ALL), work);
} else if (currentApiMode == OSPD_MODE_COLLABORATIVE) {
// sync calls
}
work->run();
}
Tiled Displays
DisplayWald - Experimental
Built as an OSPRay module
Requires MPI
Stereo supported
Routing through single head node supported if display nodes not accessible from compute nodes
DisplayWald - Experimental
Server (displays):
mpirun -perhost 1 -n 6 ./ospDisplayWald -w 3 -h 2 --no-head-node
mpirun -perhost 1 -n 6 ./ospDisplayWald -w 3 -h 2 —head-node
// will output hostname and port
Client (renderer):
mpirun -n ./ospDwViewer —display-wall-host host:port
Performance Tips
Wayness - single MPI process per node ideal
Excessive API calls can currently cause very long load times
Affinity issues - check CPU utilization pegged at 100%.
KNL cache mode - OSPRay runs best in cache/quadrant mode
Samples Per Pixel - Negative values will subset image per frame
Questions?
Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or
service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © 2016 Intel Corporation. All rights reserved. Intel, Intel Inside, the Intel logo, Intel Xeon and Intel Xeon Phi are trademarks of Intel Corporation in the United States and other countries. *Other names and brands may be claimed as the property of others.
Copyright © 2016 Intel Corporation, All Rights Reserved
29