a preliminary investigation on optimizing charm++ for homogeneous multi-core machines
DESCRIPTION
A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines. Chao Mei 05/02/2008 The 6 th Charm++ Workshop. Motivation. Clusters are built from multicore chips 4 cores/node on BG/P 8 cores/node on Abe (2 Intel quad-core chips) - PowerPoint PPT PresentationTRANSCRIPT
A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines
Chao Mei05/02/2008
The 6th Charm++ Workshop
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Motivation
Clusters are built from multicore chips 4 cores/node on BG/P 8 cores/node on Abe (2 Intel quad-core chips) 16 cores/node on Ranger (4 AMD quad-core chips) …
Charm has a building version for SMP node for many years Not tuned
So, what are the issues for getting high performance?
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Start with a kNeighbor benchmark
A synthetic kNeighbor benchmark Each element communicates with its neighbors in K-
stride (wrap-around), and then neighbors send back an acknowledge.
An iteration: all elements finish the above communication
Environment A smp node with 2 Xeon quadcores, only use 7 cores Ubuntu 7.04; gcc 4.2 Charm: net-linux-amd64-smp vs. net-linux-amd64 1 element/core, K=3
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Performance at first glance
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
1. 4
1. 6
0 2000 4000 6000 8000 10000 12000 14000 16000
msg si ze (byte)
iter
atio
n ti
me (
ms)
Non- SMP SMP
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Outline
Examine the communication model in Charm++ between the Non-SMP and SMP layers
Describe current optimizations for SMP step by step
Talk about a different approach to utilize multicore
Conclude with the future work
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Communication model for the multicore
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Possible overheads in SMP version
Locks Overusing locks to ensure correctness Locks in message queues …
False sharing Some per thread data structures are allocated together
in an array form: e.g. each element in “CmiState state[numThds]” belongs to a thread
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Reducing the usage of locks
By examining the source codes, finding overuse of locks Narrower sections enclosed by locks
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
1. 4
1. 6
0 2000 4000 6000 8000 10000 12000 14000 16000
msg si ze (byte)
ite
ratio
n t
ime (
ms)
Non- SMP SMP SMP- Rel axed l ock
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Overhead in message queues
A micro-benchmark to show the overhead in message queues N producers, 1
consumer lock vs. memory
fence + atomic operation (fetch-and-increment)
1 queue vs. N queues
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
1 2 3 4 5 6 7 8
number of producers
avg
iter t
ime
(us)
multiQ-fence singleQ-fence+atomic op singleQ-lock
1. Each producer produces 10K items per iteration
2. One iteration: consumer cosumes all items
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Applying multi Q + Fence
Less than 2% improvement Much less
contention compared with the micro-benchmark
0. 4
0. 45
0. 5
0. 55
0. 6
0. 65
0. 7
0. 75
0. 8
0 2000 4000 6000 8000 10000 12000 14000 16000
msg si ze (byte)
iter
atio
n ti
me (
ms)
Non- SMP SMP- Rel axed l ock SMP- Rel axed l ock- mul t i Q- Fence
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Big overhead in msg allocation
We noticed that:
We used our own default memory module Every memory allocation is protected by a lock Provide some useful functionalities in Charm++ system
(a historic reason not using other memory modules) memory footprint information, memory debugger Isomalloc
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Switching to OS memory module
We don’t lose the aforementioned functionalities by recent updates
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
1. 4
1. 6
0 2000 4000 6000 8000 10000 12000 14000 16000msg si ze (byte)
iter
atio
n ti
me (
ms)
Non-SMP SMP-Rel axed l ock-Si ngl eQ-Fence SMP-Reduced l ock overhead
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Identifying false sharing overhead
Another micro-benchmark Each element repeatedly sends itself a message, but
each time the message is reused (i.e., not allocating a new message)
Benchmark timing of 1000 iterations
Use Intel VTune performance analysis tool Focusing on the cache misses caused by “Invalidate” in
the MESI coherence protocol
Declaring variables with “__thread” specifier will make them thread private
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Performance for the micro-benchmark
Parameters: 1 element/core, 7 cores
Before: 1.236 us per iteration
After: 0.913 us per iteration
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Adding the gains from removing false sharing
Around 1% improvement
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
1. 4
1. 6
0 2000 4000 6000 8000 10000 12000 14000 16000
msg si ze (byte)
iteration time (ms)
Non- SMP SMP SMP- Opt i mi zed
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Rethinking communication model
Posix-shared memory layer
No threads, every core still runs a process
Inter-core message passing doesn’t go through NIC, but through memory copy (inter-process communication)
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Performance comparison
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
1. 4
1. 6
0 2000 4000 6000 8000 10000 12000 14000 16000
msg si ze (byte)
iter
atio
n ti
me (
ms)
Non- SMP SMP- Opt i mi zed Posi x Shared Memory
Chao Mei ([email protected])Parallel Programming Lab, UIUC
Future work
Other platform BG/P
Optimize the posix shared memory version
Effects on real applications For NAMD, initial result shows that SMP helps up to 24
nodes on Abe
Any other communication models Adaptive one?