intel concurrent collections c++ api and runtime® concurrent collections for c++ ... −garbage...
TRANSCRIPT
1 1
Software and Services Group
Concurrent Collections
Easy and effective distributed computing
Frank Schlimbach Kath Knobe
James Brodman
2012/03/06
2
Software and Services Group
2
• In this presentation the focus is on distributed memory
− Introduction to Concurrent Collections (CnC) concepts
− How it is used
− Results
− How it helps with distributed computing
• It also works well in shared memory
− Intel’s own benchmark effort not yet public
> CnC scored high across the board
This Talk
3
Software and Services Group
3
Outline
• Introduction to Concurrent Collections
• Introduction to using Intel® Concurrent Collections for C++
• distCnC: Concurrent Collection for distributed memory
4
Software and Services Group
4
What problem does CnC address?
Parallel programs are notoriously difficult for mortals to write, debug, maintain and port
Parallel programs are notoriously difficult for anyone to tune
5
Software and Services Group
5 5
The Big Idea
•Don’t specify what operations run in parallel
−Difficult and depends on target
6
Software and Services Group
6 6
The Big Idea
•Don’t specify what operations run in parallel
−Difficult and depends on target
•Specify the semantic ordering constraints only
−Easier and depends only on application
7
Software and Services Group
7 7
Producer - consumer
step1 step2 item
Exactly Two Sources of Ordering Requirements
• Producer must execute before consumer
COMPUTE STEP COMPUTE STEP DATA ITEM
8
Software and Services Group
8 8
Controller - controllee
Exactly Two Sources of Ordering Requirements
Producer - consumer
step1 step2 item
COMPUTE STEP COMPUTE STEP DATA ITEM
step1
COMPUTE STEP
step2
COMPUTE STEP
CONTROL TAG
tony
• Producer must execute before consumer
• Controller must execute before controllee
10
Software and Services Group
10 10
1: The White Board Drawing: - computations
- data - producer/consumer relations - I/O
Cholesky
Trisolve
Update
COMPUTE STEP
COMPUTE STEP
COMPUTE STEP
Array
DATA ITEM
11
Software and Services Group
11 11
2: Distinguish among the instances
Cholesky: iter
Trisolve: row, iter
Update: col, row, iter
COMPUTE STEP
COMPUTE STEP
COMPUTE STEP
Array: col, row, iter
DATA ITEM
12
Software and Services Group
12 12
3: What are the control tag collections
TrisolveTag: row, iter
CholeskyTag: iter
UpdateTag: col, row, iter
CONTROL TAG
CONTROL TAG
CONTROL TAG
Cholesky: iter
Trisolve: row, iter
Update: col, row, iter
COMPUTE STEP
COMPUTE STEP
COMPUTE STEP
Array : col, row, iter
DATA ITEM
13
Software and Services Group
13 13
4: Who produces control
TrisolveTag: row, iter
CholeskyTag: iter
UpdateTag: col, row, iter
CONTROL TAG
CONTROL TAG
CONTROL TAG
Cholesky: iter
Trisolve: row, iter
Update: col, row, iter
COMPUTE STEP
COMPUTE STEP
COMPUTE STEP
Array : col, row, iter
DATA ITEM
14
Software and Services Group
14
Separation of Concerns
Mapping CnC spec to platform
CnC
Domain Spec
Application problem
The domain expert:
• Finance
• Gaming
• Chemistry
• …
The tuning expert:
• Parallelism
• Locality
• Load balancing
• …
15
Software and Services Group
15
Separation of Concerns
Mapping CnC spec to platform
CnC
Domain Spec
Application problem
Domain expert doesn’t need to know a lot about parallelism
Tuning expert doesn’t need to know a lot
about the app
The domain expert:
• Finance
• Gaming
• Chemistry
• …
The tuning expert:
• Parallelism
• Locality
• Load balancing
• …
16
Software and Services Group
16
CnC model allows for a wide variety of runtime approaches
grain distribution schedule
HP TStreams
Intel CnC
Georgia Tech CnC
HP TStreams
Rice CnC
static static
static
static
static
dynamic
static
dynamic dynamic
static
dynamic dynamic
dynamic
dynamic dynamic
17
Software and Services Group
17
Concurrent Collections Promise
• Separates concerns (domain and tuning)
− Increases productivity
• Domain language
− Determinism (with respect to the results)
− Easy to debug
− Unified programming applicable for shared and distributed memory
• Tuning language
− Provides effective tuning (not covered in this presentation)
18
Software and Services Group
18
Outline
• Introduction to Concurrent Collections
• Introduction to using Intel® Concurrent Collections for C++
• distCnC: Concurrent Collection for distributed memory
19
Software and Services Group
19
Intel® Concurrent Collections for C++
• Template library with runtime
> For Windows(IA-32/Intel-64) and Linux (Intel-64)
> Shared and distributed memory
• Upcoming whatif-release end of Q1’12
− Header-files, shared libs
− Samples
− Documentation
> CnC concepts
> Using CnC
> Tutorials
− No translator
http://software.intel.com/en-us/articles/intel-concurrent-collections-for-cc/
20
Software and Services Group
20
Intel® Concurrent Collections for C++
libc pthread …
TBB
CnC runtime
CnC C++ API
OS
CnC translator
MPI/ITC
21
Software and Services Group
21
C++ API
•The C++-API puts CnC concepts into C++
−Creating a CnC graph and using it
−Generality through collection being templates
−Type safety
−Debug and tuning interface
−Tuning options
−Range support for tags
−Running on distributed systems
22
Software and Services Group
22
How to write a CnC application
•Design
−Identify steps, Items and Tags (see previous slides)
•Write C++ code
−Create a context with all collections in it
−Each step is a user function with required interface
−A step-instance is prescribed by a put to the tag-collection
−Steps consume and produce item/tag instances with get and put
−Steps call gets before any puts and any memory allocation
−The environment produces initial tag/item instances and invokes the graph
23
Software and Services Group
23
Major classes provided by CnC C++ API
Tag collection: CnC::tag_collection< tag_type, tuner_type >
prescribes, put, put_range, iterate
Item collection: CnC::item_collection< tag_type, item_type, tuner_type >
put, get, iterate
Step collection: CnC::step_collection< step_type, tuner_type >
consumes, produces
Graph/context: CnC::context< DerivedContext >
Derive from CnC::context< yourContext >
Declare collections as members
Initialize collections in context constructor with context as argument
Prescribe your steps in context constructor
myStep COMPUTE STEP
myTag
CONTROL TAG
myItem
DATA ITEM
24
Software and Services Group
24
Sample Context
struct cholesky_context : public CnC::context< cholesky_context >
{
// Step Collections
CnC::step_collection< cholesky > sc_cholesky;
...
// Item collections
CnC::item_collection< triple, tile_const_ptr_type > Array;
// Tag collections
CnC::tag_collection< int > tc_cholesky;
...
cholesky_context( int _b = 0, int _p = 0, int _n = 0 )
{
tc_cholesky.prescribes( sc_cholesky, *this );
sc_cholesky.consumes( Array);
...
}
};
25
Software and Services Group
25
Sample step (cholesky) // Perform unblocked Cholesky factorization on the input block
// Output is a lower triangular matrix.
int cholesky::execute( const int & t, cholesky_context & c ) const
{
tile_const_ptr_type A_block;
tile_ptr_type L_block;
int b = c.b;
const int iter = t;
c.Array.get(triple(iter,col,row), A_block); // Get the input tile.
L_block = std::make_shared< tile_type >( b ); // allocate output tile
for(int k_b = 0; k_b < b; k_b++) {
// do all the math on this tile
}
c.Array.put(triple(iter+1,col,row), L_block); // Write the output tile
return CnC::CNC_Success;
} execute must be const with no side-effects The step can have a global/constant status The step must be copy-constructible
Fixed parameters
Get’ing consumed data Computation proceeds
as normal
Put’ing produced data
26
Software and Services Group
26
Tuning •CnC tuning influences how the runtime manages the interaction between CnC entities (items, steps, tags)
−This is different and orthogonal to optimizations of the serial language (C++)
•The tuning expert gives hints to the runtime
−For a specific application
•Examples
−Garbage collection
>Detect that an item will not be used any more
>A semantic attribute, different from traditional GC
−Influence step scheduling
−Memoizing step execution
−Work/data distribution
•Much of the tuning potential yet to come
27
Software and Services Group
27
Outline
• Introduction to Concurrent Collections
• Introduction to using Intel® Concurrent Collections for C++
• distCnC: Concurrent Collection for distributed memory
28
Software and Services Group
28
Why distributed CnC?
•CnC is a declarative and high level model
−Abstracts from memory model
−Facilitates switch between shared and distributed memory (no explicit message passing needed)
•CnC comes with a control methodology
−Allows controlling work distribution
•CnC comes with a data methodology
−Provides hooks for selective data distribution
CnC provides a unified programming model for shared and distributed memory.
29
Software and Services Group
29
Limitations of distributed computing apply
•Usual caveats for distributed memory apply
−E.g. ratio between data-exchange and computation
•Different algorithms might be needed for distributed or shared memory
Programming methodology and framework stays the same in any case
Over a wide class of applications the algorithm stays the same
30
Software and Services Group
30
Spike
•Parallel solver for banded linear systems of equation
•Complex algorithm, but “easily” maps onto CnC (at least for James Brodman )
• 15 Step Collections
• 24 Item Collections
• 12 Tag Collections
• Same code for shared and distributed memory
31
Software and Services Group
31
Spike Performance
0
1
2
3
4
5
6
1048576 w/ 128 Super/Sub Diagonals, 32 partitions
MKL MKL+OMP HTA+TBB CnC HTA+MPI DistCnC
Shared memory
Shared/distributed memory (4 nodes, GigE)
1048576 w/ 128 Super/Sub Diagonals, 32 partitions
MKL MKL +OMP
HTA (TBB)
CnC HTA (MPI)
DistCnC
Spike: Parallel solver for banded linear systems of equation Complex algorithm but “easily” maps onto CnC
Runtim
e (
S)
32
Software and Services Group
32
Spike Scalability
0.00
5.00
10.00
15.00
20.00
25.00
1 2 4 8 16 32
Tim
e [
sec]
#nodes (24 h-threads each)
Spike - Time (matrix: 10Mx10M, 257 bands)
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
2 4 8 16 32
Eff
icie
ncy [
%]
#nodes (24 h-threads each)
Spike - Parallel Efficiency (matrix: 10Mx10M, 257 bands)
33
Software and Services Group
33
Making a CnC program distCnC-ready* • #include <cnc/dist_cnc.h>
− sets #define and declares dist_cnc_init template
• instantiate CnC::dist_cnc_init< … > object
−First thing in main, must persist throughout main
−Template parameters are the contexts used in the program
• serialization of non-standard data types (tags and items)
−Simple mechanism (similar to BOOST)
−int, double, float, char etc. don’t need explicit serialization
•Same binary runs on shared and distributed memory
*distCnC-readyness doesn’t guarantee good performance, but it enables execution on a distributed memory system.
34
Software and Services Group
34
DistCnC-ready performance (UTS)
• Unbalanced tree search
• Tree shape unknown in advance
• CnC code is trivial
• CnC: 151 loc
• Shmem: ~1000 loc
• MPI: ~800 loc
• CnC performs better on single node (multi-threaded)
• Performance gap in the mid-sized region is a load-balancing issue
• Experimental version solves it
0
2
4
6
8
10
12
16 32 64 96 128 160 192 224 256
1 2 4 6 8 10 12 14 16
sp
eed
up
over 1
no
de
#Threads/ranks
#Nodes
UTS [T3XXL] Speedup
MPI
CnC
35
Software and Services Group
35
distCnC-ready Performance
Default round-robin distribution can be arbitrarily bad.
0.00
50.00
100.00
150.00
200.00
250.00
1 2 4
Tim
e [
sec]
#nodes (24 h-threads each)
CnC Time (dist-ready) [GigE]
inverse
primes
mandelbrot
cholesky
36
Software and Services Group
36
CnC makes work and data distribution easy and efficient
•By default, a simple local round-robin scheduling is done
−Data is sent to where needed (requested)
•Tuner can declare where work is to be executed: int tuner::compute_on( step_tag )
•Similarly, tuner can declare where the data needs to go: int tuner::consumed_on( item_tag ) or
vector< int > tuner::consumed_on( item_tag )
•Both returning ranks/ids of address-spaces
•Mapping tag->rank can be computed
−Statically at compile time
−Dynamically at init-time
−Dynamically on the fly
37
Software and Services Group
37
Tuned Performance
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
2 4 8 16 32
Tim
e[sec]
#nodes (24h-threads each)
CnC Time (tuned) [IB]
inverse
primes
mandelbrot
cholesky
38
Software and Services Group
38
Scalability Comparison (cholesky)
CnC: Only a few lower-level optimizations
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
1 2 4 8 16 32
Sp
eed
up
over M
KL-L
AP
AC
K
#nodes (24 h-threads each)
Cholesky - Speedup (matrix 16kx16k, blocks 100x100)
MKL-LAPACK
CnC
MKL-scaLAPACK
39
Software and Services Group
39
Scalability Comparison (inverse)
0
1
2
3
4
5
6
7
1 2 4 8 16 32
Sp
eed
up
over M
KL-L
AP
AC
K
#nodes (24h-threads each)
Matrix inverse - Speedup (matrix: 16kx16k, blocks: 90x90)
MKL-LAPACK
CnC
MKL-scaLAPACK
CnC: No lower-level optimizations yet
Intel Confidential
40 40
Software and Services Group
Scalability (stencil)
3D 25-point stencil reads from 2 previous time-steps Shmem performance similar to cache-oblivious algorithms
0
5
10
15
20
25
30
35
2 4 8 16 32
Sp
eed
up
#nodes (24 h-threads each)
RTM Stencil - Speedup (3kx1.5k2, 32 time-steps)
60%
65%
70%
75%
80%
85%
90%
95%
100%
105%
1 2 4 8 16 32
3,2E+09 6,4E+09 1,3E+10 2,6E+10 5,1E+10 1,0E+11
12,0 24,0 47,9 95,8 191,7 383,3
Parallel E
ffic
ien
cy
# of nodes Matrix size [#elements]
Matrix size [GB]
RTM Stencil [weak] - Efficiency (16 time-steps)
Efficiency stays above 97%!
41
Software and Services Group
41
Increased Productivity when experimenting with distribution plans • Partitioning the work directly affects communication for controller/controlee
• The CnC methodology also declares the links between work- and data instances
• Allows clean mechanism for the tuning expert to create a distribution plan
−For each item, declare steps which depend on it (“consumed_by”)
−With compute_on the runtime knows where to send the item to
−Runtime auto-distributes data optimally according to declared work-distribution
−Depends on the program semantics only
int tuner::consumed_on( const tag t ) const
{
return compute_on( consumed_by( t ) );
}
42
Software and Services Group
42
Distribution function (cyclic1)
Used for primes
int tuner::compute_on( const int & i ) const
{
return i % numProcs();
}
Simple
43
Software and Services Group
43
Distribution function (cyclic2)
int tuner::compute_on( const int& stage,
const int& stream ) const
{
return stage % numProcs();
}
Good for pipeline if stages require larger data sets
Pipeline# (data parallel)
Sta
ges (
task p
aralle
l)
Used for mandelbrot
Pipelines (data parallel)
Sta
ges (
task p
aralle
l)
int tuner::compute_on( const int& stage,
const int& stream ) const
{
return stream % numProcs();
}
Good for pipeline if large chunks of data go through all stages
44
Software and Services Group
44
Distribution function (blocked)
int tuner::compute_on( const int& x, y, z ) const
{
return x * NBLOCKS_X / nx
+ ( y * NBLOCKS_Y / ny ) * nx
+ ( z * NBLOCKS_Z / nz ) * nx * ny
}
Minimizes data transfer.
Used for Cholesky Used for
stencil
45
Software and Services Group
45
Distribution function (blocked cyclic)
int tuner::compute_on( const int& stage, const int& stream ) const
{
return ( ( x + y * m_nx ) / BLOCKSIZE ) % numProcs();
}
Good if block-size and/or –shape is relevant. Good if flat parallel model (pure MPI, pure threading)
46
Software and Services Group
46
Distribution function (tree) int tuner::compute_on( const int & tag, context & c) const
{
return tag >= LIM
? CnC::COMPUTE_ON_LOCAL
: tag % numProcs();
}
At certain tree-depth stop distributing and stay local
16
1
2
8
4
17 18
9
19 20
10
21 22
11
23 24
12
25 25
13
26 27
14
28 29
15
30
5 6 7
3
Used for quickSort
47
Software and Services Group
47
Changing distribution (cholesky) int tuner::compute_on( ... ) {
switch( dt ) {
case BLOCKED_ROWS :
return (((j*j)/2+1+i)/s);
case BLOCKED_CYCLIC :
return ((i/2)*n+(j/2))%np;
case ROW_CYCLIC :
return i % np;
case COLUMN_CYCLIC :
return j % np;
}
}
Optimal distribution might depend on several factors CnC makes it easy to customize.
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
1 2 4 8 16 32
Sp
eed
up
over s
ing
le-n
od
e M
KL-L
AP
AC
K
#nodes (24 h-threads each)
Cholesky - Speedup (matrix 16kx16k, blocks 100x100)
MKL-LAPACK
CnC COLUMN_CYCLIC
CnC BLOCKED_ROWS
CnC ROW_CYCLIC
CnC BLOCKED_CYCLIC
MKL-scaLAPACK
48
Software and Services Group
48
Summary
CnC helps structuring a program so that
−It doesn’t limit parallelism and so exposes more potential parallelism
−Tuning can be separated and effective
−It can be optimized for shared and/or distributed memory
−In distributed systems specifically
>A key task of distributing data and work becomes easy
>minimizing transferred data volume becomes a nice side effect of a good distribution
−Development is productive
>The CnC runtime hides all the difficulties with using low-level techniques
>Result is deterministic and independent of #threads or #processes
49
Software and Services Group
49
The road ahead
•We are working on the tuning side
−A tuning language
−More convenient interfaces
•Runtime optimizations
•Spec-level optimizations
•Fault-tolerance
•Looking for feedback
•Looking for new areas to apply CnC to
Intel Confidential
51 51
Software and Services Group
Optimization Notice
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2®, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Intel Confidential
52 52
Software and Services Group
Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products. BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2011-2012. Intel Corporation.
http://intel.com/software/products