7512_getting started in hpc development_final2
TRANSCRIPT
-
8/7/2019 7512_Getting Started in HPC Development_final2
1/39an Developer eBook
Getting Started inHPC Development
-
8/7/2019 7512_Getting Started in HPC Development_final2
2/39
2 Letter rom the Editor
4 Utilizing a Multi-Core System with the Actor Model
12 Lots about Locks
22 Intel Parallel Studio XE and Intel Cluster StudioTool Suites
33 Intel Array Building Blocks
4
22
2
12
33
12
Contents
This content was adapted from Internet .coms DevX website and Intel Parallel UniverseMagazine. Contributors: James Leigh, Wooyoung Kim, Michael Voss, Michael McCool,
Sanjay Goil and John McHugh.
Getting Started in HPC Development
-
8/7/2019 7512_Getting Started in HPC Development_final2
3/39
2 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
any people, even in the IT industry, hear
the term high-perormance computing
(HPC) and think o supercomputers that
are used in scientic experiments or
complex research applications. But as the amount o data
continues to grow, and databases continue to expand,
businesses in the private sector are going to need toharness some serious computing
horsepower.
HPC is powered, in part, by
powerul multicore processors
that can speed up application
perormance. For sotware
developers, this means learning
to create applications with
parallelism that can take
advantage o these multicoreprocessors. This also means
changing the way that
applications are developed.
There are a number o
techniques, methods and
technologies available that
can help application developers pick up parallel
programming and create applications that can run in
an HPC environment. In this eBook rom Internet.com
and Intel were going to look at some o these tools andmethods to give developers some ideas about whats
available.
In our rst article, James Leigh is going to look at
developing ecient multi-threaded applications without
using synchronized blocks.
The actor model (which is native to some programming
languages such as Scala) is a pattern or concurrent
computation that enables applications to take ull
advantage o multicore and multiprocessor computing.
James likes the actor model because it abstracts the
nitty-gritty o multiprocessor programming away romthe developer. This reduces
concurrency issues and improves
the fexibility o the system.
The actor model also has a
low learning curve, so new
developers can quickly see how
actors are implemented and
understand how they t together
In our next article, Wooyoung
Kim and Michael Voss discusswhy in their opinion locks remain
the best choice or implementing
synchronization and protecting
critical sections o sotware code
Their article discusses some o
their experiences with mutual
exclusion locks in developing multithreaded concurrent
applications using the locks provided in Intel Threading
Building Blocks as examples.
Then John McHugh and Sanjay Goil are going to
introduce us to Intel Parallel Studio XE, a set o new
sotware development tool suites or developers o
applications that run on both Windows and Linux in C/
C++ and Fortran who need advanced perormance
or multicore today and uture scaling to manycore.
The Intel Parallel Studio XE 2011 bundle contains the
latest versions o Intel C/C++ and Fortran compilers,
Letter rom the EditorBy Michael Pastore
M
-
8/7/2019 7512_Getting Started in HPC Development_final2
4/39
3 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
Intel MKL and Intel IPP perormance libraries, Intel PBB
libraries, (Intel TBB, Intel ArBB betas, and Intel Cilk Plus),
Intel Inspector XE correctness analyzer and Intel VTune
Amplier XE perormance proler.
Finally, Michael McCool is going to answer the ollowing
question: How can parallelism mechanisms in modern
processor hardware, including vector SIMD instructions,
be targeted in a portable, general way within existing
programming languages?
And his answer will be Intel Array Building Blocks, which
he will explain in more detail.
We hope you enjoy this eBook, and remember you can
always turn to Internet.com websites like DevX.com and
Developer.com, as well as the Intel Sotware Network, or
more inormation on the journey to developing or high-
perormance computing.
-
8/7/2019 7512_Getting Started in HPC Development_final2
5/39
4 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
Download the code for this article from: http://assets.
devx.com/devx/actor-model.zip.
typical multi-threaded application in Java
contains numerous synchronized methods
and statements. They might also contain calls
to the methods wait() and notiy() that were introduced
with Java 1.0, but these methods provide very primitiveunctionality and are easily misused. Java 5 introduced
the java.util.concurrent package,
which provides some higher-level
abstractions away rom wait()
and notiy() . However, it can still
be a challenge to appropriately
use the synchronized and
volatile keywords. Even when
used correctly, getting them
used eciently can require
complicated orchestrations olocks.
The biggest criticism o Javas
synchronization is perormance.
Synchronization blocks become
overly encompassing too easily.
Although a synchronization block
on its own is ar rom slow, when overly encompassing, it
becomes a contested synchronization block. Contested
synchronized blocks, or other blocking operations, are
slow and require the OS to put threads to sleep and
use interrupts to activate them. This puts pressure
on the scheduler, resulting in signicant perormance
degradation.
Actor Model
The actor model (native to some programming languages
such as Scala) is a pattern or concurrent computation
that enables applications to take ull advantage o multi-
core and multi-processor computing. The undamental
idea behind the actor model is that the application is
broken up into actors that perorm particular roles.
Every method call (or message) to an actor is executed
in a unique thread, so you avoid all o the contested
locking issues typically ound in concurrent applications.
This allows or more ecient concurrent processing whilekeeping the complexity o actor
implementations low, as there is
no need to consider concurrent
execution within each actor
implementation.
The class in Listing 1 shows what
an actor class might look like.
This class takes a string o words
and saves them to an XML le,
and includes a calculated codeor every character stored. The
code might be used later as
an index or to nd similar text
blocks. Notice that this class is
not thread sae and you can only
use each instance rom a single
thread. This is normal, because
each actor is used rom only one thread. It is common not
to have any synchronized or volatile keywords present in
an actor class because they are not needed.
Long-lived, normally synchronized objects used by
dierent threads are better o with a dedicated thread
ree rom any synchronization issues. Each method call
is placed in the queue (the order within the queue is not
important) waiting until the actor is available to process
the call. Think o this queue like your email in-box:
messages are received at any time and are acted on when
time permits. Typically, calls are asynchronous and do
Utilizing a Multi-Core System with the Actor ModelBy James Leigh
A
-
8/7/2019 7512_Getting Started in HPC Development_final2
6/39
5 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
not block, so the calling thread continues execution and
avoids any need to rely on thread interrupts. When callers
need a result, you can pass a callback object as part o
the parameters to allow the actor to notiy the caller. In
some cases, it is desirable to block the caller until the
actor processes the message.
You can separate the storage actor in Listing 1 into
a second actor as shown in Listing 2. In this way, the
storage actor calls an instance o HexCoderActor with
itsel as the callback. The storage actor does not wait
or the HexCoder to generate the hex code, but instead
continues with other items in its queue. This allows
the storage actors thread to specialize in writing the
resulting XML le, while the text code is calculated
asynchronously in another thread. Notice how these
classes can take advantage o concurrent threads without
any special keywords or deep knowledge o concurrent
programming.
Every actor needs a manager to allocate and manage
its thread. Each actor also needs a proxy to send
messages to its queue. Implementing a basic actor
manager is straightorward. In Listing 3, shows such a
manager written in Java 5. It uses Javas Proxy object todynamically wrap an actor, implementing all o the actors
interaces. Every method call on the proxy is then queued
in an ExecutorServicevoid methods are asynchronous
and other method calls block until the executor has
nished executing and the result is available.
Exception Handling and Worker Services
In every program, it is important to test and have proper
exception handling. This becomes even more important
with multi-threaded programming, because asynchronousexecution quickly becomes dicult to debug. Because
execution is not done sequentially, a sequential debugger
is less useul. Similarly, stack traces are shorter and
do not give caller details. In these situations, it is best
to either have the actor handle exceptions itsel or
enable callbacks to handle both successul results and
exceptions.
You should also consider that calls to an actor do carry
some overhead when compared to sequential calls. You
need to queue messages passed to a separate thread
and you cannot optimize with compilers in the same
manner as sequential calls. This makes the actor model
less applicable to smaller, aster objects that are better
implemented as immutable or stateul. However, there
are also advantages to running actors in a dedicated
thread. By avoiding synchronized and volatile
keywords, the on-board chip memory does not need to
sync up with the main memory as oten, since the actors
thread is the only thread that can access its variables.
Modern compilers can also observe that the head-lock o
the queue is only used rom its actor thread and optimize
it away, making it possible or actors to run without any
interruption or mandatory memory fushing. Thereore,
use actors or specialized worker services.
An example o worker services is an importing and
indexing service. Consider the task o retrieving remote
data, processing it locally, and storing it into a local
database. You might break this up into three steps:
1. Retrieve data.
2. Process data.3. Store result.
In this example, the remote data is not retrieved by a
single connection, but rather in multiple les that are
listed in index les, mixed in with the data les. The
remote data is in a ormat that you cannot process
directly and you need to pre-process or ormat it rst.
Furthermore, you need to convert the data because it
uses a dierent vocabulary. This creates six steps:
1. Retrieve index or data le.2. Format the le or parsing.
3. Convert data.
4. I index, then list data les and go to step 1.
5. Process data les.
6. Insert data.
These six steps t well into the actor model. Think o
each o these steps as a job that one or more individuals
(actors) need to perorm.
-
8/7/2019 7512_Getting Started in HPC Development_final2
7/39
6 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
Included in this article is an implementation o the above
actor model or retrieving remote recipes rom multiple
sites in multiple ormats. Each recipe is listed in one or
more index les on the web, and the recipe is in HTML.
Actor
RoundRobin
UrlResolver
XhtmlTransormer
StyleSheetTransormer
RdParser
SeeAlsoExtractor
IngredientProcessor
RDFInserter
UrlConsumer
UrlConsumer
StreamTransormer
StreamTransormer
StreamTransormer
RdConsumer
RdConsumer
RdConsumer
Distributes URLs to other actors
Retrieves data streams or another
actor
Formats HTML into XHTML or
parsing
Converts remote XML ormat into
local data ormat
Parses data stream into data
structure
Extracts URLs rom index data
Applies local processing rules on
data
Inserts data into a database
Trait Role
Listing 4 shows how these actors are connected to one
another. The manage( ) methods are typed versions o the
ActorManager#manage(Object) in Listing 3.
A ClusterMap and Main class are also provided in the
download archive. To run the example, execute the
Main class with the ollowing two arguments: http://
www.kra tcanada.com/en/search/SearchResul ts .
aspx?gcatid=86 and http://www.cookingnook.com/ree-
online-recipes.html
The program retrieves these silos o inormation, harvests
meaningul data, indexes it, and makes it available in a
graphical user interace.
With the stage set, lets introduce the actors:
The Main class then opens the ClusterMap and begins
harvesting the recipes. Ater a ew recipes are harvested
select the check-box on the let to see the number o
recipes that are harvested and click the clear button at
the top to update the list o words extracted rom the
ingredients section. In this way, you can index and search
multiple distinct recipe sites. For example, to nd recipes
that include lemon, cheddar, and garlic (yum), click on
these ingredients and the Tortilla Soup recipe is revealed
to include all three ingredients rom the recipes harvested
(see Figure 1).
-
8/7/2019 7512_Getting Started in HPC Development_final2
8/39
7 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
Figure 1. ClusterMap: The tortilla soup recipe is revealed
ater clicking certain ingredients.
In a multi-core system, the program uses over 30 threads
to orchestrate the retrieval and processing o the data
downloading and processing as quickly as the remote
host provides the data. In spite o the multi-threaded
perormance, there is no need to consider typical multi-
threaded challenges, reeing the developer rom worrying
about the constraint on what each actor should do.
The actor model is a powerul metaphor to assist in
creating multi-threaded applications, and by assigning
remote addresses and enabling remote communication
between actors, you can extend the model to assist in
distributed challenges as well. By including lie-cycle and
dependency management and making actors aware o
their environment, they can become agents, participating
in a sel-organizing system. This architecture has worked
well or many distributed problems such as on-line trading
disaster response, and modelling social structure. It has
also been the source o inspiration or many service-
oriented architectures.
In essence, the actor model abstracts the nitty-gritty o
multi-processor programming away rom the developer
This reduces concurrency issues and improves the fexibility
o the system. This simple model has a low learning
curve, so new developers can quickly see how actors
are implemented and understand how they t together
By managing the actors properly, you can leverage the
same implementations rom multi-processor systems onto
distributed networked systems in a gradual manner that
can scale with the development demands.
Listing 1. Storage Actor:This listing shows what an actor class might look like.
public class StorageActor implements Storage {
private Writer out;
private Set recorded = new HashSet();
private Set sorted = new TreeSet();
private StringEncoder encoder = new Soundex();
public void init() throws IOException {
out = new FileWriter(text.xml);out.write(\n);
}
public void close() throws IOException {
out.write(\n);
out.close();
}
public void store(String text) throws Exception {
i (recorded.add(text)) {
-
8/7/2019 7512_Getting Started in HPC Development_final2
9/39
8 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
String code = code(text);
store(code, text);
}
}
public void store(String code, String text) throws IOException {
out.write();
out.write(text);
out.write(\n);
}
private String code(String text) throws EncoderException {
or (String word : text.split([^a-zA-Z]*)) {
i (word.length() > 2) {
String encoded = encoder.encode(word);
sorted.add(encoded);
}
}
int hash = 0;
or (String encoded : sorted) {
hash = hash * 31 + encoded.hashCode();
}
sorted.clear();
return Integer.toHexString(hash);
}}
Listing 2. HexCoderActor: You can separate the storage actor rom Listing 1 into a second actor.
public class HexCoderActor implements HexCoder {
private Set sorted = new TreeSet();
private StringEncoder encoder = new Soundex();
public void code(String text, Storage callback) throws Exception {
String code = code(text);
callback.store(code, text) ;
}private String code(String text) throws EncoderException {
or (String word : text.split([^a-zA-Z]*)) {
i (word.length() > 2) {
String encoded = encoder.encode(word);
sorted.add(encoded);
}
}
int hash = 0;
-
8/7/2019 7512_Getting Started in HPC Development_final2
10/39
9 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
or (String encoded : sorted) {
hash = hash * 31 + encoded.hashCode();
}
sorted.clear();
return Integer.toHexString(hash);
}
}
public class StorageActor implements Storage {
private Writer out;
private Set recorded = new HashSet();
private HexCoder coder;
public StorageActor(HexCoder coder) {
this.coder = coder;
}
public void init() throws IOException {
out = new FileWriter(text.xml);
out.write(\n);
}
public void close() throws IOException {
out.write(\n);
out.close();
}
public void store(String text) throws Exception {
i (recorded.add(text) ) {
coder.code(text, this);}
}
public void store(String code, String text) throws IOException {
out.write();
out.write(text);
out.write(\n);
}
}
Listing 3. ActorManager: The ActorManager as shown written in Java 5.
public class ActorManager {
private nal Map executors = new ConcurrentHashMap();
public Object manage(Object actor) {
ExecutorService executor = Executors.newSingleThreadExecutor();
executors.put(executor, executor);
-
8/7/2019 7512_Getting Started in HPC Development_final2
11/39
10 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
Class ac = actor.getClass( );
ClassLoader cl = ac.getClassLoader();
Class[] interaces = ac.getInteraces( );
ActorHandler handler = new ActorHandler(actor, executor);
return Proxy.newProxyInstance(cl, interaces, handler);
}
private class ActorHandler implements InvocationHandler {
private Object actor;
private ExecutorService executor;
public ActorHandler(Object actor, ExecutorService executor) {
this.actor = actor;
this.executor = executor;
}
public Object invoke(nal Object proxy, nal Method method,
nal Object[] args) throws Throwable {
Class type = method.getReturnType();
Future result = executor.submit(new Callable() {
public Object call() throws Exception {
Object result = method.invoke(actor, args);
i (result == actor)
return proxy;
return result;}
});
i (Void.TYPE.equals(type))
return null;
return result.get( );
}
}
}
Listing 4. ActorFactory: This listing shows how the actors are connected.
public void init() throws Exception {
ClassLoader cl = Thread.currentThread().getContextClassLoader();
URL xsl = cl.getResource(RECIPES_XSL);
UrlConsumer[] consumers = new UrlConsumer[1 + PROCESSORS * 2];
SeeAlsoConsumer seeAlso = _.seeAlso() ;
// only one thread/actor can insert at a time
RDFConsumer insert = _.insert(store);
-
8/7/2019 7512_Getting Started in HPC Development_final2
12/39
11 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
or (int i=0;i
-
8/7/2019 7512_Getting Started in HPC Development_final2
13/39
12 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
emember the red telephone box, once a
amiliar sight on the streets o London? Thats
a good example o mutually exclusive access
to a shared resource, although you probably
didnt nd any locks on them. Why? Because only one
person at a time could use one to make a call, and civil
persons would not listen to a strangers conversationwhile waiting outside.
Unortunately, there are no
guarantees that programs
will be equally civil, so wise
programmers use semaphores
to keep processes rom running
amok and leaving shared
resources, such as les and I/O
devices, in inconsistent states.
Mutual exclusion locks (also,
called mutex locks or
simply locks or mutexes) are
a special kind o semaphore.
Each protects a single shared
resource, qualiying it as a binary
semaphore. Concurrent programs
use locks to guarantee consistent
communication among threads
through shared variables or data structures. A piece
o program code protected by a mutex lock is called acritical section.
Mutex locks are oten implemented using an indivisible
test-and-set instruction in todays prevalent multi-core
systems. Although generally deemed ecient, relying
on an indivisible test-and-set instruction incurs a ew
hidden perormance penalties. First, execution o such an
continued
Lots about LocksBy Wooyoung Kim and Michael Voss
Rinstruction requires memory access, so it intereres with
other cores progressespecially when the instruction is
in a tight loop. The eect may be elt even more acutely
on systems with a shared memory bus. Another penalty
stems rom cache-coherency. Because the cache line
containing a lock object is shared among cores, one
threads update to the lock invalidates the copies onthe other cores. Each subsequent test o the lock on
other cores triggers etching the cache line. A related
penalty is alse sharing where an
unrelated write to another part
o the cache line invalidates the
whole cache line. Even i the lock
remains unchanged, the cache
line must be etched to test the
lock on a dierent core.
Given all these problems,one might wonder Why use
locks at all? What are the
alternatives? One extreme
alternative is to give up on
communicating through
shared variables and adopt the
mantra o no sharing. That
involves replicating data and
communicating via message
passing. Unortunately, the cost o replication and
message passing is even greater than the overheadassociated with locks on todays multi-core shared-
memory architectures.
Another approach that has been actively pursued
recently as an alternative to mutex locks is lock-ree/non-
blocking algorithms. Researchers have reported some
isolated successes in designing practical non-blocking
-
8/7/2019 7512_Getting Started in HPC Development_final2
14/39
13 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
explicit use o locks, then:
Guideline 2: Make Critical Sections as Small as Possible
When a thread arrives at a critical section and nds that
another thread is already in it, it must wait. Keep the
critical section small, and you will get small waiting times
or threads and better overall perormance. Examine
when a shared data in a critical section is made private
and see i you can saely take some o the accesses to the
data out o the critical section.
For example, the code in Listing 1 implements a
concurrent stack. It denes two methods push( ) and
pop(), each protected using a TBB mutex lock (smtx)
thats acquired in the constructor and released in the
destructor. The examples in Listing 1 rely on the C++
scoping rules to delimit the critical sections.
A cursory look at pop() shows that:
1. I the stack is empty pop() returns alse.
2. I the stack is not empty, the code acquires the
mutex lock and then re-examines the stack.
3. I the stack has become empty since the previous
test, pop() returns alse.4. Otherwise, the code updates the top variable,
and copies the old top element.
5. Finally, pop() releases the lock, reclaims the
popped node, and returns true.
implementations. Nonetheless, non-blocking algorithms
are hardly a holy grail. Designing ecient non-blocking
data structures remains dicult, and the promised
perormance gain has been elusive at best. Youll see
more about non-blocking algorithms at the end o this
article.
With no proven better alternatives at present, it makes
sense to make the most o mutex locks until they are
rendered no longer necessary. This article discusses
some experiences with mutex locks in developing multi-
threaded concurrent applications, using the mutex locks
provided in Intel Threading Building Blocks as examples.
Making the Most of Mutex Locks
Mutexes are oten vilied as major perormance snags
in multi-threaded, concurrent application development;
however, our experience suggests that mutex locks are
the least evil among synchronization methods available
today. Even though the nominal overhead appears large,
you can harness them to your advantage i you use them
in well-disciplined ways. Throughout this article, youll see
some o the lessons learned, stated as guidelines, the rst
two o which are:
Guideline 1: Being Frugal Always Pays Off
Minimize explicit uses o locks. Instead, use concurrent
containers and concurrent algorithms provided by
ecient, thread-sae concurrency libraries. I you still
nd places in your application that you think benet rom
Heres a closer look at the critical section. Copying type T may take a lot o time, depending on T. Because o the lock,
you know that, once updated, the old top value cannot be viewed by other threads; it becomes private and local to the
thread inside the critical section. Thereore, you can saely yank the copy statement out o the critical section (ollowingguideline 2) as ollows.
bool pop( T& _e ) {
node* tp = NULL;
i( !top ) goto done;
{
tbb::spin_mutex::scoped_lock lock( smtx );
i( !top ) goto done;
-
8/7/2019 7512_Getting Started in HPC Development_final2
15/39
14 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
tp = top;
top = top->nxt;
// move the next line...
// _e = tp->elt;
}
// ...to here
_e = tp->elt;
delete tp;
done:
return tp!=NULL;
}
As another example, consider implementing a tiny memory management routine. A thread allocates objects rom its
private blocks and returns objects to their parent block. It is possible or a thread to ree objects allocated by another
thread. Such objects are added to their parent blocks public ree list. In addition, a block with a non-empty public
ree list is added to a list (i.e., public block list) ormed with block_t::next_to_internalize and accessed through block_
bin_t::mailbox, i not already in.
The owner thread privatizes objects in a blocks public ree list, as needed. Function internalize_next() implements the
unctionality and is invoked when a thread runs out o private blocks with ree objects to allocate. It takes a block bin
private to the caller thread as its argument and pops the ront block rom the list bin->mailbox, i not empty. Then, it
internalizes objects in the blocks public ree list:
block_t* internalize_next ( block_bin_t* bin )
{
block_t* block;{
tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);
block = bin->mailbox;
i( block ) {
bin->mailbox = block->next_to_internalize;
block->next_to_internalize = NULL;
}
}
i( block )
internalize_returned_objects( block );
return block;}
The unctions critical section protects access to bin->mailbox with bin->mailbox_lock. Inside the critical section, i bin-
>mailbox is not empty it pops the ront block to block and resets the blocks next_to_internalize.
Note that block is a local variable. By the time bin->mailbox is updated, block (which points to the old ront block
becomes invisible to other threads and access to its next_to_internalize eld becomes race-ree. Thus, you can saely
move the reset statement outside the critical section:
-
8/7/2019 7512_Getting Started in HPC Development_final2
16/39
15 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
block_t* internalize_next ( block_bin_t* bin )
{
block_t* block;
{
tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);
block = bin->mailbox;
i( block ) {
bin->mailbox = block->next_to_internalize;
// move the next statement...
// block->next_to_internalize = NULL;
}
i( block ) {
// ...to here
block->next_to_internalize = NULL;
internalize_returned_objects( block );
}
return block;
}
Guideline 3: Synchronize as Infrequently as Possible
The idea behind this guideline is that you can amortize
the cost o a lock operation over a number o local
operations. Doing so reduces the overall execution timebecause executing atomic instructions tends to consume
an order o magnitude more cycles.
Again, suppose youre designing a memory allocator that
allocates objects out o a block. To reduce the number o
trips to the operating system to get more memory blocks,
the allocator uses a unction called allocate_blocks() to
get a big strip rom the operating system, partition it into
a number o blocks, and then put them in the global ree
block list shared among threads. The ree block list ree_
list is implemented as a concurrent stack (see Listing 2).
Note that the code to push a newly carved-out
block into ree_list is inside a while loop. Also, note
that stack2::push( ) protects concurrent accesses to
stack2::top through a mutex lock. That means allocate_
blocks() acquires the lock ree_list.mtx N times or a strip
containing N blocks.
You can reduce that requency to one per strip by adding
a ew thread-local instructions. The idea is to build a
thread-local list o blocks in the while loop rst (using two
pointer variables head and tail) and then push the entire
list into ree_list with a single lock acquisition (see Listing
3). Finally, so that allocate_blocks() can access ree_lists
private elds, its declared as a riend o stack2.
Guideline 4: Most of All, Know Your Application.
The guideline that will help you most in practice is to
analyze and understand your application using actual
use scenarios and representative input sets. Then you
can determine what kinds o locks are best used where.
Perormance analysis tools such as Intel Parallel
Amplier can help you identiy where the hot spots areand ne-tune your application accordingly.
A Smorgasbord of Lock Flavors
Intel Threading Building Blocks oers a gamut o
mutex locks with dierent traits, because critical sections
with dierent access patterns call out mutex locks with
dierent trade-os. Other libraries may oer similar
-
8/7/2019 7512_Getting Started in HPC Development_final2
17/39
16 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
choices. You need to know your application to select the
most appropriate lock favor or each critical section.
Spin Mutex vs. Queuing Mutex
The most prominent distinguishing property or locks
is fairnesswhether a lock allows air access to the
critical section or not. This is an important consideration
when choosing a lock, but its importance may vary
depending on circumstances. For example, an operating
system should guarantee that no process gets unairly
delayed when multiple processes contend against each
other to get into a critical section. By contrast, unairness
among threads in a user process may be tolerable to
some degree i it helps boost the throughput.
TBBs spin_mutex is an unair lock. Threads entering a
critical section with a spin_mutex repeatedly attempt to
acquire the lock (they spin-wait until they get into the
critical section, thus the name). In theory, the waiting time
or a spin_mutex is unbounded. The TBB queuing_mutex,
on the other hand, is a air lock, because a thread arriving
earlier at a critical section will get into it earlier than one
arriving later. Waiting threads orm a queue. A newly
arriving thread puts itsel at the end o the queue usinga non-blocking atomic operation and spin-waits until its
fag is raised. A thread leaving the critical section hands
the lock over to the next in line by raising the latters fag.
Unortunately there are no cast-in-stone guidelines or
criteria that dictate when to use an unair spin_mutex and
when to use a air queuing_mutex. In general though,
guaranteeing airness costs more. When a critical section
is brie and contention is light, the chance o a thread
being starved is slim and any additional overhead or
unneeded airness may not be warranted. In those cases,use a spin_mutex.
The TBB queuing_mutex spin-waits on a local cache
line and does not interere with other threads memory
access. Consider using a queuing mutex or modestly
sized critical sections and/or when you expect a airly
high degree o contention.
One report claims that, using a test program, with spin
locks, a dierence o up to 2x runtime per thread was
observed and some threads were unairly granted the
lock up to 1 million times on an 8-core Opteron machine.
I you suspect your application suers rom unairness
due to a spin_mutex, switching to a air mutex such as
queuing_mutex is your answer. But beore switching, back
up your decision with concrete measurement data.
Reader-Writer locks
Not all concurrent accesses need to be mutually
exclusive. Indeed, accesses to many concurrent data
structures are mostly read-accesses, and only occasionally
need write-accesses. For these structures, keeping one
reader spin-waiting while another reader is in the critical
section is not necessary.
TBB reader/writer mutexes allow multiple readers to be in
a critical section while giving writers exclusive access to it.
The unair version is called spin_rw_mutex, while the air
version is queuing_rw_mutex. These mutexes also allow
readers to upgrade to writers and writers to downgrade
to readers.
Under some circumstances, you can replace reader-side
locks with less expensive operations (although potentially
at the expense o writers). One such example is a
sequential lock; another is read-copy-update lock. These
locks are less-general reader-writer locks, so using them
properly in applications requires more stringent scrutiny.
Mutex and Recursive_Mutex
TBB provides a mutex that wraps around the underlying
OS locks, but compared to the native version, addsportability across all supported operating systems. In
addition the TBB mutex releases the lock even when an
exception is thrown rom the critical section.
A sibling, recursive_mutex, permits a thread to acquire
multiple locks on the same mutex. The thread must
release all locks on a recursive_mutex beore any other
thread can acquire a lock on it.
-
8/7/2019 7512_Getting Started in HPC Development_final2
18/39
17 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
Avoiding Lock Pitfalls
There is no shortage o reerences that warn about the
inevitable dangers o using locks, such as deadlocks and
livelocks. However, you can reduce the chances o getting
ensnared by these problems considerably by instituting a
ew simple rules.
Avoid explicit use of locks. Instead use concurrent
containers and concurrent algorithms provided in well-
supported concurrency libraries such as Intel Threading
Building Blocks. I you think your application requires
explicit use o locks, avoid implementing your own locks
and use well-tested well-tuned locks such as TBB locks.
Avoid making calls to functions (particularly unknown
ones) while holding a lock. In general, calling a unction
while holding a lock is not good practice. For one thing,
it increases the size o the critical section, thus increasing
the wait-times o other threads. More seriously, you may
not know whether the unction contains lock acquisition
code. Even i it does not now, it may in the uture. Such
changes potentially lead to a deadlock situation, and
when that happens, its very dicult to locate and x. I
possible, re-actor the critical section so that it computesthe unction arguments in the critical section but invokes
the unction outside the critical section.
Avoid holding multiple locks. Circular lock acquisition is
a leading cause o dead lock problems. I you must hold
multiple locks, always acquire the locks in the same order
and then release them in the same order that they were
acquired.
Avoid using recursive locks. You may be able to nd some
isolated cases where recursive locks make great sense.However, locks dont compose well. Even a completely
unrelated change to a part o your application may lead
to a deadlock, and the problem will be very dicult to
locate.
Even i you do everything you possibly can to avoid
deadlocks and livelocks, problems may still occur. I
you suspect your application has a deadlock or race
condition, and you cannot locate it quickly, dont get
burned by trying to resolve it by yoursel. Use tools such
as Intel Parallel Inspector.
Lock-Free and Non-Blocking Algorithms
As promised earlier, one strategy that avoids locks
and their associated problems advocated by some
researchers is to use non-blocking synchronization
methods such as lock-ree/wait-ree programming
techniques and sotware transactional memory. These
techniques aim to provide wait-reedom, thereby
addressing issues stemming rom the blocking nature o
locks without compromising perormance.
Unortunately, our experience with non-blocking
algorithms has been (so ar) disappointing, and many
other developers and researchers agree. Almost all non-
blocking algorithms invariably use one or more hardware-
supported atomic operations (such as compare-and-swap
(CAS) and load-link/store-conditional (LL/SC)). Some
even use double-word CAS (DCAS).
Dependence on these atomic primitives makes them
dicult to write (see Doherty, Simon, et al, DCAS isnot a silver bullet or nonblocking algorithm design.
Proceedings of the 16th Annual ACM Symposium on
Parallelism in Algorithms and Architectures. 2004, and
Herb Sutters article Lock-Free Code: A False Sense
o Security), dicult to validate or correctness (see
Gotsman, Alexey, et al, Proving That Non-Blocking
Algorithms Dont Block, Symposium on Principles of
Programming Languages to appear in 2009), and dicult
to port to other platorms. This is probably one reason
why non-blocking algorithms have been limited to simple
data structures. Furthermore, improved perormance ovelock-based implementations seems hard to get.
Arguments or the other benets are not compelling
enough to warrant the pain o switching to non-blocking
algorithms. Fairness is contingent upon the underlying
atomic operations; in some cases, livelock is still possible.
For many user applications, benets such as real-time
support and ault tolerance are a good-to-have, not
-
8/7/2019 7512_Getting Started in HPC Development_final2
19/39
18 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
a must-have. In other cases, solutions provided by
operating systems are sucient (e.g., priority inheritance
or priority inversion).
Sotware Transactional Memory (STM) is another
alternative to lock-based synchronization. It abstracts
away the use o low-level atomic primitives using the
notion o transactions, and simplies synchronizing access
to shared variables through optimistic execution and
a roll-back mechanism. Like non-blocking algorithms,
STM promises perormance gains over lock-based
synchronization, and also promises to avoid many
common locking pitalls. The results so ar are not so
avorable. One publication observes that the overall
perormance o TM is signicantly worse at low levels
o parallelism (see Cascaval, Calin, et al, Sotware
Transactional Memory: Why is it only a research toy?
ACM Queue. 2008, Vol. 6, 5.) However, STM is a relatively
young research area, so the jury is still out.
Lock It Up
Locks have been unairly vilied as a hindrance to
the development o ecient concurrent applications
on burgeoning multi-core platorms. However, our
experiences suggest that rather than discouraging
the use o mutex locks, one should instead promote
their well-disciplined use. More oten than not,
implementations with such locks outperorm those with
non-blocking algorithms or STM.
The most important consideration or making the best
use o mutex locks is understanding the application well,
using tools to aid that understanding where necessary,and selecting the best-tting synchronization method or
each critical section. When you do choose a mutex, use it
with the recommended guidelines, but keep fexibility in
mind. Doing so will prevent most common mutex-related
pitalls without incurring unwarranted perormance
penalties. Finally, shun the do-it-yoursel temptation, and
delegate work to well-supported concurrency libraries.
Listing 1. Concurrent Stack Implementation:
The push and pop methods are protected by a mutex lock acquired in the constructor and released in the destructor.
/* unintrusive concurrent stack */
#include tbb/spin_mutex.h
template
class concurrent_stack
{
class node {
riend class concurrent_stack;
node* nxt;
T elt;
public:node( T& _e ) : nxt(NULL), elt(_e) {}
} ;
public:
concurrent_stack() : top(0) { }
void push( T& _e ) {
node* n = new node( _e );
tbb::spin_mutex::scoped_lock lock( smtx );
n->nxt = top;
-
8/7/2019 7512_Getting Started in HPC Development_final2
20/39
19 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
top = n;
}
bool pop( T& _e ) {
node* tp = NULL;
i( !top ) goto done;
{
tbb::spin_mutex::scoped_lock lock( smtx );
i( !top ) goto done;
tp = top;
top = top->nxt;
_e = tp->elt;
}
delete tp;
done:
return tp!=NULL;
}
private:
node* top;
tbb::spin_mutex smtx;
} ;
Listing 2: Memory Allocator:
This memory allocator reduces trips to get memory by getting a big strip o memory rom the operating system, partitioningit into a number o blocks, and then putting them in the global ree block list.
class stack2 {
public:
inline stack2() : top(NULL) {}
inline void push ( void** ptr ) {
tbb::spin_mutex::scoped_lock lock(mtx);
*ptr = top;
top = ptr;
}
inline void* pop ( void ) {i( !top ) return NULL;
void **result;
{
tbb::spin_mutex::scoped_lock lock(mtx);
i ( !(result=(void**) top) ) goto done;
top = *result;
}
*result = NULL;
-
8/7/2019 7512_Getting Started in HPC Development_final2
21/39
20 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
done:
return result;
}
private:
tbb::spin_mutex mtx;
void* top;
} ;
stack2 ree_list;
int allocate_blocks() {
uintptr_t raw_strip = get_strip();
i( !raw_strip )
return 0;
uintptr_t aligned_strip = align_strip( raw_strip );
uintptr_t endp = (uintptr_t) raw_strip+strip_size ;
uintptr_t b = aligned_strip;
while ( b+block_sizebump_ptr = block_endp;
// note this line
ree_list.push( (void**)b );
b = block_endp;
}return 1;
}
Listing 3. Build a Block List:
The while loop builds a list o blocks, and then pushes the entire list into free_listusing only a single lock acquisition.
class stack2 {
riend int allocate_blocks() ;
public:
inline stack2() : top(NULL) {}
inline void push ( void** ptr ) {tbb::spin_mutex::scoped_lock lock(mtx);
*ptr = top;
top = ptr;
}
inline void* pop ( void ) {
i( !top ) return NULL;
void **result;
{
-
8/7/2019 7512_Getting Started in HPC Development_final2
22/39
21 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents
Getting Started in HPC Development
tbb::spin_mutex::scoped_lock lock(mtx);
i ( !(result=(void**) top) ) goto done;
top = *result;
}
*result = NULL;
done:
return result;
}
private:
tbb::spin_mutex mtx;
void* top;
} ;
stack2 ree_list;
int allocate_blocks() {
uintptr_t raw_strip = get_strip();
i( !raw_strip )
return 0;
uintptr_t aligned_strip = align_strip( raw_strip );
uintptr_t endp = (uintptr_t) raw_strip+strip_size ;
uintptr_t b = aligned_strip;
uintptr_t head = 0;
uintptr_t tail = b;while ( b+block_sizebump_ptr = (void*) block_endp;
ree_list.push( (void**)b ) ;
* (uintptr_t*) b = head;
head = b;
b = block_endp;
}
{
// Push the block list into ree_list
tbb::spin_mutex::scoped_lock lock(ree_list.mtx);* (void**) tail = ree_list.top;
ree_list.top = (void*) head;
}
return 1;
}
-
8/7/2019 7512_Getting Started in HPC Development_final2
23/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
This article originally appeared in Intel Parallel Universe
Magazine. Used with permission.
n September, Intel introduced Intel Parallel
Studio 2011, a tool suite or Microsot
Windows Visual Studio C++ developers,
with the singular objective o providing theessential perormance tools or
application development on
Intel architecture. These tools
provide signicant innovation,
and enable unprecedented
developer productivity when
building, debugging and tuning
parallel applications or multicore.
With the introduction o Intel
Parallel Building Blocks (Intel
PBB) developers have methods tointroduce and extend parallelism
in C/C++ applications or higher
perormance and eciencies.
Now Intel is extending the reach
o the next-generation Intel tools
to developers o applications
on both Windows and Linux in C/C++ and Fortran who
need advanced perormance or multicore today and
orward scaling to manycore. Intel Parallel Studio XE
2011 contains C/C++ and Fortran Compilers, Intel
Math Kernel Library (Intel MKL) and Intel Integrated
Perormance Primitives (Intel IPP) perormance libraries,
Intel PBB libraries, Intel Threading Building Blocks (Intel
TBB), Intel Cilk Plus, and Intel Array Building Blocks
(Intel ArBB), Intel Inspector XE correctness analyzer and
Intel VTune Amplier XE perormance proler.
HPC programmers have traditionally been able to use all
Intel Parallel Studio XE and Intel
Cluster Studio Tool SuitesBy Sanjay Goil and John McHugh
I
the compute power made available to them. Even with
the perormance leaps that Moores law has allowed Intel
architecture to deliver over the past decade, the hunger
or additional perormance continues to thrive. There
are big unsolved problems in science and engineering,
physical simulations at higher granularities, and problems
where the economically viable compute power provideslower resolution or piecemeal
simulation o smaller portions o
the larger problem. This is what
makes serving the HPC market
so exciting or Intel, and it is a
signicant driver or innovation
in both hardware and sotware
methodologies or parallelism
and perormance.
Intel Cluster Studio introducestools or HPC cluster
development with MPI, including
the scalable Intel MPI Library
and Intel Trace Analyzer and
Collector perormance proler,
with the industry-leading C/
C++ and Fortran compilers or a
complete cluster development toolkit. This is combined
with the ease o deployment oered by the Intel
Cluster Ready program, making deployment o cluster
applications highly ecient.
Introducing New Tool Suites
Sotware developers o high per ormance applications
require a complete set o development tools. While
traditionally these tools include compilers, debuggers,
and perormance and parallel libraries, more oten the
issues in development come in error correctness and
22
-
8/7/2019 7512_Getting Started in HPC Development_final2
24/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
perormance proling. The code doesnt run correctly, or exhibits error prone behavior on some runs, pointing to data
races, deadlocks, or perormance bottlenecks in locks or synchronization, or exposes security risks at runtime. To this
end, Intels correctness analyzers and perormance prolers are a great addition to the development environment or
highly robust and secure code development.
For advanced and distributed perormance, Intel is simpliying the procurement, deployment and use o HPC tools on
multicore 32- and 64-node Intel architecture and HPC clusters programmed with the Message Passing Interace (MPI).
A sotware development project goes through several steps to get optimal perormance on the target platorm.
Most oten the developer gets a rudimentary per ormance prole o the application run to show hotspots. Once
opportunities or optimization are identied, the coding aspects are handled by the compilers and perormance
and parallel libraries to add parallelism, presenting task level, data level and vectorization opportunities. Finally, thecorrectness tools make robust code possible by checking or threading and memory errors, and identiying security
vulnerabilities. This cycle typically repeats itsel to nd higher application eciencies.
23
-
8/7/2019 7512_Getting Started in HPC Development_final2
25/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
Highlights o Intel Parallel Studio XE 2011
Available or Multiple Operating Systems: Intel Parallel Studio XE provides the same set o tools to aid
development or both Windows and Linux platorms. C/C++, Fortran compilers and perormance and parallelism
libraries bring advanced optimizations to the Mac*.
Robustness: Intel Inspector XEs memory and thread analyzer nds and pinpoints memory and threading errors
beore they happen.
Code Quality: Intel Parallel Studio XE enables developers to eectively nd sotware security vulnerabilities
through static security analysis.
Advanced Optimization: The compilers and libraries in Intel Composer XE oer advanced vectorization support,
including support or Intel AVX. The C/C++ optimizing compiler now includes Intel PBB library, expanding
the types o problems that can be solved more easily in parallelism with increased scalability and reliability. For
Fortran developers, it now oers co-array Fortran and additional support or the Fortran 2008 standard.
Perormance: Intel VTune Amplier XE perormance proler nds bottlenecks in serial and parallel code thatlimit perormance. Improvements include a more intuitive interace, ast statistical call graph, and timeline view.
Intel MKL and Intel IPP perormance libraries provide robust multicore perormance or commonly used math and
data processing routines. A simple linking o the application with these libraries is an easy rst step or multicore
parallelism.
Compatibility and Support: Intel Parallel Studio XE excels at compatibility with leading development
environments and compilers. Intel oers broad support with orums and Intel Premier Support that provides ast
answers and covers all sotware updates or one year.
24
-
8/7/2019 7512_Getting Started in HPC Development_final2
26/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
Old Name
Compiler Suite Proessional Edition
C++ Compiler Proessional Edition
[Visual] Fortran Compiler Proessional Edition
Visual Fortran Compiler Proessional Edition with
IMSL
VTune Perormance Analyzer
(including Intel Thread Proler)
Thread Checker
Cluster Toolkit Compiler Edition
Composer XE
C++ Composer XE
[Visual] Fortran Composer XE
Visual Fortran Composer XE with IMSL
VTune Amplier XE
Inspector XE
Cluster Studio
New Name
Whats new in Intel Composer XE
Intel Composer XE contains next-generation C/C++ and Fortran compilers (12.0) and perormance and parallel libraries
Intel MKL 10.3, Intel IPP 7.0 and Intel TBB 3.0.
The latest Intel C/C++ compiler, Intel C++ Compiler XE 12.0 is optimized or the latest Intel architecture processor (code-
named Sandy Bridge) with Intel AVX support. The product contains Intel PBB, which includes advances in mixing and
matching task, vector, and data parallelism in applications to better map to the multicore optimization opportunities,
Intel Cilk Plus, Intel TBB and Intel ArBB (in Beta, available separately). There are vector optimizations with Intel AVX with
SIMD pragmas, in addition to an array notation tool to help in auto-parallelization called GAP, or the highest perormance
and parallelism on the latest generation o x86 multicore CPUs. For Windows users, support or Visual Studio 2010 is
included.
The tools introduced in Intel Parallel Studio XE 2011 are next generation revisions o industry-leading tools or C/
C++ and Fortran developers seeking cross-platorm capabilities or the latest x86 processors on Windows and Linux
platorms. Those amiliar with Intels industry-leading tools will see that the product names have transitioned in this new
releasein all cases with signicant additional capabilities. Other names remain the same.
25
-
8/7/2019 7512_Getting Started in HPC Development_final2
27/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
The Intel Fortran Compiler XE 12.0 includes several advances in more complete support or Fortran 2003 standard and
some support or Fortran 2008 standards, including Co-array Fortran, vector optimizations with AVX and help with auto-parallelization or the highest perormance and parallelism on the latest x86 multicore CPUs.
The perormance libraries continue to provide an easy way to include highly optimized and automatically parallel math
and scientic unctions, and data processing routines or high perormance users. The math library, Intel MKL 10.3
includes enhancements such as better Intel AVX support, summary statistics library, and enhanced C language support
or LAPACK. The data processing library, Intel IPP 7.0 includes improved data compression and codecs, and support or
Intel AVX and AES instructions, continuing to address data processing intensive application domains.
Enhanced Developer Productivity with Correctness Analyzers and Perormance Proflers
Intel Parallel Studio XE 2011 combines ease-o-use innovations, introduced in Intel Parallel Studio, with advanced
unctionality or high perormance, scalability and code robustness or Linux and Windows. Intel has traditionally oered
developer tools on both Windows and Linux, and strives to oer the same unctionality across both platorms, especially
important or developing applications to run on both operating systems.
Introducing SIMD pragmas or
26
-
8/7/2019 7512_Getting Started in HPC Development_final2
28/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
With the capabilities in the correctness analyser, Intel Inspector XE, the product helps the C/C++ and Fortran developer
with static and dynamic code analysis through threading and memory analysis tools, to develop highly robust, secure
and highly optimized applications.
New capabilities in this tool include:
Simplied conguration and run analysis
Finds coding defects quickly, such as:
o Memory leaks and memory corruption
o Threading data races and deadlocks
Supports native threads, understands any parallel model built on top of threads Dynamic instrumentation works on standard builds and binaries
Timeline view to explore context of the respective threads
Intuitive standalone GUI and command line interface for Windows and Linux
Advanced command line reporting
Intel VTune Amplier XE 2011 is the next generation o the Intel VTune Analyzer, which is a powerul tool to quickly
nd and provide greater insights into multicore perormance bottlenecks. It takes away the guesswork and analyzes
perormance behavior in Windows* and Linux* applications, providing quick access to scalability bottlenecks or aster
Hotspots in theapplication
Thread basedCPU usage
27
-
8/7/2019 7512_Getting Started in HPC Development_final2
29/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
and improved decision making. The next generation Intel VTune perormance proler has new eatures, including:
Easy predened analyses
Fast hotspot analysis (hot functions and call stack)
Powerful ltering
Threading timeline
Frame analysis
Attach to a running process (Windows)
Event multiplexing
Simplied remote collection
Improved compare results
Tight Visual Studio integration
Non-root Linux install
Only EBS driver install needs root
Sotware security starts very early in the development phase, and Intel Parallel Studio XE 2011 makes it aster to identiy
locate, and x sotware issues prior to sotware deployment. This helps identiy and prevent critical sotware security
vulnerabilities early in the development cycle, where the cost o nding and xing errors is the lowest.
Intels static security analysis (SSA), included in the Parallel Studio XE bundle, provides these unique advantages or
robust code development:
Easier, faster setup and ramp to get static analysis results
Simple approach to congure and run static analysis
Discover and x defects at any phase of the development cycle Finds over 250 security errors, such as:
o Buer overruns and uninitialized variables
o Unsae library usage and arithmetic overfow
o Unchecked input and heap corruption
Tracks state associated with issues, even as source evolves and line numbers change
Displays problem sets and location of source
Provides lters, assignment of priority, and maintenance of problem set state
Intuitive standalone GUI and command line interface for Windows and Linux
Feature
Support or both Linux and Windows platorms Development capability with the same set o
tools on both Windows and Linux platorms;
enhanced perormance, productivity, and
programmability
Benet
28
-
8/7/2019 7512_Getting Started in HPC Development_final2
30/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
Feature
C/C++ Compilers with Intel Parallel Building Blocks
Fortran Compilers with Fortran 2008 standards
support including Co-Array Fortran (CAF)
Memory, threading, and security analysis tools in onepackage support.
Updated perormance libraries
Updated perormance proler
Breakthrough in providing choice o parallelism
or applications task, data, vector with
mix and match or optimizing application
perormance. C/C++ standards support
Advances in the industry-leading Fortran
compilers with new support or scalable
parallelism on nodes and clusters (cluster
support available separately with Intel Cluster
Studio 2011). Fortran standards
Enhances developer productivity andeciencies by simpliying and speeding the
process o detecting dicult- to-nd coding
errors
Multicore perormance or common math and
data processing tasks, with a simple linking with
these automatically parallel libraries
Several ease-o-use enhancements, deeper
micro-architectural insights, enhanced GUI, and
quicker, more robust perormance
Benet
Increase Perormance and Scalability o HPC Cluster Computing
Intel Cluster Studio 2011 sets a new standard in distributed parallelism on Intel architecture-based clusters. This premier
tool suite provides development fexibility or enabling MPI-based application perormance or highly-parallel shared-
memory and cluster systems based on 32 and 64 Intel architectures. The newly architected Intel MPI library 4.0 is key to
achieving these advantages by providing new levels o cluster scalability, improved interconnect support across many
abrics, aster on-node messaging, support or hybrid parallelization, and an application tuning capability that adjusts to
the cluster and application structure. For the developer, the Intel Trace Analyzer and Collector 8.0 is enhanced with neweatures that accelerate the analysis and tuning cycle o MPI-based cluster applications. The suite is complemented with
the latest Intel C/C++ and Fortran compiler technology along with Intel MKL 10.3, Intel IPP 7.0, and Intel PBB (also sold as
Intel Composer XE) to urther optimize and parallelize application execution on each computing node. Co-array Fortran
is supported on clusters in this package.
Along with Intel Cluster Ready (ICR), a program to dene cluster architectures or increasing uptime, increasing productivity
and reducing total cost o ownership (TCO) or IA-based HPC clusters, Intel Cluster Studio 2011 makes it easy to code,
debug, and optimize to gain higher scalability or MPI-based cluster applications, up to petascale, and also is the premier
29
-
8/7/2019 7512_Getting Started in HPC Development_final2
31/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
suite or developing and tuning hybrid-parallel codes that can mix MPI with multithreading paradigms such as OpenMP
or Intel PBB.
Intel Cluster Studio 2011 provides an extensive sotware package containing Intel C/C++ compilers and Intel Fortran
Compilers or all Intel architectures, plus all the Intel Cluster Tools that help you develop, analyze, and optimize
perormance o parallel applications on Linux or Windows. By combining all the compilers and tools into one license
package, Intel can provide single installation, interoperability, and support or the best-in-class cluster sotware tools.
Highlights o Intel Cluster Studio 2011
Scalability and High Perormance: The interconnect-tuned and multicore-optimized Intel MPI Library delivers
application perormance on thousands o 32- and 64-IA multicore processors.
Built-in Optimization: Utilize optimizing compilers and libraries in Intel Composer XE to get the most out o
advanced processor technologies. The C/C++ optimizing compiler now includes Intel PBB, which expands the
types o problems that can be solved more easily in parallel, and with increased reliability. For Fortran developers, it
now oers co-array Fortran (CAF) and additional support or the Fortran 2008 standard. Intel Compilers also delive
advanced vectorization support with SIMD pragmas.
Ease o MPI Tuning: Intel Trace Analyzer and Collector has been enhanced with new eatures that accelerate the
analysis and tuning cycle o MPI-based cluster applications.
Target Applications to Multiple Operating Systems: Leverage the same source code in Intel Compilers and libraries
which bring advanced optimizations to Windows and Linux.
Intel Cluster Ready Qualifed: This program denes cluster architectures to increase uptime and productivity and
reduce total cost o ownership (TCO) or IA-based HPC clusters.
Compatibility and Support: Intel Cluster Studio oers excellent compatibility with leading development environments
and compilers , while providing optimal support or multiple generations o Intel processors and compatibles. Inteoers broad support through its orums and Intel Premier Support, which provides ast answers and covers al
sotware updates or one year
30
-
8/7/2019 7512_Getting Started in HPC Development_final2
32/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
Feature
Analysis tools or MPI developers Load imbalance
diagram; ideal Interconnect simulator
Scalable Intel MPI Library with multi-rail IB support and
Application Tuner
C/C++ Compilers with Intel Parallel Building Blocks
Fortran compilers with Fortran 2008 standards support
including co-array Fortran (CAF) on clusters
Updated perormance libraries, Intel MKL and Intel IPP
Support or both Linux and Windows platorms
Enhanced developer productivity and
eciencies by simpliying and speeding the
detection o errors and oering perormance
proling o MPI messages.
Scale to tens o thousands o cores with one
o the most scalable and robust commercial
MPI libraries in the industry. Ease-o-use with
dynamic and congurable support across
multiple cluster abrics and multi-rail IB support
Breakthrough in providing choice o parallelismor applications process, task, data, vector
with mix and match or optimizing application
perormance on clusters o SMP nodes. C/C++
standards support
Advances in industry-leading Fortran compilers
with new support or scalable parallelism on
nodes and clusters. Fortran standards support.
Multicore perormance or common math and
data processing tasks, with a simple linking withthese automatically parallel libraries
Development capability with the same set o
tools on both Windows and Linux platorms
or enhanced perormance, productivity, and
programmability
Benet
Summary
With the introduction o Intel Parallel Studio XE and Intel Cluster Studio, Intel is extending the reach o the next-generation
Intel tools to Windows and Linux C/C++ and Fortran developers needing advanced perormance or multicore today and
orward scaling to manycore.
Intel Parallel Studio XE 2011 bundle contains the latest versions o Intel C/C++ and Fortran compilers, Intel MKL and
Intel IPP perormance libraries, Intel PBB libraries, (Intel TBB, Intel ArBB [in beta] , and Intel Cilk Plus), Intel Inspector XE
correctness analyzer, and Intel VTune Amplier XE perormance proler.
31
-
8/7/2019 7512_Getting Started in HPC Development_final2
33/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
Intel Cluster Studio 2011 bundle contains the latest versions o Intel MPI Library, Intel Trace Analyzer and Collector, Inte
C/C++ and Fortran compilers, Intel MKL and Intel IPP perormance libraries, Intel PBB libraries, (Intel TBB, Intel ArBB [in
beta], and Intel Cilk Plus.
32
-
8/7/2019 7512_Getting Started in HPC Development_final2
34/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
This article originally appeared in Intel Parallel Universe
Magazine. Used with permission.
ntel Array Building Blocks (Intel ArBB)
is a sophisticated and powerul platorm
or portable data-parallel sotware
development. Intel ArBB will be available asa component o Intel Parallel
Building Blocks, along with
several other tools and libraries
or parallel programming. Intel
ArBB can be used to parallelize
compute-intensive applications
within a structured, deterministic-
by-deault ramework. It also
provides powerul runtime
generic programming
mechanisms, yet can be usedwith existing compilers. In
particular, it has been veried
to work with the Intel, Microsot
and gcc C++ compilers. Intel
ArBB is currently in Beta, and
eedback is appreciated; it can be
downloaded today rom http://
intel.com/go/ArBB or either Windows or Linux.
Is Intel ArBB a language or a library? Yes both at the
same time. Intel ArBB is the answer to the ollowing
question: How can parallelism mechanisms in modern
processor hardware, including vector SIMD instructions,
be targeted in a portable, general way within existing
programming languages? The answer is an embedded
language. Intel ArBB is a language extension
implemented as an API. It has a library interace, but
includes a capability or the dynamic generation and
Intel
Array Building BlocksBy Michael McCool
I
optimization o parallelized and vectorized machine
language.
Modern processors include many mechanisms or
increasing per ormance through parallelism: multiple
cores, hyperthreading, superscalar instruction issue,
pipelining and single-instruction, multiple data (SIMD)vector instructions. The rst
twomultiple cores and
hyperthreadingcan be
accessed through threads,
although or eciency, one may
want to use lightweight tasks
that share hardware threads.
Instruction-level parallelism,
such as superscalar instruction
issue and pipelining, are invoked
automatically by the processor,as long as the instruction
stream avoids unnecessary data
dependencies. However, the
last orm o parallelism, SIMD
vector parallelism, can only be
accessed by generating special
instructions that explicitly invoke
multiple operations at once: SIMD instructions. SIMD
instructions perorm the same operation on multiple
components o a vector at once, so they are sometimes
also called SIMD vector instructions.
SIMD vector instructions are very powerul, and they
are becoming more powerul over time. In current
processors that support streaming SIMD extensions
(SSE), our single-precision foating point instructions
can be executed with a single SSE SIMD instruction. In
next-generation AVX processors, the width o the SIMD
instructions will double, so eight such operations can be
33
-
8/7/2019 7512_Getting Started in HPC Development_final2
35/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
executed at once. In the Intel Many Integrated Core
(MIC) architecture, the width doubles again, so sixteen
such operations can be executed at once. The theoretical
peak foating-point perormance o a processor is
represented by the product o the number o cores, the
width o the vector units and the clock rate. While the
clock rate is no longer scaling signicantly, the number o
cores and the SIMD vector width o each core continue
to scale. Vectorizationexpressing computations using
SIMD vector instructionsis essential to attain the peak
perormance o modern processors.
However, there are two problems. First, using SIMD
vector units requires use o specic machine-language
vector instructions. Second, dierent processors have
dierent SIMD vector instruction extensions. The SSE,
AVX and MIC vector instructions are all dierent. While
AVX machines can execute SSE instructions, this will not
access the ull perormance potential o AVX processors.
This latter issue is not so critical since current compiler
technology does permit the generation o multiple
code paths in a single binary. For example, when using
the Intel C++ compiler, a single source program can
be compiled or both SSE and AVX machines, and the
resulting program will use AVX code when possible.However, when using static compilers, developers still
need to know in advance which set o processors they
wish to target, and the problem remains: how is ecient
vectorized code to be generated?
The traditional approach to supporting instruction
set extensions is to modiy the compiler to emit the
new instructions, and then to recompile programs
as necessary. However, or SIMD vector instructions
this is not so easy. It is very dicult or a compiler to
automatically identiy serial structures in a program thatcan be mapped to SIMD vector instructions. It can be
done sometimes, but it is better or the programmer to
explictly indicate which operations in the program should
use SIMD vector operations and how. This requires new
constructs in the programming language that can be
easily and reliably vectorized. Unortunately, there is as
yet no widely accepted machine-independent standard
or speciying vectorization in C and C++.
Intel Parallel Building Blocks (Intel PBB) actually
includes three separate strategies or accessing vector
operations in a portable manner. The rst strategy, which
should not be overlooked, is to use a xed-unction
library: Intel Math Kernel Library (Intel MKL) and
(Intel Integrated Perormance Primitives ( Intel IPP)
include many mathematical operations that have already
been vectorized. I the operation you need is part o
these optimized libraries, that is oten the best solution.
I not, and you have to code the algorithm yoursel, there
are two other strategies available. First, you could use
Intel Cilk Plus, an extension to C and C++ that includes
a notation to speciy explicit vector operations on arrays.
This notation is an extension to C/C++ available in the
Intel C/C++ compiler. The second general-purpose
mechanism is Intel ArBB.
Intel ArBB is an embedded language, implemented as a
C++ API, that in theory works with any ISO-standard C++
compiler. It uses standard C++ mechanisms or its syntax,
declaring types or collections o data and overloading
operators so that operations can be expressed over those
collections. In other words, it looks like a typical matrix-
vector math library. However, there is a dierence. In
an ordinary library, the C/C++ compiler generates thecode statically. In ArBB, in constrast, machine code is
generated by the library itsel, dynamically.
In practice, ArBB is very simple to use; in the ollowing
we will give a ew examples. To set the stage, however,
we rst need to discuss some basics. The ArBB C++
API denes both types and operations. Types include
scalar types or foating point numbers, integers, and
Booleans, as well as types or representing collections o
these types and user-dened types based on them. The
ArBB scalar types are used in place o the ordinary C++types or foats and integers, and have names like 32 (or
single-precision foat), i32 (or signed 32-bit integers), and
so orth. Using an ArBB scalar type indicates to ArBB that
the corresponding machine language or operations on
this type should be generated dynamically by ArBB and
not statically by C++. There are also types to manage
large collections o data. The simplest o these is called
dense and represents a contiguously stored
34
-
8/7/2019 7512_Getting Started in HPC Development_final2
36/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
(dense) multidimensional array with element type T and dimensionality D. The dimensionality is optional and deaults to
1. The element type T can be any ArBB scalar type or structures or classes with ArBB scalar types as elements.
There are two basic ways to speciy parallel computations in ArBB: as sequences o operations over entire collections
(vector mode), or as unctions replicated over every element o a collection (elemental mode). Vector mode is the
simplest: arithmetic operations on collections apply in parallel to the corresponding members o the collections. This
works even i the element type is user-dened and the user has overloaded the operator themselves. For example,
suppose we have our dense collections called A, B, C and D, all o the same size. Then the ollowing expression
will operate in parallel on all the elements o these collections:
A += (B/C) * D;
Note that in general when a collection appears on both the let and right side o an expression, ArBB generates a
result as i all the inputs were read beore any outputs are written. In practice, we have to put this expression inside
a unction and invoke it with a call operation. However, any sequence o parallel vector operations can be inside such a
unction:
void
doit(dense& A, dense B, dense C, dense D)
{
A += (B/C) * D;
}
...
call(doit)(A,B,C,D);
The way this call actually works is that it calls the unction doit precisely once and observes (rather than actually
perorms) the sequence o ArBB type constructions, operations and destructions generated by this unction. It records
this sequence, compiles it into optimized machine language, executes it (in parallel) and then caches it. The next time
the same unction is called, call does not invoke the C++ unction again: it will just retrieve the internally generated
machine code rom its cache. For simple uses o Intel ArBB this is exactly what you want.
In more advanced use cases, however, you may want to generate dierent versions o the operation rom the same
C++ unction. For example, you can parameterize the sequence o Intel ArBB operations by ordinary C++ variables and
control fow, and you can use this to generate variants o a computation. Managing this powerul mechanism or generic
programming is enabled by another Intel ArBB type called a closure. A closure is an object that represents a captured
Intel ArBB unction; it is conceptually similar to a lambda unction, but is dynamically generated. The return type o callis actually an appropriately typed closure. Another unction, capture, is also available. It is similar to call in that it creates
a closure, but it does not cache it, so it can be called repeatedly on the same C++ unction to generate variants. Again,
or simple uses o Intel ArBB explicit use o closures is not necessary, and you can just think o call as a straightorward
unction invocation.
You can also write elemental unctions over scalar Intel ArBB types:
void
35
-
8/7/2019 7512_Getting Started in HPC Development_final2
37/39
Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents
Getting Started in HPC Development
kernel(32& a, 32 b, 32 c, 32 d)
{
a += (b/c)*d;
}
You can invoke elemental unctions rom inside a call by using the map operation. A map operation replicates the
unction over every element o the input containers.
void
doit(dense& A, dense B, dense C, dense D)
{
map(kernel)(A, B, C, D);
}
call(doit)(A,B,C,D);
It is also possible, rom inside an elemental unction, to access neighboring elements o the input. This makes it very
easy to write stencil operations, such as convolutions. You can also pass in either an entire container or a single element
to every argument o the map. Single-element arguments are replicated to match the size and shape o any containers
used as arguments. For example, suppose we use:
void
doit(dense& A, 32 b, 32 c, dense D)
{
map(kernel)(A, b, c, D);
}call(doit)(A,b,c,D);
with the same kernel unction, but with the types o b and c matching the corresponding unction argument exactly; in
this case, 32. There will still be as many parallel instances o the kernel as there are elements in the collections A and D,
but every instance will get a copy o the same value o b and c. In summary, call arguments need to match exactly, but
map unctions are polymorphic and any argument can either be a single element or a collection.
In addition to using these two basic patterns to express parallel operations, users o ArBB also have access to several
collective operations that act on or take an entire container as an input. These operations can shit the contents
o containers around, take cumulative sums (prefx scans), perorm sets o reads and writes (known as scatters and
gathers), discard elements and pack the remainder into a contiguous sequence (known as pack; the inverse is unpack)or simply combine all elements into a single element. Combination o all the elements o a container into a single