7512_getting started in hpc development_final2

8/7/2019 7512_Getting Started in HPC Development_final2

1/39an Developer eBook

Getting Started inHPC Development


2/39

2 Letter rom the Editor

4 Utilizing a Multi-Core System with the Actor Model

12 Lots about Locks

22 Intel Parallel Studio XE and Intel Cluster StudioTool Suites

33 Intel Array Building Blocks

4

22

2

12

33

12

Contents

This content was adapted from Internet .coms DevX website and Intel Parallel UniverseMagazine. Contributors: James Leigh, Wooyoung Kim, Michael Voss, Michael McCool,

Sanjay Goil and John McHugh.

Getting Started in HPC Development


3/39

2 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents


any people, even in the IT industry, hear

the term high-perormance computing

(HPC) and think o supercomputers that

are used in scientic experiments or

complex research applications. But as the amount o data

continues to grow, and databases continue to expand,

businesses in the private sector are going to need toharness some serious computing

horsepower.

HPC is powered, in part, by

powerul multicore processors

that can speed up application

perormance. For sotware

developers, this means learning

to create applications with

parallelism that can take

advantage o these multicoreprocessors. This also means

changing the way that

applications are developed.

There are a number o

techniques, methods and

technologies available that

can help application developers pick up parallel

programming and create applications that can run in

an HPC environment. In this eBook rom Internet.com

and Intel were going to look at some o these tools andmethods to give developers some ideas about whats

available.

In our rst article, James Leigh is going to look at

developing ecient multi-threaded applications without

using synchronized blocks.

The actor model (which is native to some programming

languages such as Scala) is a pattern or concurrent

computation that enables applications to take ull

advantage o multicore and multiprocessor computing.

James likes the actor model because it abstracts the

nitty-gritty o multiprocessor programming away romthe developer. This reduces

concurrency issues and improves

the fexibility o the system.

The actor model also has a

low learning curve, so new

developers can quickly see how

actors are implemented and

understand how they t together

In our next article, Wooyoung

Kim and Michael Voss discusswhy in their opinion locks remain

the best choice or implementing

synchronization and protecting

critical sections o sotware code

Their article discusses some o

their experiences with mutual

exclusion locks in developing multithreaded concurrent

applications using the locks provided in Intel Threading

Building Blocks as examples.

Then John McHugh and Sanjay Goil are going to

introduce us to Intel Parallel Studio XE, a set o new

sotware development tool suites or developers o

applications that run on both Windows and Linux in C/

C++ and Fortran who need advanced perormance

or multicore today and uture scaling to manycore.

The Intel Parallel Studio XE 2011 bundle contains the

latest versions o Intel C/C++ and Fortran compilers,

Letter rom the EditorBy Michael Pastore

M


4/39



Intel MKL and Intel IPP perormance libraries, Intel PBB

libraries, (Intel TBB, Intel ArBB betas, and Intel Cilk Plus),

Intel Inspector XE correctness analyzer and Intel VTune

Amplier XE perormance proler.

Finally, Michael McCool is going to answer the ollowing

question: How can parallelism mechanisms in modern

processor hardware, including vector SIMD instructions,

be targeted in a portable, general way within existing

programming languages?

And his answer will be Intel Array Building Blocks, which

he will explain in more detail.

We hope you enjoy this eBook, and remember you can

always turn to Internet.com websites like DevX.com and

Developer.com, as well as the Intel Sotware Network, or

more inormation on the journey to developing or high-

perormance computing.


5/39



Download the code for this article from: http://assets.

devx.com/devx/actor-model.zip.

typical multi-threaded application in Java

contains numerous synchronized methods

and statements. They might also contain calls

to the methods wait() and notiy() that were introduced

with Java 1.0, but these methods provide very primitiveunctionality and are easily misused. Java 5 introduced

the java.util.concurrent package,

which provides some higher-level

abstractions away rom wait()

and notiy() . However, it can still

be a challenge to appropriately

use the synchronized and

volatile keywords. Even when

used correctly, getting them

used eciently can require

complicated orchestrations olocks.

The biggest criticism o Javas

synchronization is perormance.

Synchronization blocks become

overly encompassing too easily.

Although a synchronization block

on its own is ar rom slow, when overly encompassing, it

becomes a contested synchronization block. Contested

synchronized blocks, or other blocking operations, are

slow and require the OS to put threads to sleep and

use interrupts to activate them. This puts pressure

on the scheduler, resulting in signicant perormance

degradation.

Actor Model

The actor model (native to some programming languages

such as Scala) is a pattern or concurrent computation

that enables applications to take ull advantage o multi-

core and multi-processor computing. The undamental

idea behind the actor model is that the application is

broken up into actors that perorm particular roles.

Every method call (or message) to an actor is executed

in a unique thread, so you avoid all o the contested

locking issues typically ound in concurrent applications.

This allows or more ecient concurrent processing whilekeeping the complexity o actor

implementations low, as there is

no need to consider concurrent

execution within each actor

implementation.

The class in Listing 1 shows what

an actor class might look like.

This class takes a string o words

and saves them to an XML le,

and includes a calculated codeor every character stored. The

code might be used later as

an index or to nd similar text

blocks. Notice that this class is

not thread sae and you can only

use each instance rom a single

thread. This is normal, because

each actor is used rom only one thread. It is common not

to have any synchronized or volatile keywords present in

an actor class because they are not needed.

Long-lived, normally synchronized objects used by

dierent threads are better o with a dedicated thread

ree rom any synchronization issues. Each method call

is placed in the queue (the order within the queue is not

important) waiting until the actor is available to process

the call. Think o this queue like your email in-box:

messages are received at any time and are acted on when

time permits. Typically, calls are asynchronous and do

Utilizing a Multi-Core System with the Actor ModelBy James Leigh

A


6/39



not block, so the calling thread continues execution and

avoids any need to rely on thread interrupts. When callers

need a result, you can pass a callback object as part o

the parameters to allow the actor to notiy the caller. In

some cases, it is desirable to block the caller until the

actor processes the message.

You can separate the storage actor in Listing 1 into

a second actor as shown in Listing 2. In this way, the

storage actor calls an instance o HexCoderActor with

itsel as the callback. The storage actor does not wait

or the HexCoder to generate the hex code, but instead

continues with other items in its queue. This allows

the storage actors thread to specialize in writing the

resulting XML le, while the text code is calculated

asynchronously in another thread. Notice how these

classes can take advantage o concurrent threads without

any special keywords or deep knowledge o concurrent

programming.

Every actor needs a manager to allocate and manage

its thread. Each actor also needs a proxy to send

messages to its queue. Implementing a basic actor

manager is straightorward. In Listing 3, shows such a

manager written in Java 5. It uses Javas Proxy object todynamically wrap an actor, implementing all o the actors

interaces. Every method call on the proxy is then queued

in an ExecutorServicevoid methods are asynchronous

and other method calls block until the executor has

nished executing and the result is available.

Exception Handling and Worker Services

In every program, it is important to test and have proper

exception handling. This becomes even more important

with multi-threaded programming, because asynchronousexecution quickly becomes dicult to debug. Because

execution is not done sequentially, a sequential debugger

is less useul. Similarly, stack traces are shorter and

do not give caller details. In these situations, it is best

to either have the actor handle exceptions itsel or

enable callbacks to handle both successul results and

exceptions.

You should also consider that calls to an actor do carry

some overhead when compared to sequential calls. You

need to queue messages passed to a separate thread

and you cannot optimize with compilers in the same

manner as sequential calls. This makes the actor model

less applicable to smaller, aster objects that are better

implemented as immutable or stateul. However, there

are also advantages to running actors in a dedicated

thread. By avoiding synchronized and volatile

keywords, the on-board chip memory does not need to

sync up with the main memory as oten, since the actors

thread is the only thread that can access its variables.

Modern compilers can also observe that the head-lock o

the queue is only used rom its actor thread and optimize

it away, making it possible or actors to run without any

interruption or mandatory memory fushing. Thereore,

use actors or specialized worker services.

An example o worker services is an importing and

indexing service. Consider the task o retrieving remote

data, processing it locally, and storing it into a local

database. You might break this up into three steps:

1. Retrieve data.

2. Process data.3. Store result.

In this example, the remote data is not retrieved by a

single connection, but rather in multiple les that are

listed in index les, mixed in with the data les. The

remote data is in a ormat that you cannot process

directly and you need to pre-process or ormat it rst.

Furthermore, you need to convert the data because it

uses a dierent vocabulary. This creates six steps:

1. Retrieve index or data le.2. Format the le or parsing.

3. Convert data.

4. I index, then list data les and go to step 1.

5. Process data les.

6. Insert data.

These six steps t well into the actor model. Think o

each o these steps as a job that one or more individuals

(actors) need to perorm.


7/39



Included in this article is an implementation o the above

actor model or retrieving remote recipes rom multiple

sites in multiple ormats. Each recipe is listed in one or

more index les on the web, and the recipe is in HTML.

Actor

RoundRobin

UrlResolver

XhtmlTransormer

StyleSheetTransormer

RdParser

SeeAlsoExtractor

IngredientProcessor

RDFInserter

UrlConsumer

UrlConsumer

StreamTransormer

StreamTransormer

StreamTransormer

RdConsumer

RdConsumer

RdConsumer

Distributes URLs to other actors

Retrieves data streams or another

actor

Formats HTML into XHTML or

parsing

Converts remote XML ormat into

local data ormat

Parses data stream into data

structure

Extracts URLs rom index data

Applies local processing rules on

data

Inserts data into a database

Trait Role

Listing 4 shows how these actors are connected to one

another. The manage( ) methods are typed versions o the

ActorManager#manage(Object) in Listing 3.

A ClusterMap and Main class are also provided in the

download archive. To run the example, execute the

Main class with the ollowing two arguments: http://

www.kra tcanada.com/en/search/SearchResul ts .

aspx?gcatid=86 and http://www.cookingnook.com/ree-

online-recipes.html

The program retrieves these silos o inormation, harvests

meaningul data, indexes it, and makes it available in a

graphical user interace.

With the stage set, lets introduce the actors:

The Main class then opens the ClusterMap and begins

harvesting the recipes. Ater a ew recipes are harvested

select the check-box on the let to see the number o

recipes that are harvested and click the clear button at

the top to update the list o words extracted rom the

ingredients section. In this way, you can index and search

multiple distinct recipe sites. For example, to nd recipes

that include lemon, cheddar, and garlic (yum), click on

these ingredients and the Tortilla Soup recipe is revealed

to include all three ingredients rom the recipes harvested

(see Figure 1).


8/39



Figure 1. ClusterMap: The tortilla soup recipe is revealed

ater clicking certain ingredients.

In a multi-core system, the program uses over 30 threads

to orchestrate the retrieval and processing o the data

downloading and processing as quickly as the remote

host provides the data. In spite o the multi-threaded

perormance, there is no need to consider typical multi-

threaded challenges, reeing the developer rom worrying

about the constraint on what each actor should do.

The actor model is a powerul metaphor to assist in

creating multi-threaded applications, and by assigning

remote addresses and enabling remote communication

between actors, you can extend the model to assist in

distributed challenges as well. By including lie-cycle and

dependency management and making actors aware o

their environment, they can become agents, participating

in a sel-organizing system. This architecture has worked

well or many distributed problems such as on-line trading

disaster response, and modelling social structure. It has

also been the source o inspiration or many service-

oriented architectures.

In essence, the actor model abstracts the nitty-gritty o

multi-processor programming away rom the developer

This reduces concurrency issues and improves the fexibility

o the system. This simple model has a low learning

curve, so new developers can quickly see how actors

are implemented and understand how they t together

By managing the actors properly, you can leverage the

same implementations rom multi-processor systems onto

distributed networked systems in a gradual manner that

can scale with the development demands.

Listing 1. Storage Actor:This listing shows what an actor class might look like.

public class StorageActor implements Storage {

private Writer out;

private Set recorded = new HashSet();

private Set sorted = new TreeSet();

private StringEncoder encoder = new Soundex();

public void init() throws IOException {

out = new FileWriter(text.xml);out.write(\n);

}

public void close() throws IOException {

out.write(\n);

out.close();

}

public void store(String text) throws Exception {

i (recorded.add(text)) {


9/39



String code = code(text);

store(code, text);

}

}

public void store(String code, String text) throws IOException {

out.write();

out.write(text);

out.write(\n);

}

private String code(String text) throws EncoderException {

or (String word : text.split([^a-zA-Z]*)) {

i (word.length() > 2) {

String encoded = encoder.encode(word);

sorted.add(encoded);

}

}

int hash = 0;

or (String encoded : sorted) {

hash = hash * 31 + encoded.hashCode();

}

sorted.clear();

return Integer.toHexString(hash);

}}

Listing 2. HexCoderActor: You can separate the storage actor rom Listing 1 into a second actor.

public class HexCoderActor implements HexCoder {

private Set sorted = new TreeSet();

private StringEncoder encoder = new Soundex();

public void code(String text, Storage callback) throws Exception {

String code = code(text);

callback.store(code, text) ;

}private String code(String text) throws EncoderException {

or (String word : text.split([^a-zA-Z]*)) {

i (word.length() > 2) {

String encoded = encoder.encode(word);

sorted.add(encoded);

}

}

int hash = 0;


10/39



or (String encoded : sorted) {

hash = hash * 31 + encoded.hashCode();

}

sorted.clear();

return Integer.toHexString(hash);

}

}

public class StorageActor implements Storage {

private Writer out;

private Set recorded = new HashSet();

private HexCoder coder;

public StorageActor(HexCoder coder) {

this.coder = coder;

}

public void init() throws IOException {

out = new FileWriter(text.xml);

out.write(\n);

}

public void close() throws IOException {

out.write(\n);

out.close();

}

public void store(String text) throws Exception {

i (recorded.add(text) ) {

coder.code(text, this);}

}

public void store(String code, String text) throws IOException {

out.write();

out.write(text);

out.write(\n);

}

}

Listing 3. ActorManager: The ActorManager as shown written in Java 5.

public class ActorManager {

private nal Map executors = new ConcurrentHashMap();

public Object manage(Object actor) {

ExecutorService executor = Executors.newSingleThreadExecutor();

executors.put(executor, executor);


11/39



Class ac = actor.getClass( );

ClassLoader cl = ac.getClassLoader();

Class[] interaces = ac.getInteraces( );

ActorHandler handler = new ActorHandler(actor, executor);

return Proxy.newProxyInstance(cl, interaces, handler);

}

private class ActorHandler implements InvocationHandler {

private Object actor;

private ExecutorService executor;

public ActorHandler(Object actor, ExecutorService executor) {

this.actor = actor;

this.executor = executor;

}

public Object invoke(nal Object proxy, nal Method method,

nal Object[] args) throws Throwable {

Class type = method.getReturnType();

Future result = executor.submit(new Callable() {

public Object call() throws Exception {

Object result = method.invoke(actor, args);

i (result == actor)

return proxy;

return result;}

});

i (Void.TYPE.equals(type))

return null;

return result.get( );

}

}

}

Listing 4. ActorFactory: This listing shows how the actors are connected.

public void init() throws Exception {

ClassLoader cl = Thread.currentThread().getContextClassLoader();

URL xsl = cl.getResource(RECIPES_XSL);

UrlConsumer[] consumers = new UrlConsumer[1 + PROCESSORS * 2];

SeeAlsoConsumer seeAlso = _.seeAlso() ;

// only one thread/actor can insert at a time

RDFConsumer insert = _.insert(store);


12/39



or (int i=0;i


13/39



emember the red telephone box, once a

amiliar sight on the streets o London? Thats

a good example o mutually exclusive access

to a shared resource, although you probably

didnt nd any locks on them. Why? Because only one

person at a time could use one to make a call, and civil

persons would not listen to a strangers conversationwhile waiting outside.

Unortunately, there are no

guarantees that programs

will be equally civil, so wise

programmers use semaphores

to keep processes rom running

amok and leaving shared

resources, such as les and I/O

devices, in inconsistent states.

Mutual exclusion locks (also,

called mutex locks or

simply locks or mutexes) are

a special kind o semaphore.

Each protects a single shared

resource, qualiying it as a binary

semaphore. Concurrent programs

use locks to guarantee consistent

communication among threads

through shared variables or data structures. A piece

o program code protected by a mutex lock is called acritical section.

Mutex locks are oten implemented using an indivisible

test-and-set instruction in todays prevalent multi-core

systems. Although generally deemed ecient, relying

on an indivisible test-and-set instruction incurs a ew

hidden perormance penalties. First, execution o such an

continued

Lots about LocksBy Wooyoung Kim and Michael Voss

Rinstruction requires memory access, so it intereres with

other cores progressespecially when the instruction is

in a tight loop. The eect may be elt even more acutely

on systems with a shared memory bus. Another penalty

stems rom cache-coherency. Because the cache line

containing a lock object is shared among cores, one

threads update to the lock invalidates the copies onthe other cores. Each subsequent test o the lock on

other cores triggers etching the cache line. A related

penalty is alse sharing where an

unrelated write to another part

o the cache line invalidates the

whole cache line. Even i the lock

remains unchanged, the cache

line must be etched to test the

lock on a dierent core.

Given all these problems,one might wonder Why use

locks at all? What are the

alternatives? One extreme

alternative is to give up on

communicating through

shared variables and adopt the

mantra o no sharing. That

involves replicating data and

communicating via message

passing. Unortunately, the cost o replication and

message passing is even greater than the overheadassociated with locks on todays multi-core shared-

memory architectures.

Another approach that has been actively pursued

recently as an alternative to mutex locks is lock-ree/non-

blocking algorithms. Researchers have reported some

isolated successes in designing practical non-blocking


14/39



explicit use o locks, then:

Guideline 2: Make Critical Sections as Small as Possible

When a thread arrives at a critical section and nds that

another thread is already in it, it must wait. Keep the

critical section small, and you will get small waiting times

or threads and better overall perormance. Examine

when a shared data in a critical section is made private

and see i you can saely take some o the accesses to the

data out o the critical section.

For example, the code in Listing 1 implements a

concurrent stack. It denes two methods push( ) and

pop(), each protected using a TBB mutex lock (smtx)

thats acquired in the constructor and released in the

destructor. The examples in Listing 1 rely on the C++

scoping rules to delimit the critical sections.

A cursory look at pop() shows that:

1. I the stack is empty pop() returns alse.

2. I the stack is not empty, the code acquires the

mutex lock and then re-examines the stack.

3. I the stack has become empty since the previous

test, pop() returns alse.4. Otherwise, the code updates the top variable,

and copies the old top element.

5. Finally, pop() releases the lock, reclaims the

popped node, and returns true.

implementations. Nonetheless, non-blocking algorithms

are hardly a holy grail. Designing ecient non-blocking

data structures remains dicult, and the promised

perormance gain has been elusive at best. Youll see

more about non-blocking algorithms at the end o this

article.

With no proven better alternatives at present, it makes

sense to make the most o mutex locks until they are

rendered no longer necessary. This article discusses

some experiences with mutex locks in developing multi-

threaded concurrent applications, using the mutex locks

provided in Intel Threading Building Blocks as examples.

Making the Most of Mutex Locks

Mutexes are oten vilied as major perormance snags

in multi-threaded, concurrent application development;

however, our experience suggests that mutex locks are

the least evil among synchronization methods available

today. Even though the nominal overhead appears large,

you can harness them to your advantage i you use them

in well-disciplined ways. Throughout this article, youll see

some o the lessons learned, stated as guidelines, the rst

two o which are:

Guideline 1: Being Frugal Always Pays Off

Minimize explicit uses o locks. Instead, use concurrent

containers and concurrent algorithms provided by

ecient, thread-sae concurrency libraries. I you still

nd places in your application that you think benet rom

Heres a closer look at the critical section. Copying type T may take a lot o time, depending on T. Because o the lock,

you know that, once updated, the old top value cannot be viewed by other threads; it becomes private and local to the

thread inside the critical section. Thereore, you can saely yank the copy statement out o the critical section (ollowingguideline 2) as ollows.

bool pop( T& _e ) {

node* tp = NULL;

i( !top ) goto done;

{

tbb::spin_mutex::scoped_lock lock( smtx );



15/39



tp = top;

top = top->nxt;

// move the next line...

// _e = tp->elt;

}

// ...to here

_e = tp->elt;

delete tp;

done:

return tp!=NULL;

}

As another example, consider implementing a tiny memory management routine. A thread allocates objects rom its

private blocks and returns objects to their parent block. It is possible or a thread to ree objects allocated by another

thread. Such objects are added to their parent blocks public ree list. In addition, a block with a non-empty public

ree list is added to a list (i.e., public block list) ormed with block_t::next_to_internalize and accessed through block_

bin_t::mailbox, i not already in.

The owner thread privatizes objects in a blocks public ree list, as needed. Function internalize_next() implements the

unctionality and is invoked when a thread runs out o private blocks with ree objects to allocate. It takes a block bin

private to the caller thread as its argument and pops the ront block rom the list bin->mailbox, i not empty. Then, it

internalizes objects in the blocks public ree list:

block_t* internalize_next ( block_bin_t* bin )

{

block_t* block;{

tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);

block = bin->mailbox;

i( block ) {

bin->mailbox = block->next_to_internalize;

block->next_to_internalize = NULL;

}

}

i( block )

internalize_returned_objects( block );

return block;}

The unctions critical section protects access to bin->mailbox with bin->mailbox_lock. Inside the critical section, i bin-

>mailbox is not empty it pops the ront block to block and resets the blocks next_to_internalize.

Note that block is a local variable. By the time bin->mailbox is updated, block (which points to the old ront block

becomes invisible to other threads and access to its next_to_internalize eld becomes race-ree. Thus, you can saely

move the reset statement outside the critical section:


16/39



block_t* internalize_next ( block_bin_t* bin )

{

block_t* block;

{

tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);

block = bin->mailbox;

i( block ) {

bin->mailbox = block->next_to_internalize;

// move the next statement...

// block->next_to_internalize = NULL;

}

i( block ) {

// ...to here

block->next_to_internalize = NULL;

internalize_returned_objects( block );

}

return block;

}

Guideline 3: Synchronize as Infrequently as Possible

The idea behind this guideline is that you can amortize

the cost o a lock operation over a number o local

operations. Doing so reduces the overall execution timebecause executing atomic instructions tends to consume

an order o magnitude more cycles.

Again, suppose youre designing a memory allocator that

allocates objects out o a block. To reduce the number o

trips to the operating system to get more memory blocks,

the allocator uses a unction called allocate_blocks() to

get a big strip rom the operating system, partition it into

a number o blocks, and then put them in the global ree

block list shared among threads. The ree block list ree_

list is implemented as a concurrent stack (see Listing 2).

Note that the code to push a newly carved-out

block into ree_list is inside a while loop. Also, note

that stack2::push( ) protects concurrent accesses to

stack2::top through a mutex lock. That means allocate_

blocks() acquires the lock ree_list.mtx N times or a strip

containing N blocks.

You can reduce that requency to one per strip by adding

a ew thread-local instructions. The idea is to build a

thread-local list o blocks in the while loop rst (using two

pointer variables head and tail) and then push the entire

list into ree_list with a single lock acquisition (see Listing

3). Finally, so that allocate_blocks() can access ree_lists

private elds, its declared as a riend o stack2.

Guideline 4: Most of All, Know Your Application.

The guideline that will help you most in practice is to

analyze and understand your application using actual

use scenarios and representative input sets. Then you

can determine what kinds o locks are best used where.

Perormance analysis tools such as Intel Parallel

Amplier can help you identiy where the hot spots areand ne-tune your application accordingly.

A Smorgasbord of Lock Flavors

Intel Threading Building Blocks oers a gamut o

mutex locks with dierent traits, because critical sections

with dierent access patterns call out mutex locks with

dierent trade-os. Other libraries may oer similar


17/39



choices. You need to know your application to select the

most appropriate lock favor or each critical section.

Spin Mutex vs. Queuing Mutex

The most prominent distinguishing property or locks

is fairnesswhether a lock allows air access to the

critical section or not. This is an important consideration

when choosing a lock, but its importance may vary

depending on circumstances. For example, an operating

system should guarantee that no process gets unairly

delayed when multiple processes contend against each

other to get into a critical section. By contrast, unairness

among threads in a user process may be tolerable to

some degree i it helps boost the throughput.

TBBs spin_mutex is an unair lock. Threads entering a

critical section with a spin_mutex repeatedly attempt to

acquire the lock (they spin-wait until they get into the

critical section, thus the name). In theory, the waiting time

or a spin_mutex is unbounded. The TBB queuing_mutex,

on the other hand, is a air lock, because a thread arriving

earlier at a critical section will get into it earlier than one

arriving later. Waiting threads orm a queue. A newly

arriving thread puts itsel at the end o the queue usinga non-blocking atomic operation and spin-waits until its

fag is raised. A thread leaving the critical section hands

the lock over to the next in line by raising the latters fag.

Unortunately there are no cast-in-stone guidelines or

criteria that dictate when to use an unair spin_mutex and

when to use a air queuing_mutex. In general though,

guaranteeing airness costs more. When a critical section

is brie and contention is light, the chance o a thread

being starved is slim and any additional overhead or

unneeded airness may not be warranted. In those cases,use a spin_mutex.

The TBB queuing_mutex spin-waits on a local cache

line and does not interere with other threads memory

access. Consider using a queuing mutex or modestly

sized critical sections and/or when you expect a airly

high degree o contention.

One report claims that, using a test program, with spin

locks, a dierence o up to 2x runtime per thread was

observed and some threads were unairly granted the

lock up to 1 million times on an 8-core Opteron machine.

I you suspect your application suers rom unairness

due to a spin_mutex, switching to a air mutex such as

queuing_mutex is your answer. But beore switching, back

up your decision with concrete measurement data.

Reader-Writer locks

Not all concurrent accesses need to be mutually

exclusive. Indeed, accesses to many concurrent data

structures are mostly read-accesses, and only occasionally

need write-accesses. For these structures, keeping one

reader spin-waiting while another reader is in the critical

section is not necessary.

TBB reader/writer mutexes allow multiple readers to be in

a critical section while giving writers exclusive access to it.

The unair version is called spin_rw_mutex, while the air

version is queuing_rw_mutex. These mutexes also allow

readers to upgrade to writers and writers to downgrade

to readers.

Under some circumstances, you can replace reader-side

locks with less expensive operations (although potentially

at the expense o writers). One such example is a

sequential lock; another is read-copy-update lock. These

locks are less-general reader-writer locks, so using them

properly in applications requires more stringent scrutiny.

Mutex and Recursive_Mutex

TBB provides a mutex that wraps around the underlying

OS locks, but compared to the native version, addsportability across all supported operating systems. In

addition the TBB mutex releases the lock even when an

exception is thrown rom the critical section.

A sibling, recursive_mutex, permits a thread to acquire

multiple locks on the same mutex. The thread must

release all locks on a recursive_mutex beore any other

thread can acquire a lock on it.


18/39



Avoiding Lock Pitfalls

There is no shortage o reerences that warn about the

inevitable dangers o using locks, such as deadlocks and

livelocks. However, you can reduce the chances o getting

ensnared by these problems considerably by instituting a

ew simple rules.

Avoid explicit use of locks. Instead use concurrent

containers and concurrent algorithms provided in well-

supported concurrency libraries such as Intel Threading

Building Blocks. I you think your application requires

explicit use o locks, avoid implementing your own locks

and use well-tested well-tuned locks such as TBB locks.

Avoid making calls to functions (particularly unknown

ones) while holding a lock. In general, calling a unction

while holding a lock is not good practice. For one thing,

it increases the size o the critical section, thus increasing

the wait-times o other threads. More seriously, you may

not know whether the unction contains lock acquisition

code. Even i it does not now, it may in the uture. Such

changes potentially lead to a deadlock situation, and

when that happens, its very dicult to locate and x. I

possible, re-actor the critical section so that it computesthe unction arguments in the critical section but invokes

the unction outside the critical section.

Avoid holding multiple locks. Circular lock acquisition is

a leading cause o dead lock problems. I you must hold

multiple locks, always acquire the locks in the same order

and then release them in the same order that they were

acquired.

Avoid using recursive locks. You may be able to nd some

isolated cases where recursive locks make great sense.However, locks dont compose well. Even a completely

unrelated change to a part o your application may lead

to a deadlock, and the problem will be very dicult to

locate.

Even i you do everything you possibly can to avoid

deadlocks and livelocks, problems may still occur. I

you suspect your application has a deadlock or race

condition, and you cannot locate it quickly, dont get

burned by trying to resolve it by yoursel. Use tools such

as Intel Parallel Inspector.

Lock-Free and Non-Blocking Algorithms

As promised earlier, one strategy that avoids locks

and their associated problems advocated by some

researchers is to use non-blocking synchronization

methods such as lock-ree/wait-ree programming

techniques and sotware transactional memory. These

techniques aim to provide wait-reedom, thereby

addressing issues stemming rom the blocking nature o

locks without compromising perormance.

Unortunately, our experience with non-blocking

algorithms has been (so ar) disappointing, and many

other developers and researchers agree. Almost all non-

blocking algorithms invariably use one or more hardware-

supported atomic operations (such as compare-and-swap

(CAS) and load-link/store-conditional (LL/SC)). Some

even use double-word CAS (DCAS).

Dependence on these atomic primitives makes them

dicult to write (see Doherty, Simon, et al, DCAS isnot a silver bullet or nonblocking algorithm design.

Proceedings of the 16th Annual ACM Symposium on

Parallelism in Algorithms and Architectures. 2004, and

Herb Sutters article Lock-Free Code: A False Sense

o Security), dicult to validate or correctness (see

Gotsman, Alexey, et al, Proving That Non-Blocking

Algorithms Dont Block, Symposium on Principles of

Programming Languages to appear in 2009), and dicult

to port to other platorms. This is probably one reason

why non-blocking algorithms have been limited to simple

data structures. Furthermore, improved perormance ovelock-based implementations seems hard to get.

Arguments or the other benets are not compelling

enough to warrant the pain o switching to non-blocking

algorithms. Fairness is contingent upon the underlying

atomic operations; in some cases, livelock is still possible.

For many user applications, benets such as real-time

support and ault tolerance are a good-to-have, not


19/39



a must-have. In other cases, solutions provided by

operating systems are sucient (e.g., priority inheritance

or priority inversion).

Sotware Transactional Memory (STM) is another

alternative to lock-based synchronization. It abstracts

away the use o low-level atomic primitives using the

notion o transactions, and simplies synchronizing access

to shared variables through optimistic execution and

a roll-back mechanism. Like non-blocking algorithms,

STM promises perormance gains over lock-based

synchronization, and also promises to avoid many

common locking pitalls. The results so ar are not so

avorable. One publication observes that the overall

perormance o TM is signicantly worse at low levels

o parallelism (see Cascaval, Calin, et al, Sotware

Transactional Memory: Why is it only a research toy?

ACM Queue. 2008, Vol. 6, 5.) However, STM is a relatively

young research area, so the jury is still out.

Lock It Up

Locks have been unairly vilied as a hindrance to

the development o ecient concurrent applications

on burgeoning multi-core platorms. However, our

experiences suggest that rather than discouraging

the use o mutex locks, one should instead promote

their well-disciplined use. More oten than not,

implementations with such locks outperorm those with

non-blocking algorithms or STM.

The most important consideration or making the best

use o mutex locks is understanding the application well,

using tools to aid that understanding where necessary,and selecting the best-tting synchronization method or

each critical section. When you do choose a mutex, use it

with the recommended guidelines, but keep fexibility in

mind. Doing so will prevent most common mutex-related

pitalls without incurring unwarranted perormance

penalties. Finally, shun the do-it-yoursel temptation, and

delegate work to well-supported concurrency libraries.

Listing 1. Concurrent Stack Implementation:

The push and pop methods are protected by a mutex lock acquired in the constructor and released in the destructor.

/* unintrusive concurrent stack */

#include tbb/spin_mutex.h

template

class concurrent_stack

{

class node {

riend class concurrent_stack;

node* nxt;

T elt;

public:node( T& _e ) : nxt(NULL), elt(_e) {}

} ;

public:

concurrent_stack() : top(0) { }

void push( T& _e ) {

node* n = new node( _e );


n->nxt = top;


20/39



top = n;

}

bool pop( T& _e ) {

node* tp = NULL;


{



tp = top;

top = top->nxt;

_e = tp->elt;

}

delete tp;

done:

return tp!=NULL;

}

private:

node* top;

tbb::spin_mutex smtx;

} ;

Listing 2: Memory Allocator:

This memory allocator reduces trips to get memory by getting a big strip o memory rom the operating system, partitioningit into a number o blocks, and then putting them in the global ree block list.

class stack2 {

public:

inline stack2() : top(NULL) {}

inline void push ( void** ptr ) {

tbb::spin_mutex::scoped_lock lock(mtx);

*ptr = top;

top = ptr;

}

inline void* pop ( void ) {i( !top ) return NULL;

void **result;

{


i ( !(result=(void**) top) ) goto done;

top = *result;

}

*result = NULL;


21/39



done:

return result;

}

private:

tbb::spin_mutex mtx;

void* top;

} ;

stack2 ree_list;

int allocate_blocks() {

uintptr_t raw_strip = get_strip();

i( !raw_strip )

return 0;

uintptr_t aligned_strip = align_strip( raw_strip );

uintptr_t endp = (uintptr_t) raw_strip+strip_size ;

uintptr_t b = aligned_strip;

while ( b+block_sizebump_ptr = block_endp;

// note this line

ree_list.push( (void**)b );

b = block_endp;

}return 1;

}

Listing 3. Build a Block List:

The while loop builds a list o blocks, and then pushes the entire list into free_listusing only a single lock acquisition.

class stack2 {

riend int allocate_blocks() ;

public:

inline stack2() : top(NULL) {}

inline void push ( void** ptr ) {tbb::spin_mutex::scoped_lock lock(mtx);

*ptr = top;

top = ptr;

}

inline void* pop ( void ) {

i( !top ) return NULL;

void **result;

{


22/39




i ( !(result=(void**) top) ) goto done;

top = *result;

}

*result = NULL;

done:

return result;

}

private:

tbb::spin_mutex mtx;

void* top;

} ;

stack2 ree_list;

int allocate_blocks() {

uintptr_t raw_strip = get_strip();

i( !raw_strip )

return 0;

uintptr_t aligned_strip = align_strip( raw_strip );

uintptr_t endp = (uintptr_t) raw_strip+strip_size ;

uintptr_t b = aligned_strip;

uintptr_t head = 0;

uintptr_t tail = b;while ( b+block_sizebump_ptr = (void*) block_endp;

ree_list.push( (void**)b ) ;

* (uintptr_t*) b = head;

head = b;

b = block_endp;

}

{

// Push the block list into ree_list

tbb::spin_mutex::scoped_lock lock(ree_list.mtx);* (void**) tail = ree_list.top;

ree_list.top = (void*) head;

}

return 1;

}


23/39

Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents


This article originally appeared in Intel Parallel Universe

Magazine. Used with permission.

n September, Intel introduced Intel Parallel

Studio 2011, a tool suite or Microsot

Windows Visual Studio C++ developers,

with the singular objective o providing theessential perormance tools or

application development on

Intel architecture. These tools

provide signicant innovation,

and enable unprecedented

developer productivity when

building, debugging and tuning

parallel applications or multicore.

With the introduction o Intel

Parallel Building Blocks (Intel

PBB) developers have methods tointroduce and extend parallelism

in C/C++ applications or higher

perormance and eciencies.

Now Intel is extending the reach

o the next-generation Intel tools

to developers o applications

on both Windows and Linux in C/C++ and Fortran who

need advanced perormance or multicore today and

orward scaling to manycore. Intel Parallel Studio XE

2011 contains C/C++ and Fortran Compilers, Intel

Math Kernel Library (Intel MKL) and Intel Integrated

Perormance Primitives (Intel IPP) perormance libraries,

Intel PBB libraries, Intel Threading Building Blocks (Intel

TBB), Intel Cilk Plus, and Intel Array Building Blocks

(Intel ArBB), Intel Inspector XE correctness analyzer and

Intel VTune Amplier XE perormance proler.

HPC programmers have traditionally been able to use all

Intel Parallel Studio XE and Intel

Cluster Studio Tool SuitesBy Sanjay Goil and John McHugh

I

the compute power made available to them. Even with

the perormance leaps that Moores law has allowed Intel

architecture to deliver over the past decade, the hunger

or additional perormance continues to thrive. There

are big unsolved problems in science and engineering,

physical simulations at higher granularities, and problems

where the economically viable compute power provideslower resolution or piecemeal

simulation o smaller portions o

the larger problem. This is what

makes serving the HPC market

so exciting or Intel, and it is a

signicant driver or innovation

in both hardware and sotware

methodologies or parallelism

and perormance.

Intel Cluster Studio introducestools or HPC cluster

development with MPI, including

the scalable Intel MPI Library

and Intel Trace Analyzer and

Collector perormance proler,

with the industry-leading C/

C++ and Fortran compilers or a

complete cluster development toolkit. This is combined

with the ease o deployment oered by the Intel

Cluster Ready program, making deployment o cluster

applications highly ecient.

Introducing New Tool Suites

Sotware developers o high per ormance applications

require a complete set o development tools. While

traditionally these tools include compilers, debuggers,

and perormance and parallel libraries, more oten the

issues in development come in error correctness and

22


24/39



perormance proling. The code doesnt run correctly, or exhibits error prone behavior on some runs, pointing to data

races, deadlocks, or perormance bottlenecks in locks or synchronization, or exposes security risks at runtime. To this

end, Intels correctness analyzers and perormance prolers are a great addition to the development environment or

highly robust and secure code development.

For advanced and distributed perormance, Intel is simpliying the procurement, deployment and use o HPC tools on

multicore 32- and 64-node Intel architecture and HPC clusters programmed with the Message Passing Interace (MPI).

A sotware development project goes through several steps to get optimal perormance on the target platorm.

Most oten the developer gets a rudimentary per ormance prole o the application run to show hotspots. Once

opportunities or optimization are identied, the coding aspects are handled by the compilers and perormance

and parallel libraries to add parallelism, presenting task level, data level and vectorization opportunities. Finally, thecorrectness tools make robust code possible by checking or threading and memory errors, and identiying security

vulnerabilities. This cycle typically repeats itsel to nd higher application eciencies.

23


25/39



Highlights o Intel Parallel Studio XE 2011

Available or Multiple Operating Systems: Intel Parallel Studio XE provides the same set o tools to aid

development or both Windows and Linux platorms. C/C++, Fortran compilers and perormance and parallelism

libraries bring advanced optimizations to the Mac*.

Robustness: Intel Inspector XEs memory and thread analyzer nds and pinpoints memory and threading errors

beore they happen.

Code Quality: Intel Parallel Studio XE enables developers to eectively nd sotware security vulnerabilities

through static security analysis.

Advanced Optimization: The compilers and libraries in Intel Composer XE oer advanced vectorization support,

including support or Intel AVX. The C/C++ optimizing compiler now includes Intel PBB library, expanding

the types o problems that can be solved more easily in parallelism with increased scalability and reliability. For

Fortran developers, it now oers co-array Fortran and additional support or the Fortran 2008 standard.

Perormance: Intel VTune Amplier XE perormance proler nds bottlenecks in serial and parallel code thatlimit perormance. Improvements include a more intuitive interace, ast statistical call graph, and timeline view.

Intel MKL and Intel IPP perormance libraries provide robust multicore perormance or commonly used math and

data processing routines. A simple linking o the application with these libraries is an easy rst step or multicore

parallelism.

Compatibility and Support: Intel Parallel Studio XE excels at compatibility with leading development

environments and compilers. Intel oers broad support with orums and Intel Premier Support that provides ast

answers and covers all sotware updates or one year.

24


26/39



Old Name

Compiler Suite Proessional Edition

C++ Compiler Proessional Edition

[Visual] Fortran Compiler Proessional Edition

Visual Fortran Compiler Proessional Edition with

IMSL

VTune Perormance Analyzer

(including Intel Thread Proler)

Thread Checker

Cluster Toolkit Compiler Edition

Composer XE

C++ Composer XE

[Visual] Fortran Composer XE

Visual Fortran Composer XE with IMSL

VTune Amplier XE

Inspector XE

Cluster Studio

New Name

Whats new in Intel Composer XE

Intel Composer XE contains next-generation C/C++ and Fortran compilers (12.0) and perormance and parallel libraries

Intel MKL 10.3, Intel IPP 7.0 and Intel TBB 3.0.

The latest Intel C/C++ compiler, Intel C++ Compiler XE 12.0 is optimized or the latest Intel architecture processor (code-

named Sandy Bridge) with Intel AVX support. The product contains Intel PBB, which includes advances in mixing and

matching task, vector, and data parallelism in applications to better map to the multicore optimization opportunities,

Intel Cilk Plus, Intel TBB and Intel ArBB (in Beta, available separately). There are vector optimizations with Intel AVX with

SIMD pragmas, in addition to an array notation tool to help in auto-parallelization called GAP, or the highest perormance

and parallelism on the latest generation o x86 multicore CPUs. For Windows users, support or Visual Studio 2010 is

included.

The tools introduced in Intel Parallel Studio XE 2011 are next generation revisions o industry-leading tools or C/

C++ and Fortran developers seeking cross-platorm capabilities or the latest x86 processors on Windows and Linux

platorms. Those amiliar with Intels industry-leading tools will see that the product names have transitioned in this new

releasein all cases with signicant additional capabilities. Other names remain the same.

25


27/39



The Intel Fortran Compiler XE 12.0 includes several advances in more complete support or Fortran 2003 standard and

some support or Fortran 2008 standards, including Co-array Fortran, vector optimizations with AVX and help with auto-parallelization or the highest perormance and parallelism on the latest x86 multicore CPUs.

The perormance libraries continue to provide an easy way to include highly optimized and automatically parallel math

and scientic unctions, and data processing routines or high perormance users. The math library, Intel MKL 10.3

includes enhancements such as better Intel AVX support, summary statistics library, and enhanced C language support

or LAPACK. The data processing library, Intel IPP 7.0 includes improved data compression and codecs, and support or

Intel AVX and AES instructions, continuing to address data processing intensive application domains.

Enhanced Developer Productivity with Correctness Analyzers and Perormance Proflers

Intel Parallel Studio XE 2011 combines ease-o-use innovations, introduced in Intel Parallel Studio, with advanced

unctionality or high perormance, scalability and code robustness or Linux and Windows. Intel has traditionally oered

developer tools on both Windows and Linux, and strives to oer the same unctionality across both platorms, especially

important or developing applications to run on both operating systems.

Introducing SIMD pragmas or

26


28/39



With the capabilities in the correctness analyser, Intel Inspector XE, the product helps the C/C++ and Fortran developer

with static and dynamic code analysis through threading and memory analysis tools, to develop highly robust, secure

and highly optimized applications.

New capabilities in this tool include:

Simplied conguration and run analysis

Finds coding defects quickly, such as:

o Memory leaks and memory corruption

o Threading data races and deadlocks

Supports native threads, understands any parallel model built on top of threads Dynamic instrumentation works on standard builds and binaries

Timeline view to explore context of the respective threads

Intuitive standalone GUI and command line interface for Windows and Linux

Advanced command line reporting

Intel VTune Amplier XE 2011 is the next generation o the Intel VTune Analyzer, which is a powerul tool to quickly

nd and provide greater insights into multicore perormance bottlenecks. It takes away the guesswork and analyzes

perormance behavior in Windows* and Linux* applications, providing quick access to scalability bottlenecks or aster

Hotspots in theapplication

Thread basedCPU usage

27


29/39



and improved decision making. The next generation Intel VTune perormance proler has new eatures, including:

Easy predened analyses

Fast hotspot analysis (hot functions and call stack)

Powerful ltering

Threading timeline

Frame analysis

Attach to a running process (Windows)

Event multiplexing

Simplied remote collection

Improved compare results

Tight Visual Studio integration

Non-root Linux install

Only EBS driver install needs root

Sotware security starts very early in the development phase, and Intel Parallel Studio XE 2011 makes it aster to identiy

locate, and x sotware issues prior to sotware deployment. This helps identiy and prevent critical sotware security

vulnerabilities early in the development cycle, where the cost o nding and xing errors is the lowest.

Intels static security analysis (SSA), included in the Parallel Studio XE bundle, provides these unique advantages or

robust code development:

Easier, faster setup and ramp to get static analysis results

Simple approach to congure and run static analysis

Discover and x defects at any phase of the development cycle Finds over 250 security errors, such as:

o Buer overruns and uninitialized variables

o Unsae library usage and arithmetic overfow

o Unchecked input and heap corruption

Tracks state associated with issues, even as source evolves and line numbers change

Displays problem sets and location of source

Provides lters, assignment of priority, and maintenance of problem set state

Intuitive standalone GUI and command line interface for Windows and Linux

Feature

Support or both Linux and Windows platorms Development capability with the same set o

tools on both Windows and Linux platorms;

enhanced perormance, productivity, and

programmability

Benet

28


30/39



Feature

C/C++ Compilers with Intel Parallel Building Blocks

Fortran Compilers with Fortran 2008 standards

support including Co-Array Fortran (CAF)

Memory, threading, and security analysis tools in onepackage support.

Updated perormance libraries

Updated perormance proler

Breakthrough in providing choice o parallelism

or applications task, data, vector with

mix and match or optimizing application

perormance. C/C++ standards support

Advances in the industry-leading Fortran

compilers with new support or scalable

parallelism on nodes and clusters (cluster

support available separately with Intel Cluster

Studio 2011). Fortran standards

Enhances developer productivity andeciencies by simpliying and speeding the

process o detecting dicult- to-nd coding

errors

Multicore perormance or common math and

data processing tasks, with a simple linking with

these automatically parallel libraries

Several ease-o-use enhancements, deeper

micro-architectural insights, enhanced GUI, and

quicker, more robust perormance

Benet

Increase Perormance and Scalability o HPC Cluster Computing

Intel Cluster Studio 2011 sets a new standard in distributed parallelism on Intel architecture-based clusters. This premier

tool suite provides development fexibility or enabling MPI-based application perormance or highly-parallel shared-

memory and cluster systems based on 32 and 64 Intel architectures. The newly architected Intel MPI library 4.0 is key to

achieving these advantages by providing new levels o cluster scalability, improved interconnect support across many

abrics, aster on-node messaging, support or hybrid parallelization, and an application tuning capability that adjusts to

the cluster and application structure. For the developer, the Intel Trace Analyzer and Collector 8.0 is enhanced with neweatures that accelerate the analysis and tuning cycle o MPI-based cluster applications. The suite is complemented with

the latest Intel C/C++ and Fortran compiler technology along with Intel MKL 10.3, Intel IPP 7.0, and Intel PBB (also sold as

Intel Composer XE) to urther optimize and parallelize application execution on each computing node. Co-array Fortran

is supported on clusters in this package.

Along with Intel Cluster Ready (ICR), a program to dene cluster architectures or increasing uptime, increasing productivity

and reducing total cost o ownership (TCO) or IA-based HPC clusters, Intel Cluster Studio 2011 makes it easy to code,

debug, and optimize to gain higher scalability or MPI-based cluster applications, up to petascale, and also is the premier

29


31/39



suite or developing and tuning hybrid-parallel codes that can mix MPI with multithreading paradigms such as OpenMP

or Intel PBB.

Intel Cluster Studio 2011 provides an extensive sotware package containing Intel C/C++ compilers and Intel Fortran

Compilers or all Intel architectures, plus all the Intel Cluster Tools that help you develop, analyze, and optimize

perormance o parallel applications on Linux or Windows. By combining all the compilers and tools into one license

package, Intel can provide single installation, interoperability, and support or the best-in-class cluster sotware tools.

Highlights o Intel Cluster Studio 2011

Scalability and High Perormance: The interconnect-tuned and multicore-optimized Intel MPI Library delivers

application perormance on thousands o 32- and 64-IA multicore processors.

Built-in Optimization: Utilize optimizing compilers and libraries in Intel Composer XE to get the most out o

advanced processor technologies. The C/C++ optimizing compiler now includes Intel PBB, which expands the

types o problems that can be solved more easily in parallel, and with increased reliability. For Fortran developers, it

now oers co-array Fortran (CAF) and additional support or the Fortran 2008 standard. Intel Compilers also delive

advanced vectorization support with SIMD pragmas.

Ease o MPI Tuning: Intel Trace Analyzer and Collector has been enhanced with new eatures that accelerate the

analysis and tuning cycle o MPI-based cluster applications.

Target Applications to Multiple Operating Systems: Leverage the same source code in Intel Compilers and libraries

which bring advanced optimizations to Windows and Linux.

Intel Cluster Ready Qualifed: This program denes cluster architectures to increase uptime and productivity and

reduce total cost o ownership (TCO) or IA-based HPC clusters.

Compatibility and Support: Intel Cluster Studio oers excellent compatibility with leading development environments

and compilers , while providing optimal support or multiple generations o Intel processors and compatibles. Inteoers broad support through its orums and Intel Premier Support, which provides ast answers and covers al

sotware updates or one year

30


32/39



Feature

Analysis tools or MPI developers Load imbalance

diagram; ideal Interconnect simulator

Scalable Intel MPI Library with multi-rail IB support and

Application Tuner

C/C++ Compilers with Intel Parallel Building Blocks

Fortran compilers with Fortran 2008 standards support

including co-array Fortran (CAF) on clusters

Updated perormance libraries, Intel MKL and Intel IPP

Support or both Linux and Windows platorms

Enhanced developer productivity and

eciencies by simpliying and speeding the

detection o errors and oering perormance

proling o MPI messages.

Scale to tens o thousands o cores with one

o the most scalable and robust commercial

MPI libraries in the industry. Ease-o-use with

dynamic and congurable support across

multiple cluster abrics and multi-rail IB support

Breakthrough in providing choice o parallelismor applications process, task, data, vector

with mix and match or optimizing application

perormance on clusters o SMP nodes. C/C++

standards support

Advances in industry-leading Fortran compilers

with new support or scalable parallelism on

nodes and clusters. Fortran standards support.

Multicore perormance or common math and

data processing tasks, with a simple linking withthese automatically parallel libraries

Development capability with the same set o

tools on both Windows and Linux platorms

or enhanced perormance, productivity, and

programmability

Benet

Summary

With the introduction o Intel Parallel Studio XE and Intel Cluster Studio, Intel is extending the reach o the next-generation

Intel tools to Windows and Linux C/C++ and Fortran developers needing advanced perormance or multicore today and

orward scaling to manycore.

Intel Parallel Studio XE 2011 bundle contains the latest versions o Intel C/C++ and Fortran compilers, Intel MKL and

Intel IPP perormance libraries, Intel PBB libraries, (Intel TBB, Intel ArBB [in beta] , and Intel Cilk Plus), Intel Inspector XE

correctness analyzer, and Intel VTune Amplier XE perormance proler.

31


33/39



Intel Cluster Studio 2011 bundle contains the latest versions o Intel MPI Library, Intel Trace Analyzer and Collector, Inte

C/C++ and Fortran compilers, Intel MKL and Intel IPP perormance libraries, Intel PBB libraries, (Intel TBB, Intel ArBB [in

beta], and Intel Cilk Plus.

32


34/39



This article originally appeared in Intel Parallel Universe

Magazine. Used with permission.

ntel Array Building Blocks (Intel ArBB)

is a sophisticated and powerul platorm

or portable data-parallel sotware

development. Intel ArBB will be available asa component o Intel Parallel

Building Blocks, along with

several other tools and libraries

or parallel programming. Intel

ArBB can be used to parallelize

compute-intensive applications

within a structured, deterministic-

by-deault ramework. It also

provides powerul runtime

generic programming

mechanisms, yet can be usedwith existing compilers. In

particular, it has been veried

to work with the Intel, Microsot

and gcc C++ compilers. Intel

ArBB is currently in Beta, and

eedback is appreciated; it can be

downloaded today rom http://

intel.com/go/ArBB or either Windows or Linux.

Is Intel ArBB a language or a library? Yes both at the

same time. Intel ArBB is the answer to the ollowing

question: How can parallelism mechanisms in modern

processor hardware, including vector SIMD instructions,

be targeted in a portable, general way within existing

programming languages? The answer is an embedded

language. Intel ArBB is a language extension

implemented as an API. It has a library interace, but

includes a capability or the dynamic generation and

Intel

Array Building BlocksBy Michael McCool

I

optimization o parallelized and vectorized machine

language.

Modern processors include many mechanisms or

increasing per ormance through parallelism: multiple

cores, hyperthreading, superscalar instruction issue,

pipelining and single-instruction, multiple data (SIMD)vector instructions. The rst

twomultiple cores and

hyperthreadingcan be

accessed through threads,

although or eciency, one may

want to use lightweight tasks

that share hardware threads.

Instruction-level parallelism,

such as superscalar instruction

issue and pipelining, are invoked

automatically by the processor,as long as the instruction

stream avoids unnecessary data

dependencies. However, the

last orm o parallelism, SIMD

vector parallelism, can only be

accessed by generating special

instructions that explicitly invoke

multiple operations at once: SIMD instructions. SIMD

instructions perorm the same operation on multiple

components o a vector at once, so they are sometimes

also called SIMD vector instructions.

SIMD vector instructions are very powerul, and they

are becoming more powerul over time. In current

processors that support streaming SIMD extensions

(SSE), our single-precision foating point instructions

can be executed with a single SSE SIMD instruction. In

next-generation AVX processors, the width o the SIMD

instructions will double, so eight such operations can be

33


35/39



executed at once. In the Intel Many Integrated Core

(MIC) architecture, the width doubles again, so sixteen

such operations can be executed at once. The theoretical

peak foating-point perormance o a processor is

represented by the product o the number o cores, the

width o the vector units and the clock rate. While the

clock rate is no longer scaling signicantly, the number o

cores and the SIMD vector width o each core continue

to scale. Vectorizationexpressing computations using

SIMD vector instructionsis essential to attain the peak

perormance o modern processors.

However, there are two problems. First, using SIMD

vector units requires use o specic machine-language

vector instructions. Second, dierent processors have

dierent SIMD vector instruction extensions. The SSE,

AVX and MIC vector instructions are all dierent. While

AVX machines can execute SSE instructions, this will not

access the ull perormance potential o AVX processors.

This latter issue is not so critical since current compiler

technology does permit the generation o multiple

code paths in a single binary. For example, when using

the Intel C++ compiler, a single source program can

be compiled or both SSE and AVX machines, and the

resulting program will use AVX code when possible.However, when using static compilers, developers still

need to know in advance which set o processors they

wish to target, and the problem remains: how is ecient

vectorized code to be generated?

The traditional approach to supporting instruction

set extensions is to modiy the compiler to emit the

new instructions, and then to recompile programs

as necessary. However, or SIMD vector instructions

this is not so easy. It is very dicult or a compiler to

automatically identiy serial structures in a program thatcan be mapped to SIMD vector instructions. It can be

done sometimes, but it is better or the programmer to

explictly indicate which operations in the program should

use SIMD vector operations and how. This requires new

constructs in the programming language that can be

easily and reliably vectorized. Unortunately, there is as

yet no widely accepted machine-independent standard

or speciying vectorization in C and C++.

Intel Parallel Building Blocks (Intel PBB) actually

includes three separate strategies or accessing vector

operations in a portable manner. The rst strategy, which

should not be overlooked, is to use a xed-unction

library: Intel Math Kernel Library (Intel MKL) and

(Intel Integrated Perormance Primitives ( Intel IPP)

include many mathematical operations that have already

been vectorized. I the operation you need is part o

these optimized libraries, that is oten the best solution.

I not, and you have to code the algorithm yoursel, there

are two other strategies available. First, you could use

Intel Cilk Plus, an extension to C and C++ that includes

a notation to speciy explicit vector operations on arrays.

This notation is an extension to C/C++ available in the

Intel C/C++ compiler. The second general-purpose

mechanism is Intel ArBB.

Intel ArBB is an embedded language, implemented as a

C++ API, that in theory works with any ISO-standard C++

compiler. It uses standard C++ mechanisms or its syntax,

declaring types or collections o data and overloading

operators so that operations can be expressed over those

collections. In other words, it looks like a typical matrix-

vector math library. However, there is a dierence. In

an ordinary library, the C/C++ compiler generates thecode statically. In ArBB, in constrast, machine code is

generated by the library itsel, dynamically.

In practice, ArBB is very simple to use; in the ollowing

we will give a ew examples. To set the stage, however,

we rst need to discuss some basics. The ArBB C++

API denes both types and operations. Types include

scalar types or foating point numbers, integers, and

Booleans, as well as types or representing collections o

these types and user-dened types based on them. The

ArBB scalar types are used in place o the ordinary C++types or foats and integers, and have names like 32 (or

single-precision foat), i32 (or signed 32-bit integers), and

so orth. Using an ArBB scalar type indicates to ArBB that

the corresponding machine language or operations on

this type should be generated dynamically by ArBB and

not statically by C++. There are also types to manage

large collections o data. The simplest o these is called

dense and represents a contiguously stored

34


36/39



(dense) multidimensional array with element type T and dimensionality D. The dimensionality is optional and deaults to

1. The element type T can be any ArBB scalar type or structures or classes with ArBB scalar types as elements.

There are two basic ways to speciy parallel computations in ArBB: as sequences o operations over entire collections

(vector mode), or as unctions replicated over every element o a collection (elemental mode). Vector mode is the

simplest: arithmetic operations on collections apply in parallel to the corresponding members o the collections. This

works even i the element type is user-dened and the user has overloaded the operator themselves. For example,

suppose we have our dense collections called A, B, C and D, all o the same size. Then the ollowing expression

will operate in parallel on all the elements o these collections:

A += (B/C) * D;

Note that in general when a collection appears on both the let and right side o an expression, ArBB generates a

result as i all the inputs were read beore any outputs are written. In practice, we have to put this expression inside

a unction and invoke it with a call operation. However, any sequence o parallel vector operations can be inside such a

unction:

void

doit(dense& A, dense B, dense C, dense D)

{

A += (B/C) * D;

}

...

call(doit)(A,B,C,D);

The way this call actually works is that it calls the unction doit precisely once and observes (rather than actually

perorms) the sequence o ArBB type constructions, operations and destructions generated by this unction. It records

this sequence, compiles it into optimized machine language, executes it (in parallel) and then caches it. The next time

the same unction is called, call does not invoke the C++ unction again: it will just retrieve the internally generated

machine code rom its cache. For simple uses o Intel ArBB this is exactly what you want.

In more advanced use cases, however, you may want to generate dierent versions o the operation rom the same

C++ unction. For example, you can parameterize the sequence o Intel ArBB operations by ordinary C++ variables and

control fow, and you can use this to generate variants o a computation. Managing this powerul mechanism or generic

programming is enabled by another Intel ArBB type called a closure. A closure is an object that represents a captured

Intel ArBB unction; it is conceptually similar to a lambda unction, but is dynamically generated. The return type o callis actually an appropriately typed closure. Another unction, capture, is also available. It is similar to call in that it creates

a closure, but it does not cache it, so it can be called repeatedly on the same C++ unction to generate variants. Again,

or simple uses o Intel ArBB explicit use o closures is not necessary, and you can just think o call as a straightorward

unction invocation.

You can also write elemental unctions over scalar Intel ArBB types:

void

35


37/39



kernel(32& a, 32 b, 32 c, 32 d)

{

a += (b/c)*d;

}

You can invoke elemental unctions rom inside a call by using the map operation. A map operation replicates the

unction over every element o the input containers.

void

doit(dense& A, dense B, dense C, dense D)

{

map(kernel)(A, B, C, D);

}

call(doit)(A,B,C,D);

It is also possible, rom inside an elemental unction, to access neighboring elements o the input. This makes it very

easy to write stencil operations, such as convolutions. You can also pass in either an entire container or a single element

to every argument o the map. Single-element arguments are replicated to match the size and shape o any containers

used as arguments. For example, suppose we use:

void

doit(dense& A, 32 b, 32 c, dense D)

{

map(kernel)(A, b, c, D);

}call(doit)(A,b,c,D);

with the same kernel unction, but with the types o b and c matching the corresponding unction argument exactly; in

this case, 32. There will still be as many parallel instances o the kernel as there are elements in the collections A and D,

but every instance will get a copy o the same value o b and c. In summary, call arguments need to match exactly, but

map unctions are polymorphic and any argument can either be a single element or a collection.

In addition to using these two basic patterns to express parallel operations, users o ArBB also have access to several

collective operations that act on or take an entire container as an input. These operations can shit the contents

o containers around, take cumulative sums (prefx scans), perorm sets o reads and writes (known as scatters and

gathers), discard elements and pack the remainder into a contiguous sequence (known as pack; the inverse is unpack)or simply combine all elements into a single element. Combination o all the elements o a container into a single

7512_getting started in hpc development_final2

Documents