7512_getting started in hpc development_final2

Upload: daniel-campbell

Post on 08-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    1/39an Developer eBook

    Getting Started inHPC Development

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    2/39

    2 Letter rom the Editor

    4 Utilizing a Multi-Core System with the Actor Model

    12 Lots about Locks

    22 Intel Parallel Studio XE and Intel Cluster StudioTool Suites

    33 Intel Array Building Blocks

    4

    22

    2

    12

    33

    12

    Contents

    This content was adapted from Internet .coms DevX website and Intel Parallel UniverseMagazine. Contributors: James Leigh, Wooyoung Kim, Michael Voss, Michael McCool,

    Sanjay Goil and John McHugh.

    Getting Started in HPC Development

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    3/39

    2 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    any people, even in the IT industry, hear

    the term high-perormance computing

    (HPC) and think o supercomputers that

    are used in scientic experiments or

    complex research applications. But as the amount o data

    continues to grow, and databases continue to expand,

    businesses in the private sector are going to need toharness some serious computing

    horsepower.

    HPC is powered, in part, by

    powerul multicore processors

    that can speed up application

    perormance. For sotware

    developers, this means learning

    to create applications with

    parallelism that can take

    advantage o these multicoreprocessors. This also means

    changing the way that

    applications are developed.

    There are a number o

    techniques, methods and

    technologies available that

    can help application developers pick up parallel

    programming and create applications that can run in

    an HPC environment. In this eBook rom Internet.com

    and Intel were going to look at some o these tools andmethods to give developers some ideas about whats

    available.

    In our rst article, James Leigh is going to look at

    developing ecient multi-threaded applications without

    using synchronized blocks.

    The actor model (which is native to some programming

    languages such as Scala) is a pattern or concurrent

    computation that enables applications to take ull

    advantage o multicore and multiprocessor computing.

    James likes the actor model because it abstracts the

    nitty-gritty o multiprocessor programming away romthe developer. This reduces

    concurrency issues and improves

    the fexibility o the system.

    The actor model also has a

    low learning curve, so new

    developers can quickly see how

    actors are implemented and

    understand how they t together

    In our next article, Wooyoung

    Kim and Michael Voss discusswhy in their opinion locks remain

    the best choice or implementing

    synchronization and protecting

    critical sections o sotware code

    Their article discusses some o

    their experiences with mutual

    exclusion locks in developing multithreaded concurrent

    applications using the locks provided in Intel Threading

    Building Blocks as examples.

    Then John McHugh and Sanjay Goil are going to

    introduce us to Intel Parallel Studio XE, a set o new

    sotware development tool suites or developers o

    applications that run on both Windows and Linux in C/

    C++ and Fortran who need advanced perormance

    or multicore today and uture scaling to manycore.

    The Intel Parallel Studio XE 2011 bundle contains the

    latest versions o Intel C/C++ and Fortran compilers,

    Letter rom the EditorBy Michael Pastore

    M

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    4/39

    3 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Intel MKL and Intel IPP perormance libraries, Intel PBB

    libraries, (Intel TBB, Intel ArBB betas, and Intel Cilk Plus),

    Intel Inspector XE correctness analyzer and Intel VTune

    Amplier XE perormance proler.

    Finally, Michael McCool is going to answer the ollowing

    question: How can parallelism mechanisms in modern

    processor hardware, including vector SIMD instructions,

    be targeted in a portable, general way within existing

    programming languages?

    And his answer will be Intel Array Building Blocks, which

    he will explain in more detail.

    We hope you enjoy this eBook, and remember you can

    always turn to Internet.com websites like DevX.com and

    Developer.com, as well as the Intel Sotware Network, or

    more inormation on the journey to developing or high-

    perormance computing.

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    5/39

    4 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Download the code for this article from: http://assets.

    devx.com/devx/actor-model.zip.

    typical multi-threaded application in Java

    contains numerous synchronized methods

    and statements. They might also contain calls

    to the methods wait() and notiy() that were introduced

    with Java 1.0, but these methods provide very primitiveunctionality and are easily misused. Java 5 introduced

    the java.util.concurrent package,

    which provides some higher-level

    abstractions away rom wait()

    and notiy() . However, it can still

    be a challenge to appropriately

    use the synchronized and

    volatile keywords. Even when

    used correctly, getting them

    used eciently can require

    complicated orchestrations olocks.

    The biggest criticism o Javas

    synchronization is perormance.

    Synchronization blocks become

    overly encompassing too easily.

    Although a synchronization block

    on its own is ar rom slow, when overly encompassing, it

    becomes a contested synchronization block. Contested

    synchronized blocks, or other blocking operations, are

    slow and require the OS to put threads to sleep and

    use interrupts to activate them. This puts pressure

    on the scheduler, resulting in signicant perormance

    degradation.

    Actor Model

    The actor model (native to some programming languages

    such as Scala) is a pattern or concurrent computation

    that enables applications to take ull advantage o multi-

    core and multi-processor computing. The undamental

    idea behind the actor model is that the application is

    broken up into actors that perorm particular roles.

    Every method call (or message) to an actor is executed

    in a unique thread, so you avoid all o the contested

    locking issues typically ound in concurrent applications.

    This allows or more ecient concurrent processing whilekeeping the complexity o actor

    implementations low, as there is

    no need to consider concurrent

    execution within each actor

    implementation.

    The class in Listing 1 shows what

    an actor class might look like.

    This class takes a string o words

    and saves them to an XML le,

    and includes a calculated codeor every character stored. The

    code might be used later as

    an index or to nd similar text

    blocks. Notice that this class is

    not thread sae and you can only

    use each instance rom a single

    thread. This is normal, because

    each actor is used rom only one thread. It is common not

    to have any synchronized or volatile keywords present in

    an actor class because they are not needed.

    Long-lived, normally synchronized objects used by

    dierent threads are better o with a dedicated thread

    ree rom any synchronization issues. Each method call

    is placed in the queue (the order within the queue is not

    important) waiting until the actor is available to process

    the call. Think o this queue like your email in-box:

    messages are received at any time and are acted on when

    time permits. Typically, calls are asynchronous and do

    Utilizing a Multi-Core System with the Actor ModelBy James Leigh

    A

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    6/39

    5 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    not block, so the calling thread continues execution and

    avoids any need to rely on thread interrupts. When callers

    need a result, you can pass a callback object as part o

    the parameters to allow the actor to notiy the caller. In

    some cases, it is desirable to block the caller until the

    actor processes the message.

    You can separate the storage actor in Listing 1 into

    a second actor as shown in Listing 2. In this way, the

    storage actor calls an instance o HexCoderActor with

    itsel as the callback. The storage actor does not wait

    or the HexCoder to generate the hex code, but instead

    continues with other items in its queue. This allows

    the storage actors thread to specialize in writing the

    resulting XML le, while the text code is calculated

    asynchronously in another thread. Notice how these

    classes can take advantage o concurrent threads without

    any special keywords or deep knowledge o concurrent

    programming.

    Every actor needs a manager to allocate and manage

    its thread. Each actor also needs a proxy to send

    messages to its queue. Implementing a basic actor

    manager is straightorward. In Listing 3, shows such a

    manager written in Java 5. It uses Javas Proxy object todynamically wrap an actor, implementing all o the actors

    interaces. Every method call on the proxy is then queued

    in an ExecutorServicevoid methods are asynchronous

    and other method calls block until the executor has

    nished executing and the result is available.

    Exception Handling and Worker Services

    In every program, it is important to test and have proper

    exception handling. This becomes even more important

    with multi-threaded programming, because asynchronousexecution quickly becomes dicult to debug. Because

    execution is not done sequentially, a sequential debugger

    is less useul. Similarly, stack traces are shorter and

    do not give caller details. In these situations, it is best

    to either have the actor handle exceptions itsel or

    enable callbacks to handle both successul results and

    exceptions.

    You should also consider that calls to an actor do carry

    some overhead when compared to sequential calls. You

    need to queue messages passed to a separate thread

    and you cannot optimize with compilers in the same

    manner as sequential calls. This makes the actor model

    less applicable to smaller, aster objects that are better

    implemented as immutable or stateul. However, there

    are also advantages to running actors in a dedicated

    thread. By avoiding synchronized and volatile

    keywords, the on-board chip memory does not need to

    sync up with the main memory as oten, since the actors

    thread is the only thread that can access its variables.

    Modern compilers can also observe that the head-lock o

    the queue is only used rom its actor thread and optimize

    it away, making it possible or actors to run without any

    interruption or mandatory memory fushing. Thereore,

    use actors or specialized worker services.

    An example o worker services is an importing and

    indexing service. Consider the task o retrieving remote

    data, processing it locally, and storing it into a local

    database. You might break this up into three steps:

    1. Retrieve data.

    2. Process data.3. Store result.

    In this example, the remote data is not retrieved by a

    single connection, but rather in multiple les that are

    listed in index les, mixed in with the data les. The

    remote data is in a ormat that you cannot process

    directly and you need to pre-process or ormat it rst.

    Furthermore, you need to convert the data because it

    uses a dierent vocabulary. This creates six steps:

    1. Retrieve index or data le.2. Format the le or parsing.

    3. Convert data.

    4. I index, then list data les and go to step 1.

    5. Process data les.

    6. Insert data.

    These six steps t well into the actor model. Think o

    each o these steps as a job that one or more individuals

    (actors) need to perorm.

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    7/39

    6 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Included in this article is an implementation o the above

    actor model or retrieving remote recipes rom multiple

    sites in multiple ormats. Each recipe is listed in one or

    more index les on the web, and the recipe is in HTML.

    Actor

    RoundRobin

    UrlResolver

    XhtmlTransormer

    StyleSheetTransormer

    RdParser

    SeeAlsoExtractor

    IngredientProcessor

    RDFInserter

    UrlConsumer

    UrlConsumer

    StreamTransormer

    StreamTransormer

    StreamTransormer

    RdConsumer

    RdConsumer

    RdConsumer

    Distributes URLs to other actors

    Retrieves data streams or another

    actor

    Formats HTML into XHTML or

    parsing

    Converts remote XML ormat into

    local data ormat

    Parses data stream into data

    structure

    Extracts URLs rom index data

    Applies local processing rules on

    data

    Inserts data into a database

    Trait Role

    Listing 4 shows how these actors are connected to one

    another. The manage( ) methods are typed versions o the

    ActorManager#manage(Object) in Listing 3.

    A ClusterMap and Main class are also provided in the

    download archive. To run the example, execute the

    Main class with the ollowing two arguments: http://

    www.kra tcanada.com/en/search/SearchResul ts .

    aspx?gcatid=86 and http://www.cookingnook.com/ree-

    online-recipes.html

    The program retrieves these silos o inormation, harvests

    meaningul data, indexes it, and makes it available in a

    graphical user interace.

    With the stage set, lets introduce the actors:

    The Main class then opens the ClusterMap and begins

    harvesting the recipes. Ater a ew recipes are harvested

    select the check-box on the let to see the number o

    recipes that are harvested and click the clear button at

    the top to update the list o words extracted rom the

    ingredients section. In this way, you can index and search

    multiple distinct recipe sites. For example, to nd recipes

    that include lemon, cheddar, and garlic (yum), click on

    these ingredients and the Tortilla Soup recipe is revealed

    to include all three ingredients rom the recipes harvested

    (see Figure 1).

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    8/39

    7 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Figure 1. ClusterMap: The tortilla soup recipe is revealed

    ater clicking certain ingredients.

    In a multi-core system, the program uses over 30 threads

    to orchestrate the retrieval and processing o the data

    downloading and processing as quickly as the remote

    host provides the data. In spite o the multi-threaded

    perormance, there is no need to consider typical multi-

    threaded challenges, reeing the developer rom worrying

    about the constraint on what each actor should do.

    The actor model is a powerul metaphor to assist in

    creating multi-threaded applications, and by assigning

    remote addresses and enabling remote communication

    between actors, you can extend the model to assist in

    distributed challenges as well. By including lie-cycle and

    dependency management and making actors aware o

    their environment, they can become agents, participating

    in a sel-organizing system. This architecture has worked

    well or many distributed problems such as on-line trading

    disaster response, and modelling social structure. It has

    also been the source o inspiration or many service-

    oriented architectures.

    In essence, the actor model abstracts the nitty-gritty o

    multi-processor programming away rom the developer

    This reduces concurrency issues and improves the fexibility

    o the system. This simple model has a low learning

    curve, so new developers can quickly see how actors

    are implemented and understand how they t together

    By managing the actors properly, you can leverage the

    same implementations rom multi-processor systems onto

    distributed networked systems in a gradual manner that

    can scale with the development demands.

    Listing 1. Storage Actor:This listing shows what an actor class might look like.

    public class StorageActor implements Storage {

    private Writer out;

    private Set recorded = new HashSet();

    private Set sorted = new TreeSet();

    private StringEncoder encoder = new Soundex();

    public void init() throws IOException {

    out = new FileWriter(text.xml);out.write(\n);

    }

    public void close() throws IOException {

    out.write(\n);

    out.close();

    }

    public void store(String text) throws Exception {

    i (recorded.add(text)) {

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    9/39

    8 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    String code = code(text);

    store(code, text);

    }

    }

    public void store(String code, String text) throws IOException {

    out.write();

    out.write(text);

    out.write(\n);

    }

    private String code(String text) throws EncoderException {

    or (String word : text.split([^a-zA-Z]*)) {

    i (word.length() > 2) {

    String encoded = encoder.encode(word);

    sorted.add(encoded);

    }

    }

    int hash = 0;

    or (String encoded : sorted) {

    hash = hash * 31 + encoded.hashCode();

    }

    sorted.clear();

    return Integer.toHexString(hash);

    }}

    Listing 2. HexCoderActor: You can separate the storage actor rom Listing 1 into a second actor.

    public class HexCoderActor implements HexCoder {

    private Set sorted = new TreeSet();

    private StringEncoder encoder = new Soundex();

    public void code(String text, Storage callback) throws Exception {

    String code = code(text);

    callback.store(code, text) ;

    }private String code(String text) throws EncoderException {

    or (String word : text.split([^a-zA-Z]*)) {

    i (word.length() > 2) {

    String encoded = encoder.encode(word);

    sorted.add(encoded);

    }

    }

    int hash = 0;

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    10/39

    9 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    or (String encoded : sorted) {

    hash = hash * 31 + encoded.hashCode();

    }

    sorted.clear();

    return Integer.toHexString(hash);

    }

    }

    public class StorageActor implements Storage {

    private Writer out;

    private Set recorded = new HashSet();

    private HexCoder coder;

    public StorageActor(HexCoder coder) {

    this.coder = coder;

    }

    public void init() throws IOException {

    out = new FileWriter(text.xml);

    out.write(\n);

    }

    public void close() throws IOException {

    out.write(\n);

    out.close();

    }

    public void store(String text) throws Exception {

    i (recorded.add(text) ) {

    coder.code(text, this);}

    }

    public void store(String code, String text) throws IOException {

    out.write();

    out.write(text);

    out.write(\n);

    }

    }

    Listing 3. ActorManager: The ActorManager as shown written in Java 5.

    public class ActorManager {

    private nal Map executors = new ConcurrentHashMap();

    public Object manage(Object actor) {

    ExecutorService executor = Executors.newSingleThreadExecutor();

    executors.put(executor, executor);

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    11/39

    10 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Class ac = actor.getClass( );

    ClassLoader cl = ac.getClassLoader();

    Class[] interaces = ac.getInteraces( );

    ActorHandler handler = new ActorHandler(actor, executor);

    return Proxy.newProxyInstance(cl, interaces, handler);

    }

    private class ActorHandler implements InvocationHandler {

    private Object actor;

    private ExecutorService executor;

    public ActorHandler(Object actor, ExecutorService executor) {

    this.actor = actor;

    this.executor = executor;

    }

    public Object invoke(nal Object proxy, nal Method method,

    nal Object[] args) throws Throwable {

    Class type = method.getReturnType();

    Future result = executor.submit(new Callable() {

    public Object call() throws Exception {

    Object result = method.invoke(actor, args);

    i (result == actor)

    return proxy;

    return result;}

    });

    i (Void.TYPE.equals(type))

    return null;

    return result.get( );

    }

    }

    }

    Listing 4. ActorFactory: This listing shows how the actors are connected.

    public void init() throws Exception {

    ClassLoader cl = Thread.currentThread().getContextClassLoader();

    URL xsl = cl.getResource(RECIPES_XSL);

    UrlConsumer[] consumers = new UrlConsumer[1 + PROCESSORS * 2];

    SeeAlsoConsumer seeAlso = _.seeAlso() ;

    // only one thread/actor can insert at a time

    RDFConsumer insert = _.insert(store);

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    12/39

    11 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    or (int i=0;i

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    13/39

    12 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    emember the red telephone box, once a

    amiliar sight on the streets o London? Thats

    a good example o mutually exclusive access

    to a shared resource, although you probably

    didnt nd any locks on them. Why? Because only one

    person at a time could use one to make a call, and civil

    persons would not listen to a strangers conversationwhile waiting outside.

    Unortunately, there are no

    guarantees that programs

    will be equally civil, so wise

    programmers use semaphores

    to keep processes rom running

    amok and leaving shared

    resources, such as les and I/O

    devices, in inconsistent states.

    Mutual exclusion locks (also,

    called mutex locks or

    simply locks or mutexes) are

    a special kind o semaphore.

    Each protects a single shared

    resource, qualiying it as a binary

    semaphore. Concurrent programs

    use locks to guarantee consistent

    communication among threads

    through shared variables or data structures. A piece

    o program code protected by a mutex lock is called acritical section.

    Mutex locks are oten implemented using an indivisible

    test-and-set instruction in todays prevalent multi-core

    systems. Although generally deemed ecient, relying

    on an indivisible test-and-set instruction incurs a ew

    hidden perormance penalties. First, execution o such an

    continued

    Lots about LocksBy Wooyoung Kim and Michael Voss

    Rinstruction requires memory access, so it intereres with

    other cores progressespecially when the instruction is

    in a tight loop. The eect may be elt even more acutely

    on systems with a shared memory bus. Another penalty

    stems rom cache-coherency. Because the cache line

    containing a lock object is shared among cores, one

    threads update to the lock invalidates the copies onthe other cores. Each subsequent test o the lock on

    other cores triggers etching the cache line. A related

    penalty is alse sharing where an

    unrelated write to another part

    o the cache line invalidates the

    whole cache line. Even i the lock

    remains unchanged, the cache

    line must be etched to test the

    lock on a dierent core.

    Given all these problems,one might wonder Why use

    locks at all? What are the

    alternatives? One extreme

    alternative is to give up on

    communicating through

    shared variables and adopt the

    mantra o no sharing. That

    involves replicating data and

    communicating via message

    passing. Unortunately, the cost o replication and

    message passing is even greater than the overheadassociated with locks on todays multi-core shared-

    memory architectures.

    Another approach that has been actively pursued

    recently as an alternative to mutex locks is lock-ree/non-

    blocking algorithms. Researchers have reported some

    isolated successes in designing practical non-blocking

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    14/39

    13 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    explicit use o locks, then:

    Guideline 2: Make Critical Sections as Small as Possible

    When a thread arrives at a critical section and nds that

    another thread is already in it, it must wait. Keep the

    critical section small, and you will get small waiting times

    or threads and better overall perormance. Examine

    when a shared data in a critical section is made private

    and see i you can saely take some o the accesses to the

    data out o the critical section.

    For example, the code in Listing 1 implements a

    concurrent stack. It denes two methods push( ) and

    pop(), each protected using a TBB mutex lock (smtx)

    thats acquired in the constructor and released in the

    destructor. The examples in Listing 1 rely on the C++

    scoping rules to delimit the critical sections.

    A cursory look at pop() shows that:

    1. I the stack is empty pop() returns alse.

    2. I the stack is not empty, the code acquires the

    mutex lock and then re-examines the stack.

    3. I the stack has become empty since the previous

    test, pop() returns alse.4. Otherwise, the code updates the top variable,

    and copies the old top element.

    5. Finally, pop() releases the lock, reclaims the

    popped node, and returns true.

    implementations. Nonetheless, non-blocking algorithms

    are hardly a holy grail. Designing ecient non-blocking

    data structures remains dicult, and the promised

    perormance gain has been elusive at best. Youll see

    more about non-blocking algorithms at the end o this

    article.

    With no proven better alternatives at present, it makes

    sense to make the most o mutex locks until they are

    rendered no longer necessary. This article discusses

    some experiences with mutex locks in developing multi-

    threaded concurrent applications, using the mutex locks

    provided in Intel Threading Building Blocks as examples.

    Making the Most of Mutex Locks

    Mutexes are oten vilied as major perormance snags

    in multi-threaded, concurrent application development;

    however, our experience suggests that mutex locks are

    the least evil among synchronization methods available

    today. Even though the nominal overhead appears large,

    you can harness them to your advantage i you use them

    in well-disciplined ways. Throughout this article, youll see

    some o the lessons learned, stated as guidelines, the rst

    two o which are:

    Guideline 1: Being Frugal Always Pays Off

    Minimize explicit uses o locks. Instead, use concurrent

    containers and concurrent algorithms provided by

    ecient, thread-sae concurrency libraries. I you still

    nd places in your application that you think benet rom

    Heres a closer look at the critical section. Copying type T may take a lot o time, depending on T. Because o the lock,

    you know that, once updated, the old top value cannot be viewed by other threads; it becomes private and local to the

    thread inside the critical section. Thereore, you can saely yank the copy statement out o the critical section (ollowingguideline 2) as ollows.

    bool pop( T& _e ) {

    node* tp = NULL;

    i( !top ) goto done;

    {

    tbb::spin_mutex::scoped_lock lock( smtx );

    i( !top ) goto done;

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    15/39

    14 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    tp = top;

    top = top->nxt;

    // move the next line...

    // _e = tp->elt;

    }

    // ...to here

    _e = tp->elt;

    delete tp;

    done:

    return tp!=NULL;

    }

    As another example, consider implementing a tiny memory management routine. A thread allocates objects rom its

    private blocks and returns objects to their parent block. It is possible or a thread to ree objects allocated by another

    thread. Such objects are added to their parent blocks public ree list. In addition, a block with a non-empty public

    ree list is added to a list (i.e., public block list) ormed with block_t::next_to_internalize and accessed through block_

    bin_t::mailbox, i not already in.

    The owner thread privatizes objects in a blocks public ree list, as needed. Function internalize_next() implements the

    unctionality and is invoked when a thread runs out o private blocks with ree objects to allocate. It takes a block bin

    private to the caller thread as its argument and pops the ront block rom the list bin->mailbox, i not empty. Then, it

    internalizes objects in the blocks public ree list:

    block_t* internalize_next ( block_bin_t* bin )

    {

    block_t* block;{

    tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);

    block = bin->mailbox;

    i( block ) {

    bin->mailbox = block->next_to_internalize;

    block->next_to_internalize = NULL;

    }

    }

    i( block )

    internalize_returned_objects( block );

    return block;}

    The unctions critical section protects access to bin->mailbox with bin->mailbox_lock. Inside the critical section, i bin-

    >mailbox is not empty it pops the ront block to block and resets the blocks next_to_internalize.

    Note that block is a local variable. By the time bin->mailbox is updated, block (which points to the old ront block

    becomes invisible to other threads and access to its next_to_internalize eld becomes race-ree. Thus, you can saely

    move the reset statement outside the critical section:

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    16/39

    15 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    block_t* internalize_next ( block_bin_t* bin )

    {

    block_t* block;

    {

    tbb::spin_mutex::scoped_lock scoped_cs(bin->mailbox_lock);

    block = bin->mailbox;

    i( block ) {

    bin->mailbox = block->next_to_internalize;

    // move the next statement...

    // block->next_to_internalize = NULL;

    }

    i( block ) {

    // ...to here

    block->next_to_internalize = NULL;

    internalize_returned_objects( block );

    }

    return block;

    }

    Guideline 3: Synchronize as Infrequently as Possible

    The idea behind this guideline is that you can amortize

    the cost o a lock operation over a number o local

    operations. Doing so reduces the overall execution timebecause executing atomic instructions tends to consume

    an order o magnitude more cycles.

    Again, suppose youre designing a memory allocator that

    allocates objects out o a block. To reduce the number o

    trips to the operating system to get more memory blocks,

    the allocator uses a unction called allocate_blocks() to

    get a big strip rom the operating system, partition it into

    a number o blocks, and then put them in the global ree

    block list shared among threads. The ree block list ree_

    list is implemented as a concurrent stack (see Listing 2).

    Note that the code to push a newly carved-out

    block into ree_list is inside a while loop. Also, note

    that stack2::push( ) protects concurrent accesses to

    stack2::top through a mutex lock. That means allocate_

    blocks() acquires the lock ree_list.mtx N times or a strip

    containing N blocks.

    You can reduce that requency to one per strip by adding

    a ew thread-local instructions. The idea is to build a

    thread-local list o blocks in the while loop rst (using two

    pointer variables head and tail) and then push the entire

    list into ree_list with a single lock acquisition (see Listing

    3). Finally, so that allocate_blocks() can access ree_lists

    private elds, its declared as a riend o stack2.

    Guideline 4: Most of All, Know Your Application.

    The guideline that will help you most in practice is to

    analyze and understand your application using actual

    use scenarios and representative input sets. Then you

    can determine what kinds o locks are best used where.

    Perormance analysis tools such as Intel Parallel

    Amplier can help you identiy where the hot spots areand ne-tune your application accordingly.

    A Smorgasbord of Lock Flavors

    Intel Threading Building Blocks oers a gamut o

    mutex locks with dierent traits, because critical sections

    with dierent access patterns call out mutex locks with

    dierent trade-os. Other libraries may oer similar

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    17/39

    16 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    choices. You need to know your application to select the

    most appropriate lock favor or each critical section.

    Spin Mutex vs. Queuing Mutex

    The most prominent distinguishing property or locks

    is fairnesswhether a lock allows air access to the

    critical section or not. This is an important consideration

    when choosing a lock, but its importance may vary

    depending on circumstances. For example, an operating

    system should guarantee that no process gets unairly

    delayed when multiple processes contend against each

    other to get into a critical section. By contrast, unairness

    among threads in a user process may be tolerable to

    some degree i it helps boost the throughput.

    TBBs spin_mutex is an unair lock. Threads entering a

    critical section with a spin_mutex repeatedly attempt to

    acquire the lock (they spin-wait until they get into the

    critical section, thus the name). In theory, the waiting time

    or a spin_mutex is unbounded. The TBB queuing_mutex,

    on the other hand, is a air lock, because a thread arriving

    earlier at a critical section will get into it earlier than one

    arriving later. Waiting threads orm a queue. A newly

    arriving thread puts itsel at the end o the queue usinga non-blocking atomic operation and spin-waits until its

    fag is raised. A thread leaving the critical section hands

    the lock over to the next in line by raising the latters fag.

    Unortunately there are no cast-in-stone guidelines or

    criteria that dictate when to use an unair spin_mutex and

    when to use a air queuing_mutex. In general though,

    guaranteeing airness costs more. When a critical section

    is brie and contention is light, the chance o a thread

    being starved is slim and any additional overhead or

    unneeded airness may not be warranted. In those cases,use a spin_mutex.

    The TBB queuing_mutex spin-waits on a local cache

    line and does not interere with other threads memory

    access. Consider using a queuing mutex or modestly

    sized critical sections and/or when you expect a airly

    high degree o contention.

    One report claims that, using a test program, with spin

    locks, a dierence o up to 2x runtime per thread was

    observed and some threads were unairly granted the

    lock up to 1 million times on an 8-core Opteron machine.

    I you suspect your application suers rom unairness

    due to a spin_mutex, switching to a air mutex such as

    queuing_mutex is your answer. But beore switching, back

    up your decision with concrete measurement data.

    Reader-Writer locks

    Not all concurrent accesses need to be mutually

    exclusive. Indeed, accesses to many concurrent data

    structures are mostly read-accesses, and only occasionally

    need write-accesses. For these structures, keeping one

    reader spin-waiting while another reader is in the critical

    section is not necessary.

    TBB reader/writer mutexes allow multiple readers to be in

    a critical section while giving writers exclusive access to it.

    The unair version is called spin_rw_mutex, while the air

    version is queuing_rw_mutex. These mutexes also allow

    readers to upgrade to writers and writers to downgrade

    to readers.

    Under some circumstances, you can replace reader-side

    locks with less expensive operations (although potentially

    at the expense o writers). One such example is a

    sequential lock; another is read-copy-update lock. These

    locks are less-general reader-writer locks, so using them

    properly in applications requires more stringent scrutiny.

    Mutex and Recursive_Mutex

    TBB provides a mutex that wraps around the underlying

    OS locks, but compared to the native version, addsportability across all supported operating systems. In

    addition the TBB mutex releases the lock even when an

    exception is thrown rom the critical section.

    A sibling, recursive_mutex, permits a thread to acquire

    multiple locks on the same mutex. The thread must

    release all locks on a recursive_mutex beore any other

    thread can acquire a lock on it.

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    18/39

    17 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Avoiding Lock Pitfalls

    There is no shortage o reerences that warn about the

    inevitable dangers o using locks, such as deadlocks and

    livelocks. However, you can reduce the chances o getting

    ensnared by these problems considerably by instituting a

    ew simple rules.

    Avoid explicit use of locks. Instead use concurrent

    containers and concurrent algorithms provided in well-

    supported concurrency libraries such as Intel Threading

    Building Blocks. I you think your application requires

    explicit use o locks, avoid implementing your own locks

    and use well-tested well-tuned locks such as TBB locks.

    Avoid making calls to functions (particularly unknown

    ones) while holding a lock. In general, calling a unction

    while holding a lock is not good practice. For one thing,

    it increases the size o the critical section, thus increasing

    the wait-times o other threads. More seriously, you may

    not know whether the unction contains lock acquisition

    code. Even i it does not now, it may in the uture. Such

    changes potentially lead to a deadlock situation, and

    when that happens, its very dicult to locate and x. I

    possible, re-actor the critical section so that it computesthe unction arguments in the critical section but invokes

    the unction outside the critical section.

    Avoid holding multiple locks. Circular lock acquisition is

    a leading cause o dead lock problems. I you must hold

    multiple locks, always acquire the locks in the same order

    and then release them in the same order that they were

    acquired.

    Avoid using recursive locks. You may be able to nd some

    isolated cases where recursive locks make great sense.However, locks dont compose well. Even a completely

    unrelated change to a part o your application may lead

    to a deadlock, and the problem will be very dicult to

    locate.

    Even i you do everything you possibly can to avoid

    deadlocks and livelocks, problems may still occur. I

    you suspect your application has a deadlock or race

    condition, and you cannot locate it quickly, dont get

    burned by trying to resolve it by yoursel. Use tools such

    as Intel Parallel Inspector.

    Lock-Free and Non-Blocking Algorithms

    As promised earlier, one strategy that avoids locks

    and their associated problems advocated by some

    researchers is to use non-blocking synchronization

    methods such as lock-ree/wait-ree programming

    techniques and sotware transactional memory. These

    techniques aim to provide wait-reedom, thereby

    addressing issues stemming rom the blocking nature o

    locks without compromising perormance.

    Unortunately, our experience with non-blocking

    algorithms has been (so ar) disappointing, and many

    other developers and researchers agree. Almost all non-

    blocking algorithms invariably use one or more hardware-

    supported atomic operations (such as compare-and-swap

    (CAS) and load-link/store-conditional (LL/SC)). Some

    even use double-word CAS (DCAS).

    Dependence on these atomic primitives makes them

    dicult to write (see Doherty, Simon, et al, DCAS isnot a silver bullet or nonblocking algorithm design.

    Proceedings of the 16th Annual ACM Symposium on

    Parallelism in Algorithms and Architectures. 2004, and

    Herb Sutters article Lock-Free Code: A False Sense

    o Security), dicult to validate or correctness (see

    Gotsman, Alexey, et al, Proving That Non-Blocking

    Algorithms Dont Block, Symposium on Principles of

    Programming Languages to appear in 2009), and dicult

    to port to other platorms. This is probably one reason

    why non-blocking algorithms have been limited to simple

    data structures. Furthermore, improved perormance ovelock-based implementations seems hard to get.

    Arguments or the other benets are not compelling

    enough to warrant the pain o switching to non-blocking

    algorithms. Fairness is contingent upon the underlying

    atomic operations; in some cases, livelock is still possible.

    For many user applications, benets such as real-time

    support and ault tolerance are a good-to-have, not

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    19/39

    18 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    a must-have. In other cases, solutions provided by

    operating systems are sucient (e.g., priority inheritance

    or priority inversion).

    Sotware Transactional Memory (STM) is another

    alternative to lock-based synchronization. It abstracts

    away the use o low-level atomic primitives using the

    notion o transactions, and simplies synchronizing access

    to shared variables through optimistic execution and

    a roll-back mechanism. Like non-blocking algorithms,

    STM promises perormance gains over lock-based

    synchronization, and also promises to avoid many

    common locking pitalls. The results so ar are not so

    avorable. One publication observes that the overall

    perormance o TM is signicantly worse at low levels

    o parallelism (see Cascaval, Calin, et al, Sotware

    Transactional Memory: Why is it only a research toy?

    ACM Queue. 2008, Vol. 6, 5.) However, STM is a relatively

    young research area, so the jury is still out.

    Lock It Up

    Locks have been unairly vilied as a hindrance to

    the development o ecient concurrent applications

    on burgeoning multi-core platorms. However, our

    experiences suggest that rather than discouraging

    the use o mutex locks, one should instead promote

    their well-disciplined use. More oten than not,

    implementations with such locks outperorm those with

    non-blocking algorithms or STM.

    The most important consideration or making the best

    use o mutex locks is understanding the application well,

    using tools to aid that understanding where necessary,and selecting the best-tting synchronization method or

    each critical section. When you do choose a mutex, use it

    with the recommended guidelines, but keep fexibility in

    mind. Doing so will prevent most common mutex-related

    pitalls without incurring unwarranted perormance

    penalties. Finally, shun the do-it-yoursel temptation, and

    delegate work to well-supported concurrency libraries.

    Listing 1. Concurrent Stack Implementation:

    The push and pop methods are protected by a mutex lock acquired in the constructor and released in the destructor.

    /* unintrusive concurrent stack */

    #include tbb/spin_mutex.h

    template

    class concurrent_stack

    {

    class node {

    riend class concurrent_stack;

    node* nxt;

    T elt;

    public:node( T& _e ) : nxt(NULL), elt(_e) {}

    } ;

    public:

    concurrent_stack() : top(0) { }

    void push( T& _e ) {

    node* n = new node( _e );

    tbb::spin_mutex::scoped_lock lock( smtx );

    n->nxt = top;

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    20/39

    19 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    top = n;

    }

    bool pop( T& _e ) {

    node* tp = NULL;

    i( !top ) goto done;

    {

    tbb::spin_mutex::scoped_lock lock( smtx );

    i( !top ) goto done;

    tp = top;

    top = top->nxt;

    _e = tp->elt;

    }

    delete tp;

    done:

    return tp!=NULL;

    }

    private:

    node* top;

    tbb::spin_mutex smtx;

    } ;

    Listing 2: Memory Allocator:

    This memory allocator reduces trips to get memory by getting a big strip o memory rom the operating system, partitioningit into a number o blocks, and then putting them in the global ree block list.

    class stack2 {

    public:

    inline stack2() : top(NULL) {}

    inline void push ( void** ptr ) {

    tbb::spin_mutex::scoped_lock lock(mtx);

    *ptr = top;

    top = ptr;

    }

    inline void* pop ( void ) {i( !top ) return NULL;

    void **result;

    {

    tbb::spin_mutex::scoped_lock lock(mtx);

    i ( !(result=(void**) top) ) goto done;

    top = *result;

    }

    *result = NULL;

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    21/39

    20 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    done:

    return result;

    }

    private:

    tbb::spin_mutex mtx;

    void* top;

    } ;

    stack2 ree_list;

    int allocate_blocks() {

    uintptr_t raw_strip = get_strip();

    i( !raw_strip )

    return 0;

    uintptr_t aligned_strip = align_strip( raw_strip );

    uintptr_t endp = (uintptr_t) raw_strip+strip_size ;

    uintptr_t b = aligned_strip;

    while ( b+block_sizebump_ptr = block_endp;

    // note this line

    ree_list.push( (void**)b );

    b = block_endp;

    }return 1;

    }

    Listing 3. Build a Block List:

    The while loop builds a list o blocks, and then pushes the entire list into free_listusing only a single lock acquisition.

    class stack2 {

    riend int allocate_blocks() ;

    public:

    inline stack2() : top(NULL) {}

    inline void push ( void** ptr ) {tbb::spin_mutex::scoped_lock lock(mtx);

    *ptr = top;

    top = ptr;

    }

    inline void* pop ( void ) {

    i( !top ) return NULL;

    void **result;

    {

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    22/39

    21 Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division of QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    tbb::spin_mutex::scoped_lock lock(mtx);

    i ( !(result=(void**) top) ) goto done;

    top = *result;

    }

    *result = NULL;

    done:

    return result;

    }

    private:

    tbb::spin_mutex mtx;

    void* top;

    } ;

    stack2 ree_list;

    int allocate_blocks() {

    uintptr_t raw_strip = get_strip();

    i( !raw_strip )

    return 0;

    uintptr_t aligned_strip = align_strip( raw_strip );

    uintptr_t endp = (uintptr_t) raw_strip+strip_size ;

    uintptr_t b = aligned_strip;

    uintptr_t head = 0;

    uintptr_t tail = b;while ( b+block_sizebump_ptr = (void*) block_endp;

    ree_list.push( (void**)b ) ;

    * (uintptr_t*) b = head;

    head = b;

    b = block_endp;

    }

    {

    // Push the block list into ree_list

    tbb::spin_mutex::scoped_lock lock(ree_list.mtx);* (void**) tail = ree_list.top;

    ree_list.top = (void*) head;

    }

    return 1;

    }

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    23/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    This article originally appeared in Intel Parallel Universe

    Magazine. Used with permission.

    n September, Intel introduced Intel Parallel

    Studio 2011, a tool suite or Microsot

    Windows Visual Studio C++ developers,

    with the singular objective o providing theessential perormance tools or

    application development on

    Intel architecture. These tools

    provide signicant innovation,

    and enable unprecedented

    developer productivity when

    building, debugging and tuning

    parallel applications or multicore.

    With the introduction o Intel

    Parallel Building Blocks (Intel

    PBB) developers have methods tointroduce and extend parallelism

    in C/C++ applications or higher

    perormance and eciencies.

    Now Intel is extending the reach

    o the next-generation Intel tools

    to developers o applications

    on both Windows and Linux in C/C++ and Fortran who

    need advanced perormance or multicore today and

    orward scaling to manycore. Intel Parallel Studio XE

    2011 contains C/C++ and Fortran Compilers, Intel

    Math Kernel Library (Intel MKL) and Intel Integrated

    Perormance Primitives (Intel IPP) perormance libraries,

    Intel PBB libraries, Intel Threading Building Blocks (Intel

    TBB), Intel Cilk Plus, and Intel Array Building Blocks

    (Intel ArBB), Intel Inspector XE correctness analyzer and

    Intel VTune Amplier XE perormance proler.

    HPC programmers have traditionally been able to use all

    Intel Parallel Studio XE and Intel

    Cluster Studio Tool SuitesBy Sanjay Goil and John McHugh

    I

    the compute power made available to them. Even with

    the perormance leaps that Moores law has allowed Intel

    architecture to deliver over the past decade, the hunger

    or additional perormance continues to thrive. There

    are big unsolved problems in science and engineering,

    physical simulations at higher granularities, and problems

    where the economically viable compute power provideslower resolution or piecemeal

    simulation o smaller portions o

    the larger problem. This is what

    makes serving the HPC market

    so exciting or Intel, and it is a

    signicant driver or innovation

    in both hardware and sotware

    methodologies or parallelism

    and perormance.

    Intel Cluster Studio introducestools or HPC cluster

    development with MPI, including

    the scalable Intel MPI Library

    and Intel Trace Analyzer and

    Collector perormance proler,

    with the industry-leading C/

    C++ and Fortran compilers or a

    complete cluster development toolkit. This is combined

    with the ease o deployment oered by the Intel

    Cluster Ready program, making deployment o cluster

    applications highly ecient.

    Introducing New Tool Suites

    Sotware developers o high per ormance applications

    require a complete set o development tools. While

    traditionally these tools include compilers, debuggers,

    and perormance and parallel libraries, more oten the

    issues in development come in error correctness and

    22

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    24/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    perormance proling. The code doesnt run correctly, or exhibits error prone behavior on some runs, pointing to data

    races, deadlocks, or perormance bottlenecks in locks or synchronization, or exposes security risks at runtime. To this

    end, Intels correctness analyzers and perormance prolers are a great addition to the development environment or

    highly robust and secure code development.

    For advanced and distributed perormance, Intel is simpliying the procurement, deployment and use o HPC tools on

    multicore 32- and 64-node Intel architecture and HPC clusters programmed with the Message Passing Interace (MPI).

    A sotware development project goes through several steps to get optimal perormance on the target platorm.

    Most oten the developer gets a rudimentary per ormance prole o the application run to show hotspots. Once

    opportunities or optimization are identied, the coding aspects are handled by the compilers and perormance

    and parallel libraries to add parallelism, presenting task level, data level and vectorization opportunities. Finally, thecorrectness tools make robust code possible by checking or threading and memory errors, and identiying security

    vulnerabilities. This cycle typically repeats itsel to nd higher application eciencies.

    23

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    25/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Highlights o Intel Parallel Studio XE 2011

    Available or Multiple Operating Systems: Intel Parallel Studio XE provides the same set o tools to aid

    development or both Windows and Linux platorms. C/C++, Fortran compilers and perormance and parallelism

    libraries bring advanced optimizations to the Mac*.

    Robustness: Intel Inspector XEs memory and thread analyzer nds and pinpoints memory and threading errors

    beore they happen.

    Code Quality: Intel Parallel Studio XE enables developers to eectively nd sotware security vulnerabilities

    through static security analysis.

    Advanced Optimization: The compilers and libraries in Intel Composer XE oer advanced vectorization support,

    including support or Intel AVX. The C/C++ optimizing compiler now includes Intel PBB library, expanding

    the types o problems that can be solved more easily in parallelism with increased scalability and reliability. For

    Fortran developers, it now oers co-array Fortran and additional support or the Fortran 2008 standard.

    Perormance: Intel VTune Amplier XE perormance proler nds bottlenecks in serial and parallel code thatlimit perormance. Improvements include a more intuitive interace, ast statistical call graph, and timeline view.

    Intel MKL and Intel IPP perormance libraries provide robust multicore perormance or commonly used math and

    data processing routines. A simple linking o the application with these libraries is an easy rst step or multicore

    parallelism.

    Compatibility and Support: Intel Parallel Studio XE excels at compatibility with leading development

    environments and compilers. Intel oers broad support with orums and Intel Premier Support that provides ast

    answers and covers all sotware updates or one year.

    24

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    26/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Old Name

    Compiler Suite Proessional Edition

    C++ Compiler Proessional Edition

    [Visual] Fortran Compiler Proessional Edition

    Visual Fortran Compiler Proessional Edition with

    IMSL

    VTune Perormance Analyzer

    (including Intel Thread Proler)

    Thread Checker

    Cluster Toolkit Compiler Edition

    Composer XE

    C++ Composer XE

    [Visual] Fortran Composer XE

    Visual Fortran Composer XE with IMSL

    VTune Amplier XE

    Inspector XE

    Cluster Studio

    New Name

    Whats new in Intel Composer XE

    Intel Composer XE contains next-generation C/C++ and Fortran compilers (12.0) and perormance and parallel libraries

    Intel MKL 10.3, Intel IPP 7.0 and Intel TBB 3.0.

    The latest Intel C/C++ compiler, Intel C++ Compiler XE 12.0 is optimized or the latest Intel architecture processor (code-

    named Sandy Bridge) with Intel AVX support. The product contains Intel PBB, which includes advances in mixing and

    matching task, vector, and data parallelism in applications to better map to the multicore optimization opportunities,

    Intel Cilk Plus, Intel TBB and Intel ArBB (in Beta, available separately). There are vector optimizations with Intel AVX with

    SIMD pragmas, in addition to an array notation tool to help in auto-parallelization called GAP, or the highest perormance

    and parallelism on the latest generation o x86 multicore CPUs. For Windows users, support or Visual Studio 2010 is

    included.

    The tools introduced in Intel Parallel Studio XE 2011 are next generation revisions o industry-leading tools or C/

    C++ and Fortran developers seeking cross-platorm capabilities or the latest x86 processors on Windows and Linux

    platorms. Those amiliar with Intels industry-leading tools will see that the product names have transitioned in this new

    releasein all cases with signicant additional capabilities. Other names remain the same.

    25

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    27/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    The Intel Fortran Compiler XE 12.0 includes several advances in more complete support or Fortran 2003 standard and

    some support or Fortran 2008 standards, including Co-array Fortran, vector optimizations with AVX and help with auto-parallelization or the highest perormance and parallelism on the latest x86 multicore CPUs.

    The perormance libraries continue to provide an easy way to include highly optimized and automatically parallel math

    and scientic unctions, and data processing routines or high perormance users. The math library, Intel MKL 10.3

    includes enhancements such as better Intel AVX support, summary statistics library, and enhanced C language support

    or LAPACK. The data processing library, Intel IPP 7.0 includes improved data compression and codecs, and support or

    Intel AVX and AES instructions, continuing to address data processing intensive application domains.

    Enhanced Developer Productivity with Correctness Analyzers and Perormance Proflers

    Intel Parallel Studio XE 2011 combines ease-o-use innovations, introduced in Intel Parallel Studio, with advanced

    unctionality or high perormance, scalability and code robustness or Linux and Windows. Intel has traditionally oered

    developer tools on both Windows and Linux, and strives to oer the same unctionality across both platorms, especially

    important or developing applications to run on both operating systems.

    Introducing SIMD pragmas or

    26

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    28/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    With the capabilities in the correctness analyser, Intel Inspector XE, the product helps the C/C++ and Fortran developer

    with static and dynamic code analysis through threading and memory analysis tools, to develop highly robust, secure

    and highly optimized applications.

    New capabilities in this tool include:

    Simplied conguration and run analysis

    Finds coding defects quickly, such as:

    o Memory leaks and memory corruption

    o Threading data races and deadlocks

    Supports native threads, understands any parallel model built on top of threads Dynamic instrumentation works on standard builds and binaries

    Timeline view to explore context of the respective threads

    Intuitive standalone GUI and command line interface for Windows and Linux

    Advanced command line reporting

    Intel VTune Amplier XE 2011 is the next generation o the Intel VTune Analyzer, which is a powerul tool to quickly

    nd and provide greater insights into multicore perormance bottlenecks. It takes away the guesswork and analyzes

    perormance behavior in Windows* and Linux* applications, providing quick access to scalability bottlenecks or aster

    Hotspots in theapplication

    Thread basedCPU usage

    27

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    29/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    and improved decision making. The next generation Intel VTune perormance proler has new eatures, including:

    Easy predened analyses

    Fast hotspot analysis (hot functions and call stack)

    Powerful ltering

    Threading timeline

    Frame analysis

    Attach to a running process (Windows)

    Event multiplexing

    Simplied remote collection

    Improved compare results

    Tight Visual Studio integration

    Non-root Linux install

    Only EBS driver install needs root

    Sotware security starts very early in the development phase, and Intel Parallel Studio XE 2011 makes it aster to identiy

    locate, and x sotware issues prior to sotware deployment. This helps identiy and prevent critical sotware security

    vulnerabilities early in the development cycle, where the cost o nding and xing errors is the lowest.

    Intels static security analysis (SSA), included in the Parallel Studio XE bundle, provides these unique advantages or

    robust code development:

    Easier, faster setup and ramp to get static analysis results

    Simple approach to congure and run static analysis

    Discover and x defects at any phase of the development cycle Finds over 250 security errors, such as:

    o Buer overruns and uninitialized variables

    o Unsae library usage and arithmetic overfow

    o Unchecked input and heap corruption

    Tracks state associated with issues, even as source evolves and line numbers change

    Displays problem sets and location of source

    Provides lters, assignment of priority, and maintenance of problem set state

    Intuitive standalone GUI and command line interface for Windows and Linux

    Feature

    Support or both Linux and Windows platorms Development capability with the same set o

    tools on both Windows and Linux platorms;

    enhanced perormance, productivity, and

    programmability

    Benet

    28

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    30/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Feature

    C/C++ Compilers with Intel Parallel Building Blocks

    Fortran Compilers with Fortran 2008 standards

    support including Co-Array Fortran (CAF)

    Memory, threading, and security analysis tools in onepackage support.

    Updated perormance libraries

    Updated perormance proler

    Breakthrough in providing choice o parallelism

    or applications task, data, vector with

    mix and match or optimizing application

    perormance. C/C++ standards support

    Advances in the industry-leading Fortran

    compilers with new support or scalable

    parallelism on nodes and clusters (cluster

    support available separately with Intel Cluster

    Studio 2011). Fortran standards

    Enhances developer productivity andeciencies by simpliying and speeding the

    process o detecting dicult- to-nd coding

    errors

    Multicore perormance or common math and

    data processing tasks, with a simple linking with

    these automatically parallel libraries

    Several ease-o-use enhancements, deeper

    micro-architectural insights, enhanced GUI, and

    quicker, more robust perormance

    Benet

    Increase Perormance and Scalability o HPC Cluster Computing

    Intel Cluster Studio 2011 sets a new standard in distributed parallelism on Intel architecture-based clusters. This premier

    tool suite provides development fexibility or enabling MPI-based application perormance or highly-parallel shared-

    memory and cluster systems based on 32 and 64 Intel architectures. The newly architected Intel MPI library 4.0 is key to

    achieving these advantages by providing new levels o cluster scalability, improved interconnect support across many

    abrics, aster on-node messaging, support or hybrid parallelization, and an application tuning capability that adjusts to

    the cluster and application structure. For the developer, the Intel Trace Analyzer and Collector 8.0 is enhanced with neweatures that accelerate the analysis and tuning cycle o MPI-based cluster applications. The suite is complemented with

    the latest Intel C/C++ and Fortran compiler technology along with Intel MKL 10.3, Intel IPP 7.0, and Intel PBB (also sold as

    Intel Composer XE) to urther optimize and parallelize application execution on each computing node. Co-array Fortran

    is supported on clusters in this package.

    Along with Intel Cluster Ready (ICR), a program to dene cluster architectures or increasing uptime, increasing productivity

    and reducing total cost o ownership (TCO) or IA-based HPC clusters, Intel Cluster Studio 2011 makes it easy to code,

    debug, and optimize to gain higher scalability or MPI-based cluster applications, up to petascale, and also is the premier

    29

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    31/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    suite or developing and tuning hybrid-parallel codes that can mix MPI with multithreading paradigms such as OpenMP

    or Intel PBB.

    Intel Cluster Studio 2011 provides an extensive sotware package containing Intel C/C++ compilers and Intel Fortran

    Compilers or all Intel architectures, plus all the Intel Cluster Tools that help you develop, analyze, and optimize

    perormance o parallel applications on Linux or Windows. By combining all the compilers and tools into one license

    package, Intel can provide single installation, interoperability, and support or the best-in-class cluster sotware tools.

    Highlights o Intel Cluster Studio 2011

    Scalability and High Perormance: The interconnect-tuned and multicore-optimized Intel MPI Library delivers

    application perormance on thousands o 32- and 64-IA multicore processors.

    Built-in Optimization: Utilize optimizing compilers and libraries in Intel Composer XE to get the most out o

    advanced processor technologies. The C/C++ optimizing compiler now includes Intel PBB, which expands the

    types o problems that can be solved more easily in parallel, and with increased reliability. For Fortran developers, it

    now oers co-array Fortran (CAF) and additional support or the Fortran 2008 standard. Intel Compilers also delive

    advanced vectorization support with SIMD pragmas.

    Ease o MPI Tuning: Intel Trace Analyzer and Collector has been enhanced with new eatures that accelerate the

    analysis and tuning cycle o MPI-based cluster applications.

    Target Applications to Multiple Operating Systems: Leverage the same source code in Intel Compilers and libraries

    which bring advanced optimizations to Windows and Linux.

    Intel Cluster Ready Qualifed: This program denes cluster architectures to increase uptime and productivity and

    reduce total cost o ownership (TCO) or IA-based HPC clusters.

    Compatibility and Support: Intel Cluster Studio oers excellent compatibility with leading development environments

    and compilers , while providing optimal support or multiple generations o Intel processors and compatibles. Inteoers broad support through its orums and Intel Premier Support, which provides ast answers and covers al

    sotware updates or one year

    30

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    32/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Feature

    Analysis tools or MPI developers Load imbalance

    diagram; ideal Interconnect simulator

    Scalable Intel MPI Library with multi-rail IB support and

    Application Tuner

    C/C++ Compilers with Intel Parallel Building Blocks

    Fortran compilers with Fortran 2008 standards support

    including co-array Fortran (CAF) on clusters

    Updated perormance libraries, Intel MKL and Intel IPP

    Support or both Linux and Windows platorms

    Enhanced developer productivity and

    eciencies by simpliying and speeding the

    detection o errors and oering perormance

    proling o MPI messages.

    Scale to tens o thousands o cores with one

    o the most scalable and robust commercial

    MPI libraries in the industry. Ease-o-use with

    dynamic and congurable support across

    multiple cluster abrics and multi-rail IB support

    Breakthrough in providing choice o parallelismor applications process, task, data, vector

    with mix and match or optimizing application

    perormance on clusters o SMP nodes. C/C++

    standards support

    Advances in industry-leading Fortran compilers

    with new support or scalable parallelism on

    nodes and clusters. Fortran standards support.

    Multicore perormance or common math and

    data processing tasks, with a simple linking withthese automatically parallel libraries

    Development capability with the same set o

    tools on both Windows and Linux platorms

    or enhanced perormance, productivity, and

    programmability

    Benet

    Summary

    With the introduction o Intel Parallel Studio XE and Intel Cluster Studio, Intel is extending the reach o the next-generation

    Intel tools to Windows and Linux C/C++ and Fortran developers needing advanced perormance or multicore today and

    orward scaling to manycore.

    Intel Parallel Studio XE 2011 bundle contains the latest versions o Intel C/C++ and Fortran compilers, Intel MKL and

    Intel IPP perormance libraries, Intel PBB libraries, (Intel TBB, Intel ArBB [in beta] , and Intel Cilk Plus), Intel Inspector XE

    correctness analyzer, and Intel VTune Amplier XE perormance proler.

    31

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    33/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    Intel Cluster Studio 2011 bundle contains the latest versions o Intel MPI Library, Intel Trace Analyzer and Collector, Inte

    C/C++ and Fortran compilers, Intel MKL and Intel IPP perormance libraries, Intel PBB libraries, (Intel TBB, Intel ArBB [in

    beta], and Intel Cilk Plus.

    32

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    34/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    This article originally appeared in Intel Parallel Universe

    Magazine. Used with permission.

    ntel Array Building Blocks (Intel ArBB)

    is a sophisticated and powerul platorm

    or portable data-parallel sotware

    development. Intel ArBB will be available asa component o Intel Parallel

    Building Blocks, along with

    several other tools and libraries

    or parallel programming. Intel

    ArBB can be used to parallelize

    compute-intensive applications

    within a structured, deterministic-

    by-deault ramework. It also

    provides powerul runtime

    generic programming

    mechanisms, yet can be usedwith existing compilers. In

    particular, it has been veried

    to work with the Intel, Microsot

    and gcc C++ compilers. Intel

    ArBB is currently in Beta, and

    eedback is appreciated; it can be

    downloaded today rom http://

    intel.com/go/ArBB or either Windows or Linux.

    Is Intel ArBB a language or a library? Yes both at the

    same time. Intel ArBB is the answer to the ollowing

    question: How can parallelism mechanisms in modern

    processor hardware, including vector SIMD instructions,

    be targeted in a portable, general way within existing

    programming languages? The answer is an embedded

    language. Intel ArBB is a language extension

    implemented as an API. It has a library interace, but

    includes a capability or the dynamic generation and

    Intel

    Array Building BlocksBy Michael McCool

    I

    optimization o parallelized and vectorized machine

    language.

    Modern processors include many mechanisms or

    increasing per ormance through parallelism: multiple

    cores, hyperthreading, superscalar instruction issue,

    pipelining and single-instruction, multiple data (SIMD)vector instructions. The rst

    twomultiple cores and

    hyperthreadingcan be

    accessed through threads,

    although or eciency, one may

    want to use lightweight tasks

    that share hardware threads.

    Instruction-level parallelism,

    such as superscalar instruction

    issue and pipelining, are invoked

    automatically by the processor,as long as the instruction

    stream avoids unnecessary data

    dependencies. However, the

    last orm o parallelism, SIMD

    vector parallelism, can only be

    accessed by generating special

    instructions that explicitly invoke

    multiple operations at once: SIMD instructions. SIMD

    instructions perorm the same operation on multiple

    components o a vector at once, so they are sometimes

    also called SIMD vector instructions.

    SIMD vector instructions are very powerul, and they

    are becoming more powerul over time. In current

    processors that support streaming SIMD extensions

    (SSE), our single-precision foating point instructions

    can be executed with a single SSE SIMD instruction. In

    next-generation AVX processors, the width o the SIMD

    instructions will double, so eight such operations can be

    33

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    35/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    executed at once. In the Intel Many Integrated Core

    (MIC) architecture, the width doubles again, so sixteen

    such operations can be executed at once. The theoretical

    peak foating-point perormance o a processor is

    represented by the product o the number o cores, the

    width o the vector units and the clock rate. While the

    clock rate is no longer scaling signicantly, the number o

    cores and the SIMD vector width o each core continue

    to scale. Vectorizationexpressing computations using

    SIMD vector instructionsis essential to attain the peak

    perormance o modern processors.

    However, there are two problems. First, using SIMD

    vector units requires use o specic machine-language

    vector instructions. Second, dierent processors have

    dierent SIMD vector instruction extensions. The SSE,

    AVX and MIC vector instructions are all dierent. While

    AVX machines can execute SSE instructions, this will not

    access the ull perormance potential o AVX processors.

    This latter issue is not so critical since current compiler

    technology does permit the generation o multiple

    code paths in a single binary. For example, when using

    the Intel C++ compiler, a single source program can

    be compiled or both SSE and AVX machines, and the

    resulting program will use AVX code when possible.However, when using static compilers, developers still

    need to know in advance which set o processors they

    wish to target, and the problem remains: how is ecient

    vectorized code to be generated?

    The traditional approach to supporting instruction

    set extensions is to modiy the compiler to emit the

    new instructions, and then to recompile programs

    as necessary. However, or SIMD vector instructions

    this is not so easy. It is very dicult or a compiler to

    automatically identiy serial structures in a program thatcan be mapped to SIMD vector instructions. It can be

    done sometimes, but it is better or the programmer to

    explictly indicate which operations in the program should

    use SIMD vector operations and how. This requires new

    constructs in the programming language that can be

    easily and reliably vectorized. Unortunately, there is as

    yet no widely accepted machine-independent standard

    or speciying vectorization in C and C++.

    Intel Parallel Building Blocks (Intel PBB) actually

    includes three separate strategies or accessing vector

    operations in a portable manner. The rst strategy, which

    should not be overlooked, is to use a xed-unction

    library: Intel Math Kernel Library (Intel MKL) and

    (Intel Integrated Perormance Primitives ( Intel IPP)

    include many mathematical operations that have already

    been vectorized. I the operation you need is part o

    these optimized libraries, that is oten the best solution.

    I not, and you have to code the algorithm yoursel, there

    are two other strategies available. First, you could use

    Intel Cilk Plus, an extension to C and C++ that includes

    a notation to speciy explicit vector operations on arrays.

    This notation is an extension to C/C++ available in the

    Intel C/C++ compiler. The second general-purpose

    mechanism is Intel ArBB.

    Intel ArBB is an embedded language, implemented as a

    C++ API, that in theory works with any ISO-standard C++

    compiler. It uses standard C++ mechanisms or its syntax,

    declaring types or collections o data and overloading

    operators so that operations can be expressed over those

    collections. In other words, it looks like a typical matrix-

    vector math library. However, there is a dierence. In

    an ordinary library, the C/C++ compiler generates thecode statically. In ArBB, in constrast, machine code is

    generated by the library itsel, dynamically.

    In practice, ArBB is very simple to use; in the ollowing

    we will give a ew examples. To set the stage, however,

    we rst need to discuss some basics. The ArBB C++

    API denes both types and operations. Types include

    scalar types or foating point numbers, integers, and

    Booleans, as well as types or representing collections o

    these types and user-dened types based on them. The

    ArBB scalar types are used in place o the ordinary C++types or foats and integers, and have names like 32 (or

    single-precision foat), i32 (or signed 32-bit integers), and

    so orth. Using an ArBB scalar type indicates to ArBB that

    the corresponding machine language or operations on

    this type should be generated dynamically by ArBB and

    not statically by C++. There are also types to manage

    large collections o data. The simplest o these is called

    dense and represents a contiguously stored

    34

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    36/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    (dense) multidimensional array with element type T and dimensionality D. The dimensionality is optional and deaults to

    1. The element type T can be any ArBB scalar type or structures or classes with ArBB scalar types as elements.

    There are two basic ways to speciy parallel computations in ArBB: as sequences o operations over entire collections

    (vector mode), or as unctions replicated over every element o a collection (elemental mode). Vector mode is the

    simplest: arithmetic operations on collections apply in parallel to the corresponding members o the collections. This

    works even i the element type is user-dened and the user has overloaded the operator themselves. For example,

    suppose we have our dense collections called A, B, C and D, all o the same size. Then the ollowing expression

    will operate in parallel on all the elements o these collections:

    A += (B/C) * D;

    Note that in general when a collection appears on both the let and right side o an expression, ArBB generates a

    result as i all the inputs were read beore any outputs are written. In practice, we have to put this expression inside

    a unction and invoke it with a call operation. However, any sequence o parallel vector operations can be inside such a

    unction:

    void

    doit(dense& A, dense B, dense C, dense D)

    {

    A += (B/C) * D;

    }

    ...

    call(doit)(A,B,C,D);

    The way this call actually works is that it calls the unction doit precisely once and observes (rather than actually

    perorms) the sequence o ArBB type constructions, operations and destructions generated by this unction. It records

    this sequence, compiles it into optimized machine language, executes it (in parallel) and then caches it. The next time

    the same unction is called, call does not invoke the C++ unction again: it will just retrieve the internally generated

    machine code rom its cache. For simple uses o Intel ArBB this is exactly what you want.

    In more advanced use cases, however, you may want to generate dierent versions o the operation rom the same

    C++ unction. For example, you can parameterize the sequence o Intel ArBB operations by ordinary C++ variables and

    control fow, and you can use this to generate variants o a computation. Managing this powerul mechanism or generic

    programming is enabled by another Intel ArBB type called a closure. A closure is an object that represents a captured

    Intel ArBB unction; it is conceptually similar to a lambda unction, but is dynamically generated. The return type o callis actually an appropriately typed closure. Another unction, capture, is also available. It is similar to call in that it creates

    a closure, but it does not cache it, so it can be called repeatedly on the same C++ unction to generate variants. Again,

    or simple uses o Intel ArBB explicit use o closures is not necessary, and you can just think o call as a straightorward

    unction invocation.

    You can also write elemental unctions over scalar Intel ArBB types:

    void

    35

  • 8/7/2019 7512_Getting Started in HPC Development_final2

    37/39

    Getting Started in HPC Development an Internet.com Developer eBook. 2010, Internet.com, a division o QuinStreet, IncBack to Contents

    Getting Started in HPC Development

    kernel(32& a, 32 b, 32 c, 32 d)

    {

    a += (b/c)*d;

    }

    You can invoke elemental unctions rom inside a call by using the map operation. A map operation replicates the

    unction over every element o the input containers.

    void

    doit(dense& A, dense B, dense C, dense D)

    {

    map(kernel)(A, B, C, D);

    }

    call(doit)(A,B,C,D);

    It is also possible, rom inside an elemental unction, to access neighboring elements o the input. This makes it very

    easy to write stencil operations, such as convolutions. You can also pass in either an entire container or a single element

    to every argument o the map. Single-element arguments are replicated to match the size and shape o any containers

    used as arguments. For example, suppose we use:

    void

    doit(dense& A, 32 b, 32 c, dense D)

    {

    map(kernel)(A, b, c, D);

    }call(doit)(A,b,c,D);

    with the same kernel unction, but with the types o b and c matching the corresponding unction argument exactly; in

    this case, 32. There will still be as many parallel instances o the kernel as there are elements in the collections A and D,

    but every instance will get a copy o the same value o b and c. In summary, call arguments need to match exactly, but

    map unctions are polymorphic and any argument can either be a single element or a collection.

    In addition to using these two basic patterns to express parallel operations, users o ArBB also have access to several

    collective operations that act on or take an entire container as an input. These operations can shit the contents

    o containers around, take cumulative sums (prefx scans), perorm sets o reads and writes (known as scatters and

    gathers), discard elements and pack the remainder into a contiguous sequence (known as pack; the inverse is unpack)or simply combine all elements into a single element. Combination o all the elements o a container into a single