modelling and analysis of shared-memory communication s. r. garea and t. hoefler, "modeling...

1

Modelling and Analysis of Shared-memory Communication

S. R. Garea and T. Hoefler, "Modeling Communication in Cache-Coherent SMP Systems-A Case-Study with Xeon Phi," in Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 06 2013, pp. 97-108.

Putigny, B.; Ruelle, B.; Goglin, B., "Analysis of MPI Shared-Memory Communication Performance from a Cache Coherence Perspective," Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International , vol., no., pp.1238,1247, 19-23 May 2014 Presented by

Aishwarya DhandapaniTaru Doodi

2

Modeling Communication in Cache-Coherent SMP Systems-A Case Study with Xeon PhiS. R. Garea and T. Hoefler, "Modeling Communication in Cache-Coherent SMP Systems-A Case-Study with Xeon Phi," in Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 06 2013, pp. 97-108.

3

Aim A novel state-based modeling approach for memory

communication in cache-coherent systems. Assigns a cost to each transition for a cache line in different cache

Demonstrate the applicability of the model to Intel Xeon Phi and show how it can be simplified for algorithm design.

4

Intel Xeon Phi 5110P: Architecture 60 cores @ 1056 MHz

Supports 4 threads per core with hyperthreading

x86 ISA (low latency)

32kb of L1 data cache

32kb of L1 instruction cache

512kb of unified L2 cache

8GB of global memory

3 bidirectional ring buses

Data block ring (64 bytes wide)

Address ring (send/write commands and memory addresses)

Acknowledgement ring

GDDR5 memory controller connected to the ring

5

Intel Xeon Phi 5110P: Cache Coherency Directory based cache coherence

64 distributed tag directories(DTD)

DTD for a core is not local to the core

On average, it will be at a distance of 15 cores due to the ring topology

Lines are assigned to each DTD using a hash function based on the address of the line.

This results in an even load distribution but does not take advantage of locality in the network.

6

Cache-coherent Systems

7

Parametrizing the Model for Xeon Phi BenchIT benchmark is used to determine base parameters for the

model

It measures the latency of T0 reading random lines from a buffer owned by T1 varying the state of the lines and the placement of T1,

In the same core (distance 0)

In an adjacent core (distance 1)

In a distant core (distance 15)

In a core located at the opposite side of the ring (distance 30)

8

Parameterizing Measurements

The communication with the DTD makes the distance between the two cores nearly irrelevant.

The distance-invariant performance is a design goal for excellent application scalability on Xeon Phi.

If the line is in I state, that is, it has to be fetched from memory, it does not matter if T1 is in the local core or in a remote one.

9

Communication Models1. Single-line Ping-Pong Model

2. Multi-line Ping-Pong Model

3. DTD Contention Model

4. Ring Contention

10

Single-line Ping-Pong Model

This requires two sets of buffers on each process (thread) and a synchronization between sender and receiver.

Using a designated byte, which is called canary value in the receive buffer as a synchronization flag such that the receiver waits for the message by repeatedly reading this byte (polling) until the byte changes.

Such “canary protocols” are used in practice for small-message synchronizations.

For the two threads T0 and T1 on different cores and Ss = Sr = E, we measured 497.1µs (standard deviation σ = 77.2µs) while the simplified model predicts 479.1µs (σ = 36.1µs). For the states Ss = I and Sr = E, we measured 842.8µs (standard deviation σ = 102µs) while the simplified model predicts 748.1µs (σ = 49.6µs)

11

Multi-line Ping-Pong Model The receiver will only poll for the

canary value on the last line of the recv-buffer while the sender copies the content of the send-buffer.

This simple model misses several factors that affect performance

Eviction overhead

Hardware prefetcher

Signal buses

DTD capabilities

to serve outstanding requests.

Buffer sizes varied from 64 bytes to 8 kb.

I state E state

12

DTD Contention Model

DTDs may cause delays when they are contended.

Contention is benchmarked using a global send-buffer owned by one thread that every other thread (receivers) copies into a private recv-buffer.

With only two threads, the performance is expected to be RLs,Ss+ RLr,Sr

The contention on MIC for cached lines can be estimated with a linear model, TC(nth) = c·nth + b

13

Ring Contention Two ping-pong benchmarks

Arranged threads into groups of four where the communicating pairs are interleaved

e.g., if a group is formed by T0, T1, T2 and T3, and Ti is running in core i, the pairs are T0 −T2, and T1 −T3

Second benchmark forces pairs to communicate through the same part of the ring (e.g, with 6 threads, the pairs will be T 2 − T 3, T 1 − T 4, T 0 − T 5 ) assuming that communications will use the shortest path.

There was no congestion caused by having several pairs of threads communicating simultaneously if they are accessing different memory addresses.

14

Designing Communication Algorithms The communication

algorithms tackled are different patterns of data exchange where interference among threads hugely increase variability.

To capture all the variations, the algorithms are expressed as Min-Max Models in the paper.

The data buffers are initially assumed to be in exclusive state in the owner’s cache.

15

Fast Message Broadcasting In a sender driven approach, the sender copies the data into the recv-buffer,

similar to the ping pong benchmark The receiver may notify the sender with the canary protocol that the recv-buffer is

ready.

In a receiver-driven approach the receiver would copy the message after the sender has notified that it is ready (notification forwards). In addition, the receiver has to acknowledge the reception of the message (notification backwards).

In a sender driven approach, the sender copies the data into the recvbuffer, similar to the ping pong benchmark. The receiver may notify the sender with the canary protocol that the recv-buffer is ready.

In a receiver-driven approach, the receiver would copy the message after the sender has notified that it is ready (notification forwards). In addition, the receiver has to acknowledge the reception of the message (notification backwards).

16

Notification

The notification forwards and backwards uses shared structures in order for them to be accessible to every thread.

Root can notify that the message is ready to be copied and the rest of threads can confirm that they have received the message, so that the root can free the shared structure.

The notification forwards can be seen as a notification with payload where data and flag can be fetched in a single line.

If data is small enough to fit in the same line, the descendants will poll the notification line and they will copy the data directly from there.

If the space in the notification line is not enough, the parent will set the flag and an address (zero-copy protocol) from which descendants will copy the data.

Backwards notification from the descendants to the parent uses cache lines that are independent from the notification forwards structures to avoid interference in the copy of the data.

The first one with one cache line in which every thread adds a value after finishing. The parent reads this value and checks if the operation is done. This requires every child thread, to write to the same line, and, since only one thread can write a line at a time, these writes are going to be serialized.

In the second variant, each thread avoids serialization by writing its own notification line, but the parent has to read them all to check if the operation is done

17

Small Broadcast

The structure of a generic tree is described by assuming that each level i can use a different number of descendants (ki ) and that the height of the tree is d.

All of the ki descendants of one thread from level i are accessing the same line.

Increasing the number of descendants also increases contention.

Different threads accessing different data should not cause any congestion.

In the worst case, the descendants can read the flag before it is set and interfere while the parent is copying the data and setting the flag.

18

Barrier Synchronization In the shared memory scenario, assuming that every thread owns a

notification line, each “send” operation consists of setting a flag and waiting until the receivers acknowledge that they have read this flag; and, “receive” is to notify to the senders the read of the corresponding flags.

In the best case, the owner was the last reader of its line (to check its value in the previous round), having it in cache when setting it to ready (RL), and, the cost of checking it after every receiver has finished is RR.

Although every thread has to read m lines, they are not contiguous and exposed to be prefetched – single line model.

The contention model does not apply either because, although m threads are accessing each line, they are performing writes that have to be serialized.

19

Small Reductions

The root is receiving from multiple threads.

One approach could be having all those threads writing to a common location.

Each thread will have to (1) check a flag to see if the buffer is ready (RR), (2) read the buffer (RL), (3) apply the reduction operation to the buffer using its private data (RL), (4) write the result to the data buffer and (5) notify that it has finished (RR).

Will cause serialization.

To avoid serialization, the root has several buffers in which each descendant writes its data.

The root reads them all and performs the operation.

20

Experimental Setup Intel Xeon Phi 5110P: 60 cores @ 1052 MHz

Host machine: Intel Xeon E5-2670 Sandy Bridge, 8 cores @ 2.60 Ghz

Intel MIC software stack is the MPSS Gold update 2.1.4346-16,

Intel Composer XE 2013.0.079

Intel Compiler v.13.0

Intel MPI v.4.1.0.024

Benchmarks used:

EPCC OpenMP Benchmarks 3.0

Intel MPI Benchmarks (IMB) 3.2.

21

Experiment Before each iteration, threads are synchronized with a custom

RDTSC-based synchronization and the data lines are placed in the desired cache state.

A second synchronization before the collective operation is performed. The time is measured for every operation call and the whole distribution of times is used for statistical analysis of the obtained results.

Synchronization is not done separately for OpenMP and MPI implementations.

22

Performance

Small broadcast Large broadcast

23

Performance

Barrier synchronization Small reduction

24

Conclusion Notification system and interference caused by threads in the polling

stages, can impact performance more than the actual data transfer.

Model allows algorithm designers to abstract away from the architecture and the detailed cache coherency protocols and design algorithms on purely analytic ground.

Powerful framework for tuning and developing parallel algorithms.

Developed algorithms are up to 4.3 times faster than Intel’s hand-tuned implementation in high-performance libraries and compilers

25

Comments Extremely well written.

Provides with a very good modeling which can be extended to other architectures.

Highly analytical, covers both best and worst scenarios.

Could provide a little more background about the other models that have been compared with.

26

Analysis of MPI Shared Memory Communication Performance from a Cache Coherence PerspectivePutigny, B.; Ruelle, B.; Goglin, B., "Analysis of MPI Shared-Memory Communication Performance from a Cache Coherence Perspective," Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International , vol., no., pp.1238,1247, 19-23 May 2014

27

Intra-node Communication and Cache Coherence (1/2) Most parallel applications involve communications and data transfers

between processes or threads

Most parallel processors use MPI for intra-node communication

Shared memory MPI implementations involves different copies of data in a shared memory buffer and L1 caches

Therefore, cache coherence plays an important role in shared memory communications

However, MPI implementations tune shared memory communications based on metrics that hardly consider cache coherence

28

Intra-node Communication and Cache Coherence (2/2)

MPI implementations like Open MPI offer many configuration options for tuning intra-node communications

However, the effective use of these options require an in-depth understanding of hardware and memory architectures which is not common among programmers

Therefore, most programmers use an MPI implementation like MPICH2 that does not offer too many explicit tuning options

The disadvantage with such implementations is that they are usually outdated

29

The Solution? – Modelling shared memory communications

The paper suggests an approach that uses micro memory benchmarks to model communication patterns among threads in a processor

These benchmarks hide from the user hard to model hardware features like prefetches or cache coherence implementations that are not documented in detail

The benchmark outputs are then combined to rebuild the application memory access patterns and predict its behavior

This is portable across different architectures as the modelling is done using benchmarks

30

Anatomy of Intra-node Communication Most processors implement some form of MESI protocol

Shared-memory MPI communications use an intermediate buffer that is shared between the sender and the receiver

The sender writes message in the buffer before the receiver reads it

Popular MPI implementations like MPICH2 and Open MPI allocate a large buffer that is shared between processes

This buffer is divided into fixed size chunks among processes

Processes reuse their share of the buffer every time they want to send messages, even if the destination is different

When a message is too large to fit into one fixed size buffer, the process uses multiple buffers. There is a pipeline protocol in place to ensure the receiver’s reads overlap with the sender’s writes appropriately

31

Modelling Intra-node Communication (1/2) Micro memory benchmarks are run on test bed platforms using

mbench framework

The mbench framework offers easy ways to setup memory buffers to each cache state and compute the corresponding memory access throughputs for different number of threads

The benchmark outputs are then combined to rebuild the memory access pattern and predict its behavior. Rebuilding considers MESI protocol

Since most HPC architectures use some form of MESI, this model is expected to match a wide range of HPC applications

32

Modelling Intra-node Communication (2/2)

The model targets memory-bound applications, i.e. where memory access is the key performance criteria and cannot be overlapped significantly with computation

The model allows the computation of time taken for sender and receiver nodes to copy data to and from their buffers. This allows to compute the throughput of these communications

33

Test Beds – Intel Xeon Sandy Bridge E5-2650 (1/2) Two 8 core Intel Xeon E5-2650 processor @ 2GHz

Sandy Bridge microarchitecture

16 cores with hyper threading = 32 cores totally

34

Test Beds – AMD Opteron 6272 (2/2) AMD Opteron 6272 @ 2.1GHz

Bulldozer microarchitecture

Each thread is seen as 4 cores by the OS. Therefore, totally 64 cores

35

Evaluating the Model (1/2)

The performance predicted by the model is compared to an experiment that measures the performance of communication on Intel and AMD processors using MPI.

The model measures only data transfer times whereas MPI adds control overheads.

To overcome this, a synthetic experiment was designed that mimics the data transfers withing Open MPI 1.7 implementation.

36

Evaluating the Model (2/2)

37

Impact of Application Buffer Reuse (1/2) Sometimes, the buffers that belong to the sender or receiver are

reused in message transfers.

And sometimes they are not reused.

This causes the MESI states involved to change and causes memory access performance to vary significantly.

Therefore, it is important to consider the changes in memory access throughput due to buffer reuse.

38

Impact of Application Buffer Reuse (2/2)

39

When transitioning from M -> S state

When a core tries to load data that has been recently modified by another core, a write back is required if there is no shared memory between the cores

Such write backs prove to be expensive

40

When transitioning from S -> M state The remote copy has to be

invalidated before the local copy can switch from S to M state

The Intel Xeon E5 has a directory in its L3 cache that keeps track of these invalidation requests and cancels them when it is known that there are no other copies of that cache line.

This has shown to improve the performance by 5-10%

41

MOESI Protocol and Shared-buffer Reuse Order(1/2)

AMD platforms use MOESI protocol for cache coherence

This protocol is interesting because of the transition that happens in the O -> M state

One would assume that this transition would happen as quickly as the transition from S to M state in the MESI protocol

But on the contrary, it takes longer.

It is faster to write to a shared copy than an owned one.

Also, in the Bulldozer architecture, reuse of shared buffers seemed to speed up data transfers as against the use of different buffers

42

Shared-buffer Reuse Order and MOESI Protocol (2/2)

43

Conclusions

Intra-node communications are becoming increasingly important with the advent of many core processors

Modern memory subsystems are too complex to model due to complex hardware

The paper proposes a new approach to modelling shared-memory communication using micro memory benchmarks

The proposed model’s efficiency was analyzed using an experimental setup and was found to be acceptably accurate

Some common state transitions and their impact on shared-memory communication was discussed

The unexpected behavior of MOESI protocol was discussed

44

Future Work The proposed model could be used for real applications and it’s

efficiency can be tested in real-time

Other architectures that use MESI cache coherence protocol like ARM could also be considered for the study

More complex communication operations like collective operations can be studied

The optimizations suggested in the paper could be incorporated into an MPI implementation that would automatically tune shared-memory communication

45

Questions

modelling and analysis of shared-memory communication s. r. garea and t. hoefler, "modeling...

Documents

cache line

ring distance

cachecoherent systems

intel xeon phi

cachecoherent smp systems

cache coherencydirectory

distant core distance

adjacent core distance