everything you always wanted to know about synchronization...

Everything You Always Wanted to Know AboutSynchronization but Were Afraid to Ask

Tudor David, Rachid Guerraoui, Vasileios Trigonakis∗

School of Computer and Communication Sciences,Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland

{tudor.david, rachid.guerraoui, vasileios.trigonakis}@epfl.ch

AbstractThis paper presents the most exhaustive study of syn-chronization to date. We span multiple layers, fromhardware cache-coherence protocols up to high-levelconcurrent software. We do so on different types ofarchitectures, from single-socket – uniform and non-uniform – to multi-socket – directory and broadcast-based – many-cores. We draw a set of observations that,roughly speaking, imply that scalability of synchroniza-tion is mainly a property of the hardware.

1 IntroductionScaling software systems to many-core architectures isone of the most important challenges in computing to-day. A major impediment to scalability is synchroniza-tion. From the Greek “syn”, i.e., with, and “khronos”,i.e., time, synchronization denotes the act of coordinat-ing the timeline of a set of processes. Synchronizationbasically translates into cores slowing each other, some-times affecting performance to the point of annihilatingthe overall purpose of increasing their number [3, 21].

A synchronization scheme is said to scale if its per-formance does not degrade as the number of cores in-creases. Ideally, acquiring a lock should for exampletake the same time regardless of the number of coressharing that lock. In the last few decades, a large bodyof work has been devoted to the design, implemen-tation, evaluation, and application of synchronizationschemes [1, 4, 6, 7, 10, 11, 14, 15, 26–29, 38–40, 43–45]. Yet, the designer of a concurrent system still has

∗Authors appear in alphabetical order.

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrightsfor third-party components of this work must be honored. For all otheruses, contact the Owner/Author.

Copyright is held by the Owner/Author(s).SOSP’13, Nov. 3–6, 2013, Farmington, Pennsylvania, USA.ACM 978-1-4503-2388-8/13/11.http://dx.doi.org/10.1145/2517349.2522714

Figure 1: Analysis method.

little indication, a priori, of whether a given synchro-nization scheme will scale on a given modern many-corearchitecture and, a posteriori, about exactly why a givenscheme did, or did not, scale.

One of the reasons for this state of affairs is that thereare no results on modern architectures connecting thelow-level details of the underlying hardware, e.g., thecache-coherence protocol, with synchronization at thesoftware level. Recent work that evaluated the scal-ability of synchronization on modern hardware, e.g.,[11, 26], was typically put in a specific application andarchitecture context, making the evaluation hard to gen-eralize. In most cases, when scalability issues are faced,it is not clear if they are due to the underlying hardware,to the synchronization algorithm itself, to its usage ofspecific atomic operations, to the application context, orto the workload.

Of course, getting the complete picture of how syn-chronization schemes behave, in every single context, isvery difficult. Nevertheless, in an attempt to shed somelight on such a picture, we present the most exhaustivestudy of synchronization on many-cores to date. Ouranalysis seeks completeness in two directions (Figure 1).

1. We consider multiple synchronization layers, frombasic many-core hardware up to complex con-current software. First, we dissect the laten-cies of cache-coherence protocols. Then, westudy the performance of various atomic opera-tions, e.g., compare-and-swap, test-and-set, fetch-and-increment. Next, we proceed with locking andmessage passing techniques. Finally, we examinea concurrent hash table, an in-memory key-valuestore, and a software transactional memory.

33

2. We vary a set of important architectural at-tributes to better understand their effect on syn-chronization. We explore both single-socket(chip multi-processor) and multi-socket (multi-processor) many-cores. In the former category, weconsider uniform (e.g., Sun Niagara 2) and non-uniform (e.g, Tilera TILE-Gx36) designs. In thelatter category, we consider platforms that imple-ment coherence based on a directory (e.g., AMDOpteron) or broadcast (e.g., Intel Xeon).

Our set of experiments, of what we believe constitutethe most commonly used synchronization schemes andhardware architectures today, induces the following setof observations.

Crossing sockets is a killer. The latency of perform-ing any operation on a cache line, e.g., a store or acompare-and-swap, simply does not scale across sock-ets. Our results indicate an increase from 2 to 7.5 timescompared to intra-socket latencies, even under no con-tention. These differences amplify with contention atall synchronization layers and suggest that cross-socketsharing should be avoided.

Sharing within a socket is necessary but not suffi-cient. If threads are not explicitly placed on the samesocket, the operating system might try to load balancethem across sockets, inducing expensive communica-tion. But, surprisingly, even with explicit placementwithin the same socket, an incomplete cache directory,combined with a non-inclusive last-level cache (LLC),might still induce cross-socket communication. On theOpteron for instance, this phenomenon entails a 3-foldincrease compared to the actual intra-socket latencies.We discuss one way to alleviate this problem by circum-venting certain access patterns.

Intra-socket (non-)uniformity does matter. Within asocket, the fact that the distance from the cores to theLLC is the same, or differs among cores, even onlyslightly, impacts the scalability of synchronization. Forinstance, under high contention, the Niagara (uniform)enables approximately 1.7 times higher scalability thanthe Tilera (non-uniform) for all locking schemes. Thedeveloper of a concurrent system should thus be awarethat highly contended data pose a higher threat in thepresence of even the slightest non-uniformity, e.g., non-uniformity inside a socket.

Loads and stores can be as expensive as atomic op-erations. In the context of synchronization, wherememory operations are often accompanied with mem-ory fences, loads and stores are generally not signifi-cantly cheaper than atomic operations with higher con-sensus numbers [19]. Even without fences, on data thatare not locally cached, a compare-and-swap is roughlyonly 1.35 (on the Opteron) and 1.15 (on the Xeon) times

more expensive than a load.

Message passing shines when contention is very high.Structuring an application with message passing reducessharing and proves beneficial when a large number ofthreads contend for a few data. However, under low con-tention and/or a small number of cores, locks performbetter on higher-layer concurrent testbeds, e.g., a hashtable and a software transactional memory, even whenmessage passing is provided in hardware (e.g., Tilera).This suggests the exclusive use of message passing foroptimizing certain highly contended parts of a system.

Every locking scheme has its fifteen minutes of fame.None of the nine locking schemes we consider consis-tently outperforms any other one, on all target architec-tures or workloads. Strictly speaking, to seek optimality,a lock algorithm should thus be selected based on thehardware platform and the expected workload.

Simple locks are powerful. Overall, an efficient im-plementation of a ticket lock is the best performing syn-chronization scheme in most low contention workloads.Even under rather high contention, the ticket lock per-forms comparably to more complex locks, in particularwithin a socket. Consequently, given their small mem-ory footprint, ticket locks should be preferred, unless itis sure that a specific lock will be very highly contended.

A high-level ramification of many of these observa-tions is that the scalability of synchronization appears,first and above all, to be a property of the hardware,in the following sense. Basically, in order to be ableto scale, synchronization should better be confined to asingle socket, ideally a uniform one. On certain plat-forms (e.g., Opteron), this is simply impossible. Withina socket, sophisticated synchronization schemes are gen-erally not worthwhile. Even if, strictly speaking, no sizefits all, a proper implementation of a simple ticket lockseems enough.

In summary, our main contribution is the most ex-haustive study of synchronization to date. Results ofthis study can be used to help predict the cost of a syn-chronization scheme, explain its behavior, design bet-ter schemes, as well as possibly improve future hard-ware design. SSYNC, the cross-platform synchroniza-tion suite we built to perform the study is, we believe,a contribution of independent interest. SSYNC abstractsvarious lock algorithms behind a common interface: itnot only includes most state-of-the-art algorithms, butalso provides platform specific optimizations with sub-stantial performance improvements. SSYNC also con-tains a library that abstracts message passing on vari-ous platforms, and a set of microbenchmarks for mea-suring the latencies of the cache-coherence protocols,the locks, and the message passing. In addition, weprovide implementations of a portable software transac-

34

Name Opteron Xeon Niagara TileraSystem AMD Magny Cours Intel Westmere-EX SUN SPARC-T5120 Tilera TILE-Gx36

Processors 4× AMD Opteron 6172 8× Intel Xeon E7-8867L SUN UltraSPARC-T2 TILE-Gx CPU# Cores 48 80 (no hyper-threading) 8 (64 hardware threads) 36

Core clock 2.1 GHz 2.13 GHz 1.2 GHz 1.2 GHzL1 Cache 64/64 KiB I/D 32/32 KiB I/D 16/8 KiB I/D 32/32 KiB I/DL2 Cache 512 KiB 256 KiB 256 KiB

Last-level Cache 2×6 MiB (shared per die) 30 MiB (shared) 4 MiB (shared) 9 MiB DistributedInterconnect 6.4 GT/s HyperTransport

(HT) 3.06.4 GT/s QuickPathInterconnect (QPI)

Niagara2 Crossbar Tilera iMesh

Memory 128 GiB DDR3-1333 192 GiB Sync DDR3-1067 32 GiB FB-DIMM-400 16 GiB DDR3-800#Channels / #Nodes 4 per socket / 8 4 per socket / 8 8 / 1 4 / 2

OS Ubuntu 12.04.2 / 3.4.2 Red Hat EL 6.3 / 2.6.32 Solaris 10 u7 Tilera EL 6.3 / 2.6.40

Table 1: The hardware and the OS characteristics of the target platforms.

tional memory, a concurrent hash table, and a key-valuestore (Memcached [30]), all built using the libraries ofSSYNC. SSYNC is available at:

http://lpd.epfl.ch/site/ssyncThe rest of the paper is organized as follows. We re-

call in Section 2 some background notions and presentin Section 3 our target platforms. We describe SSYNC inSection 4. We present our analyses of synchronizationfrom the hardware and software perspectives, in Sec-tions 5 and 6, respectively. We discuss related work inSection 7. In Section 8, we summarize our contributions,discuss some experimental results that we omit from thepaper because of space limitations, and highlight someopportunities for future work.

2 Background and ContextHardware-based Synchronization. Cache coherenceis the hardware protocol responsible for maintaining theconsistency of data in the caches. The cache-coherenceprotocol implements the two fundamental operations ofan architecture: load (read) and store (write). In additionto these two operations, other, more sophisticated opera-tions, i.e., atomic, are typically also provided: compare-and-swap, fetch-and-increment, etc.

Most contemporary processors use the MESI [37]cache-coherence protocol, or a variant of it. MESI sup-ports the following states for a cache line: Modified: thedata are stale in the memory and no other cache has acopy of this line; Exclusive: the data are up-to-date inthe memory and no other cache has a copy of this line;Shared: the data are up-to-date in the memory and othercaches might have copies of the line; Invalid: the dataare invalid.

Coherence is usually implemented by either snoopingor using a directory. By snooping, the individual cachesmonitor any traffic on the addresses they hold in order toensure coherence. A directory keeps approximate or pre-cise information of which caches hold copies of a mem-ory location. An operation has to consult the directory,enforce coherence, and then update the directory.Software-based Synchronization. The most popularsoftware abstractions are locks, used to ensure mutual

exclusion. Locks can be implemented using varioustechniques. The simplest are spin locks [4, 20], in whichprocesses spin on a common memory location until theyacquire the lock. Spin locks are generally consideredto scale poorly because they involve high contention ona single cache line [4], an issue which is addressed byqueue locks [29, 43]. To acquire a queue lock, a threadadds an entry to a queue and spins until the previousholder hands the lock. Hierarchical locks [14, 27] aretailored to today’s non-uniform architectures by usingnode-local data structures and minimizing accesses toremote data. Whereas the aforementioned locks employbusy-waiting techniques, other implementations are co-operative. For example, in case of contention, the com-monly used Pthread Mutex adds the core to a wait queueand is then suspended until it can acquire the lock.

An alternative to locks we consider in our study isto partition the system resources between processes. Inthis view, synchronization is achieved through messagepassing, which is either provided by the hardware or im-plemented in software [5]. Software implementationsare generally built over cache-coherence protocols andimpose a single-writer and a single-reader for the usedcache lines.

3 Target PlatformsThis section describes the four platforms considered inour experiments. Each is representative of a specific typeof many-core architecture. We consider two large-scalemulti-socket multi-cores, henceforth called the multi-sockets1, and two large-scale chip multi-processors(CMPs), henceforth called the single-sockets. The multi-sockets are a 4-socket AMD Opteron (Opteron) and an8-socket Intel Xeon (Xeon), whereas the CMPs are an8-core Sun Niagara 2 (Niagara) and a 36-core TileraTILE-Gx36 (Tilera). The characteristics of the four plat-forms are detailed in Table 1.

All platforms have a single die per socket, aside fromthe Opteron, that has two. Given that these two dies areactually organized in a 2-socket topology, we use theterm socket to refer to a single die for simplicity.

1We also test a 2-socket Intel Xeon and a 2-socket AMD Opteron.

35

http://lpd.epfl.ch/site/ssync

3.1 Multi-socket – Directory-based:Opteron

The 48-core AMD Opteron contains four multi-chipmodules (MCMs). Each MCM has two 6-core dies withindependent memory controllers. Hence, the systemcomprises, overall, eight memory nodes. The topologyof the system is depicted in Figure 2(a). The maximumdistance between two dies is two hops. The dies of anMCM are situated at a 1-hop distance, but they sharemore bandwidth than two dies of different MCMs.

The caches of the Opteron are write-back and non-inclusive [13]. Nevertheless, the hierarchy is not strictlyexclusive; on an LLC hit the data are pulled in the L1but may or may not be removed from the LLC (decidedby the hardware [2]). The Opteron uses the MOESIprotocol for cache coherence. The ‘O’ stands for theowned state, which indicates that this cache line has beenmodified (by the owner) but there might be more sharedcopies on other cores. This state allows a core to loada modified line of another core without the need to in-validate the modified line. The modified cache line sim-ply changes to owned and the new core receives the linein shared state. Cache coherence is implemented witha broadcast-based protocol, assisted by what is calledthe HyperTransport Assist (also known as the probe fil-ter) [13]. The probe filter is, essentially, a directory re-siding in the LLC2. An entry in the directory holds theowner of the cache line, which can be used to directlyprobe or invalidate the copy in the local caches of theowner core.

3.2 Multi-socket – Broadcast-based: XeonThe 80-core Intel Xeon consists of eight sockets of 10-cores. These form a twisted hypercube, as depicted inFigure 2(b), maximizing the distance between two nodesto two hops. The Xeon uses inclusive caches [23], i.e.,every new cache-line fill occurs in all the three levelsof the hierarchy. The LLC is write-back; the data arewritten to the memory only upon an eviction of a mod-ified line due to space or coherence. Within a socket,the Xeon implements coherence by snooping. Acrosssockets, it broadcasts snoop requests to the other sock-ets. Within the socket, the LLC keeps track of whichcores might have a copy of a cache line. Additionally,

2Typically, the probe filter occupies 1MiB of the LLC.

(a) AMD Opteron (b) Intel Xeon

Figure 2: The system topologies.

the Xeon extends the MESI protocol with the forwardstate [22]. This state is a special form of the shared stateand indicates the only cache that will respond to a loadrequest for that line (thus reducing bandwidth usage).

3.3 Single-socket – Uniform: NiagaraThe Sun Niagara 2 is a single-die processor that incor-porates 8 cores. It is based on the chip multi-threadingarchitecture; it provides 8 hardware threads per core, to-taling 64 hardware threads. Each L1 cache is sharedamong the 8 hardware threads of a core and is write-through to the LLC. The 8 cores communicate with theshared LLC through a crossbar [36], which means thateach core is equidistant from the LLC (uniform). Thecache-coherence implementation is directory-based anduses duplicate tags [32], i.e., the LLC cache holds a di-rectory of all the L1 lines.

3.4 Single-socket – Non-uniform: TileraThe Tilera TILE-Gx36 [41] is a 36-core chip multi-processor. The cores, also called tiles, are allocated ona 2-dimensional mesh and are connected with Tilera’siMesh on-chip network. iMesh handles the coherenceof the data and also provides hardware message passingto the applications. The Tilera implements the DynamicDistributed Cache technology [41]. All L2 caches areaccessible by every core on the chip, thus, the L2s areused as a 9 MiB coherent LLC. The hardware uses a dis-tributed directory to implement coherence. Each cacheline has a home tile, i.e., the actual L2 cache where thedata reside if cached by the distributed LLC. Consider,for example, the case of core x loading an address homedon tile y. If the data are not cached in the local L1 andL2 caches of x, a request for the data is sent to the L2cache of y, which plays the role of the LLC for this ad-dress. Clearly, the latency of accessing the LLC dependson the distance between x and y, hence the Tilera is anon-uniform cache architecture.

4 SSYNCSSYNC is our cross-platform synchronization suite;it works on x86 64, SPARC, and Tilera processors.SSYNC contains libslock, a library that abstracts lockalgorithms behind a common interface and libssmp,a library with fine-tuned implementations of messagepassing for each of the four platforms. SSYNC also in-cludes microbenchmarks for measuring the latencies ofthe cache coherence, the locks, and the message pass-ing, as well as ssht, i.e., a cache efficient hash tableand TM2C, i.e., a highly portable transactional memory.

4.1 Librarieslibslock. This library contains a common interface andoptimized implementations of a number of widely usedlocks. libslock includes three spin locks, namely

36

the test-and-set lock, the test-and-test-and-set lock withexponential back-off [4, 20], and the ticket lock [29].The queue locks are the MCS lock [29] and the CLHlock [43]. We also employ an array-based lock [20].libslock also contains hierarchical locks, such as thehierarchical CLH lock [27] and the hierarchical ticketlock (hticket) [14]3. Finally, libslock abstracts thePthread Mutex interface. libslock also containsa cross-platform interface for atomic instructions andother architecture dependent operations, such as fences,thread and memory placement functions.libssmp. libssmp is our implementation of messagepassing over cache coherence4 (similar to the one in Bar-relfish [6]). It uses cache line-sized buffers (messages)in order to complete message transmissions with singlecache-line transfers. Each buffer is one-directional andincludes a byte flag to designate whether the buffer isempty or contains a message. For client-server commu-nication, libssmp implements functions for receivingfrom any other, or from a specific subset of the threads.Even though the design of libssmp is identical on allplatforms, we leverage the results of Section 5 to tailorlibssmp to the specifics of each platform individually.

4.2 Microbenchmarksccbench. ccbench is a tool for measuring the cost ofoperations on a cache line, depending on the line’s MESIstate and placement in the system. ccbench bringsthe cache line in the desired state and then accesses itfrom either a local or a remote core. ccbench supports30 cases, such as store on modified and test-and-set onshared lines.stress tests. SSYNC provides tests for the primitives inlibslock and libssmp. These tests can be used tomeasure the primitives’ latency or throughput under var-ious conditions, e.g., number and placement of threads,level of contention.

4.3 Concurrent SoftwareHash Table (ssht). ssht is a concurrent hash table thatexports three operations: put, get, and remove. It isdesigned to place the data as efficiently as possible in thecaches in order to (i) allow for efficient prefetching and(ii) avoid false sharing. ssht can be configured to useany of the locks of libslock or the message passingof libssmp.Transactional Memory (TM2C). TM2C [16] is amessage passing-based software transactional memorysystem for many-cores. TM2C is implemented usinglibssmp. TM2C also has a shared memory versionbuilt with the spin locks of libslock.

3In fact, based on the results of Section 5 and without being awareof [14], we designed and implemented the hticket algorithm.

4On the Tilera, it is an interface to the hardware message passing.

5 Hardware-Level AnalysisIn this section, we report on the latencies incurred by thehardware cache-coherence protocols and discuss how toreduce them in certain cases. These latencies constitutea good estimation of the cost of sharing a cache line ina many-core platform and have a significant impact onthe scalability of any synchronization scheme. We useccbench to measure basic operations such as load andstore, as well as compare-and-swap (CAS), fetch-and-increment (FAI), test-and-set (TAS), and swap (SWAP).

5.1 Local AccessesTable 3 contains the latencies for accessing the localcaches of a core. In the context of synchronization,the values for the LLCs are worth highlighting. On theXeon, the 44 cycles is the local latency to the LLC, butalso corresponds to the fastest communication betweentwo cores on the same socket. The LLC plays the samerole for the single-socket platforms, however, it is di-rectly accessible by all the cores of the system. On theOpteron, the non-inclusive LLC holds both data and thecache directory, so the 40 cycles is the latency to both.However, the LLC is filled with data only upon an evic-tion from the L2, hence the access to the directory ismore relevant to synchronization.

5.2 Remote AccessesTable 2 contains the latencies to load, store, or performan atomic operation on a cache line based on its previ-ous state and location. Notice that the accesses to aninvalid line are accesses to the main memory. In the fol-lowing, we discuss all cache-coherence states, except forthe invalid. We do not explicitly measure the effects ofthe forward state of the Xeon: there is no direct way tobring a cache line to this state. Its effects are included inthe load from shared case.Loads. On the Opteron, a load has basically the samelatency regardless of the previous state of the line; es-sentially, the steps taken by the cache-coherence pro-tocol are always the same. Interestingly, although thetwo dies of an MCM are tightly coupled, the benefitsare rather small. The latencies between two dies in anMCM and two dies that are simply directly connecteddiffer by roughly 12 cycles. One extra hop adds an ad-ditional overhead of 80 cycles. Overall, an access overtwo hops is approximately 3 times more expensive thanan access within a die.

The results in Table 2 represent the best-case scenario

Opteron Xeon Niagara TileraL1 3 5 3 2L2 15 11 11

LLC 40 44 24 45RAM 136 355 176 118

Table 3: Local caches and memory latencies (cycles).

37

System Opteron Xeon Niagara TileraPPPPPPState

Hops same same one two same one two same other one maxdie MCM hop hops die hop hops core core hop hops

loadsModified 81 161 172 252 109 289 400 3 24 45 65Owned 83 163 175 254 - - - - - - -

Exclusive 83 163 175 253 92 273 383 3 24 45 65Shared 83 164 176 254 44 223 334 3 24 45 65Invalid 136 237 247 327 355 492 601 176 176 118 162

storesModified 83 172 191 273 115 320 431 24 24 57 77Owned 244 255 286 291 - - - - - - -

Exclusive 83 171 191 271 115 315 425 24 24 57 77Shared 246 255 286 296 116 318 428 24 24 86 106

atomic operations: CAS (C), FAI (F), TAS (T), SWAP (S)Operation all all all all all all all C/F/T/S C/F/T/S C/F/T/S C/F/T/SModified 110 197 216 296 120 324 430 71/108/64/95 66/99/55/90 77/51/70/63 98/71/89/84Shared 272 283 312 332 113 312 423 76/99/67/93 66/99/55/90 124/82/121/95 142/102/141/115

Table 2: Latencies (cycles) of the cache coherence to load/store/CAS/FAI/TAS/SWAP a cache line depending onthe MESI state and the distance. The values are the average of 10000 repetitions with < 3% standard deviation.

for the Opteron: at least one of the involved cores resideson the memory node of the directory. If the directory isremote to both cores, the latencies increase proportion-ally to the distance. In the worst case, where two coresare on different nodes, both 2-hops away from the direc-tory, the latencies are 312 cycles. Even worse, even ifboth cores reside on the same node, they still have to ac-cess the remote directory, wasting any locality benefits.

In contrast, the Xeon does not have the locality is-sues of the Opteron. If the data are present within thesocket, a load can be completed locally due to the inclu-sive LLC. Loading from the shared state is particularlyinteresting, because the LLC can directly serve the datawithout needing to probe the local caches of the previousholder (unlike the modified and exclusive states). How-ever, the overhead of going off-socket on the Xeon isvery high. For instance, loading from the shared stateis 7.5 times more expensive over two hops than loadingwithin the socket.

Unlike the large variability of the multi-sockets, theresults are more stable on the single-sockets. On the Ni-agara, a load costs either an L1 or an L2 access, depend-ing on whether the two threads reside on the same core.On the Tilera, the LLC is distributed, hence the laten-cies depend on the distance of the requesting core fromthe home tile of the cache line. The cost for two adja-cent cores is 45 cycles, whereas for the two most remotecores5, it is 20 cycles higher (2 cycles per hop).Stores. On the Opteron, both loads and stores on a mod-ified or an exclusive cache line have similar latencies (nowrite-back to memory). However, a store on a sharedline is different6. Every store on a shared or ownedcache line incurs a broadcast invalidation to all nodes.This happens because the cache directory is incomplete

510 hops distance on the 6-by-6 2-dimensional mesh of the Tilera.6For the store on shared test, we place two different sharers on the

indicated distance from a third core that performs the store.

(it does not keep track of the sharers) and does not in anyway detect whether sharing is limited within the node7.Therefore, even if all sharers reside on the same node,a store needs to pay the overhead of a broadcast, thusincreasing the cost from around 83 to 244 cycles. Ob-viously, the problem is aggravated if the directory is notlocal to any of the cores involved in the store. Finally,the scenario of storing on a cache line shared by all 48cores costs 296 cycles.

Again, the Xeon has the advantage of being able tolocally complete an operation that involves solely coresof a single node. In general, stores behave similarly re-gardless of the previous state of the cache line. Finally,storing on a cache line shared by all 80 cores on the Xeoncosts 445 cycles.

Similarly to a load, the results for a store exhibit muchlower variability on the single-sockets. A store on theNiagara has essentially the latency of the L2, regardlessof the previous state of the cache line and the number ofsharers. On the Tilera, stores on a shared line are a bitmore expensive due to the invalidation of the cache linesof the sharers. The cost of a store reaches a maximum of200 cycles when all 36 cores share that line.

Atomic Operations. On the multi-sockets, CAS, TAS,FAI, and SWAP have essentially the same latencies.These latencies are similar to a store followed by amemory barrier. On the single-sockets, some operationsclearly have different hardware implementations. For in-stance, on the Tilera, the FAI operation is faster than theothers. Another interesting point is the latencies for per-forming an operation when the line is shared by all thecores of the system. On all platforms, the latencies fol-low the exact same trends as a store in the same scenario.

7On the Xeon, the inclusive LLC is able to detect if there is sharingsolely within the socket.

38

Implications. The latencies of cache coherence re-veal some important issues that should be addressed inorder to implement efficient synchronization. Cross-socket communication is 2 to 7.5 times more expensivethan intra-socket communication. The problem on thebroadcast-based design of Xeon is larger than on theOpteron. However, within the socket, the inclusive LLCof the Xeon provides strong locality, which in turn trans-lates into efficient intra-socket synchronization. In termsof locality, the incomplete directory of the Opteron isproblematic in two ways. First, a read-write pattern ofsharing will cause stores on owned and shared cachelines to exhibit the latency of a cross-socket operation,even if all sharers reside on the same socket. We thusexpect intra-socket synchronization to behave similarlyto the cross-socket. Second, the location of the directoryis crucial: if the cores that use some memory are remoteto the directory, they pay the remote access overhead.To achieve good synchronization performance, the datahave to originate from the local memory node (or to bemigrated to the local one). Overall, an Opteron MCMshould be treated as a two-node platform.

The single-sockets exhibit quite a different behavior:they both use their LLCs for sharing. The latencies (tothe LLC) on the Niagara are uniform, i.e., they are af-fected by neither the distance nor the number of the in-volved cores. We expect this uniformity to translate tosynchronization that is not prone to contention. The non-uniform Tilera is affected both by the distance and thenumber of involved cores, therefore we expect scalabil-ity to be affected by contention. Regarding the atomicoperations, both single-sockets have faster implementa-tions for some of the operations (see Table 2). Theseshould be preferred to achieve the best performance.

5.3 Enforcing LocalityA store to a shared or owned cache line on the Opteroninduces an unnecessary broadcast of invalidations, evenif all the involved cores reside on the same node (see Ta-ble 2). This results in a 3-fold increase of the latencyof the store operation. In fact, to avoid this issue, wepropose to explicitly maintain the cache line to the mod-ified state. This can be easily achieved by calling theprefetchw x86 instruction before any load referenceto that line. Of course, this optimization should be used

0 20 40 60 80

100 120 140

0 6 12 18 24 30 36 42 48

La

ten

cy (

10

3 c

ycle

s)

Threads

non-optimizedback-off

back-off & prefetchw

Figure 3: Latency of acquire and release using differ-ent implementations of a ticket lock on the Opteron.

with care because it disallows two cores to simultane-ously hold a copy of the line.

To illustrate the potential of this optimization, we en-gineer an efficient implementation of a ticket lock. Aticket lock consists of two counters: the next and thecurrent. To acquire the lock, a thread atomically fetchesand increases the next counter, i.e., it obtains a ticket.If the ticket equals the current, the thread has acquiredthe lock, otherwise, it spins until this becomes true. Torelease the lock, the thread increases the value of thecurrent counter.

A particularly appealing characteristic of the ticketlock is the fact that the ticket, subtracted by the currentcounter, is the number of threads queued before the cur-rent thread. Accordingly, it is intuitive to spin with aback-off proportional to the number of threads queued infront [29]. We use this back-off technique with and with-out the prefetchw optimization and compare the re-sults with a non-optimized implementation of the ticketlock. Figure 3 depicts the latencies for acquiring and im-mediately releasing a single lock. Obviously, the non-optimized version scales terribly, delivering a latencyof 720K cycles on 48 cores. In contrast, the versionswith the proportional back-off scale significantly better.The prefetchw gives an extra performance boost, per-forming up to 2 times better on 48 cores.

SSYNC uses the aforementioned optimization wher-ever possible. For example, the message passing imple-mentation on the Opteron with this technique is up to 2.5times faster than without it.

5.4 Stressing Atomic OperationsIn this test, we stress the atomic operations. Each threadrepeatedly tries to perform an atomic operation on a sin-gle shared location. For FAI, SWAP, and CAS FAI thesecalls are always eventually successful, i.e., they writeto the target memory, whereas for TAS and CAS theyare not. CAS FAI implements a FAI operation based onCAS. This enables us to highlight both the costs of spin-ning until the CAS is successful and the benefits of hav-ing a FAI instruction supported by the hardware. Aftercompleting a call, the thread pauses for a sufficient num-ber of cycles to prevent the same thread from completingconsecutive operations locally (long runs [31])8.

On the multi-sockets, we allocate threads on the samesocket and continue on the next socket once all coresof the previous one have been used. On the Niagara, wedivide the threads evenly among the eight physical cores.On all platforms, we ensure that each thread allocates itslocal data from the local memory node. We repeat eachexperiment five times and show the average value.

8The delay is proportional to the maximum latency across the in-volved cores and does not affect the total throughput in a way otherthan the intended.

39

0 10 20 30 40 50 60 70 80 90

100 110 120 130 140

0 6 12 18 24 30 36 42 48

Th

rou

gh

pu

t(M

op

s/s

)

Threads

Opteron

0 10 20 30 40 50 60 70 80

Threads

Xeon

0 8 16 24 32 40 48 56 64

Threads

Niagara

0 6 12 18 24 30 36

Threads

Tilera

CAS TAS CAS based FAI SWAP FAI

Figure 4: Throughput of different atomic operations on a single memory location.

Figure 4 shows the results of this experiment. Themulti-sockets exhibit a very steep decrease in thethroughput once the location is accessed by more thanone core. The latency of the operations increases fromapproximately 20 to 120 cycles. In contrast, the single-sockets generally show an increase in the throughput onthe first few cores. This can be attributed to the costof the local parts of the benchmark (e.g., a while loop)that consume time comparable to the latency of the op-erations. For more than six cores, however, the resultsstabilize (with a few exceptions).

Both the Opteron and the Xeon exhibit a stablethroughput close to 20 Mops/s within a socket, whichdrops once there are cores on a second socket. Not sur-prisingly (see Table 2), the drop on the Xeon is largerthan on the Opteron. The throughput on these platformsis dictated by the cache-coherence latencies, given thatan atomic operation actually brings the data in its localcache. In contrast, on the single-sockets the throughputconverges to a maximum value and exhibits no subse-quent decrease. Some further interesting points worthhighlighting are as follows. First, the Niagara (SPARCarchitecture) does not provide an atomic increment orswap instruction. Their implementations are based onCAS, therefore the behavior of FAI and CAS FAI arepractically identical. SWAP shows some fluctuations onthe Niagara, which we believe are caused by the schedul-ing of the hardware threads. However, SPARC pro-vides a hardware TAS implementation that proves to behighly efficient. Likewise, the FAI implementation onthe Tilera slightly outperforms the other operations.Implications. Both multi-sockets have a very fastsingle-thread performance, that drops on two or morecores and decreases further when there is cross-socketcommunication. Contrarily, both single-sockets have alower single-thread throughput, but scale to a maximumvalue, that is subsequently maintained regardless of thenumber of cores. This behavior indicates that globallystressing a cache line with atomic operations will in-troduce performance bottlenecks on the multi-sockets,while being somewhat less of a problem on the single-

sockets. Finally, a system designer should take advan-tage of the best performing atomic operations availableon each platform, like the TAS on the Niagara.

6 Software-Level AnalysisThis section describes the software-oriented part of thisstudy. We start by analyzing the behavior of locks underdifferent levels of contention and continue with messagepassing. We use the same methodology as in Section 5.4.In addition, the globally shared data are allocated fromthe first participating memory node. We finally report onour findings on higher-level concurrent software.

6.1 LocksWe evaluate the locks in SSYNC under various degreesof contention on our platforms.

6.1.1 Uncontested LockingIn this experiment we measure the latency to acquire alock based on the location of the previous holder. Al-though in a number of cases acquiring a lock does in-volve contention, a large portion of acquisitions in ap-plications are uncontested, hence they have a similar be-havior to this experiment.

Initially, we place a single thread that repeatedly ac-quires and releases the lock. We then add a secondthread, as close as possible to the first one, and pin it

0!200!400!600!800!

1000!1200!1400!1600!

TAS!

TTAS

!TI

CKE

T!AR

RAY!

MU

TEX!

MC

S!C

LH!

HC

LH!

HTI

CKE

T!TA

S!TT

AS!

TIC

KET!

ARR

AY!

MU

TEX!

MC

S!C

LH!

HC

LH!

HTI

CKE

T!TA

S!TT

AS!

TIC

KET!

ARR

AY!

MU

TEX!

MC

S!C

LH!

TAS!

TTAS

!TI

CKE

T!AR

RAY!

MU

TEX!

MC

S!C

LH!

Opteron! Xeon! Niagara! Tilera!

Late

ncy

(cyc

les)!

single thread! same core! same die! same mcm! one hop! two hops! max hops!

Figure 6: Uncontested lock acquisition latency basedon the location of the previous owner of the lock.

40

0

1

2

3

4

5

6

7

8

0 6 12 18 24 30 36 42 48

Th

rou

gh

pu

t (M

op

s/s

)

Threads

Opteron

Best single thread: 22 Mops/s

0 10 20 30 40 50 60 70 80

Threads

Xeon

Best single thread: 34 Mops/s

0 8 16 24 32 40 48 56 64

Threads

Niagara

0 6 12 18 24 30 36

Threads

Tilera

TTAS ARRAY MCS CLH HTICKET TAS HCLH MUTEX TICKET

Figure 5: Throughput of different lock algorithms using a single lock.

further in subsequent runs. Figure 6 contains the laten-cies of the different locks when the previous holder is atvarious distances. Latencies suffer important increaseson the multi-sockets as the second thread moves furtherfrom the first. In general, acquisitions that need to trans-fer data across sockets have a high cost. Remote acquisi-tions can be up to 12.5 and 11 times more expensive thanlocal ones on the Opteron and the Xeon respectively. Incontrast, due to the shared and distributed LLCs, the Ni-agara and the Tilera suffer no and slight performance de-crease, respectively, as the location of the second threadchanges. The latencies of the locks are in accordancewith the cache-coherence latencies presented in Table 2.

Moreover, the differences in the latencies are signifi-cantly larger between locks on the multi-sockets than onthe single-sockets, making the choice of the lock algo-rithm in an uncontested scenario paramount to perfor-mance. More precisely, while spin locks closely followthe cache-coherence latencies, more complex locks gen-erally introduce some additional overhead.Implications. Using a lock, even if no contention is in-volved, is up to one order of magnitude more expensivewhen crossing sockets. The 350-450 cycles on a multi-socket, and the roughly 200 cycles on a single-socket,are not negligible, especially if the critical sections areshort. Moreover, the penalties induced when crossingsockets in terms of latency tend to be higher for com-plex locks than for simple locks. Therefore, regardlessof the platform, simple locks should be preferred, whencontention is very low.

6.1.2 Lock Algorithm BehaviorWe study the behavior of locks under extreme and verylow contention. On the one hand, highly contendedlocks are often the main scalability bottleneck. On theother hand, a large number of systems use locking strate-gies, such as fine-grained locks, that induce low con-tention. Therefore, good performance in these two sce-narios is essential. We measure the total throughput oflock acquisitions that can be performed using each ofthe locks. Each thread acquires a random lock, reads

and writes one corresponding cache line of data, and re-leases the lock. Similarly to the atomic operations stresstest (Section 5.4) in the extreme contention experiment(one lock), a thread pauses after it releases the lock, inorder to ensure that the release becomes visible to theother cores before retrying to acquire the lock. Giventhe uniform structure of the platforms, we do not use hi-erarchical locks on the single-socket machines.

Extreme contention. The results of the maximum con-tention experiment (one lock) are depicted in Figure 5.As we described in Section 5, the Xeon exhibits verystrong locality within a socket. Accordingly, the hierar-chical locks, i.e., hticket and HCLH, perform the best bytaking advantage of that. Although there is a very bigdrop from one to two cores on the multi-sockets, withinthe socket both the Opteron and the Xeon manage tokeep a rather stable performance. However, once a sec-ond socket is involved the throughput decreases again.

Not surprisingly, the CLH and the MCS locks are themost resilient to contention. They both guarantee that asingle thread is spinning on each cache line and use theglobally shared data only to enqueue for acquiring thelock. The ticket lock proves to be the best spin lockon this workload. Overall, the throughput on two ormore cores on the multi-sockets is an order of magni-tude lower than the single-core performance. In contrast,the single-sockets maintain a comparable performanceon multiple cores.

Very low contention. The very low contention results(512 locks) are shown in Figure 7. Once again, onecan observe the strong intra-socket locality of the Xeon.In general, simple locks match or even outperform themore complex queue locks. While on the Xeon the dif-ferences between locks become insignificant for a largenumber of cores, it is generally the ticket lock that per-forms the best on the Opteron, the Niagara, and theTilera. On a low-contention scenario it is thus difficultto justify the memory requirements that complex lockalgorithms have. It should be noted that, aside from theacquisitions and releases, the load and the store on the

41

0

20

40

60

80

100

120

140

160

0 6 12 18 24 30 36 42 48

Th

rou

gh

pu

t (M

op

s/s

)

Threads

Opteron

0 10 20 30 40 50 60 70 80

Threads

Xeon

0 8 16 24 32 40 48 56 64

Threads

Niagara

0 6 12 18 24 30 36

Threads

TileraTTAS ARRAY MCS CLH HTICKET TAS HCLH MUTEX TICKET

Figure 7: Throughput of different lock algorithms using 512 locks.1.0x: TICKET

0.5x: TAS

0.5x: H

TICK

ET

0.5x: H

TICK

ET

1.0x: TICKET

0.8x: TAS

0.2x: H

TICK

ET

0.2x: C

LH

1.0x: TICKET

5.4x: C

LH

9.0x: TICKET

11.0x: CLH

1.0x: TICKET

3.4x: C

LH

5.7x: C

LH

6.1x: C

LH

0!5!

10!15!20!25!30!35!40!45!50!55!

1! 6! 18!

36! 1! 10!

18!

36! 1! 8! 18!

36! 1! 8! 18!

36!

Opteron! Xeon! Niagara! Tilera!Threads!

16 locks!

1.0x: TICKET

0.3x: TICKET

0.2x: M

CS

0.2x: M

CS

1.0x: TICKET

0.4x: C

LH

0.2x: H

TICK

ET

0.1x: H

TICK

ET

1.0x: TICKET

3.4x: C

LH

3.8x: C

LH

3.7x: C

LH

1.0x: TICKET

2.0x: C

LH

2.3x: C

LH

2.2x: C

LH

0!5!

10!15!20!25!30!35!40!

1! 6! 18!

36! 1! 10!

18!

36! 1! 8! 18!

36! 1! 8! 18!

36!


Thro

ughp

ut (M

ops/

s)!

Threads!

4 locks!

1.0x: TICKET

0.5x: TAS

0.5x: H

TICK

ET

0.5x: H

TICK

ET

1.0x: TICKET

1.1x: TAS

0.3x: A

RRAY

0.2x: H

CLH

1.0x: TICKET

6.1x: TICKET

11.2x: TICKET

15.9x: CLH

1.0x: TICKET

4.3x: TAS

7.4x: C

LH

9.0x: TICKET

0!

10!

20!

30!

40!

50!

60!

70!

1! 6! 18!

36! 1! 10!

18!

36! 1! 8! 18!

36! 1! 8! 18!

36!


32 locks!

1.0x: TICKET

0.6x: TTA

S 0.6x: TTA

S 0.6x: TTA

S 1.0x: TICKET

1.8x: TAS

1.0x: H

CLH

0.7x: H

CLH

1.0x: TICKET

7.2x: TICKET

14.0x: TICKET

22.0x: TICKET

1.0x: TICKET

4.8x: TAS

9.6x: TAS

15.6x: TAS

0!10!20!30!40!50!60!70!80!90!

1! 6! 18!

36! 1! 10!

18!

36! 1! 8! 18!

36! 1! 8! 18!

36!


128 locks!

Figure 8: Throughput and scalability of locks depending on the number of locks. The “X : Y” labels on top ofeach bar indicate the best-performing lock (Y) and the scalability over the single-thread execution (X).

protected data also contribute to the lack of scalability ofmulti-sockets, for the reasons pointed out in Section 5.Implications. None of the locks is consistently thebest on all platforms. Moreover, no lock is consis-tently the best within a platform. While complex locksare generally the best under extreme contention, sim-ple locks perform better under low contention. Underhigh contention, hierarchical locks should be used onmulti-sockets with strong intra-socket locality, such asthe Xeon. The Opteron, due to the previously discussedlocality issues, and the single-sockets favor queue locks.In case of low contention, simple locks are better thancomplex implementations within a socket. Under ex-treme contention, while not as good as more complexlocks, a ticket lock can avoid performance collapsewithin a socket. On the Xeon, the best performance isachieved when all threads run on the same socket, bothfor high and for low contention. Therefore, synchroniza-tion between sockets should be limited to the absoluteminimum on such platforms. Finally, we observe thatwhen each core is dedicated to a single thread there isno scenario in which Pthread Mutexes perform the best.Mutexes are however useful when threads contend for acore. Therefore, unless multiple threads run on the samecore, alternative implementations should be preferred.

6.1.3 Cross-Platform Lock BehaviorIn this experiment, we compare lock behavior under var-ious degrees of contention across architectures. In order

to have a straightforward cross-platform comparison, werun the tests on up to 36 cores. Having already exploredthe lock behavior of different algorithms, we only reportthe highest throughput achieved by any of the locks oneach platform. We vary the contention by running ex-periments with 4, 16, 32, and 128 locks, thus examininghigh, intermediate, and low degrees of contention.

The results are shown in Figure 8. In all cases, thedifferences between the single and multi-sockets are no-ticeable. Under high contention, single-sockets pre-vent performance collapse from one thread to two ormore, whereas in the lower contention cases these plat-forms scale well. As noted in Section 5, stores andatomic operations are affected by contention on theTilera, resulting in slightly less scalability than on theNiagara: on high contention workloads, the uniformityof the Niagara delivers up to 1.7 times more scalabil-ity than the Tilera, i.e., the rate at which performanceincreases. In contrast, multi-sockets exhibit a signifi-cantly lower throughput for high contention, when com-pared to single-thread performance. Multi-sockets pro-vide limited scalability even on the low contention sce-narios. The direct cause of this contrasting behavioris the higher latencies for the cache-coherence transi-tions on multi-sockets, as well as the differences in thethroughput of the atomic operations. It is worth noticingthat the Xeon scales well when all the threads are withina socket. Performance, however, severely degrades even

42

262! 47

2!

506! 660!

214!

914! 11

67!

181!

249!

61!

64!51

9! 887!

959!

1567!

564!

1968! 26

60!

337! 471!

120!

138!

0!500!

1000!1500!2000!2500!3000!

sam

e!di

e!sa

me!

mcm

!on

e!ho

p!tw

o!ho

ps!

sam

e! d

ie!

one!

hop!

two!

hops!

sam

e!co

re!

othe

r co

re!

one!

hop!

max!

hops!


Late

ncy

(cyc

les)!

one-way! round-trip!

Figure 9: One-to-one communication latencies ofmessage passing depending on the distance betweenthe two cores.

with one thread on a remote socket. In contrast, theOpteron shows poor scalability regardless of the numberof threads. The reason for this difference is the limitedlocality of the Opteron we discussed in Section 5.Implications. There is a significant difference in scal-ability trends between multi and single-sockets acrossvarious degrees of contention. Moreover, even a smalldegree of non-uniformity can have an impact on scala-bility. As contention drops, simple locks should be usedin order to achieve high throughput on all architectures.Overall, we argue that synchronization intensive systemsshould favor platforms that provide locality, i.e., they canprevent cross-socket communication.

6.2 Message PassingWe evaluate the message passing implementations ofSSYNC. To capture the most prominent communicationpatterns of a message passing application we evaluateboth one-to-one and client-server communication. Thesize of a message is 64 bytes (a cache line).One-to-One Communication. Figure 9 depicts the la-tencies of two cores that exchange one-way and round-trip messages. As expected, the Tilera’s hardware mes-sage passing performs the best. Not surprisingly, a one-way message over cache coherence costs roughly twicethe latency of transferring a cache line. Once a core xreceives a message, it brings the receive buffer (i.e., acache line) to its local caches. Consequently, the secondcore y has to fetch the buffer (first cache-line transfer) inorder to write a new message. Afterwards, x has to re-fetch the buffer (second transfer) to get the message. Ac-cordingly, the round-trip case takes approximately fourtimes the cost of a cache-line transfer. The reasoning isexactly the same with one-way messages, but applies toboth ways: send and then receive.Client-Server Communication. Figure 10 depicts theone-way and round-trip throughput for a client-serverpattern with a single server. Again, the hardware mes-sage passing of the Tilera performs the best. With 35clients, one server delivers up to 16 Mops/s (round-trip)on the Tilera (less on the other platforms). In this bench-

5

10

15

20

50

0 6 12 18 24 30 36

Thro

ughput (M

ops/s

)

Number of clients

Opteron, one-wayOpteron, round-trip

Xeon, one-wayXeon, round-trip

Niagara, one-wayNiagara, round-trip

Tilera, one-wayTilera, round-trip

Figure 10: Total throughput of client-server commu-nication.

mark, the server does not perform any computation be-tween messages, therefore the 16 Mops constitutes anupper bound on the performance of a single server. Itis interesting to note that if we reduce the size of themessage to a single word, the throughput on the Tilera is27 Mops/s for round-trip and more than 100 Mops/s forone-way messages, respectively.

Two additional observations are worth mentioning.First, the Xeon performs very well within a socket, espe-cially for one-way messages. The inclusive LLC cacheplays the role of the buffer for exchanging messages.However, even with a single client on a remote socket,the throughput drops from 25 to 8 Mops/s. The sec-ond point is that as the number of cores increases, theround-trip throughput becomes higher than the one-wayon the Xeon. We also observe this effect on the Opteron,once the length of the local computation of the serverincreases (not shown in the graph). This happens be-cause the request-response model enables the server toefficiently prefetch the incoming messages. On one-waymessages, the clients keep trying to send messages, sat-urating the incoming queues of the server. This leads tothe clients busy-waiting on cache lines that already con-tain a message. Therefore, even if the server prefetchesa message, the client will soon bring the cache line toits own caches (or transform it to shared), making theconsequent operations of the server more expensive.Implications. Message passing can achieve latenciessimilar to transferring a cache line from one core to an-other. This behavior is slightly affected by contention,because each pair of cores uses individual cache linesfor communication. The previous applies both to one-to-one and client-server communication. However, a singleserver has a rather low upper bound on the throughput itcan achieve, even when not executing any computation.In a sense, we have to trade performance for scalability.

6.3 Hash Table (ssht)We evaluate ssht, i.e., the concurrent hash table im-plementation of SSYNC, under low (512 buckets) andhigh (12 buckets) contention, as well as short (12 ele-ments) and long (48 elements) buckets. We use 80%get, 10% put, and 10% remove operations, so as tokeep the size of the hash table constant. We configure

43

1.0x : TICK

ET

3.3x : TICK

ET

5.3x : TA

S 5.2x : CLH

1.0x : TICK

ET

6.8x : TA

S 7.1x : TA

S 9.7x : HT

ICKET

1.0x : TICK

ET

8.0x : TICK

ET

15.9x : TICKET

25.4x : TICKET

1.0x : TA

S 7.2x : TA

S 13.7x : TAS

20.6x : TAS

0!20!40!60!80!

100!

1! 6! 18!

36! 1! 10!

18!

36! 1! 8! 18!

36! 1! 8! 18!

36!


512 buckets, 12 entries/bucket!

mp!

1.0x : TICK

ET

3.7x : TICK

ET

6.1x : TICK

ET

9.3x : TICK

ET

1.0x : TA

S 7.5x : TA

S 9.0x : TA

S 13

.5x : TAS

1.0x : TICK

ET

8.0x : TICK

ET

15.8x : TICKET

25.6x : TICKET

1.0x : TICK

ET

6.5x : TA

S 11

.6x : TAS

15

.1x : TAS

0!

20!

40!

60!

80!

100!

1! 6! 18!

36! 1! 10!

18!

36! 1! 8! 18!

36! 1! 8! 18!

36!



mp!

1.0x : TICK

ET

2.0x : TA

S 1.7x : CLH

1.6x : CLH

1.0x : TA

S 2.0x : CLH

0.7x : CLH

0.4x : CLH

1.0x : TICK

ET

6.3x : TICK

ET

9.4x : TICK

ET

10.1x : TICKET

1.0x : TA

S 5.5x : TA

S 7.7x : TICK

ET

6.7x : TICK

ET

0!5!

10!15!20!25!30!35!40!45!

1! 6! 18!

36! 1! 10!

18!

36! 1! 8! 18!

36! 1! 8! 18!

36!


Thro

ughp

ut (M

ops/

s)!

Threads!


mp!

1.0x : TICK

ET

2.3x : TA

S 2.2x : CLH

2.1x : CLH

1.0x : TA

S 2.1x : CLH

0.9x : CLH

0.6x : AR

RAY

1.0x : TICK

ET

6.1x : TICK

ET

8.4x : TICK

ET

8.7x : TICK

ET

1.0x : TA

S 5.4x : TA

S 6.2x : TICK

ET

4.7x : TICK

ET

0!5!

10!15!20!25!30!35!40!45!

1! 6! 18!

36! 1! 10!

18!

36! 1! 8! 18!

36! 1! 8! 18!

36!



mp!

Figure 11: Throughput and scalability of the hash table (ssht) on different configurations. The “X : Y” labelson top of each bar indicate the best-performing lock (Y) and the scalability over the single-thread execution (X).

ssht so that each bucket is protected by a single lock,the keys are 64 bit integers, and the payload size is 64bytes. The trends on scalability pertain on other configu-rations as well. Finally, we configure the message pass-ing (mp) version to use (i) one server per three cores9

and (ii) round-trip operations, i.e., all operations block,waiting for a response from the server. It should be notedthat dedicating some threads as servers reduces the con-tention induced on the shared data of the application.Figure 11 depicts the results on the four target platformson the aforementioned scenarios10.Low Contention. Increasing the length of the criti-cal sections increases the scalability of the lock-basedssht on all platforms, except for the Tilera. The multi-sockets benefit from the efficient prefetching of the dataof a bucket. All three systems benefit from the lowersingle-thread performance, which leads to higher scala-bility ratios. On the Tilera, the local data contend withthe shared data for the L2 cache space, reducing scalabil-ity. On this workload, the message passing implementa-tion is strictly slower than the lock-based ones, even onthe Tilera. It is interesting to note that the Xeon scalesslightly even outside the 10 cores of a socket, thus de-livering the highest throughput among all platforms. Fi-nally, the best performance in this scenario is achievedby simple spin locks.High Contention. The situation is radically differentfor high contention. First, the message passing versionnot only outperforms the lock-based ones on three outof the four platforms (for high core counts), but it alsodelivers by far the highest throughput. The hardwarethreads of the Niagara do not favor client-server solu-tions; the servers are delayed due to the sharing of thecore’s resources with other threads. However, the Ni-agara achieves a 10-fold performance increase on 36threads, which is the best scalability among the lock-based versions and approaches the optimal 12-fold scal-ability. It is worth mentioning that if we do not explicitly

9This configuration achieves the highest throughput.10The single-thread throughput for message passing is actually a re-

sult of a one server / one client execution.

pin the threads on cores, the multi-sockets deliver 4 to 6times lower maximum throughput on this workload.Summary. These experiments illustrate two majorpoints. First, increasing the length of a critical sectioncan partially hide the costs of synchronization under lowcontention. This, of course, assumes that the data ac-cessed in the critical section are mostly being read (sothey can be shared) and follow a specific access pattern(so they can be prefetched). Second, the results illustratehow message passing can provide better scalability andperformance than locking under extreme contention.

6.4 Key-Value Store (Memcached)Memcached (v. 1.4.15) [30] is an in-memory key-valuestore, based on a hash table. The hash table has a largenumber of buckets and is protected by fine-grain locks.However, during certain rebalancing and maintenancetasks, it dynamically switches to a global lock for shortperiods of time. Since we are interested in the effect ofsynchronization on performance and scalability, we re-place the default Pthread Mutexes that protect the hashtable, as well as the global locks, with the interface pro-vided by libslock. In order to stress Memcached,we use the memslap tool from the libmemcached li-brary [25] (v. 1.0.15). We deploy memslap on a re-mote server and use its default configuration. We use500 client threads and run a get-only and a set-only test.Get. The get test does not cause any switches to globallocks. Due to the essentially non-existent contention, thelock algorithm has little effect in this test. In fact, evencompletely removing the locks of the hash table does notresult in any performance difference. This indicates thatthere are bottlenecks other than synchronization.Set. A write-intensive workload however stresses anumber of global locks, which introduces contention.In the set test the differences in lock behavior translatein a difference in the performance of the application aswell. Figure 12 shows the throughput on various plat-forms using different locks. We do not present the re-sults on more than 18 cores, since none of the platformsscales further. Changing the Mutexes to ticket, MCS,or TAS locks achieves speedups between 29% and 50%

44

0!

50!

100!

150!

200!

250!

1! 6! 18! 1! 10! 18! 1! 8! 18! 1! 10! 18!

Opteron!(3.9x)!

Xeon!(6x)!

Niagara!(6.03x)!

Tilera!(5.9x)!

Thro

ughp

ut (K

ops/

s)!

Threads!

MUTEX! TAS! TICKET! MCS!

Figure 12: Throughput of Memcached using a set-only test. The maximum speed-up vs. single threadis indicated under the platform names.

on three of the four platforms11. Moreover, the cache-coherence implementation of the Opteron proves againproblematic. Due to the periodic accesses to globallocks, the previously presented issues strongly manifest,resulting in a maximum speedup of 3.9. On the Xeon,the throughput increases while all threads are runningwithin a socket, after which it starts to decrease. Finally,thread scheduling has an important impact on perfor-mance. Not allocating threads to the appropriate coresdecreases performance by 20% on the multi-sockets.Summary. Even in an application where the main lim-itations to performance are networking and the mainmemory, when contention is involved, the impact of thecache coherence and synchronization primitives is stillimportant. When there is no contention and the data iseither prefetched or read from the main memory, syn-chronization is less of an issue.

7 Related WorkOur study of synchronization is, to date, the most ex-haustive in that it covers a wide range of schemes, span-ning different layers, and different many-core architec-tures. In the following, we discuss specific related workthat concerns the main layers of our study.Leveraging Cache Coherence. The characteristics ofcache-coherence protocols on x86 multi-sockets havebeen investigated by a number of studies [17, 33], whosefocus is on bandwidth limitations. The cache-coherencelatencies are measured from the point of view of load-ing data. We extend these studies by also measuringstores and atomic operations and analyzing the impacton higher-level concurrent software. Molka et al. [33]consider the effect of the AMD and Intel memory hier-archy characteristics on various workloads from SPECOMPM2001 and conclude that throughput is mainly dic-tated by memory limitations. The results are thus oflimited relevance for systems involving high contention.Moses et al. [35] use simulations to show that increasing

11The bottleneck on the Niagara is due to network and OS issues.

non-uniformity entails a decrease in the performance ofthe TTAS lock under high contention. However, the con-clusions are limited to spin locks and one specific hard-ware model. We generalize and quantify such observa-tions on commonly used architectures and synchroniza-tion schemes, while also analyzing their implications.Scaling Locks. Mellor-Crumney et al. [29] and An-derson [4] point out the lack of scalability with tradi-tional spin locks. They introduce and test several al-ternatives, such as queue locks. Their evaluation isperformed on large scale multi-processors, on whichthe memory latency distribution is significantly differentthan on today’s many-cores. Luchangco et. al [27] studya NUMA-aware hierarchical CLH lock and compare itsperformance with a number of well-known locks. Asidefrom a Niagara processor, no other modern many-coreis used. Our analysis extends their evaluation with morelocks, more hardware platforms, and more layers.

Other studies focus on the Linux kernel [12] and con-clude that the default ticket lock implementation causesimportant performance bottlenecks in the OS on a multi-core. Performance is improved in a number of differ-ent scenarios by replacing the ticket locks with complexlocks. We confirm that plain spin locks do not scaleacross sockets and present some optimizations that al-leviate the issue.

Various techniques have been proposed in order to im-prove the performance of highly contended locks, espe-cially on multi-sockets. For example, combining [18] isan approach in which a thread can execute critical sec-tions on behalf of others. In particular, RCL [26] re-places the “lock, execute, and unlock” pattern with re-mote procedure calls to a dedicated server core. Forhighly contended critical sections this approach hidesthe contention behind messages and enables the serverto locally access the protected data. However, as theRCL paper mentions, the scope of this solution is lim-ited to high contention and a large number of cores, asour results on message passing confirm.Scaling Systems on Many-Cores. In order to improveOS scalability on many-cores, a number of approachesdeviate from traditional kernel designs. The OS is typ-ically restructured to either improve locality (e.g., Tor-nado [15]), limit sharing (e.g., Corey [10]), or avoidresource sharing altogether by using message passing(e.g., Barrelfish [6], fos [45]). Boyd-Wickizer et al.[11] aim at verifying whether these scalability issuesare indeed inherent to the Linux kernel design. Theauthors show how optimizing, using various concurrentprogramming techniques, removes several scalability is-sues from both the kernel and the applications. By do-ing so, they conclude that it is not necessary to give upthe traditional kernel structure just yet. Our study con-firms these papers’ observation that synchronization can

45

be an important bottleneck. We also go a step further:we study the roots of the problem and observe that manyof the detected issues are in fact hardware-related andwould not necessarily manifest on different platforms.

8 Concluding RemarksSummary. This paper dissects the cost of synchroniza-tion and studies its scalability along different directions.Our analysis extends from basic hardware synchroniza-tion protocols and primitives all the way to complexconcurrent software. We also consider different repre-sentative hardware architectures. The results of our ex-periments and our cross-platform synchronization suite,SSYNC, can be used to evaluate the potential for scalingsynchronization on different platforms and to developconcurrent applications and systems.Observations & Ramifications. Our experimentationinduces various observations about synchronization onmany-cores. The first obvious one is that crossingsockets significantly impacts synchronization, regard-less of the layer, e.g., cache coherence, atomic opera-tions, locks. Synchronization scales much better withina single socket, irrespective of the contention level. Sys-tems with heavy sharing should reduce cross-socket syn-chronization to the minimum. As we pointed out, thisis not always possible (e.g., on a multi-socket AMDOpteron), for hardware can still induce cross-socket traf-fic, even if sharing is explicitly restricted within a socket.Message passing can be viewed as a way to reduce shar-ing as it enforces partitioning of the shared data. How-ever, it comes at the cost of lower performance (thanlocks) on a few cores or low contention.

Another observation is that non-uniformity affectsscalability even within a single-socket many-core, i.e.,synchronization on a Sun Niagara 2 scales better thanon a Tilera TILE-Gx36. Consequently, even on a single-socket many-core such as the TILE-Gx36, a systemshould reduce the amount of highly contended data toavoid performance degradation (due to the hardware).

We also notice that each of the nine state-of-the-artlock algorithms we evaluate performs the best on at leastone workload/platform combination. Nevertheless, if wereduce the context of synchronization to a single socket(either one socket of a multi-socket, or a single-socketmany-core), then our results indicate that spin locksshould be preferred over more complex locks. Complexlocks have a lower uncontested performance, a largermemory footprint, and only outperform spin locks un-der relatively high contention.

Finally, regarding hardware, we highlight the fact thatimplementing multi-socket coherence using broadcast oran incomplete directory (as on the Opteron) is not favor-able to synchronization. One way to cope with this isto employ a directory combined with an inclusive LLC,

both for intra-socket sharing and for detecting whether acache line is shared outside the socket.

Miscellaneous. Due to the lack of space, we had to omitsome rather interesting results. In particular, we con-ducted our analysis on small-scale multi-sockets, i.e., a2-socket AMD Opteron 2384 and a 2-socket Intel XeonX5660. Overall, the scalability trends on these platformsare practically identical to the large-scale multi-sockets.For instance, the cross-socket cache-coherence latenciesare roughly 1.6 and 2.7 higher than the intra-socket onthe 2-socket Opteron and Xeon, respectively. In addi-tion, we did not include the results of the software trans-actional memory. In brief, these results are in accor-dance with the results of the hash table (Section 6.3),both for locks and message passing. Furthermore, wealso considered the MonetDB [34] column-store. Inshort, the behavior of the TPC-H benchmarks [42] onMonetDB is similar to the get workload of Memcached(Section 6.4): synchronization is not a bottleneck.

Limitations & Future Work. We cover what we be-lieve to be the “standard” synchronization schemes andmany-core architectures used today. Nevertheless, we donot study lock-free [19] techniques, an appealing way ofdesigning mutual exclusion-free data structures. Thereis also a recent body of work focusing on serializing crit-ical sections over message passing by sending requeststo a single server [18, 26]. It would be interesting toextend our experiments to those schemes as well.

Moreover, since the beginning of the multi-core revo-lution, power consumption has become increasingly im-portant [8, 9]. It would be interesting to compare dif-ferent platforms and synchronization schemes in termsof performance per watt. Finally, to facilitate synchro-nization, Intel has introduced the Transactional Synchro-nization Extensions (TSX) [24] in its Haswell micro-architecture. We will experiment with TSX in SSYNC.

AcknowledgementsWe wish to thank our shepherd, Sam King, and theanonymous reviewers for their helpful comments. Wewould also like to thank Christos Kozyrakis for thefruitful discussions at the incipient stages of this work,as well as, Babak Falsafi, Tim Harris, Luis Ceze, Ro-drigo Rodrigues, Edouard Bugnion, and Mark Moir fortheir comments on an early draft of this paper. Fi-nally, we wish to thank the EPFL DIAS lab and Eco-Cloud for providing us with some of the platforms em-ployed in this study. Part of this work was funded fromthe European Union Seventh Framework Programme(FP7) under grant agreement 248465, the S(o)OS projectand the Swiss National Science Foundation project200021 143915, Reusable Concurrent Data Types.

46

References[1] J. Abellan, J. Fernandez, and M. Acacio. GLocks:

Efficient support for highly-contended locks inMany-Core CMPs. IPDPS 2011, pages 893–905.

[2] AMD. Software optimization guide for AMD fam-ily 10h and 12h processors. 2011.

[3] G. Amdahl. Validity of the single processor ap-proach to achieving large scale computing capabil-ities. AFIPS 1967 (Spring), pages 483–485.

[4] T. E. Anderson. The performance of spin lockalternatives for shared-memory multiprocessors.IEEE TPDS, 1(1):6–16, 1990.

[5] H. Attiya, A. Bar-Noy, and D. Dolev. Shar-ing memory robustly in message-passing systems.PODC 1990, pages 363–375.

[6] A. Baumann, P. Barham, P. Dagand, T. Harris,R. Isaacs, S. Peter, T. Roscoe, A. Schupbach, andA. Singhania. The multikernel: a new OS architec-ture for scalable multicore systems. SOSP 2009,pages 29–44.

[7] L. Boguslavsky, K. Harzallah, A. Kreinen, K. Sev-cik, and A. Vainshtein. Optimal strategies for spin-ning and blocking. J. Parallel Distrib. Comput.,21(2):246–254, 1994.

[8] S. Borkar. Design challenges of technology scal-ing. Micro, IEEE, 19(4):23–29, 1999.

[9] S. Borkar and A. Chien. The future of micropro-cessors. Communications of the ACM, 54(5):67–77, 2011.

[10] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao,F. Kaashoek, R. Morris, A. Pesterev, L. Stein,M. Wu, Y. Dai, Y. Zhang, and Z. Zhang. Corey:an operating system for many cores. OSDI 2008,pages 43–57.

[11] S. Boyd-Wickizer, A. Clements, Y. Mao,A. Pesterev, M. Kaashoek, R. Morris, andN. Zeldovich. An analysis of Linux scalability tomany cores. In OSDI 2010, pages 1–16.

[12] S. Boyd-Wickizer, M. Kaashoek, R. Morris, andN. Zeldovich. Non-scalable locks are dangerous.In Proceedings of the Linux Symposium, 2012.

[13] P. Conway, N. Kalyanasundharam, G. Donley,K. Lepak, and B. Hughes. Cache hierarchy andmemory subsystem of the AMD Opteron proces-sor. Micro, IEEE, 30(2):16 –29, 2010.

[14] D. Dice, V. Marathe, and N. Shavit. Lock cohort-ing: a general technique for designing numa locks.PPoPP 2012, pages 247–256.

[15] B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm.Tornado: Maximizing locality and concurrency in

a shared memory multiprocessor operating system.OSDI 1999, pages 87–100.

[16] V. Gramoli, R. Guerraoui, and V. Trigonakis.TM2C: a software transactional memory for many-cores. EuroSys 2012, pages 351–364.

[17] D. Hackenberg, D. Molka, and W. Nagel. Com-paring cache architectures and coherency protocolson x86-64 multicore SMP systems. MICRO 2009,pages 413–422.

[18] D. Hendler, I. Incze, N. Shavit, and M. Tzafrir. Flatcombining and the synchronization-parallelismtradeoff. SPAA 2010, pages 355–364. ACM.

[19] M. Herlihy. Wait-free synchronization. ACMTransactions on Programming Languages and Sys-tems, 13(1):124–149, 1991.

[20] M. Herlihy and N. Shavit. The art of multiproces-sor programming, revised first edition. 2012.

[21] M. Hill and M. Marty. Amdahl’s law in the multi-core era. Computer, 41(7):33 –38, 2008.

[22] Intel. An introduction to the Intel QuickPath inter-connect. 2009.

[23] Intel. Intel 64 and IA-32 architectures software de-veloper’s manual. 2013.

[24] Intel. Transactional Synchronization ExtensionsOverview. 2013.

[25] libmemcached. http://libmemcached.org/libMemcached.html.

[26] J. Lozi, F. David, G. Thomas, J. Lawall, andG. Muller. Remote core locking: migratingcritical-section execution to improve the perfor-mance of multithreaded applications. USENIXATC 2012.

[27] V. Luchangco, D. Nussbaum, and N. Shavit. Ahierarchical CLH queue lock. ICPP 2006, pages801–810.

[28] J. Mellor-Crummey and M. Scott. Synchronizationwithout contention. ASPLOS 1991, pages 269–278.

[29] J. Mellor-Crummey and M. Scott. Algorithms forscalable synchronization on shared-memory multi-processors. ACM TOCS, 1991.

[30] Memcached. http://www.memcached.org.[31] M. Michael and M. Scott. Simple, fast, and prac-

tical non-blocking and blocking concurrent queuealgorithms. PODC 1996, pages 267–275.

[32] S. Microsystems. UltraSPARC T2 supplement tothe UltraSPARC architecture. 2007.

[33] D. Molka, R. Schone, D. Hackenberg, andM. Muller. Memory performance and SPEC

47

http://libmemcached.org/libMemcached.html

http://libmemcached.org/libMemcached.html

http://www.memcached.org

OpenMP scalability on quad-socket x86 64 sys-tems. ICA3PP 2011, pages 170–181.

[34] MonetDB. http://www.monetdb.org/.[35] J. Moses, R. Illikkal, L. Zhao, S. Makineni, and

D. Newell. Effects of locking and synchroniza-tion on future large scale CMP platforms. CAECW2006.

[36] U. Nawathe, M. Hassan, K. Yen, A. Kumar, A. Ra-machandran, and D. Greenhill. Implementationof an 8-Core, 64-thread, power-efficient SPARCserver on a chip. Solid-State Circuits, IEEE Jour-nal of, 43(1):6 –20, 2008.

[37] M. Papamarcos and J. Patel. A low-overhead co-herence solution for multiprocessors with privatecache memories. ISCA 1984, pages 348–354.

[38] C. Ranger, R. Raghuraman, A. Penmetsa, G. Brad-ski, and C. Kozyrakis. Evaluating mapreduce formulti-core and multiprocessor systems. HPCA2007, pages 13–24.

[39] M. Schroeder and M. Burrows. Performance ofFirefly RPC. In SOSP 1989, pages 83–90.

[40] M. Scott and W. Scherer. Scalable queue-basedspin locks with timeout. PPoPP 2001, pages 44–52.

[41] Tilera tile-gx. http://www.tilera.com/products/processors/TILE-Gx_Family.

[42] TPC-H. http://www.tpc.org/tpch/.[43] C. T.S. Building FIFO and priority-queuing spin

locks from atomic swap. Technical report, 1993.[44] J. Tseng, H. Yu, S. Nagar, N. Dubey, H. Franke,

P. Pattnaik, H. Inoue, and T. Nakatani. Per-formance studies of commercial workloads on amulti-core system. IISWC 2007, pages 57–65.

[45] D. Wentzlaff and A. Agarwal. Factored operatingsystems (fos): the case for a scalable operating sys-tem for multicores. OSR, 43(2):76–85, 2009.

48

http://www.monetdb.org/

http://www.tilera.com/products/processors/TILE-Gx_Family



http://www.tpc.org/tpch/

everything you always wanted to know about synchronization...

Documents