assessment of cache coherence protocols

7/30/2019 Assessment of Cache Coherence Protocols

1/195

Assessment of Cache Coherence Protocols in Shared-memory

Multiprocessors

by

Alexander Grbic

A thesis submitted in conformity with the requirements

for the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

Copyright c 2003 by Alexander Grbic


2/195

Abstract

Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors

Alexander Grbic

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2003

The cache coherence protocol plays an important role in the performance of a distributed

shared-memory (DSM) multiprocessor. A variety of cache coherence protocols exist and differ

mainly in the scope of the sites that are updated by a write operation. These protocols can

be complex and their impact on the performance of a multiprocessor system is often difficult

to assess. To obtain good performance, both architects and users must understand processor

communication, data locality, the properties of the interconnection network, and the nature of

the coherence protocols. Analyzing the processor data sharing behavior and determining its

effect on cache coherence communication traffic is the first step to a better understanding of

overall performance. Toward this goal, this dissertation provides a framework for evaluating

the coherence communication traffic of different protocols and considers using more than one

protocol in a DSM multiprocessor.

The framework consists of a data access characterization and the application of assessment

rules. Its usefulness is demonstrated through an investigation into the performance of different

cache coherence protocols for a variety of systems and parameters. It is shown to be effective

for determining the relative performance of protocols and the effect of changes in system and

application parameters. The investigation also shows that no single protocol is best suited for

all communication patterns. Consequently, the dissertation also considers using more than one

cache coherence protocol in a DSM multiprocessor. The results show that the hybrid protocol

can significantly reduce traffic in all levels of the interconnection network with little effect on

execution time.

ii


3/195

Acknowledgements

I would like to thank my supervisors, Professors Zvonko Vranesic and Sinisa Srbljic, for their

suggestions, guidance and support throughout my thesis. Without their knowledge, experience

and time this work would not have been possible. I am grateful for their continued faith in me

in spite of my decisions to take on new challenges and responsibilities. In addition, I wish to

acknowledge useful discussions with Professor Michael Stumm and thank him for his help.

I cannot say enough to thank my wife Gordana and daughter Lidia for their love, patience

and understanding. Gordana, you gave me the support I needed to keep going, even when it

looked like there was no end in sight to my graduate work. Lidia, the moment you arrived you

brightened up my life, provided me with inspiration and taught me about the important things.

To both of you, my love.

I would like to thank my parents, brother and sister for their support, sacrifices and their

love. Tony and Vanda, thanks for being there for Gordana, Lidia and me whenever we needed

you. Tony, your dedication to research has motivated me in more ways than just making me

realize that you could finish before me.

I must also thank my friends for the continued friendships. Even though Ive gone largely

into seclusion in the last while, youve kept in touch and always made me feel welcome. I

express my thanks to the old Computer Group crowd and to the people at work for the friendly

and frequent reminders of my unfinished business.

I gratefully acknowledge the financial assistance provided to me through OGSST and NSERC

Scholarships as well as a UofT Open Fellowship.

iii


4/195

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5

2.1 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Type of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Implementing Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Directory Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Understanding Protocol Performance . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Hybrid Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5.1 On-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.2 Off-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 The NUMAchine Multiprocessor - Evolution . . . . . . . . . . . . . . . . . . . . . 22

2.7 Memory Consistency models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.8 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 The NUMAchine Cache Coherence Protocol 27

3.1 The NUMAchine Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.2 Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

iv


5/195


6/195

5 Sharing Patterns and Traffic 54

5.1 Data Access Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1.1 Data Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1.2 Obtaining the Data Access Characterization . . . . . . . . . . . . . . . . . 57

5.2 Understanding Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . 58

5.2.1 Description of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.3 Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.3 Choice of Characterization Interval . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4 Confirmation of Rule 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.1 Choosing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.5 Extending the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Evaluation of Protocol Performance 74

6.1 The Update Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.1.1 The Update Protocol in a Distributed System . . . . . . . . . . . . . . . . 77

6.2 The Write-through Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.3 Uncached Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.4 Protocol Communication Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.5 Study Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.5.2 Page Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.5.3 Interval Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.6 Data Access Characterization of Benchmarks . . . . . . . . . . . . . . . . . . . . 83

6.7 Relative Performance of Different Protocols . . . . . . . . . . . . . . . . . . . . . 85

6.7.1 Applying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 85

6.7.2 Verifying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 89

vi


7/195

6.8 Explanation of Application Behavior . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.9 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7 Hybrid Cache Coherence Protocol 99

7.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7.2 Processor Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.2.1 Base Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.2.2 Dirty Shared State Support . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.3 Directory Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.3.1 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3.2 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.4 Transitions Between Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.4.1 Dealing with Additional States in the Update Protocol . . . . . . . . . . . 109

7.4.2 Network Cache Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.4.3 Cache Blocks in Transition . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.4.4 Transitions Between Protocols in the Processor Cache . . . . . . . . . . . 113

7.5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.5.1 Simulation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.5.3 Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.6 Hybrid Protocol Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.7 Wrong Protocols for Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.8 Decision Functions and Hybrid Protocol Execution Time . . . . . . . . . . . . . . 129

7.8.1 Only the Traffic-Based Decision Function Changes to Update (t2u) . . . . 132

7.8.2 Only the Traffic-based Decision Function Changes to Invalidate (t2i) . . . 135

7.8.3 Only the Latency-based Decision Function Changes to Update (l2u) . . . 136

7.8.4 Only the Latency-based Decision Function Changes to Invalidate (l2i) . . 137

7.8.5 General Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.9 Latency-based Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

vii


8/195

7.10 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8 Conclusion 147

8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

A NUMAchine Cache Coherence Protocol - Invalidate 151

A.1 Local System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

A.2 Remote System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

A.3 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

A.3.1 Negative Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 161

A.3.2 Exclusive Reads and Upgrades . . . . . . . . . . . . . . . . . . . . . . . . 163

A.3.3 Non-inclusion of Network Cache, NOTIN Cases . . . . . . . . . . . . . . . 165

B System Events 168

Bibliography 174

viii


9/195

List of Tables

2.1 Experimental and commercial multiprocessor architectures. . . . . . . . . . . . . 13

2.2 Cache coherence in experimental and commercial multiprocessors. . . . . . . . . 15

3.1 States in memory and network cache directories. . . . . . . . . . . . . . . . . . . 40

4.1 Simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Access latencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1 Values of parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.1 Communication costs in numbers of packets for invalidate, update, write-through

and uncached operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 System data access characterization and percentage of writes. . . . . . . . . . . . 86

6.3 Data access characterization for the central ring. . . . . . . . . . . . . . . . . . . 87

6.4 Average number of packets per access for different cache coherence protocols on

a 4-processor system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.5 Average number of packets per access for different cache coherence protocols on

a 64-processor system central ring. . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.1 Parallel efficiency for SPLASH2 applications used in the hybrid protocol study. . 116

7.2 Examples of NUMAchine system event costs in terms of number of packets for

the invalidate and update protocols. . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.3 Frequency of using incorrect protocols given in numbers of intervals. . . . . . . . 128

ix


10/195

7.4 Disagreements between the traffic-based and latency-based decision functions

given in numbers of intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.5 MRSW example for the case where only the traffic decision function changes to

update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327.6 SRMW example for the case where only the traffic decision function changes to

update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7.7 MRMW example for the case where only the traffic decision function changes to

update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.8 MW example for the case where only the traffic decision function changes to

update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.9 MRSW example for the case where only the traffic decision function changes to

in v alid ate (t2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.10 MRSW example for the case where only the latency decision function changes

to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.11 MRMW example for the case where only the latency decision function changes

to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.12 MRSW example for the case where only the latency decision function changes

to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.13 MRMW example for the case where only the latency decision function changes

to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.14 MW example for the case where only the latency decision function changes to

invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

A.1 System events for local requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

A.2 System events for remote requests. . . . . . . . . . . . . . . . . . . . . . . . . . . 155

B.1 System event descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

B.2 System event details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

B.3 Traffic and latency costs for system events. . . . . . . . . . . . . . . . . . . . . . 172

B.4 System parameters that affect traffic. . . . . . . . . . . . . . . . . . . . . . . . . . 172

x


11/195

B.5 Traffic costs for requests and responses. . . . . . . . . . . . . . . . . . . . . . . . 172

B.6 System parameters that affect latency. . . . . . . . . . . . . . . . . . . . . . . . . 173

B.7 Latency of modules and the interconnection network. . . . . . . . . . . . . . . . . 173

xi


12/195

List of Figures

2.1 Invalidate and update protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Cache coherence with a directory protocol. . . . . . . . . . . . . . . . . . . . . . 11

2.3 The Hector multiprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 NUMAchine architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Routing mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Station and network level coherence. . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Directory entries in memory and network cache. . . . . . . . . . . . . . . . . . . 39

3.5 Local write. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.6 Local read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.7 Remote read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.8 Remote write. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 Data access patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2 Time/space characterization of data accesses. . . . . . . . . . . . . . . . . . . . . 58

5.3 Bus-based system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 Comparison of INV and UPD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Comparison of INV and UNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.6 Comparison of INV and WT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.7 Comparison of UPD and WT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.8 Comparison of UPD and UNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.9 Comparison of WT and UNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.10 Hierarchical system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

xii


13/195

6.1 Data access characterization for Barnes. . . . . . . . . . . . . . . . . . . . . . . . 84

6.2 Data access characterization for FFT. . . . . . . . . . . . . . . . . . . . . . . . . 85

6.3 Average number of packets per access for the invalidate and update protocols. . . 96

7.1 State transition diagrams for the processor cache. . . . . . . . . . . . . . . . . . . 105

7.2 Example of a violation of sequential consistency that can occur if the owner does

not invalidate its copy when responding to an exclusive intervention request. . . 110

7.3 Example of remote exclusive read request to the LI state in the memory for the

update protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.4 Example of local exclusive read request to the GI state in the memory. . . . . . . 112

7.5 Barnes with the base problem size and the ideal decision function. . . . . . . . . 122

7.6 FFT with the base problem size and the ideal decision function. . . . . . . . . . . 122

7.7 Ocean non-contiguous with the base problem size and the ideal decision function. 123

7.8 Radix with the base problem size and the ideal decision function. . . . . . . . . . 123

7.9 Barnes with the small problem size and the ideal decision function. . . . . . . . . 124

7.10 FFT with the small problem size and the ideal decision function. . . . . . . . . . 124

7.11 Ocean non-contiguous with the small problem size and the ideal decision function.125

7.12 Radix with the small problem size and the ideal decision function. . . . . . . . . 125

7.13 Effect of changing cache block size to 256 bytes. . . . . . . . . . . . . . . . . . . . 126

7.14 Effect of changing the ring width to 4 bytes. . . . . . . . . . . . . . . . . . . . . . 126

7.15 Barnes with the base problem size and the latency-based decision function. . . . 142

7.16 FFT with the base problem size and the latency-based decision function. . . . . . 142

7.17 Ocean non-contiguous with the base problem size and the latency-based decision

function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.18 Radix with the base problem size and the latency-based decision function. . . . . 143

7.19 Barnes with the small problem size and the latency-based decision function. . . . 144

7.20 FFT with the small problem size and the latency-based decision function. . . . . 144

7.21 Ocean non-contiguous with the small problem size and the latency-based decision

function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

xiii


14/195

7.22 Radix with the small problem size and the latency-based decision function. . . . 145

A.1 Special exclusive read request example. . . . . . . . . . . . . . . . . . . . . . . . . 164

xiv


15/195

Chapter 1

Introduction

The demand for multiprocessors has continued to grow in recent years and commercial machines

with tens of processors are readily available today. In 2000, the sales of shared-memory systems

with more than eight processors passed $16 billion [20]. This has been driven by the continuing

need for computational power beyond what state-of-the-art uniprocessor systems can provide.

Uses of multiprocessors have grown from mostly scientific and engineering applications to other

areas such as databases and file and media servers.

Multiprocessor architectures vary depending on the size of the machine and differ from ven-

dor to vendor. Shared-memory architectures have become dominant in small and medium-sized

machines that have up to 64 processors. They provide a single view of memory, which is shared

among all processors, and a shared-memory model for programming, where communication is

achieved through accesses to the same memory location. The success of this model is due to the

ease of transition it provides from uniprocessors to multiprocessors. The programming model

is similar to uniprocessors and it allows for the incremental parallelization of sequential code,

while achieving high performance.

To achieve high performance, the shared view of memory is implemented in hardware. The

predominant architecture for small systems is based on a bus. At about 32 processors, this

architecture reaches its limits. For larger systems, other types of interconnection networks,

often hierarchical, are used and the memory is distributed throughout the machine. This type

of architecture is referred to as a distributed shared-memory (DSM) multiprocessor.

1


16/195


17/195

Chapter 1. Introduction 3

network and the memory system are believed to be the most important subsystems and will

continue to be so over the next decade. When designing the interconnection network for a

shared-memory multiprocessor, the cache coherence protocol is a key design consideration. The

performance of a protocol with a particular interconnection network has a considerable impacton the performance of the overall system. To obtain good performance with the system, both

architects and users must understand processor communication, data locality, the properties of

the interconnection network, and the nature of the protocols.

A variety of cache coherence protocols exist and differ mainly in the scope of the sites

that are updated by a write operation. These protocols can be complex and their impact on

the performance of a multiprocessor system is often difficult to assess. The performance of

a system is directly related to the latency associated with processor accesses. The latency of

an access often depends on congestion in the system, which is directly related to the amount

of communication traffic. Analyzing the processor data sharing behavior and determining its

effect on the cache coherence communication costs is the first step in understanding the overall

performance. This dissertation provides a framework for evaluating the communication costs of

different protocols and comparing different protocols as well as assessing the effects of different

system and application parameters on the performance. In addition to improving the latency

of accesses, reducing the traffic can reduce the cost of the system by reducing the bandwidth

requirements. The dissertation also presents a study of using more than one cache coherence

protocol in a DSM multiprocessor and how communication requirements can be reduced with

this approach.

Much of the work in this dissertation has been inspired by the authors involvement in the

NUMAchine multiprocessor project [36] at the University of Toronto. The objective was to

design a multiprocessor system which is cost-effective, with a scalability goal of 100 processors.

Costs were reduced by using commercial of-the-shelf parts and programmable logic devices.

The author was directly involved in the design and development of a unique cache coherence

protocol. Without loss of generality, many of the principles presented in this dissertation are

applied to the NUMAchine multiprocessor as a specific example of a successful architecture for

medium-scale systems.


18/195

Chapter 1. Introduction 4

1.2 Overview

Chapter 2 discusses cache coherence protocols in the context of distributed shared-memory

multiprocessors. Next, a description of the NUMAchine cache coherence protocol and the

unique combination of features it provides is given in Chapter 3. NUMAchine is a good example

of a cost effective multiprocessor and its architecture is used as a platform for investigation

throughout this work. Chapter 4 provides a description of the experimental setup and the choice

of benchmark programs used to perform experiments described in later chapters. Chapter 5

develops a framework for assessing the behavior of cache coherence protocols, which consists of

a method for characterizing the sharing behavior for a program and a set of rules that explain

the performance of the protocols. An analysis of several cache coherence protocols designed for

NUMAchine using the proposed framework is given in Chapter 6. In Chapter 7, the possibility

of using more than one protocol during the execution of an application is explored. Finally,

Chapter 8 summarizes the major conclusions and describes possible future work.


19/195

Chapter 2

Background

Shared memory multiprocessors have become popular because of the simple programming model

they provide. A single shared address space is accessible to any processor in the system and

communication between processors occurs by simply accessing the same data location. In a

system with caches, the sharing of data in this way results in copies of the same cache block in

multiple caches. Although this sharing is not a problem for read accesses, a problem can occur

if one of the processors writes to shared data. This is the cache coherence problem.

This chapter begins with a discussion of the cache coherence problem. Section 2.2 describes

the solution commonly used in distributed shared memory (DSM) multiprocessors, called di-

rectory cache coherence protocols. Section 2.3 provides a survey of representative DSM mul-

tiprocessors and their cache coherence protocols. Various approaches used to understand the

performance of cache coherence protocols are given in Section 2.4. Attempts at using more

than one type of cache coherence protocol are described in Section 2.5. Since the research in

this thesis is motivated by the development of the NUMAchine multiprocessor, a description of

its evolution and relevant references are given in Section 2.6. NUMAchine provides a memory

model called sequential consistency, which is briefly described in Section 2.7.

5


20/195

Chapter 2. Background 6

2.1 Cache Coherence

A typical shared memory multiprocessor contains multiple levels of caches in the memory hier-

archy. Each processor may read data and store it in its cache. This results in copies of the same

data being present in different caches at the same time. The problem occurs when a processor

performs a write to data. If only the value in the writing processors cache is modified, no

other processor will see the change. If some action is not taken, other processors will read a

stale copy of the data. Intuitively, a read by another processor should return the last value

written. To avoid the problem of reading stale data, all processors with copies of the data must

be notified of the changes. Two properties must be ensured. First, changes to a data location

must be made visible to all processors, which is called write propagation. Second, the changes

to a location must be made visible in the same order to all processors, which is called write

serialization.

Culler and Singh [21] define a coherent memory system as1:

A multiprocessor memory system is coherent if the results of any execution of a

program are such that, for each location, it is possible to construct a hypothetical

serial order of all operations to the location (i.e., put all reads/writes issued by all

processors into a total order) that is consistent with the results of execution and in

which

1. operations issued by any particular processor occur in the order in which they

were issued to the memory system by that processor, and

2. the value returned by each operation is the value written by the last write to

that location in the serial order.

To solve the cache coherence problem, that is to maintain a coherent memory system, a

distributed algorithm called a cache coherence protocol is used. A variety of cache coherence

protocols exist [79] [57] [43] and differ mainly by the action performed on a write.

1In the original text, the word process is used instead of processor.


21/195


M

P1 P2

P2

M

P1

M

P1 P2

A A

A

A A A

AA

A

P2

M

P1

A

a. Only memory has copy of A. b. Processors and memory share A.

c. Copies of A invalidated. d. Copies of A updated.

Figure 2.1: Invalidate and update protocols.

2.1.1 Type of Protocols

Cache coherence protocols can be classified into a number of categories based on the scope of

sites that are updated by a write operation. Depending on how other processor caches are

notified of changes, protocols can be classified as invalidateand update as shown in Figure 2.1.

In Figure 2.1a only the memory has a valid copy of data block A. In Figure 2.1b both processors

read A and store it in their respective caches. The difference between the protocols becomes

apparent when, for example, processor P1 issues a write. In an invalidate protocol, processor

P1 modifies its copy of the cache block and invalidates the other copies in the system as shown

in Figure 2.1c. In an update protocol, the processor writes to its copy of the cache block and

propagates the change to other copies in the system as shown in Figure 2.1d. Upon receiving

the changes, the other caches update their contents.

Cache coherence protocols can be further classified depending on how the memory is updated


22/195


into write-throughand write-backprotocols. In a write-through protocol, the memory is updated

whenever a processor performs a write; it writes through to the memory. In a write-back

protocol, the memory can be updated in one of two ways. First, the memory is updated when

a processor with the only valid copy of the block replaces it. Second, a copy of the block iswritten back to memory when a processor reads it from the cache of another processor.

The choice of cache coherence protocol plays an important role in the performance of a

multiprocessor system. Many systems are based on the write-back invalidate protocol. In

many cases, applications run efficiently using this type of a protocol, but there are examples

where other protocols can achieve better results.

2.1.2 Implementing Protocols

A cache coherence protocol is typically enforced by a set of cooperating finite state machines,

which can be implemented in hardware, software, or some combination of the two. We focus

on hardware implementations because they are relevant to distributed shared memory multi-

processors. They perform well and make the accessing of data transparent to the programmer

and the operating system. In addition, they can operate at a finer granularity of data, such as

a cache block which can range from 16 to 256 bytes in most systems today.

During program execution, the hardware implemented state machines check for certain

conditions and act appropriately to maintain coherence. The actions are determined by the

operation issued by the processor and the state information stored with each cache block. The

state machines and the state information are typically located at the processors, memory and

other locations of caches in the system. When a processor issues an operation, the controller

decides the change of state and the appropriate action on the interconnect.

Existing hardware cache coherence schemes include snoopy schemes, directory schemes,

and schemes that involve cache coherent interconnection networks. To describe each, it is

first necessary to distinguish between different types of multiprocessor systems: symmetric

multiprocessors (SMPs) and distributed shared memory multiprocessors (DSMs). In SMPs the

time to access any part of memory is the same, while in DSMs the time depends on the location

of the processor performing the access and the memory being accessed. This is known as non-


23/195


uniform memory access (NUMA). DSM systems with cache coherence implemented in hardware,

which is the norm, are also known as cache coherent NUMA (CC-NUMA) systems.

For symmetric multiprocessors, snoopy protocols are popular because they are well under-

stood and relatively simple to implement. These schemes assume that the network traffic isvisible to all devices. Each device performs coherence actions according to a protocol for the

operations it issues. Communication between caches and memory is achieved using a broad-

cast mechanism. For a bus-based multiprocessor, sending a message is effectively a broadcast

because anything sent on the bus is visible to all other devices. Each device snoops on the in-

terconnection network and performs actions according to the protocol for blocks it has stored.

SMPs with snoopy protocols are limited in size, typically containing only tens of processors.

Even with large caches, a limit on the number of processors is reached due to the amount

of traffic on the bus and eventually due to physical constraints. At this point, some other

interconnection network, that scales with system size, must be used.

In distributed shared memory (DSM) systems a scalable interconnection network is used

to connect processing nodes, which can contain one or more processors and memory. The

interconnect consists of multiple components that contain traffic to that portion of the system,

so that operations can be performed simultaneously in different parts of the network. In this

type of a system, broadcasting to all caches is prohibitive because of the amount of network

traffic generated. The following section describes cache coherence protocols called directory

protocols which eliminate the need to broadcast requests to the system.

Recently, a number of protocols have been proposed that combine snoopy and directory

protocol implementations [62] [60] [61]. Their goal is to achieve the lower latency of requests

associated with snoopy protocols while maintaining the lower bandwidth requirements of direc-

tory protocols. Bandwidth adaptive snooping [62] switches between the two implementations

based on recent network utilization. A snoopy protocol is used when there is ample bandwidth

available, and a directory protocol at times of high utilization. The need for broadcasting can

also be further reduced by multicasting requests to a predicted set of destinations [60]. To

allow for the extension of these ideas to general interconnection networks, a new type of cache

coherence protocol called Token cache coherence [61] has been introduced, which exchanges and


24/195


25/195


26/195


performed and the cache block size. Although a protocol can be implemented with any inter-

connection network, the specific features of the network can be used to optimize the protocol.

The example given in Figure 2.2 assumes a single centralized directory with what is known as

a full bit vectorscheme [18]: one presence bit is available for each processor. To avoid contentionand to allow for a system that has a small up-front cost in small configurations, directories are

distributed in a large system such that each memory in the system has a directory associated

with it. Another major issue for directories is the amount of storage overhead required for

larger systems. Ideally, the overhead should scale gracefully with the number of processors in

the system. The full bit vector scheme does not scale well because the storage overhead per

entry is proportional to the number of processors. To save on storage, the width and height of

the directory can be varied. The width of the entry can be reduced by reducing the number of

presence bits available per entry. For example, a single bit can be used to represent more than

one processing node. These types of schemes are called coarse bit vector schemes [39]. Another

type of scheme is called the limited pointer scheme [8] in which a limited number of pointers

are provided. After all the pointers are used, further coherence commands are broadcast. The

storage requirements of the directory can also be reduced by reducing the height of the directory,

that is the number of entries. The directory is then essentially used as a cache [39]. Typical

large-scale multiprocessors have a distributed full or coarse bit vector directory.

2.3 Implementations

In this section a number of medium to large-scale DSM multiprocessors, both academic projects

and commercial implementations, are described. An emphasis is placed on the specifics of the

system architecture and the cache coherence protocol. A summary of the architectural features

is given in Table 2.1 and the cache coherence protocols in Table 2.2. Note that the NUMAchine

multiprocessor is provided in the table for comparison, but is not described in this section. It

is described in Section 2.6 and Chapter 3.

The DASH multiprocessor [55] [56] developed at Stanford University consists of processing

nodes called clusters, which are connected by a pair of 2-D mesh networks. Each cluster


27/195


Name Cluster Cluster size Interconnect

DASH bus 4 meshAlewife non-clustered meshFLASH non-clustered meshNUMAchine bus 4 ring hierarchySGI Origin crossbar 2 hypercube

Compaq AlphaServerGS320

10-port switch(crossbar)

4 8-port switch (cross-bar)

Sun Fire 15K bus 4 crossbarHP SPP2000 (X-class) crossbar 16 toroidal ringHP Superdome switch 4 crossbar hierarchyIBM NUMA-Q bus 4 ring

Table 2.1: Experimental and commercial multiprocessor architectures.

contains up to 4 processors (R3000) and a portion of the memory. DASH implements

a distributed, directory-based cache coherence protocol [54] which is of the invalidation

type. A bus snooping protocol enforces coherence within a cluster and a full bit vector

directory enforces coherence across clusters. DASH also contains a remote access cache,

which is used to cache blocks belonging to other clusters.

The Alewife Machine [7] developed at MIT also consists of processing nodes connected by

a mesh network. Each node consists of a single processor (sparcle) and a portion of global

memory. A directory scheme which contains only five pointers per cache block is used to

reduce hardware requirements. If more than five nodes share a cache block, additional

pointers are stored in the main memory using a scheme called LimitLESS directories [19].

Common-case memory accesses are handled in hardware and a software trap is used to

enforce coherence for memory blocks that are shared among a large number of processors.

The FLASH multiprocessor [50] developed at Stanford University is the successor to

DASH. Each node contains a processor (R10000), a portion of main memory, and a

programmable node controller called MAGIC (Memory And General Interconnect Con-

troller). This controller controls the datapath and implements coherence. A base directory

cache coherence protocol exists and consists of a scalable directory data structure. FLASH

uses a dynamic pointer allocation scheme for which a directory header for each block is

stored in the main memory. The header contains boolean flags and a pointer to a linked

list of nodes that contain the shared block.


28/195


The SGI Origin multiprocessor [53] developed by Silicon Graphics Inc consists of up to 512

nodes connected by a Craylink network in a hypercube configuration. Each node consists

of up to 2 processors (R10000) and a portion of the global memory. One of the main goals

of the Origin is to minimize the latency difference between local and remote accesses to 2:1.The directory-based cache coherence protocol is similar to that of DASH. It is designed to

be insensitive to network ordering, allowing for the use of any interconnection network. A

full bit vector scheme which switches to a coarse bit vector scheme for a large number of

processors is implemented. More recently SGI has introduced the Origin 3000 [73], which

is similar in architecture, but includes an updated processor (R14000).

The Compaq AlphaServer GS320 [29] developed by Compaq can scale to 64 processors.

Memory is distributed across 4-processor (Alpha 21246) nodes, called quad-processor

building blocks, which are connected by a local switch. Eight such quads can be connected

by a global switch. The cache coherence protocol is directory-based and uses a full bit

vector scheme. The protocol exploits the architecture and its ordering properties to reduce

the number of messages.

The Sun Fire 15K Server [20] is a multiprocessor developed by Sun Microsystems. The

Sun Fireplane interconnect, consisting of three 18x18 crossbars, is used to connect up to

18 four-processor (UltraSparc III) boards. A snoopy-based protocol is used to maintain

coherence within a board and across a limited number of boards. For larger systems, a

directory protocol is used to maintain coherence across the Fireplane interconnect.

The Exemplar series of multiprocessors [15] [82] [16] [1] was originally developed by Convex

Computer Corporation and later continued by Hewlett Packard. The line went through

a number of generations with the most recent being the SPP2000 (X-class). It consists of

up to 16 processor nodes, called hypernodes, connected by a set of 4 unidirectional rings

that use an SCI-based protocol. Each hypernode contains up to 16 processors (PA8000),

and a local memory connected by a crossbar. The SCI cache coherence protocol is used

to keep the node caches coherent. Within a hypernode, a full bit vector directory is used

to enforce coherence.


29/195


30/195


cation [25]. The protocol can be described in a protocol description language, from which the

verifier generates states and verifies against the protocols specification. It is also difficult to

ensure that the hardware implementation of a protocol is true to its original specification, so

approaches such as witness strings [4] have been used, where an execution trace used duringverification is converted to an input stimulus for logic simulation.

2.4 Understanding Protocol Performance

Cache coherence protocols can have a large effect on the performance of multiprocessor systems.

The performance depends on the data access behavior of applications and no single protocol

works best for all data access patterns. In general, the invalidate protocol performs well for

applications in which accesses to a particular data block are performed mostly by the same

processor or when the data block migrates between processors. In these cases, it is not necessary

to send any messages through the network once the data is in the processors cache. For

applications that exhibit a more fine-grained sharing of data blocks, in which a single data item

is frequently read and written by different processors, the update protocol performs better. By

sending updates, the data item is always in the cache and misses due to invalidations are avoided.

System designers and application developers need to be able to compare different protocols and

assess the effects of different system and application parameters on the performance of protocols.

To better understand the performance of different protocols a number of classifications of

data sharing have been proposed. The classifications have been used for various purposes.

For invalidate protocols, Gupta and Weber [87] [38] proposed a number classes of data access

patterns. They are distinguished by their use in parallel programs and their invalidation pat-

terns: read-only, migratory, synchronization, mostly-read, frequently read-written, read-only,

producer-consumer, migratory, and irregular read-write. Bennett et al. [13] used the concept

of data access patterns for protocol selection in the Munin software distributed shared memory

system. They are: write-once, write-many, producer-consumer, private, migratory, result ob-

jects, read-mostly, synchronization and general read-write. Adve at al. [5] compared hardware

and software cache coherence protocols using an analytical model. They introduced data access


31/195


patterns that are similar to Weber and Guptas: passively-shared, mostly-read, frequently read-

written, migratory and synchronization. Brorsson and Stenstrom [17] used different data access

patterns to analyze the performance of applications running on systems with a limited directory

invalidate protocol. The data access patterns take into account the type of sharing, read onlyor read/write, and the degree of sharing, exclusive, shared-by-few and shared-by-many.

In this thesis, the classification proposed by Srbljic et al. [78] is used as a basis for un-

derstanding the performance of protocols. It is similar to the data access patterns introduced

by Carter et al. [13] and by Brorsson and Stenstrom [17]. The main difference is that the

fuzziness in the definition of data access patterns is avoided. For example, Brorsson and

Stenstrom have data access patterns defined as shared-by-few and shared-by-many, where the

degree of sharing is fuzzy. Carter at al. introduced data access patterns like write-many

and read-mostly, where the access mode is fuzzy (for example, read-mostly means that a data

object is read more often than it is written). Srbljic et al. classify data accesses according to the

number of processors that perform reads and writes to a particular data item. They are: Single

Reader Single Writer(SRSW), Multiple Reader(MR), Multiple Reader Single Writer(MRSW),

Multiple Writer (MW), Single Reader Multiple Writer (SRMW), and Multiple Reader Multiple

Writer (MRMW).

2.5 Hybrid Protocols

Since different data blocks may exhibit different types of access behavior, a system which

uses more than one cache coherence protocol has the potential to lead to an improvement

in performance. Using the appropriate protocol can lead to a reduction in cache misses and

coherence traffic, both of which can result in an improvement in performance. A hybrid cache

coherence protocol can use any one of a given number of different basic protocols, such as

invalidate or update, for each cache block.

In addition, the data access behavior for a particular cache block may change during the

execution of an application. To further increase the potential for performance improvement,

the protocol for a block can be changed during the execution of an application. These protocols


32/195


33/195


Dynamic hybrid protocols with on-line decision functions first appeared in small bus-based

multiprocessors. They are briefly described in this section because similar techniques have been

used in larger DSM systems. They use both invalidates and updates and take advantage of the

broadcast properties of the bus. The first such protocol is the write-once protocol [31], in whichthe first write to a block results in an update to the main memory and an invalidation to the

other caches. The next write by the same processor results in a change to the local cache only

and the memory is no longer updated. The Archibald scheme [10] [11] extends the write-once

protocol by allowing a number of updates while there are no other accesses from other processors

to that cache block. The competitive scheme [49] sends a number of updates based on a break-

even point of communication overhead for the two protocols. Eggers and Katz [26] provide a

comparison of a basic update, basic invalidate, the Archibald, and competitive schemes. They

conclude that none of the protocols perform best for all applications. The schemes described

were later extended. Anderson and Karlin extend the competitive scheme [9] by allowing for

changes to the break-even point during the execution of an application. Dahlgren [22] suggests

a number of extensions to the Archibald scheme. They consist of merging multiple writes into

a single write, with a write cache, to reduce bus traffic and snooping on bus data to reduce

cache misses called read snarfing.

A number of studies have also been performed on DSM systems with directory-based cache

coherence protocols. Grahn, Stenstrom and Dubois [33] present a directory-based competitive

scheme and compare it to an invalidate and an update scheme. They use a relaxed memory

consistency model to hide the latency of updates with the use of a write-buffer at the second-

level cache. They find that the update performs better than invalidate for applications with

moderate bandwidth requirements and note that the competitive protocol does not perform well

with migratory sharing. To reduce some of the traffic associated with the competitive-update

protocol, Dalhgren and Stenstrom [24] introduce a write cache to merge multiple writes. Nilsson

and Stenstrom [66] add migratory detection to the update protocol to reduce the overhead of

migratory sharing. Additional details to this study are provided in [32]. In a study to determine

the techniques that can be used to improve the performance of multiprocessors, Stenstrom et

al. [80] evaluate a number of alternatives. On a sequentially consistent machine they compare


34/195


adaptive sequential prefetching and migratory sharing detection, while on a machine with release

consistency they compare adaptive sequential prefetching and a hybrid protocol. The hybrid

protocol uses a competitive update protocol scheme and a write cache. They find that coupled

with sequential prefetching, the hybrid protocol yields combined gains. Similarly, but in thecontext of reducing useless updates, Bianchini et al. [14] show the effect of bandwidth and

block size on update and invalidate protocols. They compare a static hybrid protocol and

a competitive update with coalescing write buffers. They find that software caching and a

dynamic hybrid protocol reduce most of the useless writes. Coalescing write buffers produce

the least amount of traffic and have the largest impact on execution time.

Two schemes that use something other than a competitive update protocol are proposed by

Srbljic [77] and Raynaud et al. [72]. Srbljic proposes counters to keep track of communication

traffic for the invalidate and update protocols. The protocol used at a given time is changed

when the cost reaches a threshold value. Although results are favorable, an artificial workload

is used and few system details are modeled. Raynaud et al. [72] introduce the distance adaptive

model. The update pattern is recorded in the directory and then used to determine which

blocks should be updated and which invalidated. A comparison of an invalidate with migratory

handling, competitive update, delayed competitive update, delayed competitive update with

migratory handling and two distance adaptive protocols is provided. The distance adaptive

protocols perform better than invalidate and competitive protocols.

The disadvantage of run-time approaches is the inability to accurately predict future ac-

cesses. The decision function is based purely on information about previous accesses. Basing the

prediction of future accesses on past accesses can be inaccurate, although recent work [65] [51]

on using hardware techniques similar to branch prediction for coherence actions has yielded en-

couraging results. Another disadvantage is that run-time schemes require additional hardware

such as counters which may result in significant cost.

2.5.2 Off-line Decision Function

Another approach to hybrid cache coherence protocols is to use an off-line decision function.

The decision function can be implemented in hardware or software. The first method involves


35/195


analyzing the memory trace for a specific application using hardware performance counters. An

application which executes frequently can be fine tuned by using the information provided by

specialized hardware. The second and more preferable method involves implementing an off-

line decision function at compile time. The main idea behind this approach is that informationon which protocol to use can be extracted from the source code. In contrast to the on-line

schemes, the decision is not solely based on previous accesses. This option offers the possibility

of more accurately predicting data access patterns in the future.

A number of studies have shown the potential improvement from such a scheme. Veen-

stra and Fowler [84] demonstrate the advantages of dynamic schemes over static ones (for

larger cache blocks) as well as maintaining coherence on a per-block as opposed to a per-page

basis. Performance results are obtained using an optimal off-line protocol. Mounes-Toussi

and Lilja [64] present results for the potential of compile-time analysis. They introduce a

dynamic hybrid scheme and different levels of compiler capabilities which insert special write-

invalidate, write-update and write-only commands into the memory reference stream. They

consider factors that could affect compiler analysis, such as imprecise array subscript analy-

sis and inter-procedural analysis. The study compares the ideal compiler, non-ideal compiler,

invalidate-only, update-only, and dynamic schemes, and finds that the compiler schemes out-

perform the others. Two similar studies [2] [70] compare the value of providing specialized

producer-initiated communication primitives that are software controlled. Abdel-Shafi et al. [2]

demonstrate that remote writes, called writesend and writethrough, can provide benefits over

prefetching and that the combination of both is able to eliminate most of the overhead. The

primitives are hand-inserted. Qin and Baer [70] use a protocol processor implementation of

cache coherence and annotate applications with primitives. They evaluate a set of prefetch

and post-store mechanisms. Sivasubramaniam [75] uses intelligent send-initiated data trans-

fer mechanisms for transferring ownership for critical section variables. The compiler is able

to recognize writes within a critical section. A competitive update mechanism implemented in

software in the network interface is also evaluated. Poulsen and Yew [69], through their work on

parallelizing compilers, propose a hybrid prefetching and data forwarding mechanism. The data

forwarding mechanism is compiler-inserted for communication between loop iterations. Finally,


36/195


of particular importance to this thesis is the work done by Srbljic et al. [78], which presents a

number of analytical models and indicates the potential for dynamic hybrid protocols.

Although the work in this thesis is concerned with DSM multiprocessors, one bus-based

implementation is worth mentioning because of its compiler implementations of a decisionfunction. Techniques for reducing coherence misses and invalidation traffic were compared by

Dahlgren et al. [23]. The study concluded that their dynamic hybrid protocol does as well as

their compiler-inserted update scheme in terms of misses, but does better in terms of bus traffic.

Off-line decision functions also have some disadvantages. Some of the necessary run-time

information that is required for the decision functions is not easily obtainable. For example,

many schemes require information about the interleaving of accesses from different processors.

Unfortunately, there also exist a number of general limitations in compile-time analysis which

can result in inaccuracies. The performance can vary depending on the extent of memory

disambiguation and whether inter-procedural analysis is available.

2.6 The NUMAchine Multiprocessor - Evolution

The work in this thesis is motivated by the NUMAchine multiprocessor project and specifically

the work done on cache coherence protocols in that context. Although the ideas are applicable

to shared-memory multiprocessors in general, they are evaluated in detail in the context of

the NUMAchine multiprocessor. In this section, an overview of NUMAchine development is

provided. The details of the architecture and cache coherence protocol are given in Chapter 3.

Many of the features of NUMAchine are based on experiences with its successful predecessor,

a multiprocessor called Hector [86] [81], also developed at the University of Toronto. Hector

is a ring-based, clustered, shared memory machine depicted in Figure 2.3. Cache coherence in

Hector is implemented in software by the operating system using a page-based write-through

to memory protocol.

Although the software coherence scheme provided good performance, interest in developing a

hardware cache coherent machine grew. Farkas investigated what it would take to provide cache

coherence on an architecture similar to Hector [27] [28]. He describes how to provide a sequential


37/195


Inter-Ring

Station

PM = Processor Module

I/O = SCSI, Ethernet, etc.

Interface

Station

PM I/O

PMPM

PMPM

PM

Processor

Memory

Station BusInterface

Controller

Figure 2.3: The Hector multiprocessor.

consistency memory model. He identifies the need for locking at the home memory while a

transaction is in progress and sending invalidation messages to the top of the hierarchy for

multicasts. For the invalidation-based cache coherence protocol he proposes using a multicast

rather than individual invalidations. He also describes an update-based protocol.

One of the goals in the NUMAchine project was to investigate a hardware cache coherent

machine that is cost-effective, easy to use, and performs well. The hierarchical ring structure

and features such as clustering processors, a network cache, and a directory protocol were

chosen. A cache coherence protocol optimized for the NUMAchine architecture was developed

based on the invalidation write-back scheme suggested in [27].

An initial overview of the NUMAchine project is given in [3]. It includes plans for hardware,

operating system and compiler development. A detailed description of the architecture with

simulation results is given in the NUMAchine technical report [85]. Details of the prototype

implementation are provided in [34] and NUMAchine related theses [35] [58]. The architecture

was subsequently analyzed in [37] and measured performance results were presented in [36].


38/195


2.7 Memory Consistency models

When writing parallel software, assumptions are made about how the memory system behaves.

Although there is an intuitive notion about how a shared address space should behave, it

needs to be specified in more detail. Cache coherence dictates that the order of writes to

a single location must be made visible to all processors in the same order, but it does not

say anything about when writes to different locations become visible. Since programmers and

system designers need to worry about this, more than cache coherence is needed to define the

behavior of the shared address space. The order in which all memory operations are performed

needs to be defined. This is called the memory consistency model. A number of different models

exist with the most intuitive one being sequential consistency.

Lamport [52] defines sequential consistency as:

A multiprocessor is sequentially consistent if the result of any execution is the same

as if the operations of all the processors were executed in some sequential order,

and the operations of each individual processor occur in the sequence in the order

specified by its program.

For this behavior to occur in a multiprocessor system, there must be constraints on the order

in which memory operations appear to be performed. Determining how to design a system that

provides this model is difficult, so sufficient conditions were defined. For example, to provide

sequential consistency [21]:

1. Every processor2 issues memory operations in program order.

2. After a write operation is issued, the issuing processor waits for the write to

complete before issuing the next operation.

3. After a read operation is issued, the issuing processor waits for the read to

complete, and for the write whose value is being returned by the read to com-

plete, before issuing its next operation. That is, if the write whose value is

being returned has performed with respect to this processor (as it must have

2In the original text, the word process is used instead of processor.


39/195


if its value is being returned), then the processor should wait until the write

has performed with respect to all processors.

The constraints focus on program order and the appearance that one operation is complete

with respect to all processors before the next one is issued. This means that all writes to any

location must appear to all processors to have occurred in the same order, which is a difficult

requirement for most systems.

To allow for additional hardware and compiler optimizations, which are commonly used in

uniprocessors, a number of less strict, relaxed, models have been proposed [6]. These opti-

mizations can result in increased performance, which is the main reason that many commercial

multiprocessors use them, but at the cost of added complexity of using the relaxed models.

The models make it tougher for users and designers of systems to understand and reason about

correctness. Recently, a study has re-examined the use of relaxed models because modern high-

performance processors leave little additional performance to be gained from relaxed schemes.

In light of this, there may be less incentive for implementing these less-intuitive programming

models [45].

One of the goals of the NUMAchine multiprocessor was usability and because of it the system

was designed to support sequential consistency. Although providing this memory consistencymodel may be expensive in some architectures, the NUMAchine architecture inherently provides

a simple and efficient means for supporting it. The necessary ordering between writes to different

locations is provided by defining fixed sequencing points in the ring hierarchy [37]. This ensures

that a multicast invalidation does not become active until it passes the sequencing point on the

highest ring level that must be traversed to reach all multicast destinations. This imposes the

necessary ordering, at the expense of an increase in the average traversal length for sequenced

packets (i.e. invalidations).

2.8 Remarks

The related work and the survey of state-of-the-art multiprocessor implementations presented

in this chapter provide a number of interesting points.


40/195


Cache coherence protocols are critical aspects of shared-memory multiprocessor systems and

much effort has gone into their design and implementation. Directory-based cache coherence

protocols are the defacto standard for medium to large-scale distributed shared-memory (DSM)

multiprocessors. The best architecture and cache coherence protocol for a shared-memorymultiprocessor has not been determined. However, the NUMAchine multiprocessor provides

a good platform for research in cache coherence protocols because its architecture and cache

coherence protocol are in line with current multiprocessors.

To achieve good performance in a DSM multiprocessor, it is important to understand the

communication patterns of applications and the behavior of cache coherence protocols for these

patterns. Since no single protocol is best suited for all communication patterns, using more

than one has shown some promise. An open question remains as to the benefits of such a

scheme in a DSM multiprocessor, in particular one that supports sequential consistency.


41/195


42/195

Chapter 3. The NUMAchine Cache Coherence Protocol 28

Inter-RingCentral Ring

Local Rings

P P P P

I/OM

Ring Out Ring In

P = ProcessorM = MemoryNI = Network InterfaceI/O = SCSI, Ethernet, etc.

StationBus

Interface

Stations

NI

Figure 3.1: NUMAchine architecture.

3.1.1 Architecture

The NUMAchine architecture is hierarchical. Processors and memory are distributed across a

number of nodes called stations. Each station contains a number of processors and a portion of

the total system memory. The organization of the memory is such that each memory address

has a fixed home station. The stations are connected by one or more levels of unidirectional

bit-parallel rings which operate using a slotted-ring protocol. The time to access a memory

location in the system varies depending on which processor issues the request and where the

request is satisfied in the system. Therefore, the architecture is of the NUMA (Non-Uniform

Memory Access) type.

The 64-processor machine consists of two levels of rings as shown in Figure 3.1. At the top

of the hierarchy, a central ring connects four local rings through inter-ring interfaces. At the

next level, each local ring connects four stations through a ring interface. Each station contains

four MIPS R4400 processors [41] with 1-MByte external secondary caches, a memory module

(M) with up to 256 MBytes of DRAM for data and SRAM for status information of each cache

block, a network interface (NI) which handles packets flowing between the station and the ring,

and an I/O module which has standard interfaces for connecting disks and other I/O devices.


43/195


The modules on a station are connected by a bus. Along with mechanisms to handle packets

flowing to and from the rings, the network interface also contains an 8-MByte DRAM-based

network cache for storing cache blocks from other stations. The network cache also contains

SRAM used to store status information of cache blocks.

3.1.2 Interconnection Network

The interconnection network consists of a bus in each station and a hierarchy of rings connecting

the stations. The rings are unidirectional and use a slotted protocol. The hierarchy provides

increased total bandwidth by allowing for transfers to take place concurrently on several rings.

Experience from the Hector multiprocessor [86] demonstrated that using an interconnection

network based on rings provides a number of benefits:

They are easy to build because they consist of point-to-point connections. The network

interfaces are simple with only one input port and one output port. The issues of loading

and signal reflections from multiple connections that limit the number of connections that

can be provided by a bus are avoided.

They can transmit signals reliably at high clock rates because of the simplicity of thehardware required to implement them. Short critical paths in logic and short lines in the

interconnection network make this possible.

The multiprocessor can be expanded easily, without large wiring or topology changes,

making the system highly modular.

They provide a natural multicasting capability. The sender of the multicast needs to send

a single packet with multiple destinations selected. The packet travels around the ring

and is only replicated when it reaches the interfaces of the destinations.

They provide ordering among packets. A unique path exists between any two stations in

the system and the network interfaces are designed not to allow packets to bypass each

other.


44/195


They have subsequently been shown to perform well in comparison with meshes for con-

figurations up to 128 processors [71].

The natural ordering among packets and multicast ability are useful for efficiently imple-

menting cache coherence and a sequentially consistent memory. The ordering of packets in the

NUMAchine ring hierarchy is maintained because a unique path exists between any two stations

and the point-to-point order of packets is maintained. A packet cannot overtake another one

in the network on its way to a destination. The multicast capability is a fundamental property

of rings. A single packet can be targeted for multiple destinations. The packet travels around

the ring and is replicated at each destination.

A split-transaction protocol is used in the interconnected network, meaning that transactions

required to maintain coherence are split into requests and responses. For example, a processor

places a read request on the bus, and then releases it. When the memory is ready to respond

with the data, it requests the use of the bus.

Requests and responses, broken up into packets, travel along a single physical interconnec-

tion network. The packets are buffered at each modules connection to the network to allow

for more concurrency in the system. Each module contains incoming and outgoing buffers. Al-

though only one physical network exists, it is split in the ring interface and processor modules

into two virtual networks for deadlock avoidance. These modules contain two types of outgoing

buffers: one for requests and the other for responses. During periods of congestion, requests

are halted while responses are allowed to proceed. From the perspective of cache coherence,

the interconnection network looks like a single ordered network. Requests cannot pass other

requests and responses do not pass other responses. It is only the ordering of responses with

respect to requests that can change and vice-versa.

3.1.3 Communication Scheme

The routing of packets begins and ends at a station. A novel routing scheme for packets is

implemented in NUMAchine. The destination of a packet is specified using a routing mask.

The routing mask consists of fields that represent levels in the hierarchy. The number of bits

in a field corresponds to the number of targets in the next level of hierarchy.


45/195


Local Ring 0

Stn 1

Stn 2Stn 3

Stn 0 Stn 1

Stn 2Stn 3

Stn 0 Stn 1

Stn 2Stn 3

Stn 0 Stn 1

Stn 2

Local Ring 1

Local Ring 2Local Ring 3

1 0 0 1 1 0 0 101230123

1 0 0 0 1 0 0 0

Ring Station

012300123

0 0 0 1 0 0 0 101230123

OR

Stn 3

Stn 0

Figure 3.2: Routing mask.

In the two-level prototype, the routing mask consists of two 4-bit fields. Bits set in the firstfield indicate the destination ring, while bits set in the second field indicate the destination

station on the ring. For point-to-point communication, each station in the hierarchy can be

uniquely identified by setting one bit in each of the fields. Multicasting to multiple stations is

possible by setting more than one bit in each of the fields; however, setting more than one bit

per field can specify more stations than required. For example, to send a packet to station 0

on local ring 0 (0001 0001) and to station 3 on local ring 3 (1000 1000), the routing mask is

set to the logical OR of the two (1001 1001) as shown in Figure 3.2. Due to over-specification

inherent in the mask, the packet would also be sent to station 0 on ring 3 (1000 0001) and

station 3 on ring 0 (0001 1000).

This communication scheme makes the routing of packets on the ring simple and fast. Each

ring and each station needs only to check a single bit to determine whether it is the destination

for the packet.

3.1.4 Organization of the Network Cache

A third-level cache exists on each station in the network interface module, called a network

cache. It stores copies of cache blocks whose home memories are on other stations. It is a

direct-mapped cache which does not enforce the inclusion property [12]. Not enforcing the

inclusion property means that the network cache does not contain copies of all cache blocks in


46/195


caches below it in the hierarchy. For example, a processor secondary cache on the local station

may contain a cache block that is not present in the network. The next section describes a

number of interesting problems and solutions that arise from this property.

3.2 Protocol Features

The NUMAchine cache coherence protocol is a hierarchical, directory-based, write-back invali-

date protocol optimized for the NUMAchine architecture. It exploits the multicast mechanism

and utilizes the inherent ordering provided by the ring.

Before proceeding, it is useful to define some terminology. The home memory of a cache

block refers to the memory module to which the cache block belongs. If a particular station

is being discussed, it is referred to as the local station. Local memory or local network cache

refer to the memory or network cache on that station. Remote station, remote memory or

remote network cache refer to any memory, network cache or station other than the station

being discussed.

3.2.1 Processor Behavior

The MIPS R4400MC [41] processor has two levels of caches: an on-chip primary cache and an

off-chip secondary cache. It also comes with support for a variety of cache coherence protocols.

Each cache block in the caches has a cache coherence state associated with it. In the secondary

cache three basic states, dirty, shared, and invalid, are defined in the standard way for write-

back invalidate protocols.

The processor issues a request if it misses in its caches. A read miss occurs if the cache

block is not in the cache or if it is in the invalid state. A write miss occurs if the cache block is

not in the dirty state. The processor stalls on read and write misses. When replacing a cache

block, the processor writes it back to the home memory if it is in the dirty state. Otherwise,

the cache block is overwritten, without notifying the home memory.

The processor can respond to a number of external requests. An external read request will

cause the processor to return the data if the cache block is in the dirty state and negatively


47/195


M NC

P2 P3 P4P1

M NC

P2 P3 P4P1

NC

P3P2P1 P4

M NC

P2 P3 P4P1

MNC

P3P2P1 P4

MNC

P3P2P1 P4

M

station-level coherence network-level coherence

Figure 3.3: Station and network level coherence.

acknowledge (NACK) the request otherwise. On an external invalidation, the processor will

invalidate its copy of the cache block.

3.2.2 Protocol Hierarchy

The NUMAchine cache coherence protocol is hierarchical. Cache coherence is maintained at two

levels as shown in Figure 3.3: the station level and the network level. Station-level coherence is

maintained between the local memory and the processor caches on a station, or between the local

network cache and the processor caches if the home location of a cache block is a remote station.

Network-level coherence is maintained between the home memory of a cache block and all the

remote network caches with copies of the cache block. Information for maintaining coherence

at the station and network levels is stored in the directories; a directory-based protocol is used

at both levels.

3.2.3 Invalidations

A cache coherence protocol must have a mechanism to make writes visible to all processors (write

propagation). The NUMAchine cache coherence protocol uses invalidations for this purpose.

A cache coherence protocol must also ensure that all processors see writes to a location as

having happened in the same order (write serialization). To ensure write serialization the

NUMAchine protocol uses locking states and takes advantage of the ordering properties of the

interconnection network. In this section, the mechanism to perform writes is described.


48/195


In a typical multiprocessor system, requests are serialized by either the memory or the

current owner of a cache block. A write request first goes to the memory, which is aware of

all copies in the system and sends individual invalidations to each processing node with a valid

copy. Upon receiving the invalidation, each node replies with an invalidation acknowledgmentto the original requester. When the requester has received all the

assessment of cache coherence protocols

Documents