assessment of cache coherence protocols

Upload: charityiitm

Post on 14-Apr-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Assessment of Cache Coherence Protocols

    1/195

    Assessment of Cache Coherence Protocols in Shared-memory

    Multiprocessors

    by

    Alexander Grbic

    A thesis submitted in conformity with the requirements

    for the degree of Doctor of Philosophy

    Graduate Department of Electrical and Computer Engineering

    University of Toronto

    Copyright c 2003 by Alexander Grbic

  • 7/30/2019 Assessment of Cache Coherence Protocols

    2/195

    Abstract

    Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors

    Alexander Grbic

    Doctor of Philosophy

    Graduate Department of Electrical and Computer Engineering

    University of Toronto

    2003

    The cache coherence protocol plays an important role in the performance of a distributed

    shared-memory (DSM) multiprocessor. A variety of cache coherence protocols exist and differ

    mainly in the scope of the sites that are updated by a write operation. These protocols can

    be complex and their impact on the performance of a multiprocessor system is often difficult

    to assess. To obtain good performance, both architects and users must understand processor

    communication, data locality, the properties of the interconnection network, and the nature of

    the coherence protocols. Analyzing the processor data sharing behavior and determining its

    effect on cache coherence communication traffic is the first step to a better understanding of

    overall performance. Toward this goal, this dissertation provides a framework for evaluating

    the coherence communication traffic of different protocols and considers using more than one

    protocol in a DSM multiprocessor.

    The framework consists of a data access characterization and the application of assessment

    rules. Its usefulness is demonstrated through an investigation into the performance of different

    cache coherence protocols for a variety of systems and parameters. It is shown to be effective

    for determining the relative performance of protocols and the effect of changes in system and

    application parameters. The investigation also shows that no single protocol is best suited for

    all communication patterns. Consequently, the dissertation also considers using more than one

    cache coherence protocol in a DSM multiprocessor. The results show that the hybrid protocol

    can significantly reduce traffic in all levels of the interconnection network with little effect on

    execution time.

    ii

  • 7/30/2019 Assessment of Cache Coherence Protocols

    3/195

    Acknowledgements

    I would like to thank my supervisors, Professors Zvonko Vranesic and Sinisa Srbljic, for their

    suggestions, guidance and support throughout my thesis. Without their knowledge, experience

    and time this work would not have been possible. I am grateful for their continued faith in me

    in spite of my decisions to take on new challenges and responsibilities. In addition, I wish to

    acknowledge useful discussions with Professor Michael Stumm and thank him for his help.

    I cannot say enough to thank my wife Gordana and daughter Lidia for their love, patience

    and understanding. Gordana, you gave me the support I needed to keep going, even when it

    looked like there was no end in sight to my graduate work. Lidia, the moment you arrived you

    brightened up my life, provided me with inspiration and taught me about the important things.

    To both of you, my love.

    I would like to thank my parents, brother and sister for their support, sacrifices and their

    love. Tony and Vanda, thanks for being there for Gordana, Lidia and me whenever we needed

    you. Tony, your dedication to research has motivated me in more ways than just making me

    realize that you could finish before me.

    I must also thank my friends for the continued friendships. Even though Ive gone largely

    into seclusion in the last while, youve kept in touch and always made me feel welcome. I

    express my thanks to the old Computer Group crowd and to the people at work for the friendly

    and frequent reminders of my unfinished business.

    I gratefully acknowledge the financial assistance provided to me through OGSST and NSERC

    Scholarships as well as a UofT Open Fellowship.

    iii

  • 7/30/2019 Assessment of Cache Coherence Protocols

    4/195

    Contents

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2 Background 5

    2.1 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.1.1 Type of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.1.2 Implementing Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.2 Directory Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.4 Understanding Protocol Performance . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.5 Hybrid Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.5.1 On-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.5.2 Off-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.6 The NUMAchine Multiprocessor - Evolution . . . . . . . . . . . . . . . . . . . . . 22

    2.7 Memory Consistency models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2.8 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3 The NUMAchine Cache Coherence Protocol 27

    3.1 The NUMAchine Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.1.2 Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    iv

  • 7/30/2019 Assessment of Cache Coherence Protocols

    5/195

  • 7/30/2019 Assessment of Cache Coherence Protocols

    6/195

    5 Sharing Patterns and Traffic 54

    5.1 Data Access Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    5.1.1 Data Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    5.1.2 Obtaining the Data Access Characterization . . . . . . . . . . . . . . . . . 57

    5.2 Understanding Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . 58

    5.2.1 Description of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    5.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    5.2.3 Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    5.3 Choice of Characterization Interval . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    5.4 Confirmation of Rule 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    5.4.1 Choosing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    5.4.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    5.5 Extending the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    6 Evaluation of Protocol Performance 74

    6.1 The Update Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    6.1.1 The Update Protocol in a Distributed System . . . . . . . . . . . . . . . . 77

    6.2 The Write-through Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    6.3 Uncached Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    6.4 Protocol Communication Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    6.5 Study Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    6.5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    6.5.2 Page Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    6.5.3 Interval Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    6.6 Data Access Characterization of Benchmarks . . . . . . . . . . . . . . . . . . . . 83

    6.7 Relative Performance of Different Protocols . . . . . . . . . . . . . . . . . . . . . 85

    6.7.1 Applying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 85

    6.7.2 Verifying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 89

    vi

  • 7/30/2019 Assessment of Cache Coherence Protocols

    7/195

    6.8 Explanation of Application Behavior . . . . . . . . . . . . . . . . . . . . . . . . . 91

    6.9 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

    7 Hybrid Cache Coherence Protocol 99

    7.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    7.2 Processor Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    7.2.1 Base Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    7.2.2 Dirty Shared State Support . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    7.3 Directory Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    7.3.1 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    7.3.2 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    7.4 Transitions Between Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

    7.4.1 Dealing with Additional States in the Update Protocol . . . . . . . . . . . 109

    7.4.2 Network Cache Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . 112

    7.4.3 Cache Blocks in Transition . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    7.4.4 Transitions Between Protocols in the Processor Cache . . . . . . . . . . . 113

    7.5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

    7.5.1 Simulation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

    7.5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    7.5.3 Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

    7.6 Hybrid Protocol Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    7.7 Wrong Protocols for Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

    7.8 Decision Functions and Hybrid Protocol Execution Time . . . . . . . . . . . . . . 129

    7.8.1 Only the Traffic-Based Decision Function Changes to Update (t2u) . . . . 132

    7.8.2 Only the Traffic-based Decision Function Changes to Invalidate (t2i) . . . 135

    7.8.3 Only the Latency-based Decision Function Changes to Update (l2u) . . . 136

    7.8.4 Only the Latency-based Decision Function Changes to Invalidate (l2i) . . 137

    7.8.5 General Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

    7.9 Latency-based Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

    vii

  • 7/30/2019 Assessment of Cache Coherence Protocols

    8/195

    7.10 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

    8 Conclusion 147

    8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

    8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

    A NUMAchine Cache Coherence Protocol - Invalidate 151

    A.1 Local System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

    A.2 Remote System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

    A.3 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

    A.3.1 Negative Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 161

    A.3.2 Exclusive Reads and Upgrades . . . . . . . . . . . . . . . . . . . . . . . . 163

    A.3.3 Non-inclusion of Network Cache, NOTIN Cases . . . . . . . . . . . . . . . 165

    B System Events 168

    Bibliography 174

    viii

  • 7/30/2019 Assessment of Cache Coherence Protocols

    9/195

    List of Tables

    2.1 Experimental and commercial multiprocessor architectures. . . . . . . . . . . . . 13

    2.2 Cache coherence in experimental and commercial multiprocessors. . . . . . . . . 15

    3.1 States in memory and network cache directories. . . . . . . . . . . . . . . . . . . 40

    4.1 Simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    4.2 Access latencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    5.1 Values of parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    6.1 Communication costs in numbers of packets for invalidate, update, write-through

    and uncached operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    6.2 System data access characterization and percentage of writes. . . . . . . . . . . . 86

    6.3 Data access characterization for the central ring. . . . . . . . . . . . . . . . . . . 87

    6.4 Average number of packets per access for different cache coherence protocols on

    a 4-processor system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

    6.5 Average number of packets per access for different cache coherence protocols on

    a 64-processor system central ring. . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    7.1 Parallel efficiency for SPLASH2 applications used in the hybrid protocol study. . 116

    7.2 Examples of NUMAchine system event costs in terms of number of packets for

    the invalidate and update protocols. . . . . . . . . . . . . . . . . . . . . . . . . . 117

    7.3 Frequency of using incorrect protocols given in numbers of intervals. . . . . . . . 128

    ix

  • 7/30/2019 Assessment of Cache Coherence Protocols

    10/195

    7.4 Disagreements between the traffic-based and latency-based decision functions

    given in numbers of intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    7.5 MRSW example for the case where only the traffic decision function changes to

    update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327.6 SRMW example for the case where only the traffic decision function changes to

    update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

    7.7 MRMW example for the case where only the traffic decision function changes to

    update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

    7.8 MW example for the case where only the traffic decision function changes to

    update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

    7.9 MRSW example for the case where only the traffic decision function changes to

    in v alid ate (t2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

    7.10 MRSW example for the case where only the latency decision function changes

    to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    7.11 MRMW example for the case where only the latency decision function changes

    to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

    7.12 MRSW example for the case where only the latency decision function changes

    to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

    7.13 MRMW example for the case where only the latency decision function changes

    to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

    7.14 MW example for the case where only the latency decision function changes to

    invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

    A.1 System events for local requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

    A.2 System events for remote requests. . . . . . . . . . . . . . . . . . . . . . . . . . . 155

    B.1 System event descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

    B.2 System event details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

    B.3 Traffic and latency costs for system events. . . . . . . . . . . . . . . . . . . . . . 172

    B.4 System parameters that affect traffic. . . . . . . . . . . . . . . . . . . . . . . . . . 172

    x

  • 7/30/2019 Assessment of Cache Coherence Protocols

    11/195

    B.5 Traffic costs for requests and responses. . . . . . . . . . . . . . . . . . . . . . . . 172

    B.6 System parameters that affect latency. . . . . . . . . . . . . . . . . . . . . . . . . 173

    B.7 Latency of modules and the interconnection network. . . . . . . . . . . . . . . . . 173

    xi

  • 7/30/2019 Assessment of Cache Coherence Protocols

    12/195

    List of Figures

    2.1 Invalidate and update protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.2 Cache coherence with a directory protocol. . . . . . . . . . . . . . . . . . . . . . 11

    2.3 The Hector multiprocessor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.1 NUMAchine architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.2 Routing mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.3 Station and network level coherence. . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.4 Directory entries in memory and network cache. . . . . . . . . . . . . . . . . . . 39

    3.5 Local write. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.6 Local read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    3.7 Remote read. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    3.8 Remote write. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    5.1 Data access patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    5.2 Time/space characterization of data accesses. . . . . . . . . . . . . . . . . . . . . 58

    5.3 Bus-based system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    5.4 Comparison of INV and UPD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    5.5 Comparison of INV and UNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    5.6 Comparison of INV and WT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    5.7 Comparison of UPD and WT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    5.8 Comparison of UPD and UNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    5.9 Comparison of WT and UNC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    5.10 Hierarchical system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    xii

  • 7/30/2019 Assessment of Cache Coherence Protocols

    13/195

    6.1 Data access characterization for Barnes. . . . . . . . . . . . . . . . . . . . . . . . 84

    6.2 Data access characterization for FFT. . . . . . . . . . . . . . . . . . . . . . . . . 85

    6.3 Average number of packets per access for the invalidate and update protocols. . . 96

    7.1 State transition diagrams for the processor cache. . . . . . . . . . . . . . . . . . . 105

    7.2 Example of a violation of sequential consistency that can occur if the owner does

    not invalidate its copy when responding to an exclusive intervention request. . . 110

    7.3 Example of remote exclusive read request to the LI state in the memory for the

    update protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

    7.4 Example of local exclusive read request to the GI state in the memory. . . . . . . 112

    7.5 Barnes with the base problem size and the ideal decision function. . . . . . . . . 122

    7.6 FFT with the base problem size and the ideal decision function. . . . . . . . . . . 122

    7.7 Ocean non-contiguous with the base problem size and the ideal decision function. 123

    7.8 Radix with the base problem size and the ideal decision function. . . . . . . . . . 123

    7.9 Barnes with the small problem size and the ideal decision function. . . . . . . . . 124

    7.10 FFT with the small problem size and the ideal decision function. . . . . . . . . . 124

    7.11 Ocean non-contiguous with the small problem size and the ideal decision function.125

    7.12 Radix with the small problem size and the ideal decision function. . . . . . . . . 125

    7.13 Effect of changing cache block size to 256 bytes. . . . . . . . . . . . . . . . . . . . 126

    7.14 Effect of changing the ring width to 4 bytes. . . . . . . . . . . . . . . . . . . . . . 126

    7.15 Barnes with the base problem size and the latency-based decision function. . . . 142

    7.16 FFT with the base problem size and the latency-based decision function. . . . . . 142

    7.17 Ocean non-contiguous with the base problem size and the latency-based decision

    function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

    7.18 Radix with the base problem size and the latency-based decision function. . . . . 143

    7.19 Barnes with the small problem size and the latency-based decision function. . . . 144

    7.20 FFT with the small problem size and the latency-based decision function. . . . . 144

    7.21 Ocean non-contiguous with the small problem size and the latency-based decision

    function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

    xiii

  • 7/30/2019 Assessment of Cache Coherence Protocols

    14/195

    7.22 Radix with the small problem size and the latency-based decision function. . . . 145

    A.1 Special exclusive read request example. . . . . . . . . . . . . . . . . . . . . . . . . 164

    xiv

  • 7/30/2019 Assessment of Cache Coherence Protocols

    15/195

    Chapter 1

    Introduction

    The demand for multiprocessors has continued to grow in recent years and commercial machines

    with tens of processors are readily available today. In 2000, the sales of shared-memory systems

    with more than eight processors passed $16 billion [20]. This has been driven by the continuing

    need for computational power beyond what state-of-the-art uniprocessor systems can provide.

    Uses of multiprocessors have grown from mostly scientific and engineering applications to other

    areas such as databases and file and media servers.

    Multiprocessor architectures vary depending on the size of the machine and differ from ven-

    dor to vendor. Shared-memory architectures have become dominant in small and medium-sized

    machines that have up to 64 processors. They provide a single view of memory, which is shared

    among all processors, and a shared-memory model for programming, where communication is

    achieved through accesses to the same memory location. The success of this model is due to the

    ease of transition it provides from uniprocessors to multiprocessors. The programming model

    is similar to uniprocessors and it allows for the incremental parallelization of sequential code,

    while achieving high performance.

    To achieve high performance, the shared view of memory is implemented in hardware. The

    predominant architecture for small systems is based on a bus. At about 32 processors, this

    architecture reaches its limits. For larger systems, other types of interconnection networks,

    often hierarchical, are used and the memory is distributed throughout the machine. This type

    of architecture is referred to as a distributed shared-memory (DSM) multiprocessor.

    1

  • 7/30/2019 Assessment of Cache Coherence Protocols

    16/195

  • 7/30/2019 Assessment of Cache Coherence Protocols

    17/195

    Chapter 1. Introduction 3

    network and the memory system are believed to be the most important subsystems and will

    continue to be so over the next decade. When designing the interconnection network for a

    shared-memory multiprocessor, the cache coherence protocol is a key design consideration. The

    performance of a protocol with a particular interconnection network has a considerable impacton the performance of the overall system. To obtain good performance with the system, both

    architects and users must understand processor communication, data locality, the properties of

    the interconnection network, and the nature of the protocols.

    A variety of cache coherence protocols exist and differ mainly in the scope of the sites

    that are updated by a write operation. These protocols can be complex and their impact on

    the performance of a multiprocessor system is often difficult to assess. The performance of

    a system is directly related to the latency associated with processor accesses. The latency of

    an access often depends on congestion in the system, which is directly related to the amount

    of communication traffic. Analyzing the processor data sharing behavior and determining its

    effect on the cache coherence communication costs is the first step in understanding the overall

    performance. This dissertation provides a framework for evaluating the communication costs of

    different protocols and comparing different protocols as well as assessing the effects of different

    system and application parameters on the performance. In addition to improving the latency

    of accesses, reducing the traffic can reduce the cost of the system by reducing the bandwidth

    requirements. The dissertation also presents a study of using more than one cache coherence

    protocol in a DSM multiprocessor and how communication requirements can be reduced with

    this approach.

    Much of the work in this dissertation has been inspired by the authors involvement in the

    NUMAchine multiprocessor project [36] at the University of Toronto. The objective was to

    design a multiprocessor system which is cost-effective, with a scalability goal of 100 processors.

    Costs were reduced by using commercial of-the-shelf parts and programmable logic devices.

    The author was directly involved in the design and development of a unique cache coherence

    protocol. Without loss of generality, many of the principles presented in this dissertation are

    applied to the NUMAchine multiprocessor as a specific example of a successful architecture for

    medium-scale systems.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    18/195

    Chapter 1. Introduction 4

    1.2 Overview

    Chapter 2 discusses cache coherence protocols in the context of distributed shared-memory

    multiprocessors. Next, a description of the NUMAchine cache coherence protocol and the

    unique combination of features it provides is given in Chapter 3. NUMAchine is a good example

    of a cost effective multiprocessor and its architecture is used as a platform for investigation

    throughout this work. Chapter 4 provides a description of the experimental setup and the choice

    of benchmark programs used to perform experiments described in later chapters. Chapter 5

    develops a framework for assessing the behavior of cache coherence protocols, which consists of

    a method for characterizing the sharing behavior for a program and a set of rules that explain

    the performance of the protocols. An analysis of several cache coherence protocols designed for

    NUMAchine using the proposed framework is given in Chapter 6. In Chapter 7, the possibility

    of using more than one protocol during the execution of an application is explored. Finally,

    Chapter 8 summarizes the major conclusions and describes possible future work.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    19/195

    Chapter 2

    Background

    Shared memory multiprocessors have become popular because of the simple programming model

    they provide. A single shared address space is accessible to any processor in the system and

    communication between processors occurs by simply accessing the same data location. In a

    system with caches, the sharing of data in this way results in copies of the same cache block in

    multiple caches. Although this sharing is not a problem for read accesses, a problem can occur

    if one of the processors writes to shared data. This is the cache coherence problem.

    This chapter begins with a discussion of the cache coherence problem. Section 2.2 describes

    the solution commonly used in distributed shared memory (DSM) multiprocessors, called di-

    rectory cache coherence protocols. Section 2.3 provides a survey of representative DSM mul-

    tiprocessors and their cache coherence protocols. Various approaches used to understand the

    performance of cache coherence protocols are given in Section 2.4. Attempts at using more

    than one type of cache coherence protocol are described in Section 2.5. Since the research in

    this thesis is motivated by the development of the NUMAchine multiprocessor, a description of

    its evolution and relevant references are given in Section 2.6. NUMAchine provides a memory

    model called sequential consistency, which is briefly described in Section 2.7.

    5

  • 7/30/2019 Assessment of Cache Coherence Protocols

    20/195

    Chapter 2. Background 6

    2.1 Cache Coherence

    A typical shared memory multiprocessor contains multiple levels of caches in the memory hier-

    archy. Each processor may read data and store it in its cache. This results in copies of the same

    data being present in different caches at the same time. The problem occurs when a processor

    performs a write to data. If only the value in the writing processors cache is modified, no

    other processor will see the change. If some action is not taken, other processors will read a

    stale copy of the data. Intuitively, a read by another processor should return the last value

    written. To avoid the problem of reading stale data, all processors with copies of the data must

    be notified of the changes. Two properties must be ensured. First, changes to a data location

    must be made visible to all processors, which is called write propagation. Second, the changes

    to a location must be made visible in the same order to all processors, which is called write

    serialization.

    Culler and Singh [21] define a coherent memory system as1:

    A multiprocessor memory system is coherent if the results of any execution of a

    program are such that, for each location, it is possible to construct a hypothetical

    serial order of all operations to the location (i.e., put all reads/writes issued by all

    processors into a total order) that is consistent with the results of execution and in

    which

    1. operations issued by any particular processor occur in the order in which they

    were issued to the memory system by that processor, and

    2. the value returned by each operation is the value written by the last write to

    that location in the serial order.

    To solve the cache coherence problem, that is to maintain a coherent memory system, a

    distributed algorithm called a cache coherence protocol is used. A variety of cache coherence

    protocols exist [79] [57] [43] and differ mainly by the action performed on a write.

    1In the original text, the word process is used instead of processor.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    21/195

    Chapter 2. Background 7

    M

    P1 P2

    P2

    M

    P1

    M

    P1 P2

    A A

    A

    A A A

    AA

    A

    P2

    M

    P1

    A

    a. Only memory has copy of A. b. Processors and memory share A.

    c. Copies of A invalidated. d. Copies of A updated.

    Figure 2.1: Invalidate and update protocols.

    2.1.1 Type of Protocols

    Cache coherence protocols can be classified into a number of categories based on the scope of

    sites that are updated by a write operation. Depending on how other processor caches are

    notified of changes, protocols can be classified as invalidateand update as shown in Figure 2.1.

    In Figure 2.1a only the memory has a valid copy of data block A. In Figure 2.1b both processors

    read A and store it in their respective caches. The difference between the protocols becomes

    apparent when, for example, processor P1 issues a write. In an invalidate protocol, processor

    P1 modifies its copy of the cache block and invalidates the other copies in the system as shown

    in Figure 2.1c. In an update protocol, the processor writes to its copy of the cache block and

    propagates the change to other copies in the system as shown in Figure 2.1d. Upon receiving

    the changes, the other caches update their contents.

    Cache coherence protocols can be further classified depending on how the memory is updated

  • 7/30/2019 Assessment of Cache Coherence Protocols

    22/195

    Chapter 2. Background 8

    into write-throughand write-backprotocols. In a write-through protocol, the memory is updated

    whenever a processor performs a write; it writes through to the memory. In a write-back

    protocol, the memory can be updated in one of two ways. First, the memory is updated when

    a processor with the only valid copy of the block replaces it. Second, a copy of the block iswritten back to memory when a processor reads it from the cache of another processor.

    The choice of cache coherence protocol plays an important role in the performance of a

    multiprocessor system. Many systems are based on the write-back invalidate protocol. In

    many cases, applications run efficiently using this type of a protocol, but there are examples

    where other protocols can achieve better results.

    2.1.2 Implementing Protocols

    A cache coherence protocol is typically enforced by a set of cooperating finite state machines,

    which can be implemented in hardware, software, or some combination of the two. We focus

    on hardware implementations because they are relevant to distributed shared memory multi-

    processors. They perform well and make the accessing of data transparent to the programmer

    and the operating system. In addition, they can operate at a finer granularity of data, such as

    a cache block which can range from 16 to 256 bytes in most systems today.

    During program execution, the hardware implemented state machines check for certain

    conditions and act appropriately to maintain coherence. The actions are determined by the

    operation issued by the processor and the state information stored with each cache block. The

    state machines and the state information are typically located at the processors, memory and

    other locations of caches in the system. When a processor issues an operation, the controller

    decides the change of state and the appropriate action on the interconnect.

    Existing hardware cache coherence schemes include snoopy schemes, directory schemes,

    and schemes that involve cache coherent interconnection networks. To describe each, it is

    first necessary to distinguish between different types of multiprocessor systems: symmetric

    multiprocessors (SMPs) and distributed shared memory multiprocessors (DSMs). In SMPs the

    time to access any part of memory is the same, while in DSMs the time depends on the location

    of the processor performing the access and the memory being accessed. This is known as non-

  • 7/30/2019 Assessment of Cache Coherence Protocols

    23/195

    Chapter 2. Background 9

    uniform memory access (NUMA). DSM systems with cache coherence implemented in hardware,

    which is the norm, are also known as cache coherent NUMA (CC-NUMA) systems.

    For symmetric multiprocessors, snoopy protocols are popular because they are well under-

    stood and relatively simple to implement. These schemes assume that the network traffic isvisible to all devices. Each device performs coherence actions according to a protocol for the

    operations it issues. Communication between caches and memory is achieved using a broad-

    cast mechanism. For a bus-based multiprocessor, sending a message is effectively a broadcast

    because anything sent on the bus is visible to all other devices. Each device snoops on the in-

    terconnection network and performs actions according to the protocol for blocks it has stored.

    SMPs with snoopy protocols are limited in size, typically containing only tens of processors.

    Even with large caches, a limit on the number of processors is reached due to the amount

    of traffic on the bus and eventually due to physical constraints. At this point, some other

    interconnection network, that scales with system size, must be used.

    In distributed shared memory (DSM) systems a scalable interconnection network is used

    to connect processing nodes, which can contain one or more processors and memory. The

    interconnect consists of multiple components that contain traffic to that portion of the system,

    so that operations can be performed simultaneously in different parts of the network. In this

    type of a system, broadcasting to all caches is prohibitive because of the amount of network

    traffic generated. The following section describes cache coherence protocols called directory

    protocols which eliminate the need to broadcast requests to the system.

    Recently, a number of protocols have been proposed that combine snoopy and directory

    protocol implementations [62] [60] [61]. Their goal is to achieve the lower latency of requests

    associated with snoopy protocols while maintaining the lower bandwidth requirements of direc-

    tory protocols. Bandwidth adaptive snooping [62] switches between the two implementations

    based on recent network utilization. A snoopy protocol is used when there is ample bandwidth

    available, and a directory protocol at times of high utilization. The need for broadcasting can

    also be further reduced by multicasting requests to a predicted set of destinations [60]. To

    allow for the extension of these ideas to general interconnection networks, a new type of cache

    coherence protocol called Token cache coherence [61] has been introduced, which exchanges and

  • 7/30/2019 Assessment of Cache Coherence Protocols

    24/195

  • 7/30/2019 Assessment of Cache Coherence Protocols

    25/195

  • 7/30/2019 Assessment of Cache Coherence Protocols

    26/195

    Chapter 2. Background 12

    performed and the cache block size. Although a protocol can be implemented with any inter-

    connection network, the specific features of the network can be used to optimize the protocol.

    The example given in Figure 2.2 assumes a single centralized directory with what is known as

    a full bit vectorscheme [18]: one presence bit is available for each processor. To avoid contentionand to allow for a system that has a small up-front cost in small configurations, directories are

    distributed in a large system such that each memory in the system has a directory associated

    with it. Another major issue for directories is the amount of storage overhead required for

    larger systems. Ideally, the overhead should scale gracefully with the number of processors in

    the system. The full bit vector scheme does not scale well because the storage overhead per

    entry is proportional to the number of processors. To save on storage, the width and height of

    the directory can be varied. The width of the entry can be reduced by reducing the number of

    presence bits available per entry. For example, a single bit can be used to represent more than

    one processing node. These types of schemes are called coarse bit vector schemes [39]. Another

    type of scheme is called the limited pointer scheme [8] in which a limited number of pointers

    are provided. After all the pointers are used, further coherence commands are broadcast. The

    storage requirements of the directory can also be reduced by reducing the height of the directory,

    that is the number of entries. The directory is then essentially used as a cache [39]. Typical

    large-scale multiprocessors have a distributed full or coarse bit vector directory.

    2.3 Implementations

    In this section a number of medium to large-scale DSM multiprocessors, both academic projects

    and commercial implementations, are described. An emphasis is placed on the specifics of the

    system architecture and the cache coherence protocol. A summary of the architectural features

    is given in Table 2.1 and the cache coherence protocols in Table 2.2. Note that the NUMAchine

    multiprocessor is provided in the table for comparison, but is not described in this section. It

    is described in Section 2.6 and Chapter 3.

    The DASH multiprocessor [55] [56] developed at Stanford University consists of processing

    nodes called clusters, which are connected by a pair of 2-D mesh networks. Each cluster

  • 7/30/2019 Assessment of Cache Coherence Protocols

    27/195

    Chapter 2. Background 13

    Name Cluster Cluster size Interconnect

    DASH bus 4 meshAlewife non-clustered meshFLASH non-clustered meshNUMAchine bus 4 ring hierarchySGI Origin crossbar 2 hypercube

    Compaq AlphaServerGS320

    10-port switch(crossbar)

    4 8-port switch (cross-bar)

    Sun Fire 15K bus 4 crossbarHP SPP2000 (X-class) crossbar 16 toroidal ringHP Superdome switch 4 crossbar hierarchyIBM NUMA-Q bus 4 ring

    Table 2.1: Experimental and commercial multiprocessor architectures.

    contains up to 4 processors (R3000) and a portion of the memory. DASH implements

    a distributed, directory-based cache coherence protocol [54] which is of the invalidation

    type. A bus snooping protocol enforces coherence within a cluster and a full bit vector

    directory enforces coherence across clusters. DASH also contains a remote access cache,

    which is used to cache blocks belonging to other clusters.

    The Alewife Machine [7] developed at MIT also consists of processing nodes connected by

    a mesh network. Each node consists of a single processor (sparcle) and a portion of global

    memory. A directory scheme which contains only five pointers per cache block is used to

    reduce hardware requirements. If more than five nodes share a cache block, additional

    pointers are stored in the main memory using a scheme called LimitLESS directories [19].

    Common-case memory accesses are handled in hardware and a software trap is used to

    enforce coherence for memory blocks that are shared among a large number of processors.

    The FLASH multiprocessor [50] developed at Stanford University is the successor to

    DASH. Each node contains a processor (R10000), a portion of main memory, and a

    programmable node controller called MAGIC (Memory And General Interconnect Con-

    troller). This controller controls the datapath and implements coherence. A base directory

    cache coherence protocol exists and consists of a scalable directory data structure. FLASH

    uses a dynamic pointer allocation scheme for which a directory header for each block is

    stored in the main memory. The header contains boolean flags and a pointer to a linked

    list of nodes that contain the shared block.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    28/195

    Chapter 2. Background 14

    The SGI Origin multiprocessor [53] developed by Silicon Graphics Inc consists of up to 512

    nodes connected by a Craylink network in a hypercube configuration. Each node consists

    of up to 2 processors (R10000) and a portion of the global memory. One of the main goals

    of the Origin is to minimize the latency difference between local and remote accesses to 2:1.The directory-based cache coherence protocol is similar to that of DASH. It is designed to

    be insensitive to network ordering, allowing for the use of any interconnection network. A

    full bit vector scheme which switches to a coarse bit vector scheme for a large number of

    processors is implemented. More recently SGI has introduced the Origin 3000 [73], which

    is similar in architecture, but includes an updated processor (R14000).

    The Compaq AlphaServer GS320 [29] developed by Compaq can scale to 64 processors.

    Memory is distributed across 4-processor (Alpha 21246) nodes, called quad-processor

    building blocks, which are connected by a local switch. Eight such quads can be connected

    by a global switch. The cache coherence protocol is directory-based and uses a full bit

    vector scheme. The protocol exploits the architecture and its ordering properties to reduce

    the number of messages.

    The Sun Fire 15K Server [20] is a multiprocessor developed by Sun Microsystems. The

    Sun Fireplane interconnect, consisting of three 18x18 crossbars, is used to connect up to

    18 four-processor (UltraSparc III) boards. A snoopy-based protocol is used to maintain

    coherence within a board and across a limited number of boards. For larger systems, a

    directory protocol is used to maintain coherence across the Fireplane interconnect.

    The Exemplar series of multiprocessors [15] [82] [16] [1] was originally developed by Convex

    Computer Corporation and later continued by Hewlett Packard. The line went through

    a number of generations with the most recent being the SPP2000 (X-class). It consists of

    up to 16 processor nodes, called hypernodes, connected by a set of 4 unidirectional rings

    that use an SCI-based protocol. Each hypernode contains up to 16 processors (PA8000),

    and a local memory connected by a crossbar. The SCI cache coherence protocol is used

    to keep the node caches coherent. Within a hypernode, a full bit vector directory is used

    to enforce coherence.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    29/195

  • 7/30/2019 Assessment of Cache Coherence Protocols

    30/195

    Chapter 2. Background 16

    cation [25]. The protocol can be described in a protocol description language, from which the

    verifier generates states and verifies against the protocols specification. It is also difficult to

    ensure that the hardware implementation of a protocol is true to its original specification, so

    approaches such as witness strings [4] have been used, where an execution trace used duringverification is converted to an input stimulus for logic simulation.

    2.4 Understanding Protocol Performance

    Cache coherence protocols can have a large effect on the performance of multiprocessor systems.

    The performance depends on the data access behavior of applications and no single protocol

    works best for all data access patterns. In general, the invalidate protocol performs well for

    applications in which accesses to a particular data block are performed mostly by the same

    processor or when the data block migrates between processors. In these cases, it is not necessary

    to send any messages through the network once the data is in the processors cache. For

    applications that exhibit a more fine-grained sharing of data blocks, in which a single data item

    is frequently read and written by different processors, the update protocol performs better. By

    sending updates, the data item is always in the cache and misses due to invalidations are avoided.

    System designers and application developers need to be able to compare different protocols and

    assess the effects of different system and application parameters on the performance of protocols.

    To better understand the performance of different protocols a number of classifications of

    data sharing have been proposed. The classifications have been used for various purposes.

    For invalidate protocols, Gupta and Weber [87] [38] proposed a number classes of data access

    patterns. They are distinguished by their use in parallel programs and their invalidation pat-

    terns: read-only, migratory, synchronization, mostly-read, frequently read-written, read-only,

    producer-consumer, migratory, and irregular read-write. Bennett et al. [13] used the concept

    of data access patterns for protocol selection in the Munin software distributed shared memory

    system. They are: write-once, write-many, producer-consumer, private, migratory, result ob-

    jects, read-mostly, synchronization and general read-write. Adve at al. [5] compared hardware

    and software cache coherence protocols using an analytical model. They introduced data access

  • 7/30/2019 Assessment of Cache Coherence Protocols

    31/195

    Chapter 2. Background 17

    patterns that are similar to Weber and Guptas: passively-shared, mostly-read, frequently read-

    written, migratory and synchronization. Brorsson and Stenstrom [17] used different data access

    patterns to analyze the performance of applications running on systems with a limited directory

    invalidate protocol. The data access patterns take into account the type of sharing, read onlyor read/write, and the degree of sharing, exclusive, shared-by-few and shared-by-many.

    In this thesis, the classification proposed by Srbljic et al. [78] is used as a basis for un-

    derstanding the performance of protocols. It is similar to the data access patterns introduced

    by Carter et al. [13] and by Brorsson and Stenstrom [17]. The main difference is that the

    fuzziness in the definition of data access patterns is avoided. For example, Brorsson and

    Stenstrom have data access patterns defined as shared-by-few and shared-by-many, where the

    degree of sharing is fuzzy. Carter at al. introduced data access patterns like write-many

    and read-mostly, where the access mode is fuzzy (for example, read-mostly means that a data

    object is read more often than it is written). Srbljic et al. classify data accesses according to the

    number of processors that perform reads and writes to a particular data item. They are: Single

    Reader Single Writer(SRSW), Multiple Reader(MR), Multiple Reader Single Writer(MRSW),

    Multiple Writer (MW), Single Reader Multiple Writer (SRMW), and Multiple Reader Multiple

    Writer (MRMW).

    2.5 Hybrid Protocols

    Since different data blocks may exhibit different types of access behavior, a system which

    uses more than one cache coherence protocol has the potential to lead to an improvement

    in performance. Using the appropriate protocol can lead to a reduction in cache misses and

    coherence traffic, both of which can result in an improvement in performance. A hybrid cache

    coherence protocol can use any one of a given number of different basic protocols, such as

    invalidate or update, for each cache block.

    In addition, the data access behavior for a particular cache block may change during the

    execution of an application. To further increase the potential for performance improvement,

    the protocol for a block can be changed during the execution of an application. These protocols

  • 7/30/2019 Assessment of Cache Coherence Protocols

    32/195

  • 7/30/2019 Assessment of Cache Coherence Protocols

    33/195

    Chapter 2. Background 19

    Dynamic hybrid protocols with on-line decision functions first appeared in small bus-based

    multiprocessors. They are briefly described in this section because similar techniques have been

    used in larger DSM systems. They use both invalidates and updates and take advantage of the

    broadcast properties of the bus. The first such protocol is the write-once protocol [31], in whichthe first write to a block results in an update to the main memory and an invalidation to the

    other caches. The next write by the same processor results in a change to the local cache only

    and the memory is no longer updated. The Archibald scheme [10] [11] extends the write-once

    protocol by allowing a number of updates while there are no other accesses from other processors

    to that cache block. The competitive scheme [49] sends a number of updates based on a break-

    even point of communication overhead for the two protocols. Eggers and Katz [26] provide a

    comparison of a basic update, basic invalidate, the Archibald, and competitive schemes. They

    conclude that none of the protocols perform best for all applications. The schemes described

    were later extended. Anderson and Karlin extend the competitive scheme [9] by allowing for

    changes to the break-even point during the execution of an application. Dahlgren [22] suggests

    a number of extensions to the Archibald scheme. They consist of merging multiple writes into

    a single write, with a write cache, to reduce bus traffic and snooping on bus data to reduce

    cache misses called read snarfing.

    A number of studies have also been performed on DSM systems with directory-based cache

    coherence protocols. Grahn, Stenstrom and Dubois [33] present a directory-based competitive

    scheme and compare it to an invalidate and an update scheme. They use a relaxed memory

    consistency model to hide the latency of updates with the use of a write-buffer at the second-

    level cache. They find that the update performs better than invalidate for applications with

    moderate bandwidth requirements and note that the competitive protocol does not perform well

    with migratory sharing. To reduce some of the traffic associated with the competitive-update

    protocol, Dalhgren and Stenstrom [24] introduce a write cache to merge multiple writes. Nilsson

    and Stenstrom [66] add migratory detection to the update protocol to reduce the overhead of

    migratory sharing. Additional details to this study are provided in [32]. In a study to determine

    the techniques that can be used to improve the performance of multiprocessors, Stenstrom et

    al. [80] evaluate a number of alternatives. On a sequentially consistent machine they compare

  • 7/30/2019 Assessment of Cache Coherence Protocols

    34/195

    Chapter 2. Background 20

    adaptive sequential prefetching and migratory sharing detection, while on a machine with release

    consistency they compare adaptive sequential prefetching and a hybrid protocol. The hybrid

    protocol uses a competitive update protocol scheme and a write cache. They find that coupled

    with sequential prefetching, the hybrid protocol yields combined gains. Similarly, but in thecontext of reducing useless updates, Bianchini et al. [14] show the effect of bandwidth and

    block size on update and invalidate protocols. They compare a static hybrid protocol and

    a competitive update with coalescing write buffers. They find that software caching and a

    dynamic hybrid protocol reduce most of the useless writes. Coalescing write buffers produce

    the least amount of traffic and have the largest impact on execution time.

    Two schemes that use something other than a competitive update protocol are proposed by

    Srbljic [77] and Raynaud et al. [72]. Srbljic proposes counters to keep track of communication

    traffic for the invalidate and update protocols. The protocol used at a given time is changed

    when the cost reaches a threshold value. Although results are favorable, an artificial workload

    is used and few system details are modeled. Raynaud et al. [72] introduce the distance adaptive

    model. The update pattern is recorded in the directory and then used to determine which

    blocks should be updated and which invalidated. A comparison of an invalidate with migratory

    handling, competitive update, delayed competitive update, delayed competitive update with

    migratory handling and two distance adaptive protocols is provided. The distance adaptive

    protocols perform better than invalidate and competitive protocols.

    The disadvantage of run-time approaches is the inability to accurately predict future ac-

    cesses. The decision function is based purely on information about previous accesses. Basing the

    prediction of future accesses on past accesses can be inaccurate, although recent work [65] [51]

    on using hardware techniques similar to branch prediction for coherence actions has yielded en-

    couraging results. Another disadvantage is that run-time schemes require additional hardware

    such as counters which may result in significant cost.

    2.5.2 Off-line Decision Function

    Another approach to hybrid cache coherence protocols is to use an off-line decision function.

    The decision function can be implemented in hardware or software. The first method involves

  • 7/30/2019 Assessment of Cache Coherence Protocols

    35/195

    Chapter 2. Background 21

    analyzing the memory trace for a specific application using hardware performance counters. An

    application which executes frequently can be fine tuned by using the information provided by

    specialized hardware. The second and more preferable method involves implementing an off-

    line decision function at compile time. The main idea behind this approach is that informationon which protocol to use can be extracted from the source code. In contrast to the on-line

    schemes, the decision is not solely based on previous accesses. This option offers the possibility

    of more accurately predicting data access patterns in the future.

    A number of studies have shown the potential improvement from such a scheme. Veen-

    stra and Fowler [84] demonstrate the advantages of dynamic schemes over static ones (for

    larger cache blocks) as well as maintaining coherence on a per-block as opposed to a per-page

    basis. Performance results are obtained using an optimal off-line protocol. Mounes-Toussi

    and Lilja [64] present results for the potential of compile-time analysis. They introduce a

    dynamic hybrid scheme and different levels of compiler capabilities which insert special write-

    invalidate, write-update and write-only commands into the memory reference stream. They

    consider factors that could affect compiler analysis, such as imprecise array subscript analy-

    sis and inter-procedural analysis. The study compares the ideal compiler, non-ideal compiler,

    invalidate-only, update-only, and dynamic schemes, and finds that the compiler schemes out-

    perform the others. Two similar studies [2] [70] compare the value of providing specialized

    producer-initiated communication primitives that are software controlled. Abdel-Shafi et al. [2]

    demonstrate that remote writes, called writesend and writethrough, can provide benefits over

    prefetching and that the combination of both is able to eliminate most of the overhead. The

    primitives are hand-inserted. Qin and Baer [70] use a protocol processor implementation of

    cache coherence and annotate applications with primitives. They evaluate a set of prefetch

    and post-store mechanisms. Sivasubramaniam [75] uses intelligent send-initiated data trans-

    fer mechanisms for transferring ownership for critical section variables. The compiler is able

    to recognize writes within a critical section. A competitive update mechanism implemented in

    software in the network interface is also evaluated. Poulsen and Yew [69], through their work on

    parallelizing compilers, propose a hybrid prefetching and data forwarding mechanism. The data

    forwarding mechanism is compiler-inserted for communication between loop iterations. Finally,

  • 7/30/2019 Assessment of Cache Coherence Protocols

    36/195

    Chapter 2. Background 22

    of particular importance to this thesis is the work done by Srbljic et al. [78], which presents a

    number of analytical models and indicates the potential for dynamic hybrid protocols.

    Although the work in this thesis is concerned with DSM multiprocessors, one bus-based

    implementation is worth mentioning because of its compiler implementations of a decisionfunction. Techniques for reducing coherence misses and invalidation traffic were compared by

    Dahlgren et al. [23]. The study concluded that their dynamic hybrid protocol does as well as

    their compiler-inserted update scheme in terms of misses, but does better in terms of bus traffic.

    Off-line decision functions also have some disadvantages. Some of the necessary run-time

    information that is required for the decision functions is not easily obtainable. For example,

    many schemes require information about the interleaving of accesses from different processors.

    Unfortunately, there also exist a number of general limitations in compile-time analysis which

    can result in inaccuracies. The performance can vary depending on the extent of memory

    disambiguation and whether inter-procedural analysis is available.

    2.6 The NUMAchine Multiprocessor - Evolution

    The work in this thesis is motivated by the NUMAchine multiprocessor project and specifically

    the work done on cache coherence protocols in that context. Although the ideas are applicable

    to shared-memory multiprocessors in general, they are evaluated in detail in the context of

    the NUMAchine multiprocessor. In this section, an overview of NUMAchine development is

    provided. The details of the architecture and cache coherence protocol are given in Chapter 3.

    Many of the features of NUMAchine are based on experiences with its successful predecessor,

    a multiprocessor called Hector [86] [81], also developed at the University of Toronto. Hector

    is a ring-based, clustered, shared memory machine depicted in Figure 2.3. Cache coherence in

    Hector is implemented in software by the operating system using a page-based write-through

    to memory protocol.

    Although the software coherence scheme provided good performance, interest in developing a

    hardware cache coherent machine grew. Farkas investigated what it would take to provide cache

    coherence on an architecture similar to Hector [27] [28]. He describes how to provide a sequential

  • 7/30/2019 Assessment of Cache Coherence Protocols

    37/195

    Chapter 2. Background 23

    Inter-Ring

    Station

    PM = Processor Module

    I/O = SCSI, Ethernet, etc.

    Interface

    Station

    PM I/O

    PMPM

    PMPM

    PM

    Processor

    Memory

    Station BusInterface

    Controller

    Figure 2.3: The Hector multiprocessor.

    consistency memory model. He identifies the need for locking at the home memory while a

    transaction is in progress and sending invalidation messages to the top of the hierarchy for

    multicasts. For the invalidation-based cache coherence protocol he proposes using a multicast

    rather than individual invalidations. He also describes an update-based protocol.

    One of the goals in the NUMAchine project was to investigate a hardware cache coherent

    machine that is cost-effective, easy to use, and performs well. The hierarchical ring structure

    and features such as clustering processors, a network cache, and a directory protocol were

    chosen. A cache coherence protocol optimized for the NUMAchine architecture was developed

    based on the invalidation write-back scheme suggested in [27].

    An initial overview of the NUMAchine project is given in [3]. It includes plans for hardware,

    operating system and compiler development. A detailed description of the architecture with

    simulation results is given in the NUMAchine technical report [85]. Details of the prototype

    implementation are provided in [34] and NUMAchine related theses [35] [58]. The architecture

    was subsequently analyzed in [37] and measured performance results were presented in [36].

  • 7/30/2019 Assessment of Cache Coherence Protocols

    38/195

    Chapter 2. Background 24

    2.7 Memory Consistency models

    When writing parallel software, assumptions are made about how the memory system behaves.

    Although there is an intuitive notion about how a shared address space should behave, it

    needs to be specified in more detail. Cache coherence dictates that the order of writes to

    a single location must be made visible to all processors in the same order, but it does not

    say anything about when writes to different locations become visible. Since programmers and

    system designers need to worry about this, more than cache coherence is needed to define the

    behavior of the shared address space. The order in which all memory operations are performed

    needs to be defined. This is called the memory consistency model. A number of different models

    exist with the most intuitive one being sequential consistency.

    Lamport [52] defines sequential consistency as:

    A multiprocessor is sequentially consistent if the result of any execution is the same

    as if the operations of all the processors were executed in some sequential order,

    and the operations of each individual processor occur in the sequence in the order

    specified by its program.

    For this behavior to occur in a multiprocessor system, there must be constraints on the order

    in which memory operations appear to be performed. Determining how to design a system that

    provides this model is difficult, so sufficient conditions were defined. For example, to provide

    sequential consistency [21]:

    1. Every processor2 issues memory operations in program order.

    2. After a write operation is issued, the issuing processor waits for the write to

    complete before issuing the next operation.

    3. After a read operation is issued, the issuing processor waits for the read to

    complete, and for the write whose value is being returned by the read to com-

    plete, before issuing its next operation. That is, if the write whose value is

    being returned has performed with respect to this processor (as it must have

    2In the original text, the word process is used instead of processor.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    39/195

    Chapter 2. Background 25

    if its value is being returned), then the processor should wait until the write

    has performed with respect to all processors.

    The constraints focus on program order and the appearance that one operation is complete

    with respect to all processors before the next one is issued. This means that all writes to any

    location must appear to all processors to have occurred in the same order, which is a difficult

    requirement for most systems.

    To allow for additional hardware and compiler optimizations, which are commonly used in

    uniprocessors, a number of less strict, relaxed, models have been proposed [6]. These opti-

    mizations can result in increased performance, which is the main reason that many commercial

    multiprocessors use them, but at the cost of added complexity of using the relaxed models.

    The models make it tougher for users and designers of systems to understand and reason about

    correctness. Recently, a study has re-examined the use of relaxed models because modern high-

    performance processors leave little additional performance to be gained from relaxed schemes.

    In light of this, there may be less incentive for implementing these less-intuitive programming

    models [45].

    One of the goals of the NUMAchine multiprocessor was usability and because of it the system

    was designed to support sequential consistency. Although providing this memory consistencymodel may be expensive in some architectures, the NUMAchine architecture inherently provides

    a simple and efficient means for supporting it. The necessary ordering between writes to different

    locations is provided by defining fixed sequencing points in the ring hierarchy [37]. This ensures

    that a multicast invalidation does not become active until it passes the sequencing point on the

    highest ring level that must be traversed to reach all multicast destinations. This imposes the

    necessary ordering, at the expense of an increase in the average traversal length for sequenced

    packets (i.e. invalidations).

    2.8 Remarks

    The related work and the survey of state-of-the-art multiprocessor implementations presented

    in this chapter provide a number of interesting points.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    40/195

    Chapter 2. Background 26

    Cache coherence protocols are critical aspects of shared-memory multiprocessor systems and

    much effort has gone into their design and implementation. Directory-based cache coherence

    protocols are the defacto standard for medium to large-scale distributed shared-memory (DSM)

    multiprocessors. The best architecture and cache coherence protocol for a shared-memorymultiprocessor has not been determined. However, the NUMAchine multiprocessor provides

    a good platform for research in cache coherence protocols because its architecture and cache

    coherence protocol are in line with current multiprocessors.

    To achieve good performance in a DSM multiprocessor, it is important to understand the

    communication patterns of applications and the behavior of cache coherence protocols for these

    patterns. Since no single protocol is best suited for all communication patterns, using more

    than one has shown some promise. An open question remains as to the benefits of such a

    scheme in a DSM multiprocessor, in particular one that supports sequential consistency.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    41/195

  • 7/30/2019 Assessment of Cache Coherence Protocols

    42/195

    Chapter 3. The NUMAchine Cache Coherence Protocol 28

    Inter-RingCentral Ring

    Local Rings

    P P P P

    I/OM

    Ring Out Ring In

    P = ProcessorM = MemoryNI = Network InterfaceI/O = SCSI, Ethernet, etc.

    StationBus

    Interface

    Stations

    NI

    Figure 3.1: NUMAchine architecture.

    3.1.1 Architecture

    The NUMAchine architecture is hierarchical. Processors and memory are distributed across a

    number of nodes called stations. Each station contains a number of processors and a portion of

    the total system memory. The organization of the memory is such that each memory address

    has a fixed home station. The stations are connected by one or more levels of unidirectional

    bit-parallel rings which operate using a slotted-ring protocol. The time to access a memory

    location in the system varies depending on which processor issues the request and where the

    request is satisfied in the system. Therefore, the architecture is of the NUMA (Non-Uniform

    Memory Access) type.

    The 64-processor machine consists of two levels of rings as shown in Figure 3.1. At the top

    of the hierarchy, a central ring connects four local rings through inter-ring interfaces. At the

    next level, each local ring connects four stations through a ring interface. Each station contains

    four MIPS R4400 processors [41] with 1-MByte external secondary caches, a memory module

    (M) with up to 256 MBytes of DRAM for data and SRAM for status information of each cache

    block, a network interface (NI) which handles packets flowing between the station and the ring,

    and an I/O module which has standard interfaces for connecting disks and other I/O devices.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    43/195

    Chapter 3. The NUMAchine Cache Coherence Protocol 29

    The modules on a station are connected by a bus. Along with mechanisms to handle packets

    flowing to and from the rings, the network interface also contains an 8-MByte DRAM-based

    network cache for storing cache blocks from other stations. The network cache also contains

    SRAM used to store status information of cache blocks.

    3.1.2 Interconnection Network

    The interconnection network consists of a bus in each station and a hierarchy of rings connecting

    the stations. The rings are unidirectional and use a slotted protocol. The hierarchy provides

    increased total bandwidth by allowing for transfers to take place concurrently on several rings.

    Experience from the Hector multiprocessor [86] demonstrated that using an interconnection

    network based on rings provides a number of benefits:

    They are easy to build because they consist of point-to-point connections. The network

    interfaces are simple with only one input port and one output port. The issues of loading

    and signal reflections from multiple connections that limit the number of connections that

    can be provided by a bus are avoided.

    They can transmit signals reliably at high clock rates because of the simplicity of thehardware required to implement them. Short critical paths in logic and short lines in the

    interconnection network make this possible.

    The multiprocessor can be expanded easily, without large wiring or topology changes,

    making the system highly modular.

    They provide a natural multicasting capability. The sender of the multicast needs to send

    a single packet with multiple destinations selected. The packet travels around the ring

    and is only replicated when it reaches the interfaces of the destinations.

    They provide ordering among packets. A unique path exists between any two stations in

    the system and the network interfaces are designed not to allow packets to bypass each

    other.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    44/195

    Chapter 3. The NUMAchine Cache Coherence Protocol 30

    They have subsequently been shown to perform well in comparison with meshes for con-

    figurations up to 128 processors [71].

    The natural ordering among packets and multicast ability are useful for efficiently imple-

    menting cache coherence and a sequentially consistent memory. The ordering of packets in the

    NUMAchine ring hierarchy is maintained because a unique path exists between any two stations

    and the point-to-point order of packets is maintained. A packet cannot overtake another one

    in the network on its way to a destination. The multicast capability is a fundamental property

    of rings. A single packet can be targeted for multiple destinations. The packet travels around

    the ring and is replicated at each destination.

    A split-transaction protocol is used in the interconnected network, meaning that transactions

    required to maintain coherence are split into requests and responses. For example, a processor

    places a read request on the bus, and then releases it. When the memory is ready to respond

    with the data, it requests the use of the bus.

    Requests and responses, broken up into packets, travel along a single physical interconnec-

    tion network. The packets are buffered at each modules connection to the network to allow

    for more concurrency in the system. Each module contains incoming and outgoing buffers. Al-

    though only one physical network exists, it is split in the ring interface and processor modules

    into two virtual networks for deadlock avoidance. These modules contain two types of outgoing

    buffers: one for requests and the other for responses. During periods of congestion, requests

    are halted while responses are allowed to proceed. From the perspective of cache coherence,

    the interconnection network looks like a single ordered network. Requests cannot pass other

    requests and responses do not pass other responses. It is only the ordering of responses with

    respect to requests that can change and vice-versa.

    3.1.3 Communication Scheme

    The routing of packets begins and ends at a station. A novel routing scheme for packets is

    implemented in NUMAchine. The destination of a packet is specified using a routing mask.

    The routing mask consists of fields that represent levels in the hierarchy. The number of bits

    in a field corresponds to the number of targets in the next level of hierarchy.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    45/195

    Chapter 3. The NUMAchine Cache Coherence Protocol 31

    Local Ring 0

    Stn 1

    Stn 2Stn 3

    Stn 0 Stn 1

    Stn 2Stn 3

    Stn 0 Stn 1

    Stn 2Stn 3

    Stn 0 Stn 1

    Stn 2

    Local Ring 1

    Local Ring 2Local Ring 3

    1 0 0 1 1 0 0 101230123

    1 0 0 0 1 0 0 0

    Ring Station

    012300123

    0 0 0 1 0 0 0 101230123

    OR

    Stn 3

    Stn 0

    Figure 3.2: Routing mask.

    In the two-level prototype, the routing mask consists of two 4-bit fields. Bits set in the firstfield indicate the destination ring, while bits set in the second field indicate the destination

    station on the ring. For point-to-point communication, each station in the hierarchy can be

    uniquely identified by setting one bit in each of the fields. Multicasting to multiple stations is

    possible by setting more than one bit in each of the fields; however, setting more than one bit

    per field can specify more stations than required. For example, to send a packet to station 0

    on local ring 0 (0001 0001) and to station 3 on local ring 3 (1000 1000), the routing mask is

    set to the logical OR of the two (1001 1001) as shown in Figure 3.2. Due to over-specification

    inherent in the mask, the packet would also be sent to station 0 on ring 3 (1000 0001) and

    station 3 on ring 0 (0001 1000).

    This communication scheme makes the routing of packets on the ring simple and fast. Each

    ring and each station needs only to check a single bit to determine whether it is the destination

    for the packet.

    3.1.4 Organization of the Network Cache

    A third-level cache exists on each station in the network interface module, called a network

    cache. It stores copies of cache blocks whose home memories are on other stations. It is a

    direct-mapped cache which does not enforce the inclusion property [12]. Not enforcing the

    inclusion property means that the network cache does not contain copies of all cache blocks in

  • 7/30/2019 Assessment of Cache Coherence Protocols

    46/195

    Chapter 3. The NUMAchine Cache Coherence Protocol 32

    caches below it in the hierarchy. For example, a processor secondary cache on the local station

    may contain a cache block that is not present in the network. The next section describes a

    number of interesting problems and solutions that arise from this property.

    3.2 Protocol Features

    The NUMAchine cache coherence protocol is a hierarchical, directory-based, write-back invali-

    date protocol optimized for the NUMAchine architecture. It exploits the multicast mechanism

    and utilizes the inherent ordering provided by the ring.

    Before proceeding, it is useful to define some terminology. The home memory of a cache

    block refers to the memory module to which the cache block belongs. If a particular station

    is being discussed, it is referred to as the local station. Local memory or local network cache

    refer to the memory or network cache on that station. Remote station, remote memory or

    remote network cache refer to any memory, network cache or station other than the station

    being discussed.

    3.2.1 Processor Behavior

    The MIPS R4400MC [41] processor has two levels of caches: an on-chip primary cache and an

    off-chip secondary cache. It also comes with support for a variety of cache coherence protocols.

    Each cache block in the caches has a cache coherence state associated with it. In the secondary

    cache three basic states, dirty, shared, and invalid, are defined in the standard way for write-

    back invalidate protocols.

    The processor issues a request if it misses in its caches. A read miss occurs if the cache

    block is not in the cache or if it is in the invalid state. A write miss occurs if the cache block is

    not in the dirty state. The processor stalls on read and write misses. When replacing a cache

    block, the processor writes it back to the home memory if it is in the dirty state. Otherwise,

    the cache block is overwritten, without notifying the home memory.

    The processor can respond to a number of external requests. An external read request will

    cause the processor to return the data if the cache block is in the dirty state and negatively

  • 7/30/2019 Assessment of Cache Coherence Protocols

    47/195

    Chapter 3. The NUMAchine Cache Coherence Protocol 33

    M NC

    P2 P3 P4P1

    M NC

    P2 P3 P4P1

    NC

    P3P2P1 P4

    M NC

    P2 P3 P4P1

    MNC

    P3P2P1 P4

    MNC

    P3P2P1 P4

    M

    station-level coherence network-level coherence

    Figure 3.3: Station and network level coherence.

    acknowledge (NACK) the request otherwise. On an external invalidation, the processor will

    invalidate its copy of the cache block.

    3.2.2 Protocol Hierarchy

    The NUMAchine cache coherence protocol is hierarchical. Cache coherence is maintained at two

    levels as shown in Figure 3.3: the station level and the network level. Station-level coherence is

    maintained between the local memory and the processor caches on a station, or between the local

    network cache and the processor caches if the home location of a cache block is a remote station.

    Network-level coherence is maintained between the home memory of a cache block and all the

    remote network caches with copies of the cache block. Information for maintaining coherence

    at the station and network levels is stored in the directories; a directory-based protocol is used

    at both levels.

    3.2.3 Invalidations

    A cache coherence protocol must have a mechanism to make writes visible to all processors (write

    propagation). The NUMAchine cache coherence protocol uses invalidations for this purpose.

    A cache coherence protocol must also ensure that all processors see writes to a location as

    having happened in the same order (write serialization). To ensure write serialization the

    NUMAchine protocol uses locking states and takes advantage of the ordering properties of the

    interconnection network. In this section, the mechanism to perform writes is described.

  • 7/30/2019 Assessment of Cache Coherence Protocols

    48/195

    Chapter 3. The NUMAchine Cache Coherence Protocol 34

    In a typical multiprocessor system, requests are serialized by either the memory or the

    current owner of a cache block. A write request first goes to the memory, which is aware of

    all copies in the system and sends individual invalidations to each processing node with a valid

    copy. Upon receiving the invalidation, each node replies with an invalidation acknowledgmentto the original requester. When the requester has received all the