linked list for cache coherence

Redundant Linked List Based Cache Coherence Protocol

This artic

Santa Clara University Santa Clara, CA 95053

qli @ scuacc.scu.edu

Abstract

presents a distributed directory based cache cc Crence protocc: that improves

Qiang Li and Stevan Vlaovic Department of Computer Engineering

performance and facilitates error recovery in large scale multiprocessors. A number of distributed directory based protocols, such as the Scalable Coherent Interface (SCI, ANSYIEEE Std 1596), use a linked list structure to maintain cache coherence. While they work well for small to medium size systems, the list traversal overhead becomes high when the system size grows into the thousands of processors range. Also, the system is vulnerable to a single node failure in that the recovery from such a failure involves all the processors in the system. Single node failure can happen relatively frequently when a protocol is applied to SCI-based Local Area MultiProces- sors (LAMP) where individual nodes are autonomous computers and can power up and down individually. We propose an enhancement to the linked list approach. A redundant spanning list is constructed when the list is built, which achieves two goals: I ) the list traversal time is reduced from O(N) to O( 3) and 2) recovery from single node failure is con$ned to the processors involved in the failed list, unless the head of the list is lost. Keywords: Distributed shared memory, cache coherence, fault tolerance, cache performance.

1: Introduction

Cache coherence is a common problem for multiprocessor systems. When the number of processors is small and a single bus can be used, the snooping scheme can be used. However, when the system grows large, a single bus can no longer satisfy the bandwidth need of the processors. Multiple-bus structure must be used to scale the system up to even a moderate size. In this case, the directory based cache coherence protocols are usually used. Scalable Coherent Interface (SCI, ANSIDEEE Std P1.596) [l] is an attempt to solve the scalability problem suffered by the traditional bus architectures when building large scale parallel machines by using point-to-point connection and linked list based cache directory. SCI can connect a large number (up to 64K) of nodes or machines so that all memory modules associated with the nodes become part of a global physical memory space [6]. Accessing a memory location on a remote node is done by hardware as if it is in the local memory module, except the latency is higher. This is effectively a Non-Uniform Memory Access time model (NUMA) [9, 101.

One important architectural model enabled by SCI is the Local Area MultiProcessors (LAMP) in which a large number of nodes, individual workstations, or processor-memory units in a multiprocessor box are connected into a seamless shared memory MP. The physical appear- ance of LAMP is very similar to a cluster of computers connected by a local area network with one fundamental difference: all nodes of a LAMP shared physical memory and cache coherence among the processors are maintained by hardware. In contrast, the shared memory of a cluster of workstations, if provided, is done by software emulation, such as the system described in

n.x1~6-fi807-5/95 $4.00 0 1995 IEEE 43

http://scuacc.scu.edu

[5 , 11, 13, 141. As the result, SCI-based LAMP can have a bandwidth of up to 1 GByte/sec and latency in the range of sub-microseconds to microseconds.

SCI uses a linked list directory structure for its cache coherence protocol. The traversal time of the list is O( a). When the number of processors is large, a few hundreds to a few thousands, the linked list can be long and the overhead of the traversal time may become too high. While the tree-based directory structure is being studied, we propose a simple extension which gives O( a) list traversal time and higher reliability.

The rest of the article is organized as follows. In Section 2, we discuss some cache coherence protocols including SCI. In Section 3, we describe the proposed structure and analyze is charac- teristics.

2: Hardware cache coherency

2.1: Directory based protocols

The use of a directory to keep track of the caches that share the same cache line is naturally implemented by directory based protocols. The individual caches can insert and delete themselves from the directory as appropriate, to reflect the use or roll out of shared cache lines. When a processor writes into a cache line, the directory is used to find all copies that need to be either updated or purged.

The directory can either be centralized at memory, or distributed among the nodes in a distributed shared memory (DSM) machine. The centralized directory was first developed by Tang [15], later modified by Censier and Feautrier [2], and implemented in the Stanford DASH [4]. Generally, the centralized directory maintains a bit map of the individual caches, where each bit set represents a shared copy of a particular cache line for the appropriate caches involved. The advantage of this type of implementation is that the entire sharing list can be found by simply examining the appropriate bit map. However, the centralization of the directory also forces each potential reader and writer to access the directory. In a system that has a global main memory, this access would become a bottleneck. Additionally, the reliability of such a scheme is in ques- tion, as a fault in the node holding the directory means that the entire directory would be lost.

By distributing the directory, the bottleneck is relieved, and the reliability is increased [7,6, 121. Such designs are called the distributed pointer (DP) protocol. In this type of system, a linked list is created dynamically, to reflect the sharing members at that particular time. The caches can insert and delete themselves from a linked list as necessary. This avoids including every node in the directory, even though only a small number of caches may be sharing the cache line. In the DP protocol, the memory includes one pointer to the linked list, and the cache line then has other associated pointers, depending on the specific protocol.

2.2: SCI’s cache coherence protocol

The cache coherence protocol specified by SCI is defined by a doubly linked list [l]. For a cache to join the sharing list, it must become the new head of the list. While insertions must occur at the head of the list, deletions can happen anywhere along the list. The only cache capable of writing is the head. Once the line is written, the rest of the list is purged; and if the members wish to obtain the new copy, they must join the list again. This invalidation scheme follows the current trend of invalidating cache lines rather than updating them [ 3 ] .

The mechanism for maintaining the list is f&ly straightforward. A cache inserting oneself into a list generally starts by generating an address. This address is forwarded to main memory (not necessarily remote) where the memory either gives the data for the request, or forwards the

44

request to the head of sharing list. In this case, memory does not have the most recent copy, so the data must be retrieved from the list. The new requester then negotiates for the data in ques- tion with the head, and becomes the new head of the list. This new head acquires a pointer (bidi- rectional) to main memory and to the old head. The old head is termed itsforward neighbor. The old head’s backward neighbor is the current head of the list. This list is then a function of time, with the oldest entry being the tail. In order for a cache to remove itself from the list, it merely asks the forward and backward neighbors to connect to each other.

3: Large scale shared list extensions

For large scale multiprocessor systems, an efficient mechanism to invalidate and update sharing cache lines is necessary. The structure should avoid any points of contention and also be rea- sonably fault tolerant. We believe the single doubly linked list approach is somewhat inadequate in dealing with recovery from an arbitrary failure. Should the head or any node of the list exit ungracefully, part of the list is lost. The only way to regain the sharing list is to interrupt all the processors in the system and poll them to see which ones had the offending cache line. Given that the sharing list often contains only a small portion of the system, the overhead is significant and the burden on the operating systems is heavy. This is clearly not the optimal policy in a large scale system.

3.1: Binary tree extensions

To alleviate the inherent sequential nature of the purge operation, a tree structure extension to SCI standard is under study. These extensions are attempts to reduce the linear latencies associated with the invalidation command to a logarithmic latency [l]. This is achieved by adding an additional pointer to the original specification. A writer is capable of invalidating the sharing list by following the binary tree structure: hence the logarithmic latency. However, this assumes a nearly balanced tree; otherwise this structure will also default to the sequential linked list. Since new additions to the list must be made at the root (head), maintaining a balanced tree is not triv- ial.

The working group in charge of these extensions is the Kiloprocessor Extensions to SCI (P1596.2). Part of the Wisconsin STEM [8] effort on behalf cache coherency is being considered as an extension to SCI. The Wisconsin STEM (permuted acronym for Tree Merging Extensions to SCI) also organizes the sharing set as a binary tree. The overhead included for each 64 bit cache line includes three pointers and one five bit height. The five bit height is sufficient for 64K nodes, as discussed in [8].

The problem with STEM is that tree balancing is extremely difficult, as mentioned previously. The network combining mechanisms must be present and there must be simultaneous requests for the tree to be built efficiently. The interruption of all the nodes in the case of error is still extant in this extension. In the following sections we explore a redundantly connected sharing list that improves performance and facilitates graceful error recovery.

3.2: Redundantly connected sharing lists

The goals in our design are to keep the structure simple, increase reliability, increase performance with regard to invalidations and updates, and to make it fully compatible with the current SCI specifications which represent a simple and elegant design.

45

Figure 1. The spanning list

The proposed structure is shown in Figure 1. The shaded nodes are called the hub nodes, and the list linking the hub nodes is called the spanning list The portions of the original linked list between two hub nodes are called the local lists. When a request is sent from the head, it will travel in the spanning list. At each hub node, a replicate of the request will be sent through the local list. A replicated request traveling in the local list will be discarded when it encounters the next hub node. Thus, the time total time of traversing a list is reduced from traversing N nodes (N is the total number of nodes in the sharing list) to traversing some hub nodes and a local list.

The node format is shown in Figure 2. Thefwd and bwdpointers are used to connect the original list, and the sfwd and sbwd pointers (called the spanning pointers) are used to connect the spanning list. Among other information, the sfutus field will indicate that a node is a hub node or not. The counter is used to decide which nodes are hub nodes, as will be discussed below. Although the additional pointers and the counter increase the cache overhead, this increase is inevitable when adding redundancies. As we will show later, the overhead is a good tradeoff for both performance and reliability. w counter status

Other info

Figure 2. Cache line format

To minimize the time to traverse a list, the length of the local list needs to be examined. It is not difficult to see that the list structure in Figure 3 gives the optimal performance in that it has the same distance between the head and the last node of each local list. That is, assuming the time to traverse a node is the same for all nodes, a request will reach the last nodes of all local lists at the same time.

To achieve such a configuration, the counter shown in the node format (Figure 2) is needed. When the first node of the list is created, the counter is initialized to 2. When a new node is prepended to the list, the counter is passed to the new node, and the new node decrements the counter value by 1. If the new value is greater than 0, the node holds the value. If the new value becomes zero, the new node becomes a hub node. Then its counter is set to the value of the pre-

Head

Figure 3. An optimal configuration

46

vious hub node’s counter plus one.

Head

Figure 4. The construction of a sharing cache list

Figure 4 shows a sample construction sequence of a list. Initially there is just one cache containing the data (part (a)). Since there is only one node, the pointers are all set to null. The counter is initialized to 2 as marked in the box. Then in part (b) one more node joins the list, and it becomes the head (as specified by SCI). It carries a pointer to the hub node (dotted arrow) and the counter of the previous node minus one. Since the counter is greater than 0, the node will not become a hub node. Part (c) shows the transient state where the third node joins the list. It receives the spanning pointer to the previous hub node (dotted arrow) and the counter. The counter is now 0, so the new node becomes a hub node by taking the following actions (let x be the new node and y be the previous hub node):

1. x sends a pointer to itself to y; 2. y receives the pointer and points to x; 3. y sends its counter to x ; 4. x receives the counter, adds 1 to it, and keeps it; 5. x marks itself as a hub node in its status field.

The sequence of actions leads to the configuration in part (d). Part (e) and part (0 show the configurations with 6 and 8 nodes, respectively.

The aforementioned example gives the basic construction of the list. The resulting configuration does help in performance and error recovery. When any of the nodes in the local lists fails, all the nodes are still connected. However, the hub nodes are still vulnerable. When a hub nodes fails, part of the list becomes disconnected. To solve this problem, redundant links that bypass

41

the hub nodes are needed. It is natural to link the nodes before and after a hub node to create a configuration as shown in Figure 5. Notice that the connections are established using the sfwd and sbwd pointers.

Head

Figure 5. Additional links to bypass the hub nodes

When a new node prepends itself to a hub node, additional actions are taken to connect the bypass links. Figure 5 shows the protocol. Suppose x is a hub node and it is the current head as shown in part (a). Node y is the backward neighbor of x. A new node, z, joins the list. The node x sends a packet to z containing a pointer to x, a pointer toy, the counter value 8, and a code indi- cating the current head is a hub node (shown in part (b)). Upon receiving the information, z points to x and y, sets its own counter to 7 and status to bypass, and sends a packet toy containing a pointer to z as shown in part (c). Node y will point to z when it receives the packet, which completes the process as shown in part (d). When the next node joins the list, node y checks its list and realizes that it is a bypass node, so it sends the pointer to x as both the normal pointer and the spanning pointer.

..........

--..__..__

..........

-.-..._.._

(d)

Figure 6. Linking the bypass links

48

We now evaluate the performance of the proposed structure. For convenience and without los- ing generality we assume the head of the list is a hub node. By definition, starting from the head toward end of the list, each local list is one node shorter that the previous one, which can be seen clearly in Figure 7. If the first local list (the one after the head) has a length k (including the lead- ing hub node), the total number of nodes, N, in the list is

Assuming K n 1 , we have

That is, given a list of N nodes, the time to send a request to all nodes is equivalent to the time k = 1.4fi

to traverse 1.4fi nodes.

4: Conclusion

We have presented an enhancement to the linked list based cache protocol specified by SCI while successfully keeping the protocol simple and full compatible. The performance of traversing the entire list was reduced from O(N) to O( / N). Although the result may not be as good as the tree based approach which gives a logNperformance, the tree balancing problem is as yet unsolved, and the reliability issues are not addressed. The list structure can be efficiently constructed on the fly, without any sort of balancing problem. The resulting structure is always the optimal configuration. The proposed solution has the ability to reliably withstand a faulty node. This is especially important in a distributed environment, where the possibility of a single node failing is much greater than that of the entire system failing.

The problem that remains unsolved is the loss of the head of the list. The loss of the head is a more severe type of error as it could contain the most up-do-date copy of the cache line. Recov-

49

ery of the list in this situation can still be done by interrogating processors. Locating only one node is sufficient to a recover the entire sharing list. However, the loss of the up-to-date copy can be dealt with only by checkpointing with the related applications.

References

[ l ] IEEE Std 1596-1992, Scalable Coherent Interface. Institute of Electrical and Electronics Engineers, Inc., 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331,800-678-4333.

[2] L.M. Censier and P. Feautrier. A New Solution to Coherence Problems in Multicache Systems. IEEE Transac- tions on Computers, pages 49-58, June 1990.

[3] S.J. Eggers and R.H. Katz. A Characterization of Sharing in Parallel Programs and it Application to Coherency Protocol Evaluation. In Proceedings of the 15th Annual International Symposium o Computer Architecture, May 1988.

[4] D. Lenoski et al.The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. In Proc. 17th Int’l Symp. ComputerArchitecture, pages 148-159, Los Alamitos, Calif., May 1990.

[5] B. Fleisch and G. Popek. Mirage: A Coherent Distributed Shared Memory Design. In Proceedings from the 14th ACM Symposium on Operating System Principles, pages 21 1-223, New York, 1989.

[6] David B. Gustavson. The Scalable Coherent Interface and Related Standards Projects. IEEE Micro, pages 10-21, Piscataway, NJ, February 1992.

[7] D.V. James and A.T. Laundrie and S . Gjessing and G.S. Sohi. Scalable Coherent Interface. IEEE Computer, pages 14-11, June 1990.

[8] R.E. Johnson. Extending the Scalable Coherent Interface for Large-Scale Shared-Memoly Multiprocessors. Ph.D. Thesis, University of Wisconsin-Madison, 1993.

[9] R.P. LaRowe, Jr., and C.S. Ellis. Experimental Comparison of Memory Management Policies for NUMA Multi- processors. ACM Transactions on Computing Systems, pages 319-363, November 1991.

[lo] R.P. LaRowe, Jr., and C.S. Ellis and M.A. Holliday. Evaluation of NUMA Memory Management Through Mod- eling and Measurements. IEEE Transactions on Parallel and Distributed Systems, pages 686-701, Dec. 1992.

[ l l ] K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions in Computer Systems, 7(4):321-359, November 1989.urvey of Issues and Algorithms

[ 121 H. Nilsson and P. Stenstrom. The Scalable Tree Protocol - A Cache Coherence Approach to Large-scale Multi- processors. In Proceedings of the 4th IEEE Symposium on Parallel and Distribute Processing, May 1992.

[ 131 Bill Nitzberg and Virginia Lo. Distributed Shared Memory: A survey of Issues and Algorithms. IEEE Computes pages 52-60. August 1991.

[14] M. Stumm and S. Zhou. Algorithms Implementing Distributed Shared Memory. Computer: pages 52-60, August 1991.

[15] C.K. Tang. Cache System Design in the Tightly Coupled Multiprocessor System. In AFlPS Proceedings of the National Computer Conference, 1976.

[16] M. Thapar and B. Delagi. Stanford Distributed Directory Protocol. IEEE Computel; 78-80, June 1990.

50

linked list for cache coherence

Documents