lab5 report v4

ECE4750 – Lab5 – Multicore System Lingnan Liu ll656 Mohammed Sameed Shafi m2788

1. Introduction Lab 5 for this course introduces us to a multicore system. The need for a multicore system comes from the fact that in recent years transistor scaling has not matched its previous exponential growth, forcing computer architects to develop new ways to exploit not only DLP (data-level parallelism) and ILP (Instruction-level parallelism) in a program code, but also to provide TLP (thread level parallelism) so that the program workload can be distributed on multiple processor cores. The tradeoff for using multiple cores comes in terms of increased area on chip and energy and also for the need to optimize programs which may have been written with only a single core in mind. In the baseline design, we combine our efforts from lab 2 and lab 3 to compose a single core processor system with a separate instruction and data caches, and in the alternative design we tackle a four processor core system, with each core having its own private instruction cache and four data caches which are shared among all the cores.

In this lab, we were able to complete the baseline (section 2) and alternative (section 3) designs, as well as work on extensions (section 6) to improve our system's performance. We first approach the design with a modular strategy, using our components from lab 2 (processor) and lab 3 (caches) to compose our single core processor system. We then use an incremental design strategy to complete the multi-core design. The multi-core design incorporates our work from lab 4 (ring network). This lab also shifts focus to a software component, where we code sorting algorithms to test both our system designs. We start off with a quicksort algorithm on our single core design, and then we use a merged sort algorithm to test the functionality of our multicore design. The decision to go for a merge sort on the multicore design is because the quick sort is not much amenable to parallelization. As a high level overview of our results, we were able to get a performance improvement on the given benchmarks for the multicore design compared to the single core design, although the theoretical speedup of 4x was not visible anywhere. This peculiarity is discussed in the evaluation section (section 5). As section 5 explains, even though we may think that going for more cores might seem like a smart idea, it does not always help the performance that much, while at the same time provides additional constraints of higher area and energy consumption, along with the need for recompiling legacy code to exploit TLP. Section 6 talks about how we improve the cache read hit latency to a single cycle to achieve better performance.

2. Baseline Design Our baseline design is to implement a single core system with two separate caches – one for instructions (i) and one for data (d). We utilize bypass processor design from lab 2 and hook it up to two direct-mapped caches as shown in Figure 1. The cache request and response ports of the caches connect to the respective i/dcache ports of the processor, while the memory (mem) request and response ports connect to outside facing memreq/sp0/1 ports as shown. Note that the data bitwidth from processor to caches is 32 bits, while the data bitwidth from the caches to the test memory is the full cache line, which is 128 bits in this design. The processor also supports a special "stats" bit to tell the manager that stats should be enabled. The reason we need this stats bit is because a typical compiled program includes a lot of code to bootstrap, manage the stack, various library calls etc. So, the code that we are interested in running might be masked out by the uninteresting parts of this boilerplate code, so we only enable stats when we are actually in the function of interest.

The baseline design for this lab provides an interesting starting point. It is a good design in the sense that it is easy to connect all the three components (the processor and two caches). So it provides an easy entry point in terms of design. In terms of functionality, we can argue that the baseline is not a good design in that both the caches have a four cycle hit latency. This is rather unrealistic and causes the design’s to consume a high number of cycles. For memory operations such as loads and stores, it takes 8 cycles even on a cache hit (4 cycle hit latency for the instruction cache and another 4 cycle latency for the data cache). For a cache miss, the penalty is even higher. Also, we utilize direct-mapped caches, which are not effective in handling conflict misses.

For the baseline design, we also implement quicksort algorithm for sorting an array. The quicksort algorithm works in this fashion – we first select pivot value (the value in the middle of the array) and then partition the array into two – with the target being for one partition to have values lower than the pivot and the other to have values greater than the pivot. To achieve this, we select values from each end of the partition and then compare them against the pivot. If the values are in their correct partition, we let them stay there, else we swap the values between the partitions. We


keep doing this for all the elements of the partitions. Now the pivot is in its correctly sorted position and the partitions have their respective values. We then recursively call the partition function and repeat the above procedure across both the original partitions by splitting them further into smaller partitions and choosing new pivots in each. Quicksort serves as a good benchmarking algorithm as it is faster than other similar sorting algorithms because it is relatively cache-efficient. The reason for this is that it linearly scans the input and linearly partitions the input. This means we can make the most of every cache load we do as we read every number we load into the cache before swapping that cache for another. This helps in exploiting the spatial locality in the data sequence.

3. Alternative Design As a high level overview, we build up the multicore system consisting of 4 processor cores, each with a private instructions cache, 4 shared data caches among them and 3 memory networks, each of which is composed of two ring networks, allowing the processors to access the shared multi-banked data caches. We make several changes to the design of processor, cache and network to get these three parts to work in unison. We also implement a parallel merge sort algorithm which will be running on our system as another benchmark.

More instructions supporting atomic memory operations (AMO) are implemented for the bypass processor. AMO instructions enable us to protect the data that is shared between different threads so that eliminate the interference between parallel running cores, which is a fundamental function of any parallel computing system.

To let the processor to exploit the larger capacity of the multi-banked shared data caches, one parameter named “p_num_banks” is added to the cache design. For the data caches, we set the parameter to 4, which means there will be two bits between offset and index counted as bank id and the adjacent cache line before is now interleaved through four caches. For the instruction caches, we set the parameter to 1 restricting the processor to only having access to its dedicated instruction cache.

The memory network is composed of two ring networks, one for requests and one for responses. Two types of memory network exists in our multicore system: refill net and data cache net. In the refill networks for both the instruction caches and the data caches, only port0 handles the output and input with the main memory. In the data cache network, all four ports transmit the requests and responses between processors and data caches. There is one parameter named “p_single_bank” that allows us to use the same MemNet design to accomplish these two different networks and it should be set to 1 for refill nets and 0 for the data cache net.

For a multicore system, having a master processor that handles all the connections between the system and the peripheral elements is essential. We make proc0 the master processor and connect all the manager ports and the status bits only to proc0. This simplifies the logic of accepting transactions and makes our multicore system robust enough to handle not only multi-thread programs but also older single-thread programs.

As another part of the alternative design, we implement the parallel sorting algorithm and make it another benchmark for our multicore system. There are many kinds of sorting algorithm we can choose and the merge sort, in our opinion, is the most amenable one to parallel computing. By splitting the original data into smaller parts and merge them to larger sorted data set, the merge sort has instinct parallel characteristic as the merging procedure can be done in parallel. Thus, we choose the merge sort as the template of our parallel sorting algorithm.

Our merge sort algorithm begins with treating the original data set as an array of n sub-lists (n is the length of the data set) of size 1, which is obviously already “sorted”. Then the sub-lists are iteratively merged into larger sorted sub-lists back and forth between two buffers. Through the iterations, the size of sorted sub-lists grows exponentially and we keep merging theses sub-lists until their size reaches the length of original data set. At that time, the whole data set is sorted.

To exploit the four cores in our multicore system, we first statically divide the work among the four cores and invoke the scalar sort function which implements the algorithm discussed above on each core. After the separated work is done by every core, core 0 will be in charge and merge the 4 sorted sub-lists into one list by invoking the merge function.


This alternative design does not perform better on single-threaded loads, as can be seen in table 3. This is due to the fact that we use a ring network to connect the shared data caches. Firstly, since the number of elements which we connect is low, we could have probably used a global crossbar which would have resulted in a much lower latency in the network. Secondly, since each processor can access any of the data caches, this results in much more congestion on this ring network topology. Because the processor can only have a single request in flight at a time, the injection rate into the ring network will be limited by both the number of loads and stores and the round-trip latency. So if the round-trip latency is 10 cycles, and we have a load/store on average 20% of the time, then the injection rate will be one packet every ~15 cycles. The point is the injection rate will be very low given these simple processors. The alternative design does perform better on multi-threaded loads, but still does not give a 4x improvement, and this is discussed in section 5. If we were to improve the number of cores, we might see an increased performance, though the increase would not be linear, but we would be further adding to the congestion of the network which would result in bottlenecks. The only reason to invest in the complexity of additional cores would be if the application could be further parallelized to take advantage of these additional cores, and if we could optimize the weak points of our design, namely the network and the cache. The tradeoff would come in terms of more area and energy being utilized. If the expected applications do not provide TLP, then investing in a multicore complexity would be a waste.

4. Testing Strategy We use a combination of different testing strategies to get both of our designs working. Firstly, we use the source/sink tests similar to the ones from lab 2 to verify that the single core processor is supporting all the implemented instructions. The advantage of these tests is that we use a simple Verilog based compiler and we need a minimal number of instructions to write the tests. This allows the test case to get input data easily from the sources and verify them against the results at the sinks, and also it’s much easier to create a random input/output data stream. To ensure that the alternative design works, we use the multi-threaded assembly test suite. These tests are very similar to the single threaded versions, except for testing the correct computation in all of the cores. proc0 creates work for the other processors, and it waits on all other processors to finish. Instead of directly sending the test outcome like in the case of single threaded assembly tests, each processor writes this outcome to a global array at an index dedicated for this core. Once all of the processors have executed their testing logic, proc0 checks the test outcomes of each processor.

We also use self-checking tests which are written as separate assembly files that are assembled and loaded into program memory using a simulator. The advantage here is that we can leverage a powerful compiler to write complicated test cases easily. The drawback to these tests is that we need to have the complete set of processor instructions ready to start writing tests. The self-checking multithreaded assembly-file tests make sure all cores can execute instructions and generate the correct results.

We leverage our tests from the previous labs to ensure that each of our components – processor/caches/networks is working before we go on to test our baseline design. These tests now represent whitebox tests in our design of a complete system. They would also represent unit testing in that we sure that the individual parts are completely functional before we move on to test the complete system. Line traces are again a very useful tool in identifying whether each of our cores is actually performing as it should. Most of the testing we do is blackbox, in that we are testing the functionality of the complete system.

To test if the AMO operations work on the multicore design, we make sure that multiple threads work on their corresponding array elements and that the master thread keeps check of all the atomic locations. The master thread then checks for the correct result.

5. Evaluation We evaluate both the baseline and alternative designs over a series of benchmarks – vvadd, cmplx-mult, bin-search, masked-filter and the sorting algorithms implemented by us – quick sort and merge sort. Figure 3 gives details of how both the baseline and alternative designs perform. Table 2 gives further details of the performance improvement on moving the benchmarks to a multicore design. We can see from table 2 that in none of the benchmarks are we able to achieve a theoretical 4x performance increase on increasing the number of cores from 1 to 4. This can be attributed to


two main factors – first the microarchitectural factors – which in this case is due to the ring network topology being inefficient. Since the data cahces are shared, there will be cases of contention when two separate processor cores try to access the same data cache. Secondly, the software is not completely parallelizable in most cases. In almost all cases, the performance increase would not be linear on adding an additional core. Another reason on the software front would be the additional overhead to manage the parallelized code, which can be seen in our merge sort algorithm where only proc0 is responsible for the reduction join part.

As we study the design of the benchmarks, we see that the performance of some is compute bound, and in some it is memory bound. An example of the memory bound performance is for the vvadd benchmark – since each vvadd requires 2 loads, an add and a store, the cache hit latency really affects the performance. Since the processor core in itself is pretty decent (it is almost able to maintain a CPI of 1), the cache memory really lets the performance down. An example of a the compute bound benchmark would be the cmplx-mult, where the main latency in performance is due to the variable latency multiplier. This is also why the cmplx-mult benchmark is able to achieve a much higher increase in performance compared to the vvadd. If we were to alter the memory hit latency to 1 cycle, we would see far more improvement in the vvadd benchmark.

In regards to readability, the code used in lab 2 is much more readable as it uses a consistent and familiar register naming scheme. For the assembly code generated by the compiler for lab 5, we also see pseudo-instructions such as li (li r1, 3 = addiu r1, r0,3) and move (move r1, r2 = addu r1, r2, r0). In terms of static code size, if we take the vvadd as an example, for the hand-assembled code for lab 2, we loop unroll the code by a factor of 4 and in the case of lab 5 we unroll the code by a factor of 8. Though the number of iterations would decrease if we increase the loop unrolling factor (and this would decrease the dynamic code size), the static code size for each iteration does increase and to accomplish this we use a higher number of registers. Another example is the cmplx-mult, for which in lab 2 we do not use any loop unrolling, but we unroll the loop by a factor of 2. Another point to note is that for the vvadd the compiler in lab 5 unrolls by a factor of 8, and since we have 100 iterations, the compiler would need to introduce some fixup code at the end to manage this mismatch. Lab 5 code would give a better performance since the loop is unrolled by a higher factor.

If we run the scalar benchmarks on the multicore system, we see a significant decrease in the performance (figure 4). This is due to the additional latencies added by the ring network. If we run the parallel benchmarks on the singlecore system, we can see that there is relatively small overhead to run them. This overhead can be due to two reasons – software and hardware. For software, we see that the number of instructions increases slightly, this is mostly due to the checks performed at the start of the program to identify how many cores are present in the system. For hardware, we see an increase in number of cycles due to these additional instructions.

For the quicksort benchmark, we get a worse performance on the multicore system because of the additional latencies introduced by the refill network for the data cache which now has a hit latency of approximately 12 cycles. Looking at the assembly code and the corresponding line traces, we can see that many of the latencies of the instruction and data cache hits (4 and 8 respectively) overlap. For the merge sort benchmark, we get a performance decrease almost by a factor of two. This is due to the fact that our mergesort algorithm is designed keeping multiple cores in mind. Now since there’s only one core present, the quicksort emerges as a winner as it is a faster sorting algorithm (worst case: log n vs. nlog n). In our implementation, the mergesort algorithm on a multicore system is able to achieve a small speedup of 1% compared to the quicksort on a single core.

We can conclude by saying that even though we do see a performance increase on the benchmarks in going from single core to multi core design, we need to keep in mind the overhead that is generated. In this lab, we work on one benchmark at a time, even though in real world applications, several programs compete to work on the same processor core. This is where hyperthreading comes into play and we should be able to exploit TLP much more effectively.


6. Extensions As the extension part of this lab, we optimize the blocking cache in the baseline and the alternative design by reducing its

read hit latency to 1 cycle and use the optimized version cache instead of the original cache in the single-core and multi-core

system. Because the dominating performance drawback in our baseline and alternative design is the unrealistically high hit

latency of 4 cycles in the original cache, which restricts the cycles per instruction (CPI) to be over 4, we hope to get the

instructions more efficiently executed thus the performance of our system boosts up by using our cache.

To reduce hit latency and retain the original module interface at the same time, changes must be made both to the control unit and the data path. For the control unit, we eliminate the IDLE, WAIT states which do nothing actual but waiting for the requests and responses of the processor, and merge their function into the TAG_CHECK state and the data access states (according to the type of data access). In the refill path and the evict path, we also remove the EVICT_PREPARE state and the REFILL_UPDATE state and merge them into the other two states in the refill and evict path. The complete state transition diagram is shown in Figure 6.

The states remaining now and their function are stated below:

TC – STATE_TAG_CHECK – Consumes the incoming cache request and places it in the input registers. Checks the tag, transits to the correct state based on whether it was a hit/miss, and if the cacheline is valid or dirty. In the same time, try to read the data array and if the type of cache request is read and cache hit happens, set the valid bit of cache response to 1.

IN – STATE_INIT_DATA_ACCESS – Handles writing to the appropriate cacheline when the init transaction is received.

RD – STATE_READ_DATA_ACCESS – Handles reading from the cacheline on a read hit and the processor is not ready to accept the response from cache.

WD – STATE_WRITE_DATA_ACCESS – Handles writing to the cache on a hit.

RR – STATE_REFILL_REQUEST – Makes a request to memory for the appropriate data.

RW – STATE_REFILL_WAIT – Waits while the memory is busy, and once the response is valid, this state send back the response to the cache.

ER – STATE_EVICT_REQUEST – Reads the tag and data, and prepares the message to be sent to memory. Makes a request to memory to write the data prepared.

EW–STATE_EVICT_WAIT– Waits while the memory is busy, and once valid, transits to STATE_REFILL_WAIT.

There are also some states corresponding to the AMO transactions which is not modified in our optimization.

Related to the reduced FSM of control unit, changes are also introduced in the data path. The input registers are still retained

but we add more bypass paths to allow the TAG_CHECK state directly use the request signal without waiting for another

cycle. The memory response register and the read tag register are removed so there we can eliminate the REFILL_UPDATE

state and the EVICT_PREPARE state. Bypass path through the read data register is added so we can remove the WAIT state

while sustaining the AMO transactions. The optimized data path is shown in Figure 5.

We use the same unit tests and line tracing in lab3 to test the functionality of our cache. We then replace the cache in our baseline and alternative design and run the same unit tests and self-checking tests to make sure our caches function well in the whole system.

For evaluating the performance of our optimized system, we first invoke the simulator in lab3 to get the performance of our optimized cache under various types of access patterns. Then we use the simulator in this lab section to get the performance information of the whole system such as number of cycles to finish benchmark and the CPI. The data is all recorded in Figure 7.

We can see that there is a large performance increase in our optimized cache, both in the baseline part and the alternative part. In the baseline, we even get the CPI close to 1 for some benchmarks, like vvadd (1.45) and quicksort (1.61)


Appendix

Figure 1. Baseline single core configuration


Figure 2. Alternative multicore configuration


Figure 3. Performance comparison of single core and multi core designs on corresponding benchmarks.

Figure 4. Performance comparison of running scalar benchmarks on singlecore and multicore designs

2303

2018

9065

4971

5879

4330

25447

12063

28366

28120

0 5000 10000 15000 20000 25000 30000

Single Core

Multicore

Single Core

Multicore

Single Core

Multicore

Single Core

Multicore

Single Core

Multicore

vvad

dcm

plx

-mu

ltb

in-s

ear

chm

aske

d-f

ilte

rq

uic

k-so

rtm

erge

-so

rt

Number of Cycles

2303

3946

9065

14024

5879

6631

25447

32948

28366

35256

0 5000 10000 15000 20000 25000 30000 35000 40000

Singlecore

Multicore

Singlecore

Multicore

Singlecore

Multicore

Singlecore

Multicore

Singlecore

Multicore

vvad

dcm

plx

-m

ult

bin

-se

arch

mas

ked

-fi

lte

rq

uic

kso

rt

No. of cycles

Ben

chm

arks


Figure 5. Optimized datapath for the extension cache


Figure 6. Optimized control unit for the extension cache


Figure 7. Performance improvement on single core and multi core desings using optimized caches

Benchmark System Design No. of Cycles

vvadd Single Core 2303

Multicore 2018

cmplx-mult Single Core 9065

Multicore 4971

bin-search Single Core 5879

Multicore 4330

masked-filter Single Core 25447

Multicore 12063

quick-sort Single Core 28366

merge-sort Multicore 28120 Table 1. Performance of benchmarks on baseline (single core) and alternative (multi core) system

2303

2018

1030

1756

9065

4971

5834

4579

5879

4330

1869

2191

25447

12063

8909

8742

28366

28120

10017

19626

0 5000 10000 15000 20000 25000 30000

Single Core

Multicore

Extension Single core

Extension Multicore

Single Core

Multicore


Extension Multicore

Single Core

Multicore


Extension Multicore

Single Core

Multicore


Extension Multicore

Single Core

Multicore


Extension Multicore

vvad

dcm

plx

-mu

ltb

in-s

ear

chm

aske

d-f

ilte

rso

rt

Number of Cycles


Benchmark Percentage

Increase

vvadd 12.37516283

cmplx-mult 45.16271373

bin-search 26.34801837

masked-filter 52.59559084

sort 0.867235423

Table 2. Percentage increase in performance from singlecore to multicore design.

Scalar benchmark Design No. of Cycles

vvadd Singlecore 2303

Multicore 3946

cmplx-mult Singlecore 9065

Multicore 14024

bin-search Singlecore 5879

Multicore 6631

masked-filter Singlecore 25447

Multicore 32948

quicksort Singlecore 28366

Multicore 35256

Table 3. Performance of scalar benchmarks on both single core and multicore design

Parallel benchmark Design No. of cycles

vvadd Singlecore 2555

cmplx-mult Singlecore 9268

bin-search Singlecore 12395

masked-filter Singlecore 25476

mergesort Singlecore 51714

Table 4. Performance of parallel benchmarks on single core design