new fetch policies for multicore processor simulator to support dynamic design in fetch stage

10
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012 6 © 2012 JCSE www.journalcse.co.uk New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage H. G. Konsowa 1 , E. M. Saad 1 and M. H. A. Awadalla 1 1 Faculty of Engineering, Helwan University, Cairo, Egypt AbstractDuring the early design space exploration phase of the microprocessor design process, a variety of enhancements and design options are evaluated by analyzing the performance model of the microprocessor. Current multicore processor is based on complex designs, integrating different components on a single chip, such as hardware threads, processor cores, memory hierarchy or interconnection networks. The permanent need to enhance the performance of multicore motivates the development of dynamic design, using historical data of previous runs to predict new value of architecture parameter. Some basic notions multicore processors architectures are affected by the problem of long-latency instructions stalling the processor pipeline. This paper focuses on increasing the data locality to reduce L2 data cache miss, three new fetch stage policies, EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH, based on Ordinary Least Square (OLS) regression statistic method have been development. These new fetch policies differ on thread selection time which is represented by instructions’ count and window size. Furthermore, the simulation multicore tool, multi2sim, is adapted to cope with multicore processor dynamic design by adding dynamic feature in the policy of thread selection in fetch stage. SPLASH2, parallel scientific workloads, has been used to validate the proposed adaptation for multi2sim. Intensive simulated experiments have been conducted and the obtained results show that remarkable performance enhancements have been achieved especially with WZ-FETCH fetch policy. Key wordsMulticore design, Fetch policy, Performance, Simulation architecture. —————————— —————————— 1 INTRODUCTION To take advantage of the increasing execution speed of modern superscalar processors, fetch mechanisms must be able to supply a larger number of instructions to the processing cores. Current instruction cache designs are limited by difficulty in fetching a branch and its target in a single cycle. The multithreading architecture, especially the Simultaneous Multithreading (SMT) architecture, provides the processors with significant parallelism. SMT architecture has been defined as fully shared system resources, i.e., computation resources and memory accessing resources, by several concurrently executing threads in the same core [1]. With SMT employed in the processor, both horizontal waste and vertical waste are minimized [2]. However, resource sharing leads to the uneven distribution among threads as well as performance unfairness, which might not be optimized until the well- designed scheme is carried out. Long-latency load might make the shared resources occupied by some threads inefficiently, and thus is one of the major obstacles towards higher parallelism and better performance [3]. As a result, some instruction fetch policies [2], [4], are designed to allocate the priorities in the fetch stage, expecting to alleviate the impact of long-latency load. Industry is embracing multicore by rapidly increasing the number of processing cores per chip. In 2005, both AMD and Intel offered dual- core x86 products; AMD shipped its first quad- core product in 2007. Meanwhile Sun shipped an 8-core, 32-threaded CMP in 2005. It is conceivable that the number of cores per chip will increase exponentially, at the rate of Moore’s Law, over the next decade. In fact an Intel research project explores CMPs with eighty identical processor/cache cores integrated onto a single die, and Berkeley researchers suggest future CMPs could contain thousands of cores. Multicore processor has been widely used to execute different applications. Multicore processor depends on multithreading architecture in many stages. The multithreading architectures, especially the Simultaneous Multithreading (SMT) architecture, provide the microprocessor with significant parallelism, which is believed as the key potential to better performance and more

Upload: journal-of-computer-science-and-engineering

Post on 31-Jul-2015

74 views

Category:

Documents


2 download

DESCRIPTION

Journal of Computer Science and Engineering, ISSN 2043-9091, Volume 12, Issue 2, April 2012 http://www.journalcse.co.uk

TRANSCRIPT

Page 1: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

6

© 2012 JCSE www.journalcse.co.uk

New Fetch Policies for Multicore Processor Simulator to Support Dynamic

Design in Fetch Stage

H. G. Konsowa1, E. M. Saad1 and M. H. A. Awadalla1

1Faculty of Engineering, Helwan University, Cairo, Egypt

Abstract—During the early design space exploration phase of the microprocessor design process, a variety of enhancements and design options are evaluated by analyzing the performance model of the microprocessor. Current multicore processor is based on complex designs, integrating different components on a single chip, such as hardware threads, processor cores, memory hierarchy or interconnection networks. The permanent need to enhance the performance of multicore motivates the development of dynamic design, using historical data of previous runs to predict new value of architecture parameter. Some basic notions multicore processors architectures are affected by the problem of long-latency instructions stalling the processor pipeline. This paper focuses on increasing the data locality to reduce L2 data cache miss, three new fetch stage policies, EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH, based on Ordinary Least Square (OLS) regression statistic method have been development. These new fetch policies differ on thread selection time which is represented by instructions’ count and window size. Furthermore, the simulation multicore tool, multi2sim, is adapted to cope with multicore processor dynamic design by adding dynamic feature in the policy of thread selection in fetch stage. SPLASH2, parallel scientific workloads, has been used to validate the proposed adaptation for multi2sim. Intensive simulated experiments have been conducted and the obtained results show that remarkable performance enhancements have been achieved especially with WZ-FETCH fetch policy.

Key  words—Multicore design, Fetch policy, Performance, Simulation architecture.

—————————— u ——————————

1 INTRODUCTION To take advantage of the increasing execution speed of modern superscalar processors, fetch mechanisms must be able to supply a larger number of instructions to the processing cores. Current instruction cache designs are limited by difficulty in fetching a branch and its target in a single cycle. The multithreading architecture, especially the Simultaneous Multithreading (SMT) architecture, provides the processors with significant parallelism. SMT architecture has been defined as fully shared system resources, i.e., computation resources and memory accessing resources, by several concurrently executing threads in the same core [1]. With SMT employed in the processor, both horizontal waste and vertical waste are minimized [2]. However, resource sharing leads to the uneven distribution among threads as well as performance unfairness, which might not be optimized until the well-designed scheme is carried out. Long-latency load might make the shared resources occupied by some threads inefficiently, and thus is one of the major obstacles towards higher parallelism and

better performance [3]. As a result, some instruction fetch policies [2], [4], are designed to allocate the priorities in the fetch stage, expecting to alleviate the impact of long-latency load. Industry is embracing multicore by rapidly increasing the number of processing cores per chip. In 2005, both AMD and Intel offered dual-core x86 products; AMD shipped its first quad-core product in 2007. Meanwhile Sun shipped an 8-core, 32-threaded CMP in 2005. It is conceivable that the number of cores per chip will increase exponentially, at the rate of Moore’s Law, over the next decade. In fact an Intel research project explores CMPs with eighty identical processor/cache cores integrated onto a single die, and Berkeley researchers suggest future CMPs could contain thousands of cores. Multicore processor has been widely used to execute different applications. Multicore processor depends on multithreading architecture in many stages. The multithreading architectures, especially the Simultaneous Multithreading (SMT) architecture, provide the microprocessor with significant parallelism, which is believed as the key potential to better performance and more

Page 2: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

7

© 2012 JCSE www.journalcse.co.uk

throughputs. It improves the resource efficiency via scheduling and executing concurrent threads in the same core. Moreover, fetch policies are proposed to assign priorities in the fetch stage to manage the shared resources. Many simulation tools are used to evaluate the multicore performance. The rest of this paper is organized as follows. Section 2 shows the related work to the theme of the paper. Section 3 introduces a simulation framework and benchmark suite description. Section 4 presents the developed methodology. Section 5 demonstrates experiments and discussion. Section 6 concludes the paper.

2 RELATED WORK Current I-fetch policies address the problem of the latency of load that miss in the L2 cache in several ways. Round Robin [2] is absolutely blind to this problem. Instructions are alternately fetched from available threads, even when they have in-flight L2 misses. Simultaneous Multithreading (SMT) architectures are defined as fully shared execution resources among several concurrently running threads in the same core [6]. Long-latency load is one of the major obstacles to better performance as the expression of Memory Wall in the SMT architectures [7]. This is one of the major causes to performance degradation. It refers to a load that fails to hit in Level 2 (L2) cache, and thus has to go to even lower level in the hierarchical memory architecture. As a result, the instruction fetch policies are designed to assign priorities in fetch stage with the expectation to alleviate the effect of long-latency load [5]. Tullsen [6] explored a variety of SMT fetch policies that assign fetch priority to threads according to various criteria. ICOUNT was the best performing policy, in which the priority is assigned to a thread according to the number of instructions it has in decode, rename, and issue stages (issue queues) of the pipeline. Threads with the fewest number of instructions will be given the highest priority for fetch, the reason being that such threads may be making more forward process than others, to prevent one thread from clogging the issue queue, and to provide a mix of threads in the issue queue to increase parallelism. Two parameters characterize an ICOUNT scheme. The first parameter dictates the maximum number of threads to fetch from each cycle, while the second denotes the maximum number of instructions per thread to fetch. There is another fetch policy based on L1 Data

Cache (D-Cache) miss, e.g., Data Gating (DG) [8] and D-CacheWarn (DWarn) [9], Assume L1 DCach misses clearly indicate future L2 cache miss. Consequently, DG prevents the thread with unsolved L1 D-Cache miss rate from fetching instructions, such that it is less likely to introduce long latency load to the system. Similarly DWarn takes actions when L1 D-Cache miss happens too. It reduces the priority of those threads with unsolved L1 D-Cache miss without gating them completely. By adjusting priority, it is expected to achieve the balance between minimal inefficient occupancy and optimized fetch width utilization. ICOUNT implemented by Lichen Weng and Chen Liu [4] [5] advocated that L1 and L2 cache misses are closely associated during execution, but the relationship is not as simple as that L1 D-Cache misses leads to L2 cache miss. In fact, the relationship between L1 and L2 cache misses can hardly be described by a simple model perfectly. They used a statistical model to represent the relationship between L1 and L2 cache misses which called the Ordinary Least Square (OLS) regression. More recent policies, implemented on top of ICOUNT, focus in this problem and add more control over issue queues, as well as control over the physical registers. In [9], a load hit/miss predictor is used in a super-scalar processor to guide the dispatch of instructions that the scheduler makes. This allows the scheduler to dispatch dependent instructions at the time they require data. The authors propose several hit/miss predictors that are adaptations of well-known branch miss predictors. The authors suggest adding a load miss predictor in an SMT processor in order to detect L2 misses. This predictor would guide the instruction fetch, switching between threads when any of them is predicted to miss in L2. 3 SIMULATION FRAMEWORK Our framework contains of multicore Simulation tool and a subset of benchmark programs used to evaluate an architectural enhancement of multicore by workload all threads of multicore using one benchmark program which executed in parallel way or multiple benchmark programs that executed in sequential workload way [1] [10]. 3.1 Multicore Simulation Description Multi2Sim is a simulation framework for heterogeneous computing, including models for super-scalar, multithreaded, multicore, and graphics processors. Multi2sim uses three different models, embodies in three different modules as shown in Fig. 1:

Page 3: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

8

© 2012 JCSE www.journalcse.co.uk

1) A functional simulation engine also called simulator kernel.

2) A detailed simulation. 3) An event-driven simulation.

Hereinafter we use two terms, context and threads. Context will be used to denote a software entity, defined by status of virtual memory image and a logical register file. Thread will refer to a processor hardware entity which can be represented as physical register file, a set of physical memory pages, a set of pipeline queues, etc.

Fig. 1. Simulator modules.

Functional Simulation: It is built as a library and provides an interface to the rest of the simulator. Functional simulator owns the functions to create/destroy software contexts, perform program loading, enumerate existing contexts, execute machine instructions and handle speculative execution. Functional simulator has check pointing capability of implemented memory module and register file. This capability leads to save both register file and memory status and reload them when a wrong execution path starts.

Detailed Simulation: Detailed simulator uses a function engine to perform a timing-first simulator simulation. This means that in each cycle, a sequence of calls to kernel updates the states of existing contexts. The detailed simulator analyses the nature of the executed machine instructions and calculates the operation latencies incurred by hardware structure.

The main simulated hardware consists of pipeline structure as illustrated in Fig. 2 (stage resource, instructions queue, load-store queue, reorder buffer), branch predictor, cache memories (with variable size, associativity and replacement policy), memory management unit, and segmented unit of configurable latency.

Event Driven Simulation: To increase simulator modularity, the machine instructions can be included in a single file (functional simulation) and hardware component is simulated in detailed simulation module. Function calls between function module and detailed simulation module are implemented by event driven module. The event – driven module is an interface to calculate the latency.

Fetch

Fetch queue

Trace queue

Trace cache

Decode op. queue Dispatch

IQ

LS

Q

Register file

ROB Commit

FU

WritebackDatacache

Instr. cache

Issue FU

Fig. 2. Processor pipeline.

From the above illustration, it is to be noticed that the pipelines are enabled in detailed simulation module. Our simulation tool supports a set of parameters that specify how stages are organized in multithreaded design. Stages can be shared among threads or private per thread except execution stage, which is shared by definition of Multithread.

There must be an algorithm which schedules a thread every cycle on the stage. The pipeline model in multicore simulator is divided into five stages:-

1. Fetch stage 2. Decode/rename stage 3. Issue stage 4. Execution stage 5. Commit stage

The Fetch stage takes instructions from cache L1 and places it in instruction fetch queue (IFQ). The decode/rename stage takes instruction from an IFQ, decodes them, renames their registers and assigns them a slot in reorder buffer (ROB) and instruction queue (IQ). Then, the issue stage consumes instructions from the IQ and sends them to the corresponding functional unit. During the execution stage, the functional units operate and write their result back to register file. Finally, the commit stage retires instructions from ROB in program order.

Multicore switches among threads according to a thread selection policy. Processor can be classified as Coarse-grain multithreading (CGMT), Fine-grain multithreading (FGMT) and Simultaneous multithreading (SMT). FGMT processor switches thread on fixed schedule on every processor cycle, CGMT processor is characterized by thread switching induced by a long latency operation or thread quantum expiration (switch-on-event) and SMT processor is able to issue instructions from different threads in a single cycle.

All previous five stages can be shared among all threads or all stages are replicated (except

Page 4: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

9

© 2012 JCSE www.journalcse.co.uk

execution stage) as many times as support hardware threads.

3.2 Benchmark Suite Description SPLASH2, this suite consists of 11 parallel benchmarks, classified as kernels or applications [14]. All of them provide command-line arguments or configuration files to specify the input data size. SPLASH2 benchmarks perform computations, synchronizations, communication, stressing processor cores, memory hierarchy, and interconnection networks. Thus, they are used for the evaluation of the baseline and proposed multicore architectures. Additionally, SPLASH2 benchmarks provide arguments to specify the number of contexts created at runtime, which allows the evaluation of systems with different number of cores. SPLASH is a suite of parallel scientific workloads. We consider barnes, fmm and ocean from this suite. Each benchmark executes same number of worker threads as number of cores. Fast forwarded up to second barrier synchronization point. Trace collected for 500 million instructions.

4 THE PROPOSED METHODOLOGY In this paper the simulation multicore tool, multi2sim is adapted to cope with multicore dynamic design by enabling dynamic feature for thread selection policy in fetch stage to be more user friendly. Then the target is how to add dynamic thread selection policy at fetch stage. This tool has three thread selection policy choices (three switch case “Shared”, “TimeSlice” and “SwitchOnEvent”), as shown in Table 1 [2].

Table 1: Classification of multithreading paradigms depending on Multi2Sim variables in a

CPU Configuration file.

Variable Coarse-Grain MT

Fine-Grain MT

Simultaneous MT

Fetch Kind

SwitchOn Event

Time Slice

TimeSlice/Shared

Dispatch Kind

TimeSlice Time Slice

TimeSlice/Shared

Issue Kind

TimeSlice Time Slice Shared

Commit Kind

TimeSlice Time Slice

TimeSlice/Shared

Our contribution in this paper is the development of fetch stage by adding three new policies. These policies are depending on the Ordinary Least Square (OLS) regression statistic method [12]. To predict the future L2 data cache miss, we apply OLS regression equation [12-15] using number of samples equal to the window array size for each

thread, i.e. the function to calculate the future L2 data cache miss is existed on thread level. This function is called regression engine. Then the algorithm of selecting fetch threads will select the thread which has least miss value of L2 data cache miss. To implement the above mentioned update of fetch stage, a new array with two dimensions to store L1 data cache misses and the corresponding actual L2 data cache misses as shown if Fig. 3 [5] is used. This array represents the samples window size parameter. The storing operation of samples is done at discrete time. This time is measured by number of instructions. It was implemented by adding an instruction counter to determine the time for storing the sample. At real time, the relation between L1 and L2 cache misses are updated dynamically along the execution by calculating OLS regression coefficients. The differences between the three new policies will be occurred according to the calculation time of OLS regression coefficients. The time for calculation "OLS regression coefficients" can be one of the following:- EACH_LOOP_Fetch At each loop if the system can access the cache of level one as shown in Fig. [4]. INC_Fetch At each loop if the system can access the cache and instruction counter has certain value which is defined as sampling period as shown in fig.[ 5]. It is measured by Misses Per Kilo Instructions (MPKI). WS_Fetch At each loop if the system can access the cache, instruction counter is equal to sampling period and location of store reach to maximum window size as shown in fig. [6].  

 

Fig. 3. Fetching according to Evaluated L2 D-Cache Misses [5].

Page 5: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

10

© 2012 JCSE www.journalcse.co.uk

Multi2sim is adapted to take the values of the new parameters (window array size and instruction count) as input variables. Below is the added code of window store function “ws_store()” for each new method to clarify the location of calculation the OLS regression coefficient function “calc_ols_reg_coeff()”.

 

 

 

 

Method one “EACH_LOOP_Fetch” start ws_store1() //window size store { wsl[current_cache][loc]=miss[current_cache]; calc_ols_reg_coeff(); calc_future_l2(); loc++;if(loc==ws){ //window size zero_ws();loc=0; } }  

Fig. 4. “EACH_LOOP_Fetch” function in simulation tool

.____________________________________  

Method two “INC_Fetch” start ws_store2() //window size store { calc_future_l2(); ins_count++; if(ins_count==500) //Instruction count { wsl[current_cache][loc]=miss[current_cache] ins_count=0; loc++; calc_ols_reg_coeff(); if(loc==ws) { zero_ws();loc=0; } } } _____________________________________  Fig. 5. “EACH_LOOP_Fetch” function in simulation tool.

Page 6: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

11

© 2012 JCSE www.journalcse.co.uk

_____________________________________Method three “WS_Fetch”    start ws_store3() //window size store { calc_future_l2(); ins_count++; if(ins_count==500) //Instruction count { wsl[current_cache][loc]=miss[current_cache] ins_count=0; loc++; =if(loc==ws) //window size { calc_ols_reg_coeff(); zero_ws();loc=0; } } } _____________________________________  Fig. 6. “WS_Fetch” “EACH_LOOP_Fetch” function in simulation tool.

Also, in our study we change instruction count parameter which represents the time of storing the samples and study its effect on prediction accuracy of L2 data cache.

5 SIMULATION RESULTS  

The simulation results include new fetch policies results and the prediction accuracy results.

5.1 New fetch policies results

Four metrics are used to measure the performance for three benchmarks application from SPLACH Suite.

The results are divided into two sets

1) The results and graphs from running OCEAN benchmark application with different fetch policies.

2) The results and graphs from running three applications with different fetch policies (three benchmarks results).

OCEAN Results:

The execution time, instructions per second, cycles per instruction and branch prediction accuracy are used as a metrics to compare between three new policies for fetch stage in multicore processor. The below table 2 contains a values of these metrics as a results for running OCEAN benchmark application from SPLACH suite for each new fetch policy.

Table 2: Metric values verses different fetch policies for OCEAN benchmark

Fetch Policy EACH_LOOP_FETC

H INC-

FETCH WS_FET

CH Execution

time 66.8 43.09 42

Instructions per second -

IPS 124033 192276 195149

Cycle per second -CPS 74855 116040 117774

Branch Prediction Accuracy

0.9664 0.9664 0.9664

 

 

Fig. 7. Execution time metric verse different fetch policies OCEAN application from SPLACH benchmarks.

When WS_FETCH policy is used, the least execution time is done as shown in fig. 7. This means that the best location of OLS_CALC_REG_COEFF function is occurred when window size is reach to maximum.in this experiment window size is equal to five samples.

 

 

 

Page 7: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

12

© 2012 JCSE www.journalcse.co.uk

 

Fig. 8. IPS metric verses different fetch policies OCEAN application from SPLACH benchmarks.  

 

Fig. 9. CPS metric verses different fetch policies OCEAN application from SPLACH benchmarks.

As shown in fig. [8], fig. [9] the fetch policy WS_FETCH generate the highest CPS and IPS values for OCEAN benchmark compared with EACH_LOOP_FETCH and IN_FETCH fetch policies.

Three benchmarks results: This section of results are contained samples of screen shots for the multi2sim tool outputs. ______________________________________Result of method one “EACH_LOOP_FETCH” [ CPU ] Time = 12.39 Instructions = 1526475

InstructionsPerSecond = 123155 Contexts = 4 Memory = 10117120 MemoryMax = 40468480 Cycles = 749017 InstructionsPerCycle = 2.038 BranchPredictionAccuracy = 0.855 CyclesPerSecond = 60430 _____________________________________  Fig. 10. Simulator Screenshot output using “EACH_LOOP_FETCH” fetch policy  _____________________________________    Result of method two “INC_FETCH” ; Multi2Sim 3.1 - A Simulation Framework for

CPU-GPU Heterogeneous Computing ; Please use command 'm2s --help' for a list of command-line options. ; Last compilation: Jan 22 2012 18:19:03 Creating a vector of 100 elements... Sorting vector... Finished ; Simulation Statistics Summary ;[ CPU ] Time = 11.39 Instructions = 1526479 InstructionsPerSecond = 134037 Contexts = 4 Memory = 10117120 MemoryMax = 40468480 Cycles = 747114 InstructionsPerCycle = 2.043 BranchPredictionAccuracy = 0.8544 CyclesPerSecond = 65602 _____________________________________    Fig. 11. Simulator result Screen using “INC_FETCH” fetch policy

             _____________________________________  Result  of  method  three  “WZ_FETCH”    ;  Multi2Sim 3.1 - A Simulation Framework for CPU-GPU Heterogeneous Computing ; Please use command 'm2s --help' for a list of command-line options. ; Last compilation: Jan 22 2012 18:19:03 Creating a vector of 100 elements... Sorting vector... Finished

Page 8: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

13

© 2012 JCSE www.journalcse.co.uk

; ; Simulation Statistics Summary ; [ CPU ] Time = 11.19

Instructions = 1526477 InstructionsPerSecond = 136409 Contexts = 4 Memory = 10117120 MemoryMax = 40468480 Cycles = 745538 InstructionsPerCycle = 2.047 BranchPredictionAccuracy = 0.8553 CyclesPerSecond = 66623 ubuntu:~/Downloads/aa/multi2sim-3.1/samples$ _____________________________________    Fig. 12. Simulator result Screen using “WZ_FETCH” fetch policy.

The execution time and instruction per second are two metrics in multi2sim tool. These metrics are used to measure the performance of three benchmarks from SPLACH Suite. These applications are run using three different fetch stage policies as shown in table 3.

Table 3: IPS metric verses different fetch policies

for three SPLACH benchmarks.  

Fetch policy/Time EACH_LOOP_FETCH INC-

FETCH WS_FETCH

LU 39.4 30.76 30.31 Sort 66.8 43.09 42.45

Ocean 12.39 11.39 11.19  

 

 

 

Fig. 13. Execution time metric verses different fetch policies for three SPLACH benchmarks

As shown in Fig. 13, the best results of IPS are occurred when WS_FETCH fetch policy is used for all benchmarks.

 Table 4: IPS metric verses different fetch policies

for three SPLACH benchmark  

Fetch policy/IPS EACH_LOOP_FETCH INC-

FETCH WS_FETCH

LU 203182 260213 264105 Sort 124033 192276 195149

Ocean 123155 134037 136409  

 

 Fig. 14. IPS metric verses different fetch policies for three SPLACH benchmarks.

Despite the different applications, the WS_FETCH proved to be the best as shown in fig. 13, fig. 14.

5.2 Prediction Accuracy

Beside improving multi2sim tool to be dynamic and user friendly, the obtained results from fetch engine show that the fetch engine prediction accuracy can be measured. The prediction accuracy result as depicted in Fig. 15 and Fig. 16. We measured the prediction accuracy by comparing the actual and future L2 data misses.      

Page 9: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

14

© 2012 JCSE www.journalcse.co.uk

                                                         

 

Fig. 15. L1 , predicted and actual L2 data cache misses for instructions count equal 500 and 1000

These figures represent the effect of changing the sampling time on the relation between actual data cache misses at level one (L1) and the predicted value of data cache misses in data cache level two (L2). The results show that larger instructions count leads to prediction accuracy improvement.

Fig. 16. Prediction accuracy for Ocean program from SPLASH benchmark suite (window array size =7, instruction count=500 ).

Fig. 17. Prediction accuracy for Ocean program from SPLASH benchmark suite (window array size =7, instruction count=1000 ).

6 CONCLUSION AND FUTURE WORK

In this paper the simulation multicore tool, multi2sim is adapted to cope with multicore dynamic design by enabling dynamic feature for thread selection policy in fetch stage. The improved simulator is implemented and evaluated. Furthermore, adding three new policies (EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH) have been developed of fetch stage. These policies are based on the Ordinary Least Square (OLS) regression statistic method. The difference among these new fetch policies is relied on thread selection time. This difference is represented by instructions’ count and window size. All new algorithms of selecting fetch threads will select the thread which has least miss value of L2 data cache miss to increase data locality. Our results show that WZ-FETCH fetch policy is the best fetch policy in all used benchmarks programs for all used metrics (IPC, instruction count, and execution time). For the future work, the existence of multicore dynamic parameter will be considered. Furthermore, a flexible multicore processor can be designed.

REFERENCES

[1] R. Ubal, J. Sahuquillo, S. Petit, and P. L_opez, “Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors,” In Proc. of the 19th Int'l Symposium on ComputerArchitecture and High Performance Computing, Oct. 2007.

[2] L. Weng, Gang Quan and Chen Liu, “PCOUNT: A Power Aware Fetch Policy in Simultaneous Multithreading Processors,” International Green Computing Conference and Workshops (IGCC), 2011

[3] J. P. Shen and Mikko H. Lipasti, “ Modern Processor Design: Fundamentals of Superscalar Processors,” July 2004.

[4] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In Proc. of the 42nd Int'l Symposium on Microarchitecture, Dec. 2009.

[5] L. Weng and Chen Liu, “Fetching According to the Evaluated L2 Cache Misses By OLS Regression in SMT Architecture”. [Online]:http://web.eng.fiu.edu/cliu/Pub/ASPLOS, 2011.

[6] D.M. Tullsen and J.A. Brown, “Handling long-latency loads in a simultaneous multithreading processor,” ISCA, 2001.

Page 10: New Fetch Policies for Multicore Processor Simulator to Support Dynamic Design in Fetch Stage

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 12, ISSUE 2, APRIL 2012

 

15

© 2012 JCSE www.journalcse.co.uk

[7] L. Weng and Chen Liu, “Hardware-aided Monitoring of L1 and L2 D-Cache Misses in SMT,”http://web.eng.fiu.edu/cliu/Pub/posterfhpm.pdf, 2011.

[8] L. Weng and Chen Liu, “Fetching According to the Evaluated L2 Cache Misses By OLS Regression in SMT Architecture”. [Online]:http://web.eng.fiu.edu/cliu/Pub/ASPLOS, 2011.

[9] A. El-Moursy and D.H. Albonesi, “Front-end policies for improved issue efficiency in SMT processors,” HPCA, 2003.

[10] F.J. Cazorla, A. Ramirez, M. Valero and E. Fernandez, “DCache warn: an I-fetch policy to increase SMT efficiency.” IPDPS, 2004.

[11] “The Multi2Sim Simulation Framework” http://www.multi2sim.org, 2011.

[12] A. Cottrell, “Regression Analysis: Basic Concepts,”[Online]. Available: Regression.pdf

[13] R. Ubal, J. Sahuquillo, S. Petit, and P. L´opez. “Paired ROBs: a Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors.” In Proc. of the 2009 Euro-Par Conference, Aug. 2009.

[14] J. H. Edmondson, P. Rubinfeld, and R. Preston, “Superscalar Instruction Execution in the 21164 Alpha Microprocessor.” IEEE Micro, 15(2), 1995.

[15] M. R. Marty, B. Beckmann, L. Yen, A. R. Alameldeen, M. Xu, and K. Moore, “GEMS: Multifacet’s General Execution-driven Multiprocessor Simulato,” Int’l Symposium on Computer Architecture, 2006.

Biographical  notes  

Eng. Hanan Mohamed Gamal Konsowa is PhD Student at Helwan University, Faculty of Engineering. She received the B.Sc. in Communication and Electronics Engineering from Ain Shams University, Cairo,Egypt. and M.Sc degrees in Electronics and Communications

Engineering from Helwan University, Egypt in 2006. Her research interest includes Multicore design, soft computing and real time systems. She works as Technical Manager in IBM Authorized Business Partner Company.

Prof. Dr. Elsayed Mostafa Saad is a Professor of Electronic Circuits, Faculty of Engineering, Helwan Univ. He is Author and/or Co-author of 132 scientific papers. He is a member of the national Radio Science Comittee, member of the scientific consultant

committee in the Egyptian Eng. Syndicate for Electrical

Engineers, till l May 1995, Member of the Egyptian Eng. Sydicate, Member of the European Circuit Society (ECS), Member of the Society of Electrical Engineering(SEE). His research interest Electronic Circuits and Communication.

Dr. Medhat Awadalla is an associate professor at communication, electronics and computers Department, Helwan University. He obtained his PhD from university of Cardiff, UK. Msc and Bsc from Helwan University, Egypt. His research interest includes Multicore

design and real time systems.