migrating java threads with fuzzy control on asymmetric · pdf file ·...

Migrating Java Threads with Fuzzy Control on Asymmetric Multicore Systems forBetter Energy Delay Product

Hsin-Ching Sun∗, Bor-Yeh Shen∗, Wuu Yang∗, Jenq-Kuen Lee§∗Department of Computer Science, National Chiao Tung University

Hsinchu City 300, Taiwanhcsun, byshen, [email protected]

§Department of Computer Science, National Tsing Hua UniversityHsinchu City 300, Taiwan

[email protected]

Abstract—Asymmetric multicore systems had been studiedas a new hardware platform toward performance-power effi-ciency for the execution of application programs. Each core inthe system has distinct performance and power characteristics.When exploiting asymmetric multicore systems, a major issueis to distribute threads to various cores. In this work, we build apseudo asymmetric system by the dynamic voltage frequencyscaling (DVFS) mechanism on Intel core-i7 920 for physicalpower measurement and implement a tool agent for regularJVM to form an asymmetric-aware JVM that supervises theexecution of Java threads and migrates threads with a fuzzy-control scheduler. For result inspection, we consider energydelay product (EDP) as a metric to reveal the compromisebetween performance and energy use. Our fuzzy-control sched-uler results in EDP benefit for some benchmarks and loweroverall energy consumption. 1

Keywords-Asymmetric multicore; JVM; power efficiency;schedule;

I. INTRODUCTION

A. Asymmetric Multicore Systems

Computer processor development in the form of single-core systems had reached the limit in terms of powerconsumption, memory gap, and thermal control. Chip vali-dation also becomes more and more difficult when IC chipsget more and more complex. Although multicore systemshad then become the solution for single-core developmentdifficulties, a problem concerning energy utilization and per-formance in multicore systems is raised since under-utilizedcores waste energy. To address the energy-saving problem,many researchers proposed the asymmetric multicore ar-chitecture in order to achieve better energy-performanceefficiency [1]–[6].

Asymmetric multicore systems (ASYMS) consist of sev-eral cores (with identical or different ISA). These cores de-liver different performance and power characteristics. Figure1 shows a 4-core system. The larger core is usually faster

1The work reported in this paper is partially supported by NationalScience Council, Taiwan, Republic of China, under grants NSC 99-2220-E-009-048- and NSC 99-2220-E-009-049-. W. Yang is the correspondingauthor.

Core 32.12GHz

Dual issue

Core 11.56GHz

Single issue

Core 21.56GHz

Singleissue Core 4

2.67GHzFour issue

Figure 1. An asymmetric multicore system that consists of four cores ina chip.

and more complex and consumes more energy. Differentcores fit different performance and/or energy requirementsfor different applications.

B. Toward Better Performance Power Efficiency

When it comes to ASYMS, the software or the OS mustbe aware of this hardware asymmetry [2], [7]–[9]. A moreaggressive thread-scheduling policy is required compared totraditional scheduling policies whose sole purpose is loadbalancing among the identical cores. There are two issuesin scheduling ASYMS: performance [4], [8], [10]–[12] andpower efficiency [1], [2], [5], [7]. In this work, we proposea new scheduling mechanism based on fuzzy control forASYMS. Our objective is to achieve better performanceand power efficiency. We build a pseudo asymmetric single-ISA multicore system with Intel core-i7 in our experiment.Each core runs in different frequencies, controlled by thedynamic voltage frequency scaling (DVFS) mechanism [13]for physical power measurement.

We implement our schedulers through JVM tool inter-face for regular JVM to form an asymmetry-aware JVM.The asymmetry-aware JVM monitors every java threadswith hardware performance counters. Two fuzzy-controlschedulers are designed for asymmetry-aware JVM to make

Inference engine

with rules

Fuzzification

Defuzzification

Fuzzy-control scheduler

perf reading

To which core

Thread of execution

Figure 2. Migration decision through fuzzy control system

scheduling decisions for all Java threads based on perfor-mance counters. Figure 2 shows the control loop of thefuzzy-control scheduler. Here, we use the number of exe-cuted instructions per cycle (IPC) as the heuristic utilizationindex, although our early result shows that IPC may notcarry enough information to serve as an utilization index.Our heuristic is that threads with a high instruction-per-cycle index should be moved to faster cores [2]. A resultin a previous study [2] also shows that these CPU-intensiveprograms will get more speedup when they run on coreswith higher frequency.

We also use the number of last-level cache (LLC) missesas another index too. The heuristic for this scheduler is todistribute Java threads among slower cores when there areLLC misses.

The overhead caused by the agent is also under ourconsideration. One of the most significant overhead is themigration overhead since there might be a lot of threadmigrations [4], [13].

II. RELATED WORK

An early study [1] addressed the potential of power re-duction in single-ISA heterogeneous multicore architectures.During program execution, systems software dynamicallychooses an appropriate core to meet the specific performanceand power requirements [1]. Several off-the-shelf processorspecifications were used to model the architectures for diespace, performance, and power simulation. Our work startswith the same perspective of choosing appropriate cores forapplications under performance requirements, but it differsin that we use a fuzzy-control scheduler in choosing cores.

A study of making use of the asymmetric multicoresystems for power efficiency was in [2], in which themechanisms of exploiting asymmetric multicore systemswere classified into thread-level parallelism specializationand microarchitecture specialization. In microarchitecturespecialization, an application (especially for a single-threadapplication) experiencing different phases [14] during exe-cution would have its CPU-intensive phases run on fastercores or have its memory-intensive phases run on slower

Linux

Hotspot JVM

javaThread

osThread

pthread library

Java thread …

Linux thread …

Figure 3. Mapping between Java threads and Linux threads

cores. Our work belongs to the category of microarchitecturespecialization. We uses hardware performance counters [2],[15], which were described to be an effective way for on-line utilization estimation, to determine the various phasesduring execution.

The fuzzy-control theory is a modern control method[16]–[18]. Previous work [19] revealed a fantastic way tocombine the DVFS usage with the fuzzy control system atthe electronic level. In that work [19], the electricity currentand the variation of the current are assumed to represent thework load of CPU and to serve as the input of the fuzzycontrol system. The supply voltage, which is the output ofthe fuzzy control system, changes dynamically on need. Theuse of variation gives an inspiration that we can use theIPC history to predict future utilization and make migrationdecisions accordingly.

III. IMPLEMENTATION

A. Linux Performance Counter Subsystem and Java Threads

Linux Performance Counter Subsystem is a kernel-based subsystem that provides a framework for collectingperformance data. Linux provides a simple system callsys_perf_event_open for performance counters. Thissystem call returns a file descriptor through which a userprogram can read various hardware performance counters.

Both threads or processes can be monitored through thesubsystem. Since a Java process relies on Linux for processor thread creation in our environment, all the performancecounters of the spawned threads could be easily monitored.

Note that the mapping between Java threads and Linuxthreads is implementation-dependent. The JVM in our im-plementation is Java HotSpot Virtual Machine [20]. Themapping between Java threads and Linux threads is providedby the pthread library. It is an 1-1 mapping as shown inFigure 3. This means that we can use Linux PerformanceCounter Subsystem to monitor each Java thread.

0

0.5

1

0 0.5 1 1.5 2 2.5 3

NormalNot good

BadGood

Excellent

IPC domain

Figure 4. Membership functions for IPC

Zero Positive Negative

Difference domain

0

0.5

1

-2 -1 0 1 2

Figure 5. Membership functions for IPC difference

B. JVM Tool Interface

The JVM Tool Interface (JVMTI) provides a program-ming interface to inspect the state and to control the exe-cution of applications running in the Java virtual machine.Since the mapping between Java threads and Linux threadsis 1-1, we may monitor the creation of Java threads throughJVMTI and register the mapping between Java threadsand Linux threads. This mapping is used to interpret theperformance counter statistics.

C. Migration Mechanisms

We implement two fuzzy-control schedulers for ourasymmetry-aware JVM: the IPC scheduler and the LLC-missscheduler. Each scheduler is a closed control loop that takesdifferent input and makes migration decisions periodically.

A fuzzy-control scheduler consists of three parts: a fuzzi-fication part, an inference engine with a set of rules, and adefuzzification part.

The fuzzification part converts a quantitative value intoa qualitative value. Several linguistic variables may then beused in the inference engine. The inference engine comeswith a set of rules, which are based on human knowledgeover the controlled object to infer output fuzzy sets. The de-fuzzification part converts the output sets to a crisp number,which is used to control the intended device. Here, we usethe IPC fuzzy-control scheduler to explain the heuristic.

1) IPC scheduler: In the fuzzification part, there arelinguistic variables. A linguistic variable stands for a humanconcept like “age” or “temperature” and has values like 23years old or 31◦C. A linguistic variable is associated withone or more membership functions like “old” or “hot” that

Destination Core

Core 0 Core 1 Core2 Core 3

0

0.5

1

0 1 2 3 4

Figure 6. Membership functions for destination core

specify the degree a given input value describes the concept,for example, how old or how hot.

We have two input linguistic variables in our case: thenumber of instructions executed per cycle (IPC) and theIPC difference. IPC difference is the difference of currentIPC reading and the previous reading. Since each core ini7-920 processor is four-issue capable, we define a generalpossible domain for IPC values within 0-3. Five membershipfunctions for the IPC variable are defined in the domain asshown in Figure 4. Although there are many possible choicesof the shapes of the membership functions, the shape doesnot influence the output in a significant way according to[19]. So we choose the triangular and trapezoid member-ship functions in our case because of lower computationalcomplexity. The same choices are picked for the membershipfunctions for IPC difference, as shown in Figure 5.

The overlap between two membership functions is 50%.This implies that an input IPC value will hit at mosttwo membership functions. Similarly for the IPC differencevariable. Thus, the inference engine will have four differentrules at most to do an inference.

The design of the output variable depends highly onthe target ASYMS. Figure 6 gives the basic design of theoutput variable for destination core in our environment. Ingeneral, the number of membership functions for destinationcore depends on the number of difference performancecapabilities in the underlying ASYMS. In our case, thereare four cores with different clock frequencies. Note thatwe can define the four membership functions with differentareas to reflect the amount of differences in frequency. Forexample, if there is one core with much slower frequencythan the other three, we would prefer not to use the slowercore and thus define the associated membership function sothat that slower core is associated with smaller area in orderto affect the decision made by the final defuzzification part.

The inference engine consists of inference rules, whichare based on our domain knowledge, to infer implied mem-bership functions. Since we have five membership functionsin the IPC variable and three in the IPC variation variable,the inference engine should consist of fifteen rules. Figure7 presents the 15 synthetic rules. For example, if the IPC

RULES Bad Not good Normal Good Excellent

Neg 0 0 1 2 3

Zero 0 1 2 2 3

Pos 0 2 3 3 3

Figure 7. Inference rules for IPC scheduler

0

0.5

1

0 25 50 75 100

badnot_goodgood too_bad

Load-miss domain

difference domain

0

0.5

1

-25 0 25

RULES good Not_good bad too_bad

Neg 3 3 2 1

Zero 3 2 1 0

Pos 2 2 1 0

Neg Zero Pos

Core 0 Core 1 Core2 Core 3

0

0.5

1

0 1 2 3 4

Destination Core of Asym A 12

Figure 8. LLC-miss scheduler modeling. The unit in LLC-miss domainand difference domain is 100 thousand.

is good and the difference is positive, the destination coreis core 3. The inference logic used in our case is Zadehinference engine [17]. Finially, the inference will usuallyproduce four implied membership functions which will thenbe used in the defuzzification process.

The defuzzification part converts the implied fuzzy setsinto a crisp number that serves as a control signal. There areseveral common defuzzification methods, such as centroid,average center, mean of maxima, etc. [17], [19]. We choosethe centroid method because it is popular and this methodwill react to different area of membership function when wehave preference for some cores [17]. The calculated centroidwill fall within one of the membership functions. Controlledthreads will be migrated to the destination core accordingly.For instance, if the calculated centroid is 2.3, the thread ismigrated to core 2.

2) LLC-miss scheduler: Alternatively, we may use thenumber of llc-load-misses (cache miss when loadingfrom LLC), which is provided by the Linux performancecounter system, as the input of fuzzy-control scheduler.(The system provides LLC events like load-reference, store-reference, load-miss and store-miss. We choose to use thellc-load-miss as a starting point in the scheduler.) Figure 8gives the overall design of LLC-miss scheduler. The majorproblem in designing the LLC-miss scheduler is determiningthe great amount of LLC misses. We choose the amountin Figure 8 from profiling data of benchmarks in ourexperiment. As a net effect, threads experiencing more than3 hundred thousand LLC misses in one schedule periodare scheduled to slower cores. The difference is derived

Table IINTEL I7 SPECIFICATION

Processor Number i7-920# of cores 4Clock Speed 2.66 GHzL1 Cache 32KB L1 data, 32KB L1 instruction per coreL2 Cache 256KB per core, inclusiveL3 Cache 8 MB shared cacheLithography 45 nmMotherboard GIGABYTE GA-EX58-UD3R

Table II4 CONFIGURATIONS IN OUR EXPERIMENT

CPU 0 CPU 1 CPU 2 CPU 3Sym2.66 2.66GHz 2.66GHz 2.66GHz 2.66GHzAsym A 2.26GHz 2.39GHz 2.53GHz 2.66GHzSym2.26 2.26GHz 2.26GHz 2.26GHz 2.26GHzAsym B 1.60GHz 2.13GHz 2.39GHz 2.66GHz

from current and previous LLC misses as a prediction ofincreasing or decreasing trend. The design in output variable,inference part, and defuzzification part is the same in the IPCscheduler.

We plan to merge the IPC scheduler and the LLC-missscheduler in the future.

IV. EXPERIMENT

A. Hardware Configuration

The hardware platform we use in our experiment is anIntel core i7-920 processor, which has four cores in a chip.A detail list of specification is shown in Table I. Note thatboth the Intel Turbo Boost Technology and the Intel Hyper-Threading Technology are turned off to eliminate their effecton performance and power consumption.

It should be mentioned that there is a L3 shared cachein Intel i7-920. According to previous research [21], theshared cache prevents serious performance loss from threadmigration. So we expect that we will not suffer muchperformance loss from thread migration in our experiments.

B. Software Environment

The operating system used in the experiments is DebianGNU/Linux built upon kernel version 2.6.32. The kernel pro-vides the acpi-cpufreq module for controlling the frequenciesof cores. We use this module to set up an asymmetric envi-ronment. The available frequencies provided by this modulefor the hardware include 2.66GHz, 2.53GHz, 2.39GHz,2.26GHz, 2.13GHz, 2.00GHz, 1.86GHz, 1.73GHz, and1.60GHz. Table II shows the 4 configurations used in ourexperiments. The symmetric configuration sym2.66 consistsof four cores, all running with the highest frequency. Theasymmetric configuration A makes use of the highest fourdistinct frequencies. We also test our control system undera more extreme environment like asymmetric configurationB. Configuration sym2.26 is provided to realize the worstperformance loss in Asym A.

Power supply

MB

CPUPower meterr

12V2

Figure 9. Power measurement through 12V2 rails

0

0.5

1

0 2 4 6 8 10 12

core 0

core 1

core 2

core 3

1.60 1.73 1.86 2.00 2.13 2.26 2.39 2.53 2.66

0

0.5

1

0 1 2 3 4 5

core 0

core 1

core 2

core 3

Available performance choice (GHz)

1.60 1.73 1.86 2.00 2.13 2.26 2.39 2.53 2.66

Available performance choice (GHz)

Asym A

Asym B1 1 1 1

1 2.5 3 3.5

Figure 10. Output variable for Asym A and Asym B

C. 12V2 Rails Measurement

According to EPS12V specifications [22], i7 processorpower is provided by 12V2 rails. We make an instrumentto these rails as shown in Figure 9. The power meterused in our experiments is NI-4065 digital multimeter. Themeter provides both current and timing data. The powerconsumption is derived through these data. Project SIKULI[23] is used in the measurement for test automation.

D. Benchmark Configuration

SPECjvm2008 is our experimental benchmark. We con-figured the benchmark suite to use lagom [24] workload, inwhich every application in the suite does a fixed amount ofwork in one run. The number of threads in each applicationis set to one so that there should be one active thread dur-ing application execution. An exception is sunflow becausemultiple active threads are required during the execution. Wedid not provide the measurement data of sunflow because ofthe current constrain of the power meter.

E. Destination Core Preference

For IPC scheduler, the output variable, destination core,react to different frequency gaps between cores. When thereis a big frequency gap between two destination cores, areasonable selection of cores is trying not to select the corewhich is much slower. As a result, we design two outputvariables for Asym A and Asym B as shown in Figure 10.As for Asym B, each destination membership function has adifferent bottom, which results in different area. The bottoms

Table IIIBASE RUNNING TIME AND POWER CONSUMPTION UNDER SYM2.66

application sec watt application sec wattcmpiler.compiler 18.04 22.5 fft.large 28.73 10.9compiler.sunflow 32.08 17.9 lu.large 33.6 14.3compress 78.15 13.4 sor.large 97.42 11.1crypto.aes 94.96 14.1 sparse.large 50.44 13.7crypto.rsa 148.6 14.6 monte carlo 1609 15.2crypto.signverify 116.8 13.7 fft.small 86.29 14.4derby 96.49 11.9 lu.small 99.65 16.3mpegaudio 128.6 13.6 sor.small 112.2 10.9serial 46.4 13.9 sparse.small 48.86 15.9xml.transform 28.70 17 xml.validation 29.35 14.9Time increase for IPC

11.021.041.061.08

1.11.121.141.161.18

1.2

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

de

rby

mp

ega

ud

io

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.

larg

e

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.s

mal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

n

Asym A

Sym2.26

14

Figure 11. Time increase ratio under asym A for IPC scheduler

are calculated as follows: The bottom of slowest destinationcore is always one. If there is one more choice between twodestination cores, we increase the bottom of faster one by0.5. For example, there are three more choices between core0 and core 1 in asym B, so the bottom of core 1 is calculatedas 1 + 1.5 = 2.5, and so on. In this way, core 0 is less likelyto be selected.

V. EXPERIMENT RESULT

Table III gives the baseline data (execution time andpower consumption), which is measured in the symmetricconfiguration sym2.66. Note that we test the LLC-missscheduler in asym A for now. The experiment results forIPC scheduler in both asym A and asym B are provided.

A. IPC scheduler

1) Time increase: Figure 11 gives the ratio of the execu-tion time in asymmetric configuration A relative to the base-line (i.e., the symmetric configuration sym2.66) under ourScheduler decisions for Asym A for IPC

0%10%20%30%40%50%60%70%80%90%

100%

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

de

rby

mp

ega

ud

io

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.l

arge

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.s

mal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

n

core 3

core 2

core 1

core 0

18

Figure 12. Core selection for IPC scheduler under asym A

Time increase for IPC for Asym B

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

de

rby

mp

ega

ud

io

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.

larg

e

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.s

mal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

n

Asym B

20

Figure 13. Time increase ratio for IPC scheduler under asym BScheduler decisions for Asym B for ipc

0%10%20%30%40%50%60%70%80%90%

100%

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

de

rby

mp

ega

ud

io

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.

larg

e

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.s

mal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

n

core 3

core 2

core 1

core 0

21

Figure 14. Core selection for IPC scheduler under asym B

fuzzy-control scheduler. We also provide the execution timein symmetric configuration sym2.26. (Note that 2.26GHz isthe lowest frequency in configuration asym A.) This givesan upper bound in time increase ratio. Smaller increasein execution time should indicate that memory operationsare more critical for the application. Figure 12 shows thebreakdown of the execution time for each application thatare dedicated to each core with the fuzzy-control schedulerunder asym A. This demonstrates the selection made by thefuzzy-control scheduler.

The time increase ratio in Sym2.26 indicates that applica-tion scimark.fft.large and scimark.lu.large are less sensitiveto the change of frequency, especially for scimark.fft.large.The fuzzy-control scheduler made a right choice to useslowest core mostly for it.

As for scimark.sor.large and scimark.sor.small, the sched-uler choose slower cores since these applications remainin low IPC state during the whole execution. In contrast,application crypto.aes, crypto.rsa, scimark.monte carlo andscimark.sparse.small are forced to use fastest core.

For others applications, the choices made by the fuzzy-control scheduler result in less performance loss than that insym2.26 because the use of each core is in a balance.

Figure 13 shows the time increase ratio under asymmetricconfiguration B. The breakdown of execution time in eachcore for asym B is provided in Figure 14. The results aresimilar to that under asym A except that core 0 has lesschance to be selected. This prevents a great performanceloss from choosing the slowest core like scimark.sor.* didunder configuration asym A.

Energy delay product reduction for ipcfor A

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

de

rby

mp

egau

dio

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.

larg

e

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.

smal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

n

EDP

24

Figure 15. Relative energy delay product for asymmetric configuration AEnergy delay product reduction for ipcfor B

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

der

by

mp

ega

ud

io

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.l

arge

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.s

mal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

n

EDP

25

Figure 16. Relative energy delay product for asymmetric configuration B

2) Energy Delay Product: We compare EDP in asym-metric configuration A under the fuzzy-control schedulerand EDP in the symmetric configuration sym2.66 withouta scheduler. Figure 15 presents the result in EDP benefit.The fuzzy-control scheduler chooses faster cores mostly forcrypto.*, mpegaudio, scimark.monte carlo, scimark.lu.smalland scimark.sparse.small since these applications exhibithigh IPC state during entire execution. Although the sched-uler makes right choices for these applications, there is aslight increase in EDP. Application scimark.fft.large revealsa positive benefit in EDP reduction.

The EDP reduction results for other applications in asymA are not well, especially for scimark.sor.* which is knownas CPU-intensive application in Figure 11. A post-review onperformance counter statistic of these applications shows theIPC may not always indicate the right intensiveness (CPUor memory). This should lead to a new design of fuzzyscheduler that takes other metric (e.g., last-level cache miss)into consideration.

The EDP reduction rate in asym B is provied in Figure16. There are more applications which show positive benefitthan that in asym A. For these applications, the use of lowerfrequency might increase the EDP benefit in their memory-intensive part.

B. LLC-miss scheduler

1) Time increase: Figure 17 and Figure 18 give timeincrease ratio and core selection for LLC-miss schedulerunder asym A respectively. Considering the intensivenessshown in sym2.26, the scheduler makes right decisions formost of the applications except scimark.fft.small.

Time increase for llc for A

0.95

1

1.05

1.1

1.15

1.2

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

de

rby

mp

ega

ud

io

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.

larg

e

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.

smal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

n

Asym A

Sym2.26

26

Figure 17. Time increase ratio for LLC-miss scheduler under asym AScheduler decisions for llc for Asym A

0%10%20%30%40%50%60%70%80%90%

100%

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

de

rby

mp

ega

ud

io

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.

larg

e

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.s

mal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

ncore 3

core 2

core 1

core 0

29

Figure 18. Core selection for LLC-miss scheduler under asym A

2) Energy Delay Product: Figure 19 presents the EDPbenefit for LLC-miss scheduler. Again, we have best benefitfor scimark.fft.large. Strictly speaking, scimark.fft.large isthe only memory-intensive application in SPECjvm2008.Others are that sensitive to the change of frequency. Thisleads to a question that how do we identify this kind ofmemory-intensive with performance counter statistic, espe-cially in a short time period.

C. Migration overhead

We measure the overhead by running the benchmarks inthe symmetric configuration sym2.66 with the IPC schedulersince this scheduler made more migrations than LLC-missscheduler. Figure 20 gives the increase ratio of executiontime and Table IV shows the number of migrations persec. Although the IPC scheduler causes massive migrationsevery second for most of the applications, we do not seerelevent increase in execution time. This result should beconsistent with previous research [21] which stated theEnergy delay product reduction for llc

for A

0.8

0.85

0.9

0.95

1

1.05

1.1

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

de

rby

mp

egau

dio

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.l

arge

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.s

mal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

n

EDP

27

Figure 19. Relative energy delay product for LLC-miss scheduler underasym A

Time increase overhead

0.97

0.98

0.99

1

1.01

1.02

1.03

1.04

com

pile

r.co

mp

iler

com

pile

r.su

nfl

ow

com

pre

ss

cryp

to.a

es

cryp

to.r

sa

cryp

to.s

ign

veri

fy

de

rby

mp

ega

ud

io

scim

ark.

fft.

larg

e

scim

ark.

lu.la

rge

scim

ark.

sor.l

arge

scim

ark.

spar

se.la

rge

scim

ark.

mo

nte

_ca

rlo

scim

ark.

fft.

smal

l

scim

ark.

lu.s

mal

l

scim

ark.

sor.

smal

l

scim

ark.

spar

se.s

mal

l

seri

al

xml.t

ran

sfo

rm

xml.v

alid

atio

n

Figure 20. Time increase in symmetric configuration with fuzzy-controlscheduler

Table IVTHE NUMBER OF MIGRATIONS PER SECOND

application migr/s application migr/scmpiler.compiler 64.26 scimark.fft.large 32.94compiler.sunflow 69.12 scimark.lu.large 82.38compress 65.92 scimark.sor.large 78.54crypto.aes 11.61 scimark.sparse.large 85.86crypto.rsa 0.27 scimark.monte carlo 0.16crypto.signverify 46.43 scimark.fft.small 78.70derby 65.40 scimark.lu.small 89.82mpegaudio 86.06 scimark.sor.small 69.34serial 82.95 scimark.sparse.small 0.30xml.transform 62.48 xml.validation 59.40

advantage of having a shared cache between cores for threadmigration overhead prevention. We believe that the increaseof execution time in the asymmetric configuration should bedue to the use of lower frequency.

VI. CONCLUSION AND FUTURE WORKS

Asymmetric multicore systems had been studied as anew hardware platform toward energy-performance effi-ciency. A major issue is the scheduling problem [2], [4],[9], [11], [12]. Toward micro-architecture specialization [2],we present fuzzy-control schedulers based on performancecounter statistics on JVM. The experiment results showthat the fuzzy-control schedulers exhibit EDP benefit forone memory-intensive application. One of the major futurework is to precisely identify this kind of memory-intensiveapplication with performance counter statistics.

Although we do not provide power reduction rate data,we achieve at most 85% power reduction rate on average forentire SPECjvm2008. This is important to the developmentof ASYMS. Although the fuzzy-control schedulers makeright decisions for several applications, there are still im-provements we can make in the future. The results also showthat the migration overhead is negligible in our ASYMS andthe major cause of performance degradation is the use oflower frequencies.

REFERENCES

[1] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan,and D. M. Tullsen, “Single-ISA heterogeneous multi-corearchitectures: The potential for processor power reduction,”

Microarchitecture, IEEE International Symposium on, vol. 0,p. 81, 2003.

[2] A. Fedorova, J. C. Saez, D. Shelepov, and M. Prieto, “Maxi-mizing power efficiency with asymmetric multicore systems,”Commun. ACM, vol. 52, no. 12, pp. 48–57, 2009.

[3] R. Kumar, D. M. Tullsen, N. P. Jouppi, and P. Ranganathan,“Heterogeneous chip multiprocessors,” Computer, vol. 38, pp.32–38, 2005.

[4] T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn, “Efficientoperating system scheduling for performance-asymmetricmulti-core architectures,” in SC ’07: Proceedings of the 2007ACM/IEEE conference on Supercomputing. New York, NY,USA: ACM, 2007, pp. 1–11.

[5] J. C. Mogul, J. Mudigonda, N. Binkert, P. Ranganathan,and V. Talwar, “Using asymmetric single-ISA CMPs to saveenergy on operating systems,” IEEE Micro, vol. 28, no. 3, pp.26–41, 2008.

[6] M. Gillespie, “Preparing for the second stage of multi-corehardware: Asymmetric (heterogeneous) cores,” Intel Corpo-ration, Tech. Rep., 2008.

[7] V. Kumar and A. Fedorova, “Towards better performance perwatt in virtual environments on asymmetric single-ISA multi-core systems,” SIGOPS Oper. Syst. Rev., vol. 43, no. 3, pp.105–109, 2009.

[8] S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai, “Theimpact of performance asymmetry in emerging multicorearchitectures,” SIGARCH Comput. Archit. News, vol. 33,no. 2, pp. 506–517, 2005.

[9] R. Strong, J. Mudigonda, J. C. Mogul, N. Binkert,and D. Tullsen, “Fast switching of threads between cores,”SIGOPS Oper. Syst. Rev., vol. 43, pp. 35–45, April 2009. [On-line]. Available: http://doi.acm.org/10.1145/1531793.1531801

[10] M. D. Hill and M. R. Marty, “Amdahl’s law in the multicoreera,” Computer, vol. 41, no. 7, pp. 33–38, 2008.

[11] S. Ghiasi, T. Keller, and F. Rawson, “Scheduling for heteroge-neous processors in server systems,” in CF ’05: Proceedingsof the 2nd conference on Computing frontiers. New York,NY, USA: ACM, 2005, pp. 199–210.

[12] D. Shelepov, J. C. Saez Alcaide, S. Jeffery, A. Fedorova,N. Perez, Z. F. Huang, S. Blagodurov, and V. Kumar, “Hass:a scheduler for heterogeneous multicore systems,” SIGOPSOper. Syst. Rev., vol. 43, no. 2, pp. 66–75, 2009.

[13] K. K. Rangan, G.-Y. Wei, and D. Brooks, “Thread mo-tion: fine-grained power management for multi-core systems,”SIGARCH Comput. Archit. News, vol. 37, no. 3, pp. 302–313,2009.

[14] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder,“Discovering and exploiting program phases,” IEEE Micro,vol. 23, no. 6, pp. 84–93, 2003.

[15] Performance counters for linux. [Online]. Available:https://perf.wiki.kernel.org/index.php/Main Page

[16] L. A. Zadeh, “Fuzzy logic, neural networks, and soft comput-ing,” Commun. ACM, vol. 37, pp. 77–84, March 1994. [On-line]. Available: http://doi.acm.org/10.1145/175247.175255

[17] L.-X. Wang, A course in fuzzy systems and control. UpperSaddle River, NJ, USA: Prentice-Hall, Inc., 1997.

[18] L. A. Zadeh, “Fuzzy sets and their applications to cognitiveand decision processes,” Academic press, pp. 1–39, 1975.

[19] H. Pourshaghaghi and J. de Gyvez, “Dynamic voltage scalingbased on supply current tracking using fuzzy logic controller,”in Electronics, Circuits, and Systems, 2009. ICECS 2009. 16thIEEE International Conference on, 2009, pp. 779 –782.

[20] Java hotspot virtual machine. [Online]. Available:http://www.oracle.com/technetwork/java/javase/tech/index-jsp-136373.html

[21] Q. Teng, P. Sweeney, and E. Duesterwald, “Understanding thecost of thread migration for multi-threaded java applicationsrunning on a multicore platform,” in Performance Analysisof Systems and Software, 2009. ISPASS 2009. IEEE Interna-tional Symposium on, 26-28 2009, pp. 123 –132.

[22] Entry-level power supply specification. [On-line]. Available: http://en.wikipedia.org/wiki/Entry-Level Power Supply Specification

[23] Project sikuli. [Online]. Available: http://sikuli.org/

[24] Specjvm2008 user’s guide. [Online]. Available:http://www.spec.org/jvm2008/docs/UserGuide.html

migrating java threads with fuzzy control on asymmetric · pdf file ·...

Documents