global task scheduling - 2017...

Moving big.LITTLETM Toward Fully Heterogeneous

Global Task Scheduling

Samuel Chiang 姜新雨

FAE Director APAC

Agenda

What is big.LITTLE?

What are the benefits?

How do you build a big.LITTLE SoC?

Where can you get the software?

What is next for big.LITTLE?

Conflicting Requirements

Why big.LITTLE ?

All workloads are not equal.

Applications do not require high performance all of the time.

Higher Efficiency Slow / Idle periods in high performance apps

Background Tasks

PER

FO

RM

AN

CE

Higher Performance

One or two dominant threads

Simultaneous execution of apps

“Demanding tasks”

What is big.LITTLE Processing?

Power Optimizing Technology that Delivers More Performance and Lower Power

Combines Architecturally Identical Performance- and Efficiency-Tuned Processors

Transparent to Application Software, shipping in production today

LITTLE

“Always on, always connected

tasks”

Significant Power Saving (selected use cases)

Cortex-A9

smartphone

big.LITTLE

big

Cortex® -A9

smartphone

big.LITTLE

Increased Performance

big.LITTLE Software Evolution

Improving Performance and Efficiency

2012 H1 2013 H2 2013

big.LITTLE MP CPU

Migration Cluster

Migration

Global Task

Scheduling

Big CPU Big CPU

Big CPU

LITTLE CPU


LITTLE CPU

Thread Thread

Thread Thread

Thread

Thread Thread

Thread

LITTLE CPU

Thread Thread

Thread Thread

Thread Thread

Thread Thread

Cluster 1 Cluster 2

Thread

2. Demanding Tasks detected

• Based on amount of time a thread is ready to run (run

queue residency)

Moved to a BIG CPU

1. System starts

Task fill up the system

3. Task no longer demanding

Moved to a LITTLE CPU

Big CPU

Thread

4. Global load balancer consolidates the workload

Linux Kernel scheduler at 10,000 ft

Memory Controller Ports

Load balancer: ensures the run queue is

loaded evenly for each CPU, when empty

pushes the CPU to idle. Called every time

the system requires scheduling. If running

queues are unbalanced idle tasks from

busiest processors are migrated

Completely Fair Schedule class:

schedules tasks following a Completely Fair

Scheduler (CFS) algorithm.

Cache Coherent Interconnect

L2

A-15 A-15 …

L2

A-7 A-7 … IO

Coherent

Master

System Port

Application Application

Linux Kernel Scheduler

Run

Queue

Run

Queue Run

Queue

Run

Queue

A Running Queue : is created for each

processor (CPU. Each queue contains a

list of runnable processes on a given

processor.

RT schedule class: schedules tasks

following real-time mechanism defined in

POSIX standard.

big.LITTLE Operation

big.LITTLE delivers optimum performance and efficiency in all use cases

Sustained Performance Envelope

1 - Maximum Responsiveness for high-intensity workload

2 - Sustained maximum interactive performance

3 - Long-use low-intensity workload

Responsive Peak Performance

Peak performance bursts above typical power envelope of the SoC

Delivering enhanced user experience at a touch


Bursts of Performance

Web Browsing + Audio uses big cores for bursts of performance

LITTLE cores run background tasks, audio, and browser tasks after the main screen

render completes

Partner Platform, 4 Cortex-A15 cores, 4 Cortex-A7 cores

0

10

20

30

40

50

60

70

80

90

100

A7 CPU0 A7 CPU1 A7 CPU2 A7 CPU3 A15 CPU0 A15 CPU1 A15 CPU2 A15 CPU3

DVFS states: BBench Browsing with Audio big Core High

Frequency

big Core Med

Frequency

big Core Low

Frequency

LITTLE Core High

Frequency

LITTLE Core Med

Frequency

LITTLE Core low

frequency

STANDBYWFI

OFF

Cluster Off

0 50 100

0

1

2

3

4

5

6

7

8

% Time

Number of parallel active CPUs

Bursts of Performance

0

10

20

30

40

50

60

70

80

90

100


DVFS states: Epic Citadel big Core High

Frequency

big Core Med

Frequency

big Core Low

Frequency

LITTLE Core High

Frequency

LITTLE Core Med

Frequency

LITTLE Core low

frequency

STANDBYWFI

OFF

Cluster Off0 50 100

0

1

2

3

4

5

6

7

8

% Time


Moderately intensive game, it can run predominantly on Cortex-A7 processor

Cortex-A15 used for short bursts of performance


Maximum Sustained Performance


Maximise performance within the sustainable power envelope

Efficiency of big.LITTLE enables increased performance in SoC

Performance Demands Increasing

Mobile device performance advancing at ever faster pace

Po

we

r

big.LITTLE, Restof SoC

big.LITTLE, CPU

A15 Only, Restof SoC

A15 Only, CPU

Maximizing Sustained Performance

Real Racing 3 Castle Master

CPU

Rest

of So

C

Rest

of So

C

CPU

CPU

CPU

thermal limit

big.LITTLE enables maximum performance under the thermal envelope for sustained

gaming

Maximizing Sustained Performance

CPU + GPU Utilization

ARM

MaliTM-T628

Cortex-A15

0%

20%

40%

60%

80%

100%

44%

Cortex-A15

0%

20%

40%

60%

80%

100%

76%

Cortex-A15

0%

20%

40%

60%

80%

100%75%

Continuous play on advanced console-quality gaming challenges thermal limits P

ow

er

A15 Only, Rest of

SoC

A15 Only, CPU


Rest

of So

C

CPU

CPU

thermal limit

Delivering Better User Experiences

big.LITTLE reduces sustained SoC power below the thermal limit

Lower power for potentially longer playing time

More power budget for GPU and visuals

CPU + GPU Utilization

ARM

Mali-T628

Cortex-A7 Cortex-A7 0%

100%

0%

100%

43% 42%

Cortex-A7 Cortex-A7 0%

100%

0%

100%

44% 41%

Cortex-A15

0%

20%

40%

60%

80%

100%

Cortex-A15

0%

20%

40%

60%

80%

100%

44%

47%

Po

we

r

big.LITTLE, Rest

of SoC

big.LITTLE,

CPU

A15 Only, Rest

of SoC

A15 Only, CPU


CPU

Rest

of So

C

Rest

of So

C

CPU

CPU

CPU

thermal limit

Increased Performance for Threaded Software

Well threaded games can make use of big and LITTLE processors extensively.

0

10

20

30

40

50

60

70

80

90

100


DVFS states: CastleMaster big Core High Frequency

big Core Med Frequency

big Core Low Frequency

LITTLE Core High

Frequency

LITTLE Core Med

Frequency

LITTLE Core low

frequency

STANDBYWFI

OFF

Cluster Off

0 20 40 60 80 100

0

1

2

3

4

5

6

7

8

% Time



Most Efficient Processing


Maximise energy efficiency with LITTLE processor for typical workloads

Extended device lifetime with the most efficient A-Class processor

Po

we

r

big.LITTLE, Rest ofSoC

big.LITTLE, CPU

A15 Only, Rest ofSoC

A15 Only, CPU

Low Intensity Use Cases

Significant power reduction at CPU level R

est

of So

C

R S

oC

C

PU

CPU

CPU

Audio Angry Birds Temple Run Video 1080p Home Screen

Rest

of So

C

Rest

of So

C

Rest

of So

C

Rest

of So

C

CPU

-76% -76% -73% -73% -42%

Rest

of So

C


Po

we

r

big.LITTLE, Rest ofSoC

big.LITTLE, CPU

A15 Only, Rest ofSoC

A15 Only, CPU


Savings are still significant at SoC level R

est

of So

C

R S

oC

C

PU

CPU

CPU

Rest

of So

C

Audio Angry Birds Temple Run Video 1080p Home Screen

Rest

of So

C

Rest

of So

C

Rest

of So

C

Rest

of So

C

CPU

-38% -40% -33% -33% -21%



Casual Games reside entirely on Cortex-A7 processors, except for drawing of initial

screens and App launch

0

10

20

30

40

50

60

70

80

90

100


DVFS states: Angry Birds big Core High Frequency

big Core Med Frequency

big Core Low Frequency

LITTLE Core High

Frequency

LITTLE Core Med

Frequency

LITTLE Core low

frequency

STANDBYWFI

OFF

Cluster Off

0 20 40 60 80 100

0

1

2

3

4

5

6

7

8

% Time



Agenda

What is big.LITTLE?

What are the benefits?

How do you build a big.LITTLE SoC?

Where can you get the software?

What is next for big.LITTLE?

Fine-Tuned to Different Performance Points

Simple, in-order, 8 stage pipelines

Performance better than mainstream, high-volume smartphones (Cortex-

A8 and Cortex-A9)

Most energy-efficient applications processor from ARM

Complex, out-of-order, multi-issue pipelines

Up to 2x the performance of today’s high-end smartphones

Highest performance in mobile power envelope

Cortex-A53

Cortex-A7

Cortex-A57

Cortex-A15

LIT

TL

E

big

Q

u

e

u

e

I

s

s

u

e

I

n

t

e

g

e

r

Integrating an Efficient System

CoreLink™ CCI-400

Cache Coherent Interconnect

DDR/LPDDR DDR/LPDDR

IO Coherent Masters

To Peripheral Interconnect DMC

MMU-400

GIC-400 Interrupt Control

L2 cache L2 Cache

Shader Shader

Shader Shader

Shader

Shader

L2 L2 Cache

Cortex-A7 Mali- T628 GPU Cortex-A15

MMU-400

TZC-400

Display and

Video Sub-

system

ADB-400 ADB-400 ADB-400 ADB-400

MMU-400 MMU-400

ACE Coherence enables

big.LITTLE and efficient

GPU compute

Coherency simplifies

software

Efficient Voltage

scaling for power

management

Common memory

view for all SoC

components

Path to memory with

TrustZone hardware

security

CPU Cluster and IO

Coherent GPU

Software Availability

CPU Migration

“In-Kernel Switcher”

http://www.linaro.org/linaro-blog/2013/05/02/the-linaro-iks-code-now-publicly-available


“big.LITTLE MP”

http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git

Integration, tuning and validation on Chip Vendor SoC required

Upstreaming
















http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git

big.LITTLE SoCs

Initial big.LITTLE SoCs now in silicon

10+ licensees in various stages of development

Rapid advancement in software and system optimization

Summary

big.LITTLE delivering on key benefits

Significantly higher peak performance within a tighter power budget

Improved SoC performance under thermal constraints

Power savings across a range of workloads and use scenarios

Performance and efficiency increase on threaded workloads

big.LITTLE technology is shipping in products today

Wider range of differentiated solutions from silicon partners

Devices transitioning to more advanced big.LITTLE

Demonstrated additional benefit from global task scheduling

Global task scheduling available for production platforms

Thank You

global task scheduling - 2017...

Documents