global task scheduling - 2017...
TRANSCRIPT
Moving big.LITTLETM Toward Fully Heterogeneous
Global Task Scheduling
Samuel Chiang 姜新雨
FAE Director APAC
Agenda
What is big.LITTLE?
What are the benefits?
How do you build a big.LITTLE SoC?
Where can you get the software?
What is next for big.LITTLE?
Conflicting Requirements
Why big.LITTLE ?
All workloads are not equal.
Applications do not require high performance all of the time.
Higher Efficiency Slow / Idle periods in high performance apps
Background Tasks
PER
FO
RM
AN
CE
Higher Performance
One or two dominant threads
Simultaneous execution of apps
“Demanding tasks”
What is big.LITTLE Processing?
Power Optimizing Technology that Delivers More Performance and Lower Power
Combines Architecturally Identical Performance- and Efficiency-Tuned Processors
Transparent to Application Software, shipping in production today
LITTLE
“Always on, always connected
tasks”
Significant Power Saving (selected use cases)
Cortex-A9
smartphone
big.LITTLE
big
Cortex® -A9
smartphone
big.LITTLE
Increased Performance
big.LITTLE Software Evolution
Improving Performance and Efficiency
2012 H1 2013 H2 2013
big.LITTLE MP CPU
Migration Cluster
Migration
Global Task
Scheduling
Big CPU Big CPU
Big CPU
LITTLE CPU
Global Task Scheduling
LITTLE CPU
Thread Thread
Thread Thread
Thread
Thread Thread
Thread
LITTLE CPU
Thread Thread
Thread Thread
Thread Thread
Thread Thread
Cluster 1 Cluster 2
Thread
2. Demanding Tasks detected
• Based on amount of time a thread is ready to run (run
queue residency)
Moved to a BIG CPU
1. System starts
Task fill up the system
3. Task no longer demanding
Moved to a LITTLE CPU
Big CPU
Thread
4. Global load balancer consolidates the workload
Linux Kernel scheduler at 10,000 ft
Memory Controller Ports
Load balancer: ensures the run queue is
loaded evenly for each CPU, when empty
pushes the CPU to idle. Called every time
the system requires scheduling. If running
queues are unbalanced idle tasks from
busiest processors are migrated
Completely Fair Schedule class:
schedules tasks following a Completely Fair
Scheduler (CFS) algorithm.
Cache Coherent Interconnect
L2
A-15 A-15 …
L2
A-7 A-7 … IO
Coherent
Master
System Port
Application Application
Linux Kernel Scheduler
Run
Queue
Run
Queue Run
Queue
Run
Queue
A Running Queue : is created for each
processor (CPU. Each queue contains a
list of runnable processes on a given
processor.
RT schedule class: schedules tasks
following real-time mechanism defined in
POSIX standard.
big.LITTLE Operation
big.LITTLE delivers optimum performance and efficiency in all use cases
Sustained Performance Envelope
1 - Maximum Responsiveness for high-intensity workload
2 - Sustained maximum interactive performance
3 - Long-use low-intensity workload
Responsive Peak Performance
Peak performance bursts above typical power envelope of the SoC
Delivering enhanced user experience at a touch
Sustained Performance Envelope
Bursts of Performance
Web Browsing + Audio uses big cores for bursts of performance
LITTLE cores run background tasks, audio, and browser tasks after the main screen
render completes
Partner Platform, 4 Cortex-A15 cores, 4 Cortex-A7 cores
0
10
20
30
40
50
60
70
80
90
100
A7 CPU0 A7 CPU1 A7 CPU2 A7 CPU3 A15 CPU0 A15 CPU1 A15 CPU2 A15 CPU3
DVFS states: BBench Browsing with Audio big Core High
Frequency
big Core Med
Frequency
big Core Low
Frequency
LITTLE Core High
Frequency
LITTLE Core Med
Frequency
LITTLE Core low
frequency
STANDBYWFI
OFF
Cluster Off
0 50 100
0
1
2
3
4
5
6
7
8
% Time
Number of parallel active CPUs
Bursts of Performance
0
10
20
30
40
50
60
70
80
90
100
A7 CPU0 A7 CPU1 A7 CPU2 A7 CPU3 A15 CPU0 A15 CPU1 A15 CPU2 A15 CPU3
DVFS states: Epic Citadel big Core High
Frequency
big Core Med
Frequency
big Core Low
Frequency
LITTLE Core High
Frequency
LITTLE Core Med
Frequency
LITTLE Core low
frequency
STANDBYWFI
OFF
Cluster Off0 50 100
0
1
2
3
4
5
6
7
8
% Time
Number of parallel active CPUs
Moderately intensive game, it can run predominantly on Cortex-A7 processor
Cortex-A15 used for short bursts of performance
Partner Platform, 4 Cortex-A15 cores, 4 Cortex-A7 cores
Maximum Sustained Performance
Sustained Performance Envelope
Maximise performance within the sustainable power envelope
Efficiency of big.LITTLE enables increased performance in SoC
Po
we
r
big.LITTLE, Restof SoC
big.LITTLE, CPU
A15 Only, Restof SoC
A15 Only, CPU
Maximizing Sustained Performance
Real Racing 3 Castle Master
CPU
Rest
of So
C
Rest
of So
C
CPU
CPU
CPU
thermal limit
big.LITTLE enables maximum performance under the thermal envelope for sustained
gaming
Maximizing Sustained Performance
CPU + GPU Utilization
ARM
MaliTM-T628
Cortex-A15
0%
20%
40%
60%
80%
100%
44%
Cortex-A15
0%
20%
40%
60%
80%
100%
76%
Cortex-A15
0%
20%
40%
60%
80%
100%75%
Continuous play on advanced console-quality gaming challenges thermal limits P
ow
er
A15 Only, Rest of
SoC
A15 Only, CPU
Real Racing 3 Castle Master
Rest
of So
C
CPU
CPU
thermal limit
Delivering Better User Experiences
big.LITTLE reduces sustained SoC power below the thermal limit
Lower power for potentially longer playing time
More power budget for GPU and visuals
CPU + GPU Utilization
ARM
Mali-T628
Cortex-A7 Cortex-A7 0%
100%
0%
100%
43% 42%
Cortex-A7 Cortex-A7 0%
100%
0%
100%
44% 41%
Cortex-A15
0%
20%
40%
60%
80%
100%
Cortex-A15
0%
20%
40%
60%
80%
100%
44%
47%
Po
we
r
big.LITTLE, Rest
of SoC
big.LITTLE,
CPU
A15 Only, Rest
of SoC
A15 Only, CPU
Real Racing 3 Castle Master
CPU
Rest
of So
C
Rest
of So
C
CPU
CPU
CPU
thermal limit
Increased Performance for Threaded Software
Well threaded games can make use of big and LITTLE processors extensively.
0
10
20
30
40
50
60
70
80
90
100
A7 CPU0 A7 CPU1 A7 CPU2 A7 CPU3 A15 CPU0 A15 CPU1 A15 CPU2 A15 CPU3
DVFS states: CastleMaster big Core High Frequency
big Core Med Frequency
big Core Low Frequency
LITTLE Core High
Frequency
LITTLE Core Med
Frequency
LITTLE Core low
frequency
STANDBYWFI
OFF
Cluster Off
0 20 40 60 80 100
0
1
2
3
4
5
6
7
8
% Time
Number of parallel active CPUs
Partner Platform, 4 Cortex-A15 cores, 4 Cortex-A7 cores
Most Efficient Processing
Sustained Performance Envelope
Maximise energy efficiency with LITTLE processor for typical workloads
Extended device lifetime with the most efficient A-Class processor
Po
we
r
big.LITTLE, Rest ofSoC
big.LITTLE, CPU
A15 Only, Rest ofSoC
A15 Only, CPU
Low Intensity Use Cases
Significant power reduction at CPU level R
est
of So
C
R S
oC
C
PU
CPU
CPU
Audio Angry Birds Temple Run Video 1080p Home Screen
Rest
of So
C
Rest
of So
C
Rest
of So
C
Rest
of So
C
CPU
-76% -76% -73% -73% -42%
Rest
of So
C
Partner Platform, 4 Cortex-A15 cores, 4 Cortex-A7 cores
Po
we
r
big.LITTLE, Rest ofSoC
big.LITTLE, CPU
A15 Only, Rest ofSoC
A15 Only, CPU
Low Intensity Use Cases
Savings are still significant at SoC level R
est
of So
C
R S
oC
C
PU
CPU
CPU
Rest
of So
C
Audio Angry Birds Temple Run Video 1080p Home Screen
Rest
of So
C
Rest
of So
C
Rest
of So
C
Rest
of So
C
CPU
-38% -40% -33% -33% -21%
Partner Platform, 4 Cortex-A15 cores, 4 Cortex-A7 cores
Low Intensity Use Cases
Casual Games reside entirely on Cortex-A7 processors, except for drawing of initial
screens and App launch
0
10
20
30
40
50
60
70
80
90
100
A7 CPU0 A7 CPU1 A7 CPU2 A7 CPU3 A15 CPU0 A15 CPU1 A15 CPU2 A15 CPU3
DVFS states: Angry Birds big Core High Frequency
big Core Med Frequency
big Core Low Frequency
LITTLE Core High
Frequency
LITTLE Core Med
Frequency
LITTLE Core low
frequency
STANDBYWFI
OFF
Cluster Off
0 20 40 60 80 100
0
1
2
3
4
5
6
7
8
% Time
Number of parallel active CPUs
Partner Platform, 4 Cortex-A15 cores, 4 Cortex-A7 cores
Agenda
What is big.LITTLE?
What are the benefits?
How do you build a big.LITTLE SoC?
Where can you get the software?
What is next for big.LITTLE?
Fine-Tuned to Different Performance Points
Simple, in-order, 8 stage pipelines
Performance better than mainstream, high-volume smartphones (Cortex-
A8 and Cortex-A9)
Most energy-efficient applications processor from ARM
Complex, out-of-order, multi-issue pipelines
Up to 2x the performance of today’s high-end smartphones
Highest performance in mobile power envelope
Cortex-A53
Cortex-A7
Cortex-A57
Cortex-A15
LIT
TL
E
big
Q
u
e
u
e
I
s
s
u
e
I
n
t
e
g
e
r
Integrating an Efficient System
CoreLink™ CCI-400
Cache Coherent Interconnect
DDR/LPDDR DDR/LPDDR
IO Coherent Masters
To Peripheral Interconnect DMC
MMU-400
GIC-400 Interrupt Control
L2 cache L2 Cache
Shader Shader
Shader Shader
Shader
Shader
L2 L2 Cache
Cortex-A7 Mali- T628 GPU Cortex-A15
MMU-400
TZC-400
Display and
Video Sub-
system
ADB-400 ADB-400 ADB-400 ADB-400
MMU-400 MMU-400
ACE Coherence enables
big.LITTLE and efficient
GPU compute
Coherency simplifies
software
Efficient Voltage
scaling for power
management
Common memory
view for all SoC
components
Path to memory with
TrustZone hardware
security
CPU Cluster and IO
Coherent GPU
Software Availability
CPU Migration
“In-Kernel Switcher”
http://www.linaro.org/linaro-blog/2013/05/02/the-linaro-iks-code-now-publicly-available
Global Task Scheduling
“big.LITTLE MP”
http://git.linaro.org/gitweb?p=arm/big.LITTLE/mp.git
Integration, tuning and validation on Chip Vendor SoC required
Upstreaming
big.LITTLE SoCs
Initial big.LITTLE SoCs now in silicon
10+ licensees in various stages of development
Rapid advancement in software and system optimization
Summary
big.LITTLE delivering on key benefits
Significantly higher peak performance within a tighter power budget
Improved SoC performance under thermal constraints
Power savings across a range of workloads and use scenarios
Performance and efficiency increase on threaded workloads
big.LITTLE technology is shipping in products today
Wider range of differentiated solutions from silicon partners
Devices transitioning to more advanced big.LITTLE
Demonstrated additional benefit from global task scheduling
Global task scheduling available for production platforms