efetch : optimizing instruction fetch for event-driven web applications
DESCRIPTION
EFetch : Optimizing Instruction Fetch for Event-Driven Web Applications. Gaurav Chadha , Scott Mahlke , Satish Narayanasamy University of Michigan August, 2014. University of Michigan Electrical Engineering and Computer Science. Evolution of the Web. Web 1.0. Web 2.0. server. - PowerPoint PPT PresentationTRANSCRIPT
1
EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications
Gaurav Chadha, Scott Mahlke, Satish Narayanasamy
University of Michigan
August, 2014
University of MichiganElectrical Engineering and Computer Science
2
Evolution of the WebWeb 1.0 Web 2.0
server
client
published content
user generated
content
published content
user generated
content
• Static Web Pages• Passively view content
• Dynamic Web Pages• Collaborate and generate
content
3
Evolution of WebWeb 1.0 Web 2.0
server
client
published content
user generated
content
published content
user generated
content
• Rich user experience
compute compute
compute
4
Evolution of the WebWeb 1.0 Web 2.0
yahoo.com in 1996 yahoo.com in 2014
30x more instructions executed
Good client-side performance
Rich User Experience Browser responsiveness
5
Core Specialization
Private Caches Private Caches Private Caches Private Caches
Core 1 Core 2 Core 3 Core 4
Private Caches Private Caches Private Caches Private Caches
Core 1 Core 2 Core 3 Core 4
Multi-core processor
6
Web Core
Private Caches Private Caches Private Caches Private Caches
Core 1 Core 2 Core 3 Core 4
Private Caches Private Caches Private Caches Private Caches
Core 1 Core 2 Core 3 Core 4WebBoostWeb Core
Multi-core processor
7
WebBoost 1.0
Script performance: High L1-I cache misses
Goal: Specialized instruction prefetcher for web client-side script
Other
Web client-side script performance
Browser responsiveness
Web browser computational components
JavaScriptGCParsingCSSPaintLayoutOther
Web 1.0 Web 2.0
8
Poor I-Cache Performance
• Web pages tend to support numerous functionalities– Large instruction footprint– Lack hot code
graphics effects image editing
online forms document editing
web personalization games
audio & video
• Web client-side script inefficiencies : code bloat– JIT compiled by JS engine– Dynamic typing
V8 IonMonkey
Nitro Chakra
9
Lack of Hot Code
29 3079 6129 9179 1222915279183292137924429274790
20
40
60
80
100
PARSEC SPECint 2006 Web Apps
L1-I cache blocks
% o
f L1-
I mis
ses
95%
860 20,400
10
Poor I-Cache Performance• Compared to conventional programs, JS code incurs
many more L1-I misses• Perfect I-Cache: 53% speedup
amazon
bingcn
n
gmapsgdocs
pixlr
PARSEC
SPECint 2
0060
5
10
15
20
25
30
L1-I
mpk
i
11
Problem Statement
• Problem: Poor web client-side script I-Cache performance
• Opportunity: Web client-side scripts are executed in an event-driven model
• Solution: – Specialized prefetcher that is customized for event-driven
execution model– Identifies distinct events in the instruction stream
Outline
12
Event-driven Web Applications
EFetch
Facets of Instruction Prefetching
Design and Architecture
Methodology
Results
Conclusion
13
Web Browser Events
External Input Event
Mouse Click
On Load
Internal Browser Event
14
Event-driven Web Applications
RendererThread
Event Queue
Popping an event for execution
Events inserted in to the queue
Events generate other events
Executes on JS
Engine
Event Queue empty -
Program waitsMouse ClickKeyboard key pressGPS events
External Input Events
Internal Events
Timer eventDOMContentLoaded
E2 E3E1
Head
• Poor I-Cache performance• Different events tend to execute different code• Events typically execute for a very short duration
15
EFetch
Renderer Thread
E2
E3
E1
• Event Fetch - Instruction Prefetcher for event-driven web applications
• Technique:– Uses an event ID to identify distinct
events in the instruction stream– Event ID is augmented to create an event
signature that predicts control flow well
Event ID
16
Event Signature
Renderer Thread
E2
E3
E1 Event Type Event Handler
Event ID• Formed by the browser• Uniquely identifies an
event
Function Call Context
Event Signature
Formed in the hardware from context depth (3)
ancestor functions in the Call Stack
Correlates well the program control flow
17
Instruction Prefetcher: Facets
What to prefetch?
When to prefetch?
Instruction Prefetcher
18
What to Prefetch?• Naïve solution: On a function call, prefetch the
function body– But, this is too late
• Our approach: On a function call, predict its callees and prefetch their function body addresses
event ID
Event Signature c1 : <I-Cache Addr>c2 : <I-Cache Addr>c3 : <I-Cache Addr>
ci - callee
19
Duplication of Addresses
f
h
g
event
• A function can appear in two distinct event signatures
• Its body addresses might be duplicated
event f h < A, B, C >
callee I-Cache addresses
event g h < A, C, D >
20
Compacting I-Cache Addresses
event f h
event g h
I-Cache Addresses
< A, B, C >
< A, C, D >< A, B, C, D >
f
h
g
< A, B, C, D >
( 1, 1, 1, 0 )
f
h
g
( 1, 0, 1, 1 )
callee bit vector
21
Recording Callees and Function Bodies
c1
event signature
Context Table
FunctionTable
callee
bit vector
c2 bit vector
c2 bit vector
< A, B, C, D >
22
Instruction Prefetcher: Facets
What to prefetch?
When to prefetch?
Instruction Prefetcher
23
When to Prefetch?
• When?: Important to prefetch sufficiently in advance, but not too early
• Goal: Prefetch the next predicted function– Able to hide LLC hit latency– Typically sufficient due to low instruction miss rate in LLC
• Our Design: Keep track of a speculative call stack – Predictor Stack
24
Predictor Stack
• Maintains the call stack as predicted by the prefetcher• Helps prefetch the next function predicted to be called
f
h i
Predictor Stack
f
Function Prefetched
h
i
h
call
Call Stack
f
hi
callreturni
callreturn
return
25
Architecture
Call Stack
Function Call
Context
Event-ID
X
Event Signature
ci
Context Table
bv bv
Function Table
b1 b2
d
EA
Predicted callees,
addresses
Predictor Stack
Prefetch Queue
26
Methodology
• Instrumented open source browser – Chromium– It uses the V8 JS engine shared with Google Chrome
• Browsing sessions of popular websites were studied– Their instruction traces were simulated with Sniper Sim
• Our focus was on JS code execution, which was simulated
27
Architectural Details
• Modeled after Samsung Exynos 5250
• Core: 4-wide OoO, 1.66 GHz
• L1-(I,D) Cache: 32 KB, 2-way
• L2 Cache: 2 MB, 16-way
• Energy Modeling: Vdd = 1.2 V, 45 nm
28
Related Work• We compare EFetch with the following designs:– L1I-64KB: Hardware overhead of EFetch provisioned
towards extra L1-I cache capacity – 64 KB
– N2L: Next-2 line prefetcher
– CGP: Call Graph Prefetching
– PIF: Proactive Instruction Fetch
– RDIP: Return address stack Directed Instruction Prefetching
Annavaram, et. al. HPCA ‘01
Ferdman, et. al. MICRO ‘11
Kolli, et. al. MICRO ‘13
29
Prefetcher EfficacyN
2LCG
PPI
FRD
IPEF
etch
N2L
CGP
PIF
RDIP
EFet
ch
N2L
CGP
PIF
RDIP
EFet
ch
N2L
CGP
PIF
RDIP
EFet
ch
N2L
CGP
PIF
RDIP
EFet
ch
N2L
CGP
PIF
RDIP
EFet
ch
N2L
CGP
PIF
RDIP
EFet
ch
amazon bing cnn fb gmaps gdocs pixlr
0
50
100
150
200
250
Prefetch Hits Misses Erroneous Prefetches
%(L
1-I m
isse
s +
L1-I
pref
etch
hits
)
30
Performance
amazon bing cnn fb gmaps gdocs pixlr gmean0
5
10
15
20
25
30
35
40
L1I-64KB N2L CGP PIF RDIP EFetch
%pe
rfor
man
ce im
prov
emen
t
31
Energy ConsumptionDesign CGP PIF RDIP EFetch
Overhead (KB) 32 204 63 39
amazon bing cnn fb gmaps gdocs pixlr AMean0
0.2
0.4
0.6
0.8
1
1.2
N2L CGP PIF RDIP EFetch
Rela
tive
Ener
gy C
onsu
med
• Prefetching hardware structures consume little energy– Ranging from 0.01% of the total energy consumed for EFetch to
1.06% for PIF• Erroneous prefetches consume significant fraction of energy
32
Energy, Performance, Area
1.15 1.2 1.25 1.3 1.35 1.40.7
0.75
0.8
0.85
0.9
0.95
1
1.05
EFetch
PIF
CGP
RDIP
N2L
Performance
Ener
gy
33
Conclusion• Web 2.0 places greater demands on client-side
computing
• I-Cache performance is poor for web client-side script execution
• EFetch exploits the event-driven nature of web client-side script execution
• It achieves 29% performance improvement over no prefetching
34
EFetch: Optimizing Instruction Fetch for Event-Driven Web Applications
Gaurav Chadha, Scott Mahlke, Satish Narayanasamy
University of Michigan
August, 2014
University of MichiganElectrical Engineering and Computer Science
35
Performance Potential
amazon
bing cnn fb gmaps gdocs pixlr GMean0
10
20
30
40
50
60
70
80
% P
erfo
rman
ce im
prov
emen
t
Perfect I-Cache: 53% speedup