optcarrot: a pure-ruby nes emulator
TRANSCRIPT
• A NES Emulator written in Ruby
Demo
2
• To drive “Ruby3x3”
– Matz said “Ruby 3 will be 3 times faster than Ruby 2.0”
– Optcarrot is a CPU-intensive, real-life benchmark
• Currently works at 20 fps in Ruby 2.0 60 fps in 3.0!
• A carrot to let horses (Ruby committers) optimize Ruby
• To challenge Ruby’s limit
– NES video resolution: 256 x 240 pixels / 60 fps
– We need to do all other tasks in 0.8 sec.? Impossible?
(256*240*60).times do |i|ary[0] = 0
end0.2 sec.
3
• Famicom programming with Ruby
(takkaw, 2007)
– Presentation NES ROM by Ruby
• MRI's incremental GC
(authornari, 2008)
– Mario-like game "Nario" is used
to demonstrate the real-time GC
• Burn (remore, 2014)
– A framework to create NES ROM
in Ruby
4
• NES architecture in three minutes
• How I achieved 20 fps
• Ruby interpreters’ benchmark
• Towards 60 fps
• Speaker's award & Conclusion
5
• The details of NES architecture
– In short: “See http://wiki.nesdev.com/ !”
• How to find the bottleneck
– In short: “Use stackprof!”
6
川崎Ruby会議01
(2016/08/20)
• I’ll talk these topics at
“Kawasaki Ruby Kaigi 01”
• NES Architecture in three minutes
• How I achieved 20 fps
• Ruby interpreters’ benchmark
• Towards 60 fps
• Speaker's award & Conclusion
7
CPU GPU
Program ROM Bitmap ROM
Cartridge
NES
RAM(2 kB)
VRAM(2 kB)
control
read
read/write
read
render
read/write
To be precise: GPU is called as “PPU” (Picture Processing Unit) in NES
interrupt
8
GPU80%
CPU10%
others10%
Execution time ratio
• Why does GPU emulation
take so much?
– GPU runs at higher
clock speed than CPU
• GPU: 5.3 MHz
• CPU: 1.8 MHz
– GPU does many
complex tasks
• Background rendering
• Sprite rendering
• Scrolling
• Conflict detection
• Interrupts
9
• Per-pixel tasks (i.e. 256 x 240 x 60 = 3.7M times per second)
1. Identify what bitmap is shown here
2. Read attribute data (color, flip flag, z-index)
3. Read bitmap data from the ROM
4. Assemble them into video signal
Background map
Attribute map
VRAM
GPU2
1
3
4
Target
pixel
To be precise: These tasks are actually done per eight pixels10
Bitmap ROM
Cartridge
• Terribly complex
http://wiki.nesdev.com/w/index.php/File:Ntsc_timing.png
11
• NES Architecture in three minutes
• How I achieved 20 fps
– How to emulate CPU-GPU parallelism
– How to optimize GPU emulation
• Ruby interpreters’ benchmark
• Towards 60 fps
• Speaker's award & Conclusion
12
• Naïve approach: emulate CPU & GPU per clock
1. Run the CPU for one clock
2. Run the GPU for three clocks
3. Repeat 1 and 2
– Simple and accurate
– Very slow (~ 3 fps) because of too many method calls
CPU step
step
step
step
step
step
step
step
step
step
step
step
step
step
step
step
clock
GPU
13
• “Catch-up” method: emulate CPU&GPU per control
1. Run the CPU until it tries to control the GPU
2. Run the GPU until it catch up with the CPU
3. Repeat 1 and 2
– Accurate and fast (~ 10 fps)
CPU run
catchup
run
catchup
run
clock
GPU CPU attempts to
control GPU
14
• Naïve approach: per-pixel emulation
– Just as like the actual hardware
Bitmap ROM
Background map
Attribute map
VRAM
GPU2
1
3
4
This calculation is done for each iteration Slow!
15
Cartridge
• Pre-render the screen and update it on demand
Background map
Attribute map
VRAM
GPU
screen buffer
When VRAM is
modified by CPU,
Only invalidated pixels
is updated
Transported to TV
per frame
This explanation is over exaggerated!
Actually, the GPU emulation loop is not removed completely.16
Bitmap ROM
Cartridge
• Intel® Core™ i7-4500U @ 2.40 GHz
• Ubuntu 16.04
17
• NES Architecture in three minutes
• How I achieved 20 fps
• Ruby interpreters’ benchmark
• Towards 60 fps
• Speaker's award & Conclusion
18
• Is not so big: <5000 lines of code
– cf. redmine: >30000 LOC
• Requires no library (in no-GUI mode)
– It works on miniruby
– ruby-ffi is used for GUI (SDL2)
• Uses only basic Ruby features
– It works on ruby 1.8 / mruby / topaz / opal(with shim and/or systematic modification of source code)
19
28.7
28.1
25.5
26.6
25.0
21.4
5.83
21.9
39.2
25.0
4.10
7.48
27.0
0.0287
0.0 10.0 20.0 30.0 40.0
trunk
ruby23
ruby22
ruby21
ruby20
ruby193
ruby187
omrpreview
jruby9k
jruby17
rubinius
mruby
topaz
opal
20
MRI has been improved
(1.81.92.02.3)
OMR preview isn’t fast?
(MRI 2.2 w/ JIT)
JRuby9k is the fastest
ruby 2.0 achives >20 fps
(important for Ruby3x3)
Optcarrot works on
subset Ruby impls.
• JRuby 9k is the fastest:
“Deoptimization” looks a promising approach
– At first, an optimized byte-code is generated with
ignoring rare/pathological cases
– When needed, it is discarded and a naïve byte-code is
regenerated– BTW: JRuby‘s boot time is too bad
• OMR is not so fast?
– JIT has no advantage?
• Method calls and built-in methods may be still bottleneck
– OMR seems not to support opt_case_dispatch yet
• i.e., a case statement is not optimized well?21
• NES Architecture in three minutes
• How I achieved 20 fps
• Ruby interpreters’ benchmark
• Towards 60 fps
• Speaker's award & Conclusion
22
™
• We have kept the code reasonably clean so far
• Now, we use any means to achieve the speed
• CAUTION: Casual Ruby programmers MUST NOT
use the following ProTips™
– This is an experiment to study how to improve Ruby
implementation
23
™
• Method call is slow
– Replace it with its method definition
while catchup?inc_addr
end
while catchup?@addr += 1
end
28 fps 40 fps24
™
• Instance variable access is slow
– Replace it with local variable
– Note: the variable must not be used out of this method
while catchup?@addr += 1
end
beginaddr = @addrwhile catchup?addr += 1
endensure@addr = addr
end
40 fps 47 fps25
• Batch multiple frequent
actions across some clocks
™ while catchup?if can_be_fast?
# fast-pathdo_Ado_Bdo_C@clock += 3
elsecase @clockwhen 1 then do_Awhen 2 then do_Bwhen 3 then do_C...end@clock += 1
endend
while catchup?case @clockwhen 1 then do_Awhen 2 then do_Bwhen 3 then do_C...end@clock += 1
end
47 fps 63 fps26
™
29.4
40.3
46.6
62.7
68.8
83.2
0.0 20.0 40.0 60.0 80.0
base
method inlining
ivar localization
fastpath
misc
CPU misc
ProTip™ 1
ProTip™ 2
ProTip™ 3
27
• Used Regexp to systematically rewrite the code
– instead of hand-rewriting
• Used Welch’s t-test to confirm each optimization
src = File.read(__FILE__)src.gsub!(/.../) { ... } # method inlining
src.gsub!(/.../) { ... } # ivar localization
eval(src)
28
29
28.6
28.0
25.2
26.9
26.1
21.4
5.87
22.8
39.3
25.3
3.97
7.02
29.3
0.0285
84.0
82.9
78.2
79.6
68.1
64.0
1.46
69.0
2.12
6.13
2.43
0.754
0.0501
0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0
trunk
ruby23
ruby22
ruby21
ruby20
ruby193
ruby187
omrpreview
jruby9k
jruby17
rubinius
mruby
topaz
opal
default mode optimized mode
The generated program is
too large to fit
JVM 64k bytecode limit
30
• NES Architecture in three minutes
• How I achieved 20 fps
• Ruby interpreters’ benchmark
• Towards 60 fps
• Speaker's award & Conclusion
31
• The first person who
improved MRI performance
by using Optcarrot
– Instance variable access has
been improved about 10%
[Bug #12274]
• Optcarrot has already
started to improve Ruby!
32
• Optcarrot, a pure-Ruby NES emulator
– Non-trivial benchmark for Ruby implementations
• Wide-range Ruby implementation benchmark
– AFAIK, this is the first real-life benchmark to compare
MRI / Jruby / Rubinius / mruby / topaz / opal
• ProTips™ for boosting a Ruby program
– Need to improve method calls and instance variables
instead of JIT?
• More details?
33
川崎Ruby会議01
(2016/08/20)
34
¥2,680 + tax ¥5,440 + tax