performance and predictability
DESCRIPTION
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data. In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.TRANSCRIPT
![Page 1: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/1.jpg)
PERFORMANCE AND PREDICTABILITYRichard [email protected]
![Page 2: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/2.jpg)
![Page 3: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/3.jpg)
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
![Page 4: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/4.jpg)
![Page 5: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/5.jpg)
Performance Discussion
Product Solutions“Just use our library/tool/framework, and everything is web-scale!”
Architecture Advocacy“Always design your software like this.”
Methodology & Fundamentals“Here are some principles and knowledge, use your brain“
![Page 6: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/6.jpg)
Do you care?
DON'T look at access patterns first
Many problems not Pattern RelatedNetworkingDatabase or External ServiceMinimising I/OGarbage CollectionInsufficient Parallelism
Harvest the low hanging fruit first
![Page 7: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/7.jpg)
Low Hanging is a cost-benefit analysis
![Page 8: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/8.jpg)
So when does it matter?
![Page 9: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/9.jpg)
Informed Design & Architecture *
* this is not a call for premature optimisation
![Page 10: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/10.jpg)
![Page 11: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/11.jpg)
That 10% that actually matters
![Page 12: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/12.jpg)
Performance of Predictable Accesses
![Page 13: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/13.jpg)
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
![Page 14: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/14.jpg)
What 4 things do CPUs actually do?
![Page 15: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/15.jpg)
Fetch, Decode, Execute, Writeback
![Page 16: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/16.jpg)
Pipelining speeds this up
![Page 17: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/17.jpg)
Pipelined
![Page 18: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/18.jpg)
What about branches?
public static int simple(int x, int y, int z) {
int ret;
if (x > 5) {
ret = y + z;
} else {
ret = y;
}
return ret;
}
![Page 19: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/19.jpg)
Branches cause stalls, stalls kill performance
![Page 20: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/20.jpg)
Super-pipelined & Superscalar
![Page 21: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/21.jpg)
Can we eliminate branches?
![Page 22: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/22.jpg)
Naive Compilation
if (x > 5) {
ret = z;
} else {
ret = y;
}
cmp x, 5 jmp L1 mov z, ret br L2L1: mov y, retL2: ...
![Page 23: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/23.jpg)
Conditional Branches
if (x > 5) {
ret = z;
} else {
ret = y;
}
cmp x, 5 mov z, ret
cmovle y, ret
![Page 24: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/24.jpg)
Strategy: predict branches and speculatively execute
![Page 25: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/25.jpg)
Static Prediction
A forward branch defaults to not taken
A backward branch defaults to taken
![Page 26: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/26.jpg)
![Page 27: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/27.jpg)
Static Hints (Pentium 4 or later)
__emit 0x3E defaults to taken
__emit 0x2E defaults to not taken
don’t use them, flip the branch
![Page 28: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/28.jpg)
Dynamic prediction: record history and predict future
![Page 29: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/29.jpg)
Branch Target Buffer (BTB)
a log of the history of each branch
also stores the address
its finite!
![Page 30: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/30.jpg)
Local
record per conditional branch histories
Global
record shared history of conditional jumps
![Page 31: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/31.jpg)
Loop
specialised predictor when there’s a loop (jumping in a cycle n times)
Function
specialised buffer for predicted nearby function returns
Two level Adaptive Predictor
accounts for up patterns of up to 3 if statements
![Page 32: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/32.jpg)
Measurement and Monitoring
Use Performance Event Counters (Model Specific Registers)
Can be configured to store branch prediction information
Profilers & Tooling: perf (linux), VTune, AMD Code Analyst, Visual Studio, Oracle Performance Studio
![Page 33: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/33.jpg)
Approach: Loop Unrolling
for (int x = 0; x < 100; x+=5)
{
foo(x);
foo(x+1);
foo(x+2);
foo(x+3);
foo(x+4);
}
![Page 34: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/34.jpg)
Approach: minimise branches in loops
![Page 35: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/35.jpg)
Approach: Separation of Concerns
Little branching on a latency critical path or thread
Heavily branchy administrative code on a different path
Helps adjust for the inflight branch prediction not tracking many targets
Only applicable with low context switching rates
Also easier to understand the code
![Page 36: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/36.jpg)
One more thing ...
![Page 37: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/37.jpg)
Division/Modulus
index = (index + 1) % SIZE;
vs
index = index + 1;if (index == SIZE) { index = 0;}
![Page 38: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/38.jpg)
/ or % are 92 Cycles on Core i*
![Page 39: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/39.jpg)
Summary
CPUs are Super-pipelined and Superscalar
Branches cause stalls
Branches can be removed, minimised or simplified
Frequently get easier to read code as well
![Page 40: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/40.jpg)
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
![Page 41: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/41.jpg)
The Problem Very Fast
Relatively Slow
![Page 42: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/42.jpg)
The Solution: CPU Cache
Core Demands Data, looks at its cache
If present (a "hit") then data returned to registerIf absent (a "miss") then data looked up from memory and stored in the cache
Fast memory is expensive, a small amount is affordable
![Page 43: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/43.jpg)
Multilevel Cache: Intel Sandybridge
Shared Level 3 Cache
Level 2 Cache
Level 1 Data
Cache
Level 1 Instruction
Cache
Physical Core 0
HT: 2 Logical Cores
....
Level 2 Cache
Level 1 Data
Cache
Level 1 Instruction
Cache
Physical Core N
HT: 2 Logical Cores
![Page 44: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/44.jpg)
How bad is a miss?
Location Latency in Clockcycles
Register 0
L1 Cache 3
L2 Cache 9
L3 Cache 21
Main Memory 150-400
![Page 45: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/45.jpg)
Cache Lines
Data transferred in cache lines
Fixed size block of memory
Usually 64 bytes in current x86 CPUsBetween 32 and 256 bytes
Purely hardware consideration
![Page 46: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/46.jpg)
Prefetch = Eagerly load data
Adjacent Cache Line Prefetch
Data Cache Unit (Streaming) Prefetch
Problem: CPU Prediction isn't perfect
Solution: Arrange Data so accesses are
predictable
Prefetching
![Page 47: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/47.jpg)
Sequential Locality
Referring to data that is arranged linearly in memory
Spatial Locality
Referring to data that is close together in memory
Temporal Locality
Repeatedly referring to same data in a short time span
![Page 48: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/48.jpg)
General Principles
Use smaller data types (-XX:+UseCompressedOops)
Avoid 'big holes' in your data
Make accesses as linear as possible
![Page 49: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/49.jpg)
Primitive Arrays
// Sequential Access = Predictable
for (int i=0; i<someArray.length; i++)
someArray[i]++;
![Page 50: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/50.jpg)
Primitive Arrays - Skipping Elements
// Holes Hurt
for (int i=0; i<someArray.length; i += SKIP)
someArray[i]++;
![Page 51: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/51.jpg)
Primitive Arrays - Skipping Elements
![Page 52: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/52.jpg)
Multidimensional Arrays
Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C)
Some people realign their accesses:
for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) {array[ROWS * col + row]++;
}}
![Page 53: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/53.jpg)
Bad Access Alignment
Strides the wrong way, bad locality.
array[COLS * row + col]++;
Strides the right way, good locality.
array[ROWS * col + row]++;
![Page 54: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/54.jpg)
Full Random AccessL1D - 5 clocksL2 - 37 clocksMemory - 280 clocks
Sequential AccessL1D - 5 clocksL2 - 14 clocksMemory - 28 clocks
![Page 55: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/55.jpg)
Primitive Collections (GNU Trove, GS-Coll)Arrays > Linked ListsHashtable > Search TreeAvoid Code bloating (Loop Unrolling)Custom Data Structures
Judy Arraysan associative array/map
kD-Treesgeneralised Binary Space Partitioning
Z-Order Curvemultidimensional data in one dimension
Data Layout Principles
![Page 56: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/56.jpg)
Data Locality vs Java Heap Layout
0
1
2
class Foo {Integer count;Bar bar;Baz baz;
}
// No alignment guaranteesfor (Foo foo : foos) {
foo.count = 5;foo.bar.visit();
}
3
...
Foo
bar
baz
count
![Page 57: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/57.jpg)
Data Locality vs Java Heap Layout
Serious Java Weakness
Location of objects in memory hard to guarantee.
GC also interferesCopyingCompaction
![Page 58: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/58.jpg)
Summary
Cache misses cause stalls, which kill performance
Measurable via Performance Event Counters
Common Techniques for optimizing code
![Page 59: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/59.jpg)
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
![Page 60: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/60.jpg)
Hard Disks
Commonly used persistent storage
Spinning Rust, with a head to read/write
Constant Angular Velocity - rotations per minute stays constant
![Page 61: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/61.jpg)
A simple model
Zone Constant Angular Velocity (ZCAV) /Zoned Bit Recording (ZBR)
Operation Time = Time to process the commandTime to seekRotational speed latencySequential Transfer TIme
![Page 62: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/62.jpg)
ZBR implies faster transfer at limits than centre
![Page 63: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/63.jpg)
Seeking vs Sequential reads
Seek and Rotation times dominate on small values of data
Random writes of 4kb can be 300 times slower than theoretical max data transfer
Consider the impact of context switching between applications or threads
![Page 64: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/64.jpg)
Fragmentation causes unnecessary seeks
![Page 65: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/65.jpg)
Sector alignment is disk-level irrelevant
![Page 66: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/66.jpg)
![Page 67: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/67.jpg)
![Page 68: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/68.jpg)
Summary
Simple, sequential access patterns win
Fragmentation is your enemy
Alignment can be relevant in RAID scenarios
SSDs are a different game
![Page 69: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/69.jpg)
Why care about low level rubbish?
Branch Prediction
Memory Access
Storage
Conclusions
![Page 70: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/70.jpg)
Software runs on Hardware, play nice
![Page 71: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/71.jpg)
Predictability is a running theme
![Page 72: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/72.jpg)
Huge theoretical speedups
![Page 73: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/73.jpg)
YMMV: measured incremental improvements >>> going nuts
![Page 74: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/74.jpg)
More information
Articleshttp://www.akkadia.org/drepper/cpumemory.pdfhttps://gmplib.org/~tege/x86-timing.pdfhttp://psy-lob-saw.blogspot.co.uk/http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.htmlhttp://mechanical-sympathy.blogspot.co.ukhttp://www.agner.org/optimize/microarchitecture.pdf
Mailing Lists:https://groups.google.com/forum/#!forum/mechanical-sympathyhttps://groups.google.com/a/jclarity.com/forum/#!forum/friendshttp://gee.cs.oswego.edu/dl/concurrency-interest/
![Page 75: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/75.jpg)
![Page 76: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/76.jpg)
Q & [email protected]/java8lambdas
![Page 77: Performance and predictability](https://reader035.vdocuments.site/reader035/viewer/2022062419/55905b211a28ab642e8b4589/html5/thumbnails/77.jpg)