the art of benchmarking€¦ · © 2015- andrei alexandrescu. do not redistribute. 1 / 80 writing...

40
© 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::dive Conference, Wroclaw, Poland Andrei Alexandrescu, Ph.D. [email protected] Nov 5, 2015 The Art of Benchmarking © 2015- Andrei Alexandrescu. Do not redistribute. 2 / 80

Upload: others

Post on 04-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

© 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80

Writing Fast Codecode::dive Conference, Wrocław, Poland

Andrei Alexandrescu, Ph.D.

[email protected]

Nov 5, 2015

The Art of Benchmarking

© 2015- Andrei Alexandrescu. Do not redistribute. 2 / 80

Page 2: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Mind Amdahl’s Law

© 2015- Andrei Alexandrescu. Do not redistribute. 3 / 80

• Improvement from a component in a system is

limited by the component’s participation to

the system

• Collect app profile data before optimizing

• Focus on optimizing the top time-spenders

Mind Amdahl’s Law

© 2015- Andrei Alexandrescu. Do not redistribute. 4 / 80

Choose hotspots from

whole application

Page 3: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Mind Lhadma’s Law

© 2015- Andrei Alexandrescu. Do not redistribute. 5 / 80

Optimize hotspots

outside whole

application; profile

again within application

Discussion

© 2015- Andrei Alexandrescu. Do not redistribute. 6 / 80

• Edit/benchmark cycle slow on whole app

• But optimizations have global effects

◦ Cache effects

◦ Memory allocation/use

• Microbenchmarks’ effectiveness decreases

with margin

Page 4: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Today’s Computing Architectures

© 2015- Andrei Alexandrescu. Do not redistribute. 7 / 80

• Extremely complex

• Trade reproducible performance for average

speed

• Interrupts, multiprocessing are the norm

• Dynamic frequency control is becoming

common

• Virtually impossible to get identical timings

for experiments

Intuition

© 2015- Andrei Alexandrescu. Do not redistribute. 8 / 80

• Ignores aspects of a complex reality

• Makes narrow/obsolete/wrong assumptions

• “Fewer instructions = faster code”

• “Data is faster than computation”

• “Computation is faster than data”

• The only good intuition: “I should time this.”

Page 5: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Deep Thought

© 2015- Andrei Alexandrescu. Do not redistribute. 9 / 80

Measuring gives you a

leg up on experts who

don’t need to measure

Benchmarking for Speed

© 2015- Andrei Alexandrescu. Do not redistribute. 10 / 80

• Goal: estimate the speed of some specific

algorithm

• Non-goals:

◦ Framework overheads (looping,

measurement)

◦ Clock resolution noise

◦ Vagaries during measurement (unrelated

system activity etc)

◦ Unrelated activity overheads:

setup/cleanup work

Page 6: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Measuring

© 2015- Andrei Alexandrescu. Do not redistribute. 11 / 80

• Measured time: tm = t+ tq + tn + to

where

• tm is measured (observed) time

• t is the actual time of interest

• tq is time added by quantization noise

• tn is time added by various sources of noise

• to is overhead time (measuring, looping,

calling functions)

1. Quantization noise

© 2015- Andrei Alexandrescu. Do not redistribute. 12 / 80

• Especially visible for very fast operations

• Solution: repeat experiment many times,

measure once

• Adjust total time spent to make noise

negligible

startClock();

for (unsigned i = 0; i < iMax; ++i) {

... experiment code ...

}

measureTime();

Page 7: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

2. Other noise

© 2015- Andrei Alexandrescu. Do not redistribute. 13 / 80

• All fluctuations

• Gaussian, but always additive

• Solution: measure many times, take the mode

auto minTime = numeric_limits<unsigned>::max();

for (unsigned epoch = 0; epoch < eMax; ++epoch) {

startClock();

for (unsigned i = 0; i < iMax; ++i) {

... code to benchmark ...

}

auto t = measureTime();

if (minTime > t) minTime = t;

}

Mode vs. Minimum

© 2015- Andrei Alexandrescu. Do not redistribute. 14 / 80

• Mode of a distribution: maximum density of

measured values

• Practically: Mode slowly converges to

minimum

◦ No “subtractive” speed noise

• Exceptions: Networking, (live|dead)locking

• Sometimes worst case does matter

Page 8: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

3. Overhead noise

© 2015- Andrei Alexandrescu. Do not redistribute. 15 / 80

• All code surrounding the actual work

• Solution: measure and subtract the time for

the loop only

startClock();

for (unsigned i = 0; i < iMax; ++i) {

}

measureTime();

Wait a Minute

© 2015- Andrei Alexandrescu. Do not redistribute. 16 / 80

• Compiler optimizes away non-observable

behavior

• Empty loops are not observable

• Solution: make them observable

startClock();

for (volatile unsigned i = 0;

i < iMax; ++i) {

}

measureTime();

Page 9: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Better, but less portable

© 2015- Andrei Alexandrescu. Do not redistribute. 17 / 80

startClock();

for (unsigned i = 0; i < iMax; ++i) {

asm volatile("");

}

measureTime();

Speaking of observable

© 2015- Andrei Alexandrescu. Do not redistribute. 18 / 80

• Frequently benchmarks don’t use their

results

// Let’s time this

unsigned n = 1000000;

while (n-- > 0) {

asciiToInteger("1234567890");

}

• Compiler may call asciiToInteger once or

not at all!

Page 10: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Solution

© 2015- Andrei Alexandrescu. Do not redistribute. 19 / 80

• Define function that “uses” its parameter

template <class T>

void doNotOptimizeAway(T&& datum) {

asm volatile("" : "+r" (datum));

}

Touching entire memory

© 2015- Andrei Alexandrescu. Do not redistribute. 20 / 80

void clobber() {

asm volatile("" : : : "memory");

}

• Useful to insert at the end of a measurement

Page 11: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

More Portable

© 2015- Andrei Alexandrescu. Do not redistribute. 21 / 80

• Use I/O guarded by an non-provable false test

template <class T>

void doNotOptimizeAway(T&& datum) {

if (getpid() == 1) {

const void* p = &datum;

putchar(*static_cast<const char*>(p));

}

}

Use

© 2015- Andrei Alexandrescu. Do not redistribute. 22 / 80

• Define function that “uses” its parameter

// Let’s time this

unsigned n = 1000000;

while (n-- > 0) {

doNotOptimizeAway(

asciiToInteger("1234567890"));

}

Page 12: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Which Counter to Use?

© 2015- Andrei Alexandrescu. Do not redistribute. 23 / 80

• Not a solved problem!

• High resolution, high fluctuation

◦ Windows: QueryPerformanceCounter

◦ Linux: clock_gettime(CLOCK_REALTIME)

• clock_getres broken on older kernels,

upgrade

• Low-resolution, low fluctuation

◦ Windows: timeGetTime

◦ Un*x: gettimeofday

• CPU-specific (RDTSC)—beware CPU jumping

Baselines

© 2015- Andrei Alexandrescu. Do not redistribute. 24 / 80

Page 13: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Baseline, Defined

© 2015- Andrei Alexandrescu. Do not redistribute. 25 / 80

• “Classic”, “naïve”, “established”,

“commonly-accepted” solution to the target of

optimization

• Your own production version of a complex

function

• Slower than baseline → no good

• std::sort for sorting

• iostreams for I/O (easy to beat)

• scanf for parsing numbers

Immutable Rule

© 2015- Andrei Alexandrescu. Do not redistribute. 26 / 80

Always choose relevant

baselines

Page 14: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Bad

© 2015- Andrei Alexandrescu. Do not redistribute. 27 / 80

“I sorted 1M floats in 8

milliseconds!”

Good

© 2015- Andrei Alexandrescu. Do not redistribute. 28 / 80

“Quicksort is 25%

faster than heapsort on

1M Guassian floats”

Page 15: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Splendid

© 2015- Andrei Alexandrescu. Do not redistribute. 29 / 80

Differential Timing

© 2015- Andrei Alexandrescu. Do not redistribute. 30 / 80

• Traditional:

◦ Run baseline n times, measure ta

◦ Run contender n times, measure tb

◦ relative improvement r = tatb

• Differential:

◦ Run baseline 2n times, measure t2a

◦ Run baseline n times and contender n

times, measure ta+b

◦ Relative improvement r = t2a2ta+b−t2a

• Some overhead noises canceled

Page 16: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Common Benchmarking Pitfalls

© 2015- Andrei Alexandrescu. Do not redistribute. 31 / 80

• Measuring speed of debug builds

• Different setup for baseline and measured

◦ Sequencing: heap allocator

◦ Warmth of cache, files, databases, DNS

• Including ancillary work in measurement

◦ malloc, printf common

• Mixtures: measure ta + tb, improve ta,

conclude tb got improved

• Optimize rare cases, pessimize others

Optimizing Rare Cases

© 2015- Andrei Alexandrescu. Do not redistribute. 32 / 80

Page 17: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Almost Portable

Optimization Techniques

© 2015- Andrei Alexandrescu. Do not redistribute. 33 / 80

Categories

© 2015- Andrei Alexandrescu. Do not redistribute. 34 / 80

• Generalities

• Storage

• Numbers

• Strength reduction

• Minimize indirect (e.g. array) writes

Page 18: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Generalities

© 2015- Andrei Alexandrescu. Do not redistribute. 35 / 80

• Prefer static linking and PDC

• Prefer 64-bit code, 32-bit data

• Prefer (32-bit) array indexing to pointers

◦ Prefer a[i++] to a[++i]

• Prefer regular memory access patterns

• Minimize flow, avoid data dependencies

Storage Pecking Order

© 2015- Andrei Alexandrescu. Do not redistribute. 36 / 80

• Use static const for all immutables

◦ Beware cache issues

• Use stack for most variables

◦ Hot

◦ 0-cost addressing, like struct/class fields

• Globals: aliasing issues

• thread_local slowest, use local caching

◦ 1 instruction in Windows, Linux

◦ 3-4 in OSX

Page 19: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Integrals

© 2015- Andrei Alexandrescu. Do not redistribute. 37 / 80

• Prefer 32-bit ints to all other sizes

◦ 64 bit may make some code 20x slower

◦ 8, 16-bit computations use conversion to

32 bits and back

◦ Use small ints in arrays

• Prefer unsigned to signed

◦ Except when converting to floating point

• “Most numbers are small”

Floating Point

© 2015- Andrei Alexandrescu. Do not redistribute. 38 / 80

• Double precision as fast as single precision

• Extended precision just a bit slower

• Do not mix the three

• 1-2 FP addition/subtraction units

• 1-2 FP multiplication/division units

• SSE accelerates throughput for certain

computation kernels

• ints→FPs cheap, FPs→ints expensive

Page 20: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Strength reduction

© 2015- Andrei Alexandrescu. Do not redistribute. 39 / 80

• Don’t waste time replacing a /= 2 with

a >>= 1

• Speed hierarchy:

◦ comparisons

◦ (u)int add, subtract, bitops, shift

◦ FP add, sub (separate unit!)

◦ (u)int32 mul; FP mul

◦ FP division, remainder

◦ (u)int division, remainder

Strength reduction: Example

© 2015- Andrei Alexandrescu. Do not redistribute. 40 / 80

• Compute digit count in base-10

representation

uint32_t digits10(uint64_t v) {

uint32_t result = 0;

do {

++result;

v /= 10;

} while (v);

return result;

}

• Uses integral division extensively

Page 21: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Strength reduction: Example

© 2015- Andrei Alexandrescu. Do not redistribute. 41 / 80

uint32_t digits10(uint64_t v) {

uint32_t result = 1;

for (;;) {

if (v < 10) return result;

if (v < 100) return result + 1;

if (v < 1000) return result + 2;

if (v < 10000) return result + 3;

// Skip ahead by 4 orders of magnitude

v /= 10000U;

result += 4;

}

}

• More comparisons and additions, fewer /=

• (This is not loop unrolling!)

Page 22: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Pro version

© 2015- Andrei Alexandrescu. Do not redistribute. 43 / 80

uint32_t digits10(uint64_t v) {

uint32_t result = 1;

for (;;) {

if (LIKELY(v < 10)) return result;

if (LIKELY(v < 100)) return result + 1;

if (LIKELY(v < 1000)) return result + 2;

if (LIKELY(v < 10000)) return result + 3;

// Skip ahead by 4 orders of magnitude

v /= 10000U;

result += 4;

}

}

• 102 vs. 149 bytes in size

• Does not improve execution time!

LIKELY

© 2015- Andrei Alexandrescu. Do not redistribute. 44 / 80

• Uses static hinting for the compiler

#if defined(__GNUC__) && __GNUC__ >= 4

#define LIKELY(x) (__builtin_expect((x), 1))

#define UNLIKELY(x) (__builtin_expect((x), 0))

#else

#define LIKELY(x) (x)

#define UNLIKELY(x) (x)

#endif

Page 23: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Minimize Indirect Writes: Why?

© 2015- Andrei Alexandrescu. Do not redistribute. 45 / 80

• Disables enregistering

• A write is really a read and a write

• Maculates the cache

Minimize Indirect Writes

© 2015- Andrei Alexandrescu. Do not redistribute. 46 / 80

uint32_t u64ToAsciiClassic(uint64_t value, char* dst) {

// Write backwards.

auto start = dst;

do {

*dst++ = ’0’ + (value % 10);

value /= 10;

} while (value != 0);

const uint32_t result = dst - start;

// Reverse in place.

for (dst--; dst > start; start++, dst--) {

std::iter_swap(dst, start);

}

return result;

}

Page 24: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Minimize Indirect Writes

© 2015- Andrei Alexandrescu. Do not redistribute. 47 / 80

• Gambit: make one extra pass to compute

length

uint32_t uint64ToAscii(uint64_t v, char *const buffer) {

auto const result = digits10(v);

uint32_t pos = result - 1;

while (v >= 10) {

auto const q = v / 10;

auto const r = static_cast<uint32_t>(v % 10);

buffer[pos--] = ’0’ + r;

v = q;

}

assert(pos == 0);

// Last digit is trivial to handle

*buffer = static_cast<uint32_t>(v) + ’0’;

return result;

}

Improvements

© 2015- Andrei Alexandrescu. Do not redistribute. 48 / 80

• Fewer indirect writes

• Regular access patterns

• Fast on small numbers

• Data dependencies reduced

Page 25: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Checkpoint

© 2015- Andrei Alexandrescu. Do not redistribute. 50 / 80

• You can’t improve what you can’t measure

◦ Pro tip: You can’t measure what you don’t

measure

• Always choose good baselines

• Optimize code judiciously for today’s

dynamics

Page 26: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

The Forgotten

Parallelism

© 2015- Andrei Alexandrescu. Do not redistribute. 51 / 80

Page 27: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

© 2015- Andrei Alexandrescu. Do not redistribute. 53 / 80

How to use those

ALUs?

Instruction-Level Parallelism (ILP)

© 2015- Andrei Alexandrescu. Do not redistribute. 54 / 80

• Pipelining

• Superscalar execution

• Out-of-order execution

• Register renaming

• Speculative execution

• . . . and many more

Page 28: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

© 2015- Andrei Alexandrescu. Do not redistribute. 55 / 80

better instruction-level parallelism

=

fewer data dependencies

© 2015- Andrei Alexandrescu. Do not redistribute. 56 / 80

Automatic?

Page 29: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

© 2015- Andrei Alexandrescu. Do not redistribute. 57 / 80

Important?

© 2015- Andrei Alexandrescu. Do not redistribute. 58 / 80

Minimizing Data Dependencies

Page 30: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Study: string to integral

© 2015- Andrei Alexandrescu. Do not redistribute. 59 / 80

unsigned atoui(const char* b, const char* e) {

enforce(b < e);

unsigned result = 0;

for (; b != e; ++b) {

enforce(*b >= ’0’ && *b <= ’9’);

result = result * 10 + (*b - ’0’);

}

return result;

}

How does it work?

© 2015- Andrei Alexandrescu. Do not redistribute. 60 / 80

523924 = (((((0 * 10 + 5) * 10 + 2)

* 10 + 3) * 10 + 9) * 10 + 2) * 10

+ 4

Page 31: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Divide & Conquer?

© 2015- Andrei Alexandrescu. Do not redistribute. 61 / 80

"523924" : "523" then "924"

523924 = 523 * 1000 + 924

Key: Multiply, then add

© 2015- Andrei Alexandrescu. Do not redistribute. 62 / 80

523924 = 5 * 100,000

+ 2 * 10,000

+ 3 * 1,000

+ 9 * 100

+ 2 * 10

+ 4

Page 32: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Reducing dependencies

© 2015- Andrei Alexandrescu. Do not redistribute. 63 / 80

unsigned atoui(const char* b, const char* e) {

static const unsigned pow10[20] = {

10000000000000000000UL,

...

1

};

enforce(b < e);

unsigned result = 0;

auto i = sizeof(pow10) / sizeof(*pow10) - (e - b);

for (; b != e; ++b) {

enforce(*b >= ’0’ && *b <= ’9’);

result += pow10[i++] * (*b - ’0’);

}

return result;

}

Page 33: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

© 2015- Andrei Alexandrescu. Do not redistribute. 65 / 80

associative

means

parallelizable

Strength reduction, again

© 2015- Andrei Alexandrescu. Do not redistribute. 66 / 80

unsigned atoui(const char* b, const char* e) {

static const unsigned pow10[20] = {

10000000000000000000UL,

...

1

};

enforce(b < e);

unsigned result = 0;

auto i = sizeof(pow10) / sizeof(*pow10) - (e - b);

for (; b != e; ++b) {

enforce(*b >= ’0’ && *b <= ’9’);

result += pow10[i++] * (*b - ’0’);

}

return result;

}

Page 34: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Strength reduction, again

© 2015- Andrei Alexandrescu. Do not redistribute. 67 / 80

// Reduce the strength of this

if (*b < ’0’ || *b > ’9’)

throw range_error("ehm");

result += pow10[i++] * (*b - ’0’);

Tip: Use Bitwise Ops

© 2015- Andrei Alexandrescu. Do not redistribute. 68 / 80

if ((*b < ’0’) | (*b > ’9’))

throw range_error("ehm");

result += pow10[i++] * (*b - ’0’);

Page 35: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Tip: Use isdigit

© 2015- Andrei Alexandrescu. Do not redistribute. 69 / 80

if (!isdigit(*b))

throw range_error("ehm");

result += pow10[i++] * (*b - ’0’);

Pro Tip: Use unsigned wraparoo

© 2015- Andrei Alexandrescu. Do not redistribute. 70 / 80

auto d = unsigned(*b) - ’0’;

if (d >= 10)

throw range_error("ehm");

result += pow10[i++] * d;

Page 36: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

© 2015- Andrei Alexandrescu. Do not redistribute. 71 / 80

modulo subtraction

is

injective

Page 37: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com
Page 38: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

© 2015- Andrei Alexandrescu. Do not redistribute. 75 / 80

Love/money makes the world go round

Math makes strength reduction go round

Think Outside the Loop

© 2015- Andrei Alexandrescu. Do not redistribute. 76 / 80

for (; b != e; ++b) {

auto d = unsigned(*b) - ’0’;

enforce(d < 10);

result += pow10[i++] * d;

}

Page 39: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Think Outside the Loop

© 2015- Andrei Alexandrescu. Do not redistribute. 77 / 80

for (; e - b >= 4; b += 4, i += 4) {

auto d0 = unsigned(*b) - ’0’;

enforce(d0 < 10);

auto s0 = pow10[i] * d0;

auto d1 = unsigned(b[1]) - ’0’;

enforce(d1 < 10);

auto s1 = pow10[i + 1] * d1;

auto d2 = unsigned(b[2]) - ’0’;

enforce(d2 < 10);

auto s2 = pow10[i + 2] * d2;

auto d3 = unsigned(b[3]) - ’0’;

enforce(d3 < 10);

auto s3 = pow10[i + 3] * d3;

result += s0 + s1 + s2 + s3;

}

Page 40: The Art of Benchmarking€¦ · © 2015- Andrei Alexandrescu. Do not redistribute. 1 / 80 Writing Fast Code code::diveConference, Wrocław, Poland Andrei Alexandrescu, Ph.D. andrei@erdani.com

Summary

© 2015- Andrei Alexandrescu. Do not redistribute. 79 / 80

• Measure

• Use good baselines

• Reduce strength

• Reduce dependencies

• Knowing math is knowing optimization

© 2015- Andrei Alexandrescu. Do not redistribute. 80 / 80

Q & A