halide: a language and compiler for optimizing parallelism ...halide: a language and compiler for...
TRANSCRIPT
![Page 1: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/1.jpg)
Halide: A Language and Compiler for Optimizing Parallelism,
Locality, and Recomputation in Image Processing Pipelines
Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, Connely Barnes, Fredo Durand
MIT CSAIL, Adobe, Stanford
Friday, October 4, 13
![Page 2: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/2.jpg)
Parallelism“Moore’s law” growth will require exponentially more parallelism.
LocalityData should move as little as possible.
High Performance Image Processing
Friday, October 4, 13
![Page 3: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/3.jpg)
Message #1: Performance requires complex tradeoffs
redundant work
locality
parallelism
tradeoff
Friday, October 4, 13
![Page 4: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/4.jpg)
Where does performance come from?
Hardware
Programredundant
worklocality
parallelism
tradeoff
Friday, October 4, 13
![Page 5: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/5.jpg)
Hardware
Message #2: organization of computationis a first-class issue
Algorithm
Organization of computation
redundant work
locality
parallelism
tradeoffProgram
Program:
Friday, October 4, 13
![Page 6: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/6.jpg)
Algorithm vs. Organization: 3x3 blur
void box_filter_3x3(const Image &in, Image &blury) { Image blurx(in.width(), in.height()); // allocate blurx array
blurx(x, y) = (in(x-‐1, y) + in(x, y) + in(x+1, y))/3;
blury(x, y) = (blurx(x, y-‐1) + blurx(x, y) + blurx(x, y+1))/3;}
for (int y = 0; y < in.height(); y++) for (int x = 0; x < in.width(); x++)
for (int x = 0; x < in.width(); x++) for (int y = 0; y < in.height(); y++)
Friday, October 4, 13
![Page 7: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/7.jpg)
Algorithm vs. Organization: 3x3 blur
void box_filter_3x3(const Image &in, Image &blury) { Image blurx(in.width(), in.height()); // allocate blurx array
blurx(x, y) = (in(x-‐1, y) + in(x, y) + in(x+1, y))/3;
blury(x, y) = (blurx(x, y-‐1) + blurx(x, y) + blurx(x, y+1))/3;}
for (int y = 0; y < in.height(); y++) for (int x = 0; x < in.width(); x++)
for (int x = 0; x < in.width(); x++) for (int y = 0; y < in.height(); y++)
Same algorithm, different organizationOne of them is 15x faster
Friday, October 4, 13
![Page 8: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/8.jpg)
void box_filter_3x3(const Image &in, Image &blury) { __m128i one_third = _mm_set1_epi16(21846); #pragma omp parallel for for (int yTile = 0; yTile < in.height(); yTile += 32) { __m128i a, b, c, sum, avg; __m128i blurx[(256/8)*(32+2)]; // allocate tile blurx array for (int xTile = 0; xTile < in.width(); xTile += 256) { __m128i *blurxPtr = blurx; for (int y = -‐1; y < 32+1; y++) { const uint16_t *inPtr = &(in[yTile+y][xTile]); for (int x = 0; x < 256; x += 8) { a = _mm_loadu_si128((__m128i*)(inPtr-‐1)); b = _mm_loadu_si128((__m128i*)(inPtr+1)); c = _mm_load_si128((__m128i*)(inPtr)); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(blurxPtr++, avg); inPtr += 8; }} blurxPtr = blurx; for (int y = 0; y < 32; y++) { __m128i *outPtr = (__m128i *)(&(blury[yTile+y][xTile])); for (int x = 0; x < 256; x += 8) { a = _mm_load_si128(blurxPtr+(2*256)/8); b = _mm_load_si128(blurxPtr+256/8); c = _mm_load_si128(blurxPtr++); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(outPtr++, avg);}}}}}
9.9 → 0.9 ms/megapixelHand-optimized C++ 11x faster
(quad core x86)
Tiled, fused
Vectorized
Multithreaded
Redundant computation
Near roof-line optimum
Friday, October 4, 13
![Page 9: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/9.jpg)
Same algorithm, different organization
void box_filter_3x3(const Image &in, Image &blury) { __m128i one_third = _mm_set1_epi16(21846); #pragma omp parallel for for (int yTile = 0; yTile < in.height(); yTile += 32) { __m128i a, b, c, sum, avg; __m128i blurx[(256/8)*(32+2)]; // allocate tile blurx array for (int xTile = 0; xTile < in.width(); xTile += 256) { __m128i *blurxPtr = blurx; for (int y = -‐1; y < 32+1; y++) { const uint16_t *inPtr = &(in[yTile+y][xTile]); for (int x = 0; x < 256; x += 8) { a = _mm_loadu_si128((__m128i*)(inPtr-‐1)); b = _mm_loadu_si128((__m128i*)(inPtr+1)); c = _mm_load_si128((__m128i*)(inPtr)); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(blurxPtr++, avg); inPtr += 8; }} blurxPtr = blurx; for (int y = 0; y < 32; y++) { __m128i *outPtr = (__m128i *)(&(blury[yTile+y][xTile])); for (int x = 0; x < 256; x += 8) { a = _mm_load_si128(blurxPtr+(2*256)/8); b = _mm_load_si128(blurxPtr+256/8); c = _mm_load_si128(blurxPtr++); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(outPtr++, avg);}}}}}
void box_filter_3x3(const Image &in, Image &blury) { Image blurx(in.width(), in.height()); // allocate blurx array
for (int y = 0; y < in.height(); y++) for (int x = 0; x < in.width(); x++) blurx(x, y) = (in(x-‐1, y) + in(x, y) + in(x+1, y))/3;
for (int y = 0; y < in.height(); y++) for (int x = 0; x < in.width(); x++) blury(x, y) = (blurx(x, y-‐1) + blurx(x, y) + blurx(x, y+1))/3;}
For a given algorithm,organize to optimize:
tradeoff
redundant work
locality
parallelismFriday, October 4, 13
![Page 10: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/10.jpg)
(Re)organizing computation is hard
Optimizing parallelism, locality requires transforming program & data structure.
What transformations are legal?
What transformations are beneficial?
libraries don’t solve this:BLAS, IPP, MKL, OpenCV, MATLABoptimized kernels compose into inefficient pipelines (no fusion)
Friday, October 4, 13
![Page 11: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/11.jpg)
Halide’s answer: decouple algorithm from schedule
Algorithm: what is computedSchedule: where and when it’s computed
Easy for programmers to build pipelines
Easy to specify & explore optimizationsmanual or automatic search
Easy for the compiler to generate fast code
Friday, October 4, 13
![Page 12: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/12.jpg)
Halide algorithm:blurx(x, y) = (in(x-‐1, y) + in(x, y) + in(x+1, y))/3;blury(x, y) = (blurx(x, y-‐1) + blurx(x, y) + blurx(x, y+1))/3;
Halide schedule:blury.tile(x, y, xi, yi, 256, 32).vectorize(xi, 8).parallel(y);blurx.compute_at(blury, x).store_at(blury, x).vectorize(x, 8);
Friday, October 4, 13
![Page 13: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/13.jpg)
The schedule defines intra-stage order, inter-stage interleavingshow pipeline and domain.schedule specifies:- interleaving (up/down)- order (across domain)how we specify choices:- order of domain (loop synthesis)- granularity at which to allocate, store and reuse values (loop level)- granularity at which to interleave computation
For each stage:1) In what order should it
compute its values?2) When should it
compute its inputs?
blurx
blury
input
Friday, October 4, 13
![Page 14: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/14.jpg)
Tradeoff space modeled by granularity of interleaving
validschedulescompute
granularity
storagegranularity
fine interleavinghigh locality
redundantcomputation
coarse interleavinglow locality
no redundantcomputation
captu
ring r
euse
const
rains o
rder
(less p
arallel
ism)
breadth-firstexecutiontile-level
fusion
totalfusion
high locality,no redundancy?
Friday, October 4, 13
![Page 15: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/15.jpg)
The schedule defines producer-consumer interleaving
Breadth-first execution minimizes recomputation for large kernels
Halide’s schedule is designed to span these tradeoffs
Tile-level fusion trades off locality vs. recomputation for stencils
Fine-grained fusion optimizes locality for point-wise operations
tradeoff
redundant work
locality
parallelismFriday, October 4, 13
![Page 16: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/16.jpg)
blur_x.compute_at_root() blur_x.compute_at(blury, x) blur_x.compute_at(blury, x) .store_at_root()
blur_x.compute_at(blury, x) .vectorize(x, 4)
blur_y.tile(x, y, xi, yi, 8, 8) .parallel(y) .vectorize(xi, 4)
blur_x.compute_at(blury, y) .store_at_root() .split(x, x, xi, 8) .vectorize(xi, 4) .parallel(x)
blur_y.split(x, x, xi, 8) .vectorize(xi, 4) .parallel(x)
blur_x.compute_at(blury, y) .store_at(blury, yi) .vectorize(x, 4)
blur_y.split(y, y, yi, 8) .parallel(y) .vectorize(x, 4)
in blurx bluryin blurx blury in blurx blury
in blurx bluryin blurx blury in blurx blury
Schedule primitives compose to create many organizations
redundant work
locality
parallelism
redundant work
locality
parallelism
redundant work
locality
parallelism
redundant work
locality
parallelism
redundant work
locality
parallelism
redundant work
locality
parallelism
Friday, October 4, 13
![Page 17: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/17.jpg)
void box_filter_3x3(const Image &in, Image &blury) { __m128i one_third = _mm_set1_epi16(21846); #pragma omp parallel for for (int yTile = 0; yTile < in.height(); yTile += 32) { __m128i a, b, c, sum, avg; __m128i blurx[(256/8)*(32+2)]; // allocate tile blurx array for (int xTile = 0; xTile < in.width(); xTile += 256) { __m128i *blurxPtr = blurx; for (int y = -‐1; y < 32+1; y++) { const uint16_t *inPtr = &(in[yTile+y][xTile]); for (int x = 0; x < 256; x += 8) { a = _mm_loadu_si128((__m128i*)(inPtr-‐1)); b = _mm_loadu_si128((__m128i*)(inPtr+1)); c = _mm_load_si128((__m128i*)(inPtr)); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(blurxPtr++, avg); inPtr += 8; }} blurxPtr = blurx; for (int y = 0; y < 32; y++) { __m128i *outPtr = (__m128i *)(&(blury[yTile+y][xTile])); for (int x = 0; x < 256; x += 8) { a = _mm_load_si128(blurxPtr+(2*256)/8); b = _mm_load_si128(blurxPtr+256/8); c = _mm_load_si128(blurxPtr++); sum = _mm_add_epi16(_mm_add_epi16(a, b), c); avg = _mm_mulhi_epi16(sum, one_third); _mm_store_si128(outPtr++, avg);}}}}}
0.9 ms/megapixelC++
Func box_filter_3x3(Func in) { Func blurx, blury; Var x, y, xi, yi;
// The algorithm -‐ no storage, order blurx(x, y) = (in(x-‐1, y) + in(x, y) + in(x+1, y))/3; blury(x, y) = (blurx(x, y-‐1) + blurx(x, y) + blurx(x, y+1))/3;
// The schedule -‐ defines order, locality; implies storage blury.tile(x, y, xi, yi, 256, 32) .vectorize(xi, 8).parallel(y); blurx.compute_at(blury, x).store_at(blury, x).vectorize(x, 8); return blury;}
0.9 ms/megapixelHalide
Friday, October 4, 13
![Page 18: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/18.jpg)
Local Laplacian Filtersprototype for Adobe Photoshop Camera Raw / Lightroom
Adobe: 1500 lines of expert-optimized C++multi-threaded, SSE3 months of work10x faster than original C++
Halide: 60 lines1 intern-day
2x faster on CPU,7x faster on GPU
Friday, October 4, 13
![Page 19: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/19.jpg)
… …
… …
Local Laplacian Filters
Adobe: 1500 lines of expert-optimized C++multi-threaded, SSE3 months of work10x faster than original C++
Halide: 60 lines1 intern-day
2x faster on CPU,7x faster on GPU
Friday, October 4, 13
![Page 20: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/20.jpg)
redundant work
locality
parallelism
The Halide language & compiler
Simpler programs
Faster than hand-tuned code
Scalable across architectures
Decouples algorithm from organizationthrough a scheduling co-language to navigate fundamental tradeoffs.
open source at http://halide-lang.org
Fredo XXX: remind two (three) big messages
Friday, October 4, 13
![Page 21: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/21.jpg)
The schedule defines producer-consumer interleaving
tone curve
color correctFine-grained fusion optimizes locality for point-wise operations
Friday, October 4, 13
![Page 22: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/22.jpg)
The schedule defines producer-consumer interleaving
tone curve
color correct
tone curve
gaussian blur
Fine-grained fusion optimizes locality for point-wise operations
Friday, October 4, 13
![Page 23: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/23.jpg)
The schedule defines producer-consumer interleaving
tone curve
color correct
tone curve
gaussian blurBreadth-first execution minimizes recomputation for large kernels
Fine-grained fusion optimizes locality for point-wise operations
Friday, October 4, 13
![Page 24: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/24.jpg)
The schedule defines producer-consumer interleaving
tone curve
color correct
tone curve
gaussian blur
tone curve
median filter
Breadth-first execution minimizes recomputation for large kernels
Fine-grained fusion optimizes locality for point-wise operations
Friday, October 4, 13
![Page 25: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/25.jpg)
The schedule defines producer-consumer interleaving
tone curve
color correct
tone curve
gaussian blur
tone curve
median filter
Fine-grained fusion optimizes locality for point-wise operations
Breadth-first execution minimizes recomputation for large kernels
Tile-level fusion trades off locality vs. recomputation for stencils
Friday, October 4, 13
![Page 26: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/26.jpg)
The schedule defines producer-consumer interleaving
tone curve
color correct
tone curve
gaussian blur
tone curve
median filter
Breadth-first execution minimizes recomputation for large kernels
Fine-grained fusion optimizes locality for point-wise operations
Tile-level fusion trades off locality vs. recomputation for stencils
Friday, October 4, 13
![Page 27: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/27.jpg)
The schedule defines producer-consumer interleaving
tone curve
color correct
tone curve
gaussian blur
tone curve
median filter
Breadth-first execution minimizes recomputation for large kernels
Fine-grained fusion optimizes locality for point-wise operations
Tile-level fusion trades off locality vs. recomputation for stencils
Friday, October 4, 13
![Page 28: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/28.jpg)
The schedule defines producer-consumer interleaving
tone curve
color correct
tone curve
gaussian blur
tone curve
median filter
Breadth-first execution minimizes recomputation for large kernels
Fine-grained fusion optimizes locality for point-wise operations
Tile-level fusion trades off locality vs. recomputation for stencils
Friday, October 4, 13
![Page 29: Halide: A Language and Compiler for Optimizing Parallelism ...Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines](https://reader030.vdocuments.site/reader030/viewer/2022040723/5e3245097b69d4219c7a8946/html5/thumbnails/29.jpg)
The schedule defines producer-consumer interleaving
Breadth-first execution minimizes recomputation for large kernels
Halide’s schedule is designed to span these tradeoffs
Tile-level fusion trades off locality vs. recomputation for stencils
Fine-grained fusion optimizes locality for point-wise operations
tradeoff
redundant work
locality
parallelismFriday, October 4, 13