image 2x shrink sse implementation, benchmarking comparison between merom and penryn

Copyright © 2007 Intel Corporation.

Image 2x Shrink Image 2x Shrink SSE implementation, SSE implementation,

benchmarking comparison between Merom and Penrynbenchmarking comparison between Merom and Penryn

Dr. Zvi Danovich, Dr. Zvi Danovich, Senior Application EngineerSenior Application Engineer

November – December 2007November – December 2007


RR

®®

2

AgendaAgendaGeneral description of 2x ShrinkGeneral description of 2x ShrinkStep 1: weights computationStep 1: weights computationStep 2: components computationStep 2: components computationBenchmarks and conclusionsBenchmarks and conclusions


RR

®®

3

Pixel has 3 components (r,g,b) and 4Pixel has 3 components (r,g,b) and 4 thth, ‘a’ – weight, all are 1byte length, ‘a’ – weight, all are 1byte length Each pair of pixel lines is interpolated to one line 2x shortened: 2 opposite pixel Each pair of pixel lines is interpolated to one line 2x shortened: 2 opposite pixel

pairs are combined to 1 pixel in shrunk imagepairs are combined to 1 pixel in shrunk image New (interpolated) component C = ∑(ca)New (interpolated) component C = ∑(ca)0-3 0-3 ∕∕ ∑(a) ∑(a)0-30-3, where ‘c’ is r, g or b., where ‘c’ is r, g or b.

New weight ‘a’ A = min(255, ½ ∑(a)New weight ‘a’ A = min(255, ½ ∑(a)0-30-3 ). ).

Preliminary step: Preliminary step: reading (loading) m128i_Ev01, m128i_Ev23, m128i_Od01, m128i_Od23reading (loading) m128i_Ev01, m128i_Ev23, m128i_Od01, m128i_Od23

rr gg bb aa

General descriptionGeneral description

Sourse: even line

Sourse: odd line

0 1 2 3

m128i_Ev01 m128i_Ev23

m128i_Od23m128i_Od01

“Shrunk” pixels


RR

®®

4



RR

®®

5

equivalent

Step 1: weights computationStep 1: weights computation1.1 Building the partial sums (a1.1 Building the partial sums (a00+a+a11), (a), (a22+a+a33) …) … Building 8*16bit ‘a’-s by 2 shuffles and logical ‘or’Building 8*16bit ‘a’-s by 2 shuffles and logical ‘or’

Part sum by MADD with 8*16bit ‘1’-sPart sum by MADD with 8*16bit ‘1’-s

rr gg bb aaEven line

Odd line

11 11 11 11 11 11 11 11MADD

m128i_8a

aa00 aa11 aa22 aa33 aa44 aa55 aa66 aa77


aa00+a+a11 aa22+a+a33 aa44+a+a55 aa66+a+a77


RR

®®

6

Perform the same computation for second pair of pixel quads, obtainingPerform the same computation for second pair of pixel quads, obtaining

Building final sums using HADDBuilding final sums using HADD

Converting the result to Float Point (FP) and computation reciprocalsConverting the result to Float Point (FP) and computation reciprocals

Here we have 4 FP ‘a’-sum reciprocals - Here we have 4 FP ‘a’-sum reciprocals - normalization coefficientsnormalization coefficients

Step 1: weights computation (cont)Step 1: weights computation (cont) 1.2 Building the sums (a1.2 Building the sums (a00+a+a11+a+a22+a+a33), (a), (a44+a+a55+a+a66+a+a77) … and reciprocals) … and reciprocals

HADD

0+1+2+3 =0+1+2+3 =∑∑(a)(a)0-30-3 ∑∑(a)(a)4-74-7 ∑∑(a)(a)8-118-11 ∑∑(a)(a)12-1512-15

FPFP 1/1/∑(a)∑(a)0-30-3

FPFP 1/∑(a)1/∑(a)4-74-7

FP FP 1/∑(a)1/∑(a)8-118-11

FPFP 1/∑(a)1/∑(a)12-1512-15

aa88+a+a99 aa1010+a+a1111 aa1212+a+a1313 aa1414+a+a1515

aa00+a+a11 aa22+a+a33 aa44+a+a55 aa66+a+a77 aa88+a+a99 aa1010+a+a1111 aa1212+a+a1313 aa1414+a+a1515


RR

®®

7

1.3 Building new A1.3 Building new A00, A, A11, A, A22, A, A33

Computing new ‘a’: min(255, ½Computing new ‘a’: min(255, ½∑a)∑a)

And, finally – logical shift And, finally – logical shift to 4to 4thth position position

(∑a)(∑a)00

Step 1: weights computation (cont)Step 1: weights computation (cont)

SRAI ( (∑a)(∑a)11 (∑a)(∑a)22 (∑a)(∑a)33 , 1)

½ (∑a)½ (∑a)00 ½ (∑a)½ (∑a)11 ½ (∑a)½ (∑a)22 ½ (∑a)½ (∑a)33MIN ( , 255255 255255 255255 255255 )

AA00 AA11 AA22 AA33

equivalent as values <= 255

This is the basis of resulting quad of pixels This is the basis of resulting quad of pixels

arithmetic shift 1bit to right: division by 2

≡ AA00 AA11 AA22 AA33

AA00 AA11 AA22 AA33


RR

®®

8



RR

®®

9

equivalent

Step 2: components computationStep 2: components computation2.1 Computation 4 ‘b’-s2.1 Computation 4 ‘b’-sBuilding the partial sums (aBuilding the partial sums (a00bb00+a+a11bb11), (a), (a22bb22+a+a33bb33) …) … Building 8*16bit ‘b’-s by 2 shuffles and logical ‘or’Building 8*16bit ‘b’-s by 2 shuffles and logical ‘or’

Part sum by MADD with 8*16bit ‘a’-sPart sum by MADD with 8*16bit ‘a’-s

rr gg bb aabb00 bb11 bb22 bb33 bb44 bb55 bb66 bb77

Even line

Odd line


MADD

aa00bb00+a+a11bb11 aa22bb22+a+a33bb33 aa44bb44+a+a55bb55 aa66bb66+a+a77bb77

8 8bit ‘b’-s

bb00 bb11 bb22 bb33 bb44 bb55 bb66 bb77

8 16bit ‘a’-s from previous step

∑∑(ab)(ab)0,10,1 ∑∑(ab)(ab)2,32,3 ∑∑(ab)(ab)4,54,5 ∑∑(ab)(ab)6,76,7≡ short notation


RR

®®

10

Perform the same computation for second pair of pixel quads, obtainingPerform the same computation for second pair of pixel quads, obtaining

Building final NON-normalized interpolation sums using HADDBuilding final NON-normalized interpolation sums using HADD

Converting the result Converting the result to Float Point (FP) and to Float Point (FP) and normalizing bynormalizing by multiplication withmultiplication with ‘ ‘a’-sum reciprocalsa’-sum reciprocals from Step 1from Step 1

Step 2: components computationStep 2: components computation 2.2 Building the sums (a2.2 Building the sums (a00bb00+a+a11bb11+a+a22bb22+a+a33bb33), … and final results in FP form), … and final results in FP form

HADD

∑∑(ab)(ab)8,98,9 ∑∑(ab)(ab)10,1110,11 ∑∑(ab)(ab)12,1312,13 ∑∑(ab)(ab)14,1514,15

∑∑(ab)(ab)0,10,1 ∑∑(ab)(ab)2,32,3 ∑∑(ab)(ab)4,54,5 ∑∑(ab)(ab)6,76,7 ∑∑(ab)(ab)8,98,9 ∑∑(ab)(ab)10,1110,11 ∑∑(ab)(ab)12,1312,13 ∑∑(ab)(ab)14,1514,15

∑∑(ab)(ab)0-30-3 ∑∑(ab)(ab)4-74-7 ∑∑(ab)(ab)8-118-11 ∑∑(ab)(ab)12-1512-15

FPFP 1/1/∑(a)∑(a)0-30-3

FPFP 1/∑(a)1/∑(a)4-74-7

FP FP 1/∑(a)1/∑(a)8-118-11

FPFP 1/∑(a)1/∑(a)12-1512-15

FPFP∑∑(ab)(ab)0-30-3

FPFP∑∑(ab)(ab)4-74-7

FPFP∑∑(ab)(ab)8-118-11

FPFP∑∑(ab)(ab)12-1512-15

cvtepi32_ps

mul_ps

Here we have 4 final ‘b’ values in FP formHere we have 4 final ‘b’ values in FP form

BB00 BB11 BB22 BB33


RR

®®

11

Conversion new ‘B’-sConversion new ‘B’-s to integer form to integer form

BB00 BB11 BB22 BB33

equivalent as values <= 255

BB00 BB11 BB22 BB33

Step 2: components computationStep 2: components computation 2.3 Building new B2.3 Building new B00, B, B11, B, B22, B, B33

BB00 BB11 BB22 BB33

cvtps_epi32

BB00 BB11 BB22 BB33 ≡

Logical shift to 3Logical shift to 3rdrd position position and logical sum with quad of and logical sum with quad of ‘A’-s from previous step ‘A’-s from previous step

AA00 AA11 AA22 AA33

OR

BB00 AA00 BB11 AA11 BB22 AA22 BB33 AA33

Future resulting quad of pixels – A and B are readyFuture resulting quad of pixels – A and B are ready


RR

®®

12

Perform sub-steps of 2.1-2.3 for ‘G’-s and ‘R’-s, when the ‘G’-s quad Perform sub-steps of 2.1-2.3 for ‘G’-s and ‘R’-s, when the ‘G’-s quad is shifted to 2is shifted to 2ndnd position before logical sum, and ‘R’-s quad is not position before logical sum, and ‘R’-s quad is not shifted.shifted.

Step 2: components computation Step 2: components computation 2.4-2.9 Building new quads of G and R and summing final results2.4-2.9 Building new quads of G and R and summing final results

GG00 GG11 GG22 GG33

GG00 GG11 GG22 GG33

RR00 RR11 RR22 RR33

OR

BB00 AA00 BB11 AA11 BB22 AA22 BB33 AA33

OR

RR00 GG00 BB00 AA00 RR11 GG11 BB11 AA11 RR22 GG22 BB22 AA22 RR33 GG33 BB33 AA33

This final quad of pixels is stored in resulting imageThis final quad of pixels is stored in resulting image


RR

®®

13



RR

®®

14

Benchmarking (1 thread)Benchmarking (1 thread) Merom core - WC, 2.66GHzMerom core - WC, 2.66GHz

Penryn core – HPTN, 2.88GHzPenryn core – HPTN, 2.88GHz

6.4Vector Time

Ser Time

0.7Vector Time

Ser Time

Speed-up on Penryn (7.0x) is 1.5 better than on Merom (4.6x)

It is close to theoretical limit for 8-16bit-vector operations !

VTune CPI = 0.78

VTune CPI = 0.46

Overall speed-up Penryn(Vector)/Merom(Ser) = 8.1x

image 2x shrink sse implementation, benchmarking comparison between merom and penryn

Documents