Intro Method Results Summary
GPU-Accelerated Gaussian Processes for ObjectDetection
Calum Blair*, John Thompson*, Neil Robertson†*Institute for Digital Communications,
University of Edinburgh†Institute for Sensors, Signals and Systems,
Heriot-Watt University
SSPD 20159th September 2015
1
Intro Method Results Summary
Contents
Introduction & Motivation
Method
Related WorkGaussian Processes and InferenceGPU Acceleration
Results
Summary
2
Intro Method Results Summary
Motivation
Pedestrian or object detectionin images with realisticconfidence measures†
†Blair, Thompson, Robertson, IntrospectiveClassification for Pedestrian Detection, SSPD 2014
Object detection in Sonar(SAS) imagery*
*Blair, Thompson, Robertson, Identifying AnomalousObjects in SAS Imagery Using Uncertainty, Fusion2015
3
Intro Method Results Summary
Goals
Reliable and fast object detection: use Gaussian Processes ascomplete or final-stage classifier. Use support vector machines(SVMs) as baseline. GPU acceleration needed.
4
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Classification Algorithm Structure
O(n)
O(n2)
feature vector
modelclassification
O(n3) (for training data)
5
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Gaussian process Classifiers (GPCs)
Given training data X and matching labels y ∈ {0, 1}, doparameter learning. Perform probabilistic prediction p(y = +1|x∗)of new data sample x∗.Stage 1: define latent functions f (x) as Gaussian distribution:N (µ(x), k(X, x∗)).Covariance function k can be linear:
k(xi , xj ) = σxi · xj . (1)
or squared-error:
k(xi , xj ) = σ exp
(−
(xi − xj)2
2`2
). (2)
where σ, ` learned during training.
6
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Classifier algorithm
Estimate distribution of f∗ which best fits x∗:
p(f∗|X, y, x∗) =
∫p(f∗|X, x∗, f)p(f|X, y)df. (3)
given f is the distribution of the latent function over X.Stage 2: ‘squash’ f∗ using sigmoid with output range [0, 1]:
σ(x) =1
(1 + exp(−f (x)). (4)
Final class membership probability π:
π , p(y = +1|X, y, x∗) =
∫σ(f∗)p(f∗|X, y, x∗)df∗ . (5)
training process is O(n3), while testing is O(n2).
7
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Graphical Interpretation
Stage 1: {X, x∗} → f
-10 -5 0 5 10
0.5
1
1.5
2
2.5
3
3.5
Stage 2: f → π
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
8
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Baseline Algorithm
Support Vector Machine Comparison
Test equation:
f (x) =N∑
i=1
αiK (x,wi ) + b (6)
α,w and b learned during training. Use radial basis function(RBF) kernel: same as (2) above.
Difference with GPCs: w is condensed model of training data, butX is all samples seen.
9
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Graphical Interpretation
GPC-RBF: X←→ x∗
x∗
X1X5
X4
X3
X2
Comparison to entire training set
SVM-RBF: x∗ ←→ w
x∗
w1
w2w3
w4
Smaller set of supportingpoints
10
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Accelerating Matrix Computations
LAPACK (Linear Algebra Package) standard library.Uses BLAS (Basic Linear Algebra Subprograms): vector, matrixand vector-matrix algorithms for multiplication and linearequations.Highly optimised versions (tweak order of operations and cachecontents), available for Intel x86 (MKL, gotoBLAS, ...) andNVIDIA CUDA GPU (cuBLAS, MAGMA, nvBLAS).
MATLAB/ Numpy etc. make BLAS calls:C = AB →
C = sgemm(A,B)
single-precision, general matrix-matrix multiply
11
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
BLAS Limitations
Must reformulate high-level operations to match availablesubroutines:
exp
(−
(xi − xj)2
2`2
)(7)
expands to:(xi − xj )
2 = x2i + x2j − 2xixj . (8)
3 separate calls, 3 separate data accesses:problem when A,B are → 1GB.GPU model: limited memory, latencydominates. Here, X (training samples) ishuge.Now describe modification of GPCalgorithm.
x∗: one
X: one per row
per column
12
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Inference
Goal: find π in (5) via predictive mean E[f∗|X, y, x∗] and predictivevariance V[f∗|X, y, x∗]†.Training and test covariances form part of larger matrix:
[ff∗
]∼ N
(0,
[K (X,X) K (X, x∗)K (x∗,X) K (x∗, x∗)
])(9)
Define KX as K (X,X), KX∗ asK (X, x∗) and K∗ as K (x∗, x∗). Write aconditional Gaussian on (9) as:
K (X,X)
K (x∗,X) K (x∗, x∗)
K (X, x∗)
f∗|X, x∗, f ∼ N (K (x∗,X)K−1X f,K∗ − K (x∗,X)K−1
X ,KX∗) (10)
Now have one term in f∗-expression.†Ch.3, Rasmussen & Williams, Gaussian Processes for Machine Learning (2006).
13
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Abridged Maths
Approximate posterior term with a Gaussian and:
p(f|X, y) ≈ q(f|X, y) = N (̂f,A−1) (11)
Obtain predictive mean as:
Eq[f∗|X, y, x∗] = KᵀX∗∇ log p(y|̂f). (12)
Define predictive variance as:
Vq[f∗|X, y, x∗] = K∗ − KᵀX∗(KX + W−1)−1KX∗ , (13)
Using W , −∇∇ log(p(y|f)), L = cholesky(I + W12KXW
12 ), and
v = L\(W12KX∗), simplify to:
Vq[f∗|X, y, x∗] = K∗ − vᵀv (14)
See paper for complete derivations
14
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Probabilistic Prediction: Full algorithm
Require: X, x∗, y, f̂,W , L, kernel function k(xi , xj )1: KX∗ = K (X, x∗) I2: K∗ = K (x∗, x∗) I3: Eq[f∗|X , y, x∗] = Kᵀ
X∗∇ log(p(y|̂f)) // latent mean
4: v = L\(W12KX∗) I
5: Vq[f∗|X , y, x∗] = K∗ − vᵀv // latent variance6: π̄∗ =
∫σ(z)N (z |Eq[f∗],Vq[f∗])dz // prediction
7: return π̄
Figure: Calculate π at test time. Compute-heavy lines marked with I.
15
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
K (X,X)
K (x∗,X) K (x∗, x∗)
K (X, x∗)
Optimisation
Each sample has d ∼ 5000.Same block-level data reused for∼ 100 sliding windows in image.
16
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Optimised Matrix Multiplication
x∗: one
X: one per row
per column
C = AB: load tiles of A and Binto fast memory.(Existing work optimised tilesizes via automated parameterexploration.)For KX∗ and KX , A = X; usualmatrix structure, one sample perrow, no overlap.
X
X
x∗
x∗
17
Intro Method Results Summary GPCs BLAS Inference GPGPGPU
Improvements
When B is x∗: densely packed;instead of one row per window,re-use nearby data already in fastshared memory.Time to access A and C (theresulting KX∗ matrix) dominates.Big reductions in time & memoryconsumption. Furtherimprovements from instructionlevel parallelism.
x∗: stride over
X: one per row
packed data
18
Intro Method Results Summary
Results
Timing: test processing speed on single image
Accuracy: test on large dataset
19
Intro Method Results Summary
Timing
Algorithm Processor Implementation Time(s) Speedup
GPC CPU MATLAB BLAS 10.28GPC GPU GPGPGPU 2.77 3.7×SVM CPU LIBSVM 66.92SVM GPU cuSVM 1.74 38.5×
Matrix multiplication stage for 640× 480 image on CPU (12-coreIntel Xeon X5650, 2.67GHz ) and GPU (NVIDIA GeForce GTX680, 1536 cores, 2GB RAM).cuSVM implementation is faster as SVM needs fewer supportvectors (∼3000 vs ∼14000 GPC training vectors).
20
Intro Method Results Summary
Reliability Diagram
Figure: GPC and SVM Reliability; classifiers closer to black line are morereliable.
23
Intro Method Results Summary
Conclusion
GPC compared to baseline SVM: similar speed but gain inreliability.Best case is 3.7× speedup compared to an optimisedimplementation on CPU.Improvements usually possible even over heavily optimised initialcode, when matched to application.Code available for download†.
Questions?
†http://homepages.ed.ac.uk/cblair2/
24
Bibliography
Training posterior
Require: X, y, f, kernel function k(xi , xj )1: f̂ , Eq[f,X, y] = argmaxf p(f|X, y) // Using Newton’s method2: KX = K (X,X)3: W = −∇∇ log(p(y|̂f))
4: L = cholesky(I + W12KW
12 )
5: return W , L, f̂,KX
Figure: Prepare training posterior. This only needs to be done once andcan be re-used during testing.
2
Bibliography
Derivation of Mean
Laplacian approximation: treat posterior over the training data andlabels in our f∗ term (3) as a Gaussian:
p(f|X, y) ≈ q(f|X, y) = N (̂f,A−1) , (15)
where
f̂ = arg maxfp(f|X, y) , (16)
and (where ∇ represents differentiation):
A = −∇∇ log(p(f|X, y)|f=f̂ . (17)
f̂ can thus be found by applying Bayes’ rule to the posteriordistribution over the training variables,p(f|X, y) = p(y|f)p(f|X)/p(y|X). Discard p(y|X) as maximising f.Take log and differentiate p(f|X, y) to get predictive mean:
Eq[f∗|X, y, x∗] = KᵀX∗∇ log p(y|̂f) . (18)
3
Bibliography
Derivation of Variance
Define predictive variance as:
Vq[f∗|X, y, x∗] = K∗ − KᵀX∗(KX + W−1)−1KX∗ , (19)
using W , −∇∇ log(p(y|f)). Defining the symmetric positive
definite matrix B as B = I + W12KXW
12 , LLᵀ = B so
L = cholesky(B), and v = L\(W12KX∗) simplify to:
Vq[f∗|X, y, x∗] = K∗ − KᵀX∗W
12 (LLᵀ)−1W
12KX∗ (20)
Vq[f∗|X, y, x∗] = K∗ − vᵀv (21)
Posterior term in π (5) now approximated as a Gaussianq(f∗|X, y, x∗) with mean E and variance V.
4
Bibliography
The solution of the division involving the lower triangular matrix Lon line 4 requires too much memory to obtain any benefit fromperforming the calculation on a GPU. In our experiments it provedto be faster to execute this on the CPU; the memory limitations onthe GPU meant that the test covariance matrix KX∗ had to bepartitioned into very small batches, because of the large size of KX .
5