intr o duc t i on

Introduction Computation in fi e l d s

Accelerating fi e l d s

Conclusions and Future Work References Appendix

Accelerating the ‘fi e l d s ’ Package in R: Theory and

Application

John Paige1 ([email protected])

Isaac Lyngaas2

([email protected]) Srinath Vadlamani3 ([email protected])

Doug Nychka3 ([email protected])

1Macalester College

2Florida State University

3National Center of Atmospheric Research

July 31, 2014

tudents who are




IntroductionIntroduction to Problem Goals and Motivation

Computation in fieldsMathematical

Foundation

Accelerating fieldsEigen and Cholesky

Decompositions Parameter Optimization

Conclusions and Future

Work Appendix




Analyzing Spatial Data with fi e l d s : Kriging

−108 −106 −104 −102

3738

3940

41Colorado Average Spring High Temperature (Celsius)

Longitude

Dataset from fi e l d s package (Nychka et al., 2014b)

Latit

ude

5

10

15

20





−108 −106 −104 −102

3738

3940

41Colorado Average Spring High Temperature (Celsius)

Longitude

Dataset from fi e l d s package (Nychka et al., 2014b)

Latit

ude

5

10

15

20

?





Dataset from fi e l d s package (Nychka et al., 2014b) Kriging estimate for λ = .103, θ = 5.45




Difficult to Analyze Large Datasets

Difficulties in Kriging:• Kriging with many observations

−150 −100 −50 0 50 100 150−

50

05

0

Fields' CO2 Dataset

Longitude

La

titu

de

CO2 dataset in fi e l d s package (Nychka et al., 2014b)





Difficulties in Kriging:• Kriging with

many observations

• O (n3)

−150 −100 −50 0 50 100 150−

50

05

0

Fields' CO2 Dataset

Longitude

La

titu

de







many observations

• O (n3)• 100,000 observations, fixed

parameters: ∼8 hours

−150 −100 −50 0 50 100 150−

50

05

0

Fields' CO2 Dataset

Longitude

La

titu

de







many observations


parameters: ∼8 hours• 100,000 observations, 10

parameter samples: ∼80 hours

−150 −100 −50 0 50 100 150

Longitude

−5

00

50

Fields' CO2 Dataset

La

titu

de







many observations




• Kriging over time for many parameter sets:

−150 −100 −50 0 50 100 150

Longitude

−5

00

50

Fields' CO2 Dataset

La

titu

de







many observations




• Kriging over time for many parameter sets:

• 5,000 observations, 30 years (monthly data), 20 parameter samples: nearly 14 hours

−150 −100 −50 0 50 100 150

Longitude

−5

00

50

Fields' CO2 Dataset

La

titu

de





Research Goals

• Learn Kriging theory

• Accelerate the fields package in R• Cholesky and eigen decompositions• Spatial parameter estimation (maximum likelihood estimation)




Why Focus on fi e l d s ?

Spatial Problem Workflow Times

goeRfields (with mKrig)

• Made in R(Nychka et al., 2014b)

• Free, open source• Popular with statisticians• Easy to use

4000 6000

Number of Observations

Tim

e (

Min

ute

s)

0 2000 8000 10000

0.1

11

01

00

geoR and fi e l d s packages used in this plot: Diggle and Ribeiro (2007); Nychka et al. (2014b); Ribeiro Jr and Diggle (2001)




Why Focus on fi e l d s ?


goeRfields (with mKrig)

• Made in R(Nychka et al., 2014b)

• Free, open source• Popular with statisticians• Easy to use

• Fast (for R)4000 6000


Tim

e (

Min

ute

s)

0 2000 8000 10000

0.1

11

01

00

geoR and fi e l d s packages used in this plot: Diggle and Ribeiro (2007); Nychka et al. (2014b); Ribeiro Jr and Diggle (2001)




Why Focus on Matrix Decompositions and Optimization?

• Eigen and Cholesky decompositions take a long time• Over 10,000 observations

means over 40% of the computation time

• Spatial parameter optimization requires matrix decompositions

• Better optimization means fewer matrix decompositionsNote: running on a Caldera node

on Jellystone

• GPUs: 2 Tesla M2070-Q

• CPUs: 2 8-core 2.6-Ghz Intel Xeon E5-2670 (Sandy Bridge)

0 2000 4000 6000


8000 10000

01

23

45

6

Spatial Surface Estimation Computation Time

Tim

e (

Min

ute

s)

Complete WorkflowWorkflow Cholesky Decompositions




Why Focus on Matrix Decompositions and Optimization?

• Eigen and Cholesky decompositions take a long time• Over 10,000 observations

means over 40% of the computation time

• Spatial parameter optimization requires matrix decompositions

• Better optimization means fewer matrix decompositionsNote: running on a Caldera node

on Jellystone

• GPUs: 2 Tesla M2070-Q

• CPUs: 2 8-core 2.6-Ghz Intel Xeon E5-2670 (Sandy Bridge)

0 2000 4000 6000


8000 10000

02

04

06

08

01

00

Percent Workflow Time Taken by Cholesky Decompositions

Pe

rce

nt




Kriging Set-Up (With No Trend Component)

yi = g (xi ) + ei

yi : observationsxi : locationsg : Kriging surfaceei : error following N (0, σ2) distribution, independent





yi = g (xi ) + ei

yi : observationsxi : locationsg : Kriging surfaceei : error following N (0, σ2) distribution, independent Assumptions (Nychka et al., 2014a):

• E (g (x)) = 0• Kriging surface has zero mean

• E [g (x)g (xt)] = ρ · k(x, xt):• k: correlation function (e.g. e− | |x− xt | | /θ )• ρ: correlation strength (or signal strength)





yi = g (xi ) + ei

yi : observationsxi : locationsg : Kriging surfaceei : error following N (0, σ2) distribution, independent Assumptions (Nychka et al., 2014a):

• E (g (x)) = 0• Kriging

surface has zero mean

• E [g (x)g (xt)] = ρ · k(x, xt):

• k: correlation function (e.g. e− | |

x− xt | | /θ )• ρ:

correlation strength (or signal strength)

Reparameterization:

λ = σ2/ρ: smoothing parameter

(1/λ: signal/noise ratio)




Parameter Optimization: Maximum Likelihood Estimation

DefinitionA likelihood function, L, gives the chance that a set of observations occur in a model given the model parameters.

Definitionαˆ MLE is a maximum likelihood estimate of α if αˆ MLE

maximizes the data likelihood:

L(αˆ MLE ) ≥ L(α), ∀ α




Parameter Optimization: Maximum Likelihood Estimation

DefinitionA likelihood function, L, gives the chance that a set of observations occur in a model given the model parameters.

Definitionαˆ MLE is a maximum likelihood estimate of α if αˆ MLE

maximizes the data likelihood:

L(αˆ MLE ) ≥ L(α), ∀ α

The data log-likelihood given in Nychka et al. (2014a) for a givenθ and λ is:

ln L(θ, λ) = −

y T (Σθ + λI )− 1y1 2

+ ln |Σθ + λI | + C

2




Parameter Optimization in fi e l d s

ln L(θ, λ) = −

y T (Σθ + λI )− 1y1 2

+ ln |Σθ + λI | + C

2• y : vector of observation values

• Σθ : covariance matrix, where:

(Σθ ) i j = ρk(xi , xj ) = ρe− | |xi − xj

||/θ

• C : constant





ln L(θ, λ) = −

y T (Σθ + λI )− 1y1 2

+ ln |Σθ + λI | + C




||/θ

• C : constant

How can this likelihood be computed quickly?





ln L(θ, λ) = −

y T (Σθ + λI )− 1y1 2

+ ln |Σθ + λI | + C




||/θ

• C : constant

How can this likelihood be computed quickly?

• Krig uses eigendecomposition

• mKrig uses Cholesky decomposition




Accelerating fields: Eigen and Cholesky Decompositions

• MAGMA (Agullo et al., 2009)• Freely available library with multi-GPU computing

capability• Much faster than default R




Accelerated Workflow Time

0 5000 10000 15000

02

46

8


Default R One GPU Two GPUs


Tim

e (

Min

ute

s)

• > 2500 observations ⇒ accelerated workflow is faster• > 10000 observations ⇒ ≥ 1.55× speedup (for 1 or 2

GPUs)




Accelerated Workflow Time

0 5000 10000 15000

02

46

8


Default R One GPU Two GPUs


Tim

e (

Min

ute

s)

• > 13000 observations ⇒two GPU workflow faster than one GPU by ≥ 1 sec




Accelerating fields: Maximum Likelihood Estimation

Goal:

• Quickly find set of model parameters maximizing data likelihood

Questions:

• Is Eigen or Cholesky decomposition faster for maximizing likelihood?

• How do different ways of splitting up these decomposit ions among cores and GPUs affect likelihood maximization time?

Observations from CO2 dataset in fi e l d s package (Nychka et al., 2014b)




Cholesky Wins Over Eigendecomposition

Cholesky Decomposition Speedup Over Eigendecomposition (Default Implementations)


Sp

ee

dup

0 5000 10000 15000

0

1

5

40

60

80

10

0

12

0

14

0

• Optimizing λ for fixed θ:• 10 to 15 Cholesky decompositions• One eigendecomposition







Sp

ee

dup

0 5000 10000 15000

0

1

5

40

60

80

10

0

12

0

14

0

• < 20000 observations ⇒ Cholesky is at least 18× faster







Sp

ee

dup

0 5000 10000 15000

0

1

5

40

60

80

10

0

12

0

14

0

• Cholesky decomposition achieved better speedups with GPUs







Sp

ee

dup

0 5000 10000 15000

0

1

5

40

60

80

10

0

12

0

14

0

• Multidimensional parameter space:• Likelihood gradient is easier to estimate with Cholesky

decomposition

Introduction Computation in fi e l d s Accelerating fi e l d s Conclusions and Future Work

Splitting Up Cholesky Decompositions in

Calculations

• Two GPUs per Caldera node, so either:• Use both GPUs per Cholesky decomposition• Use one GPU per Cholesky decomposition

References Appendix

Likelihood

GPU1 GPU2

{ . i } i =

1

16

GPU1 GPU2

816

{ >i } i =1 { >i } i =9

Core1

{ . i } i =

1

16

Core16Core1

>1 >16

• Compare with likelihood calculation times using:• Default Cholesky decomposition run serially• Default Cholesky decomposition parallelized in

Rmpi

Introduc

16{ . i } i =1 >1 >16

Core1 Core1 Core16

168

1

{ . i } i =1{ >i } i=1 { >i } i

GPU1 GPU2

GPU1

GPU

tion Computation in fi e l d s Accelerating fi e l d s Conclusions and Future Work


Calculations


References Appendix

Likelihood

2

8 166

{ .Ai } i=1 { .Ai } i=9

=9GPU1 GPU2 GPU1 GPU2

{ .Ai } i

=1

16

Core1

{ . i } i =

1

16

Core16Core1

>1 >16

• Compare with likelihood calculation times using:• Default Cholesky decomposition run serially• Default Cholesky decomposition parallelized in

Rmpi

Introduc

16{ . i } i =1 >1 >16

Core1 Core1 Core16

168

1

{ . i } i =1{ >i } i=1 { >i } i

GPU1 GPU2

GPU1

GPU



Calculations


References Appendix

Likelihood

2

816 6

{ .Ai } i =1 { .Ai } i =9

GPU1 GPU2

=9

GPU1 GPU2

• C

{ .Ai } i

=1

16

Core1

{ . i } i =

1

16

Core16Core1

>1 >16

ompare with likelihood calculation times using:• Default Cholesky decomposition run serially• Default Cholesky decomposition parallelized in Rmpi

Core1

{ . i } i =

1

16

Core16Core1

>1 >16

Introduc

16{ . i } i =1 >1 >16

Core1 Core1 Core16

168

1

{ . i } i =1{ >i } i=1 { >i } i

GPU1 GPU2

GPU1

GPU



Calculations


References Appendix

Likelihood

2

816 6

{ .Ai } i =1 { .Ai } i =9

GPU1 GPU2

=9

GPU1 GPU2

• C

{ .Ai } i

=1

16

Core1

{ . i } i =

1

16

Core16Core1

>1 >16

ompare with likelihood calculation times using:• Default Cholesky decomposition run serially• Default Cholesky decomposition parallelized in Rmpi

16{ . i } i=1 >1 >16

Core1 Core1 Core16

Core1

{ . i } i =

1

16

Core16Core1

>1 >16




10

01

50

20

02

50

Splitting Up Likelihood Calculations: Results

Likelihood Calculation Time (For 16 Problems)

Serial Default16−Core Parallel Default

Serial 2 GPUs per problem Parallel 1 GPU per problem

0 5000 10000


15000

05

0

Tim

e (

Se

con

ds)

• Substantial speedups versus serial, default R (10,000 observations)

• Two GPUs in parallel (one per problem): 8.3×• Two GPUs in serial (two per problem): 5.2×




10

01

50

20

02

50

Splitting Up Likelihood Calculations: Results

Likelihood Calculation Time (For 16 Problems)

Serial Default16−Core Parallel Default

Serial 2 GPUs per problem Parallel 1 GPU per problem

0 5000 10000


15000

05

0

Tim

e (

Se

con

ds)

• Significant speedups versus parallel, default R (12,500 observations)

• Two GPUs in parallel (one per problem): 1.67×• Two GPUs in serial (two per problem): 1.02×




Conclusions

• Evidence suggests using Cholesky, not eigen decomposition for likelihood maximization will be faster (at least for ≤ 20,000 observations)

• Successfully accelerated fields’ spatial workflow computations

• > 10000 observations ⇒ ≥ 1.55× speedup (for 1 or 2 GPUs)

• Demonstrated viability of two-GPU parallelized likelihood calculations using Cholesky decomposition

• 2 GPU, 2 per problem: 5.2× speedup over current implementation for ≥ 10000 observations

• 2 GPU, 1 per problem: 8.3× speedup over current implementation for ≥ 10000 observations

• Splitting up likelihood calculations among two GPUs is faster than using two GPUs serially or 16 cores in parallel




Future Work

• Possible to reach even higher levels of speedups with GPUs (by enhancing R wrapper memory usage)

• Further investigate Cholesky vs. eigendecomposition maximum likelihood estimation speed

• Implement multidimensional parallel optimization algorithm

• Latin hypercube sampling• L-BFGS-B (Zhu et al., 1997)

• Fully incorporate code into fields package

• Test fields parallelization across nodes




References

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. In Journal of Physics: Conference Series, volume 180, page 012037. IOP Publishing, 2009.

Peter Diggle and Paulo Justiniano Ribeiro. Model-based geostatistics. Springer, 2007.

Douglas Nychka, Reinhard Furrer, and Stephan Sain. Smoothing and spatial statistics: a unified and practical approach with fields. To Be Published by Springer, 2014a.

Douglas Nychka, Reinhard Furrer, and Stephan Sain. fields: Tools for spatial data, 2014b. URLhttp://CRAN.R-p roject .o rg /package=fi elds. R package version 7.1.

Paulo J Ribeiro Jr and Peter J Diggle. geor: A package for geostatistical analysis. R news, 1(2):14–18, 2001.

Ciyou Zhu, Richard H Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4): 550–560, 1997.

http://CRAN.R-project.org/package=fields











MAGMA Cholesky Decomposition Times

0 5000 10000 15000

010

2030

40

Default and MAGMA−Accelerated Cholesky Times in R

DefaultMAGMA (1 GPU)MAGMA (2 GPUs)


Tim

e (S

econ

ds)




Parameter Optimization in fi e l d s : Cholesky Decomposition

ln L(θ, λ) = −

y T (Σ + λI )− 1y1 2

+ ln |Σ + λI |

2Solving with Cholesky decomposition (Nychka et al., 2014a): Note: Σ + λI is symmetric positive definiteLet (Σ + λI ) = LLT . Then:

y T (Σ + λI )− 1y = y T (LLT )− 1y

Also:

|Σ + λI | = |LLT |

= |L|2




Parameter Optimization in fi e l d s : Eigendecomposition

ln L(θ, λ) = −

y T (Σ + λI )− 1y1 2

+ ln |Σ + λI |

2Solving with Eigendecomposition (Nychka et al., 2014a): Let (Σ + λI ) = U (D + λI )U− 1. Then:

y T (Σ + λI )− 1y = y T (U (D + λI )U− 1)− 1y

= y T (U (D + λI )− 1Uy.

Also:

|Σ + λI | = |U (D + λI )U− 1|

= |U| · |D + λI | · |U− 1|

= |D + λI |.




Analyzing Spatial Data with fi e l d s : Walkthrough

• Call mKrig or Krig with locations, observations, model parameters, covariance function

• In this case, exponential covariance:θ = 20, correlation range1/λ = 10, signal to noise ratio







• predictSurface computes spatial surface

−108 −106 −104 −10237

3839

4041

Colorado Average Spring High Temperature (Celsius)

Longitude

Latit

ude

5

10

15

20








−108 −106 −104 −10237

3839

4041

Colorado Average Spring High Temperature (Celsius)

Longitude

Latit

ude

5

10

15

20

?








• predictSE computes standard errors of surface

intr o duc t i on

Documents

fields package nychka

large datasetsdifficulties

b kriging estimate

observations on3100

observations on3150

fixed parameters

parameter samples

application john paige1