parallelizing spatial data mining algorithms: a case study with multiscale and multigranular...

21
Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi Shekhar Army High Performance Computing and Research Center (AHPCRC) University of Minnesota

Upload: blake-horton

Post on 18-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Parallelizing Spatial Data Mining Algorithms:A case study with Multiscale and

Multigranular Classification

PGAS 2006Vijay Gandhi, Mete Celik, Shashi Shekhar

Army High Performance Computing and Research Center (AHPCRC)

University of Minnesota

Page 2: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Overview

PGAS Relevance, Application Domain

Problem Definition Approach Experimental Results Conclusion & Future Work

Page 3: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

PGAS Relevance

How effective is UPC in parallelizing spatial applications?

How effective is UPC in improving productivity of researchers in spatial domain?

Page 4: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Spatial Applications: An Example

Multiscale Multigranular Image Classification

Input Output Images at Multiple Scales

Page 5: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Model

= observations = a classification model = log-likelihood (Quality Measure) of M = Penalty function

Calculation of log-likelihood of Uses Expectation Maximization Computationally Expensive

7 hours of Computation time for an input image of size512 x 512 pixels with 4 Classes

MSMG Classification - Formulation

})(2)|({maxargˆ MpenMxlMM

xM

)|( Mxlpen

M

Page 6: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Spatial Application: Multiscale Multigranular Image Classification Applications

Land-cover change Analysis Environmental Assessment Agricultural Monitoring

Challenges Expensive computation of Quality Measure i.e.

likelihood Large amount of data Many dimensions

Page 7: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Pseudo-code : Serial Version 1.    Initialize parameters. 2.    for each Class 3.    for each Spatial Scale 4.    for each Quad 5.    Calculate Quality Measure 8.    end for Quad 9.    end for Spatial Scale 10. end for Class 11. Post-processing

Q? What are the options for parallelization?

Page 8: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Parallelization – Problem Definition Given

Serial version of a Spatial Data Mining Algorithm Likelihood of each specific class at each pixel Class-hierarchy Maximum Spatial Scale

Find Parallel formulation of the algorithm

Objective Scalability e.g. Isoefficiency

Constraints Parallel Platform “UPC”

Page 9: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Challenges in ParallelizationDescription of work

Compute Quality Measure for combinations of Class-label, Scale, Quad (Spatial Unit)

Challengesa) Variable workload across computations

of quality measureb) Many dimensions to parallelize

i.e. Class-label, Scale, Quadc) Dependency across scales

Page 10: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Class-level Parallelization 1.    Initialize parameters and memory 2.    upc_forall Class 3.    for each Spatial Scale 4.    for each Quad 5.    Calculate Quality Measure 8.    end for Quad 9.    end for Spatial Scale 10. end upc_forall Class 11. Post-processing

Page 11: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Class-level Parallelization Disadvantages:

Workload distribution is uneven (Cost of Quality measure changes with Class) Number of parallel processors is restricted

to number of classes

Examples

Page 12: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Quad-level Parallelization

1.    Initialize parameters and memory 2.    for each Spatial Scale 3. upc_forall Quad 4.    for each Class 5.    Calculate Quality Measure 6 end for Class 7.   end upc_forall Quad 8. upc_barrier 9.    end for Spatial Scale 11. Post-processing

Page 13: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Quad-level Parallelization

Advantages: Workload distribution is more even Greater number of processors can be

used Number of Quads = f (Number of

pixels) Example

Input: 4 Classes, Scale of 6

Input Image Size

Number of Quads

64 x 64 98,304

128 x 128 393,216

512 x 512 6,291,456

1024 x 1024 25,165,824

Page 14: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Experimental Design

Input 64 x 64 pixels image (Plymouth County, Massachusetts) 4 class labels (Everything, Woodland, Vegetated, Suburban)

Language UPC

Hardware Platform

Cray X1

Number of Processors

1-8

Page 15: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Workload

Input class hierarchy Output Images at Multiple scales

Scale: 64 x 64 Scale: 2 x 2

Page 16: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Effect of Number of Processors

Quad-level parallelization gives better speed-up Room for Speed-up for both approaches Q? Class-level << Quad-level. Why?

Speedup Efficiency Plot

0

0.2

0.4

0.6

0.8

1

1.2

1 2 4 8Number of Processors

E f

f i c

i e

n c

y

cy.

Class-level Quad-level

01234567

2 4 8Number of Processors

S p

e e

d u

p

u

Class-level Quad-level

Page 17: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Workload Distribution

Quad-level parallelization provides better load-balance Probably because of large number of Quads (~100,000)

Fixed Parameter - Number of processors: 4

Page 18: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Conclusions How effective is UPC in parallelizing

Spatial applications? Quad-level parallelization

Speed-up of 6.65 on 8 processors Large number of Quads (98,304)

Class-level parallelization Speed-ups are lower Smaller number of Classes (4)

Page 19: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Conclusions How effective is UPC in improving

productivity of researches in spatial domain?

Coding effort was reduced 20 lines of new code in program with base size of 2000 lines 1 person-month

Analysis effort refocused Identify units of parallel work i.e. Quality Measure Identify dimensions to parallelize i.e. Quad, Class, Scale Selecting dimension(s) to parallelize

Dependency Analysis (Ruled out Scale) Number of Units (Larger the better) Load Balancing

6 person-month

Page 20: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

Future Work

Improve Efficiency Explore Dynamic Load Balancing Other parallel formulations

Spatial Databases / Spatial Data Mining Group AHPCRC

Richard Welsh, NCS University of Boston

Junchang Ju, Eric D. Kolaczyk, Sucharita Gopal

Acknowledgements

Page 21: Parallelizing Spatial Data Mining Algorithms: A case study with Multiscale and Multigranular Classification PGAS 2006 Vijay Gandhi, Mete Celik, Shashi

References E. D. Kolaczyk, J. J., and G. S. Multiscale,

Multigranular Statistical Image Segmentation. Journal of the American Statistical Association, 100, 1358-1369, 2005.

Z. Kato, M. Berthod, and J. Zerubia. A hierarchical Markov random field model and multi-temperature annealing for parallel image classification. Graphical Models and Image Processing, 58(1):18–37, January 1996.

A. Y. Grama, A. Gupta, V. Kumar. Isoefficiency: Measuring the Scalability of Parallel Algorithms and Architectures. IEEE Parallel & Distributed Technology: Systems & Technology, 1, 12-21, 1993.