parallel computing 2009 – vrije universiteit, amsterdam1 multimedia content analysis on clusters...

Parallel Computing 2009 – Vrije Universiteit, Amsterdam1

Multimedia Content Analysison Clusters and Grids

Frank J. Seinstra(fjseins@cs.vu.nl)

Computer Systems Group, Faculty of Sciences,

Vrije Universiteit, Amsterdam

2 Parallel Computing 2009 – Vrije Universiteit, Amsterdam

Overview (1)

Part 1: What is Multimedia Content Analysis (MMCA)? Part 2: Why parallel computing in MMCA – and how? Part 3: Software Platform: Parallel-Horus Part 4: Example – Parallel Image Processing on Clusters

Overview (1)

Part 5: ‘Grids’ and their specific problems Part 6: A Software Platform for MMCA on ‘Grids’? Part 7: Large-scale MMCA applications on ‘Grids’ Part 8: Future research directions

Introduction

A Few Realistic Problem Scenarios

A Real Problem…

News broadcast - September 21, 2005:

Police Investigation: over 80.000 CCTV recordings First match found only 2.5 months after attacks

automatic

analysis?

Another real problem…

Web Video Search:

Sarah PalinSarah Palin

Search based on annotations Known to be notoriously bad (e.g, YouTube) Instead: search based on video content

Are these realistic problems?

NFI (Dutch Forensics Institute, Den Haag): Surveillance Camera Analysis Crime Scene Reconstruction

Beeld&Geluid (Dutch Institute for Sound and Vision, Hilversum): Interactive access to Dutch national

TV history

But there are many more:

Healthcare

Astronomy

Remote Sensing

Entertainment (e.g. see: PhotoSynth.net)

Part 1

What is Multimedia Content Analysis?

Multimedia

Multimedia = Text + Sound + Image + Video + ….

Video = image + image + image + …. In many (not all) multimedia applications:

calculations are executed on each separate video frame independently

So: we focus on Image Processing (+ Computer Vision)

What is a Digital Image?

“An image is a continuous function that has been discretized in spatial coordinates, brightness and color frequencies”

Most often: 2-D with ‘pixels’ as scalar or vector value

However: Image dimensionality can range from 1-D to n-D

Example (medical): 5-D = x, y, z, time, emission wavelength

Pixel dimensionality can range from 1-D to n-D Generally: 1D = binary/grayscale; 3D = color (e.g. RGB)

n-D = hyper-spectral (e.g. remote sensing by satellites)

Image ===> (sub-) Image Image ===> Scalar / Vector Value Image ===> Array of S/V Values

===> Feature Vector

Complete A-Z Multimedia Applications

Out: ‘meaning’ful result In: image

“Blue Car”“Supernova

at X,Y,t…”

“Pres. Bush

stepping off

Airforce 1”K R

Low level operations Intermediate level operations High level operations

(Parallel-) Horus

Impala

Low Level Image Processing Patterns (1)

= Unary Pixel Operation

(example: absolute value)

+ =Binary Pixel Operation

(example: addition)

+ =Template / Kernel / Filter /

Neighborhood Operation

(example: Gauss filter)

N-ary Pixel Operation…

Low Level Image Processing Patterns (2)

= Reduction Operation

(example: sum)

= N-Reduction Operation

(example: histogram)

2 1 7 6 4

=+ M Geometric Transformation

(example: rotation)

transformation matrix

Example Application: Template Matching

Input Image

Template

Result Image

for all images {

inputIm = readFile ( … );

unaryPixOpI ( sqrdInIm, inputIm, “set” );

binaryPixOpI ( sqrdInIm, inputIm, “mul” );

for all symbol images { symbol = readFile ( … ); weight = readFile ( … ); unaryPixOpI (filtIm1, sqrdInIm, “set”); unaryPixOpI (filtIm2, inputIm, “set”); genNeighborhoodOp (filtIm1, borderMirror, weight, “mul”, “sum”); binaryPixOpI (symbol, weight, “mul” ); genNeighborhoodOp (filtIm2, borderMirror, symbol, ”mul”, “sum”); binaryPixOpI (filtIm1, filtIm2, “sub”); binaryPixOpI (maxIm, filtIm1, “max”); } writeFile ( …, maxIm, … );}

See: http:/www.cs.vu.nl/~fjseins/ParHorusCode/

Part 2

Why Parallel Computing in MMCA (and how)?

The ‘Need for Speed’ in MMCA

Growing interest in international ‘benchmark evaluations’ Task: find ‘semantic concepts’ automatically

Example: NIST TRECVID (200+ hours of video)

A problem of scale: At least 30-50 hours of processing time per hour of video

Beel&Geluid: 20,000 hours of TV broadcasts per year NASA: over 10 TB of hyper-spectral image data per day London Underground: over 120,000 years of processing…!!!

High Performance Computing

Solution: Parallel & distributed computing at a very large scale

Accelerators

General Purpose CPUs

Clusters

Question: What type of high-performance hardware is most suitable?

Our initial choice: Clusters of general purpose CPUs (e.g. DAS-cluster) For many pragmatic reasons…

For non-experts in Parallel Computing?

Effort

Efficiency

Automatic Parallelizing

Compilers

Extended High Level

Languages (e.g., HPF)

Parallel Languages

(e.g., Occam, Orca)

Shared Memory

Specifications (e.g., OpenMP)

Message Passing

Libraries (e.g., MPI, PVM)

User Transparent

Parallelization Tools

Parallel Image Processing Libraries

Parallel Image Processing

Languages (e.g., Apply, IAL)

Existing Parallel Image Processing Libs

Suffer from many problems: No ‘familiar’ programming model:

Identifying parallelism still the responsibility of programmer (e.g. data partitioning [Taniguchi97], loop parallelism [Niculescu02, Olk95])

Reduced maintainability / portability: Multiple implementations for each operation

[Jamieson94] Restricted to particular machine [Moore97, Webb93]

Non-optimal efficiency of parallel execution: Ignore machine characteristics for optimization

[Juhasz98, Lee97] Ignore optimization across library calls [all]

Our Approach

Sustainable software library for user-transparent parallel image processing (1) Sustainability:

Maintainability, extensibility, portability (i.e. from Horus)

Applicability to commodity clusters

(2) User transparency: Strictly sequential API (identical to Horus) Intra-operation efficiency & inter-operation efficiency

Part 3 (a)

Software Platform: Parallel-Horus (parallel algorithms)

What Type(s) of Parallelism to support?

Data parallelism: “exploitation of concurrency that derives from

the application of the same operation to multiple elements of a data structure” [Foster, 1995]

Task parallelism: “a model of parallel computing in which many

different operations may be executed concurrently” [Wilson, 1995]

Why Data Parallelism (only)?

Natural approach for low level image processing

Scalability (in general: #pixels >> #different tasks) Load balancing is easy Finding independent tasks automatically is hard In other words: it’s just the best starting point…

(but not necessarily optimal at all times)

Many Algorithms Embarrassingly Parallel

Parallel Operation on Image

Scatter Image (1)

Sequential Operation on Partial Image (2)

Gather Result Data (3)

On 2 CPUs:

(1)(3)

Works (with minor issues) for: unary, binary, n-ary operations & (n-) reduction operations

Other only marginally more complex (1)

On 2 CPUs (without scatter / gather):

Parallel Filter Operation on Image

Scatter Image (1)

Allocate Scratch (2)

Copy Image into Scratch (3)

Handle / Communicate Borders (4)

Sequential Filter Operation on Scratch (5)

Gather Image (6)

} Also possible: ‘overlapping’ scatter

But not very useful in iterative filtering

SCRATCH

Other only marginally more complex (2)

On 2 CPUs (without broadcast / gather):

Potential faster implementations for special cases

Parallel Geometric Transformation on Image

Broadcast Image (1)

Create Partial Image (2)

Sequential Transform on Partial Image (3)

Gather Result Image (4)

RESULT

Challenge: Separable Recursive Filtering

+ =Template / Kernel / Filter /

Neighborhood Operation

(example: Gauss filter)

Equivalent:

… followed by …

Challenge: Separable Recursive Filtering

Separable filters: 1 x 2D becomes 2 x 1D Drastically reduces sequential computation time

Recursive filtering: result of each filter step (a pixel value) stored back into input image So: a recursive filter uses (part of) its output as input

Parallel Recursive Filtering: Solution 1

(SCATTER) (GATHER)(TRANSPOSE) (FILTER Y-dir)(FILTER X-dir)

Drawback: transpose operation is very expensive (esp. when nr. CPUs is large)

Loop carrying dependence at final stage (sub-image level)

minimal communication overhead full serialization

Loop carrying dependence at innermost stage (pixel-column level)

high communication overhead fine-grained wave-front parallelism

Tiled loop carrying dependence at intermediate stage (image-tile level)

moderate communication overhead coarse-grained wave-front parallelism

Wavefront parallelism

Drawback: partial serialization non-optimal use of available

Multipartitioning: Skewed cyclic block partitioning Each CPU owns at least one tile

in each of the distributed dimensions

All neighboring tiles in a particular direction are owned by the same CPU

Full Parallelism: First in one direction… And then in other… Border exchange at end of each

sweep Communication at end of sweep

always with same node

Part 3 (b)

Software Platform: Parallel-Horus (platform design)

Parallel-Horus: Parallelizable Patterns

Parallelizable Patterns Parallel Extensions

Minimal intrusion:

Sequential API

Re-use as much as possible the original sequential Horus library codes

Parallelization localized Easy to implement extensions

Pattern implementations (old vs. new)

template<class …, class …, class …>inline DstArrayT*CxPatUnaryPixOp(… dst, … src, … upo){ if (dst == 0) dst = CxArrayClone<DstArrayT>(src);

if (!PxRunParallel()) { // run sequential CxFuncUpoDispatch(dst, src, upo);

} else { // run parallel PxArrayPreStateTransition(src, …, …); PxArrayPreStateTransition(dst, …, …); CxFuncUpoDispatch(dst, src, upo); PxArrayPostStateTransition(dst); } return dst;}

template<class …, class …, class …>inline DstArrayT*CxPatUnaryPixOp(… dst, … src, … upo){ if (dst == 0) dst = CxArrayClone<DstArrayT>(src);

CxFuncUpoDispatch(dst, src, upo);

return dst;}

Inter-Operation Optimization

Don’t do this:

ImageOpImageOp ScatterScatter Gather Gather

Do this:

ImageOpScatter Avoid Communication ImageOp Gather

On the fly!

Lazy Parallelization:

Finite State Machine

Communication operations serve as state transition functions between distributed data structure states

State transitions performed only when absolutely necessary

State transition functions allow correct conversion of legal sequential code to legal parallel code at all times

Nice features: Requires no a priori knowledge of

loops and branches Can be done on the fly at run-time

(with no measurable overhead)

Part 4

Example – Parallel Image Processing on Clusters

Example: Curvilinear Structure Detection

Apply anisotropic Gaussian filter bank to input image

Maximum response when filter tuned to line direction

Here 3 different implementations fixed filters applied to a rotating image rotating filters applied to fixed input image

separable (UV) non-separable (2D)

Depending on parameter space: few minutes - several hours

Sequential = Parallel (1)

for all orientations theta {

geometricOp ( inputIm, &rotatIm, -theta, LINEAR, 0, p, “rotate” );

for all smoothing scales sy {

for all differentiation scales sx {

genConvolution ( filtIm1, mirrorBorder, “gauss”, sx, sy, 2, 0 );

genConvolution ( filtIm2, mirrorBorder, “gauss”, sx, sy, 0, 0 );

binaryPixOpI ( filtIm1, filtIm2, “negdiv” ); binaryPixOpC ( filtIm1, sx*sy, “mul” ); binaryPixOpI ( contrIm, filtIm1, “max” ); } }

geometricOp ( contrIm, &backIm, theta, LINEAR, 0, p, “rotate” );

binaryPixOpI ( resltIm, backIm, “max” );} IMPLEMENTATION 1

Sequential = Parallel (2 & 3)

for all orientations theta {

for all smoothing scales sy {

for all differentiation scales sx {

genConvolution (filtIm1, mirrorBorder, “func”, sx, sy, 2, 0 );

genConvolution (filtIm2, mirrorBorder, “func”, sx, sy, 0, 0 );

binaryPixOpI (filtIm1, filtIm2, “negdiv”); binaryPixOpC (filtIm1, sx*sy, “mul”); binaryPixOpI (resltIm, filtIm1, “max”); } }

IMPLEMENTATIONS 2 and 3

Measurements (DAS-1)

Performance

2085.985

20.017437.641

25.837

666.720

0 30 60 90 120

Nr. CPUs

Conv2D

ConvUV

ConvRot

Scaled Speedup

104.21

0 30 60 90 120

Nr. CPUsS

p Linear

Conv2D

ConvUV

ConvRot

512x512 image 36 orientations 8 anisotropic filters

=> Part of the efficiency of parallel execution always remains in the hands of the application programmer!

Measurements (DAS-2)

512x512 image 36 orientations 8 anisotropic filters

So: lazy parallelization (or: optimization across library calls) is very important for high efficiency!

LazyPar on on off off#Nodes Conv2D ConvUV Conv2D ConvUV

1 425.115 185.889 425.115 185.8892 213.358 93.824 237.450 124.1694 107.470 47.462 133.273 79.8478 54.025 23.765 82.781 60.158

16 27.527 11.927 55.399 47.40724 18.464 8.016 48.022 45.72432 13.939 6.035 42.730 43.05048 9.576 4.149 38.164 40.94464 7.318 3.325 36.851 41.265

Speedup

0 16 32 48 64

#Nodes

up Linear

Conv2D

ConvUV

Conv2D

ConvUV

Part 5

‘Grids’ and their Specific Problems

The ‘Promise of The Grid’

1997 and beyond: efficient and transparent (i.e. easy-to-use) wall-socket

computing over a distributed set of resources

Compare electrical power grid:

Getting an account on remote compute clusters is hard! Find the right person to contact… Hope he/she does not completely ignore your request… Provide proof of (a.o.) relevance, ethics, ‘trusted’ nationality… Fill in and sign NDA’s, Foreign National Information sheets, official

usage documents, etc… Wait for account to be created, & username to be sent to you… Hope to obtain an initial password as well…

Getting access to an existing international Grid-testbed is easier But only marginally so…

Grid Problems (1)

Grid Problems (2)

Getting your C++/MPI code to compile and run is hard! Copying your code to the remote cluster (‘scp’ often not allowed)… Setting up your environment & finding the right MPI compiler

(mpicc, mpiCC, … ???)… Making the necessary include libraries available… Find out how to use the cluster reservation system… Finding the correct way to start your program

(mpiexec, mpirun, … and on which nodes ???)… Getting your compute nodes to communicate with other machines

(generally not allowed)…

So: Nothing is standardized yet (not even Globus) A working application in one Grid domain will generally fail in all

Grid Problems (3)

Keeping an application running (efficiently) is hard! Grids are inherently dynamic:

Networks and CPUs are shared with others, causing fluctuations in resource availability

Grids are inherently faulty: compute nodes & clusters may crash at any time

Grids are inherently heterogeneous: optimization for run-time execution efficiency is by-and-large

unknown territory

So: An application that runs (efficiently) at one moment should be

expected to fail a moment later

Realizing the ‘Promise of the Grid’

Set of fundamental methodologies required Each solving part of the Grid’s complexities

For most of these methodologies solutions exist today: Ibis: IPL, SmartSockets, JavaGAT, IbisDeploy Parallel-Horus (or the Ibis version: Jorus)

Part 6

A Software Platform for MMCA on Grids

Wide-area Multimedia Services

Parallel

Client

Parallel

Client

Services on clusters world-wide Respond to client requests

Parallel

Server

Parallel

Servers

Parallel

Servers

Each server runs in data parallel manner Client requests executed fully asynchronously Task parallel execution of data parallel services

Situation in 2005

Parallel

Client

Parallel

Client

Parallel

Server

Parallel

Servers

Parallel

Servers

C++C++

MPIMPI

SocketsSockets

SSH (incl. tunneling)SSH (incl. tunneling)

Instable / faulty communication Execution on each cluster ‘by hand’ Connectivity problems Code pre-installed at each cluster site

Situation in 2009

Parallel

Client

Parallel

Client

Parallel

Server

Parallel

Servers

Parallel

JorusJorus

Servers

IPL / SmartSocketsIPL / SmartSockets

IbisDeploy / JavaGATIbisDeploy / JavaGAT

All Java / Ibis Overall C++ 10% faster than Java… …but much easier to use on a worldwide scale You have seen the video…

Part 7

Large-scale MMCA Applications on ‘Grids’

Our Solution: Place ‘retina’ over input image Each of 37 ‘retinal areas’ serves as a ‘receptive field’ For each receptive field:

Obtain 6 local histograms, invariant to shading / lighting Estimate Weibull parameters ß and γ for each histogram

=> scene description by set of 37x6x2 = 444 parameters

Color Based Object Recognition (1)

=> 444-valued => 444-valued feature vectorfeature vector

Learning phase: Set of 444 parameters is stored in database So: learning from 1 example, under single

visual setting

Recognition phase: Validation by showing objects under at least 50 different conditions:

Lighting direction Lighting color Viewing position

“a hedgehog”

In laboratory setting (1000 objects): 300 objects correctly recognized under all (!) visual conditions 700 remaining objects ‘missed’ under extreme conditions only

0 8 16 24 32 40 48 56 64

Nr. of CPUs

linear

client

0 16 32 48 64 80 96

Nr. of CPUs

linear

client

Single cluster, client side speedup Four clusters, client side speedup

Recognition on single machine: +/- 30 seconds Using multiple clusters: up to 10 frames per second Insightful: ‘distant’ clusters can be used effectively

Results on DAS-2

Part 8

Future Research Directions

Current / Future Research

Applicability of graphics processors (GPUs) and other accelerators NVIDIA, CELL Broadband Engine, FPGAs: Can we make these ‘easily’ programmable?

Concurrent use of the complex set of heterogeneous and hierarchical hardware as available in ‘the real world’

The End

contact & information:

fjseins@cs.vu.nl

http://www.cs.vu.nl/~fjseins/

Intermediate Level MMCA Algorithms

Appendix

Feature Vector A labeled sequence of (scalar) values Each (scalar) value represents image data related property Label: from user annotation or from automatic clustering

Let’s call this: “FIRE”

(histogram)

1 2 3 4 5 6 6 6 6 5 4 3 3=

Example:

(can be approximated by

a mathematical function,

e.g. a Weibull distribution;

only 2 parameters ‘ß’, ‘γ’)

<FIRE, ß=0.93, γ=0.13>

Feature Vectors

USAFlag

Annotation for low level ‘visual words’ Define N low level visual concepts Assign concepts to image regions For each region, calculate feature vector:

<SKY, ß=0.93, γ=0.13, … > <SKY, ß=0.91, γ=0.15, … > <SKY, ß=0.97, γ=0.12, … > <ROAD, ß=0.89, γ=0.09, … > <USA FLAG, ß=0.99, γ=0.14, … >

N human-defined ‘visual words’, each having multiple descriptions

Annotation (low level)

Example: Split image in X regions, and obtain feature vector for each All feature vectors have position in high-dimensional space Clustering algorithm applied to obtain N clusters -> N non-human ‘visual words’, each with multiple descriptions

Label1 Label2

Label3

Label4 Label5

Alternative: Clustering

Compute similarity between each image region

. . . Sky Grass Road

with each low level visual word…

…and count the number of region-matches with each visual word e.g.: 3 x ‘Sky’; 7 x ‘Grass’; 4 x ‘Road’; …

Defines accumulated feature vector for a full image

Feature Vectors for Full Images

Annotation for high level ‘semantic concepts’ Define M high level visual concepts, e.g.:

‘sports event’ ‘outdoors’ ‘airplane’ ‘president Bush’ ‘traffic situation’ ‘human interaction’, …

For all images in a known (training) set, assign all appropriate high level concepts

outdoors traffic situation

president Bush human interaction

outdoors airplane traffic situation

Annotation (high level)

The new, accumulated feature vectors again define positions in a high-dimensional space

Classification defines a separation boundary in that space, given the known high-level concepts

NOT ‘sports event’

‘sports event’ ‘Recognition’: position accumulated feature vector see on which side of the boundary it is distance to boundary defines probability

(so we can provide ranked results)

‘Recognition’ by Classification

parallel computing 2009 – vrije universiteit, amsterdam1 multimedia content analysis on clusters...

parallel computing

vrije universiteit

image image image

amsterdam image

amsterdam slide

sub image image

image dimensionality

digital image

Documents

slide 0-1 copyright © 2003 pearson education, inc. figure:...

ofir arkin, “icmp usage in scanning”, black hat...

bioinformatics for mnw 2 nd year jaap heringa few/falw...

ibis: a programming system for real-world distributed...

drammer: deterministic rowhammer attacks on mobile … ·...

25 january 2001tequila workshop amsterdam1 describing,...

semantic web examples from e-culture guus schreiber vu –...

ibis: a programming system for real-world distributed...

the ibis e-science software framework henri bal, frank j....

25/07/2002g.unal, ichep02 amsterdam1 final measurement of ...

amsterdam, june 3, 2020 frank j. seinstra › wp-content ›...

koor/best edmund.zirra@hs-karlsruhe.deeue-net meeting...

languages for the semantic web heiner stuckenschmidt vrije...

lectures on cellular automata continued modified and...

© dsens interactive amsterdam1 def dsens educational...

mnw2 course introduction to bioinformatics lecture 22:...

virtual laboratory for e-science (vl-e) henri bal department...

master’s course bioinformatics data analysis and tools...

aspect-oriented programming with aspectj using slides from...

generic e-science research henri bal bal@cs.vu.nl vrije...