dr. mahout: analyzing clinical data using scalable and distributed computing

29
Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB [email protected] | [email protected] November 10, 2011 1/29

Upload: alaura

Post on 24-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Dr. Mahout: Analyzing clinical data using scalable and distributed computing. Shannon Quinn CPCB [email protected] | [email protected] November 10, 2011. 1/29. Punchline. Cloud computing for biological and clinical data analysis Problem: high- dimensional, noisy!. tech2date.com. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Dr. Mahout:Analyzing clinical data using scalable

and distributed computingShannon Quinn

[email protected] | [email protected]

November 10, 2011

1/29

Page 2: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Punchline Cloud computing for biological and

clinical data analysis Problem: high- dimensional, noisy!

Heart tissue: biomedcentralfMRI: wikipediasegmentation: biodynamics UCSD

tech2date.com

2/29

Page 3: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Disclaimer

3/29

Biology jargon

Academic jargon

Page 4: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

My Background 2nd year Ph.D. student in CPCB Program

Research in bioimage informatics

4/29

Page 5: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

My Background Other

5/29

http://collegefootballbelt.com/Logos/

http://s3.amazonaws.com/data.tumblr.com/

Page 6: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Computational biology and …the cloud?

Biological data• is BIG

• requires repetitive analysis in chunks

• modeling involves linear algebra and statistics

6/29

Page 7: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Use case 1: protein behavior

timescale of relevant motionsbond vibration side-chain

rotationdomain shifts/max. catalysis

protein folding

global conformational shifts

[

10-15 10-6 10-3 10010-910-12

detail sampling

a common tradeoff…7/29

Page 8: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Molecular dynamics

8/29

Page 9: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

“The curse of [MD] dimensionality”

MD := for every atom for every t …€

F = ma

9/29

http://icanhascheezburger.files.wordpress.com/http://www.pdb.org/pdb/explore/explore.do?structureId=3fxi

Page 10: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Pipeline for MD trajectory analysis

Find a “surface” of protein shapes1. MD output2. Define surface

(graph!)3. Partition surface

10/29

http://www.dillgroup.ucsf.edu/

Page 11: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Mahout implementationDefining surface/graph:

MatrixMultiplicationJob (matrixmult)

TransposeJob (transpose)

DistributedLanczosSolver (svd)

StochasticSVD (ssvd)

Partitioning surface/graph:

SpectralKMeans (spectralkmeans)

Eigencuts (eigencuts)

Kmeans (kmeans)

. . .

11/29

Page 12: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

MD in Mahout conclusion MD simulations

(x@Home projects)

Existing Mahout functionality

Additional algorithms

http://folding.stanford.edu/

12/29

Page 13: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Use case 2: diseases affecting cilia What are cilia?• Hairlike structures• Keep things

moving• Diseased

cilia =

13/29

http://fc06.deviantart.net/fs71/f/2010/177/d/5/Sad_Panda_by_jinxii24.jpg

Page 14: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Importance of correct diagnoses Symptoms look

familiar Consequences do

not

14/29

Page 15: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Beat pattern of cilia tells a lot! Clinicians look at cilia motion in making

their diagnoses1. What is the motion called?2. Can we create a database of motions?

15/29

Page 16: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Clinicians’ ultimate goal

Category 1 Category 2 Category 3? ? ?

16/29

Page 17: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Cilia as dynamic textures Computer vision

Saisan et al 2001

Properties

17/29

Page 18: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

The [proposed] pipeline Step 1• Clinician captures video and uploads it

http://googolplex.dyndns.org/cilia/

18/29

Page 19: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

The [proposed] pipeline Step 2• Mahout job: autoregressive modeling

y t ~ Cx t

x t ~ A1x t−1 + ...

Appearance Model Dynamic Model

http://web.media.mit.edu/~tristan/phd/dissertation/figures/manifold2.jpg

19/29

Page 20: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

The [proposed] pipeline Step 3• Add the transition matrices to cloud library

A =

20/29

Page 21: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

The [proposed] pipeline Step 4• Recompute network with added videos

Axis

2

Axis 1

?

21/29

Page 22: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

One more thing… What’s really cool about AR models:• Can you spot the fake?

Synthetic Original

22/29

Page 23: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Mahout implementationLearning autoregressive models:

MatrixMultiplicationJob (matrixmult)

TransposeJob (transpose)

DistributedLanczosSolver (svd)

StochasticSVD (ssvd)

Comparing autoregressive parameters:

SpectralKMeans (spectralkmeans)

Eigencuts (eigencuts)

Frobenius norm

Tensors

? ? ?

23/29

Page 24: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Cilia on Mahout conclusions Autoregressive modeling uses linear algebra

that is already implemented

Maintaining AR library requires new functionality

Mahout framework gives us elbow room

24/29

Page 25: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Final Thoughts Biological / biomedical data is large,

high-dimensional, and noisy

We extend Mahout’s current linear algebra framework (spectral clustering, autoregressive models)

We provide a cloud framework!

25/29

Page 26: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Research Group University of Pittsburgh• Dr. Chakra Chennubhotla Lab (advisor)

CMU@Qatar• Dr. Majd Sakr Lab (collaborator)

University of Pittsburgh Medical Center• Dr. Cecilia Lo Lab (collaborator)

26/29

Page 27: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Sources Resources• Apache Mahout• Spectrally Clustered

Links• Categorizing ciliary motion defects (BSEC 2011)• Eigencuts spectral clustering algorithm

Technical report (coming soon!)

27/29

Page 28: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Contact Shannon Quinn• [email protected] | [email protected] • http://www.magsolweb.net/

28/29

Page 29: Dr. Mahout: Analyzing clinical data using scalable and distributed computing

Thank you!

29/29

http://icanhascheezburger.files.wordpress.com/