grizzly - informal overview - pydata boston 2013

62
grizzly statistical analysis with multidimensional dataflows in python Adrian Heilbut Boston University and Broad Institute http://www.empiricist.ca (g raphs for r eproducible i nteractive visuali z ation and ana ly sis) PyData Boston 2013

Upload: adrianheilbut

Post on 25-Jun-2015

199 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: grizzly - informal overview - pydata boston 2013

grizzlystatistical analysis with multidimensional dataflows in python

Adrian HeilbutBoston University and Broad Institutehttp://www.empiricist.ca

(graphs for reproducible interactive visualization and analysis)

PyData Boston 2013

Page 2: grizzly - informal overview - pydata boston 2013

1. Motivation Biological discovery from complex, multidimensional data; common features of complex biological data and analyses

2. Problems and Goals Reproducible, efficient, elegant, collaborative,interactive analysis Data + analysis evolving over time

3. Toy Dataset A simple dataset with hierarchical and temporal structure

4. Strategies Separate concerns; Represent types and structure explicitly; Abstract away data management; Formalize

5. Inspirations OLAP and data cube models; Declarative visualization grammars; Scientific workflow systems

6. Core Ideas Dataflows + Temporal Graphs + Multidimensional Types + Syntactic syrup

7. Toy Demos

8. Implementation

9. Biology application Mechanisms of drug side effects in Parkinson’s Disease

10. Summary and Conclusions

Page 3: grizzly - informal overview - pydata boston 2013

Motivation

• Common and unique features of scientific data

• Examples of complex datasets and analyses in computational biology

• Data analysis desiderata

Motivation Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application

Page 4: grizzly - informal overview - pydata boston 2013

Biological data is increasingly complex;Many datasets and analyses share common structural features

• High-dimensional measurements

• Longitudinal / time-course measurements

• Hierarchical structure of dimensions

• Multiple modalities (expression, protein concentration, phosphorylation)

• Complex experimental designs

• Complex analysis designs

• Complex pre-processing pipelines

• Many parameter choices

• Many cell types

• Many treatments

• Many organisms

• Many patients

• Many replicates

Page 5: grizzly - informal overview - pydata boston 2013

Ex 1. Cancer Profiling and SignaturesCancer Cell Line Encylopedia (CCLE) Broad / Novartis, Barretina 2012

1000 cell lines

expr

essio

n fo

r 20

,000

gen

esm

utat

ion

stat

usdr

ug re

spon

se

Page 6: grizzly - informal overview - pydata boston 2013

P0 P07 P12 P18 P21 P56

proliferationproliferation differentiationdifferentiation migration & patterningmigration & patterning

P0 P07 P15 P21

E0 E11 E15 E18

3 reps, 40k probes

Page 7: grizzly - informal overview - pydata boston 2013

SalineAcute (9)

Low Dose Levodopa

Chronic (12)

SalineChronic (11)

6-OHDA

Ascorbate

Day 1Expression + AIM

CP73

Day 8Expression + AIM

High Dose LevodopaAcute (10)

High Dose Levodopa

Chronic (11)

SalineChronic (10)

Low LevodopaChronic (8)

SalineChronic (7)

6-OHDA

Ascorbate

CP101

Day 8Expression + AIM

High LevodopaChronic (8)

SalineChronic (10)

Change in Expression between treatment groups

Expression vs. AIM (correlation) within treatment groups / cell types

Statistics (per gene)

Expression vs. AIM (correlation) within combined treatment groups

~ 23,000 x 200 matrix of stats for different contrasts between groups

Page 8: grizzly - informal overview - pydata boston 2013

Unique characteristics of scientific data• Relatively short half-life of data and projects

• Uncertain and complex analysis methods

• Constantly changing data

• Lots of internal and external structure over dimensions

• Teams with diverse backgrounds and skills over multiple institutions and locations

• Communication of data is a primary goal

• High risk and high value outcomes project selection / experimental followup clinical decisions

Distinctive characteristics, uses, and problems with scientific data analysis motivates need for tailored abstractions and tools

Page 9: grizzly - informal overview - pydata boston 2013

Desiderata for Data Analysis• Correctness

• Thoroughness (scientific hypothesis space + analysis space)

• Reproducibility

• Verifiability (analysis and underlying data, others and oneself)

• Clarity

• Provenance (of the data, and of the analysis)

• Interactivity (Exploration, Drill-down)

• Computational Efficiency

• Scientist Efficiency

Page 10: grizzly - informal overview - pydata boston 2013

Vision

Every figure, every table, and every quantitative claim in a scientific analysis or publication should be verifiable and explorable

it should link to an understandable, executable, modifiable representation of the data analysis pipeline by which it was generated

one should be able to trace back all the way to the primary experimental data

it should be easy and fun to play with

Page 11: grizzly - informal overview - pydata boston 2013

Problems and Goals

Errors have serious consequencesPractical problems in day-to-day analysisUnmet need for better tools

Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions

Page 12: grizzly - informal overview - pydata boston 2013
Page 13: grizzly - informal overview - pydata boston 2013

Mistakes even happen in Cambridge...

Reinhart / RogoffHerndon, Ash, Pollin

OriginalCorrect

Page 14: grizzly - informal overview - pydata boston 2013

it’s even worse than it appears...

Kimball, 2013

ability to easilydrill down to view and assess the underlying data is critical

Page 15: grizzly - informal overview - pydata boston 2013

Elements of statistical analysis

statisticalalgorithms

output dataInput data

visualizations

summary tables

Page 16: grizzly - informal overview - pydata boston 2013

Version 2.

output data

Input dataInput dataInput dataInput dataInput dataInput dataInput dataInput dataInput data

statisticalalgorithm

output data

output data

output dataoutput data

output data

output dataoutput data

output data

output data

output dataoutput data

output data

output dataoutput data

output data

output data

output dataoutput data

output data

output dataoutput data

output data

output data

statisticalalgorithm

statisticalalgorithm

Page 17: grizzly - informal overview - pydata boston 2013

Version 247...(ah_2013_09_13_v247_3-17am)

statisticalalgorithm

output data

Input dataInput dataInput data

Input dataInput dataInput dataInput data

statisticalalgorithm

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

statisticalalgorithm

statisticalalgorithm

statisticalalgorithm

statisticalalgorithm

Page 18: grizzly - informal overview - pydata boston 2013

v247_figs.pdf

75mb(450

pages)

v247_table_1.tabv247_table_1.tabv247_table_1.tabv247_table_1.tabv247_table_1.tabv247_table_1.tabv247_table_1.tabv247_table_1.tabv247_table_1.tabv247_table_1.tab

Page 19: grizzly - informal overview - pydata boston 2013

Toy Dataset

Multidimensional profiling of fermentation metabolites of S. cerevisiae

Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application

Page 20: grizzly - informal overview - pydata boston 2013

Beer ratings BeerAdvocate.com & RateBeer.com, via Stanford SNAP & a very kind blogger

Multidimensional: Appearance, Aroma, Palate, Taste, Overall

Hierarchies:

Location -> Brewery -> Beer

Beer style -> Beer

Temporal

Toy DatasetMultidimensional profiling of fermentation metabolites of S. cerevisiae

Page 21: grizzly - informal overview - pydata boston 2013

Strategies

• Separate concerns• Abstract away data management problems• Formalize• Optimize representations

(logical and physical)

Intro Problem & Goals Toy Dataset Strategies Inspirations Core Ideas Implementation Demo Biological Application Conclusions

Page 22: grizzly - informal overview - pydata boston 2013

Separation of Concerns

• Each of these components evolves over time

• Each may be modifed independently by different people with different goals

statisticalalgorithms

output dataInput data

visualizations

summary tables

Page 23: grizzly - informal overview - pydata boston 2013

Abstract and automate data management

Deciding and remembering how to name columns and files and track changes over time is not what I’d like to spend time on

Especially since I’ll probably do it inconsistently with what I decided to do last week

If the system is responsible for persisting data, caching and memoization can be done automatically.

Page 24: grizzly - informal overview - pydata boston 2013

Logical and physical representations matter

• Choice of representation and notation has a major effect on ease and efficiency with which concepts can be manipulated, by either a person or a computer

• Given our goals for an analysis system, and engineering instinct to separate independent concerns, what are optimal representations for

• data?

• analysis programs?

• visualizations and summary tables?

Page 25: grizzly - informal overview - pydata boston 2013

statisticalalgorithm

output data

Input dataInput dataInput data

Input dataInput dataInput dataInput data

statisticalalgorithm

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

output data

statisticalalgorithm

statisticalalgorithm

statisticalalgorithm

statisticalalgorithm

How do scientists actually think about analyses?

Page 26: grizzly - informal overview - pydata boston 2013

Inspirations (and their deficiencies..)1. OLAP (On-Line Analytical Processing) and MDX (Multidimensional Expressions)

2. Tableau / Polaris

3. Scientific workflow systems

VisTrails, KNIME

Galaxy, Genepattern

Page 27: grizzly - informal overview - pydata boston 2013

1: OLAP (on-line analytical processing)

Page 28: grizzly - informal overview - pydata boston 2013

2. Declarative Visualization Grammars(Polaris/Tableau; Stolte 2003)

• key idea: declarative specification of visualizations is possible and works well

• recent focus has been on busines analytics, rather than statistical graphics;

• assumes a static, structured database (ie. OLAP star schema) Stolte 2000

Page 29: grizzly - informal overview - pydata boston 2013

3. Scientific Workflow Systems

VisTrails

Page 30: grizzly - informal overview - pydata boston 2013

HypothesisCareful design and selection of representations for data, programs, and visualizations will make it possible to satistfy our data analysis objectives:

• multidimensional cubes with static, semantic types for conceptual representation of data

• directed acyclic graphs of functions with static, multidimensional input and output type signatures for our statistical programs

• declarative queries to generate summary tables

• declarative visualization grammar to generate graphics

(this is not how most researchers represent their analyses today)

CorrectnessThoroughnessReproducibilityVerifiabilityClarityProvenanceInteractivityComputational EfficiencyScientist Efficiency

Page 31: grizzly - informal overview - pydata boston 2013

Multidimensional Cubes and OLAPSemantic TypesDataflow Programming

Core Ideas

Page 32: grizzly - informal overview - pydata boston 2013

Data consists of facts about the world.

1 5.5 3 3 4 5

2 6 2 3 2 2

3 8 5 5 4 4.5

ceci n’est pas data

Page 33: grizzly - informal overview - pydata boston 2013

Data consists of facts about the world.

1

2

3

5.5 3 3 4 5

6 2 3 2 2

8 5 5 4 4.5

ABV Smell Color Taste OverallBeerID

Page 34: grizzly - informal overview - pydata boston 2013

Facts lie in specific domains defined by the structure of the real world or experimental design

1

2

3

5.5 3 3 4 5

6 2 3 2 2

8 5 5 4 4.5

ABVfloat

(%EtOh)

Smellordinal (1-5)

5 is best

Colorordinal(1-5)

5 is best

Tasteordinal(1-5)

5 is best

Overallordinal(1-5)

5 is best

BeerIDInteger

(BeerAdvocate BeerID)

Page 35: grizzly - informal overview - pydata boston 2013

There are a number of possible representations; logically but not practically equivalent

1

2

3

5.5 3 3 4 5

6 2 3 2 2

8 5 5 4 4.5

ABVfloat

(%EtOh)

Smellordinal (1-5)

5 is best

Colorordinal(1-5)

5 is best

Tasteordinal(1-5)

5 is best

Overallordinal(1-5)

5 is best

BeerIDInteger

(BeerAdvocate)BeerID

BeerID Measure Value

1 ABV 5.5

1 Smell 3

1 Color 3

1 Taste 4

1 Overall 5

2 ABV 6

2 Smell 2

2 Color 3

2 Taste 2

2 Overall 2

3 ABV 8

3 Smell 5

3 Color 5

3 Taste 4

3 Overall 4.5

cf. pandas reshape, plyr melt/cast

Page 36: grizzly - informal overview - pydata boston 2013

Data Representations

• Scientific / statistical data is usally in matrix format, and it must be for efficient storage and computation

• Relational model is good for precisely encoding logical structure of data, but

• moving between relations and matrices is cumbersome

• defining a relational schema for all intermediate data would be a lot of work, especially as with change over time

• on its own, the relational model does explicitly represent semantics and units

Page 37: grizzly - informal overview - pydata boston 2013

Conceptual Model: OLAP Data Cubes

Cartesian product of a set of dimensions (finite discrete sets) defines an N-dimensional grid

A multidimensional dataset is a function mapping locations in that grid to typed values called measures (identities of the measures can also be considered as just a special kind of dimension)

Beer ID

UserIDTime

Gene

BrainRegion

Stage of Development3 3 2 7.8 3 2

3 2 2.3 2.1 3 2

3 2.3 7.4 12 3 2

3 3.14 15 9 3 2

3 2 2 6.5 2 2

measure: log2 gene expression

measure: overall beer rating

Page 38: grizzly - informal overview - pydata boston 2013

Conceptual Model: Data Cubes as functions mapping dimensions to measures

def BeerRatingsByUser(UserID, BeerID):

return (Taste, Smell, Color, Texture, Overall)

def BeerRatingsByBeer(BeerID):

return (mean Taste, mean Smell, mean Color, mean Texture, mean Overall)

def ExpressionBySample(Gene, Region, SampleID):

return (log2 expression)

def ExpressionByRegionTime(Gene, Region, Timepoint):

return (median expression, mean expression, std deviation, median abs deviation, # replicates)

Page 39: grizzly - informal overview - pydata boston 2013

HierarchiesDimensions are related to each other in structures that reflect:

• the nature of the world

• experimental methods and designs

• analysis processes and decisions

These hierarchical relationships are critical to understanding and performing analyses, and need to be represented explicitly.

Page 40: grizzly - informal overview - pydata boston 2013

Multidimensional Semantic Types

1970s / 80s: Semantic Database formalisms

Specify different kinds of relationships and interactions between objects (eg. containment, is-a, relations / cross-products)

Overshadowed by ER model and later, UML..

1990s: OLAP

Page 41: grizzly - informal overview - pydata boston 2013

Dataflow

Lots of domains model computation as ‘declarative’ dataflows

circuit design

audio / video processing

Page 42: grizzly - informal overview - pydata boston 2013

Grizzly Computation ModelDirected Acyclic Graph of processing nodes

Inputs and outputs of every node are typed cubes

Function nodes add type information to describe their output dimensions

‘Apply’ nodes propagate any types of their input dimensions that they aren’t modified to the outputs

Computation is declarative / intensional, not imperative; nodes automatically process whatever is on their inputs, like an electrical circuit

(ReviewID, BeerID) --> (Appearance,

Aroma, Palate, Taste, Overall) CalcMedian

Ratings(BeerID) -->

(Appearance, Aroma, Palate, Taste, Overall)

(ReviewID, BeerID, SourceID)

--> (Appearance,

Aroma, Palate, Taste,

Overall)

(SourceID, BeerID) -->

(MedianAppearance, MedianAroma, MedianPalate, MedianTaste,

MedianOverall)

Apply

Page 43: grizzly - informal overview - pydata boston 2013

Advantages of DAG representation• Static type specifications allow precise and clear modeling /

design of an analysis pipeline before having to write all the code needed to implement it

• Model can be turned into an actual working program, instead of just being a schematic diagram

• Provenance tracking without extra instrumentation

• Memoization of intermediate results is easy because data dependencies are already explicit

• Easier to understand, reason about, and explain to others

• Easier to track modification history as graph edits

Page 44: grizzly - informal overview - pydata boston 2013

Syntactic Syrup: CubeApplyTakes cross-product of a set of input cubes / vectors and applies function to all results

(BeerID) --> (Appearance,

Aroma, Palate, Taste, Overall)

BeerRank

(BeerID) --> (RankScore)

(BeerID) -->

(Appearance, Aroma, Palate, Taste,

Overall)

(BeerID, RankModelID)

--> (RankScore)

(AppWeight, AromaWt, PalWt,

TasteWt, OverallWt)

(RankModelD) -->

(AppWt, AromaWt,PalWt, TasteWt,

OverallWt)

Page 45: grizzly - informal overview - pydata boston 2013

Slicing, Dicing

Since semantic type data is always propagated, in principle we can define the schema for any intermediate data (including

hierarchy structure) and make use of existing OLAP tools to run declarative queries

Page 46: grizzly - informal overview - pydata boston 2013

Implementation

• Type system

• DAGs

• Execution

• Data Management

• Visualizations

• ...queries?

Page 47: grizzly - informal overview - pydata boston 2013

Requirements for a practical system

• Programmable and extensible, without requiring discontinuous changes to existing habits

• OLAP systems not general enough; energy barrier to setting up a ‘data warehouse’ for a particular scientific analysis is too high; arbitrary, complex statistics not supported

• System must be deployable over the web, so analyses and results can be easily shared with geographically dispersed collaborators and the scientific community

• Free and open source

Page 48: grizzly - informal overview - pydata boston 2013

Current Support for Hierarchies in Pandas• Hierarchical dataframes only support ‘uniform’ hierarchies

• lots of real analysis requires comparisons across many different types

• Metadata is unstructured

• can’t compute effectively on column names

• Manual management

• consistency of column naming and interpretation depends entirely on programmer discipline

Page 49: grizzly - informal overview - pydata boston 2013

Simple Semantic Types over Pandas['[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]],

["ct", "cp73"], ["mc", "bh"], ["st", "pval"], ["tt", "welch ttest"]]',

'[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "nominal"], ["st", "pval"], ["tt", "student ttest"]]',

'[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "bonf"], ["st", "pval"], ["tt", "student ttest"]]',

'[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["mc", "bh"], ["st", "pval"], ["tt", "student ttest"]]',

'[["cmp", ["6-OHDA, chronicSaline", "Ascorbate, chronicSaline"]], ["ct", "cp73"], ["st", "pval"], ["tt", "levene"]]

ct

CP73 CP101

tt

student ttest

welch ttest

st

pval t-stat

bonf bh nom

mc

X

ct tt mccmp st

Page 50: grizzly - informal overview - pydata boston 2013

Temporal Graph Database• Canonical

representation for types, ‘programs’, and pointers to data are all as typed property graphs (DAGs) that can hold JSON payloads

• All edit history to the graph is recorded, so user can rewind / replay and branch

Page 51: grizzly - informal overview - pydata boston 2013

Generic Visualization Componentsto compose visualizations & reports

Page 52: grizzly - informal overview - pydata boston 2013

Architecture Overview

GZDB

Graph Editor

Grizzly Webapp

SQLAlchemy

Postgres

IPython

Pandas

HTML Viz Widgets

GZData

GZFlow

CherryPy

D3, Slickgrid, FlotjsPlumb

Filesystem

Page 53: grizzly - informal overview - pydata boston 2013

Biological Applications

Page 54: grizzly - informal overview - pydata boston 2013

Bio Example 1: Striatal Gene Expression w. L-DOPASummary tables

Drilldown and provenance from summary tables to primary data

Page 55: grizzly - informal overview - pydata boston 2013

Drilldown from summary to statistical tables

Page 56: grizzly - informal overview - pydata boston 2013

Drilldown from statistical tables to plots of primary data

Page 57: grizzly - informal overview - pydata boston 2013

Bio Example 2: Complex, interactive visualizations: BOMBASTICSubspace clustering of time-series data

A. Define blocks and an ordering

B. Cluster each block independently

C. Represent resulting clusters in a tree and explore/filter interactively

Each (predefined) subspace has unique information; we want to understand patterns both within and between blocks

Page 58: grizzly - informal overview - pydata boston 2013
Page 59: grizzly - informal overview - pydata boston 2013

Summary

Increasing complexity of biological data presents critical requirements for better systems for collaborative analysis of high-dimensional, multi-factor, dynamic data

A dataflow computation model with semantic, multidimensional types offers significant advantages for meeting these requirements

Grizzly defines a simple, formal model for multidimensional data and DAGs of operations on that data, adapting and combining ideas from OLAP, declarative visualization, and dataflow programming.

Proof-of-concept implementation in python establishes feasibility

Applications to analysis of real biological experiments (PD, Neuro, Cancer) will evaluate practical utility and benefits

CorrectnessThoroughnessReproducibilityVerifiabilityClarityProvenanceInteractivityComputational EfficiencyScientist Efficiency

Page 60: grizzly - informal overview - pydata boston 2013

Acknowledgements: Software• IPython

• NumPy

• Pandas

• Statsmodels

• Patsy

• CherryPy

• SQLAlchemy

• postgres

• NetworkX

• igraph

• backbone

• underscore

• jsPlumb

• flot

• D3.js

Page 61: grizzly - informal overview - pydata boston 2013

Acknowledgements