data analytics challenges in genomics

49
Data analysis challenges in genomics Guest lecture, Data Mining Uppsala 2013-10-08 Mikael Huss Science for Life Laboratory / Stockholm University

Upload: mikaelhuss

Post on 10-May-2015

3.767 views

Category:

Technology


0 download

DESCRIPTION

Lecture given for the Data Mining course at Uppsala university in October 2013. The presentation talks about data analysis in the context of genomics, next-generation sequencing, metagenomics etc.

TRANSCRIPT

Page 1: Data analytics challenges in genomics

Data analysis challenges in genomics

Guest lecture, Data Mining

Uppsala 2013-10-08

Mikael Huss

Science for Life Laboratory / Stockholm University

Page 2: Data analytics challenges in genomics

Where I work

Science for Life Laboratory Stockholm, at Karolinska institutet science park

A national center for high throughput biology (ie massively parallel measurements of DNA/RNA (“genomics”, “next generation DNA sequencing”), proteins (“proteomics”, mass spectrometry) etc.

Nodes in Uppsala & Stockholm; funded by strategic grants

Offers services to customers, mostly DNA sequencing + associated analysis

Page 3: Data analytics challenges in genomics

Outline

1. Context (short intro to DNA sequencing)

2. Big goals / visions

3. Examples of data mining applications and technical challenges

Page 4: Data analytics challenges in genomics

1. Some context on DNA sequencing

Page 5: Data analytics challenges in genomics

All* living organisms have DNA as their blueprint

GTTACGTAACCGTTACGTA…..CCTTGATCGTAAC….Etc. (2x3 billion letters for humans)

*OK, some viruses have RNA

?

Page 6: Data analytics challenges in genomics

DNA Blueprint / source codePretty much identical in all your cells

RNA “Expressed”, “active” genesDiffers between tissues, cell types, disease vs health

Proteins The molecules that actually do stuff

…ACGT…

…ACGU…

…KVL…

Reading the nucleotide or amino acids is called sequencing

It is easier to isolate and therefore to sequence DNA and RNA

DNA sequencing means “reading the genome”RNA sequencing can be used to get a snapshot of the active genesProtein abundance can be measured but harder to do on a massive scale

A short refresher on molecular genetics!

(http://ds9a.nl/amazing-dna)

Page 7: Data analytics challenges in genomics

SciLifeLab

Presently sequencing ~3 megabases of DNA per second

Corresponding to about 3 human genome sizes per hour

Also RNA, protein measurements

Page 8: Data analytics challenges in genomics

What is sequencing good for?

- Mapping new genomes

- Comparing individual genomes to each other

- Looking at how genes are expressed (RNA sequencing)

Page 9: Data analytics challenges in genomics

De novo genome sequencing

Arabidopsis (0.12 Gbp)

Populus (0.45 Gbp)

Humans (3 Gbp)

Co

nife

rs(2

0 G

bp)

Spruce(20 Gbp)

Mapping new genomes

E. g. Norwegian spruce (Christmas tree)

Economically the most important Swedish tree

Provide basis for research on• tools for breeding for tree productivity, quality, health• tools for cellulose and wood fibre modification (new

materials)

Page 10: Data analytics challenges in genomics

Working in the context of a known reference genome.

Common application: Looking for genes responsible for hereditary diseases

Often rare monogenic or common complex diseases

More than 6,000 known monogenic disease

Only ~ ½ have a gene associated (OMIM)

Complex diseases – diabetes, asthma, MS, ….

Resequencing and variation analysis

Page 11: Data analytics challenges in genomics

Functional genomics

Variation between

-Tissues- Cell types- Cell states-Individuals

- How genes actually get expressed

Page 12: Data analytics challenges in genomics

Functional genomics

Furusawa and Kaneko, Biology Direct 2009 4:17

Transcriptional patterns

“cell types” as attractors in systems of interacting genes

Page 13: Data analytics challenges in genomics

2. Big goals / visions

Page 14: Data analytics challenges in genomics

Big goals / visions

• Precision medicine – Genomic medicine– Personalized medicine– Individualized treatments

• Understanding natural diversity– Discovering new organisms– Mapping ecological niches

• Understanding complex diseases– Molecular definitions of diseases– Lifestyle and epigenetics

Page 15: Data analytics challenges in genomics

Big goals / visions

• Precision medicine – Genomic medicine– Personalized medicine– Individualized treatments

• Understanding natural diversity– Discovering new organisms– Mapping ecological niches

• Understanding complex diseases– Molecular definitions of diseases– Lifestyle and epigenetics

Page 16: Data analytics challenges in genomics

Mount Sinai Medical Center / Eric Schadt

Page 17: Data analytics challenges in genomics
Page 18: Data analytics challenges in genomics

Personal sequencing?

Genomics apps

Page 19: Data analytics challenges in genomics

Community genomics & crowdsourced clinical trials

https://www.23andme.com/about/factoids/

Page 20: Data analytics challenges in genomics

Exploring the human microbiome

Estimated 10x more bacterial cells than human cells in human body

Three “enterotypes”

Page 21: Data analytics challenges in genomics

Personal microbiome sequencing

Page 22: Data analytics challenges in genomics

Big goals / visions

• Precision medicine – Genomic medicine– Personalized medicine– Individualized treatments

• Understanding natural diversity– Discovering new organisms– Mapping ecological niches

• Understanding complex diseases– Molecular definitions of diseases– Lifestyle and epigenetics

Page 23: Data analytics challenges in genomics

Environmental samples: soil, ocean etc

Identifying new viruses in human or environmental samples; <1% known so far

Page 24: Data analytics challenges in genomics

http://www.ted.com/talks/nathan_wolfe_what_s_left_to_explore.html

Page 25: Data analytics challenges in genomics

Planetary ecologyPerhaps: “genomic observatories” continuously monitoring environmental DNA

streaming, real-time analysis important

Page 26: Data analytics challenges in genomics

Big goals / visions

• Precision medicine – Genomic medicine– Personalized medicine– Individualized treatments

• Understanding natural diversity– Discovering new organisms– Mapping ecological niches

• Understanding complex diseases– Molecular definitions of diseases– Lifestyle and epigenetics

Page 27: Data analytics challenges in genomics

Complex diseases

• Cardiovascular disease• Autoimmune disease

– Rheumatism– Multiple sclerosis– Psoriasis– …

• Diabetes(etc.)

No simple genetic explanation.

Lifestyle & environment factors likely important.

Page 28: Data analytics challenges in genomics

Data integration and correlative analysis

http://techcrunch.com/2012/03/29/cloud-will-cure-cancer/

“Collecting comprehensive profiles of every tumor for every patient provides a dataset to build models that learn normal cellular function from cancerous deviations.

Diagnostics and treatment companies/hospitals/physicians can then use the models to deliver therapy.

If we imagine a world where every tumor is comprehensively profiled, it quickly becomes clear that not only will the data sets be very large but also involve different domains of expertise required for quality control, model building, and interpretation.”

Cancer – not one disease

Page 29: Data analytics challenges in genomics

Genes – Epigenetics – Lifestyle - Environment

Understanding the interplay of lifestyle (including environment) and genes through the “interface layer”, epigenetics.

Massive correlational analyses …

Epigenetics and lifestyle

epigenetics – changes in gene expression that are not due to base sequence changes (and that can be passed on to daughter cells during cell division)

Page 30: Data analytics challenges in genomics

Gigantic clinical sequencing projects

Genomics England / NHS will sequence 100,000 genomes of patients in the next 5 years

… BGI aims for a million

But are we ready to interpret genomes?

Page 31: Data analytics challenges in genomics

3. Applications and challenges of data mining in genomics

Page 32: Data analytics challenges in genomics

Storage and transfer

“European Bioinformatics Institute (EBI) stores 20 pb of data, of which 2 pb is genomic”

“Single human genome ~140 Gb”

“ … downloading the data is time-consuming, and researchers must be sure that their computational infrastructure and software tools are up to the task. “If I could, I would routinely look at all sequenced cancer genomes,” says [Arend] Sidow. “With the current infrastructure, that's impossible.”

Cloud solutions:Embassy Cloud – EBI + CSC in EspooeasyGenomics – BGI Hong KongDNANexus – commercial service, Silicon Valley

Page 33: Data analytics challenges in genomics

Analysis challenges

Dealing with the size of raw data

Growth in sequencing capacity has outstrippedMoore’s law

Need to throw away data Tailored streaming / approximate algorithms

The Economist

Page 34: Data analytics challenges in genomics

Shape of data

“Commercial” big data:

(e.g. purchase data, movie ratings, “likes”, cell phone locations, tweets)- Typically cheap to collect examples (data points) -> many observations- Usually low-dimensional (few features)- Data are informative only in aggregate (each data point is almost meaningless)

Biomedical big data:

(e.g. DNA sequencing, fMRI etc)- Typically expensive to collect data points -> few observations- Usually very high dimensional (e.g. ~20.000 gene measurements)- Underpowered for modelling, much more features than observations

So, biological data often seems to be “transposed” relative to other types(“large p, small n”)

Page 35: Data analytics challenges in genomics

10-250 million such entries for one sample in an experiment

20.000-row x 125-column matrix Perhaps 3 million rows

Gene expression Genetic variants

The shape of (raw and processed) data

Page 36: Data analytics challenges in genomics

Examples of data mining applications in genomics

• Classification– Diseases and disease subtypes– Biomarkers for disease– Predicting disease presence or

subtype from gene expression• Clustering and visualization

– Defining cell types– Molecular definitions of disease

• Association rules– Text analysis

Page 37: Data analytics challenges in genomics

Electronic health records

Mining electronic health records: towards better research applications and clinical care

Peter B. Jensen, Lars J. Jensen & Søren Brunak

Nature Reviews Genetics 13, 395-405 (June 2012)

Unstructured and structured textMedication historyTest resultsDemographics(etc)

Page 38: Data analytics challenges in genomics

Genome interpretation

Page 39: Data analytics challenges in genomics

Sugino et al, Molecular taxonomy of major neuronal classes in theadult mouse forebrain, Nature Neuroscience 9, 99 - 107 (2005)

Gene expression patterns and neuronal cell types

Cell types

Gen

es

Gene expression

Shape and behavior of neurons

Page 40: Data analytics challenges in genomics

Genetics of multiple sclerosis

• Gene expression data on ~120 patients and 70 controls• Medication, lifestyle, specific diagnosis• Environment important – sunlight, tobacco etc

Gene expression

Medication, diagnosis etc

Page 41: Data analytics challenges in genomics

Predictive analysis contests

Page 42: Data analytics challenges in genomics

Predictive analysis contests

Page 43: Data analytics challenges in genomics

Science-oriented

Page 44: Data analytics challenges in genomics

• Build predictive models for classifying gene expression signatures for:– Psoriasis– Multiple sclerosis– COPD– Lung cancer

• Training set is public data, the secret test set was proprietary

SBV Improver Challenge #1

Page 45: Data analytics challenges in genomics

• Build predictive models for classifying gene expression signatures for:– Psoriasis– Multiple sclerosis– COPD– Lung cancer

• Training set is public data, the secret test set was proprietary

SBV Improver Challenge #1

Page 46: Data analytics challenges in genomics

SBV Improver Challenge #1

• Psoriasis easy• Lung cancer hard• MS diagnostic, COPD somewhere in the middle• MS subtype: no statistically significant submissions!

https://www.sbvimprover.com/sbv-improver-symposium-2012-presentations

Page 47: Data analytics challenges in genomics

Species translation challenge

- Can the perturbations of signaling pathways in one species predict the response to a given stimulus in another species?

- Which computational methods are most effective for inferring gene, phosphorylation and pathway responses from one species to another?

Page 48: Data analytics challenges in genomics

CAMDA 2013 challenges

Question 1: Can we replace the animal study with an in vitro assay? The current safety assessment is largely relied on the animal model, which is time-consuming, labor-intensive, and definitely not in line with the animal right voice. There is a paradigm shift in toxicology to explore the possibility of replacing the animal model with in vitro assay coupled with toxicogenomics. The TGP data contains both in vitro and animal data, which is essential to address this question.

Question 2: Can we predict the liver injury in humans using toxicogenomics data from animals?

Available data:

Drug Information (Excel table) – the basic information about individual drugs from DrugBank

Pathology Data (Excel table) –Pathology and clinical chemistry data for each rat

Array Metadata (csv format) – Meta data (e.g., dose, time, sacrifice time and etc)“toxicogenomics”

Page 49: Data analytics challenges in genomics

Fully open code that runs on the server to generate predictions. Can build on others’ results