slide 1/19 big data versus the big c saurabh bagchi paper: "bioinformatics: big data versus the...

19
Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Upload: samson-carpenter

Post on 15-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Slide 1/19

Big data versus the big C

Saurabh Bagchi

Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Page 2: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

The Cell

© 1997-2005 Coriell Institute for Medical Research

cell, nucleus, cytoplasm, mitochondrion

Page 3: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

How many?

Cells in the human body:~1014 (100 trillion)

~1015 bacterial cells!

Page 4: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Chromosomeshistone, nucleosome, chromatin, chromosome, centromere, telomere

H1DNA

H2A, H2B, H3, H4

~146bp

telomere

centromerenucleosome

chromatin

Page 5: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

How many?

Chromosomes in a human cell:46 (2x22 + X/Y)

Page 6: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Nucleotide

O

C C

CC

H

H

HHH

H

H

COPO

O-

O

to next nucleotide

to previous nucleotide

to base

deoxyribose, nucleotide, base, A, C, G, T, purine, pyrimidine, 3’, 5’

3’

5’Adenine (A)

Cytosine (C)

Guanine (G)

Thymine (T)

Let’s write “AGACC”!

pyrimidines

purines

Page 7: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

“AGACC” (backbone)

Page 8: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

“AGACC” (DNA)deoxyribonucleic acid (DNA)

5’

5’3’

3’

Page 9: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

DNA is double stranded

3’

5’

5’

3’

DNA is always written 5’ to 3’

AGACC or GGTCT

strand, reverse complement

Page 10: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

RNA

O

C C

CC

H

OH

HHH

H

H

COPO

O-

O

to next ribonucleotide

to previous ribonucleotide

to base

ribose, ribonucleotide, U

3’

5’Adenine (A)

Cytosine (C)

Guanine (G)

Uracil (U)

pyrimidines

purines

Page 11: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

How many?

Nucleotides in the human genome:~ 3 billion

Page 12: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Genes & Proteins

3’

5’

5’

3’TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA

ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT

AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA

(transcription)

(translation)

Single-stranded RNA

protein

Double-stranded DNA

gene, transcription, translation, protein

Page 13: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

How many?

Genes in the human genome:~ 20,000 – 25,000

Page 14: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Slide 14/19

Genes and cancer• Tumor suppressor genes• Onco genes: An oncogene is a gene that has the potential

to cause cancer. In tumor cells, they are often mutated or expressed at high levels.

• Observation (2013, Stephen Elledge of Harvard) – Aneuploidy: a condition in which the number of chromosomes

in the nucleus of a cell is not an exact multiple of the monoploid number of a particular species. An extra or missing chromosome is a common cause of genetic disorders including human birth defects.

– Aneuploidy was correlated with high rate of cancer– Finding: Aneuploidy resulted in missing tumor-suppressor

genes or extra copies of oncogenes

Page 15: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Slide 15/19

Data items• Lots of data to mine for patterns of cancer

– Genome of tumor cell, genome of normal cell– Medical history– Life style history– CT scan, MRI scan

• Find out correlations with cancer• Possible treatments (experimental today)

– Gene therapy– Drug targets

• Existing tool: Tumor Suppressor and Oncogene Explorer– Mine large data sets – roughly 8,000 tissue samples for 29 different kinds of

tumors– Apply statistical classification to identify tumor suppressor genes and

ongogenes: from 70 to 320, from 50 to 200– Distinguishing features include: mutation rate, ratio of benign mutations to

those that cause a gene to stop functioning

Page 16: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Slide 16/19

Databases• We can mine these large databases, for various kinds of

tumors1. Cancer Genome Atlas, from NCI (US)2. Catalog of Somatic Mutations in Cancer, from

Wellcome Trust Sanger Institute (UK)3. Galaxy4. ENCODE5. Roadmap Epigenomics Project

Page 17: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Slide 17/19

Tools and Discoveries• Large project: Bionimbus

– Cloud-based, open-source platform for sharing and analyzing genomic data from the Cancer Genome Atlas

• An example finding – By Megan McNerney (U of Chicago, Spring 2013)– Identified a gene that contributes to the development of acute myeloid

leukemia (AML)– Data mining indicated that the CUX1 gene was the most significantly

differentially expressed gene in cells that had lost chromosome 7; this gene encodes for a tumor-suppressor protein

– The researchers also identified a CUX1 fusion transcript, in other words, part of CUX1 fused to another gene.

– They hypothesized, and verified that this disruption in CUX1 may contribute to the growth of abnormal blood cells, a hallmark of AML.

Retroviral Insertional Mutagenesis In Egr1+/-mice, Haploinsufficient For a Human Del (5q) Myeloid Leukemia Gene, Develop Myeloid Neoplasms With Proviral Insertions In Genes Syntenic To Human 5qA Fernald, RJ Bergerson, J Wang, ME McNerney, T Karrison, J Anastasi, ...Blood 122 (21)

Page 18: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Slide 18/19

How much storage do I need?• Cancer and normal genome of a human: 1 terabyte (1012)• 1 M genomes = 1 exabyte (1018) Cost of US

$100M/year• Further sources of data: Electronic health records

– Includes diagnoses and notes on treatment• Data mining also points to relation of drug dosage with

factors of the patient, like age

Page 19: Slide 1/19 Big data versus the big C Saurabh Bagchi Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014

Slide 19/19

Example Influential Paper• In Usenix Security 2014: “Privacy in Pharmacogenetics: An End-to-End Case

Study of Personalized Warfarin Dosing,” Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, Thomas Ristenpart

• A case study of warfarin dosing, a popular target for pharmacogenetic modeling • Warfarin is an anticoagulant widely used to help prevent strokes in patients

suffering from atrial fibrillation (a type of irregular heart beat)• However, it is known to exhibit a complex dose-response relationship affected

by multiple genetic markers, with improper dosing leading to increased risk of stroke or uncontrolled bleeding.

• A long line of work has sought pharmocogenetic models that can accurately predict proper dosage based on patient clinical history, demographics, and genotype

• Their study used a dataset collected by the International Warfarin Pharmocogenetics Consortium (IWPC), to date the most expansive such database containing demographic information, genetic markers, and clinical histories for thousands of patients from around the world