slide 1/19 big data versus the big c saurabh bagchi paper: "bioinformatics: big data versus the...
TRANSCRIPT
Slide 1/19
Big data versus the big C
Saurabh Bagchi
Paper: "Bioinformatics: Big Data Versus the Big C", Neil Savage, Nature, May 28, 2014
The Cell
© 1997-2005 Coriell Institute for Medical Research
cell, nucleus, cytoplasm, mitochondrion
How many?
Cells in the human body:~1014 (100 trillion)
~1015 bacterial cells!
Chromosomeshistone, nucleosome, chromatin, chromosome, centromere, telomere
H1DNA
H2A, H2B, H3, H4
~146bp
telomere
centromerenucleosome
chromatin
How many?
Chromosomes in a human cell:46 (2x22 + X/Y)
Nucleotide
O
C C
CC
H
H
HHH
H
H
COPO
O-
O
to next nucleotide
to previous nucleotide
to base
deoxyribose, nucleotide, base, A, C, G, T, purine, pyrimidine, 3’, 5’
3’
5’Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
Let’s write “AGACC”!
pyrimidines
purines
“AGACC” (backbone)
“AGACC” (DNA)deoxyribonucleic acid (DNA)
5’
5’3’
3’
DNA is double stranded
3’
5’
5’
3’
DNA is always written 5’ to 3’
AGACC or GGTCT
strand, reverse complement
RNA
O
C C
CC
H
OH
HHH
H
H
COPO
O-
O
to next ribonucleotide
to previous ribonucleotide
to base
ribose, ribonucleotide, U
3’
5’Adenine (A)
Cytosine (C)
Guanine (G)
Uracil (U)
pyrimidines
purines
How many?
Nucleotides in the human genome:~ 3 billion
Genes & Proteins
3’
5’
5’
3’TAGGATCGACTATATGGGATTACAAAGCATTTAGGGA...TCACCCTCTCTAGACTAGCATCTATATAAAACAGAA
ATCCTAGCTGATATACCCTAATGTTTCGTAAATCCCT...AGTGGGAGAGATCTGATCGTAGATATATTTTGTCTT
AUGGGAUUACAAAGCAUUUAGGGA...UCACCCUCUCUAGACUAGCAUCUAUAUAA
(transcription)
(translation)
Single-stranded RNA
protein
Double-stranded DNA
gene, transcription, translation, protein
How many?
Genes in the human genome:~ 20,000 – 25,000
Slide 14/19
Genes and cancer• Tumor suppressor genes• Onco genes: An oncogene is a gene that has the potential
to cause cancer. In tumor cells, they are often mutated or expressed at high levels.
• Observation (2013, Stephen Elledge of Harvard) – Aneuploidy: a condition in which the number of chromosomes
in the nucleus of a cell is not an exact multiple of the monoploid number of a particular species. An extra or missing chromosome is a common cause of genetic disorders including human birth defects.
– Aneuploidy was correlated with high rate of cancer– Finding: Aneuploidy resulted in missing tumor-suppressor
genes or extra copies of oncogenes
Slide 15/19
Data items• Lots of data to mine for patterns of cancer
– Genome of tumor cell, genome of normal cell– Medical history– Life style history– CT scan, MRI scan
• Find out correlations with cancer• Possible treatments (experimental today)
– Gene therapy– Drug targets
• Existing tool: Tumor Suppressor and Oncogene Explorer– Mine large data sets – roughly 8,000 tissue samples for 29 different kinds of
tumors– Apply statistical classification to identify tumor suppressor genes and
ongogenes: from 70 to 320, from 50 to 200– Distinguishing features include: mutation rate, ratio of benign mutations to
those that cause a gene to stop functioning
Slide 16/19
Databases• We can mine these large databases, for various kinds of
tumors1. Cancer Genome Atlas, from NCI (US)2. Catalog of Somatic Mutations in Cancer, from
Wellcome Trust Sanger Institute (UK)3. Galaxy4. ENCODE5. Roadmap Epigenomics Project
Slide 17/19
Tools and Discoveries• Large project: Bionimbus
– Cloud-based, open-source platform for sharing and analyzing genomic data from the Cancer Genome Atlas
• An example finding – By Megan McNerney (U of Chicago, Spring 2013)– Identified a gene that contributes to the development of acute myeloid
leukemia (AML)– Data mining indicated that the CUX1 gene was the most significantly
differentially expressed gene in cells that had lost chromosome 7; this gene encodes for a tumor-suppressor protein
– The researchers also identified a CUX1 fusion transcript, in other words, part of CUX1 fused to another gene.
– They hypothesized, and verified that this disruption in CUX1 may contribute to the growth of abnormal blood cells, a hallmark of AML.
Retroviral Insertional Mutagenesis In Egr1+/-mice, Haploinsufficient For a Human Del (5q) Myeloid Leukemia Gene, Develop Myeloid Neoplasms With Proviral Insertions In Genes Syntenic To Human 5qA Fernald, RJ Bergerson, J Wang, ME McNerney, T Karrison, J Anastasi, ...Blood 122 (21)
Slide 18/19
How much storage do I need?• Cancer and normal genome of a human: 1 terabyte (1012)• 1 M genomes = 1 exabyte (1018) Cost of US
$100M/year• Further sources of data: Electronic health records
– Includes diagnoses and notes on treatment• Data mining also points to relation of drug dosage with
factors of the patient, like age
Slide 19/19
Example Influential Paper• In Usenix Security 2014: “Privacy in Pharmacogenetics: An End-to-End Case
Study of Personalized Warfarin Dosing,” Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, Thomas Ristenpart
• A case study of warfarin dosing, a popular target for pharmacogenetic modeling • Warfarin is an anticoagulant widely used to help prevent strokes in patients
suffering from atrial fibrillation (a type of irregular heart beat)• However, it is known to exhibit a complex dose-response relationship affected
by multiple genetic markers, with improper dosing leading to increased risk of stroke or uncontrolled bleeding.
• A long line of work has sought pharmocogenetic models that can accurately predict proper dosage based on patient clinical history, demographics, and genotype
• Their study used a dataset collected by the International Warfarin Pharmocogenetics Consortium (IWPC), to date the most expansive such database containing demographic information, genetic markers, and clinical histories for thousands of patients from around the world