data and python in biology at pydata nyc 2015
TRANSCRIPT
How big data is transforming biology and how we are using Python to make sense of it all
Maria NattestadComputational biology PhD studentPyData NYC 2015
Overview
Genome sequencing
Using Python to study cancer
Personal genomics
Overview
Genome sequencing
Using Python to study cancer
Personal genomics
Your genome46 strings of A, T, C, and G for a total of about 6 billion characters
male
Mutations in the genome can lead to cancer and other diseases
Over 20,000 genes are scattered all over the genome.
The genome is the instruction manual for creating a living thing.
Some changes in the genome do nothing or encode normal variation like hair color, while others can cause disease.
Illumina = “Next-generation sequencing”
Sanger = The original
Human Genome Project publishes first draft
Big Data
3000 Rice Genomes Project
Sequencing by the numbers
• Human genome is 6 billion letters [ATCG]
• No technology exists that can read an entire chromosome from end to end
• Illumina sequencing produces 100 letters of sequence
• If the genome was random, this would be enough
The genome is not random
ATCGATCAT?ATCGATCATA
repeats
Because of this the human genome STILL has gaps
Repeats make it harder to assemble the genome puzzle
A
B
RCDR
RRCR B R DR
A R
A
BR
C
DIf a repeat is longer than the reads
Long-read DNA Sequencing
Pacific BiosciencesOxford Nanopore MinION
>10X as expensive as next-generation (Illumina) sequencing>100X read length
Resolving repeatswith long-read sequencing
A R D CB R
A
B R
R
C
D
A R DCB R
Overview
Genome sequencing
Using Python to study cancer
Personal genomics
How the human genome changes during cancer
Normal human genome
How the human genome changes during cancer
(Davidson et al, 2000)
80 chromosomes instead of 46
Cancer genome
Cell line from a woman with metastatic breast cancer in 1971, tumor cells have been grown and studied in the lab ever since.
Split-read variant calling
chromosome 1
chromosome 2
A simple gene fusion
Gene1
Gene2
Gene1 Gene2
A complex gene fusion
Gene1
Gene2
Gene1 Gene2
SplitThreader:A new Python graph library for representing rearranged genomes
CHR 1
CHR 2
ATCGCCTA
GTCCATAG
8
10
2
ATCG CCGA
ATAGGTCC
CHR 1
CHR 210
2
8
Class structure of SplitThreader
Node Node
NodeNode Edge
Edge
Edge
Port Port Port Port
Port Port Port Port
Graph
Edge
Edge
Edge
Edge
Once you enter a node, you must exit out the other side like a tunnel.
Biological insights from SplitThreader
Depth first searchor
Breadth first search
Gene fusion finding
History of mutations
Using SplitThreader to find a gene fusion
CYTH1
EIF3H
CYTH1 EIF3HGoal is to find a path like this:
Too many copies of Her2 contributes to making cancer worse
Sequencing
Actual genome
Her2
Too much Her2
Too much signal to divide
Too many cell divisions
Cancer grows
About 40 copies of Her2 gene scattered around the genome
Her2 gene
Her2
Chr 17: 83 Mb
8 Mb
Her2
Her2
Her2
8 Mb
Chromosome 17
Her2
Chr 17Chr 8
1. Healthy chromosome 172. Sequence copied into
chromosome 83. Subsequence copied within
chromosome 84. Complex variant and
inverted duplication within chromosome 8
5. Subsequence copied within chromosome 8
SplitThreader is open source on Github
ATCG CCGA
ATAGGTCC
CHR 1
CHR 210
2
8
https://github.com/marianattestad/splitthreader
Visualization with D3.js is underway!Contributions are very welcome
Overview
Genome sequencing
Using Python to study cancer
Personal genomics
Personal genomics
SNP chip Sequencing• Illumina, SureGenomics• About $1,000• Captures large and small
mutations even if completely novel and unexpected
• 23andMe• About $100• Captures tiny mutations
scientists already know to look for
Personal genomics debates
• Should the government allow these companies to give people their genomic data?– How about interpreting the health risks?
• Is sharing your own genome breaking your family’s privacy?
THANK YOU