data and python in biology at pydata nyc 2015

33
How big data is transforming biology and how we are using Python to make sense of it all Maria Nattestad Computational biology PhD student PyData NYC 2015

Upload: maria-nattestad

Post on 18-Feb-2017

294 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Data and Python in Biology at PyData NYC 2015

How big data is transforming biology and how we are using Python to make sense of it all

Maria NattestadComputational biology PhD studentPyData NYC 2015

Page 2: Data and Python in Biology at PyData NYC 2015

Overview

Genome sequencing

Using Python to study cancer

Personal genomics

Page 3: Data and Python in Biology at PyData NYC 2015

Overview

Genome sequencing

Using Python to study cancer

Personal genomics

Page 4: Data and Python in Biology at PyData NYC 2015

Your genome46 strings of A, T, C, and G for a total of about 6 billion characters

male

Page 5: Data and Python in Biology at PyData NYC 2015

Mutations in the genome can lead to cancer and other diseases

Over 20,000 genes are scattered all over the genome.

The genome is the instruction manual for creating a living thing.

Some changes in the genome do nothing or encode normal variation like hair color, while others can cause disease.

Page 6: Data and Python in Biology at PyData NYC 2015

Illumina = “Next-generation sequencing”

Sanger = The original

Human Genome Project publishes first draft

Page 7: Data and Python in Biology at PyData NYC 2015

Big Data

3000 Rice Genomes Project

Page 8: Data and Python in Biology at PyData NYC 2015

Sequencing by the numbers

• Human genome is 6 billion letters [ATCG]

• No technology exists that can read an entire chromosome from end to end

• Illumina sequencing produces 100 letters of sequence

• If the genome was random, this would be enough

Page 9: Data and Python in Biology at PyData NYC 2015

The genome is not random

ATCGATCAT?ATCGATCATA

repeats

Because of this the human genome STILL has gaps

Page 10: Data and Python in Biology at PyData NYC 2015

Repeats make it harder to assemble the genome puzzle

A

B

RCDR

RRCR B R DR

A R

A

BR

C

DIf a repeat is longer than the reads

Page 11: Data and Python in Biology at PyData NYC 2015

Long-read DNA Sequencing

Pacific BiosciencesOxford Nanopore MinION

>10X as expensive as next-generation (Illumina) sequencing>100X read length

Page 12: Data and Python in Biology at PyData NYC 2015

Resolving repeatswith long-read sequencing

A R D CB R

A

B R

R

C

D

A R DCB R

Page 13: Data and Python in Biology at PyData NYC 2015

Overview

Genome sequencing

Using Python to study cancer

Personal genomics

Page 14: Data and Python in Biology at PyData NYC 2015

How the human genome changes during cancer

Normal human genome

Page 15: Data and Python in Biology at PyData NYC 2015

How the human genome changes during cancer

(Davidson et al, 2000)

80 chromosomes instead of 46

Cancer genome

Cell line from a woman with metastatic breast cancer in 1971, tumor cells have been grown and studied in the lab ever since.

Page 16: Data and Python in Biology at PyData NYC 2015

Split-read variant calling

chromosome 1

chromosome 2

Page 17: Data and Python in Biology at PyData NYC 2015

A simple gene fusion

Gene1

Gene2

Gene1 Gene2

Page 18: Data and Python in Biology at PyData NYC 2015

A complex gene fusion

Gene1

Gene2

Gene1 Gene2

Page 19: Data and Python in Biology at PyData NYC 2015

SplitThreader:A new Python graph library for representing rearranged genomes

CHR 1

CHR 2

ATCGCCTA

GTCCATAG

8

10

2

ATCG CCGA

ATAGGTCC

CHR 1

CHR 210

2

8

Page 20: Data and Python in Biology at PyData NYC 2015

Class structure of SplitThreader

Node Node

NodeNode Edge

Edge

Edge

Port Port Port Port

Port Port Port Port

Graph

Edge

Edge

Edge

Edge

Once you enter a node, you must exit out the other side like a tunnel.

Page 21: Data and Python in Biology at PyData NYC 2015

Biological insights from SplitThreader

Depth first searchor

Breadth first search

Gene fusion finding

History of mutations

Page 22: Data and Python in Biology at PyData NYC 2015

Using SplitThreader to find a gene fusion

CYTH1

EIF3H

CYTH1 EIF3HGoal is to find a path like this:

Page 23: Data and Python in Biology at PyData NYC 2015

Too many copies of Her2 contributes to making cancer worse

Sequencing

Actual genome

Her2

Too much Her2

Too much signal to divide

Too many cell divisions

Cancer grows

Page 24: Data and Python in Biology at PyData NYC 2015

About 40 copies of Her2 gene scattered around the genome

Her2 gene

Page 25: Data and Python in Biology at PyData NYC 2015

Her2

Chr 17: 83 Mb

8 Mb

Page 26: Data and Python in Biology at PyData NYC 2015

Her2

Her2

Page 27: Data and Python in Biology at PyData NYC 2015

Her2

8 Mb

Chromosome 17

Page 28: Data and Python in Biology at PyData NYC 2015

Her2

Chr 17Chr 8

1. Healthy chromosome 172. Sequence copied into

chromosome 83. Subsequence copied within

chromosome 84. Complex variant and

inverted duplication within chromosome 8

5. Subsequence copied within chromosome 8

Page 29: Data and Python in Biology at PyData NYC 2015

SplitThreader is open source on Github

ATCG CCGA

ATAGGTCC

CHR 1

CHR 210

2

8

https://github.com/marianattestad/splitthreader

Visualization with D3.js is underway!Contributions are very welcome

Page 30: Data and Python in Biology at PyData NYC 2015

Overview

Genome sequencing

Using Python to study cancer

Personal genomics

Page 31: Data and Python in Biology at PyData NYC 2015

Personal genomics

SNP chip Sequencing• Illumina, SureGenomics• About $1,000• Captures large and small

mutations even if completely novel and unexpected

• 23andMe• About $100• Captures tiny mutations

scientists already know to look for

Page 32: Data and Python in Biology at PyData NYC 2015

Personal genomics debates

• Should the government allow these companies to give people their genomic data?– How about interpreting the health risks?

• Is sharing your own genome breaking your family’s privacy?

Page 33: Data and Python in Biology at PyData NYC 2015

THANK YOU