crash. burn. roast the marshmallows

18
Yaniv Erlich Whitehead Institute @erlichya Crash. Burn. Roast the Marshmallows.

Upload: yaniv-erlich

Post on 22-Jun-2015

783 views

Category:

Science


1 download

DESCRIPTION

Most of my group scientific project had a crash and burn moment. The trick was to use the flames to roast the marshmallows.

TRANSCRIPT

Page 1: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Yaniv Erlich Whitehead Institute @erlichya

Crash. Burn. Roast the Marshmallows.

Page 2: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

My last retreat

Page 3: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

1

10

100

1000

10000

100000

1000000

10000000

100000000

2009 2010 2011 2012 2013 2014 Job talk

# In

divi

dual

s [lo

g x+

1]

Talk date

A moment of reflection

Page 4: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

1

10

100

1000

10000

100000

1000000

10000000

100000000

2009 2010 2011 2012 2013 2014 Job talk

# In

divi

dual

s [lo

g x+

1]

Talk date

A moment of reflection

Page 5: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

1

10

100

1000

10000

100000

1000000

10000000

100000000

2009 2010 2011 2012 2013 2014 Job talk

# In

divi

dual

s [lo

g x+

1]

Talk date

A moment of reflection

Page 6: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

1

10

100

1000

10000

100000

1000000

10000000

100000000

2009 2010 2011 2012 2013 2014 Job talk

# In

divi

dual

s [lo

g x+

1]

Talk date

A moment of reflection

Page 7: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

1

10

100

1000

10000

100000

1000000

10000000

100000000

2009 2010 2011 2012 2013 2014 Job talk

# In

divi

dual

s [lo

g x+

1]

Talk date

A moment of reflection

Page 8: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

1

10

100

1000

10000

100000

1000000

10000000

100000000

2009 2010 2011 2012 2013 2014 Job talk

# In

divi

dual

s [lo

g x+

1]

Talk date

A moment of reflection

Page 9: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Life cycle of a project

2. Giving a talk at the retreat

3. Project crashes & burns

4. Use the flames! Repurpose data or methods for something completely different.

5. Publish

1. Preliminary data

6. Doing the next “logical thing”

Page 10: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Example: challenge in technology Job talk (Winter, 2010)

Find rare variations in large cohorts using combinatorial pooling

Crash and Burn moment

[1 year] Liquid handling robots suck.

Develop a reliable and cheap alternative for complex pipetting experiment

Page 11: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Example: challenge in technology Roasted Marshmallows

Page 12: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Example: challenge in biology First retreat (2010)

Finding discordant variation in whole genome sequencing of monozygotic twins

Crash and Burn moment

[7 months] Could not find any discordant point mutation.

Develop algorithms for repetitive elements

Method

lobSTR: A short tandem repeat profilerfor personal genomesMelissa Gymrek,1,2 David Golan,2,3 Saharon Rosset,3 and Yaniv Erlich2,4

1Harvard–MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139,

USA; 2Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA; 3Department of Statistics and Operations

Research, Tel Aviv University, Tel Aviv 69978, Israel

Short tandem repeats (STRs) have a wide range of applications, including medical genetics, forensics, and genetic gene-alogy. High-throughput sequencing (HTS) has the potential to profile hundreds of thousands of STR loci. However,mainstream bioinformatics pipelines are inadequate for the task. These pipelines treat STR mapping as gapped alignment,which results in cumbersome processing times and a biased sampling of STR alleles. Here, we present lobSTR, a novelmethod for profiling STRs in personal genomes. lobSTR harnesses concepts from signal processing and statistical learningto avoid gapped alignment and to address the specific noise patterns in STR calling. The speed and reliability of lobSTRexceed the performance of current mainstream algorithms for STR profiling. We validated lobSTR’s accuracy by mea-suring its consistency in calling STRs from whole-genome sequencing of two biological replicates from the same in-dividual, by tracing Mendelian inheritance patterns in STR alleles in whole-genome sequencing of a HapMap trio, and bycomparing lobSTR results to traditional molecular techniques. Encouraged by the speed and accuracy of lobSTR, we usedthe algorithm to conduct a comprehensive survey of STR variations in a deeply sequenced personal genome. We tracedthe mutation dynamics of close to 100,000 STR loci and observed more than 50,000 STR variations in a single genome.lobSTR’s implementation is an end-to-end solution. The package accepts raw sequencing reads and provides the user withthe genotyping results. It is written in C/C++, includes multi-threading capabilities, and is compatible with the BAMformat.

[Supplemental material is available for this article.]

Short tandem repeats (STRs), also known as microsatellites, area class of genetic variations with repetitive elements of 2–6 nu-cleotides (nt) that consist of approximately a quarter million loci inthe human genome (Benson 1999). The repetitive structure ofthose loci creates unusual secondary DNA conformations that areprone to replication slippage events and result in high variabilityin the number of repeat elements (Mirkin 2007). The spontaneousmutation rate of STRs exceeds that of any other type of knowngenetic variation and can reach 1/500 mutations per locus pergeneration (Walsh 2001; Ballantyne et al. 2010), 200-fold higherthan the rate of spontaneous copy number variations (CNV)(Lupski 2007) and 200,000-fold higher than the rate of de novoSNPs (Conrad et al. 2011).

STR variations have been instrumental in wide-ranging areasof human genetics. STR expansions are implicated in the etiologyof a variety of genetic disorders, such as Huntingon’s Disease andFragile-X Syndrome (Pearson et al. 2005; Mirkin 2007). ForensicsDNA fingerprinting relies on profiling autosomal STR markers andY-chromosome STR (Y-STR) loci (Kayser and de Knijff 2011). STRshave been extensively used in genetic anthropology, where theirhigh mutation rates create a unique capability to link recent his-torical events to DNA variations, including the well-known CohenModal Haplotype that segregates in patrilineal lines of Jewishpriests (Skorecki et al. 1997; Zhivotovsky et al. 2004). Anotherrelatively recent application of STR analysis is tracing cell lineagesin cancer samples (Frumkin et al. 2008).

Despite the plurality of applications, STR variations are notroutinely analyzed in whole-genome sequencing studies, mainlydue to a lack of adequate tools (Treangen and Salzberg 2011). STRspose a remarkable challenge to mainstream HTS analysis pipelines.First, not all reads that align to an STR locus are informative(Supplemental Fig. 1A). If a single or paired-end read partially en-compasses an STR locus, it provides only a lower bound on thenumber of repeats. Only reads that fully encompass an STR can beused for exact STR allelotyping. Second, mainstream aligners, suchas BWA, generally exhibit a trade-off between run time and toler-ance to insertions/deletions (indels) (Li and Homer 2010). Thus,profiling STR variations—even for an expansion of three repeats ina trinucleotide STR—would require a cumbersome gapped align-ment step and lengthy processing times (Supplemental Fig. 1B).Third, PCR amplification of an STR locus can create stutter noise, inwhich the DNA amplicons show false repeat lengths due to suc-cessive slippage events of DNA polymerase during amplification(Supplemental Fig. 1C; Hauge and Litt 1993; Ellegren 2004). SincePCR amplification is a standard step in library preparation forwhole-genome sequencing, an STR profiler should explicitlymodel and attempt to remove this noise to enhance accuracy.

Here, we present lobSTR, a rapid and accurate algorithm forSTR profiling in whole-genome sequencing data sets (Fig. 1).Briefly, the algorithm has three steps. The first step is sensing:lobSTR swiftly scans genomic libraries, flags informative reads thatfully encompass STR loci, and characterizes their STR sequence.This ab initio procedure relies on a signal processing approach thatuses rapid entropy measurements to find informative STR readsfollowed by a Fast Fourier Transform to characterize the repeatsequence. The second step is alignment: lobSTR uses a divide-and-conquer strategy that anchors the nonrepetitive flanking regions

4Corresponding author.E-mail [email protected] published online before print. Article, supplemental material, and publi-cation date are at http://www.genome.org/cgi/doi/10.1101/gr.135780.111.

22:000–000 ! 2012 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/12; www.genome.org Genome Research 1www.genome.org

Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from

10.1101/gr.135780.111Access the most recent version at doi: published online April 20, 2012Genome Res.

Melissa Gymrek, David Golan, Saharon Rosset, et al. lobSTR: A short tandem repeat profiler for personal genomes

P<P Published online April 20, 2012 in advance of the print journal.

serviceEmail alerting

click heretop right corner of the article orReceive free email alerts when new articles cite this article - sign up in the box at the

object identifier (DOIs) and date of initial publication. by PubMed from initial publication. Citations to Advance online articles must include the digital publication). Advance online articles are citable and establish publication priority; they are indexedappeared in the paper journal (edited, typeset versions may be posted when available prior to final Advance online articles have been peer reviewed and accepted for publication but have not yet

http://genome.cshlp.org/subscriptions go to: Genome ResearchTo subscribe to

Copyright © 2012 by Cold Spring Harbor Laboratory Press

Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from

Page 13: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Example: challenge in biology Roasted Marshmallows

Obs

erve

d p-

valu

e

Expected p-value

gene STR TF

Page 14: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Example: challenge in peer-review

Crash and Burn moment

Submit to Science

Second retreat (2011)

Surname inference from whole genome seq data

Page 15: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Example: challenge in peer-review Roasted Marshmallows

Page 16: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Summary “Hartman: Does the gardener not labor more on the thorns than on the flowers?”

(S.Y. Agnon, Different Faces, 1932)

"הרטמן: תמיה אני אם לא יגע הגנן בקוצים יותר משיגע בפרחים?"

ש"י עגנון, פנים אחרות, 1932

Page 17: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

Thank you for a wonderful time! Funding Andria and Paul Heafy Burroughs Wellcome Career Award National Human Genome Research Institute (NHGRI) R21 Broad Institute SPARC Award

Page 18: Crash. Burn. Roast the Marshmallows

Yaniv Erlich @erlichya 9/13/14

“Yaneev Sandwich”

302 calories

Only $1.3!!! Bialy

Tomato

American Cheese

1 Fried egg Sriracha [Advanced users]

Only $1.37!!!