smbe 2015: expression strs

31
Expression STRs Yaniv Erlich @erlichya

Upload: yaniv-erlich

Post on 13-Aug-2015

330 views

Category:

Science


2 download

TRANSCRIPT

Page 1: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Expression STRs

Yaniv Erlich @erlichya

Page 2: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

We know quite a lot about genetic variations…

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 3: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

What about Short Tandem Repeats (STRs)?

CTCAATACAAGTCTAACAGCAGCAGCAGCAGCAGCAGCAGCAGTTGATGAAC

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 4: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Short Tandem Repeats

•  1% of the human genome!•  Fast mutation rates!•  Multiple Mendelian

diseases!•  Evolvability! Huntington Fragile X

OPMD Synpolydactyly Ataxia (10 types)

HFG syndrome

Holoprosen-cephaly

Pseudoach-ondroplasia

Myotonic dystrophy Cleidocranial Dysplasia ALS-FTD

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 5: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

lobSTR – Whole genome solution for STR genotyping

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 6: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

lobSTR: An STR profiler for WGS Method

lobSTR: A short tandem repeat profilerfor personal genomesMelissa Gymrek,1,2 David Golan,2,3 Saharon Rosset,3 and Yaniv Erlich2,4

1Harvard–MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139,

USA; 2Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA; 3Department of Statistics and Operations

Research, Tel Aviv University, Tel Aviv 69978, Israel

Short tandem repeats (STRs) have a wide range of applications, including medical genetics, forensics, and genetic gene-alogy. High-throughput sequencing (HTS) has the potential to profile hundreds of thousands of STR loci. However,mainstream bioinformatics pipelines are inadequate for the task. These pipelines treat STR mapping as gapped alignment,which results in cumbersome processing times and a biased sampling of STR alleles. Here, we present lobSTR, a novelmethod for profiling STRs in personal genomes. lobSTR harnesses concepts from signal processing and statistical learningto avoid gapped alignment and to address the specific noise patterns in STR calling. The speed and reliability of lobSTRexceed the performance of current mainstream algorithms for STR profiling. We validated lobSTR’s accuracy by mea-suring its consistency in calling STRs from whole-genome sequencing of two biological replicates from the same in-dividual, by tracing Mendelian inheritance patterns in STR alleles in whole-genome sequencing of a HapMap trio, and bycomparing lobSTR results to traditional molecular techniques. Encouraged by the speed and accuracy of lobSTR, we usedthe algorithm to conduct a comprehensive survey of STR variations in a deeply sequenced personal genome. We tracedthe mutation dynamics of close to 100,000 STR loci and observed more than 50,000 STR variations in a single genome.lobSTR’s implementation is an end-to-end solution. The package accepts raw sequencing reads and provides the user withthe genotyping results. It is written in C/C++, includes multi-threading capabilities, and is compatible with the BAMformat.

[Supplemental material is available for this article.]

Short tandem repeats (STRs), also known as microsatellites, area class of genetic variations with repetitive elements of 2–6 nu-cleotides (nt) that consist of approximately a quarter million loci inthe human genome (Benson 1999). The repetitive structure ofthose loci creates unusual secondary DNA conformations that areprone to replication slippage events and result in high variabilityin the number of repeat elements (Mirkin 2007). The spontaneousmutation rate of STRs exceeds that of any other type of knowngenetic variation and can reach 1/500 mutations per locus pergeneration (Walsh 2001; Ballantyne et al. 2010), 200-fold higherthan the rate of spontaneous copy number variations (CNV)(Lupski 2007) and 200,000-fold higher than the rate of de novoSNPs (Conrad et al. 2011).

STR variations have been instrumental in wide-ranging areasof human genetics. STR expansions are implicated in the etiologyof a variety of genetic disorders, such as Huntingon’s Disease andFragile-X Syndrome (Pearson et al. 2005; Mirkin 2007). ForensicsDNA fingerprinting relies on profiling autosomal STR markers andY-chromosome STR (Y-STR) loci (Kayser and de Knijff 2011). STRshave been extensively used in genetic anthropology, where theirhigh mutation rates create a unique capability to link recent his-torical events to DNA variations, including the well-known CohenModal Haplotype that segregates in patrilineal lines of Jewishpriests (Skorecki et al. 1997; Zhivotovsky et al. 2004). Anotherrelatively recent application of STR analysis is tracing cell lineagesin cancer samples (Frumkin et al. 2008).

Despite the plurality of applications, STR variations are notroutinely analyzed in whole-genome sequencing studies, mainlydue to a lack of adequate tools (Treangen and Salzberg 2011). STRspose a remarkable challenge to mainstream HTS analysis pipelines.First, not all reads that align to an STR locus are informative(Supplemental Fig. 1A). If a single or paired-end read partially en-compasses an STR locus, it provides only a lower bound on thenumber of repeats. Only reads that fully encompass an STR can beused for exact STR allelotyping. Second, mainstream aligners, suchas BWA, generally exhibit a trade-off between run time and toler-ance to insertions/deletions (indels) (Li and Homer 2010). Thus,profiling STR variations—even for an expansion of three repeats ina trinucleotide STR—would require a cumbersome gapped align-ment step and lengthy processing times (Supplemental Fig. 1B).Third, PCR amplification of an STR locus can create stutter noise, inwhich the DNA amplicons show false repeat lengths due to suc-cessive slippage events of DNA polymerase during amplification(Supplemental Fig. 1C; Hauge and Litt 1993; Ellegren 2004). SincePCR amplification is a standard step in library preparation forwhole-genome sequencing, an STR profiler should explicitlymodel and attempt to remove this noise to enhance accuracy.

Here, we present lobSTR, a rapid and accurate algorithm forSTR profiling in whole-genome sequencing data sets (Fig. 1).Briefly, the algorithm has three steps. The first step is sensing:lobSTR swiftly scans genomic libraries, flags informative reads thatfully encompass STR loci, and characterizes their STR sequence.This ab initio procedure relies on a signal processing approach thatuses rapid entropy measurements to find informative STR readsfollowed by a Fast Fourier Transform to characterize the repeatsequence. The second step is alignment: lobSTR uses a divide-and-conquer strategy that anchors the nonrepetitive flanking regions

4Corresponding author.E-mail [email protected] published online before print. Article, supplemental material, and publi-cation date are at http://www.genome.org/cgi/doi/10.1101/gr.135780.111.

22:000–000 ! 2012 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/12; www.genome.org Genome Research 1www.genome.org

Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from

10.1101/gr.135780.111Access the most recent version at doi: published online April 20, 2012Genome Res.

Melissa Gymrek, David Golan, Saharon Rosset, et al. lobSTR: A short tandem repeat profiler for personal genomes

P<P Published online April 20, 2012 in advance of the print journal.

serviceEmail alerting

click heretop right corner of the article orReceive free email alerts when new articles cite this article - sign up in the box at the

object identifier (DOIs) and date of initial publication. by PubMed from initial publication. Citations to Advance online articles must include the digital publication). Advance online articles are citable and establish publication priority; they are indexedappeared in the paper journal (edited, typeset versions may be posted when available prior to final Advance online articles have been peer reviewed and accepted for publication but have not yet

http://genome.cshlp.org/subscriptions go to: Genome ResearchTo subscribe to

Copyright © 2012 by Cold Spring Harbor Laboratory Press

Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 7: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

New: HipSTR

•  HipSTR: Haplotype-based imputation, phasing and genotyping of STRs

•  Major improvements: – Learns locus-specific stutter models – Physical-phasing –  Impute missing STRs or give priors based on

SNPs. – Reports not only the length of an STR but also

its sequence

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 8: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

HipSTR: solving homoplasy !•  Can now correctly detect STRs with identical lengths but different

sequences (homoplasy)!

•  Real example: !–  Length based genotype: -4/-4!

–  HipSTR genotype: (AGAT)8(ACAT)9 / (AGAT)10(ACAT)7!

•  HipSTR available at https://github.com/tfwillems/HipSTR!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 9: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Capillary-based validation

!  Simons Genome Diversity Project sequenced 280 individuals to 30x

!  For 105 of these samples, ~300 Marshfield STRs were genotyped using capillary electrophoresis

!  Compare the length of the STR genotypes to the capillary PCR products to assess accuracy

R2=0.987

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 10: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Analyzing 1000Genomes STRs

Good allele frequency spectrum for 90% of the STRs in the genome.!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 11: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

10.1101/gr.135780.111Access the most recent version at doi: published online April 20, 2012Genome Res.

Melissa Gymrek, David Golan, Saharon Rosset, et al. lobSTR: A short tandem repeat profiler for personal genomes

P<P Published online April 20, 2012 in advance of the print journal.

serviceEmail alerting

click heretop right corner of the article orReceive free email alerts when new articles cite this article - sign up in the box at the

object identifier (DOIs) and date of initial publication. by PubMed from initial publication. Citations to Advance online articles must include the digital publication). Advance online articles are citable and establish publication priority; they are indexedappeared in the paper journal (edited, typeset versions may be posted when available prior to final Advance online articles have been peer reviewed and accepted for publication but have not yet

http://genome.cshlp.org/subscriptions go to: Genome ResearchTo subscribe to

Copyright © 2012 by Cold Spring Harbor Laboratory Press

Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from

A catalog of STR variations

strcat.teamerlich.org!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 12: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Summary of the 1000 Genomes analysis

About 100,000 of the STRs in your genome show are different from the person next to you…!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 13: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Summary of the 1000 Genomes analysis

About 100,000 of the STRs in your genome show are different from the person next to you…

Part 1: lobSTR

Challenges Algorithm Benchmarking Validation Summary

But do normal STR variations have

phenotypic consequences?!

Page 14: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Paper

Expression STRs (eSTRs)

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Page 15: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Con

tent

e et

al.,

200

2 PIG3

War

peha

et a

l., 1

999

15 14 13 12 #of repeats

Expr

essi

on

NOS2A

EGFR

Geb

hard

t et a

l., 1

999 MMP9

Shi

maj

iri e

t al.,

199

9

Expression STRs (eSTRs): single gene studies in human

But we want a genome wide analysis!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 16: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

STR

Exp

ress

ion

H0: effect = 0 H1: effect ≠ 0

Y expression

Analysis pipeline

~190,000 tests for [genes x STR]!+ negative controls!

X STR calls

384 samples

RNA-seq

Regression tests +/-100kb from transcripts!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 17: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Genome-wide survey of eSTRs in human

Obs

erve

d p-

valu

e [-

log1

0]

Expected p-value under the null [-log10]

2060 eSTRs

Negative controls follow the null

Signal

Expression STRs (eSTRs)

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Page 18: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Orthogonal populations Orthogonal expression assay (array) +

Replication

83% of eSTRs showed the same direction of effect (N=822; p<10-93) Also the effects were highly correlated (R=0.73; p<10-140) Effect

RNA-seq

Effect Array

+Data from Stranger et al., PLoS Genetics, 2012

Most of the eSTRs are replicable

Expression STRs (eSTRs)

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Page 19: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

SNPs or STRs?

gene STR TF

Causality

Tagging

Biologically, not very interesting.!

SNP

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 20: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Decomposing variations h2

b!

Simulations of negative controls (no eSTR contribution):!

h2ST

R!

Simulated SNP-eQTL! Simulated SNP-eQTL!

+ XB Y ~ XSTR

Take home message: LMM is calibrated.!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 21: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

LMM results of real data Linear mixed model (LMM) for variance decomposition for all genes:!

eSTR vs. all common variants on the haplotype!

eSTRs contribute 10%-15% of the gene expression on cis region.!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 22: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Mean STR allele!

Expr

essi

on!

AA

AB

BB

Null hypothesis:!

random slopes!

Regressing conditioned on best SNP

gene STR TF

Causality

Tagging

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 23: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Mean STR allele!

Expr

essi

on!

AA

AB

BB Slopes in the same direction as the original association!

Regressing conditioned on best SNP

gene STR TF

Causality?

Tagging

Alternative hypothesis:!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 24: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

75% of condition effects were in the same direction (p<10-108)

Regressing conditioned on best SNP

Unconditioned effect

Conditioned Effect!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Expression STRs (eSTRs)

Page 25: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Evidence for function of eSTRs

Conservation

Expression STRs (eSTRs)

�Ph

yloP!

0

0.4

0.8

1.2

1.6

2

±1000 ±500 ±250 ±100 ±50

10-3×!

Window size(bp)!

p<7%

p<3%

p<0.1%

p<0.1%

p<1%

eSTRs are significantly enriched in more conserved regions!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Page 26: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Co-localization with functional elements

Expression STRs (eSTRs)

Peak shift: eSTRs co-localizations with histone signatures: p<0.01!

But maybe these signatures are created by nearby causal variants? ENCODE LCL!

Null (peak shifting):!

Trynka, bioRxiv,2015!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Page 27: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

A potential role of eSTRs in human diseases

Expression STRs (eSTRs)

Associating the 2060 eSTRs x 31 phenotypes of ~1300 individuals in the UK10K!

FDR<10%!

Diastolic blood pressure!CLCC1!

DIP2B!

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Page 28: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

A potential role of eSTRs in human diseases

Expression STRs (eSTRs)

Name% Symbol% P%value% Phenotype% Class%4:9955416' SLC2A9' 3.49E008' Uric_Acid' Metabolic'funcCon'10:27124545' Abi1' 4.61E007' Phosphate' Metabolic'funcCon'17:44048491' KIAA1267' 6.86E006' FEV1.FVC_RaCo' Pulmonary'funcCon'16:473880' DECR2' 2.51E005' ApoA1' Metabolic'funcCon'1:109393265' CLCC1' 2.89E005' Diastolic_BP' Blood'Pressure'6:20195837' MBOAT1' 3.26E005' Albumin' Metabolic'funcCon'1:110516300' FAM40A' 5.07E005' Urea' Metabolic'funcCon'12:51036810' DIP2B' 1.02E004' Diastolic_BP' Blood'Pressure'

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Page 29: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Summary The first genome-wide expression STR analysis. 1.  Over 2,000 eSTRs in the discovery set. 2.  Replication in independent platforms/populations.

3.  eSTRs account for 10-15% of cis-heritability by common variants

4.  Functional evidence

5.  eSTRs are associated with human phenotypes

Expression STRs (eSTRs)

How much missing heritability in GWAS studies by not analyzing repetitive elements?

Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases

Page 30: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Team eSTR:!Melissa Gymrek!Thomas Willems!Dina Zielinski !Stoyan Georgiev!Barak Marcus!Alkes Price!Mark Daly!Jonathan Pritchard!!!!

Acknowledgements

Funding Burroughs Wellcome Career Award National Institute of Justice

Page 31: SMBE 2015: Expression STRs

Yaniv Erlich 7/15/15 @erlichya

Outline

Yaniv Erlich 7/12/12 Towards a population scale map of STR variations

lobSTR: Profiling STR variations from WGS data

STR variations across 2,500 datasets: Preliminary results

All

CEU

GBR FIN

IBS

YRI

LWK

ACB

ASW

CHB

CDX

CHS

JPT

KHV

0.0

0.2

0.4

0.6

0.8

1.0

Heterozygosity

The End