smbe 2015: expression strs
TRANSCRIPT
Yaniv Erlich 7/15/15 @erlichya
Expression STRs
Yaniv Erlich @erlichya
Yaniv Erlich 7/15/15 @erlichya
We know quite a lot about genetic variations…
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
What about Short Tandem Repeats (STRs)?
CTCAATACAAGTCTAACAGCAGCAGCAGCAGCAGCAGCAGCAGTTGATGAAC
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Short Tandem Repeats
• 1% of the human genome!• Fast mutation rates!• Multiple Mendelian
diseases!• Evolvability! Huntington Fragile X
OPMD Synpolydactyly Ataxia (10 types)
HFG syndrome
Holoprosen-cephaly
Pseudoach-ondroplasia
Myotonic dystrophy Cleidocranial Dysplasia ALS-FTD
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
lobSTR – Whole genome solution for STR genotyping
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
lobSTR: An STR profiler for WGS Method
lobSTR: A short tandem repeat profilerfor personal genomesMelissa Gymrek,1,2 David Golan,2,3 Saharon Rosset,3 and Yaniv Erlich2,4
1Harvard–MIT Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139,
USA; 2Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA; 3Department of Statistics and Operations
Research, Tel Aviv University, Tel Aviv 69978, Israel
Short tandem repeats (STRs) have a wide range of applications, including medical genetics, forensics, and genetic gene-alogy. High-throughput sequencing (HTS) has the potential to profile hundreds of thousands of STR loci. However,mainstream bioinformatics pipelines are inadequate for the task. These pipelines treat STR mapping as gapped alignment,which results in cumbersome processing times and a biased sampling of STR alleles. Here, we present lobSTR, a novelmethod for profiling STRs in personal genomes. lobSTR harnesses concepts from signal processing and statistical learningto avoid gapped alignment and to address the specific noise patterns in STR calling. The speed and reliability of lobSTRexceed the performance of current mainstream algorithms for STR profiling. We validated lobSTR’s accuracy by mea-suring its consistency in calling STRs from whole-genome sequencing of two biological replicates from the same in-dividual, by tracing Mendelian inheritance patterns in STR alleles in whole-genome sequencing of a HapMap trio, and bycomparing lobSTR results to traditional molecular techniques. Encouraged by the speed and accuracy of lobSTR, we usedthe algorithm to conduct a comprehensive survey of STR variations in a deeply sequenced personal genome. We tracedthe mutation dynamics of close to 100,000 STR loci and observed more than 50,000 STR variations in a single genome.lobSTR’s implementation is an end-to-end solution. The package accepts raw sequencing reads and provides the user withthe genotyping results. It is written in C/C++, includes multi-threading capabilities, and is compatible with the BAMformat.
[Supplemental material is available for this article.]
Short tandem repeats (STRs), also known as microsatellites, area class of genetic variations with repetitive elements of 2–6 nu-cleotides (nt) that consist of approximately a quarter million loci inthe human genome (Benson 1999). The repetitive structure ofthose loci creates unusual secondary DNA conformations that areprone to replication slippage events and result in high variabilityin the number of repeat elements (Mirkin 2007). The spontaneousmutation rate of STRs exceeds that of any other type of knowngenetic variation and can reach 1/500 mutations per locus pergeneration (Walsh 2001; Ballantyne et al. 2010), 200-fold higherthan the rate of spontaneous copy number variations (CNV)(Lupski 2007) and 200,000-fold higher than the rate of de novoSNPs (Conrad et al. 2011).
STR variations have been instrumental in wide-ranging areasof human genetics. STR expansions are implicated in the etiologyof a variety of genetic disorders, such as Huntingon’s Disease andFragile-X Syndrome (Pearson et al. 2005; Mirkin 2007). ForensicsDNA fingerprinting relies on profiling autosomal STR markers andY-chromosome STR (Y-STR) loci (Kayser and de Knijff 2011). STRshave been extensively used in genetic anthropology, where theirhigh mutation rates create a unique capability to link recent his-torical events to DNA variations, including the well-known CohenModal Haplotype that segregates in patrilineal lines of Jewishpriests (Skorecki et al. 1997; Zhivotovsky et al. 2004). Anotherrelatively recent application of STR analysis is tracing cell lineagesin cancer samples (Frumkin et al. 2008).
Despite the plurality of applications, STR variations are notroutinely analyzed in whole-genome sequencing studies, mainlydue to a lack of adequate tools (Treangen and Salzberg 2011). STRspose a remarkable challenge to mainstream HTS analysis pipelines.First, not all reads that align to an STR locus are informative(Supplemental Fig. 1A). If a single or paired-end read partially en-compasses an STR locus, it provides only a lower bound on thenumber of repeats. Only reads that fully encompass an STR can beused for exact STR allelotyping. Second, mainstream aligners, suchas BWA, generally exhibit a trade-off between run time and toler-ance to insertions/deletions (indels) (Li and Homer 2010). Thus,profiling STR variations—even for an expansion of three repeats ina trinucleotide STR—would require a cumbersome gapped align-ment step and lengthy processing times (Supplemental Fig. 1B).Third, PCR amplification of an STR locus can create stutter noise, inwhich the DNA amplicons show false repeat lengths due to suc-cessive slippage events of DNA polymerase during amplification(Supplemental Fig. 1C; Hauge and Litt 1993; Ellegren 2004). SincePCR amplification is a standard step in library preparation forwhole-genome sequencing, an STR profiler should explicitlymodel and attempt to remove this noise to enhance accuracy.
Here, we present lobSTR, a rapid and accurate algorithm forSTR profiling in whole-genome sequencing data sets (Fig. 1).Briefly, the algorithm has three steps. The first step is sensing:lobSTR swiftly scans genomic libraries, flags informative reads thatfully encompass STR loci, and characterizes their STR sequence.This ab initio procedure relies on a signal processing approach thatuses rapid entropy measurements to find informative STR readsfollowed by a Fast Fourier Transform to characterize the repeatsequence. The second step is alignment: lobSTR uses a divide-and-conquer strategy that anchors the nonrepetitive flanking regions
4Corresponding author.E-mail [email protected] published online before print. Article, supplemental material, and publi-cation date are at http://www.genome.org/cgi/doi/10.1101/gr.135780.111.
22:000–000 ! 2012 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/12; www.genome.org Genome Research 1www.genome.org
Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from
10.1101/gr.135780.111Access the most recent version at doi: published online April 20, 2012Genome Res.
Melissa Gymrek, David Golan, Saharon Rosset, et al. lobSTR: A short tandem repeat profiler for personal genomes
P<P Published online April 20, 2012 in advance of the print journal.
serviceEmail alerting
click heretop right corner of the article orReceive free email alerts when new articles cite this article - sign up in the box at the
object identifier (DOIs) and date of initial publication. by PubMed from initial publication. Citations to Advance online articles must include the digital publication). Advance online articles are citable and establish publication priority; they are indexedappeared in the paper journal (edited, typeset versions may be posted when available prior to final Advance online articles have been peer reviewed and accepted for publication but have not yet
http://genome.cshlp.org/subscriptions go to: Genome ResearchTo subscribe to
Copyright © 2012 by Cold Spring Harbor Laboratory Press
Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
New: HipSTR
• HipSTR: Haplotype-based imputation, phasing and genotyping of STRs
• Major improvements: – Learns locus-specific stutter models – Physical-phasing – Impute missing STRs or give priors based on
SNPs. – Reports not only the length of an STR but also
its sequence
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
HipSTR: solving homoplasy !• Can now correctly detect STRs with identical lengths but different
sequences (homoplasy)!
• Real example: !– Length based genotype: -4/-4!
– HipSTR genotype: (AGAT)8(ACAT)9 / (AGAT)10(ACAT)7!
• HipSTR available at https://github.com/tfwillems/HipSTR!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Capillary-based validation
! Simons Genome Diversity Project sequenced 280 individuals to 30x
! For 105 of these samples, ~300 Marshfield STRs were genotyped using capillary electrophoresis
! Compare the length of the STR genotypes to the capillary PCR products to assess accuracy
R2=0.987
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Analyzing 1000Genomes STRs
Good allele frequency spectrum for 90% of the STRs in the genome.!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
10.1101/gr.135780.111Access the most recent version at doi: published online April 20, 2012Genome Res.
Melissa Gymrek, David Golan, Saharon Rosset, et al. lobSTR: A short tandem repeat profiler for personal genomes
P<P Published online April 20, 2012 in advance of the print journal.
serviceEmail alerting
click heretop right corner of the article orReceive free email alerts when new articles cite this article - sign up in the box at the
object identifier (DOIs) and date of initial publication. by PubMed from initial publication. Citations to Advance online articles must include the digital publication). Advance online articles are citable and establish publication priority; they are indexedappeared in the paper journal (edited, typeset versions may be posted when available prior to final Advance online articles have been peer reviewed and accepted for publication but have not yet
http://genome.cshlp.org/subscriptions go to: Genome ResearchTo subscribe to
Copyright © 2012 by Cold Spring Harbor Laboratory Press
Cold Spring Harbor Laboratory Press on April 25, 2012 - Published by genome.cshlp.orgDownloaded from
A catalog of STR variations
strcat.teamerlich.org!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Summary of the 1000 Genomes analysis
About 100,000 of the STRs in your genome show are different from the person next to you…!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Summary of the 1000 Genomes analysis
About 100,000 of the STRs in your genome show are different from the person next to you…
Part 1: lobSTR
Challenges Algorithm Benchmarking Validation Summary
But do normal STR variations have
phenotypic consequences?!
Yaniv Erlich 7/15/15 @erlichya
Paper
Expression STRs (eSTRs)
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Yaniv Erlich 7/15/15 @erlichya
Con
tent
e et
al.,
200
2 PIG3
War
peha
et a
l., 1
999
15 14 13 12 #of repeats
Expr
essi
on
NOS2A
EGFR
Geb
hard
t et a
l., 1
999 MMP9
Shi
maj
iri e
t al.,
199
9
Expression STRs (eSTRs): single gene studies in human
But we want a genome wide analysis!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
STR
Exp
ress
ion
H0: effect = 0 H1: effect ≠ 0
Y expression
Analysis pipeline
~190,000 tests for [genes x STR]!+ negative controls!
X STR calls
384 samples
RNA-seq
Regression tests +/-100kb from transcripts!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Genome-wide survey of eSTRs in human
Obs
erve
d p-
valu
e [-
log1
0]
Expected p-value under the null [-log10]
2060 eSTRs
Negative controls follow the null
Signal
Expression STRs (eSTRs)
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Yaniv Erlich 7/15/15 @erlichya
Orthogonal populations Orthogonal expression assay (array) +
Replication
83% of eSTRs showed the same direction of effect (N=822; p<10-93) Also the effects were highly correlated (R=0.73; p<10-140) Effect
RNA-seq
Effect Array
+Data from Stranger et al., PLoS Genetics, 2012
Most of the eSTRs are replicable
Expression STRs (eSTRs)
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Yaniv Erlich 7/15/15 @erlichya
SNPs or STRs?
gene STR TF
Causality
Tagging
Biologically, not very interesting.!
SNP
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Decomposing variations h2
b!
Simulations of negative controls (no eSTR contribution):!
h2ST
R!
Simulated SNP-eQTL! Simulated SNP-eQTL!
+ XB Y ~ XSTR
Take home message: LMM is calibrated.!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
LMM results of real data Linear mixed model (LMM) for variance decomposition for all genes:!
eSTR vs. all common variants on the haplotype!
eSTRs contribute 10%-15% of the gene expression on cis region.!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Mean STR allele!
Expr
essi
on!
AA
AB
BB
Null hypothesis:!
random slopes!
Regressing conditioned on best SNP
gene STR TF
Causality
Tagging
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Mean STR allele!
Expr
essi
on!
AA
AB
BB Slopes in the same direction as the original association!
Regressing conditioned on best SNP
gene STR TF
Causality?
Tagging
Alternative hypothesis:!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
75% of condition effects were in the same direction (p<10-108)
Regressing conditioned on best SNP
Unconditioned effect
Conditioned Effect!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Expression STRs (eSTRs)
Yaniv Erlich 7/15/15 @erlichya
Evidence for function of eSTRs
Conservation
Expression STRs (eSTRs)
�Ph
yloP!
0
0.4
0.8
1.2
1.6
2
±1000 ±500 ±250 ±100 ±50
10-3×!
Window size(bp)!
p<7%
p<3%
p<0.1%
p<0.1%
p<1%
eSTRs are significantly enriched in more conserved regions!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Yaniv Erlich 7/15/15 @erlichya
Co-localization with functional elements
Expression STRs (eSTRs)
Peak shift: eSTRs co-localizations with histone signatures: p<0.01!
But maybe these signatures are created by nearby causal variants? ENCODE LCL!
Null (peak shifting):!
Trynka, bioRxiv,2015!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Yaniv Erlich 7/15/15 @erlichya
A potential role of eSTRs in human diseases
Expression STRs (eSTRs)
Associating the 2060 eSTRs x 31 phenotypes of ~1300 individuals in the UK10K!
FDR<10%!
Diastolic blood pressure!CLCC1!
DIP2B!
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Yaniv Erlich 7/15/15 @erlichya
A potential role of eSTRs in human diseases
Expression STRs (eSTRs)
Name% Symbol% P%value% Phenotype% Class%4:9955416' SLC2A9' 3.49E008' Uric_Acid' Metabolic'funcCon'10:27124545' Abi1' 4.61E007' Phosphate' Metabolic'funcCon'17:44048491' KIAA1267' 6.86E006' FEV1.FVC_RaCo' Pulmonary'funcCon'16:473880' DECR2' 2.51E005' ApoA1' Metabolic'funcCon'1:109393265' CLCC1' 2.89E005' Diastolic_BP' Blood'Pressure'6:20195837' MBOAT1' 3.26E005' Albumin' Metabolic'funcCon'1:110516300' FAM40A' 5.07E005' Urea' Metabolic'funcCon'12:51036810' DIP2B' 1.02E004' Diastolic_BP' Blood'Pressure'
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Yaniv Erlich 7/15/15 @erlichya
Summary The first genome-wide expression STR analysis. 1. Over 2,000 eSTRs in the discovery set. 2. Replication in independent platforms/populations.
3. eSTRs account for 10-15% of cis-heritability by common variants
4. Functional evidence
5. eSTRs are associated with human phenotypes
Expression STRs (eSTRs)
How much missing heritability in GWAS studies by not analyzing repetitive elements?
Intro Genotyping STRs eSTRs eSTR or eSNPs? eSTRs in diseases
Yaniv Erlich 7/15/15 @erlichya
Team eSTR:!Melissa Gymrek!Thomas Willems!Dina Zielinski !Stoyan Georgiev!Barak Marcus!Alkes Price!Mark Daly!Jonathan Pritchard!!!!
Acknowledgements
Funding Burroughs Wellcome Career Award National Institute of Justice
Yaniv Erlich 7/15/15 @erlichya
Outline
Yaniv Erlich 7/12/12 Towards a population scale map of STR variations
lobSTR: Profiling STR variations from WGS data
STR variations across 2,500 datasets: Preliminary results
All
CEU
GBR FIN
IBS
YRI
LWK
ACB
ASW
CHB
CDX
CHS
JPT
KHV
0.0
0.2
0.4
0.6
0.8
1.0
Heterozygosity
The End