usda-ars cornell university  · 2014-10-31 · maize has more molecular diversity than humans and...

38
Edward Buckler USDA-ARS Cornell University http://www.maizegenetics.net Why can GBS be complicated? Tools for filtering, error correction and imputation.

Upload: others

Post on 26-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Edward BucklerUSDA-ARS

Cornell Universityhttp://www.maizegenetics.net

Why can GBS be complicated? Tools

for filtering, error correction and

imputation.

Page 2: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Many Organisms Are Diverse

Humans are at the lower end of diversity, which results in most genomic and bioinformatics

being optimized to this situation

What is diverse: Grapes, Flies, Arabidopsis, and most dominant species on the planet

Page 3: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Maize has more molecular diversity than humans and apes combined

Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001)

1.34%0.09%

1.42%

Page 4: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Maize genetic variation has been evolving for 5 million years

Modern Variation Begins Evolving

Sister Genus Diverges

Zea species begin diverging

Maize domesticated

5mya

4mya

3mya

2mya

1mya

War

m

Plio

cene

Col

d Pl

eist

ocen

eDivergence from Chimps

Ardipithecus

Homo erectus

Modern HumansModern Variation Begins

Australopithecus

Page 5: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Features of TASSEL GBS BioinformaticPipeline

• Designed for large numbers of samples

• Modest computing requirements• Species or Genus‐level SNP calling & filtering

• Filtering to deal with extensive paralogy

Glaubitz et al 2014 PLOSone. Available in TASSEL at maizegenetics.net

Page 6: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Reference‐based GBS bioinformatics pipelineDiscovery

Tag Counts

SNP Caller

Genotypes

Tags by Taxa 

FASTQ

Production

FASTQ

Filtered Genotypes

Tags on Physical Map

Tags on Physical Map

Genotypes

Page 7: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

What are our expectations with GBS?

Page 8: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

High Diversity Ensures High Return on Sequencing

• Proportion of informative markers– Highly repetitive – 15% not easily

informative– Half the genome is not shared between two

maize line• Potentially all of these are informative with a

large enough database– Low copy shared proportion (1% diversity)

• Bi-parental information = (1-0.01)^64bp = 48% informative

• Association information = (1-0.05)^64bp= 97% informative

Page 9: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Expectation of marker distribution

Biallelic, 17%

Too Repetitiv

e, 15%

Non-polymor

phic; 18%

Presense/Absense

, 50%

Multiallelic, 34%

Too Repetitiv

e, 15%

Non-polymorphic; 1%

Presense/Absense

, 50%

Biparental population Across the species

Page 10: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Sequencing Error

Page 11: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Illumina Basic Error Rate is ~1%

• Error rates are associated with distance from start of sequence– Bad – GBS puts these all at the same

position– Good – Reverse reads can correct– Good – Error are consistent and modelable

Page 12: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Reads with errors

• Perfect sequences:0.9964=52.5% of the 64bp sequences are

perfect47.5 are NOT perfect

The errors are autocorrelated so the proportion of perfect sequence is a little higher, and those with 2 or more is also higher.

Page 13: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Do we see these errors?• Assume 10,000 lines genotyped at

0.5X coverage

Base Type Read # (no SNP)

Read # (w/ SNP)

A Major 4950 4900

C Minor 17 67 (50 real)

G Error 17 17

T Error 17 17

Page 14: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Do Errors Matter?• Yes –Imputation, Haplotype

reconstruction• Maybe – GWAS for low frequency

SNPs• No – GS, genetic distance, mapping

on biparental populations

Page 15: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Expectations of Real SNPs

• Vast majority are biallelic• Homozygosity is predicted by

inbreeding coefficient• Allele frequency is constrained in

structured populations• In linkage disequilibrium with

neighboring SNPs

Page 16: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Studying Errors in Biparental populationsLimited range of alleles,

expected allele frequencies, high LD

Page 17: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Maize RIL population expectations

• Allele frequency 0% or 50%• Nearby sites should be in very high

LD (r2>50%)• Most sites can be tested if multiple

populations are available

Page 18: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Bi-parental populations allow identification of error, and non-Mendelian segregation

Error

Non-segregating

Segregating

Page 19: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Bi-parental populations allow identification of error, and non-Mendelian segregation

Error

Page 20: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Median error rate is 0, but there is a long tail of some high error sites

Median

Page 21: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

GBS error rates vs. Maize 50K SNP Chip• 7,254 SNPs in common• 279 maize inbreds in common (“Maize282” panel)

Comparison to 50K SNPs Mean Error Rate (per SNP)

Median ErrorRate (per SNP)

Filtered GBS genotypes

All genotypic comparisons: 1.18% 0.93%Homozygotes only: 0.58% 0.42%

Internal GBS recall error rate with 1X coverage is ~0.2%, half of 50K chip errors are paralogyissues between the systems

Page 22: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

MergeDuplicateSNPsPlugin

• When restriction sites are less than 128bp apart, we may read SNP from both directions (strands)

• ~13% of all sites• Fusing increases coverage• Fixes errors• -misMat = set maximum mismatch rate• -callHets = mismatch set to hets or not

Page 23: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Use the biology of your system to filter SNP

• Hardy-Weinberg Disequilibrium• For inbreeding crops use the expected

inbreeding coefficient to filter SNPs• How to deal with the low coverage—

– Although many individuals only have 1X coverage (27% of the samples will have 2X or more). Use these individuals to evaluate the quality of segregation.

Page 24: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Product of Filtering

• After filters, in maize we find 0.0018 error rate– AA<>aa = < 0.0018– AA<>Aa = 0.8 at low coverage

• SNPs in wrong location <~1%. Lower in other species.

Page 25: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Only a limited proportion of the best alignments are shared

between alignersBowtie2 BWA

51.5%31.2% 17.4%

Bowtie2 BWA

45.4%

BWA-MEM

16.9% 13.2%

14.4%

2.5%

4.7% 3.0%

12.4%

Bowtie2

BWA Blast

BWA-MEM

9.5% 12.8%

10.9%1.4%

1.3%

6.5%

1.2% 0.8%

1.6%

0.6%

3.9%3.3%

0.8%

32.9%

Which alignment is real

Page 26: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Genetic Mapping of All Unique Tags

• Test for association between between taxa with tags and an anchor map

• Apply in both an association and linkage context

• Trillions of tests, so computational speed is key

Fei Lu in review

Fei Lu

Page 27: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

What do YOU want to filter on?• MAF• Mapping accuracy• Support from Paired-End• Support from multiple aligners• Heterozygosity• Coverage• Agreement with WGS• Allele frequency within certain populations• LD with other SNPs• Stats within a particular population

Page 28: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Finding the Good SNPs

Discovery TOPM

Discovery TOPM

SNP & Features

SNP & Features TrainTrain Filter

TOPMFilter TOPM

Production TOPM

Production TOPM

Page 29: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Unstable Genomes

Page 30: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Using the Presence/Absence Variants

• In species like maize, this is the majority of the data

• Less subject to sequencing error• Need imputation methods to

differentiate between missing from sampling and biologically missing

Page 31: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Only 50% of the maize genome is shared between two varieties

Fu & Dooner 2002, Morgante et al. 2005, Brunner et al 2005Numerous PAVs and CNVs - Springer, Lai, Schnable in 2010

50%

Plant 1

Plant 2 Plant 3

99%

Person 1

Person 2 Person 3

Maize Humans

Page 32: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

GWAS and Joint linkage mapping

Read depth (B73 vs Non-B73)

Identify structural variations

Tags

GWAS

JointLinkage

If Map AlignIf 

AlignY

PAVsN

Chromosome

B73

Non‐B73

Bin 1 Bin 2 Bin 3

PAV (Deletion) CNV (Duplication)

Insertion

Developed by Fei Lu

Page 33: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Dependent variable, distance (abs(genetic position – physical position))

Attributes: P-value, likelihood ratio, tag count, etc.

Machine learning models: decision tree, SVM, M5Rules, etc.

Model training for mapping accuracy

6.4 M

0.5 M

8.6 M

GJ

G

J

Y

Y

Y

4.5 M mapped tags26M tags

GWAS

Joint linkage

Model training and prediction using M5Rules

Page 34: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

High-resolution mapped tags

95%

50%

4.5 million mapped tags in all maize lines

100 bp 10 Kb 1 Mb 10 Kb1 Gb 10 Gb

Median resolution

Framework of maize pan genome

Page 35: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Paired-End Sequencing

• Most of the tags are less than 500bp in size.

• Full contigs can be developed by sequencing them on a MiSeq with paired-end 250bp reads

• Strategies – physically bridge from the mapped end, genetically anchor one end.

Page 36: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Phylogenetic (Cross Species) Analysis

• When divergence <10% between taxa, GBS makes sense as 40% of the cut site pairs are preserved. If focused a genic regions, can go higher.

• Currently pipeline tools not great for very high levels of divergence. Standard nextgen or Paired-End approaches might be a good strategy.

Page 37: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Future• Clean output and input into

machine learning algorithms• Developing a community to

explore and develop filters

Page 38: USDA-ARS Cornell University  · 2014-10-31 · Maize has more molecular diversity than humans and apes combined Silent Diversity (Zhao PNAS 2000; Tenallion et al, PNAS 2001) 1.34%

Become a power user:

• Play with the code to add your features– Documentation at maizegenetics.net– Source code at SourceForge

• Guide to programming TASSEL:– http://bit.ly/KZLi8l

• Google Group– https://groups.google.com/forum/#!forum/ta

ssel