genotype imputation for american americans and hispanics in whi using reference haplotypes from the...

18
Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill 09-13-2012

Upload: efrain-treadway

Post on 15-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes

from the 1000 Genomes Project

Presented by Qing DuanDr. Yun Li group

UNC at Chapel Hill09-13-2012

Page 2: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Outline• Imputation– Study samples: WHI African Americans and Hispanics

samples– Reference haplotypes: 1000 Genomes Project (version 3

March 2012 release)• Number of markers in reference haplotypes: ~38M

• Post imputation quality assessment– Evaluation of imputation quality by comparing with

actual genotypes from Metabochip genotyping– Estimation of total number of QC+ markers and number

of QC+ indels

Page 3: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

QC on WHI Genotypes• QC was performed within African American and

Hispanics samples separately for autosomes and chromosome X.

• We excluded markers having:– Hardy-Weinberg equilibrium (HW p-value < 1e-6)– Genotype completeness (< 90%)– Minor allele frequency

• Chromosome 1-22: MAF < 1%• Chromosome X: singleton or monomorphic markers

With thanks to Eric Yi Liu

Page 4: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Summary of samples and GWAS QC+ markers

• Number of Individuals– WHI_AA: 8,421 / WHI_HA: 3,587

• Number of markersChr1-22 ChrX

WHIAA WHIHA WHIAA WHIHA

Total 860,510 36,889

QC+ 829,370 834,826 35,411 35,035

Note: chromosome X is currently under imputation, so the results on chromosome X will be available soon.

Page 5: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Reference Haplotypes

• The complete set of 1000G Phase I Integrated Release version 3 haplotypes in vcf format (March 2012 release)– A total of 2184 haplotypes

– A total of ~38M markers • including singleton and monomorphic sites

– About 1.4M markers are short indels and large deletions, the rest SNPs.

Page 6: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Note on reference haplotypes• A latest reduced set of reference haplotypes with

singletons and monomorphic markers removed are also available.– Number of markers: ~30M

– Every marker in the reduced set is included in the complete set of reference haplotypes.

– We expect little influence on imputation quality from singleton and monomorphic markers, because:• Phasing of the reference haplotypes were performed with the

singleton and monomorphic markers included• Our previous evaluation shows little effect of singletons on the

quality of imputation (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117).

Page 7: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Two-step genotype imputation-- Procedure

• Step 1: Pre-phasing (MaCH1) – WHI African American and Hispanics samples were

phased separately

• Step 2: Genotype imputation (minimac) – WHI African Americans and Hispanics samples were

imputed separately. – Haplotype to haplotype imputation: the pre-phased

haplotypes in step 1 are used to impute into the complete set of reference haplotypes from the 1000 Genomes Project.

Page 8: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Two-step genotype imputation-- Computational costs

• Phasing and imputation strategy– Split chromosomes into segments – Phase / impute each segment– Ligate segments back to chromosomes

Computational costs WHI_AA WHI_HA

Phasing Split strategy(sample genotypes)

Core region: 3000 markersFlanking: 500 markers each

# segment after splitting 277 278

Median run time ~245 hours (~10 days) ~63 hours (~3 days)

Imputation Split strategy(reference haplotypes)

Core region: 5 MbFlanking: 500 Kb each

Core region: 20 MbFlanking: 500 Kb each

# segment after splitting 520 150

Median run time ~41 hours (~2 days) ~71 hours (~3 days)

Page 9: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Summary of imputation results -- Before QC

WHIAA WHIHA

Number of individuals 8,421 3,587

Total number of imputed markers 38,050,692

Number of imputed indels 1,380,758

File size (All files gz compressed)

170 G 71 G

Note: Markers with quality filter missing in the 1000G reference haplotypes are excluded from imputation. We found all markers excluded are of type “MERGED_DEL”.

Page 10: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Evaluation of imputation quality-- Introduction

• Main idea– Compare imputed dosages with actual genotypes

• Quality metric– Dosage r2: squared correlation coefficient between

imputed dosages (continuous value ranging between 0 and 2) and actual genotypes (coded as 0, 1 and 2)• True imputation accuracy (range 0 ~ 1)

– Rsq: estimated dosage r2• Estimated imputation accuracy

Page 11: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Evaluation of imputation quality-- Study design

Calculate dosage r2

Imputed dosage

2 2 1 2 1 1 1 0 1 02 1 1 2 2 2 1 2 1 02 2 1 1 2 1 1 2 1 02 2 1 1 2 1 2 2 1 02 1 0 2 2 1 1 2 0 1

2 2 1 2 1 1 1 02 1 1 2 2 1 1 02 2 1 1 2 1 1 02 2 1 1 2 2 1 02 1 0 2 2 1 0 1

Actual genotype (Metabochip)

• Individuals used in evaluation• 1962 WHI African American samples

• Markers used in evaluation• Overlapping markers between 1000G and Metabochip

but not on Affymetrix 6.0 (All 22 autosomes)• Minor allele frequency (MAF) is defined within the

1962 individuals

Page 12: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Estimation of imputation quality-- Results

Page 13: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

• We recommend QC threshold 0.7, 0.6 and 0.3 for MAF 0.1~0.5%, 0.5~1%, and >1% category, respectively– The thresholds are chosen such that an average Rsq

greater than 0.8 in each MAF category is achieved (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117).

• Estimation based on imputation quality assessment– Total number of markers passing QC– Total number of indels passing QC

Estimation of imputation quality-- Summary

Page 14: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

• The values are estimated because:– Estimated Rsq cutoffs

• Evaluation is based on markers on Metabochip

– Estimated MAF• MAF of imputed markers is calculated based on imputed

dosages

Estimation based on imputation quality assessment-- Note

Page 15: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

• The values are estimated because:– Estimated QC thresholds for WHI Hispanics samples

• We assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans

• We will do similar quality assessment in Hispanics samples once we have their QC+ metabochip data

– Estimated QC thresholds for indels• Rsq is set based on evaluation on SNPs. We assumed indels

has similar Rsq cutoff in each MAF category to SNPs

Estimation based on imputation quality assessment-- Note (cont’d)

Page 16: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Estimation based on imputation quality assessment-- Total number of markers passing QC

Note: Markers includes both SNPs and indels

Page 17: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Estimation based on imputation quality assessment-- Number of indels passing QC

Page 18: Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun

Summary• We conducted genotype imputation for 8,421 African

American and 3,587 Hispanics samples in the Women’s Health Initiative (WHI) study using reference haplotypes from the 1000 Genomes Project (version 3, March 2012 release)

• Summary of imputation results before and after QC

WHIAA WHIHABefore QC After QC Before QC After QC

Number of individuals 8,421 8,421 3,587 3,587Total number of markers 38,050,692 18,940,103 38,050,692 15,214,231 Number of indels 1,380,758 1,219,538 1,380,758 1,126,704 File size (All files gz compressed) 170 G 102 G 71 G 33 G