the hap webserver: tools for the discovery of genetic basis of human disease hyun min kang computer...

1
The HAP webserver: Tools for the Discovery of Genetic Basis of Human Dise ase HYUN MIN KANG Computer Science and Engineering University of California, San Diego 1. Introduction Understanding the structure of human variation is important for underst anding the genetic basis of human diseases. Recent advances in high-thr oughput genotyping technology generating a tremendous amount of high de nsity single nucleotide polymorphism(SNP) data holds great promise for discovering genetic risk factors associated with disease. In order to i dentify association between disease and variations in an individual’s c hromosome, the genotype data must be phased into haplotypes. Based on H AP, which is a very efficient tool for haplotype resolution based on im prefect phylogeny, HAP webserver provides an integrated method to recon struct haplotype structure and to identify genetic variants associated with complex phenotypes which can give insight into the genetic factors of complex diseases. Our methods leverage interplay between genotype ph asing, haplotype phylogeny, association analysis, and functional SNPs p rediction. Our methods leverage new insights into the structure of huma n variation which allows us to observe phenotype associations directly from genotype and phenotype data. We demonstrate our methods via an ana lysis of two genes implicated in hypertension. Our methods are easily a ccessible via the webserver, providing complete results of association analysis including graphical visualizations. We expect that our methods will facilitate current association studies. NOAH ZAITLEN Bioinformatics Program University of California, San Diego TAURIN TAN-ATICHAT Electrical and Computer Engineering University of California, San Diego EDWARD SHYU Computer Science and Engineering University of California, San Diego GRACE SHAW Computer Science and Engineering University of California, San Diego DAFNA BITTON Computer Science and Engi neering University of Calfornia, San Diego ELAD HAZAN Department of Computer Science Princeton University ERAN HALPERIN International Computer- Science Institute, Berkeley ELEAZAR ESKIN Computer Science and Engineering University of California, San Diego 2. HAP – haplotype resolution HAP is a haplotype analysis system which is aimed in helping geneticist s perform disease association studies. The main feature of HAP is a pha sing method which is based on the assumption of imperfect phylogeny. Th e phasing method is very efficient, which allows HAP to work with very large data sets, and to perform other operations such as finding a part ition of the region into blocks of limited diversity or performing asso ciation tests on each of these block with in vitro experiments already pub lished. HAP takes as input a set of genotypes over a region, taken form a p opulation, and returns the haplotype phase of each of the individual’s genotypes. From our studies, we observed that HAP is very accurate when the number of individual taken is at least a couple of dozens. In addit ion to phasing, HAP also produces a partition of the region into blocks of correlated SNPs. The block partition of the haplotypes is such that it minimizes the number of tag SNPs. HAP leverages a new insight into t he underlying structure of haplotypes which shows that SNPs are organiz ed in highly correlated “blocks”(Daly et al 01, Patil et al 01). HAP has shown to have competitive accuracy compared to the state of the art sofrwares(such as PHASE, HAPLOTYPER). On the other hand, HAP is extremely fast and can be used on very large datasets. Recently, HAP is successfully used in revealing whole genome haplotype structure. (Hinds et al. 05) Figure 2 Predicted CHGA phylogen y Each symbol denotes a haplotype variant s of CHGA promoter. Each haplotype vari ant is classified into one of three gro ups: ancestral, common, or recent haplo type. A solid line denotes mutant, and dashed lines denotes recombination. Thi s figure is automatically generated by our webserver. CHGA HAPLOTYPE ID NUCLEOTIDE AT POSITION STATISTICAL TESTS - 110 6 - 101 8 - 101 4 - 988 - 462 - 415 -89 -57 Linear Regressi on Unpaired t-test Mann- Whitney Jonckhee re- Terpstra A G A T T G T C C .948(+) .963 (+) .969 (+) .963 (+) B A A T T G T C C .977() .999 () .999 () .996 () C G A C G A T A C .175 () .209 () .505 () .485 () D G A T T G C C C .999 () .990 (+) .983 (+) .997 (+) E G T T T G C C T .004 (+)** .004 (+)** .011 (+)* .011 (+)* F G A C G A T C C .836 () .836 () .978 () .986 () Table 1 Haplotype analysis between CHGA promoter region and CHGA 284- 301 plasma levels : Statistical p-values for the association between the hap lotypes in CHGA promoter region and CHGA 284-301 plasma levels in 221 African Ame ricans over various statistical tests. Each haplotype ID and its sequence is identical to that of Figure 2. The p-values are evaluated by permutation test s with 10 5 times of random shuffling of phenotypes. The p-values are also adj usted to multiple comparisons, thus no further conservative adjustments are r equired. The plus or minus sign next to each p-value denotes whether the hapl otype variant shows positive or negative effect on the phenotype for each sta tistical test. Single and double asterisks by the p-value denotes that the p- value is less than 0.05 and 0.01, respectively. This table is automatically g enerated by our webserver. Figure 4 CHGA association visual ization A histogram of CHGA 284-301 levels grouped by the number of copies of the haploty pes E in Table 1. The x-axis represent s plasma levels, and y-axis represents the fraction of individuals with given plasma level. It can be observed that there are significant association for the haplotype to increase plasma leve l. This figure is automatically genera ted by our webserver. Figure 5 CHGA functional SNPs pr ediction Results of predicting how each SNP con tributes to the association identified in Table 1. The y-axis is a score that represents the degree of functional co ntribution. The SNP at the position -8 9 makes the highest functional contrib ution, and those at positions -1014,-9 88,-462 share the second highest scor e. This results is consistent to the i n vitro experiments previously publishe d. This figure is automatically genera ted by our webserver. 3. Inferring Phylogenetic Relationships between Haplot ypes Recent studies have shown that within short regions, there is limited g enetic variability, and only a small number of haplotypes account for t he entire population. In a typical region of 20kb, three or four common haplotypes account for 80% of the population. Futhermore, most rare var iants appear to be minor variants of common haplotypes. Using these res ults, phylogeny is inferred by identifying most likely ancestors for th e each of the rare haplotypes given the frequent ones. Then, ancestral haplotypes are found by searching for similar common variants. Figure 1 HAP webserver (a) HAP is used in revealing whole genome hapl otype structure. The article “Whole-Genome Patterns of Common DNA Variation in Three Human Populations” is published on the cover of Science. (b) The screenshot o f HAP webserver main page, available at http://research.calit2.net/hap 4. Identifying Association via Statistical Tests Leveraging haplotype structure Quantitative phenotypes & Dose-effects Nonparametric Tests Covariates Figure 3 Linkage disequilibrium plot Results of of runn ing HAP webserver with linkage disequi librium data. The example data is avai lable via webserver. The axis represen t SNP positions. The red regions indic ate high disequilibrium while the blue indicates low disequilibrium. 5. Functional SNPs Prediction Once associated haplotypes are identified using rigorous statistical te sts, our methods provide a method for estimating the likelihood of each SNP contributing the association. To make this prediction, we iterate o ver several groupings of the haplotypes to attempt to isolate the funct ional SNPs. The outcome of the second step is a score distribution over the SNPs estimating how likely each SNP is to be functional. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. 6. Whole Genome Association Studies with HDL Mouse Phe nome Database Figure 6 HDL Phenotype The association test results for the level of HDL cholesterol in the different mouse strains. Figure 7 Random The association test results for rando mly permuted HDL phenotype in figure 6.

Upload: william-simmons

Post on 18-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: The HAP webserver: Tools for the Discovery of Genetic Basis of Human Disease HYUN MIN KANG Computer Science and Engineering University of California, San

The HAP webserver:Tools for the Discovery of Genetic Basis of Human Disease

HYUN MIN KANGComputer Science and EngineeringUniversity of California, San Diego

1. IntroductionUnderstanding the structure of human variation is important for understanding the genetic basis of human diseases. Recent advances in high-throughput genotyping technology generating a tremendous amount of high density single nucleotide polymorphism(SNP) data holds great promise for discovering genetic risk factors associated with disease. In order to identify association between disease and variations in an individual’s chromosome, the genotype data must be phased into haplotypes. Based on HAP, which is a very efficient tool for haplotype resolution based on imprefect phylogeny, HAP webserver provides an integrated method to reconstruct haplotype structure and to identify genetic variants associated with complex phenotypes which can give insight into the genetic factors of complex diseases. Our methods leverage interplay between genotype phasing, haplotype phylogeny, association analysis, and functional SNPs prediction. Our methods leverage new insights into the structure of human variation which allows us to observe phenotype associations directly from genotype and phenotype data. We demonstrate our methods via an analysis of two genes implicated in hypertension. Our methods are easily accessible via the webserver, providing complete results of association analysis including graphical visualizations. We expect that our methods will facilitate current association studies.

NOAH ZAITLENBioinformatics Program

University of California, San Diego

TAURIN TAN-ATICHATElectrical and Computer EngineeringUniversity of California, San Diego

EDWARD SHYUComputer Science and EngineeringUniversity of California, San Diego

GRACE SHAWComputer Science and EngineeringUniversity of California, San Diego

DAFNA BITTONComputer Science and EngineeringUniversity of Calfornia, San Diego

ELAD HAZANDepartment of Computer Science

Princeton University

ERAN HALPERINInternational Computer-

Science Institute, Berkeley

ELEAZAR ESKINComputer Science and EngineeringUniversity of California, San Diego

2. HAP – haplotype resolution HAP is a haplotype analysis system which is aimed in helping geneticists perform disease association studies. The main feature of HAP is a phasing method which is based on the assumption of imperfect phylogeny. The phasing method is very efficient, which allows HAP to work with very large data sets, and to perform other operations such as finding a partition of the region into blocks of limited diversity or performing association tests on each of these block with in vitro experiments already published. HAP takes as input a set of genotypes over a region, taken form a population, and returns the haplotype phase of each of the individual’s genotypes. From our studies, we observed that HAP is very accurate when the number of individual taken is at least a couple of dozens. In addition to phasing, HAP also produces a partition of the region into blocks of correlated SNPs. The block partition of the haplotypes is such that it minimizes the number of tag SNPs. HAP leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated “blocks”(Daly et al 01, Patil et al 01). HAP has shown to have competitive accuracy compared to the state of the art sofrwares(such as PHASE, HAPLOTYPER). On the other hand, HAP is extremely fast and can be used on very large datasets. Recently, HAP is successfully used in revealing whole genome haplotype structure. (Hinds et al. 05)

Figure 2 Predicted CHGA phylogenyEach symbol denotes a haplotype variants of CHGA promoter. Each haplotype variant is classified into one of three groups: ancestral, common, or recent haplotype. A solid line denotes mutant, and dashed lines denotes recombination. This figure is automatically generated by our webserver.

CHGA

HAPLOTYPE ID

NUCLEOTIDE AT POSITION STATISTICAL TESTS

-1106 -1018 -1014 -988 -462 -415 -89 -57

LinearRegression

Unpairedt-test

Mann-Whitney

Jonckheere-Terpstra

A G A T T G T C C .948(+) .963 (+) .969 (+) .963 (+)

B A A T T G T C C .977() .999 () .999 () .996 ()C G A C G A T A C .175 () .209 () .505 () .485 ()D G A T T G C C C .999 () .990 (+) .983 (+) .997 (+)

E G T T T G C C T .004 (+)** .004 (+)** .011 (+)* .011 (+)*

F G A C G A T C C .836 () .836 () .978 () .986 ()

Table 1 Haplotype analysis between CHGA promoter region and CHGA284-301 plasma levels : St

atistical p-values for the association between the haplotypes in CHGA promoter region and CHGA284-301 plasma levels in 221 African Americans over various statistical tests. Each haplotype ID and its sequence is identical to that of Figure 2. The p-values are evaluated by permutation tests with 105 times of random shuffling of phenotypes. The p-values are also adjusted to multiple comparisons, thus no further conservative adjustments are required. The plus or minus sign next to each p-value denotes whether the haplotype variant shows positive or negative effect on the phenotype for each statistical test. Single and double asterisks by the p-value denotes that the p-value is less than 0.05 and 0.01, respectively. This table is automatically generated by our webserver.

Figure 4 CHGA association visualization A histogram of CHGA284-301 levels grouped by the number of copies of the haplotypes E in Table 1. The x-axis represents plasma levels, and y-axis represents the fraction of individuals with given plasma level. It can be observed that there are significant association for the haplotype to increase plasma level. This figure is automatically generated by our webserver.

Figure 5 CHGA functional SNPs predictionResults of predicting how each SNP contributes to the association identified in Table 1. The y-axis is a score that represents the degree of functional contribution. The SNP at the position -89 makes the highest functional contribution, and those at positions -1014,-988,-462 share the second highest score. This results is consistent to the in vitro experiments previously published. This figure is automatically generated by our webserver.

3. Inferring Phylogenetic Relationships between HaplotypesRecent studies have shown that within short regions, there is limited genetic variability, and only a small number of haplotypes account for the entire population. In a typical region of 20kb, three or four common haplotypes account for 80% of the population. Futhermore, most rare variants appear to be minor variants of common haplotypes. Using these results, phylogeny is inferred by identifying most likely ancestors for the each of the rare haplotypes given the frequent ones. Then, ancestral haplotypes are found by searching for similar common variants.

Figure 1 HAP webserver (a) HAP is used in revealing whole genome haplotype structure. The article “Whole-Genome Patterns of Common DNA Variation in Three Human Populations” is published on the cover of Science. (b) The screenshot of HAP webserver main page, available at http://research.calit2.net/hap

4. Identifying Association via Statistical TestsLeveraging haplotype structure

Quantitative phenotypes & Dose-effects

Nonparametric Tests

Covariates

Figure 3 Linkage disequilibrium plot Results of of running HAP webserver with linkage disequilibrium data. The example data is available via webserver. The axis represent SNP positions. The red regions indicate high disequilibrium while the blue indicates low disequilibrium.

5. Functional SNPs PredictionOnce associated haplotypes are identified using rigorous statistical tests, our methods provide a method for estimating the likelihood of each SNP contributing the association. To make this prediction, we iterate over several groupings of the haplotypes to attempt to isolate the functional SNPs. The outcome of the second step is a score distribution over the SNPs estimating how likely each SNP is to be functional.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

6. Whole Genome Association Studies with HDL Mouse Phenome Database

Figure 6 HDL Phenotype The association test results for the level of HDL cholesterol in the different mouse strains.

Figure 7 Random The association test results for randomly permuted HDL phenotype in figure 6.