snp maps: more markers needed?

2
An American scientist has predicted that 500 000 single nucleotide polymorphisms (SNPs) will be required for a whole-genome SNP map that will be useful for finding genes involved in common diseases 1 ; a number that exceeds the aims of ongoing SNP mapping projects. Leonid Kruglyak (Fred Hutchinson Cancer Research Centre, Seattle, WA, USA) used computer simulations to estimate the extent of linkage disequilibrium (LD; Box 1) between SNPs in the genome. Although his conclusions are somewhat pessimistic in the light of current SNP mapping targets, the study provides an impetus for experimental studies of LD and highlights the factors involved in a whole- genome LD scan using SNPs. Finding the genetic variants underlying common diseases such as heart disease, cancer, diabetes and psychiatric illness is a major challenge to human geneticists. Although these diseases aggregate in families (implying a genetic component), they usually skip generations. Consequently, family-based linkage methods using small, nuclear families, so successful for Mendelian diseases such as cystic fibrosis, are proving to be of limited use. Common diseases are the result of combinations of different genetic and environmental factors. It has been hypothesized that the most powerful methods for finding many genes, each with a relatively small effect, are based on associations of marker alleles with the disease. This is done by statistically comparing the marker allele frequencies in individuals affected with the disease of interest (cases) and individuals without the disease (controls). Whole-genome association studies employ neutral markers spaced throughout the genome. Theoretically, any neutral allele (of the marker) found more frequently in cases than in controls must be in LD with, and therefore physically close to, a functional cause of the disease. SNPs are a good choice of marker for these studies because of their low mutation rate, high incidence throughout the genome and bi-allelic nature (making them amenable to automated detection techniques). How many SNPs to use for this method depends on how far the LD extends from the SNP, and it is this question that Kruglyak has addressed in his simulation study. ‘There was no clear guidance for the necessary scale of such projects from either data or theory’, he explains, and so ‘I decided to try and compute the extent of LD based on the best available information.’ The known factors that govern LD are complex and include population size, population history, disease allele frequency and increase in risk that a variant confers. Kruglyak simulated estimates of LD using simplified scenarios of population history, making a number of assumptions about the markers and what constitutes a useful amount of LD. His results suggest that useful levels of LD are ‘unlikely to extend beyond an average distance of roughly 3 kb in the general population’, implying that approximately 500 000 SNPs will be required for whole-genome association studies in samples drawn from large, out-bred populations. No LD was seen over 30 kb. As Kruglyak comments in the paper – everything is dependent on the assumptions made and he hopes that these results will ‘lead to some careful and extensive pilot studies to empirically characterize patterns of LD’. To date, there are few experimental studies that can confirm or refute his findings. Anecdotal reports suggest LD can extend up to 500 kb, and most human geneticists are expecting to see a range of distances. ‘It is almost certain that the rates of erosion of association will be highly chromosome- region specific and that useful levels of LD for SNP markers will extend for many intervals well beyond the average predicted distance of 3 kb’, explains Ian Craig, Professor of Molecular Genetics at the Institute of Psychiatry (London, UK), where researchers are investigating the underlying genetic causes of human behavioural and psychiatric traits. The need for more SNPs over longer physically characterized and/or sequenced genomic regions is being realized and currently, because SNPs are expected to facilitate large-scale genetic association studies, there is great interest in SNP discovery and detection. Kruglyak’s report comes two months after the announcement of The SNP Consortium (TSC), a collaboration between five publicly funded genome centres, ten pharmaceutical companies and the Wellcome Trust that aims to produce a map of 300 000 SNPs within the next two years. Although the target of TSC is below Kruglyak’s prediction, TSC is not alone in trying to identify, and put into the public domain, SNPs for the gene mappers. David Bentley, head of human genetics at the Sanger Centre (Hinxton, UK), a TSC member, explains: ‘We see the identification of SNPs and the development of an informative SNP map as an evolving process, in which a very useful set of 300 000 or more SNPs will be obtained to enable us to answer the unresolved controversy of how many SNPs are needed’. He anticipates that the emergence of the whole genomic sequence will impact on SNP discovery, and over a million SNPs will be found by TSC and other publicly- minded SNP-finding programmes. For example, genomics communities in the US and Japan are aiming to provide 100 000 SNPs each within the next two years. It is commonly accepted that there is at least 1 SNP per kb in the human genome, which extrapolates to 3 million in the whole genome (assuming it is 3 000 000 000 bp long). If Kruglyak’s estimations are correct, there should be sufficient markers available to attain his predicted necessary map density. Presently, there are just 6388 SNPs deposited in dbSNP – a database established by the US National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/SNP/), which serves as a central repository for SNPs in the public domain. However, it is not just the density of SNPs that is currently the limiting factor for successful whole-genome LD studies. For maximum benefit, the SNPs should be mapped relative to each other (and the genomic sequence) so that haplotypes (the patterns of 419 MOLECULAR MEDICINE TODAY, OCTOBER 1999 (VOL. 5) 1357-4310/99/$ - see front matter © 1999 Elsevier Science Ltd. All rights reserved. SNP maps: more markers needed? Box 1. Linkage disequilibrium mapping Linkage disequilibrium (LD) is a measure of the amount of recombination that has occurred between two regions in the genome (loci), and is representative of how far apart they are. Consider two loci – a disease locus and a SNP marker, for example – each having two alleles with equal (50%) fre- quency. D and d are the alleles at the disease locus, and a and b are the alleles at the marker locus. If the disease locus and SNP are at opposite ends of the chromosome, then throughout the generations in which they have both existed there will be recombination between the two loci, resulting in equal numbers of chromosomes carrying the four possible haplotypes (aD, ad, bD and bd). If, however, the two loci are close together, the recombination between them will not occur so frequently and one allele of the SNP will be associated with one allele of the disease locus. This association can be measured by genotyping samples of patients and controls for the SNP. The associated allele will have a higher than expected frequency in the patients. N e w s

Upload: elisabeth-dawson

Post on 19-Sep-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SNP maps: more markers needed?

An American scientist has predicted that 500 000single nucleotide polymorphisms (SNPs) will berequired for a whole-genome SNP map that willbe useful for finding genes involved in commondiseases1; a number that exceeds the aims ofongoing SNP mapping projects. Leonid Kruglyak(Fred Hutchinson Cancer Research Centre,Seattle, WA, USA) used computer simulations toestimate the extent of linkage disequilibrium (LD;Box 1) between SNPs in the genome. Althoughhis conclusions are somewhat pessimistic in thelight of current SNP mapping targets, the studyprovides an impetus for experimental studies ofLD and highlights the factors involved in a whole-genome LD scan using SNPs.

Finding the genetic variants underlyingcommon diseases such as heart disease, cancer,diabetes and psychiatric illness is a majorchallenge to human geneticists. Although thesediseases aggregate in families (implying a geneticcomponent), they usually skip generations.Consequently, family-based linkage methodsusing small, nuclear families, so successful forMendelian diseases such as cystic fibrosis, areproving to be of limited use. Common diseases arethe result of combinations of different genetic andenvironmental factors. It has been hypothesizedthat the most powerful methods for finding manygenes, each with a relatively small effect, arebased on associations of marker alleles with thedisease. This is done by statistically comparing themarker allele frequencies in individuals affectedwith the disease of interest (cases) and individualswithout the disease (controls). Whole-genomeassociation studies employ neutral markers spacedthroughout the genome. Theoretically, any neutralallele (of the marker) found more frequently incases than in controls must be in LD with, andtherefore physically close to, a functional cause ofthe disease. SNPs are a good choice of marker forthese studies because of their low mutation rate,high incidence throughout the genome and bi-allelic nature (making them amenable toautomated detection techniques). How many SNPsto use for this method depends on how far the LDextends from the SNP, and it is this question thatKruglyak has addressed in his simulation study.‘There was no clear guidance for the necessaryscale of such projects from either data or theory’,he explains, and so ‘I decided to try and computethe extent of LD based on the best availableinformation.’

The known factors that govern LD are complexand include population size, population history,disease allele frequency and increase in risk that avariant confers. Kruglyak simulated estimates of

LD using simplified scenarios of populationhistory, making a number of assumptions aboutthe markers and what constitutes a useful amountof LD. His results suggest that useful levels of LDare ‘unlikely to extend beyond an average distanceof roughly 3 kb in the general population’,implying that approximately 500 000 SNPs willbe required for whole-genome association studiesin samples drawn from large, out-bredpopulations. No LD was seen over 30 kb.

As Kruglyak comments in the paper –everything is dependent on the assumptions madeand he hopes that these results will ‘lead to somecareful and extensive pilot studies to empiricallycharacterize patterns of LD’. To date, there arefew experimental studies that can confirm orrefute his findings. Anecdotal reports suggest LDcan extend up to 500 kb, and most humangeneticists are expecting to see a range ofdistances. ‘It is almost certain that the rates oferosion of association will be highly chromosome-region specific and that useful levels of LD forSNP markers will extend for many intervals wellbeyond the average predicted distance of 3 kb’,explains Ian Craig, Professor of MolecularGenetics at the Institute of Psychiatry (London,UK), where researchers are investigating theunderlying genetic causes of human behaviouraland psychiatric traits.

The need for more SNPs over longer physicallycharacterized and/or sequenced genomic regions isbeing realized and currently, because SNPs areexpected to facilitate large-scale geneticassociation studies, there is great interest in SNPdiscovery and detection. Kruglyak’s report comestwo months after the announcement of The SNPConsortium (TSC), a collaboration between fivepublicly funded genome centres, ten

pharmaceutical companies and the WellcomeTrust that aims to produce a map of 300 000 SNPswithin the next two years. Although the target ofTSC is below Kruglyak’s prediction, TSC is notalone in trying to identify, and put into the publicdomain, SNPs for the gene mappers. DavidBentley, head of human genetics at the SangerCentre (Hinxton, UK), a TSC member, explains:‘We see the identification of SNPs and thedevelopment of an informative SNP map as anevolving process, in which a very useful set of300 000 or more SNPs will be obtained to enableus to answer the unresolved controversy of howmany SNPs are needed’. He anticipates that theemergence of the whole genomic sequence willimpact on SNP discovery, and over a millionSNPs will be found by TSC and other publicly-minded SNP-finding programmes. For example,genomics communities in the US and Japan areaiming to provide 100 000 SNPs each within thenext two years. It is commonly accepted that thereis at least 1 SNP per kb in the human genome,which extrapolates to 3 million in the wholegenome (assuming it is 3 000 000 000 bp long). IfKruglyak’s estimations are correct, there should besufficient markers available to attain his predictednecessary map density.

Presently, there are just 6388 SNPs deposited indbSNP – a database established by the USNational Center for Biotechnology Information(NCBI; http://www.ncbi.nlm.nih.gov/SNP/),which serves as a central repository for SNPs inthe public domain. However, it is not just thedensityof SNPs that is currently the limitingfactor for successful whole-genome LD studies.For maximum benefit, the SNPs should bemapped relative to each other (and the genomicsequence) so that haplotypes (the patterns of

419

MOLECULAR MEDICINE TODAY, OCTOBER 1999 (VOL. 5)

1357-4310/99/$ - see front matter © 1999 Elsevier Science Ltd. All rights reserved.

SNP maps: more markersneeded?

Box 1. Linkage disequilibrium mapping

Linkage disequilibrium (LD) is a measure of the amount of recombination that has occurred betweentwo regions in the genome (loci), and is representative of how far apart they are. Consider two loci– a disease locus and a SNP marker, for example – each having two alleles with equal (50%) fre-quency. D and d are the alleles at the disease locus, and a and b are the alleles at the marker locus. Ifthe disease locus and SNP are at opposite ends of the chromosome, then throughout the generationsin which they have both existed there will be recombination between the two loci, resulting in equalnumbers of chromosomes carrying the four possible haplotypes (aD, ad, bD and bd). If, however, thetwo loci are close together, the recombination between them will not occur so frequently and one allele of the SNP will be associated with one allele of the disease locus. This association can bemeasured by genotyping samples of patients and controls for the SNP. The associated allele willhave a higher than expected frequency in the patients.

N e w s

Page 2: SNP maps: more markers needed?

Now that the task of sequencing the genome iswell under way (the first draft of the humangenome sequence will be made available in thefirst few months of next year), the focus ofresearch will switch to the protein level, relatingsequences and, where possible, structures toprotein function. In June, Amos Bairoch (SwissInstitute of Bioinformatics, Switzerland), thecurator of the SwissProt protein sequence database(http://www.expasy.ch/sprot)1, announced aproject to ensure that high-quality human protein

sequence data are released into the public domainas soon as practicable – the Human ProteomicsInitiative (http://www.expasy.ch/sprot/hpi).

SwissProt is the most widely used proteinsequence database. Its value lies at least as muchin the annotations and cross-links of each entry asin the sequence information itself. A typical entryfor a human gene contains detailed descriptions ofits known polymorphisms and any associateddiseases, and cross-references to databasesincluding Medline, Online Mendelian Inheritancein Man (OMIM) and the gene sequence databases.Bairoch described the ‘golden goals’ forSwissProt: that it should be fully annotated,complete, non-redundant, highly cross-referenced,and available through a wide range of servers and analysis tools. He acknowledged, however,that ‘there is a dichotomy between the first twoaims, as annotation is time consuming andexpensive’. New investment in SwissProt,resulting from the fees now charged to industrialusers of the database, is now being devoted to theHuman Proteomics Initiative. Its aim is nothingless than to catch up with the gene sequencers, byreleasing a fully annotated set of human proteinsequences and their variants into the publicdomain.

The information in the human proteome goesfar beyond coding region sequences. Bairoch andhis colleagues will collect and annotateinformation on mammalian orthologues of human

genes, post-translational modifications andpolymorphisms. Over 150 different post-translational modifications of human proteins(such as cleavage sites, crosslinks and bonds tonon-protein molecules) have been described in theliterature. Knowledge of the complete set ofhuman polymorphisms is necessary forinvestigating variations in predisposition tocommon diseases.

The Initiative will have two phases. The aim ofthe first phase, which should last until Spring2000, is to annotate the protein products of allknown human genes. The second is a long-termcommitment to continue releasing well-annotatedsequences of protein variants for as long asresearchers produce new data. The initiative isbound to make an important contribution to ourunderstanding of the genetic basis of disease, andit cannot succeed without collaboration. Bairochappealed for help to the whole medical researchcommunity: ‘We need you to submit all types ofinformation to speed up the comprehensiveannotating of the human proteome.’

1 Bairoch, A. and Apweiler, R. (1999) TheSWISS-PROT protein sequence data bankand its supplement TrEMBL in 1999, NucleicAcids Res.27, 49–54

Clare SansomFreelance science writer

different marker alleles seen on a particularchromosome) can be determined. Additionally,SNP genotyping is currently 3–4 orders ofmagnitude too expensive to utilize high-densitySNP mapping. For the pending SNP maps to beuseful for association studies, efficienttechnologies are needed for genotyping hundredsof thousands of SNPs. Kruglyak sees this as abottleneck in the SNP genome scan and hopes thathis simulations will ‘focus attention ondevelopment of technologies to genotype SNPs onthis scale’. A further concern is the computationalability to analyse data from 500 000 SNPs.

Given that the current density of mapped SNPsis not sufficient for economical whole-genome LDscans, researchers are looking to other factors thatinfluence LD, such as size and type of population,for the design of their gene finding experiments.Larger study populations (thousands as opposed tohundreds) enable LD studies to be done withfewer markers, but such population sizes may berestrictive for individual research groups. Toovercome this problem, a number of consortia

have formed to pool samples and increase theirpower to detect associations. It has also beenargued that LD studies can be carried out withfewer markers in recently founded geneticisolates, such as the Icelandic population, becausein such populations recombination has had lesstime to whittle down LD around risk inferringvariants. Such populations have attractedcommercial interest: deCode Genetics (Reykjavik,Iceland) is studying the population of Iceland andits extensive genealogy to find genes for commondiseases. However, Kruglyak’s simulations revealthat the extent of LD is similar in isolatedpopulations and out-bred populations ‘unless thefounding bottleneck is very narrow (10–100individuals) or the frequency of the variant is low(<5%).’ He also predicts that even isolatedpopulations may have more than one genetic causefor a common disease. ‘We certainly see that inIceland’, agrees Jeff Gulcher, vice president ofresearch and development at deCode. However,the use of extensive genealogy means that thegenetics can be simplified: ‘We can choose to

study only those patients who are descendants ofone or a few ancestors’, explains Gulcher.

Until SNP maps are dense enough and SNPgenotyping becomes orders of magnitude lessexpensive, most researchers will continue to usesimple sequence repeats as the markers in theirhunt for genetic variants underlying commondiseases. ‘Genome scans based on simple repeatsequence markers will continue to have a highlysignificant role in screening for quantitative traitloci in multi-factorial disorders. We should notignore the wealth of information yielded by thesequencing projects, which will allow us toproduce much denser arrays of simple sequencerepeat markers’, says Ian Craig.

1 Kruglyak, L. (1999) Prospects for whole-genome linkage disequilibrium mapping ofcommon disease genes, Nat. Genet.22,139–144

Elisabeth Dawson PhDSanger Centre, Hinxton, UK

420

N e w s MOLECULAR MEDICINE TODAY, OCTOBER 1999 (VOL. 5)

1357-4310/99/$ - see front matter © 1999 Elsevier Science Ltd. All rights reserved.

Making sense of the humanproteome

2-D polyacrylamide gel electrophoresis of colorectaladenoma cells. Identified proteins are highlighted inred.