the encode project

38
The ENCODE Project: ENCyclopedia Of DNA Elements Overview Consortium Membership Data Release Policy Accessing ENCODE Data Common Consortium Resources Target Selection Process and Target Regions Comparative Sequence Analysis Coordination with HapMap Meeting Reports Request for Application (RFA) Press Releases and Publications Program Staff Researchers Expand Efforts to Explore Functional Landscape of the Human Genome ENCODE Overview The National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence. The project is being conducted in three phases: a pilot project phase, a technology development phase and a planned production phase. The pilot phase is testing and comparing existing methods to rigorously analyze a defined portion of the human genome sequence. It is organized as an open consortium (See: ENCODE Participants and Projects ) and brings together investigators with diverse backgrounds and expertise to evaluate the relative merits of each of a diverse set of techniques, technologies and strategies. The concurrent technology development phase of the project aims to develop new high throughput methods to identify functional elements. The goal of the first two phases of the ENCODE project is to identify a suite of approaches that will allow the comprehensive identification of all the functional elements in the human genome. Through the ENCODE pilot, NHGRI expects to assess the abilities of different approaches to be scaled up for an effort to analyze the entire human genome and to find gaps in our ability to identify functional elements in genomic sequence. The ENCODE Pilot Project process involves close interactions between computational and experimental scientists to evaluate a number of methods for annotating the human genome. A set of regions (See: Target Selection Process and Target Regions ) representing approximately 1

Upload: dylan-jackson

Post on 28-Sep-2015

249 views

Category:

Documents


0 download

DESCRIPTION

ml

TRANSCRIPT

The ENCODE Project: ENCyclopedia Of DNA Elements

The ENCODE Project: ENCyclopedia Of DNA Elements

Overview

Consortium Membership

Data Release Policy

Accessing ENCODE Data

Common Consortium Resources

Target Selection Process and Target Regions

Comparative Sequence Analysis

Coordination with HapMap

Meeting Reports

Request for Application (RFA)

Press Releases and Publications

Program Staff

Researchers Expand Efforts to Explore Functional Landscape of the Human Genome ENCODE Overview

The National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence. The project is being conducted in three phases: a pilot project phase, a technology development phase and a planned production phase.

The pilot phase is testing and comparing existing methods to rigorously analyze a defined portion of the human genome sequence. It is organized as an open consortium (See: ENCODE Participants and Projects) and brings together investigators with diverse backgrounds and expertise to evaluate the relative merits of each of a diverse set of techniques, technologies and strategies. The concurrent technology development phase of the project aims to develop new high throughput methods to identify functional elements. The goal of the first two phases of the ENCODE project is to identify a suite of approaches that will allow the comprehensive identification of all the functional elements in the human genome. Through the ENCODE pilot, NHGRI expects to assess the abilities of different approaches to be scaled up for an effort to analyze the entire human genome and to find gaps in our ability to identify functional elements in genomic sequence.

The ENCODE Pilot Project process involves close interactions between computational and experimental scientists to evaluate a number of methods for annotating the human genome. A set of regions (See: Target Selection Process and Target Regions) representing approximately 1 percent (30 Mb) of the human genome has been selected as the target for this pilot project and is currently being analyzed by all ENCODE Consortium investigators. All data generated by ENCODE participants on these regions will be rapidly released into public databases (See: Accessing ENCODE Data). By initially concentrating on a limited portion of the human genome, the NHGRI hopes that all of those who have experience and insight into the problem will be willing to participate, whether or not their approaches are proprietary or have already generated proprietary data. The ENCODE Consortium is open to all academic, government and private sector scientists interested in participating in an open process to facilitate the comprehensive interpretation of the human genome sequence and who agree to the criteria for participation (See: Criteria for Participation) for the project. The activities of the ENCODE Consortium will be influential in helping to guide the planning for a complete public elucidation of functional elements within the entire human genome.

Read about the ENCODE Project's Background

Top of page

ENCODE Consortium Membership

The ENCODE Consortium is composed primarily of scientists who were funded under RFAs released by NHGRI to initiate the pilot and technology development phases of the ENCODE Project. Other participants have been identified and brought into the Consortium as appropriate. The Consortium is open to any investigator willing to abide by the criteria for participation established for the ENCODE Project by NHGRI. The ENCODE External Consultants Panel oversees the activities of the Consortium and provides advice and feedback on the Consortium's goals, progress and membership.

Those interested in applying for membership to the ENCODE Consortium should review the criteria for participation and contact Elise Feingold, Ph.D., or Peter Good, Ph.D. (See: Program Staff).

Criteria for Participation

External Consultants Panel

ENCODE Participants and Projects

Top of page

ENCODE Data Release Policy

The NHGRI is committed to the principle of rapid data release to the scientific community. This principle was initially implemented during the Human Genome Project and has been recognized as leading to one of the most effective ways of promoting the use of the human genome sequence to advance scientific knowledge.

ENCODE Data Release Policy

ENCODE as a Data User

Top of page

Accessing ENCODE Data

The data produced by ENCODE Consortium members are deposited to public databases and are available for all to use without restriction. Data linked to the genomic sequence is stored and visualized on the University of California, Santa Cruz browser at ENCODE Project at UCSC [genome.ucsc.edu]. Other, non-sequence based data, like that from microarray studies, are available on public databases such as the Gene Expression Omnibus (GEO) [ncbi.nlm.nih.gov] and ArrayExpress [ebi.ac.uk]. The NHGRI Division of Intramural Research will be developing a "portal" that will function as a single point of entry from which users can search and retrieve data from the ENCODE Project. Data users should abide by the ENCODE Data Release Policy when accessing data produced by ENCODE Consortium members.

Top of page

ENCODE Common Consortium Resources

Common reagents and resources have been identified by the Consortium to aid in the comparison of data produced by ENCODE participants using different platforms and experimental approaches. Common resources for ENCODE include the pilot project target sequences, BAC Clones for ENCODE targets, cell lines, and antibodies to DNA-binding proteins.

ENCODE Common Consortium Resources

Top of page

ENCODE Target Selection Process and Regions

For use in the ENCODE Pilot Project, defined regions of the human genome - corresponding to 30Mb, roughly 1 percent of the total human genome - have been selected. These regions serve as the foundation on which to test and evaluate the effectiveness and efficiency of a diverse set of methods and technologies for finding various functional elements in human DNA.

ENCODE Project Target Selection Process and Target Regions

Top of page

ENCODE Comparative Sequence Analysis

A component of ENCODE data production involves the generation of sequencing information from a number of different genomes in order to extract the maximum amount of information about the human genome through comparative analyses. Efforts are already underway at the NHGRI, University of British Columbia and the NIH Intramural Sequencing Center to identify, map and sequence, respectively; BAC clones for regions syntenic to the human ENCODE targets will be made in additional mammalian species. In addition to these ENCODE-directed efforts, sequence data generated through whole genome sequencing projects will be used in comparative analyses to help scientists better understand the human sequence. ENCODE Participants intend to abide by the Fort Lauderdale recommendations on "Sharing Data from Large-scale Biological Research Projects" when using unpublished sequence data in Project analyses.

View the ENCODE regions in multiple species [ensembl.org]

Link to NISC ENCODE comparative sequencing site [nisc.nih.gov]

ENCODE as a Data User

Top of page

ENCODE-HapMap Coordination

The International HapMap Project has decided to focus on 10 of the ENCODE random regions for comprehensive genotyping as part of an in-depth study of human genetic variation. The regions were chosen to represent a range of conservation with the mouse genome and of gene density according to the strata identified during the ENCODE target selection process.

The 10 HapMap-ENCODE regions were resequenced in 48 unrelated individuals (16 Yoruba, 16 CEPH, 8 Han Chinese, and 8 Japanese) using a PCR-based method. 30,000 single nucleotide polymorphisms (SNPs) were identified in the HapMap-ENCODE regions. Some of these were already represented in dbSNP, a database of SNP data that is managed by the National Center for Biotechnology Information (NCBI), while others were discovered during the resequencing. The newly-discovered SNPs were added to dbSNP and the sequence data from the 48 individuals are stored in NCBI's Trace Archive .

Of the 30,000 SNPs identified in the HapMap-ENCODE regions, 10,000 were not analyzed because of failed design or failed genotyping. Genotype data were obtained from the remaining 20,000 SNPs in the HapMap-ENCODE regions of all 270 samples used for the HapMap Project (90 CEPH, 90 Yoruba, 45 Han Chinese, and 45 Japanese). This genotyping was done at the Broad Institute of Harvard and MIT, Illumina, Baylor College of Medicine, McGill University & Genome Quebec Innovation Centre, and the University of California, San Francisco.

The ENCODE-HapMap genotyping data set is considered to be a "gold standard" data set because of the high density of SNP coverage. The genotype data from these regions will be used to determine the best way to choose tag SNPs and to assess the adequacy of the entire HapMap for many analyses, such as coverage, linkage disequilibrium (LD) measures, and haplotype inference.

For more information on coordination between the HapMap and ENCODE Projects, please visit http://www.hapmap.org/downloads/encode1.html.en.

Top of page

ENCODE Meeting Reports

A workshop to discuss a proposal to create a highly interactive public research consortium to carry out a pilot project for testing and comparing existing and new methods to identify functional sequences in DNA. The workshop participants resoundingly supported the concept of a pilot project and made a number of recommendations about the project's goals, organization and implementation.

ENCODE Pilot Project Launch Meeting March 7, 2003

July 23-24, 2002: Workshop on the Comprehensive Extraction of Biological Information From Genomic Sequence

Top of page

ENCODE Project Request For Application (RFA)

New ENCODE RFAs

RFA-HG-07-010 [grants1.nih.gov]: A Data Analysis Center for the Encyclopedia of DNA Elements (ENCODE) Project (U01)Application Receipt Date(s): September 06, 2007

Past ENCODE RFAs

The pilot and technology development phases of the ENCODE project were initiated simultaneously in 2003 when NHGRI released Requests For Application (RFAs) for each of these phases. The first RFA for the pilot phase, RFA HG-03-003, entitled Determination of all functional elements in human DNA, solicited applications from those interested in participating in a research network to conduct a pilot project to test and compare existing methods for identifying all of the functional elements in a limited (~1%) region of the human genome. The second RFA, RFA HG-03-004, entitled Technologies to find functional elements in DNA, solicited applications to develop new and improved technologies for the efficient, comprehensive, high-throughput identification and verification of all types of sequence-based functional elements, particularly those other than coding sequences, for which adequate methods do not currently exist.

NHGRI re-released the technology development RFA in 2004 and 2006. RFA HG-04-001, issued in 2004, solicited additional applications with an added emphasis on high-risk, high-payoff projects and on technologies that might be applied to model organism genomes. RFAs HG-07-028 and HG-07-029, issued in 2006, had an added emphasis on methods to identify functional elements in repetitive sequences and on methods than can be used to validate the identity of functional elements using methods independent of the primary mode of detection.

As the initial phase of the ENCODE Project will be completed in September 2007, NHGRI issued RFAs in November 2006 to solicit application for research projects to continue the ENCODE-based analysis of the human genome at both pilot and whole-genome scales. RFA HG-07-030, entitled Creating the Encyclopedia of DNA Elements (ENCODE) in the Human Genome (U01 and U54), solicited applications for research projects to identify functional elements in the entire human genome sequence (for whole-genome scale projects) or in the 1% of the genome targeted during the ENCODE pilot phase (for pilot-scale projects). RFA HG-07-031, entitled A Data Coordination Center for the Encyclopedia of DNA Elements (ENCODE) Project (U41) solicited applications to develop, house, and maintain databases to track, store, and provide access to the different types of data generated as part of the ENCODE Project.

In 2006, NHGRI released RFAs to begin an ENCODE-like project in the model organisms Caenorhabditis elegans and Drosophila melanogaster. This effort, called modENCODE, was initiated through the funding of grants submitted in response to RFA HG-06-006, entitled Identification of all functional elements in selected model organism genomes and RFA HG-06-007, entitled A Data Coordination Center for the model organism ENCODE Project (modENCODE). The modENCODE Project will exploit the experimental advantages of working with the genomes of these two well-studied model organisms both to identify sequence-based functional elements and to promote an understanding of the functional elements on the basis of experiments that might not be possible to do for those working with the human genome.

RFA HG-03-003 [grants.nih.gov]: Determination of All Functional Elements in Human DNA (Expired)

RFA HG-03-004 [grants.nih.gov]: Technologies to Find Functional Elements in Genomic DNA (Expired) RFA-HG-04-001 [grants1.nih.gov]: Technologies to Find Functional Elements in Genomic DNA. (Expired) RFA-HG-07-028 [grants.nih.gov]: Technology Development for the Comprehensive Determination of Functional Elements in Eukaryotic Genomes (R21) (Expired) RFA HG-07-029 [grants.nih.gov]: Technology Development for the Comprehensive Determination of Functional Elements in Eukaryotic Genomes (R01) (Expired) RFA-HG-07-030 [grants.nih.gov]: Creating the Encyclopedia of DNA Elements (ENCODE) in the Human Genome (U01 and U54) (Expired)

NOT-07-007: Clarification and Additional Information to HG-07-030 and HG-07-031

Slides from Applicant Information Meeting - HG-07-030 RFA-HG-07-031 [grants.nih.gov]: A Data Coordination Center for the Encyclopedia of DNA Elements (ENCODE) Project (U41) (Expired)

NOT-07-007: Clarification and Additional Information to HG-07-030 and HG-07-031

Slides from Applicant Information Meeting - HG-07-031 RFA-HG-06-006 [grants.nih.gov]: Identification of All Functional Elements in Selected Model Organism Genomes (Expired) RFA HG-06-007 [grants.nih.gov]: A Data Coordination Center for the Model Organism ENCODE Project (modENCODE) (Expired)

Top of page

ENCODE Press Releases and Publications

ENCODE Press Releases

Researchers Expand Efforts to Explore Functional Landscape of the Human Genome October 9, 2007

New Findings Challenge Established Views on Human Genome June 13, 2007

ENCODE Consortium Publishes Scientific StrategyOctober 21, 2004

Beyond Genes: Scientists Venture Deeper Into the Human Genome: ENCODE Project Seeks to Identify All Functional Elements in Human DNA October 9, 2003

Launch of Pilot Project to Identify All Functional Elements in Human DNA March 4, 2003

ENCODE Publications

Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project Nature, June 13, 2007

ENCODE Web Focus Related articles on ENCODE from Nature, June 2007

Special Issue on ENCODE from Genome Research June 2007

The ENCODE (ENCylopedia Of DNA Elements) Project. [sciencemag.org] (Full Text) Science, Vol. 306, Issue 5696, 636-640, 22 October 2004.

Top of page

ENCODE Program Staff

Program DirectorsElise Feingold, Ph.D.E-mail: [email protected] Good, Ph.D.E-mail: [email protected] AnalystsLaura LieferE-mail: [email protected] Wetterstrand, MSE-mail: [email protected] Human Genome Research InstituteNational Institutes of Health5635 Fishers LaneSuite 4076, MSC 9305Bethesda, MD 20892-9305

Phone:(301) 496-7531Fax:(301) 480-2770

The HapMap ENCODE resequencing and genotyping project aims to produce a dense set of genotypes across large genomic regions. Ten 500-kilobase regions of the genome were resequenced in 48 unrelated DNA samples (16 Yoruba, 8 Japanese, 8 Han Chinese, and 16 CEPH). All SNPs identified, along with SNPs in dbSNP, were genotyped in the 269 HapMap DNA samples (90 Yoruba, 44 Japanese, 45 Han Chinese, and 90 CEPH). The new SNPs discovered were deposited in dbSNP and all genotype data were sent to the HapMap Data Coordination Center. In addition, Perlegen will genotype all SNPs in the remaining 34 ENCODE regions in all of the HapMap DNA samples as part of its genotyping for the HapMap Project. This study will provide dense genotype data to allow the development and assessment of methods of analysis. A second plate of samples was collected from each population in order to allow studies of how general the results are from the first plate of samples. Of the 16 Yoruba samples resequenced, 8 are on the first plate and 8 are on the second plate; of the 8 Han Chinese samples resequenced, 7 are on the first plate and 1 is on the second plate; the 8 Japanese samples and 16 CEPH samples that were resequenced are on the first plates. A complete list of the sample ID's can be found here. ENCODE Regions SNP Information

Regionname

Chromosomeband

Genomic interval (NCBI B36 )

Genotype SNPs

Genotyping groupCEU

JPT+CHB

YRI

ENr112

2p16.3

Chr2:51512208..520122082,601

2,573

2,608

McGill-GQIC, PerlegenENr131

2q37.1

Chr2:234156563..2346566272,214

2,107

2,129

McGill-GQIC, PerlegenENr113

4q26

Chr4:118466103..1189661032,538

2,401

2,405

Broad, PerlegenENm010

7p15.2

Chr7:26924045..274240451,830

1,787

1,742

UCSF-WU, PerlegenENm013(500Kb)

7q21.13

Chr7:89621624..901216241,770

1,678

1,680

Broad, PerlegenENm014(500Kb)

7q31.33

chr7:126368183..1268653243,343

3,239

3,232

Broad, PerlegenENr321

8q24.11

Chr8:118882220..1193822202,128

2,100

2,092

Illumina, PerlegenENr232

9q34.11

Chr9:130725122..1312251221,909

1,828

1,808

Illumina, PerlegenENr123

12q12

Chr12:38626477..391264762,189

2,181

2,035

BCM, PerlegenENr213

18q12.1

Chr18:23719231..242192311,990

1,969

1,966

Illumina, Perlegen

Total22,512

21,863

21,697

Population descriptors:

YRI : Yoruba in Ibadan, Nigeria

JPT+CHB : Japanese in Tokyo, Japan + Han Chinese in Beijing, China (combined on one plate)

CEU : CEPH (Utah residents with ancestry from northern and western Europe)

Generated Fri Apr 13 13:44:05 EDT 2007

HapMap ENCODE Resequencing Project[ Top ]

Groups

David Altshuler and Stacey Gabriel, Broad Institute of Harvard and MIT

Richard Gibbs and George Weinstock, Baylor College of Medicine

ENCODE Regions

Each group resequenced five 500kb regions.

These regions were chosen by the Analysis Group from among the ENCODE regions; they include a range of chromosomes, recombination rates, gene density, and values of non-transcribed conservation with mouse. For more information on the ENCODE Project see http://www.genome.gov/10005107.

Samples

Resequencing was done for 16 CEPH, 16 Yoruba, 8 Japanese, and 8 Han Chinese samples. Please click here to view the Coriell Catalog ID for each DNA sample.

The samples are currently available and may be ordered from the Coriell Institute.

Strategy

PCR-based sequencing across the regions for each sample.

Data Release

All SNPs found were deposited in dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/).

HapMap ENCODE Genotyping Project[ Top ]

Groups

David Altshuler and Stacey Gabriel, Broad Institute of Harvard and MIT

Mark Chee, Illumina

Richard Gibbs and John Belmont, Baylor College of Medicine

Thomas Hudson, McGill University & Gnome Qubec Innovation Centre

Pui Kwok, UCSF

ENCODE Regions

The regions are the same ten regions being resequenced.

Each group genotyped all known SNPs (with rs# in dbSNP) and newly discovered SNPs in the 500kb ENCODE regions in their assigned chromosomes.

Samples

The samples genotyped are the same 270 (plus the 5 duplicates for each plate) used for the HapMap Project:

90 CEPH samples: including the 16 that were resequenced.

90 Yoruba samples: including the 8 that were resequenced.

44 Japanese samples: including the 8 that were resequenced.

45 Han Chinese samples: including the 7 that were resequenced.

All of the samples listed above may be ordered from the Coriell Institute.

The 8 Yoruba samples and 1 Chinese sample that were not genotyped on the first plates, are included on the second plates.

Strategy

Initially the SNPs currently in dbSNP build 121 were genotyped.

All SNPs found from the resequencing project and other sources were also genotyped.

Data Release

The genotype data were sent to the DCC and distributed in the same way as the other HapMap genotype data.

Perlegen Genotyping Component[ Top ]

Groups

PerlegenENCODE regions

Initially, the ten 500kb regions.

Perlegen will genotype all SNPs in the remaining 34 ENCODE regions.

Samples

90 HapMap CEPH samples (plus 5 duplicates) in the ten 500kb regions.

All samples (90 Yoruba, 45 Japanese, 45 Han Chinese, and 90 CEPH samples) in the remaining 34 ENCODE regions as part of its genotyping for the HapMap Project.

Strategy

In the CEPH samples, Perlegen genotyped the ten 500kb regions for all the SNPs in dbSNP and for the SNPs it had.

Perlegen will genotype all SNPs in the remaining 34 ENCODE regions in all 270 samples.

Data Release

Perlegen sent its SNPs in the ten 500kb regions to dbSNP (as new SNPs or as validation of ones in dbSNP).

The genotype data were sent to the DCC and distributed in the same way as the other HapMap genotype data.

The data for the remaining 34 ENCODE regions will be sent to the DCC when they become available.

Last updated

Top of Form

Locationviaproxy:

HTMLCONTROL Forms.HTML:Submitbutton.1 [UP]

Online Advertising

[Report a bug] [Managecookies] Nocookies Noscripts Noads Noreferrer Showthisform

Bottom of Form

Top of Form

HTMLCONTROL Forms.HTML:Hidden.1

HTMLCONTROL Forms.HTML:Hidden.1

HTMLCONTROL Forms.HTML:Submitbutton.1 Bottom of Form

HYPERLINK "https://www.goofycake.com/cgi-bin/nph-a.cgi/FGVYMNA/http/www.sanger.ac.uk/PostGenomics/encode/info.shtml=3f;decor=3dprintable"

HYPERLINK "https://www.goofycake.com/cgi-bin/nph-a.cgi/FGVYMNA/https/enigma.sanger.ac.uk/sso/login"

HYPERLINK "https://www.goofycake.com/cgi-bin/nph-a.cgi/FGVYMNA/http/www.sanger.ac.uk/shared/news-report/atom/20020211150255"

Top of Form

Username: Password:

HTMLCONTROL Forms.HTML:Submitbutton.1 Bottom of Form

HYPERLINK "https://www.goofycake.com/cgi-bin/nph-a.cgi/FGVYMNA/http/www.sanger.ac.uk/PostGenomics/" Functional Genomics

Human (HGP)

Pathogens

Blast

ENCODE

Home

Overview

Major Data Contributors

Comparative Sequencing

Data Management

News

Website Search

People Search

Library Services

Site Map

Feedback / Help

ENCODE - Project Information

Following the completion of the Human genome sequence, the next major task is to understand the information contained therein. Although considerable progress has been made in identifying the genes that code for functional proteins (For instance see Collins et al 2003, 2004), identifying elements in DNA sequence which control gene expression and DNA replication at a genome wide level is far from trivial. Therefore NHGRI has established a pilot project (ENCODE) to explore computational and experimental methods to develop an encyclopedia of DNA elements in the human genome. Initially the pilot project has funded a collection of different groups who will target 1% of the genome chosen according to the criteria outlined in the ENCODE RFA.The Sanger Institute has two groups involved in the Encode project.

Detecting Human Functional Sequences with Microarrays

Ian Dunham PI

David Vetrie Co-PI

Nigel Carter Co-PI

We were inspired by recent work in our laboratories using microarrays to study DNA copy number (Fiegler et al 2003), replication timing and chromatin modifications in a variety of genomic situations from 400bp resolution in a ~200 kb pilot region, through ~75 kb resolution across the q arm of chromosome 22, to 1Mb resolution across the human genome. We aim to contribute microarray-based approaches to the ENCODE consortium to provide experimental evidence of DNA elements involved in gene regulation and replication, as well as the status of chromatin, across the pilot 1% of the genome. Specifically we are:-1. Developing two sets of genomic microarrays covering the 1 % of the genome targeted in the ENCODE project. The first is a low resolution genomic clone (predominantly BACs, but also PACs, cosmids, fosmids) based microarray using the clones from the genomic sequence tile path. The second is an array of 22 000 1.25kb PCR fragments designed from the DNA sequence covering ~85% of the targeted regions - viewable here.

2. Using these microarrays to assay DNA samples enriched for sequences involved in specific biological processes and functions by methods including flow-sorting, pulse-labeling and chromatin immunoprecipitation (ChIP) so as to develop high resolution maps of the following at genomic clone and 1.25kb resolution of: Replication timing,

Replication origins,

DNA methylation,

Modified histones/active and inactive chromatin,

Transcription factor binding sites.

We will correlate these maps with genomic DNA features including C+G content, genes/exons, repeat elements, and SNP density. In addition we will correlate the elements we map with regions of conserved DNA sequence identified by comparative sequencing across multiple species being undertaken in the laboratory of Eric Green and maps of transcriptional activity as part of the consortium.Refs

Reevaluating human gene annotation: a second-generation analysis of chromosome 22.Collins JE, Goward ME, Cole CG, Smink LJ, Huckle EJ, Knowles S, Bye JM, Beare DM, Dunham IGenome Res. 2003;13;27-36. PMID: 12529303A genome annotation-driven approach to cloning the human ORFeome.Collins JE, Wright CL, Edwards CA, Davis MP, Grinham JA, Cole CG, Goward ME, Aguado B, Mallya M, Mokrab Y, Huckle EJ, Beare DM, Dunham IGenome Biol. 2004;5;R84. PMID: 15461802DNA microarrays for comparative genomic hybridization based on DOP-PCR amplification of BAC and PAC clones.Fiegler H, Carr P, Douglas EJ, Burford DC, Hunt S, Scott CE, Smith J, Vetrie D, Gorman P, Tomlinson IP, Carter NPGenes Chromosomes Cancer. 2003;36;361-74. PMID: 12619160[an error occurred while processing this directive]

Identification of functionally variable regulatory regions in the human genome Manolis Dermitzakis PI

Panos Deloukas Co-PI

Stylianos E. Antonarakis, University of Geneva Co-PI

Andrew G. Clark, Cornell University Co-PI

One of the main reasons to annotate the human genome is to interpret the phenotypic consequences of genetic variation within functional genomic regions. We are using a novel approach for the selective identification of functionally variable regulatory sequences of the human genome. We are detecting correlations between variation in gene expression and nucleotide polymorphisms near those genes to identify regulatory regions and their variants that contribute to gene expression variation. This approach uses naturally occurring genomic variation (nucleotide polymorphism) and phenotypic variation (transcript levels) to detect significant associations (Figure 1). Polymorphisms associated with phenotypic variation will likely be in linkage disequilibrium with functional regulatory polymorphisms nearby, thereby identifying segments of the genome containing sequences that regulate gene expression.

Our experimental design is to use the illumina technology to screen for gene expression variation as well as to genotype relevant SNPs for the association analysis. We have designed an illumina bead array that contains approximately 350 genes from the ENCODE regions, all the human chromosome 21 genes and 100 genes from a 10 Mb genomic region of human chromosome 20. An example of a hybridized array is shown in Figure 2. The technology is highly sensitive and accurate. In Figure 3a we show the regression of two replicates from the same RNA pool and in Figure 3b the regression of two different individuals. Note the wider spread of Figure 3b as a result of difference in transcript levels between the two individuals.We view this project as readily scalable to a whole human genome screen for gene expression variation and association with nucleotide polymorphism.

It will provide 4 different types of information:

1. Genomic regions that contain variable regulatory polymorphisms;

2. Structure of regulatory variation in the human genome and determination of how it is associated with disease susceptibility;

3. Large dataset of genes that exhibit variation of expression within populations, in a manner similar to the way the HapMap project will provide the haplotype structure of the human genome.

Sanger Home

Sitemap

Site Search

Information

Careers

Press

News

Seminars

Workshops

Publications

Staff Theses

Travel Directions

Research Teams

Research Faculty

Personnel Search

Human Genetics

Model Organism Genetics

Pathogen Genetics

Bioinformatics

Sequencing

Library

Helpdesk

Webmail

VPN Access

Sign In

[email protected] Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UKTel:+44 (0)1223 834244

Last Modified Wed Apr 4 14:51:33 2007

Registered charity number 210183

Data Release Policy | Conditions of Use | CopyrightTop of Form

Locationviaproxy:

HTMLCONTROL Forms.HTML:Submitbutton.1

Online Advertising

[Report a bug] [Managecookies] Nocookies Noscripts Noads Noreferrer Showthisform

Bottom of Form

International HapMap Project

Home | About the Project | Data | Publications | Tutorial

| English | Franais | | Yoruba

The International HapMap Project is a partnership of scientists and funding agencies from Canada, China, Japan, Nigeria, the United Kingdom and the United States to develop a public resource that will help researchers find genes associated with human disease and response to pharmaceuticals. See "About the International HapMap Project" for more information.

Project Information

About the Project

HapMap Publications

HapMap Tutorial

HapMap Mailing List

HapMap Project Participants

HapMap Mirror Site in Japan

Project Data

HapMap Genome Browser (B35 - full data set)

HapMap Genome Browser (B36 - genotypes & frequencies only)

HapMart

Bulk Data Download

Data Freezes for Publication

ENCODE Project

Guidelines For Data Use

Useful Links

TSC SNP Downloads

HapMap Samples at Coriell Institute

HapMap Project Press Release

NHGRI HapMap Page

NCBI Variation Database (dbSNP)

Japanese SNP Database (JSNP)

News

2007-08-13: Phased haplotypes in NCBI b36 coordinatesPhased haplotypes for release #22 have begun to be released for bulk download. Data will be made available as it is processed and revised.

2007-07-17: HapMap Tutorial: Working with the HapMap WebsiteASHG Annual Meeting, San Diego, California, USAOctober 25th, 2007 at 18:30 PST San Diego Marriott Hotel and Marina, Rancho Las Palmas, 4th Level, South Tower

The HapMap Data Coordination Center is pleased to present a one-hour tutorial during the 2007 ASHG Annual Meeting.

The tutorial will provide an overview of the International HapMap Project, a comprehensive tour of the HapMap website, a live demo of new tools and resources, and Q&A session.

Registration is limited to the first 40 people. You must register for this tutorial by October 14th, 2005, in order to participate. There is no additional fee, but participants must first be registered for the ASHG Meeting.

To register or inquiries, please e-mail [email protected]

2006-06-14: Wellcome Trust Course: Working with the HapMapWellcome Trust Genome Campus, Hinxton, Cambridge, UKNovember 16-19th, 2007

The Wellcome Trust Course: Working with the HapMap will be held on November 16-19th, 2007 at the Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. The deadline for application is 10th August 2007. Further information can be found at: http://www.wellcome.ac.uk/doc_WTX038039.html with details of how to apply.

2007-06-04: Predicted OMIM associations available in GBrowse

The OMIM associations track presents data from the MutaGeneSys database, which links genotype data from HapMap and whole genome association studies with the known disease variants reported by the OMIM database. Example of a region with multiple OMIM associations: Chr1:194923128..194933127 2007-05-29: Newly phased haplotypes available for non-par segment of ChrX

Genotyping data for phase I+II (rel #21) was rephased for the non-pseudoautosomal (non-par) region of chromosome X. Data is currently available for bulk download.

Old News

Developing a Haplotype Map of the Human Genomefor Finding Genes Related to Health and Disease

Washington, D.C.July 18-19, 2001

Introduction

So far about 2.4 million DNA sequence variants (single nucleotide polymorphisms or SNPs) have been discovered in the human genome, and millions more exist. These variants will be most useful for discovering genes related to health and disease if their organization along chromosomes, the haplotype structure, is known. Technology is just reaching the point that haplotype maps of blocks of SNPs along chromosomes can be developed.

On July 18-19, 2001, the National Institutes of Health (NIH) held a meeting in Washington, D.C., to discuss how haplotype maps could be used for finding genes contributing to disease; the methods for constructing such maps; the data about haplotype structure in populations; the types of populations and samples that might be considered for a map; the ethical issues, including those related to studying genetic variation in identified populations; and how such a project could be organized. The goal was to resolve some issues and to set up procedures for resolving others.

There were 165 attendees, including human geneticists, population geneticists, anthropologists, pharmaceutical and biotech industry scientists, social scientists, ethicists, representatives from various communities and disease groups, administrators from many NIH institutes and international funding agencies and journalists.

Background: Genetic Variation and Its Use for Mapping Genes Contributing to Disease

Recently technology has become available to study the extent and pattern of human genetic variation on a large scale, and to use this variation to find the genes that contribute to disease. The information summarized here was not presented at the meeting but provides the background for understanding the importance and use of a haplotype map.

Rationale for finding genes contributing to disease

The goal of much genetic research is to find genes that contribute to disease. Finding these genes should allow an understanding of the disease process, so that methods for preventing and treating the disease can be developed. For diseases with a relatively straightforward genetic basis, the single-gene disorders, current methods are usually sufficient to find the genes involved. Most people, however, do not have single-gene disorders, but develop common diseases such as heart disease, stroke, diabetes, cancers or psychiatric disorders, which are affected by many genes and environmental factors. The genetic contribution to these diseases is not clear, but many researchers consider common variants to be important, the Common-Disease/Common-Variant theory.

Definition of a Single Nucleotide Polymorphism, SNP

A SNP is a site in the DNA where different chromosomes differ in the base they have. For example, 30 percent of the chromosomes may have an A, and 70 percent may have a G. These two forms, A and G, are called variants or alleles of that SNP. An individual may have a genotype for that SNP that is AA, AG, or GG.

Number of SNPs

When chromosomes from two random people are compared, they differ at about one in 1000 DNA sites. Thus when two random haploid genomes are compared, or all the paired chromosomes of one person are compared, there are about three million differences. When more people are considered, they will differ at additional sites. The number of DNA sites that are variable (SNPs) in humans is unknown, but there are probably between 10 and 30 million SNPs, about one every 100 to 300 bases. Of these SNPs, perhaps four million are common SNPs, with both alleles of each SNP having a frequency above 20 percent.

How SNPs are used to find genes contributing to disease

Some SNP alleles are the actual functional variants that contribute to the risk of getting a disease. Individuals with such a SNP allele have a higher risk for that disease than do individuals without that SNP allele. Most SNPs are not these functional variants, but are useful as markers for finding them. To find the regions with genes that contribute to a disease, the frequencies of many SNP alleles are compared in individuals with and without the disease. When a particular region has SNP alleles that are more frequent in individuals with the disease than in individuals without the disease, those SNPs and their alleles are associated with the disease. These associations between a SNP and a disease indicate that there may be genes in that region that contribute to the disease.

The use of haplotypes

A haplotype is the set of SNP alleles along a region of a chromosome. Theoretically there could be many haplotypes in a chromosome region, but recent studies are typically finding only a few common haplotypes. Consider the example below, of a region where six SNPs have been studied; the DNA bases that are the same in all individuals are not shown. The three common haplotypes are shown, along with their frequencies in the population. The first SNP has alleles A and G; the second SNP has alleles C and T. The four possible haplotypes for these two SNPs are AC, AT, GC, and GT. However, only AC and GT are common; these SNPs are said to be highly associated with each other.

The cost of genotyping is currently too high for whole-genome association studies that would look at millions of SNPs across the entire genome to see which SNPs are associated with disease. If a region has only a few haplotypes, then only a few SNPs need to be typed to determine which haplotype a chromosome has and whether the region is associated with a disease. In the example below, typing two SNPs is all that is needed to distinguish among the three common haplotypes. The two SNPs indicated by arrows are one pair of several possible pairs of SNPs that distinguish among the three common haplotypes.

Most SNP variation is within all groups

For most SNPs, any population has individuals of all possible genotypes for a SNP, but populations differ in the frequencies of individuals with each of the different genotypes. About 85 percent of human SNP variation is within all populations, and about 15 ;percent is between populations, as shown in the figure below. Thus two random individuals within a village are almost as different in their SNP alleles as any two random individuals from anywhere in the world. Although a small proportion of SNPs have alleles that are common in some groups but rare in others, most SNP alleles that are common in one group will be common in other groups. Under the Common-Disease/Common-Variant theory, common variants that contribute to a disease in one group will also contribute to the disease in other groups, although the amount of the contribution may vary.

The Meeting

At the meeting there was discussion of recent data related to haplotype maps, and what a haplotype map might look like. Since haplotypes and associations of SNPs with disease are population phenomena, some of the discussion focused on complexities in sampling human populations. Much of the discussion concerned the information that could be gained by identifying the populations contributing samples for a haplotype map, and the risks and benefits to populations of such identification. The meeting ended with discussion of various aspects of a haplotype map project, including what issues would need more discussion in working groups after the meeting. The main points of the discussions are summarized here.

The Pattern of Genetic Variation and Association Among Genes

Factors that affect the frequencies of alleles and haplotypes in populations:

Biological factors: Haplotype and allele frequencies are affected by cellular-level processes such as mutation, recombination, and gene conversion, as well as by population-level processes such as natural selection against alleles that contribute to disease. When genes are close together and associated, then selection that changes the frequency of an allele at one gene results in similar changes in the frequencies of alleles at other genes on the same haplotype.

Recombination is the major process that breaks down the associations between SNPs. It is unclear whether haplotype block boundaries are due to recombination hotspots, or are simply the result of recombination events that happened to occur there. If the blocks are due to hotspots, then perhaps they will be common across populations. If the blocks are due to regular recombination events, then populations may or may not share them, depending on how long ago the recombination events occurred. When large chromosomal regions are examined, the regions with high association have less recombination and less genetic variation.

Demographic and social factors: Haplotype and allele frequencies are also affected by population history factors such as population size, bottlenecks or expansions of population size, founder effects, isolation of a population or admixture between populations, and patterns of mate choice.

Large variances: The many influences on haplotype frequencies result in large statistical variances for associations among different SNPs; all such studies find that associations vary a lot around the mean. Neighboring SNPs may not be associated, while distant SNPs may be associated, despite the average association declining with longer distances between SNPs. This variance means that more SNPs are needed to study associations than simply counting the blocks might indicate.

Extent of association among SNPs differs by chromosome region and by allele frequency

Some studies show that a measure of association, D', falls to half its possible maximum value at a distance between SNPs of about 50 to 80 kb, averaged over gene regions in European-derived populations. Some regions have strong associations over as much as one megabase. Among different chromosome regions there is about a fourfold range in the extent of associations. Rare SNP alleles are generally of more recent origin than common SNP alleles; recombination has had less time to break down associations around them so that rare alleles generally have associations over longer distances than do common alleles.

Extent of association among SNPs differs by population

Many studies show that the chromosomal distances that SNP associations extend are generally shorter for African populations, intermediate for European and Asian populations, and longer for American Indian populations, although there is variation among populations in the same geographic region. When groups of people from populations that differ in some allele frequencies marry and reproduce with each other, as has often happened with African-Americans and with Hispanics in the United States, associations are generated over longer chromosomal distances in the admixed group than in either parental group. Recently formed populations such as the Mennonites and Acadians also may have associations over longer chromosomal distances.

Common haplotypes are in all populations

The pattern of variation within and among populations for haplotype structure is just starting to be studied on a large scale. Recent studies show that the common haplotypes are found in all populations studied, and that the population-specific haplotypes are generally rare. African populations generally have more haplotypes than other populations, which generally have subsets of the African ones, due to the origins of other populations from ones that spread out of Africa.

Haplotype Block Structure as the Rationale for a Haplotype Map

Block pattern of haplotypes

Some recent studies found that haplotypes occur in a block pattern: the chromosome region of a block has just a few common haplotypes, followed by another block region also with just a few common haplotypes, with the longer-distance haplotypes showing a mixing of the haplotypes in the two blocks. Another description of this pattern is that the SNPs in a block are strongly associated with each other, but much less associated with other SNPs. Blocks range in size from about three kb to more than 150 kb. The majority of SNPs are organized in these blocks. Some recent data show that the blocks in a Yoruban population from Nigeria are generally the same ones as, but shorter than, those in two European-derived populations, although the data are limited and these conclusions are preliminary.

Using haplotype blocks to find chromosome regions associated with disease

Where blocks exist, they can be tested for association with a disease, using just a few SNPs per block. If the blocks are large, then a few SNPs in a region will indicate whether that region has genes related to a disease. If the blocks are small, then many SNPs will be needed to cover a region. Typing more SNPs than needed is a waste of resources; typing too few SNPs means that a disease association could easily be missed.

A haplotype map

A haplotype map would show the haplotype blocks and the SNPs that define them. A haplotype map thus would serve as a resource to increase the efficiency and comprehensiveness of the many other studies that will be done to relate genes to diseases.

Haplotype maps of different populations

To the extent that populations differ in their haplotype structure, it may be useful to study different populations during different stages of the process of finding disease genes. Studying populations with large haplotype blocks will be useful for initial association studies over the entire genome to find chromosome regions affecting a disease. Once these chromosome regions have been found, they can be studied in populations with small haplotype blocks in those regions, so the particular genes can be found more easily by being localized in small regions.

Sampling Human Haplotype Variation

Possible schemes to sample human haplotype variation

Population sampling: Samples are chosen from particular identified populations, defined by ethnicity and geography.

Grid sampling: Samples are chosen from particular geographic regions on a world grid.

Proportional sampling: Samples are chosen from identified populations so that the entire sample has a known distribution, but the population identities of the individual samples are not kept. This scheme was used for the DNA Polymorphism Discovery Resource.

Studying one or multiple populations

Studying just one population would reveal the common haplotypes that are in all populations, and so the resulting haplotype map would be useful for all populations. Including only one, non-disadvantaged, population would also avoid some of the ethical issues raised by identifying populations. However, this approach would raise serious issues of justice, since only that population could receive the population-specific advantages of the haplotype map. There are also scientific reasons to include more than one population in a haplotype map: to add haplotypes that are not as common or that are more variable in frequency among populations; and to reveal regions that are similar or different in haplotype structure among populations. After the first few populations are included in a haplotype map, additional populations could still be added. The haplotype map should be developed so that it would be useful for mapping genes in any population.

Designating which population an individual belongs to, when choosing which individuals to sample

There are many ways that individuals could define which populations they belong to, such as their cultural affiliations or the geographic origins of their grandparents. Most populations have blurred boundaries, some more than others. Some people define themselves as members of several populations. Individuals or communities may emphasize some aspects of ancestry more than others, based on factors such as pride, shame, history of discrimination, or extent of knowledge. For a haplotype map, the purpose of designating an individual as belonging to a particular population is simply to make sure that most of that person's bi-parental lineages come from that particular population. Occasional differences between the population designations of individuals and their actual lineages would have little effect on a haplotype map, since the blocks are defined by the common haplotypes in a population contributing samples. The complexity in designating individuals as belonging to particular populations underscores the need for involving experts in the social sciences when developing a haplotype map.

Population consultation

Only a few populations would need to be included for the haplotype map to become a useful resource for individuals in all populations; there is no reason why any particular population should have to participate for the project to succeed. For any population that might be included in a haplotype map, there must be a process of community consultation to explain the purpose of the map and identify issues of concern to that population. This process would take time but would be necessary to educate both the population and the researchers. Particular populations may be sensitive to being exploited or to being left out. Issues may arise that require modifications to the consent process, the research protocol, the procedures of the sample repository, or the database. American Indian and Alaskan Native tribes are sovereign nations and have procedures for formally granting or withholding consent to research, which by law must be followed. Other populations are less well organized, making formal population consent unobtainable, but community consultation would still be needed, keeping in mind the multiple geographic scales and other complexities that characterize many populations. Procedures for consulting communities are outlined in an emerging literature and are under discussion at NIH.

Issues Associated with Identifying Samples by Population

Risks in identifying populations

Identifying the populations that contribute samples for a haplotype map could raise ethical risks. One risk is that any racial or ethnic identifiers used for the map would come to be reified as biological constructs, fostering a genetic essentialism in the way the map is interpreted and the categories understood. This essentialism could obscure the fluid nature of the "boundaries" between groups and the common genetic variation within all groups. Although the haplotype map would not have any individual medical information, another risk to the groups that participate could arise from later studies that use the haplotype map to find genes contributing to diseases; the participating groups could become more intensely studied, leading to the perception that their members are at high risk for diseases.

Benefits in identifying populations

Identifying individual samples as contributed by members of particular populations would be most useful scientifically. For each population, it would allow multiple sources of biomedical information to be combined. The contributing populations would gain the general benefits of the haplotype map as well as any additional benefits from studies of those particular populations. However, it is an open question how much less useful a haplotype map would be if population identifiers were omitted. Additional studies of population differences in haplotypes are needed to resolve this issue.

Describing the contributing populations

Regardless of whether the individual samples would be identified by their population of origin, any populations that contribute to a haplotype map must be described in a way that does not reinforce the mistaken perception that populations are genetically distinct, well-defined groups. Because people take in information most readily when it confirms their stereotypes, terms related to race and ethnicity must be used with precision, sensitivity, and care. Populations should be described as specifically as possible; for example, if a group of Chinese-Americans in Hawaii were studied, the population should not be labeled simply "Chinese." This specificity of description is crucial to minimize the risk of essentialist definitions of race, which assume that all individuals of a race are genetically similar.

Other Issues Raised by a Haplotype Map Project

Health priorities

Some communities lack even basic health care, so a haplotype map may be a low priority for them. Groups may feel that even if they participate in a haplotype project, not much attention would be paid to genetic diseases that primarily affect some of them but not members of other groups.

Sampling in the developed vs. the developing world

Including samples from developed countries and regions, such as the United States, Canada, Europe and Japan, might raise fewer human subjects concerns than would samples from developing countries without good IRB systems for overseeing research or strong biomedical research infrastructures.

Why not obtain the medical phenotypes of the sampled individuals when developing a haplotype map?

No phenotypic data, such as medical information, would be collected along with the samples. The haplotype map would be a resource for researchers trying to relate genetic variation to a wide range of disorders and traits. Only about 50 samples from each population would be needed to develop a haplotype map. Such a small sample would not be adequate to evaluate the many genetic and environmental factors that affect a disease. However, if the haplotype structure of the genome and the identifying SNPs were known, then researchers could use those SNPs in studies of individuals affected and not affected by a disease, matched to control for environmental factors, to track down the genes that contribute to the disease.

Elements of a Research Plan

The goal of a haplotype map should be medical

A haplotype map could be set up in many ways, to support various types of medical and biological research. It should be set up to best facilitate its use for relating genetic variation to disease.

How should a haplotype block be defined?

To compare studies, it will be important to develop a standard definition of a block, including the minimum frequencies of alleles for SNPs used, how similar haplotypes must be to be considered the "same" haplotype when figuring out which haplotypes are common, and how much of a drop in association defines the boundaries of blocks. Descriptions of the block structure of the genome would include distributions of the lengths of blocks, measures of the variability of blocks, the amount of coverage of the genome by blocks, and the proportion of haplotypes in common blocks; there are tradeoffs among these measures depending on the values of the defining parameters. Care will be needed when comparing studies using SNPs with different allele frequencies, with SNPs ascertained in different ways, and with different sample sizes for estimating associations.

Pilot projects, to help decide whether population identifiers are needed

Population differentiation for allele frequency is about 7 to 10 percent. However, currently little is known about how populations differ in their haplotypes. It will be important to find out whether the haplotype blocks are the same in different populations. Are differences in the extent of associations due to differences in block lengths, or to differences in the associations among neighboring blocks? How different are different populations from the same geographic region? How much information would be lost by removing population identifiers from samples? The first step would be to get more data, by sampling a small set of populations with different geographic origins. If these populations were similar, then it may be possible not to identify populations and still get a haplotype map that can be broadly useful. If the populations were different enough, then it might be necessary to identify the populations that contribute the samples. Projects already underway might be used to answer this question, or some pilot projects might be needed.

Number of populations

It was suggested that about 3 to 6 populations would be included in a haplotype map. The goal is to produce a tool that is broadly useful.

Samples from real populations

To obtain the most representative samples, it is important not to use samples of convenience, but to choose samples from real populations. The populations that contribute samples should be chosen based on the goals of the haplotype map, and the samples should be collected with appropriate population consultation and informed consent.

Common samples

Having a common set of samples that could be used by all research groups would allow comparisons among the results of different studies. Combining information across studies produces much more informative results than simply the sum of the results of separate studies.

SNP allele frequencies

Inclusion of SNPs spanning the range of SNP allele frequencies would be important. The length of associations among alleles may differ depending on the frequencies of the alleles. Also, a SNP allele provides the most power for an association study when its frequency matches that of a nearby allele contributing to a disease, and such alleles can be expected to span the range of frequencies.

A hierarchical approach for SNP density

It would be needlessly expensive to genotype all the individuals in a sample with a dense set of SNPs. A hierarchical approach makes more sense: start with a density of SNPs of perhaps one every 50 kb. For regions where such adjacent markers are strongly associated, these SNPs are sufficient and should be able to define blocks. For other regions, SNPs with a density of perhaps one every 10 kb could be examined, and so on until only regions with no block structure are left. Another type of hierarchical approach would be to start in the regions around genes.

What would define the endpoint of the project?

The goal of a haplotype map is to have sufficient SNPs so that researchers doing association studies could be sure that regions containing disease alleles have been found, and that regions not containing disease alleles can be excluded from further consideration. The map could be considered complete when more SNPs provide no more information about block structure, or when all common SNPs are included in the map or are highly associated with ones included.

Methods

Many technical issues still need to be worked out: the method for determining haplotypes; the types of samples needed, such as single chromosomes, individuals or families with a certain number of children; the number of samples for each population; and quality measures. The processes for consulting with populations and obtaining informed consent need to be developed.

Data analysis

Dealing with thousands to millions of SNPs, haplotypes, and haplotype blocks requires the development of better statistical methods of analysis to delineate blocks and to associate them with diseases. Better analytical methods are needed to model and understand the chromosomal and population processes that lead to the block structure observed.

Open data-sharing policy

Just as was done for the sequence produced by the Human Genome Project, providing rapid and complete data release to appropriate public databases would allow maximal benefit to be gained from haplotype data by allowing all researchers quick access to the data.

Coordination of data producers

A haplotype project would need coordination among the data producers, both large and small. The project should be international and open to all interested researchers.

Process for Planning

International project

An international steering committee should be formed. So far there is interest from the United States, Canada, the United Kingdom, France, Germany and Japan.

Two working groups

Some issues could be considered by one group; others could be considered by both.

Population and ELSI Group: To identify the risks associated with a haplotype map project, including those associated with identifying populations; to consider how to minimize those risks; and to consider which types of populations should be considered for inclusion in a haplotype map.

Methods Group: To consider the types of samples needed and how to create a haplotype map.

Name of the project

The public would have a hard time understanding and supporting a project named anything like The Haplotype Linkage Disequilibrium Association Map. A more understandable name is needed, as well as better ways to explain the project. It will also be important to communicate clearly what is understood about the complex relationships among genetics, culture, race and ethnicity.

Top of Form

Locationviaproxy:

HTMLCONTROL Forms.HTML:Submitbutton.1 [UP]

Online Advertising

[Report a bug] [Managecookies] Nocookies Noscripts Noads Noreferrer Showthisform

Bottom of Form

Top of Form

HYPERLINK "https://www.goofycake.com/cgi-bin/nph-a.cgi/JUTWENA/http/genome.wellcome.ac.uk/" \o "Link to the Human Genome site home page"

About this site | Sitemap | Contact us

Inthegenome Genesandthebody Tacklingdisease Geneticsandsociety Indepth Resources What'snew

Bottom of Form

Home > In the genome > A variable genome > Background > Haplotype mapping

The genome sequence

Focus on genes

Focus on proteins

A variable genome

News

Features

Background

Haplotype mapping 20/3/03. By Richard Twyman

Haplotypes, groups of closely linked alleles that tend to be inherited together, can be used to map human disease genes very accurately.

All our chromosomes come in pairs, one in each pair inherited from each parent. While each chromosome of a pair contains the same genes in the same order, the sequences are not identical. For example, there are single nucleotide polymorphisms (SNPs) approximately every 1000 nucleotides. It is therefore possible to distinguish sequence variants that come from our mother and our father. These are termed maternal and paternal alleles.

The ability to distinguish between maternal and paternal alleles allows human disease genes to be mapped by linkage analysis . In germ cells, which produce eggs or sperm, the maternal and paternal chromosomes pair up and exchange segments of DNA, a process called recombination. After recombination, the chromosomes contain a mixture of alleles from each parent. Recombination will occur frequently between DNA sequences that are a long way apart but only rarely between sequences that are close together. Therefore, by measuring the frequency of recombination between the disease gene and other DNA sequences whose location is already known, the position of the disease gene can be established.

Haplotype mapping: A new mutation (X) arises in the proximity of six single nucleotide polymorphisms, with the ancestral haplotype signature TATCAT. Over several generations, the haplotype signature may be eroded by recombination. For example, contemporary haplotype 1 was produced by recombination between the first and second SNPs. The new alleles are shown in pink. However, the smallest conserved haplotype signature in all patients carrying the disease allele places the disease between SNPs 3 and 4. This technique provides a candidate region of about 10 000 bp, which is smaller than most human genes.

Another consequence of recombination is that blocks of sequences on the same chromosome tend to be inherited together, a phenomenon known as linkage disequilibrium. Such groups of alleles, which are rarely separated by recombination, are known as haplotypes. In the human genome, haplotypes tend to be approximately 60 000 bp in size and therefore contain up to 60 SNPs that travel as a group.

Haplotypes can be exploited for the fine mapping of disease genes. The principle of haplotype mapping is shown in the Figure. A new mutation responsible for a genetic disease always enters the population within an existing haplotype, which is termed the ancestral haplotype.

Over several generations, recombination events may occur within the haplotype but the disease allele and the closest SNPs still tend to be inherited as a group. If this haplotype can be identified in a group of patients with the disease, typing the alleles within the haplotype allows a conserved region to be identified, which pinpoints the mutation responsible for the disease. Due to the abundance of SNPs, this technique has the potential to map genes very accurately. There is therefore much interest in developing a haplotype map of the entire human genome.

Top of Form

HTMLCONTROL Forms.HTML:Hidden.1

HTMLCONTROL Forms.HTML:Hidden.1

HTMLCONTROL Forms.HTML:Hidden.1 Email your views on this article:

Haplotype mapping' by Richard Twyman

Bottom of FormLatest articles from a variable genome

The numbers game

Gene copies and kidney disease

Phase 1 of HapMap complete

Wellcome Trust Case Control Consortium

HapMap project expands

Printableversion

Sendtoafriend

Top of Form

Bottom of Form

Glossary

_1254271668.unknown

_1254271686.unknown

_1254271690.unknown

_1254271692.unknown

_1254271693.unknown

_1254271691.unknown

_1254271688.unknown

_1254271689.unknown

_1254271687.unknown

_1254271675.unknown

_1254271679.unknown

_1254271681.unknown

_1254271683.unknown

_1254271685.unknown

_1254271684.unknown

_1254271682.unknown

_1254271680.unknown

_1254271677.unknown

_1254271678.unknown

_1254271676.unknown

_1254271670.unknown

_1254271673.unknown

_1254271674.unknown

_1254271671.unknown

_1254271672.unknown

_1254271669.unknown

_1254271663.unknown

_1254271666.unknown

_1254271667.unknown

_1254271665.unknown

_1254271661.unknown

_1254271662.unknown

_1254271659.unknown

_1254271660.unknown

_1254271657.unknown

_1254271658.unknown

_1254271656.unknown