efficient algorithms for genome-wide tagsnp selection across populations via the linkage...
TRANSCRIPT
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion
Authors: Lan Liu, Yonghui Wu,
Stefano Lonardi and Tao Jiang
Introduction The MCTS Model Our Algorithms Experimental Result
Outline
Introduction The MCTS Model Our Algorithms Experimental Result
Outline
Motivation With the rapid development of genotyping
technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP database.
We aim to select a subset of informative SNPs (i.e. tagSNPs), in order to
Save the cost for genotyping all SNPs. Perform disease association mapping.
TagSNP Selection
Haplotype-based methods Require the information of the phased
multilocus haplotypes Haplotype-free methods
Do not require haplotype information TagSNP selection via r2 linkage disequilibrium
statistics
r2 Linkage Disequilibrium Statistics
Given a pair of genetic markers 1 and 2.
r2 statistics: r2 =(pAB –pA. p.B)2
pA.(1-pA.) p.B(1-p.B)
If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).
bBmarker 1
marker 2
A pAB pAb pA.
a paB pab pa. p.B p.b
The TagSNP Selection Problem
Instance: a set V of SNP markers and LD patterns E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1 and vj2 are in V}, Feasible solution: a subset V' , such that given any v in V, there
exists a v' in V' , where r2(v,v') is no less than r0. Objective: minimize |V'|. If we define G=(V, E),
a tagSNP set is equivalent to a dominating set on G.
1 2
3
45
6
(a) SNP markers and their LD patterns in a population
1 2
3
45
6
: tagSNP
(b) TagSNPs for the population
This model is introduced by Carlson et al., 2004. It is a simple and popular tagging method.
Introduction The MCTS Model Our Algorithms Experimental Result
Outline
r2 Statistics in Single and Admixed Populations
r2= 00.050.95
0.050.00250.0475a
0.950.0475 0.9025A
bB Population 1
r2= 00.950.05
0.950.90250.0475a
0.050.0475 0.0025A
bB
Population 2
r2= 0.65610.50.5
0.50.45250.0475a
0.50.0475 0.4525A bB
Admixed population: 50% population 1 50% population 2
SNP 1: A, a SNP 2: B, b
TagSNP Selection across Populations
A pair of SNPs have remarkably different marker frequencies and very
weak LD in two populations with different evolutionary histories. may show strong LD in the admixed population.
TagSNPs picked from the admixed populations or one of the populations might not be sufficient to capture the variations in all populations.
The MCTS Model
Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations.
The above problem is called the minimum common tagSNP selection problem (MCTS).
1 2
3
45
1 2
3
45
6 6
Population 1 Population 2
(a) SNP markers and their LD patterns in two populations.
1 2
3
45
1 2
3
45
6 6
Population 1 Population 2
: tagSNP
(b) The minimum TagSNP set for these two populations.
Introduction The MCTS Model Our Algorithms Experimental Result
Outline
Our Algorithms The MCTS problem can be easily formulated by an integer linear
programming.
We first apply some data reduction rules, then use one of the following algorithms
A greedy algorithm: GreedyTag A Lagrangian relaxation algorithm: LRTag
We calculate the upper bound : the number of the tagSNPs obtained by
our algorithms
GreedyTag_lb LRTag_lb
the lower bound : the minimum number of tagSNPs needed
Data Reduction Rules
Pick all irreplaceable markers Example: marker 7
Population 1 Population 2
1 2
3
45
1 2
3
45
6 6
7 7
1 2
3
45
1 2
3
45
6 6
7 7
Remove less informative markers Example: among markers 1, 2 and 6,
remove marker 1 and 2.
1 2
3
45
1 2
3
45
6 6
7 7
Remove less stringent occurrences Example: between the occurrences of
markers 4 and 5 in population 2, remove the occurrence of marker 4.
A Greedy Algorithm
Apply data reduction
rules
un-tagged occurrence?
Pick the marker which tags the most of the remaining
occurrences as a tagSNP
yes
no
Output the tagSNPs
A Lagrangian Relaxation Algorithm
Introduce the Lagrangian
multipliers λ iteration++ < max_iter
yes
no
Output the tagSNPs
Obtain the relaxed integer
program
Initialize λ
Obtain the tagSNP set based on λ
iteration := 0
Update λ towards the subgradient direction
Update the tagSNP set based on λ
Introduction The MCTS Model Our Algorithms Experimental Result
Outline
Experimental Result We apply our algorithms on real HapMap data
(release #19, NCBI build 34, October 2005). There are four populations in HapMap data.
CEU: Europe descendents. CHB: Chinese, Beijing. JPT: Japanese, Tokyo. YRI: Yoruba people of Ibadan, Nigeria.
We get tagSNPs for the following two datasets: Encode regions
all 10 ENCODE regions Human genome
chromosomes 1 – 22
10,859 markers.
2,862,454 markers
Experiment Result for ENCODE Regions
We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).
Multipop-TagSelect first generates the tagSNPs for each single population, then combines the obtained tagSNPs together for multiple populations.
The gap between LRTag_lb and LRTag r2 = 0.5: at most two for each region
totally six for all regions r2 = 0.8: there is no gap.
Experiment Result for Human Genome
The numbers of tagSNPs selected by our algorithms are almost optimal.
The gap between LRTag_lb and LRTag for the whole genome 2,862,454 SNPs in total r2 = 0.5: 1,061 r2 = 0.8: 142
Running Time of Our Algorithms
Running environment a 32-processor SGI Altix 4700 supercomputer system 1.6 GHZ CPU 64 GB shared memory 15 threads in parallel.
Running time r2 = 0.5,
ENCODE regions: < 7 seconds for each region, < 1 minute for all regions.
Human genome: < 12 minutes for each chromosome, < 1 hour for the genome.
r2 > 0.5, our algorithms run faster the above speed.
Introduction The MCTS Model Our Algorithms Experimental Result
Outline
Thanks for your time and
attention!