efficient algorithms for genome-wide tagsnp selection across populations via the linkage...

Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion

Authors: Lan Liu, Yonghui Wu,

Stefano Lonardi and Tao Jiang

Introduction The MCTS Model Our Algorithms Experimental Result

Outline

Motivation With the rapid development of genotyping

technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP database.

We aim to select a subset of informative SNPs (i.e. tagSNPs), in order to

Save the cost for genotyping all SNPs. Perform disease association mapping.

TagSNP Selection

Haplotype-based methods Require the information of the phased

multilocus haplotypes Haplotype-free methods

Do not require haplotype information TagSNP selection via r2 linkage disequilibrium

statistics

r2 Linkage Disequilibrium Statistics

Given a pair of genetic markers 1 and 2.

r2 statistics: r2 =(pAB –pA. p.B)2

pA.(1-pA.) p.B(1-p.B)

If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

bBmarker 1

marker 2

A pAB pAb pA.

a paB pab pa. p.B p.b

The TagSNP Selection Problem

Instance: a set V of SNP markers and LD patterns E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1 and vj2 are in V}, Feasible solution: a subset V' , such that given any v in V, there

exists a v' in V' , where r2(v,v') is no less than r0. Objective: minimize |V'|. If we define G=(V, E),

a tagSNP set is equivalent to a dominating set on G.

1 2

3

45

6

(a) SNP markers and their LD patterns in a population

1 2

3

45

6

: tagSNP

(b) TagSNPs for the population

This model is introduced by Carlson et al., 2004. It is a simple and popular tagging method.


Outline

r2 Statistics in Single and Admixed Populations

r2= 00.050.95

0.050.00250.0475a

0.950.0475 0.9025A

bB Population 1

r2= 00.950.05

0.950.90250.0475a

0.050.0475 0.0025A

bB

Population 2

r2= 0.65610.50.5

0.50.45250.0475a

0.50.0475 0.4525A bB

Admixed population: 50% population 1 50% population 2

SNP 1: A, a SNP 2: B, b

TagSNP Selection across Populations

A pair of SNPs have remarkably different marker frequencies and very

weak LD in two populations with different evolutionary histories. may show strong LD in the admixed population.

TagSNPs picked from the admixed populations or one of the populations might not be sufficient to capture the variations in all populations.

The MCTS Model

Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations.

The above problem is called the minimum common tagSNP selection problem (MCTS).

1 2

3

45

1 2

3

45

6 6

Population 1 Population 2

(a) SNP markers and their LD patterns in two populations.

1 2

3

45

1 2

3

45

6 6


: tagSNP

(b) The minimum TagSNP set for these two populations.


Outline

Our Algorithms The MCTS problem can be easily formulated by an integer linear

programming.

We first apply some data reduction rules, then use one of the following algorithms

A greedy algorithm: GreedyTag A Lagrangian relaxation algorithm: LRTag

We calculate the upper bound ： the number of the tagSNPs obtained by

our algorithms

GreedyTag_lb LRTag_lb

the lower bound ： the minimum number of tagSNPs needed

Data Reduction Rules

Pick all irreplaceable markers Example: marker 7


1 2

3

45

1 2

3

45

6 6

7 7

1 2

3

45

1 2

3

45

6 6

7 7

Remove less informative markers Example: among markers 1, 2 and 6,

remove marker 1 and 2.

1 2

3

45

1 2

3

45

6 6

7 7

Remove less stringent occurrences Example: between the occurrences of

markers 4 and 5 in population 2, remove the occurrence of marker 4.

A Greedy Algorithm

Apply data reduction

rules

un-tagged occurrence?

Pick the marker which tags the most of the remaining

occurrences as a tagSNP

yes

no

Output the tagSNPs

A Lagrangian Relaxation Algorithm

Introduce the Lagrangian

multipliers λ iteration++ < max_iter

yes

no

Output the tagSNPs

Obtain the relaxed integer

program

Initialize λ

Obtain the tagSNP set based on λ

iteration := 0

Update λ towards the subgradient direction

Update the tagSNP set based on λ


Outline

Experimental Result We apply our algorithms on real HapMap data

(release #19, NCBI build 34, October 2005). There are four populations in HapMap data.

CEU: Europe descendents. CHB: Chinese, Beijing. JPT: Japanese, Tokyo. YRI: Yoruba people of Ibadan, Nigeria.

We get tagSNPs for the following two datasets: Encode regions

all 10 ENCODE regions Human genome

chromosomes 1 – 22

10,859 markers.

2,862,454 markers

Experiment Result for ENCODE Regions

We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).

Multipop-TagSelect first generates the tagSNPs for each single population, then combines the obtained tagSNPs together for multiple populations.

The gap between LRTag_lb and LRTag r2 = 0.5: at most two for each region

totally six for all regions r2 = 0.8: there is no gap.

Experiment Result for Human Genome

The numbers of tagSNPs selected by our algorithms are almost optimal.

The gap between LRTag_lb and LRTag for the whole genome 2,862,454 SNPs in total r2 = 0.5: 1,061 r2 = 0.8: 142

Running Time of Our Algorithms

Running environment a 32-processor SGI Altix 4700 supercomputer system 1.6 GHZ CPU 64 GB shared memory 15 threads in parallel.

Running time r2 = 0.5,

ENCODE regions: < 7 seconds for each region, < 1 minute for all regions.

Human genome: < 12 minutes for each chromosome, < 1 hour for the genome.

r2 > 0.5, our algorithms run faster the above speed.


Outline

Thanks for your time and

attention!

efficient algorithms for genome-wide tagsnp selection across populations via the linkage...

Documents

b slide

admixed populations

subset v

set v of snp markers

tagsnp set

given threshold r

mcts model

minimum tagsnp