efficient algorithms for genome-wide tagsnp selection across populations via the linkage...

23
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano Lonardi and Tao Jiang

Upload: kevin-tyler

Post on 18-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion

Authors: Lan Liu, Yonghui Wu,

Stefano Lonardi and Tao Jiang

Page 2: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Introduction The MCTS Model Our Algorithms Experimental Result

Outline

Page 3: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Introduction The MCTS Model Our Algorithms Experimental Result

Outline

Page 4: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Motivation With the rapid development of genotyping

technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP database.

We aim to select a subset of informative SNPs (i.e. tagSNPs), in order to

Save the cost for genotyping all SNPs. Perform disease association mapping.

Page 5: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

TagSNP Selection

Haplotype-based methods Require the information of the phased

multilocus haplotypes Haplotype-free methods

Do not require haplotype information TagSNP selection via r2 linkage disequilibrium

statistics

Page 6: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

r2 Linkage Disequilibrium Statistics

Given a pair of genetic markers 1 and 2.

r2 statistics: r2 =(pAB –pA. p.B)2

pA.(1-pA.) p.B(1-p.B)

If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

bBmarker 1

marker 2

A pAB pAb pA.

a paB pab pa. p.B p.b

Page 7: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

The TagSNP Selection Problem

Instance: a set V of SNP markers and LD patterns E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1 and vj2 are in V}, Feasible solution: a subset V' , such that given any v in V, there

exists a v' in V' , where r2(v,v') is no less than r0. Objective: minimize |V'|. If we define G=(V, E),

a tagSNP set is equivalent to a dominating set on G.

1 2

3

45

6

(a) SNP markers and their LD patterns in a population

1 2

3

45

6

: tagSNP

(b) TagSNPs for the population

This model is introduced by Carlson et al., 2004. It is a simple and popular tagging method.

Page 8: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Introduction The MCTS Model Our Algorithms Experimental Result

Outline

Page 9: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

r2 Statistics in Single and Admixed Populations

r2= 00.050.95

0.050.00250.0475a

0.950.0475 0.9025A

bB Population 1

r2= 00.950.05

0.950.90250.0475a

0.050.0475 0.0025A

bB

Population 2

r2= 0.65610.50.5

0.50.45250.0475a

0.50.0475 0.4525A bB

Admixed population: 50% population 1 50% population 2

SNP 1: A, a SNP 2: B, b

Page 10: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

TagSNP Selection across Populations

A pair of SNPs have remarkably different marker frequencies and very

weak LD in two populations with different evolutionary histories. may show strong LD in the admixed population.

TagSNPs picked from the admixed populations or one of the populations might not be sufficient to capture the variations in all populations.

Page 11: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

The MCTS Model

Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations.

The above problem is called the minimum common tagSNP selection problem (MCTS).

1 2

3

45

1 2

3

45

6 6

Population 1 Population 2

(a) SNP markers and their LD patterns in two populations.

1 2

3

45

1 2

3

45

6 6

Population 1 Population 2

: tagSNP

(b) The minimum TagSNP set for these two populations.

Page 12: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Introduction The MCTS Model Our Algorithms Experimental Result

Outline

Page 13: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Our Algorithms The MCTS problem can be easily formulated by an integer linear

programming.

We first apply some data reduction rules, then use one of the following algorithms

A greedy algorithm: GreedyTag A Lagrangian relaxation algorithm: LRTag

We calculate the upper bound : the number of the tagSNPs obtained by

our algorithms

GreedyTag_lb LRTag_lb

the lower bound : the minimum number of tagSNPs needed

Page 14: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Data Reduction Rules

Pick all irreplaceable markers Example: marker 7

Population 1 Population 2

1 2

3

45

1 2

3

45

6 6

7 7

1 2

3

45

1 2

3

45

6 6

7 7

Remove less informative markers Example: among markers 1, 2 and 6,

remove marker 1 and 2.

1 2

3

45

1 2

3

45

6 6

7 7

Remove less stringent occurrences Example: between the occurrences of

markers 4 and 5 in population 2, remove the occurrence of marker 4.

Page 15: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

A Greedy Algorithm

Apply data reduction

rules

un-tagged occurrence?

Pick the marker which tags the most of the remaining

occurrences as a tagSNP

yes

no

Output the tagSNPs

Page 16: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

A Lagrangian Relaxation Algorithm

Introduce the Lagrangian

multipliers λ iteration++ < max_iter

yes

no

Output the tagSNPs

Obtain the relaxed integer

program

Initialize λ

Obtain the tagSNP set based on λ

iteration := 0

Update λ towards the subgradient direction

Update the tagSNP set based on λ

Page 17: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Introduction The MCTS Model Our Algorithms Experimental Result

Outline

Page 18: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Experimental Result We apply our algorithms on real HapMap data

(release #19, NCBI build 34, October 2005). There are four populations in HapMap data.

CEU: Europe descendents. CHB: Chinese, Beijing. JPT: Japanese, Tokyo. YRI: Yoruba people of Ibadan, Nigeria.

We get tagSNPs for the following two datasets: Encode regions

all 10 ENCODE regions Human genome

chromosomes 1 – 22

10,859 markers.

2,862,454 markers

Page 19: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Experiment Result for ENCODE Regions

We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).

Multipop-TagSelect first generates the tagSNPs for each single population, then combines the obtained tagSNPs together for multiple populations.

The gap between LRTag_lb and LRTag r2 = 0.5: at most two for each region

totally six for all regions r2 = 0.8: there is no gap.

Page 20: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Experiment Result for Human Genome

The numbers of tagSNPs selected by our algorithms are almost optimal.

The gap between LRTag_lb and LRTag for the whole genome 2,862,454 SNPs in total r2 = 0.5: 1,061 r2 = 0.8: 142

Page 21: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Running Time of Our Algorithms

Running environment a 32-processor SGI Altix 4700 supercomputer system 1.6 GHZ CPU 64 GB shared memory 15 threads in parallel.

Running time r2 = 0.5,

ENCODE regions: < 7 seconds for each region, < 1 minute for all regions.

Human genome: < 12 minutes for each chromosome, < 1 hour for the genome.

r2 > 0.5, our algorithms run faster the above speed.

Page 22: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Introduction The MCTS Model Our Algorithms Experimental Result

Outline

Page 23: Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano

Thanks for your time and

attention!