; 1 10 20 25 ; | | | | gm01 a a a a a a a a a a a a a a a a b b b b b b b b b gm02 a a a a a a a a a...

1

Click here to load reader

Post on 21-Dec-2015

238 views

Category:

Documents


15 download

TRANSCRIPT

Page 1: ; 1 10 20 25 ; | | | | GM01 A A A A A A A A A A A A A A A A B B B B B B B B B GM02 A A A A A A A A A A A A A A A B B B B B B B B B B GM03 A A A A A A A

; 1 10 20 25; | | | | GM01 A A A A A A A A A A A A A A A A B B B B B B B B B GM02 A A A A A A A A A A A A A A A B B B B B B B B B B GM03 A A A A A A A A A A A A A B B B B B B B B B B B B GM04 A A A A A A A A A A A B B B B B B B B B B B B B B GM05 A A A A A A A A A A B B B B B B B B B B B B B B B GM06 A A A A A A A A A B B B B B B B B B B B B B B B B GM07 A A A A A A A A A B B B B B B B B B B B B B B A A GM08 A A A A A A A A A B B B B B B B B B B B B B A A A GM09 A A A A A A A A A B B B B B B B B B B B A A A A A GM10 B A A A A A A A A A B B B B B B B B B A A A A A A GM11 B B A A A A A A A A B B B B B B B B A A A A A A A GM12 B B B A A A A A A A B B B B B B B A A A A A A A A

Locus file

Mapping using Recombinant Inbred Lines

Genetic Cross

Genotyping

Raw Marker Scores

Mapping – Inference of linear order of markers using raw scores

MadMapper_RECBIT – Quality control of genetic markers and group analysis

MadMapper_XDELTA – Inference of linear order of markers on linkage groups

CheckMatrix (py_matrix_2D_V248_RECBIT.py) –Visualization and validation of genetic maps using two-dimensional heat-plots and graphical genotyping

MadMapper and CheckMatrix are multi-platform Python programs that can be used on UNIX,

Windows, and Mac OS X; Detailed analysis (quality control and clustering)

can be done on a set of ~2,000 markers;

Map construction works in a reasonable timeframe with

up to ~500 markers;Large images (up to 10,000 x 10,000 pixels)

can visualize up to ~2 millionpairwise scores simultaneously

MadMapper_RECBIT input and output files

Group Info:[ *.group_info ]

one file per iteration

16 iterations with different cutoff

values

Adjacency List:[ *.adj_list ]one file per

iteration

16 iterations with different cutoff

values

RecombinationDistance Scores:

[ *.pairs_all ]

...................GM01 GM07 0.36 GM01 GM08 0.40 GM01 GM09 0.48 GM01 GM10 0.52 GM01 GM11 0.60 GM01 GM12 0.68 GM02 GM01 0.04 GM02 GM02 0.00 GM02 GM03 0.08 GM02 GM04 0.16 GM02 GM05 0.20 GM02 GM06 0.24 ...................

Group Info Summary:file [ *.x_tree_clust ]

Summary for clustering results for all 16 iterations

Distinct linkage groups can be inferred by analysis of this clustering / grouping information

Non-Redundant Marker Scores:

[ *.z_nr_scores.loc ]locus file with

non-redundant raw marker scores

; 1 10 20 25; | | | | GM01 A A A A A A A A A A A A A A A A B B B B B B B B B GM02 A A A A A A A A A A A A A A A B B B B B B B B B B GM03 A A A A A A A A A A A A A B B B B B B B B B B B B GM04 A A A A A A A A A A A B B B B B B B B B B B B B B GM05 A A A A A A A A A A B B B B B B B B B B B B B B B GM06 A A A A A A A A A B B B B B B B B B B B B B B B B GM07 A A A A A A A A A B B B B B B B B B B B B B B A A GM08 A A A A A A A A A B B B B B B B B B B B B B A A A GM09 A A A A A A A A A B B B B B B B B B B B A A A A A GM10 B A A A A A A A A A B B B B B B B B B A A A A A A GM11 B B A A A A A A A A B B B B B B B B A A A A A A A GM12 B B B A A A A A A A B B B B B B B A A A A A A A A

INPUT: Locus file

Python_MadMapper_V248_RECBIT_012.py

Marker summary:[ *.z_marker_sum ]

for each marker, a ‘quality class’ is

assigned, which is useful for selection of ‘core’ markers

Marker Scores Info:

[ *.x_scores_stat ]detailed

information about scores and linkage

Trio Analysis:[ *.z_trio_good ][ *.z_trio_best ][ *.z_trios_bad ]

analysis of all trios (triplets) for

non-redundant set of markers

LOG file:( *.x_log_file )

information about run parameters

one input file - locus file with raw marker scores

82 output files

MadMapper BIT scoring system is used as an alternative to LOD scores to quantify linkage confidence between markers

JoinMap LOD scores JoinMap REC scores

MadMapper BIT scores MadMapper REC scores

Arabidopsis Genetic Map (Dean and Lister), five linkage groups: Comparison of Different Scoring Systems

MadMapper_RECBIT Clustering: Group Info Summary [ *.x_tree_clust file ]provides information about marker grouping –

belonging of any particular marker to specific linkage group

MadMapper_RECBIT BIN Analysis distinguishes true bins from linked groups

M_1 A A A B B B A A A A B B B B A A - A A B B B B A B B A A B A A A B B B B

M_2 A A A B B B A - A A B B B B A A A A A B B B B A B B A A B A A A B B B B

M_3 A A A B B B A A A A B B - B A A A A A B B B B A B B A A B A A A B B B B

M_4 A A A B B B A A A A B B A B A A A A A B B - B A B B A A B A A A B B B B

M_2

M_4

M_3

M_1

LinkedGroup

SaturatedNode

DilutedNode

Example of Complete Graph:

all nodes are‘saturated’

MadMapper_RECBIT Marker Summary [ *.z_marker_sum file ] provides info about redundancy of scores, marker qualities, and allele distortion

MARKER_Flank_1

REC1BIT1

D_FR1MARKER_

MiddleREC2

BIT2

D_FR2MARKER_F

lank_2 

REC_Flank

BIT_F

D_FR_F

  D_REC  D_REC_

Flank

COR47 0.1 336 0.6931 CAT3 0.0857 348 0.6931 G2395 *** 0.0253 450 0.7822 *** 5 + 0

G2395 0.0857 348 0.6931 CAT3 0.1 336 0.6931 COR47 *** 0.0253 450 0.7822 *** 5 + 0

LK141 0.1522 192 0.4554 GUT15 0.193 210 0.5644 MI238 *** 0.1231 294 0.6436 *** 4 + 0

MI238 0.193 210 0.5644 GUT15 0.1522 192 0.4554 LK141 *** 0.1231 294 0.6436 *** 4 + 0

MI204 0.051 528 0.9703 MI51 0.0806 312 0.6139 SGCSNP41 *** 0.0161 360 0.6139 *** 4 + 0

SGCSNP41 0.0806 312 0.6139 MI51 0.051 528 0.9703 MI204 *** 0.0161 360 0.6139 *** 4 + 0

M336 0.0494 438 0.802 COR15 0.0385 432 0.7723 VE018 *** 0.0879 450 0.901 *** 0 + 0

VE018 0.0385 432 0.7723 COR15 0.0494 438 0.802 M336 *** 0.0879 450 0.901 *** 0 + 0

ARR7 0.0115 510 0.8614 COR47 0 282 0.4653 F15571 *** 0.0179 324 0.5545 *** 0 + 0

F15571 0 282 0.4653 COR47 0.0115 510 0.8614 ARR7 *** 0.0179 324 0.5545 *** 0 + 0

PAP3 0.0227 504 0.8713 COR78 0.0577 276 0.5149 PDC2 *** 0.0588 270 0.505 *** 0 + 0

PDC2 0.0577 276 0.5149 COR78 0.0227 504 0.8713 PAP3 *** 0.0588 270 0.505 *** 0 + 0

Bad

Tri

os

Go

od

Tri

os

MadMapper_RECBIT Trio (Triplet) Analysis

Number of double crossovers should be low for ‘good’ trios

Number of double crossovers is high for ‘bad’ trios

M_1 A A A B B B A A A A B B B B A A - A A B B B B A B B A A B A A A B B B B X X X XM_M A A A B A B A - A A B B B B A B A A A B B A B A B B A A B A A A B B A B X X X XM_2 A A A B B B A A A A B B - B A A A A A B B B B A B B A A B A A A B B B B

‘middle’ marker

flanking marker 1

flanking marker 2

MadMapper_XDELTA Usage:

MadMapper_XDELTA takes three files as input:

1. Matrix (pairwise distances between markers)

2. List of ‘frame’ markers

3. List of markers to map

First step: finding the best map for ‘frame’ markers by checking all possible combinations

(up to 10 markers)

optionally: unlimited list of ‘frame’ markers with a fixed orderBest-Fit extension

Take one marker from the list of markers to map and insert it into 2-dimensional matrix of the current best map. Check

for all possible positions. Calculate ‘delta’ and find the map with the lowest ‘delta’ value (lowest ‘entropy’)

Move to the next marker to map until all markers are mapped. Optional shuffling (ripple) after several steps

Visual Explanation ofMinimum Entropy Approach to Infer Linear Order

Using MadMapper_XDELTA program

CheckMatrix 2D plot:

randomorderhigh

‘entropy’

partiallywrongorder

rightorderlow

‘entropy’

MadMapper_XDELTA analyzes two-dimensional matrices of all pairwise scores and finds the best map that has a minimum total sum of differences between adjacent cells (map with the lowest ‘entropy’).

Visualization of numerical data

using CheckMatrix

=============================================

MATRIX (ALL PAIRS) : madmapper_test_small.out.pairs_all

MARKERS TO MAP : madmapper_test_small.list

FRAME MARKERS LIST : madmapper_test_small.frame

OUTPUT MAP FILE : madmapper_test_small.xdelta

MAX FRAME LENGTH : 12

FIXED FRAME ORDER : FALSE

LINKAGE GROUP ID : LG

DUMMY DEBUG : TRUE

=============================================

=======

GM02 GM06 GM10 *** 1.52 *** 0.5067 *** 1

GM02 GM10 GM06 *** 1.92 *** 0.64 *** 2

GM06 GM02 GM10 *** 1.68 *** 0.56 *** 3

=======

GM03 GM02 GM06 GM10 *** 2.16 *** 0.54 *** 1

GM02 GM03 GM06 GM10 *** 2.0 *** 0.5 *** 2

GM02 GM06 GM03 GM10 *** 2.64 *** 0.66 *** 3

GM02 GM06 GM10 GM03 *** 3.2 *** 0.8 *** 4

=======

GM08 GM02 GM03 GM06 GM10 *** 3.64 *** 0.728 *** 1

GM02 GM08 GM03 GM06 GM10 *** 4.32 *** 0.864 *** 2

GM02 GM03 GM08 GM06 GM10 *** 3.28 *** 0.656 *** 3

GM02 GM03 GM06 GM08 GM10 *** 2.56 *** 0.512 *** 4

GM02 GM03 GM06 GM10 GM08 *** 3.16 *** 0.632 *** 5

=======

GM09 GM02 GM03 GM06 GM08 GM10 *** 4.8 *** 0.8 *** 1

GM02 GM09 GM03 GM06 GM08 GM10 *** 5.92 *** 0.9867 *** 2

GM02 GM03 GM09 GM06 GM08 GM10 *** 4.72 *** 0.7867 *** 3

GM02 GM03 GM06 GM09 GM08 GM10 *** 3.76 *** 0.6267 *** 4

GM02 GM03 GM06 GM08 GM09 GM10 *** 3.12 *** 0.52 *** 5

GM02 GM03 GM06 GM08 GM10 GM09 *** 3.52 *** 0.5867 *** 6

Example of the construction of a framework map and Best-Fit Extension for the remaining markers:

map calculated by checking all possible combinations

marker GM03 was inserted

marker GM09 was inserted

marker GM08 was inserted

LG MARKER

POS #1# DST1 #2# DST2 #3# DST3 #S# SUMM #D# DIFF STATUS CLASS

2 G4553 0 #1# 0 #2# NNNNNN #3# NNNNNN #S# NNNNNN #D# NNNNNN NNNNNN NNNNN

2 M246 1 #1# 0.043 #2# 0.0213 #3# 0.0778 #S# 0.0643 #D# -0.0135 GOOD __0__

2 MI320 2 #1# 0.0213 #2# 0 #3# 0.0211 #S# 0.0213 #D# 0.0002 GOOD __0__

.. … .. .. … .. … .. … .. … .. … .. …

2 NGA1126 26 #1# 0.0225 #2# 0.0645 #3# 0.0702 #S# 0.087 #D# 0.0168 GOOD __0__

2 SGCSNP135 27 #1# 0.0645 #2# 0.0645 #3# 0.0842 #S# 0.129 #D# 0.0448 GOOD __0__

2 MI54 28 #1# 0.0645 #2# 0.0211 #3# 0.0968 #S# 0.0856 #D# -0.0112 GOOD __0__

2 VE014 29 #1# 0.0211 #2# 0.0532 #3# 0.0745 #S# 0.0743 #D# -0.0002 GOOD __0__

2 M283 30 #1# 0.0532 #2# 0.1803 #3# 0.1833 #S# 0.2335 #D# 0.0502 GOOD __1__

2 SGCSNP333 31 #1# 0.1803 #2# 0.1167 #3# 0.0968 #S# 0.297 #D# 0.2002 GOOD LARGE

2 SGCSNP210 32 #1# 0.1167 #2# 0 #3# 0.1154 #S# 0.1167 #D# 0.0013 GOOD __0__

2 COP1 33 #1# 0 #2# 0.0196 #3# 0.0263 #S# 0.0196 #D# -0.0067 GOOD __0__

2 SPL3 34 #1# 0.0196 #2# 0.04 #3# 0.0339 #S# 0.0596 #D# 0.0257 GOOD __0__

2 C4H 35 #1# 0.04 #2# 0.0227 #3# 0.0303 #S# 0.0627 #D# 0.0324 GOOD __0__

.. … .. .. … .. … .. … .. … .. … .. …

2 M336 54 #1# 0.0519 #2# 0.0625 #3# 0 #S# 0.1144 #D# 0.1144 GOOD __X__

2 UBIQUE 55 #1# 0.0625 #2# 0.0526 #3# 0.0619 #S# 0.1151 #D# 0.0532 GOOD __1__

2 MI79A 56 #1# 0.0526 #2# 0.0781 #3# 0.0645 #S# 0.1307 #D# 0.0662 GOOD __1__

2 ATHB7 57 #1# 0.0781 #2# 0.1579 #3# 0.1698 #S# 0.236 #D# 0.0662 GOOD __1__

2 SGCSNP214 58 #1# 0.1579 #2# 0.1429 #3# 0.1667 #S# 0.3008 #D# 0.1341 GOOD __X__

2 SGCSNP198 59 #1# 0.1429 #2# NNNNNN #3# NNNNNN #S# NNNNNN #D# NNNNNN NNNNNN NNNNN

MadMapper_XDELTA Map Output: text tab-delimited file with ordered markers and detailed info about adjacent recombination scores

ABC

A – marker aboveB – middle markerC – marker below

Distance[A-B]

Distance[B-C]

Distance[A-C]

[A-B] + [B-C]([A-B] + [B-C]) - [A-C]

################################################################## ## EXAMPLES OF SCORING: ## ## ## POSITIVE LINKAGE: ## ## AAAAAAAAAAAAAAAAAAAA BIT SCORE = 6*20 = 120 ## AAAAAAAAAAAAAAAAAAAA REC SCORE = 0 (0.0) ## .. ## AAAAAAAAAAAAAAAAAAAA BIT SCORE = 6*18 - 6*2 = 96 ## AAAAAAAAAAAAAAAAAABB REC SCORE = 2 (2/20 = 0.1) ## ## AAAAAAAAAABBBBBBBBBB BIT SCORE = 6*10 + 6*10 = 120 ## AAAAAAAAAABBBBBBBBBB REC SCORE = 0 (0.0) ## .. ## AAAAAAAAABABBBBBBBBB BIT SCORE = 6*18 - 6*2 = 96 ## AAAAAAAAAABBBBBBBBBB REC SCORE = 2 (2/20 = 0.1) ## ## ## NO LINKAGE: ## .......... ## AAAAAAAAAAAAAAAAAAAA BIT SCORE = 6*10 - 6*10 = 0 ## AAAAAAAAAABBBBBBBBBB REC SCORE = 10 (10/20 = 0.5) ## . . . . . . . . . . ## BBBAABBAAAAAAABAABBB BIT SCORE = 6*10 - 6*10 = 0 ## BABBAABBABABABBBAABA REC SCORE = 10 (10/20 = 0.5) ## ## ## NEGATIVE LINKAGE: ## .................. ## AAAAAAAAAAAAAAAAAAAA BIT SCORE = 6*2 - 6*18 = -96 ## AABBBBBBBBBBBBBBBBBB REC SCORE = 18 (18/20 = 0.9) ## .................. ## ABABABABABABABABABAB BIT SCORE = 6*2 - 6*18 = -96 ## ABBABABABABABABABABA REC SCORE = 18 (18/20 = 0.9) ## ##################################################################

################################################################## +-------+ GENOTYPES: ## | BIT | A – 1st; B – 2nd ## SCORING SYSTEM: | | C - NOT A ( H or B ) ## | REC | D - NOT B ( H or A ) ## +-------+ H - A and B ## ## . +-------+-------+-------+-------+-------+-------+ ## . | | | | | | | ## . | A | B | C | D | H | - | ## .| | | | | | | ## +-------*-------+-------+-------+-------+-------+-------+ ## | | 6 | -6 | -4 | 4 | -2 | 0 | ## | A | | | | | | | ## | | 0 | 1 | 1 | 0 | 0.5 | 0 | ## +-------+-------*-------+-------+-------+-------+-------+ ## | | -6 | 6 | 4 | -4 | -2 | 0 | ## | B | | | | | | | ## | | 1 | 0 | 0 | 1 | 0.5 | 0 | ## +-------+-------+-------*-------+-------+-------+-------+ ## | | -4 | 4 | 4 | -4 | 0 | 0 | ## | C | | | | | | | ## | | 1 | 0 | 0 | 1 | 0 | 0 | ## +-------+-------+-------+-------*-------+-------+-------+ ## | | 4 | -4 | -4 | 4 | 0 | 0 | ## | D | | | | | | | ## | | 0 | 1 | 1 | 0 | 0 | 0 | ## +-------+-------+-------+-------+-------*-------+-------+ ## | | -2 | -2 | 0 | 0 | 2 | 0 | ## | H | | | | | | | ## | | 0.5 | 0.5 | 0 | 0 | 0 | 0 | ## +-------+-------+-------+-------+-------+-------*-------+ ## | | 0 | 0 | 0 | 0 | 0 | 0 | ## | - | | | | | | | ## | | 0 | 0 | 0 | 0 | 0 | 0 | ## +-------+-------+-------+-------+-------+-------+-------*. ## ##################################################################

MadMapper_RECBIT Dataflow: Input and Output files

Genetic Map visualization using CheckMatrix: Two dimensional heat plot of recombinationscores between all pairs of markers

detection ofproblematic

marker

Inference of linear order of markers using MadMapper_XDELTA

MadMapper_RECBIT, MadMapper_XDELTA and CheckMatrix:Python programs to infer orders of genetic markers and for visualization andvalidation of genetic maps and haplotypes (detailed description of dataflow)

http://cgpdb.ucdavis.edu/XLinkage/MadMapper/Alexander Kozik and Richard Michelmore. UC Davis Genome Center

General procedure to construct a genetic map using the MadMapper suite:

1 – Grouping of markers using MadMapper_RECBIT 2 – Selection of up to ten core markers per linkage group 3 – Construction of frame map using core markers by checking all possible combinations 4 – Best-fit extension for remaining markers (optional shuffle/ripple function can dramatically improve map quality, however, it increases the time for map construction)

5 – Visualization of constructed map using CheckMatrix 6 – Examination of MadMapper_XDELTA text output files 7 – Attempt to re-map markers (if required) that do not fit well into major framework 8 – Construction, visualization and examination of final map

Once the large framework map is constructed, adding new markers does not require changing the order of core markers and can be done relatively fast. In this case, the framework map is used with a fixed order to find the best positions for new markers.

Analysis of MadMapper_RECBITtext output files provides:

1 – assignment of markers to particular linkage groups

2 – sorting of markers into different quality groups

3 – detection and discrimination of mis-scored markers

4 – selection of high quality markers to build core map

5 – creation of non-redundant set of markers for further map construction

TrueBin

Trio-analysis helps reveal markers that were most likely

misscored and should be dropped from further analysis

Side-by-side comparison of scores (JoinMap LOD, JoinMap recombination, MadMapper BIT and MadMapper haplotype distances – REC)

Best-Fit Extension:On each iteration of the best-fit extension, the proper position for the

newly added markercorresponds to the two-dimensional matrix with

the lowest entropy

Building of framework map:The number of comparisons that

have to done to check all possible orders of markers:

# of markers - # of comparisons3 markers – 3

4 markers – 125 markers – 60

6 markers – 3607 markers – 2,520

8 markers – 20,1609 markers – 181,440

10 markers – 1,814,400

Locus file with raw marker scores is used as initial input for MadMapper_RECBIT program

Input files for MadMapper_XDELTA are usually output files from

MadMapper_RECBIT

Iter

atio

ns

of

clu

ster

ing

wit

h in

crem

enta

l cu

toff

val

ues

Arbitrary group ID after each round of clustering

Alle

le c

om

po

siti

on

/dis

tort

ion

Allele composition/distortion[ excess of ‘B’ alleles in this particular case ]

Marker ID

Mar

ker

map

po

siti

on

or

rela

tive

ord

er[

ord

er in

th

is p

arti

cula

r ca

se ]

High density of markers

Low density of markers

lowest score for the best order of markers is highlighted in red

con

fid

ence

cla

ss f

or

corr

ect

mar

ker

po

siti

on

[ L

AR

GE

is b

ad ]

small absolute differenceis good, large is bad

sep

arat

ion

of

mar

kers

in

to t

wo

d

isti

nct

lin

kag

e g

rou

ps

information about framework markers

mis

sin

g s

co

res

ma

y

cre

ate

so

me

pro

ble

ms

w

he

n d

efi

nin

g B

INs

Mad

Map

per

BIT

Sco

rin

g M

atri

x

exam

ple

s o

f B

IT s

cori

ng

pai

rwis

e d

ista

nc

e m

atri

x

… continue until all markers are inserted and ordered

LG

_1

LG

_2

LG

_3

LG

_4

LG

_5

LG

_1

LG

_2

LG

_3

LG

_4

LG

_5

LG_1 LG_2 LG_3 LG_4 LG_5 LG_1 LG_2 LG_3 LG_4 LG_5

flanking marker 1

flanking marker 2

middlemarker

MadMapper_XDELTA works with non-redundant set of scores

framework markers are highlighted in red

negative linkage between markers

Locus file with raw marker scores:each allele is scored as ‘A’ or ‘B’

Marker ID

Generation of segregating population:Collection (set) of Recombinant Inbred Lines

after several steps of self-pollination

Genotyping – assignment of a particular allele scoreto each marker

It is a long process from obtaining a set of recombinant inbred lines (RILs) to its genotyping with a thousand markers or more. Management, data processing, and genetic mapping of thousands of markers simultaneously is not a trivial task. The MadMapper suite and CheckMatrix programs simplify genetic marker data manipulation and analysis. The suite has some features other genetic programs may lack. MadMapper and CheckMatrix perform well on large scale sets of genotyping data, such as data derived from SFP (single feature polymorphism) microarray analysis. Only one input file is required to accomplish map construction: the locus file with raw marker scores. However, there are several major steps and dozens of output files in the MadMapper pipeline. Understanding of the purpose of each step and output file is required for successful genetic mapping. This poster describes details of the dataflow.

Data source: http://elp.ucdavis.edu/ West MA, van Leeuwen H, Kozik A, Kliebenstein DJ, Doerge RW, St Clair DA, Michelmore RW. High-density haplotyping with microarray-based expression and single feature polymorphism markers in Arabidopsis. Genome Res. 2006 Jun;16(6):787-795. [ PubMed:16702412 ]

Example Project: Construction of high-density genetic map of Arabidopsis thaliana linkage group 1 based on Affymetrix microarray SFP genotyping data using MadMapper

STEPS 1-2: Marker grouping and selection of framework markers

STEPS 3-4-5: Map construction and visualization with CheckMatrix

Comparison of inferred order with physical location of genes on Arabidopsis genome:

Graphical genotyping:

Graphical genotyping - RILs are grouped and sorted according to their haplotype patterns:

Framework markers are highlighted in red

sort

ing

an

d g

rou

pin

g o

f R

ILs