stickleback seg dup analysis

19
Stickleback Seg Dup Analysis 1. Genome 2. Parameters for Pipeline 3. Analysis 4. Files and images are at http://eichlerlab.gs.washington.edu/help/linchen/stickleback/stickle backwgac.html 5. The Data is in directory http://eichlerlab.gs.washington.edu/help/linchen/sticklebac k/data/

Upload: brilliant

Post on 01-Feb-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Stickleback Seg Dup Analysis. Genome Parameters for Pipeline Analysis Files and images are at http://eichlerlab.gs.washington.edu/help/linchen/stickleback/sticklebackwgac.html The Data is in directory http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/. Stickleback Genome. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Stickleback Seg Dup Analysis

Stickleback Seg Dup Analysis

1. Genome

2. Parameters for Pipeline

3. Analysis4. Files and images are at

http://eichlerlab.gs.washington.edu/help/linchen/stickleback/sticklebackwgac.html

5. The Data is in directory http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/

Page 2: Stickleback Seg Dup Analysis

Stickleback Genome

• The Genome(v1.0) is down loaded from UCSU.• Total Length is 463,354,448bp which contains a chrUn of

62,550,211bp• Total of 29101 gene annotations from ensemble gene

annotation were down loaded from UCSC.

Page 3: Stickleback Seg Dup Analysis

Seg Dup detection pipelines

• WGAC to detect Seg Dup in genomic assembly by looking for homology pairs. ( >1kb in length >90% identity)

• WSSDto detect Seg Dup in given sequences based on depth coverage of WGS (whole Genome shot gun reads). Depth coverage > Average + 3SD.

Page 4: Stickleback Seg Dup Analysis

Parameters and Notes for WGAC pipeline

• Repeats– Standard repeat coordinated were reverse generated from the soft mask

data.

– The secondary repeat masker were done using two repeat libraries, the

ab_initio_lib.txt and supplemental_lib.txt.

– Repeat Mask result for all three libraries were combined and sorted, then used for both pipelines

• Blast parsing seeds in WGAC pipeline:– the seed size is 500bp

Page 5: Stickleback Seg Dup Analysis

Result from WGAC Pipeline

• Total pairs of SD detected(>1kb and >90% identity) 152272• Inter chromosome pairs 63744• Intra chromosome pairs 88528• chrUn intra 81641• chrUn inter and intra 123278• Total NR

40,573,574bp

Notes:• In general, the number of WGAC pairs is too high (10%) for stickleback

genome with only 400mb.• 92% of total intra chromosomal WGAC pairs and 81% total pairs has at

least one sequence in the pair is on chrUn. The result is expected, since chrUn contains high percentage of redundant poorly assembled sequences.

• Our analysis also suggest that the potential repeats which are not covered by the repeat libraries, may also detected as WGAC pairs. Next slid.

Page 6: Stickleback Seg Dup Analysis

Repeats?• Since the repeats might be an issue, I set up a filter to determine how many

of WGACs may be affected. If I use >20hit, 400bp on boundary, hit length <10kb, it affected 30% of WAC pairs. If I use >10hit, and 400bp bound overlap, and hit < 10kb, 60% of WGAC is affected.

• I then generate the nr space of these hit. They are total of 7,481,640bp from 103, 157 pairs in total WGAC (152, 272 pairs of total 40,473,574bp). It has 2/3 of hits, but only 1/5 of total nr space.

• I think it is very reasonable. Because the high proportion of the WGAC pairs only affect a small proportion of NR space.

• These sequence intervals should also be detected by WSSD if they are the repeats.

• However, I did not take them out from Alldup(which is a merge of WGAC and WSSD) yet, because many of them has high frequency hit on chrUn. At this stage we do not know if they are the redundant sequences or the real seg dup. But we can pull them out at any time based on the coordinates.

• If I use >20hit, 400bp on boundary, hit length <10kb, 30% of WGAC can be

Page 7: Stickleback Seg Dup Analysis

General analysis of WGAC length and identity distribution

length distribution

0

20000000

40000000

60000000

80000000

100000000

120000000

1.k

b

2.k

b

3.k

b

4.k

b

5.k

b

6.k

b

7.k

b

8.k

b

9.k

b

10

.kb

20

.kb

30

.kb

40

.kb

50

.kb

length

tota

l (b

p)

inter

intra

identity distribution

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

identity

tota

l(bp)

inter

intra

1. Length distribution peaked at < 3kb, intra > inter, with 92% of intra on chrUn.2. Identity distribution peaked at 96%. Few is high than 99%.

Page 8: Stickleback Seg Dup Analysis

General analysis, NR distribution on chromosome.high SD in chrUn

nr lengh on chromosome

0

5000000

10000000

15000000

20000000

25000000

chrI

chrI

I

chrI

II

chrI

V

chrI

X

chrU

n

chrV

chrV

I

chrV

II

chrV

III

chrX

chrX

I

chrX

II

chrX

III

chrX

IV

chrX

IX

chrX

V

chrX

VI

chrX

VII

chrX

VIII

chrX

X

chrX

XI

chromosome

tota

l (b

p)

inter

intra

both

Percentage of Dup NR relative to chromosome

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

chrI

chrI

I

chrI

II

chrI

V

chrI

X

chrU

n

chrV

chrV

I

chrV

II

chrV

III

chrX

chrX

I

chrX

II

chrX

III

chrX

IV

chrX

IX

chrX

V

chrX

VI

chrX

VII

chrX

VIII

chrX

X

chrX

XI

chromosome

per

cen

t

inter

intra

both

Page 9: Stickleback Seg Dup Analysis

General view which show all WGAC on all chromosome

Concentration of SD on smaller supercontigs onchrUn

Page 10: Stickleback Seg Dup Analysis

Global image shows the inter and intra pairs of 5kb and above 90% without the chrUn. The red indicates the inter chromosomal pairs and

blue indicates intra chromosomal pairs

Page 11: Stickleback Seg Dup Analysis

Global image shows the inter and intra pairs of 10kb and 90% without chrUn. The red indicates the inter chromosomal pairs and blue

indicates intra chromosomal pairs

Page 12: Stickleback Seg Dup Analysis

Global image shows the inter and intra pairs of WGAC with10kb and 90%. ChrUn is also included. The red indicates the inter chromosomal

pairs and blue indicates intra chromosomal pairs

chrUn

Page 13: Stickleback Seg Dup Analysis

WSSD analysis

• Down load the WGS reads about 6 million.

• Down load Stickleback finished BAC. These BACs are used to determine the threshold for WGS depth coverage. For 5k window, the average number of reads is 78, with SD 27. The threshold for 5k window is 125. for 1k window is 25. (Average+3SD)

• Repeat mask of the stickleback genome. I used the standard, ab_initio_lib.txt and supplemental_lib.txt. In addition I added the potential repeats I detected in WGAC process which shows more than 20 hit pairs the same region.

Page 14: Stickleback Seg Dup Analysis

WSSD resulthttp://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wssd/

• There are total of 729 regions with 22,324,144bp were found in wssdGE10K_nogap.tab ( which has a 10k cut off), 251 of them are on chrUn.

• 850 regions in wssd.tab with 23,116,317 total base. It has 125 more regions and less than 1mb extra sequences comparing to 10k hits.

• A summary table of WGAC intersect with WSSD is at http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wgacCMPwssd.xls

Page 15: Stickleback Seg Dup Analysis

Union of WSSD and WGAC Gene intersect with Seg Dups

• First a none redundant Union of WGAC and WSSD is generated. AllDup.tab

• A list of genes intersect with the AllDup is performed to identify genes overlap with Dup space in genome. There are 3135 ensemble genes identified.

• Both data sets are at

http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/

Page 16: Stickleback Seg Dup Analysis

The general view of WGAC and WSSD on chromosome

Wssd black above chrom lineWGAC 5k94% black below chrom lineWGAC 10k brown below chrom line

Page 17: Stickleback Seg Dup Analysis

Summary table 1

  total chrN chrUn No. nr interval file

wssd (bp) 22324144 13574716 8749428 729 wssdGE10K_nogap.tab

wgac (bp) 40573574 21017679 19555895 7387 data/wgac/NRspace

AllDup (bp) 45608440 24390195 21218245 5934 data/allDup.tab

Genome (bp) 463354448 400804237 62550211    

repeats ? (bp) 7481640 1741266 5740374   data/repeathitMerge

Page 18: Stickleback Seg Dup Analysis

The intersect between WSSD and WGACchrom size allWGAC

gt94WGAC_ge10K WSSD Shared

gt94WGAC_ge10K_WGAConly WSSDonly

<=94%WGAC

<=94%WGAC +shared

chrI 28185914 1275120 315356 709840 195481 119875 514359 193013 388494

chrII 23295652 713095 144114 234515 77007 67107 157508 72943 149950

chrIII 16798506 1041842 435522 821969 389684 45838 432285 108184 497868

chrIV 32632948 2093860 476191 1589484 379805 96386 1209679 306309 686114

chrIX 20249479 1389579 610360 1004524 490770 119590 513754 100388 591158

chrUn 62550211 19483869 10809499 8749428 4789618 6019881 3959810 630260 5419878

chrV 12251397 591969 178851 393826 166869 11982 226957 50079 216948

chrVI 17083675 621495 177632 245111 128778 48854 116333 87014 215792

chrVII 27937443 1480355 521853 861056 469264 52589 391792 175038 644302

chrVIII 19368704 824600 245027 274801 119937 125090 154864 62353 182290

chrX 15657440 1274186 735451 1039477 611552 123899 427925 79609 691161

chrXI 16706052 1336848 499828 1152246 474664 25164 677582 149606 624270

chrXII 18401067 1002589 455231 721761 436954 18277 284807 91092 528046

chrXIII 20083130 1001618 315089 508170 174381 140708 333789 93504 267885

chrXIV 15246461 472042 95357 221539 60401 34956 161138 53894 114295

chrXIX 20240660 918086 240950 635973 212904 28046 423069 83718 296622

chrXV 16198764 578995 173468 303978 101413 72055 202565 64444 165857

chrXVI 18115788 1216252 462619 810223 375762 86857 434461 165325 541087

chrXVII 14603141 278408 54942 45597 24201 30741 21396 21509 45710

chrXVIII 16282716 827757 320585 572537 273969 46616 298568 78890 352859

chrXX 19732071 916472 277129 556990 193012 84117 363978 147507 340519

chrXXI 11717487 1062424 717376 871099 665531 51845 205568 58839 724370

total 463354448 40417128 18278097 22324144 10811957 7466140 11512187 2873518 13685475

Page 19: Stickleback Seg Dup Analysis

Summary

• Stickleback Seg Dup has been detected using two independent pipelines WGAC and WSSD. Since each pipeline is based on its unique mechanism, we expect majority of the interval should be consistent with some variation. From the result of two pipeline, two set of genomic intervals were generated for Seg Dup.

– The first set consists of the genomic intervals detected by WGAC and WSSD, which is the intersect interval between WGAC and WSSD. This set represents the most conservative estimate of SEG DUPs in Genome. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wssd_wgac_intersect

– The second set is a union of the interval of WAGC and WSSD (AllDup.tab), which represent the largest estimate of the SEG DUP in the genome. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/allDup.tab

– A list of genes intersecting with each set were also generated.• With AllDUp, union of WGAC and WSSD. There are total 3153 genes.

http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/gene_in_alldup

• With Dup from WGAC and WSSD intersect. There are total 1267 genes. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/gene_in_wssd_wgac_intersect

• A list of interval with potential to be repeats is also generated. They are the region with high frequency of hit with defined the boundary ( >10hits, <400bp at bound, <10kb in length). They account for >60% of total WAGC pairs and 1/5 of WGAC NR intervals. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/repeathitMerge

• ChrUn contigs contribute great deal to the total SD in both WGAC and WSSD. The identity distribution analysis shows that the identity of pairs are less than 99%, suggest they may contain true SD which are hard to assemble. But how many of them remain to be determined.