encode pseudogene call summary

17
1 ENCODE Pseudogene Call Summary Mark Gerstein 2005,10.27 11:00 EDT (Draft for G&T call on 2005,10.28 10:00 EDT)

Upload: ceana

Post on 09-Jan-2016

50 views

Category:

Documents


2 download

DESCRIPTION

ENCODE Pseudogene Call Summary. Mark Gerstein 2005,10.27 11:00 EDT (Draft for G&T call on 2005,10.28 10:00 EDT). Pseudogene group. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ENCODE  Pseudogene Call Summary

1

ENCODE Pseudogene Call

SummaryMark Gerstein

2005,10.27 11:00 EDT

(Draft for G&T call on2005,10.28 10:00 EDT)

Page 2: ENCODE  Pseudogene Call Summary

2

Pseudogene group Core people:

Jennifer Harrow <[email protected]>, WEI Chia-Lin <[email protected]>, Adam Frankish <[email protected]>, "Dike, Sujit"

<[email protected]>, Robert Baertsch

<[email protected]>, [email protected], Deyou Zheng

<[email protected]>, Yontao Lu <[email protected]>

[email protected], [email protected]

Others: "Hoyem, Tara L" <[email protected]>, Roderic Guigo Serra <[email protected]>, "'Gingeras, Tom'“ [email protected]>, [email protected], Suganthi Balasubramanian [email protected]

6 Calls: Sept. 15, 22; Oct. 6, 13, 20, 27

Page 3: ENCODE  Pseudogene Call Summary

3

81 (34)

Havana-Gencode: 165 pseudogenes(167 -2 )

Yale: 167 pseudogenes (164 + 3)

UCSC retrogenes: 146 not expressed

16 (7)

33 (1)

15 (1)

17 (2) 16 (0)

54 (2)

Refresher: many repetitions of the below “Venn analysis”

7 Havana agrees to be added (8, 11, 40, 59, 139, 152, 169).4 at coding loci. [Yale agrees to delete]

1 with weak sequence identity.*5 with “non-real” proteins.*

9 Havana agrees to be added.2 at coding loci. [Yale agrees to delete]

1 with weak sequence identity.*2 with “non-real” proteins.*

* Solved by consistent protein set & threshold

Numbers according to Adam’s note

Page 4: ENCODE  Pseudogene Call Summary

4

A proposal for qualified union with a uniform criteria for boundaries

1. Identify a “good” set of human proteins – HAVANA set? 2. Remove pseudogenes (from all 4 groups) overlapping with current GENCODE exons

(does GENCODE have an updated version?).

3. Create an union of the remaining pseudogenes.4. Find the “best” matching proteins for each pseudogene, remove entries without a BLAST hit (e-value cutoff

issue?). 5. Realign each pseudogene to its parent protein to produce a uniform alignment and to define the start and end

coordinates.6. Apply a threshold to sequence identity and coverage? (No.)7. Classify pseudogenes into processed and non-processed (how?)

Overall 222 pseudogenesApplication of above receipe gives

198 Consensus Intersection set of above is 81 (proc) + 49 (non-proc)

on browser + encode wiki + http://pseudogene.org/ENCODE

From Deyou Z. + Robert B.

Page 5: ENCODE  Pseudogene Call Summary

5

Insertion into processed pseudogene

heterogeneous nuclear ribonucleoprotein A1 (HNRPA1) pseudogene (parent on Chr12)

NADH dehydrogenase 2 (MTND2) pseudogene (parent mitochondrial)

NADH dehydrogenase 4 (MTND4) pseudogene (parent mitochondrial)

cytochrome b (CYTB) pseudogene (parent mitochondrial)

First insertion event

Remnant of a second, mitochondrial insertion event (has post-insertion deletions)

Protein evidence

From Adam F.

Page 6: ENCODE  Pseudogene Call Summary

6

Rearranged exon order in unprocessed pseudogene

Protein evidence

adaptor-related protein complex 1, beta 1 subunit (AP1B1) pseudogenes

Exon 6 Exon 3

Splice sites same as parent gene

Dot plot protein evidence vs genome

Following duplication of the AP1B1 locus rearrangements/duplications have produced two unprocessed pseudogenes corresponding to exons 6 and 3 of the parent gene

From Adam F.

Page 7: ENCODE  Pseudogene Call Summary

7

Rearrangement of processed pseudogene

pseudogene similar to part of ribosomal protein L3 (RPL3)

Protein dot plot

mRNA dot plot

Following insertion, one end of the RPL3 pseudogene has been flipped onto the opposite strand (with some loss of internal sequence)

From Adam F.

Page 8: ENCODE  Pseudogene Call Summary

8

Transcription among 198 consensus pseudogenes

- Nb overlapped by interrogated regions (affy arrays): 180 (90.9%)

- Nb overlapped by yale tars or affy transfrags (union): 106 (53.5% of all ; 58.9% of interrogated)

=> There is evidence of transcription (from tars or transfrags) of the pseudogene or the parent gene (if cross-hybridization) for 53.5% of the consensus pseudogenes

- Nb overlapped by cage tags: 11 (5.5%)

- Nb overlapped by ditag tags: 1 (0.5%) (83 (41.9%) are overlapped by full length ditags)

From France D.

Page 9: ENCODE  Pseudogene Call Summary

9

Pseudogene overlapped by tars/transfrags and ditags: ENCODE_consensus_187

93% similar to parent

From France D.

Page 10: ENCODE  Pseudogene Call Summary

10

Overlaps by tar/transfrag subset

- Nb overlapped by interrogated regions (affy arrays): 180 (90.9%)

- Nb overlapped by yale tars or affy transfrags (union): 106 (53.5% of all ; 58.9% of interrogated)

- Nb overlapped by yale tars (union): 84 (42.4% of all ; 46.7% of interrogated)

- Nb overlapped by affy transfrags (union): 102 (51.5% of all ; 56.7% of interrogated)

- Nb overlapped by polyA+ tars/transfrags (union) 105 (53% of all ; 58.3% of interrogated)

- Nb overlapped by total RNA tars (union) 61 (30.8% of all ; 33.9% of interrogated)

From France D.

Page 11: ENCODE  Pseudogene Call Summary

11

ENCODE pseudogenes expression

• ENCODE pseudogenes from the intersection part of consensus set– 49 non-processed, 125 processed

• Designed oligos (25mer, Tm 70°C)– Either specific to pseudogene or shared

between parental gene and pseudogene

From Alex R.

Page 12: ENCODE  Pseudogene Call Summary

12

ENCODE pseudogenes expression #2

• 5’RACE in 12 human tissues– Brain, heart, kidney, spleen, liver, colon, sm.

intestine, muscle, lung, stomach, testis, placenta

– First 96 pseudogenes 5’RACEs done in 12 tissues

– Last 78 will be done next week

• To do: pool multiple RACEs, send to Santa Clara and hybridize to Affymetrix ENCODE 20 nucleotide resolution arrays

Stylianos Antonarakis, Robert Baertsch, Jorg Drenkow, Tom Gingeras, Charlotte Henrichsen Philipp Kapranov, Catherine Ucla, Alexandre ReymondAffymetrix, UCSC, University of Geneva, University of Lausanne

From Alex R.

Page 13: ENCODE  Pseudogene Call Summary

13

Expression from pseudogene locus

(1) – putative novel transcript

HAVANA sialyltransferase pseudogene (RP3-477O4.5) supported by protein evidence

Putative novel transcript supported by a single EST with has a polyA site and signal

Supporting EST (100% ID)

Aligned proteins (column collapsed)

polyA site and signal

Appears to be some transcription from this locus which is supported at the 3’ end by a single EST

From Adam F.

Page 14: ENCODE  Pseudogene Call Summary

14

Frameshift

LILRA3

LILR pseudogene

Expression from pseudogene locus (2) – 5’ UTR of known gene

Upstream pseudogene corresponds to exons 1-3 of LILR family genes, 3’ exons have been lost. EST evidence supports expression from the pseudogene locus extending to known gene LILRA3.

From Adam F.

Page 15: ENCODE  Pseudogene Call Summary

15

Intersect Consensus Pseudogenes with ChIP-chip Hits

Factors E2F H3K4me3

(0h)

H3K4me3 (30h)

Sp3 STAT1

Group UCDavis UCSD UCSD Stanford Yale

Total Hits 400 1000 1000 400 400

Known Genes (405)

145 149 154 86 15

genes (198) 4 25 24 3 7

From Deyou Z.

Page 16: ENCODE  Pseudogene Call Summary

16

Consensus Pseudogenes with ≥2 ChIP-chip Hits

Pgene-ID Pgene-type E2F H3K4me3

(0h)

H3K4me3 (30h)

Sp3 STAT1

13 Processed 0 1 1 0 0

45 Processed 0 1 1 0 0

47 Processed 0 1 1 0 0

77 Processed 1 1 1 0 0

126 Processed 0 1 1 0 0

149 Processed 1 1 1 0 0

174 Non-Processed 0 1 1 0 0

[ 177 ] Non-Processed 1 1 1 0 0

187 Processed 0 1 1 0 0

193 Processed 0 0 0 1 1

Has Trans-criptional Evidence (intersects Gencode transcript)

From Deyou Z.

Page 17: ENCODE  Pseudogene Call Summary

17

Example Pseudogene with Binding Hits (#177)

From Deyou Z.