computational analysis of transcript identification using genbank
DESCRIPTION
Computational Analysis of Transcript Identification Using GenBank. Slides by Terry Clark. Differentiation of hematopoietic cells. Genome-wide gene expression. SAGE (Serial Analysis of Gene Expression). Figure 1 Schematic illustration of the SAGE process. - PowerPoint PPT PresentationTRANSCRIPT
Computational Analysis of Transcript
Identification Using GenBank
Slides by Terry Clark
Differentiation of hematopoietic cellsPluripotent stem cell
Myeloid Lymphoid
Erythrocyte PlateletMonocyteNeutrophil Eosinophil Basophil B cell T cell
Pluripotent stem cellMyeloid LymphoidMyeloid Lymphoid
Genome-wide gene expression
number of expressed genes level of expression
100
< 5 mRNA / cell
5--50 mRNA / cell
>500 mRNA / cell
9,000
900
SAGE (Serial Analysis of Gene Expression)
isolate SAGE tags
link tags together& sequencing
AAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAA
AAAAAAAAAAA
AAAAAAA
AAAAAAAA
gene identification
mRNA/cDNA
Jes Stollberg et al. Genome Res. 2000; 10: 1241-1248
Figure 1 Schematic illustration of the SAGE process
SAGE & GLGI Overview
SPGI
SAGE
identify most of expressed genes
quantitative analysis of expressed genesby collecting tags
GLGI
Gene identification
GenBank
collect cDNA clones
mRNA
extend tags into longer 3' cDNAs
multi-match
single-match
no match
matchmatch
What is the chance of duplicate tags?
• We can assume we are drawing randomly from the set of all 4-letters sequences of the given tag length
• This is the same problem as having unique overlaps in the contig matching problem for shotgun sequencing
Random Model
Random model does not reflect biological process
• Genes evolve by duplication as well as point mutation
• Many motifs are repeated• Function widgets at work?• Result is a strong bias in observed
biological sequences, not a uniform distribution as the simple model hopes.
• Here are some numbers ….
SAGE tags match to many genes(Tags from Hashimoto S, et al. Blood 94:837, 1999)
Tags matched gene numbers Matched genes (only show up to 10)
CCTGTAATCC 405 Hs.267557,Hs.240615,Hs.231705,Hs.283045,Hs.236713,Hs.232277,Hs.181553,Hs.262716,Hs.181392,Hs.220696GTGAAACCCC 305 Hs.282868,Hs.170225,Hs.184220,Hs.194021,Hs.231625,Hs.171830,Hs.270571,Hs.270572,Hs.272193,Hs.283921CCACTGCACT 174 Hs.118778,Hs.256868,Hs.96023,Hs.31575,Hs.47517,Hs.200451,Hs.271222,Hs.253240,Hs.270018,Hs.270415ACTTTTTCAA 44 Hs.16426,Hs.10669,Hs.75155,Hs.28166,Hs.13975,Hs.79136,Hs.111334,Hs.133430,Hs.79356,Hs.239100TTGGGGTTTC 9 Hs.231375,Hs.273127,Hs.275603,Hs.175173,Hs.276612,Hs.224773,Hs.62954,Hs.182771,Hs.276326TGCACGTTTT 8 Hs.199160,Hs.279943,Hs.36927,Hs.5338,Hs.169793,Hs.83450,Hs.173902,Hs.183506TGTGTTGAGA 5 Hs.284136,Hs.275865,Hs.275221,Hs.274466,Hs.181165CCCGTCCGGA 5 Hs.276353,Hs.277498,Hs.277573,Hs.276350,Hs.180842TTGGTCCTCT 4 Hs.12328,Hs.108124,Hs.9739,Hs.112845CTGACCTGTG 3 Hs.277477,Hs.181244,Hs.77961TACCTGCAGA 3 Hs.100000,Hs.256957,Hs.253884AGGCTACGGA 3 Hs.119122,Hs.211582,Hs.183297GGGCTGGGGT 3 Hs.183698,Hs.118757,Hs.90436CCCTGGGTTC 2 Hs.52891,Hs.111334CACAAACGGT 2 Hs.2043,Hs.195453GTGAAGGCAG 2 Hs.4221,Hs.77039GGGCATCTCT 2 Hs.75061,Hs.76807ATGGCTGGTA 2 Hs.254246,Hs.182426CGCCGCCGGC 2 Hs.182825,Hs.132753AGGGCTTCCA 2 Hs.29797,Hs.276544TTGGTGAAGG 2 Hs.278674,Hs.75968GTGGCCACGG 1 Hs.112405GTTCACATTA 1 Hs.84298TGGTGTTGAG 1 Hs.275865CCCATCGTCC 1 Hs.151604GTTGTGGTTA 1 Hs.75415TTGTAATCGT 1 Hs.125078CCCACAACCT 1 Hs.252136GAGGGAGTTT 1 Hs.76064CCAGAACAGA 1 Hs.111222
Tag Frequency Groups for 10-base Tag Set
Containing 878,938 Tags for UniGene Human
Unique Tags among 878,938 EST Derived Tags
Unique Tags among 32,851 Gene Derived Tags
Converting tag into longer 3’ sequence
3' end
3' end5' end
SAGE tag
3' longer sequence
Generation of Longer 3'cDNA for Gene Identification (GLGI)
TAAAAAAAAAAACTCGCCGGCGAANNNNNNNNNNATTTTTTTTTTTGAGCGGCCGCTT
10 bases
hundred bases
TAAAAAAAAAAACTCGCCGGCGAANNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
NNNNNNNNNN
Sense extension
antisense extension TGAGCGGCCGCTT
nnnnnnnnnn
nnnnnnnnnn
nnnnnnnnnn
nnnnnnnnnn
nnnnnnnnnn
nnnnnnnnnn
SAGE tag
TAAAAAAAAAAACTCGCCGGCGAA TGAGCGGCCGCTT
TAAAAAAAAAAACTCGCCGGCGAA TGAGCGGCCGCTT
TAAAAAAAAAAACTCGCCGGCGAA TGAGCGGCCGCTT
TAAAAAAAAAAACTCGCCGGCGAA TGAGCGGCCGCTT
UniGene Human 3’ Part Length Distribution
Myeloid Tag Matches with UniGene Human SAGE Tag Reference Database
SAGE Tag Processing with GIST
k-mer tree
GIST Performance with Improved IO
Conspirators
Sanggyu LeeJanet D. RowleySan Ming Wang
Terry ClarkAndrew HuntworkJosef JurekL. Ridgway Scott