sucest sucest: felipe rodrigues da silva embrapa recursos genéticos e biotecnologia o projeto...
TRANSCRIPT
SUCEST
SUCEST:SUCEST:
Felipe Rodrigues da Silva
Embrapa Recursos Genéticos e Biotecnologia
o projeto genoma da cana-de-açúcar.
SUCEST
Crescimento do GenBank
14.396
3.841
11.101
2.0091.160
652
0
3000
6000
9000
12000
15000
18000
1980 1983 1986 1989 1992 1995 1998 2001
ano
pa
res
de
ba
se
(m
ilh
õe
s)
Volume de dados disponíveis publicamente
SUCEST
Crescimento do GenBank
0,680338
14.396
3.841
11.101
2.009
1.160652
0,1
1
10
100
1000
10000
100000
1980 1983 1986 1989 1992 1995 1998 2001
ano
pa
res
de
ba
se
(m
ilh
õe
s)
Volume de dados disponíveis publicamente
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
SUCEST
Genomas completos de organismos
http://wit.integratedgenomics.com/GOLD/
Genomas completos (totais não cumulativos)
2 3 7 7 9 18
1.3 Gb
0
3
6
9
12
15
18
1995 1996 1997 1998 1999 2000ano
ge
no
ma
s
1
10
100
1000
10000
mil
hõ
es
de
ba
se
s
genomas
Mb
SUCEST
SUCEST
SUCEST
SOPÃOde letrinhas...
A
AA
A
A
A
AA
A
A
T TT T
T TT
T
T
TT
G
G
GG
G
G
G
G
G
C
C
C
C
C
C
C
C
CC
C
A
AA
A
A
A
AA
A
A
T TT T
T TT
T
TT
G
G
GG
GG
C
C
C
C
C
C
C
C
CC
C
A
AA
A
A
A
AA
A
A
TT TT
TTT
T
T
TT
G
G
GG
G
G
G
G
G
C
C
C
C
C
C
C
C
CC
C
AA
A
AA
AA
A
A
T TT
T TT
T
T
T
G
G
GG
G
G
G
G
C
C
C
C
C
C
C
CC
A
AA
A
A
A
AA
A
A
T TT T
T TT
T
T
TT
G
G
GG
G
G
G
G
G
C
C
C
C
C
C
C
C
CC
C
SUCEST
Cana-de-açúcar
• Cultivada em mais de 90 países
• Ocupando cerca de 20 milhões de hectares
• Família das Gramíneas (Poace)
http://apps.fao.org
SUCEST
A cana-de-açúcar no Brasil
• 25% da produção mundial
• 300 milhões de tons.
• 5 milhões de hectares plantados
• 14.5 milhões de tons. de açúcar
• 15.3 bilhões de litros de álcool
• 350 industrias
• 50 mil produtores
• 1.4 milhões de empregos direto
• 3.6 milhões de empregos indiretos
SUCEST
Origem e tamanho
• Saccharum officinarum
2n = 80
Saccharum spontaneum
2n = 64 ou 2n = 112
10 – 25%
X
S. berberi, S. sinence, S. robustum
conjunto não-reduntante = 930 Mbp
Sorgo = 760 Mbp
Arroz = 430 Mbp
• 2C = 7.440 Mbp 2n = 100-130
D'Hont, A. and Glaszmann, J. C. 2001. Proc Int Soc Sugarcane Technol 24: 556-559.
SUCEST
Projeto Genoma
Seqüenciamento Completo do Genoma– Região Gênica e Região Intergênica
Estrutural
FuncionalEST – Expressed Sequence Tag
–Regiões que codificam proteínas (Genes)
SUCEST
Seqüenciamento Completo
...ATGTTGGGCCACAGTTGACCATTGAAACTGGTTGACCATTGAAACTGACCTTGACGTAACGTGGTA....
Genomic DNA
Biblioteca de BACs
Mapa físico
BAC a ser seqüenciado
Clones Shotgun
Seqüência
...ATGTTGGGCCACAGTTGACCATTGAAACTGACCTTGACGTAACGTGGTA... Montagem
SUCEST
EST – Expressed Sequence Tag
Clonagem em E. coliSeqüenciamento
ACCTGATGGCATTTCCATCAAGCTGACCTGGAAATCGTTGGCCDNA gene Bgene A
Proteína NH2 COOHNH2 COOH
3´5´mRNA
Dogma Central da Biologiainserçãoem vetor
cDNA
SUCEST
SUCEST
• Total de Entradas 1,528,715• Homo sapiens 967,015 (63,4%)
• Plantas (total) 73,087 (4.8%)
• Mus musculus + domesticus (camundongo) 306,544• Caenorhabditis elegans 72,521• Arabidopsis thaliana 36,173• Drosophila melanogaster 27,625• Oryza sativa (arroz) 25,844• Rattus sp. (rato) 20,311• Brugia malayi (nematoide parasita) 13,641• Toxoplasma gondii 10,671• Emericella nidulans 5,787• Schistosoma mansoni 3,659• Trypanosoma brucei rhodesiense 3,519• Danio rerio (zebrafish) 3,373• Saccharomyces cerevisiae 3,042• Zea mays (milho) 1,783• Leishmania major 1,692
• Saccharum sp. 495• Outros ~ 20,000
GenBank - dbEST Março de 1998
http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html
SUCEST
Os Objetivos do projeto SUCEST
• Identificar 50.000 genes únicos
(ou seqüenciar 300.000 ESTs)
• Desenvolver um Banco de Dados para a cana-de-açúcar
• Disponibiilizar este Banco de Dados para grupos de Data Mining
• Análise funcional dos ESTs
SUCEST
O Cronograma
Data Meta
Jul/1999 Distribuição dos Primeiros Clones
Dec/1999 20,000 ESTs
Jul/2000 60,000 ESTs
Dec/2000 100,000 ESTs
Jul/2001 140,000 ESTs
Dec/2001 180,000 ESTs
Jul/2002 220,000 ESTs
Dec/2002 260,000 ESTs
Jul/2003 300,000 ESTs
SUCEST
As Bibliotecas de cDNATecidos / Órgãos
– Raiz
– Meristema
– Caule
– Sementes
– Flores
– Cartucho da Folha
– Zona de Transição Folha-Raiz
– Gema Lateral
– Calli
– Plântulas imaturas
– Plântulas infectadas com Herbaspirillum rubrisubalbicans
– Plântulas infectadas com Gluconacetobacter diazotroficans
Variedades
– SP80-3280
– SP70-1143
– SP80-87432
– RB 845298
– RB 805028
– PB5211 X P57150-4
SUCEST
Os Laboratórios de Seqüenciamento
USP (SP)(3)
BIOINFORMATICAUNICAMP (CA)
UNESP (BT)(2)
ESALQ (PI)(3)
USP (RP)(1)
UNAERP (RP)(1)
UMC (MC)(1)
UNIVAP (SJ)(1)
UNESP (JB)(2)
UFSCAR (AR)(1)
UFSCAR (SC)(1)
USP (SC)(1)
UNESP (RC)(1)
IAC (CA)(1)
IAC (CO)(1)
UNICAMP (CA)(1)
RIO DE JANEIROPERNAMBUCO ALAGOAS
ABI 377-96
SUCEST
EST – Expressed Sequence Tag
Clonagem em E. coliSeqüenciamento
ACCTGATGGCATTTCCATCAAGCTGACCTGGAAATCGTTGGCCDNA gene Bgene A
Proteína NH2 COOHNH2 COOH
3´5´mRNA
Dogma Central da Biologiainserçãoem vetor
cDNA
291.689 reads260.352 clones
266.016 clones
SUCEST
Limpeza das seqüências
• remoção de seqüências ribossômicas
• remoção de seqüências de vetor
• remoção da região de poliA
• corte por qualidade
• eliminação das derrapagens
SUCEST
poliA
AGGGGAGAATTTATGATCCCCTAGTACACCCGGCAGGACCGGTCCGGAATTCCCCGGTCGACCCAC GCGTCCGCTACAACAACAGCAGCAGCTTCCATTTACCTTGTCGGCTGTTGCAACCGCTGCTGCCTA CCACCAGCAACTACAGCTGCTACCAGTTAACCCATTGGCACTGGCTAACCCATTGGCTGCTGCCTT CCTGCAGCAGCAACAATTGCTGCCATTCAACCAGATGTCTTTGATGAACCCTGCCTTGTCGTGGTA GCAACCCATCGTTGGAGGTGCCATCTTCTAGAATACAAATGAGTTGTACTTGATAACAATGTTCTT GTGTCGGCGTGTGCAACTTCCCAGAAATAATCAATACATTGATTGAGATTTANAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAATATAATTAAAATAAAAAAATTTATAAAAAAAAAAAAATAATT TTTTTTTATAAAAAATAAATATAAAATAAAAAGGGGGGGCCGTTTTAAAGGAACAAAGTTTAAGAC CGGGGGTATGAAAGGGAAAATTTTTTTATATAGGGCCCCAAAATTAAATACATGGGCCGGTGTTAA CAACGGCGGGAGGGAAAAAACCTGGGGGTTACCAATTTAAAGCCGTGGAAAAAATCCCTTTTTTCA AGTGGGGTAAAAAGAAAAGGCCCCACCCATCGCCCTTCCAAAAATTGCCCCCCTTAAAGGAAAAAG GACACCCCCTTTTGGGCGCATATAACCGGGGGGGTGGGGGTACCCCCAAGGGAACTTATATTTTTC AGGCCTCATAGCCCTTTTTTTTTTTTTTTTTTTTTTTTTCAAGGTAGCGGGTTTCCCAGGAAAATT AAAAGGGGGGTCCTTTTGGGTAATAATGTTTTN
SUCEST
poliA
AGGGGAGAATTTATGATCCCCTAGTACACCCGGCAGGACCGGTCCGGAATTCCCCGGTCGACCCAC GCGTCCGCTACAACAACAGCAGCAGCTTCCATTTACCTTGTCGGCTGTTGCAACCGCTGCTGCCTA CCACCAGCAACTACAGCTGCTACCAGTTAACCCATTGGCACTGGCTAACCCATTGGCTGCTGCCTT CCTGCAGCAGCAACAATTGCTGCCATTCAACCAGATGTCTTTGATGAACCCTGCCTTGTCGTGGTA GCAACCCATCGTTGGAGGTGCCATCTTCTAGAATACAAATGAGTTGTACTTGATAACAATGTTCTT GTGTCGGCGTGTGCAACTTCCCAGAAATAATCAATACATTGATTGAGATTTANAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAATATAATTAAAATAAAAAAATTTATAAAAAAAAAAAAATAATT TTTTTTTATAAAAAATAAATATAAAATAAAAAGGGGGGGCCGTTTTAAAGGAACAAAGTTTAAGAC CGGGGGTATGAAAGGGAAAATTTTTTTATATAGGGCCCCAAAATTAAATACATGGGCCGGTGTTAA CAACGGCGGGAGGGAAAAAACCTGGGGGTTACCAATTTAAAGCCGTGGAAAAAATCCCTTTTTTCA AGTGGGGTAAAAAGAAAAGGCCCCACCCATCGCCCTTCCAAAAATTGCCCCCCTTAAAGGAAAAAG GACACCCCCTTTTGGGCGCATATAACCGGGGGGGTGGGGGTACCCCCAAGGGAACTTATATTTTTC AGGCCTCATAGCCCTTTTTTTTTTTTTTTTTTTTTTTTTCAAGGTAGCGGGTTTCCCAGGAAAATT AAAAGGGGGGTCCTTTTGGGTAATAATGTTTTN
SUCEST
Quality trimming
CGGAAGACTGGAGTCGTCGCTGCGGCACCGGTCCGGAATTCCCGGGTCGACCCACGCGTCCGGCCG CCGCCACCGCATCCCTTGCAGCCCCAATCCCCCACGGCGACCATGGCCGGCGCGCAGGAGTCCCTG TCCCTGGTGGGCACGATGCGTGGCCACAACGGCGAGGTGACGGCGATCGCCACCCCGATCGACAAC TCGCCGTTCATCGTCTCCTCCTCCCGCGACAAGTCCGTGCTGGTGTGGGACCTGCAAAACCCGGTC CACTCCACCCCGGAATCCGGCGCCACCGCCGACTACGGCGTCCCCTTCCGCCGCCTCACCGGCCAC TCCCACTTCGTCCAGGACGTCGTCCTCAGCTCCGACGGCCAGTTCGCCCTCTCCGGCTCCTGGGAC GGCGAGCTCCGCCTCTGGGACCTCTCCACCGGCGTCACCACCCGCCGCTTCGTCGGCCACGAGAAG GACGTCCTCTCCGTCGCCTTCTCCGTCGACAACCGCCAGATCGTCTCCGCGTCCCGCGACAAGACC ATCAAGCTCTGGAACACCCTCGGTGAGTGCAAGTACACCATTGGTGGCGACCTCGGCGGCGGGGAG GGCCACAACGGGTGGGTCTCCTGCGTCAGGTTCTTCCCCAACACCTTTCAGGCCACCATTGTCTCC GGATTCTGGGACCGCACCGTCAGGTCTGGAACCTTACCAACTGCAAGCTGCGATGCACTCTCGATG CCCACGCGGCTATGTTAACGCCGTCGCC
ACGTX: <10 ACGTX: >=10 and <15ACGTX: >=15 and <20ACGTX: >=20 and <25ACGTX: >=25 and <30ACGTX: >=30
754 bases
SUCEST
Quality trimming
CGGAAGACTGGAGTCGTCGCTGCGGCACCGGTCCGGAATTCCCGGGTCGACCCACGCGTCCGGCCG CCGCCACCGCATCCCTTGCAGCCCCAATCCCCCACGGCGACCATGGCCGGCGCGCAGGAGTCCCTG TCCCTGGTGGGCACGATGCGTGGCCACAACGGCGAGGTGACGGCGATCGCCACCCCGATCGACAAC TCGCCGTTCATCGTCTCCTCCTCCCGCGACAAGTCCGTGCTGGTGTGGGACCTGCAAAACCCGGTC CACTCCACCCCGGAATCCGGCGCCACCGCCGACTACGGCGTCCCCTTCCGCCGCCTCACCGGCCAC TCCCACTTCGTCCAGGACGTCGTCCTCAGCTCCGACGGCCAGTTCGCCCTCTCCGGCTCCTGGGAC GGCGAGCTCCGCCTCTGGGACCTCTCCACCGGCGTCACCACCCGCCGCTTCGTCGGCCACGAGAAG GACGTCCTCTCCGTCGCCTTCTCCGTCGACAACCGCCAGATCGTCTCCGCGTCCCGCGACAAGACC ATCAAGCTCTGGAACACCCTCGGTGAGTGCAAGTACACCATTGGTGGCGACCTCGGCGGCGGGGAG GGCCACAACGGGTGGGTCTCCTGCGTCAGGTTCTTCCCCAACACCTTTCAGGCCACCATTGTCTCC GGATTCTGGGACCGCACCGTCAGGTCTGGAACCTTACCAACTGCAAGCTGCGATGCACTCTCGATG CCCACGCGGCTATGTTAACGCCGTCGC
ACGTX: <10 ACGTX: >=10 and <15ACGTX: >=15 and <20ACGTX: >=20 and <25ACGTX: >=25 and <30ACGTX: >=30
753 bases
SUCEST
Quality trimming
CGGAAGACTGGAGTCGTCGCTGCGGCACCGGTCCGGAATTCCCGGGTCGACCCACGCGTCCGGCCG CCGCCACCGCATCCCTTGCAGCCCCAATCCCCCACGGCGACCATGGCCGGCGCGCAGGAGTCCCTG TCCCTGGTGGGCACGATGCGTGGCCACAACGGCGAGGTGACGGCGATCGCCACCCCGATCGACAAC TCGCCGTTCATCGTCTCCTCCTCCCGCGACAAGTCCGTGCTGGTGTGGGACCTGCAAAACCCGGTC CACTCCACCCCGGAATCCGGCGCCACCGCCGACTACGGCGTCCCCTTCCGCCGCCTCACCGGCCAC TCCCACTTCGTCCAGGACGTCGTCCTCAGCTCCGACGGCCAGTTCGCCCTCTCCGGCTCCTGGGAC GGCGAGCTCCGCCTCTGGGACCTCTCCACCGGCGTCACCACCCGCCGCTTCGTCGGCCACGAGAAG GACGTCCTCTCCGTCGCCTTCTCCGTCGACAACCGCCAGATCGTCTCCGCGTCCCGCGACAAGACC ATCAAGCTCTGGAACACCCTCGGTGAGTGCAAGTACACCATTGGTGGCGACCTCGGCGGCGGGGAG GGCCACAACGGGTGGGTCTCCTGCGT
ACGTX: <10 ACGTX: >=10 and <15ACGTX: >=15 and <20ACGTX: >=20 and <25ACGTX: >=25 and <30ACGTX: >=30
618 bases
SUCEST
Resultado de blastX
>gi|1346109|sp|P49027|GBLP_ORYSA GUANINE NUCLEOTIDE-BINDING PROTEIN BETA SUBUNIT-LIKE PROTEIN (GPB-LR) (RWD) pir||T03764 protein RWD - rice dbj|BAA07404.1| (D38231) RWD [Oryza sativa] Length = 334
Score = 315 bits (798), Expect = 4e-85 Identities = 150/170 (88%), Positives = 156/170 (91%) Frame = +1
Query: 109 MAGAQESLSLVGTMRGHNGEVTAIATPIDNSPFIVSSSRDKSVLVWDLQNPVHSTPESGA 288 MAGAQESL L G M GHN VTAIATPIDNSPFIVSSSRDKS+LVWDL NPV + E Sbjct: 1 MAGAQESLVLAGVMHGHNDVVTAIATPIDNSPFIVSSSRDKSLLVWDLTNPVQNVGEGAG 60
Query: 289 TADYGVPFRRLTGHSHFVQDVVLSSDGQFALSGSWDGELRLWDLSTGVTTRRFVGHEKDV 468 ++YGVPFRRLTGHSHFVQDVVLSSDGQFALSGSWDGELRLWDLSTGVTTRRFVGH+KDVSbjct: 61 ASEYGVPFRRLTGHSHFVQDVVLSSDGQFALSGSWDGELRLWDLSTGVTTRRFVGHDKDV 120
Query: 469 LSVAFSVDNRQIVSASRDKTIKLWNTLGECKYTIGGDLGGGEGHNGWVSC 618 LSVAFSVDNRQIVSASRD+TIKLWNTLGECKYTIGGDLGGGEGHNGWVSCSbjct: 121 LSVAFSVDNRQIVSASRDRTIKLWNTLGECKYTIGGDLGGGEGHNGWVSC 170
read trimmado
SUCEST
Resultado de blastX
>gi|1346109|sp|P49027|GBLP_ORYSA GUANINE NUCLEOTIDE-BINDING PROTEIN BETA SUBUNIT-LIKE PROTEIN (GPB-LR) (RWD) pir||T03764 protein RWD - rice dbj|BAA07404.1| (D38231) RWD [Oryza sativa] Length = 334
Score = 352 bits (893), Expect(2) = e-100 Identities = 168/192 (87%), Positives = 175/192 (90%) Frame = +1
Query: 109 MAGAQESLSLVGTMRGHNGEVTAIATPIDNSPFIVSSSRDKSVLVWDLQNPVHSTPESGA 288 MAGAQESL L G M GHN VTAIATPIDNSPFIVSSSRDKS+LVWDL NPV + E Sbjct: 1 MAGAQESLVLAGVMHGHNDVVTAIATPIDNSPFIVSSSRDKSLLVWDLTNPVQNVGEGAG 60
Query: 289 TADYGVPFRRLTGHSHFVQDVVLSSDGQFALSGSWDGELRLWDLSTGVTTRRFVGHEKDV 468 ++YGVPFRRLTGHSHFVQDVVLSSDGQFALSGSWDGELRLWDLSTGVTTRRFVGH+KDVSbjct: 61 ASEYGVPFRRLTGHSHFVQDVVLSSDGQFALSGSWDGELRLWDLSTGVTTRRFVGHDKDV 120
Query: 469 LSVAFSVDNRQIVSASRDKTIKLWNTLGECKYTIGGDLGGGEGHNGWVSCVRFFPNTFQA 648 LSVAFSVDNRQIVSASRD+TIKLWNTLGECKYTIGGDLGGGEGHNGWVSCVRF PNTFQ Sbjct: 121 LSVAFSVDNRQIVSASRDRTIKLWNTLGECKYTIGGDLGGGEGHNGWVSCVRFSPNTFQP 180
Query: 649 TIVSGFWDRTVR 684 TIVSG WDRTV+Sbjct: 181 TIVSGSWDRTVK 192
read inteiro
SUCEST
Determinação do limiar de qualidade
Quality window parameters
-300
-250
-200
-150
-100
-50
0
50
100
150
200
0 2 4 6 8 10 12 14 16
bases bellow threshold
dis
tan
ce fr
om
bla
st h
it en
d
quality threshold
20
10
16
18
14
12
9
quality threshold
SUCEST
Quality trimming
CGGAAGACTGGAGTCGTCGCTGCGGCACCGGTCCGGAATTCCCGGGTCGACCCACGCGTCCGGCCG CCGCCACCGCATCCCTTGCAGCCCCAATCCCCCACGGCGACCATGGCCGGCGCGCAGGAGTCCCTG TCCCTGGTGGGCACGATGCGTGGCCACAACGGCGAGGTGACGGCGATCGCCACCCCGATCGACAAC TCGCCGTTCATCGTCTCCTCCTCCCGCGACAAGTCCGTGCTGGTGTGGGACCTGCAAAACCCGGTC CACTCCACCCCGGAATCCGGCGCCACCGCCGACTACGGCGTCCCCTTCCGCCGCCTCACCGGCCAC TCCCACTTCGTCCAGGACGTCGTCCTCAGCTCCGACGGCCAGTTCGCCCTCTCCGGCTCCTGGGAC GGCGAGCTCCGCCTCTGGGACCTCTCCACCGGCGTCACCACCCGCCGCTTCGTCGGCCACGAGAAG GACGTCCTCTCCGTCGCCTTCTCCGTCGACAACCGCCAGATCGTCTCCGCGTCCCGCGACAAGACC ATCAAGCTCTGGAACACCCTCGGTGAGTGCAAGTACACCATTGGTGGCGACCTCGGCGGCGGGGAG GGCCACAACGGGTGGGTCTCCTGCGTCAGGTTCTTCCCCAACACCTTTCAGGCCACCATTGTCTCC GGATTCTGGGACCGCACCGTCAGGTCTGGAACCTTACCAACTGCAAGCTGCGATGCACTCTCGATG CCCACGCGGCTATGTTAACGCCGTCGCC
ACGTX: <10 ACGTX: >=10 and <15ACGTX: >=15 and <20ACGTX: >=20 and <25ACGTX: >=25 and <30ACGTX: >=30
754 bases
SUCEST
Quality trimming
CGGAAGACTGGAGTCGTCGCTGCGGCACCGGTCCGGAATTCCCGGGTCGACCCACGCGTCCGGCCG CCGCCACCGCATCCCTTGCAGCCCCAATCCCCCACGGCGACCATGGCCGGCGCGCAGGAGTCCCTG TCCCTGGTGGGCACGATGCGTGGCCACAACGGCGAGGTGACGGCGATCGCCACCCCGATCGACAAC TCGCCGTTCATCGTCTCCTCCTCCCGCGACAAGTCCGTGCTGGTGTGGGACCTGCAAAACCCGGTC CACTCCACCCCGGAATCCGGCGCCACCGCCGACTACGGCGTCCCCTTCCGCCGCCTCACCGGCCAC TCCCACTTCGTCCAGGACGTCGTCCTCAGCTCCGACGGCCAGTTCGCCCTCTCCGGCTCCTGGGAC GGCGAGCTCCGCCTCTGGGACCTCTCCACCGGCGTCACCACCCGCCGCTTCGTCGGCCACGAGAAG GACGTCCTCTCCGTCGCCTTCTCCGTCGACAACCGCCAGATCGTCTCCGCGTCCCGCGACAAGACC ATCAAGCTCTGGAACACCCTCGGTGAGTGCAAGTACACCATTGGTGGCGACCTCGGCGGCGGGGAG GGCCACAACGGGTGGGTCTCCTGCGT
ACGTX: <10 ACGTX: >=10 and <15ACGTX: >=15 and <20ACGTX: >=20 and <25ACGTX: >=25 and <30ACGTX: >=30
618 bases
SUCEST
Quality trimming
CGGAAGACTGGAGTCGTCGCTGCGGCACCGGTCCGGAATTCCCGGGTCGACCCACGCGTCCGGCCG CCGCCACCGCATCCCTTGCAGCCCCAATCCCCCACGGCGACCATGGCCGGCGCGCAGGAGTCCCTG TCCCTGGTGGGCACGATGCGTGGCCACAACGGCGAGGTGACGGCGATCGCCACCCCGATCGACAAC TCGCCGTTCATCGTCTCCTCCTCCCGCGACAAGTCCGTGCTGGTGTGGGACCTGCAAAACCCGGTC CACTCCACCCCGGAATCCGGCGCCACCGCCGACTACGGCGTCCCCTTCCGCCGCCTCACCGGCCAC TCCCACTTCGTCCAGGACGTCGTCCTCAGCTCCGACGGCCAGTTCGCCCTCTCCGGCTCCTGGGAC GGCGAGCTCCGCCTCTGGGACCTCTCCACCGGCGTCACCACCCGCCGCTTCGTCGGCCACGAGAAG GACGTCCTCTCCGTCGCCTTCTCCGTCGACAACCGCCAGATCGTCTCCGCGTCCCGCGACAAGACC ATCAAGCTCTGGAACACCCTCGGTGAGTGCAAGTACACCATTGGTGGCGACCTCGGCGGCGGGGAG GGCCACAACGGGTGGGTCTCCTGCGTCAGGTTCTTCCCCAACACCTTTCAGGCCACCATTGTCTCC GGATTCTGGGACCGCACCGTCAGGTCTGGAACCTTACCAACTGCAAGCTGCGATGCACTCTCGATG CCCACGCGGCTATGTTAACGCCGTCGCC
ACGTX: <10 ACGTX: >=10 and <15ACGTX: >=15 and <20ACGTX: >=20 and <25ACGTX: >=25 and <30ACGTX: >=30
base 684
SUCEST
Quality trimming
CGGAAGACTGGAGTCGTCGCTGCGGCACCGGTCCGGAATTCCCGGGTCGACCCACGCGTCCGGCCG CCGCCACCGCATCCCTTGCAGCCCCAATCCCCCACGGCGACCATGGCCGGCGCGCAGGAGTCCCTG TCCCTGGTGGGCACGATGCGTGGCCACAACGGCGAGGTGACGGCGATCGCCACCCCGATCGACAAC TCGCCGTTCATCGTCTCCTCCTCCCGCGACAAGTCCGTGCTGGTGTGGGACCTGCAAAACCCGGTC CACTCCACCCCGGAATCCGGCGCCACCGCCGACTACGGCGTCCCCTTCCGCCGCCTCACCGGCCAC TCCCACTTCGTCCAGGACGTCGTCCTCAGCTCCGACGGCCAGTTCGCCCTCTCCGGCTCCTGGGAC GGCGAGCTCCGCCTCTGGGACCTCTCCACCGGCGTCACCACCCGCCGCTTCGTCGGCCACGAGAAG GACGTCCTCTCCGTCGCCTTCTCCGTCGACAACCGCCAGATCGTCTCCGCGTCCCGCGACAAGACC ATCAAGCTCTGGAACACCCTCGGTGAGTGCAAGTACACCATTGGTGGCGACCTCGGCGGCGGGGAG GGCCACAACGGGTGGGTCTCCTGCGTCAGGTTCTTCCCCAACACCTTTCAGGCCACCATTGTCTCC GGATTCTGGGACCGCACCGTCAGGTCTGGAACCTTACCAACTGCAAGCTGCGATGCACTCTCGATG CCCACGCGGCTATGTTAACGCCGTCGCC
719 bases
antes dif. homol. dif. depois
618 - 66 684 + 35 719
SUCEST
Determinação do limiar de qualidade
Quality window parameters
-300
-250
-200
-150
-100
-50
0
50
100
150
200
0 2 4 6 8 10 12 14 16
bases bellow threshold
dis
tan
ce fr
om
bla
st h
it en
d
quality threshold
20
10
16
18
14
12
9
quality threshold
SUCEST
Exemplo de derrapagem
SUCEST
todos os reads291,689 reads
864.5 ±186.3 comprimento médio399.5 ±161.3 # médio bases >= 20/read
283,216 reads
remoção de ribossômicos
283,216 reads
busca de vetores
275,436 reads
corte de vector + poliA
273,728 reads
corte por qualidade
273,728 reads
corte de vetores em extremidade
258,107 reads
corte de derrapagens
256,101 reads
corte de poliA em extremidade
reads trimados237,954 reads
642.6 ±139.8 avg. read size397.8 ±120.1 avg bases >= 20/read
remoção de seqs de baixa qualidade
SUCEST
cluster size
(reads) HS X phrap X CAP3 X HS
total
common
1 32202 13731 18535 11634 16838 14296 10744
2 12440 5617 9207 4869 7665 4852 3792
3 6752 2402 5192 2151 4193 1984 1441
4 4225 1239 3329 1145 2709 992 697
5 2856 676 2360 700 1872 521 344
6 2098 442 1806 482 1452 354 231
7 1582 288 1362 317 1115 220 144
8 1245 202 1091 242 862 153 99
9 974 156 913 186 720 113 72
10 776 105 752 143 634 74 44
11 639 76 607 99 511 54 30
12 492 71 547 99 429 46 32
13 437 47 454 90 400 40 25
14 366 42 391 40 341 26 13
15 306 31 390 50 295 18 11
16 273 25 279 35 275 18 8
17 225 15 273 23 235 11 4
18 177 11 227 15 191 5 2
19 124 6 177 18 176 5 3
>=20 1192 40 1814 87 2228 23 12
total 69381 25222 49706 22425 43141 23805 17748
SUCEST
Discrepância interna
SUCEST
Discrepância interna
SUCEST
Teste de consistência interna
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
% of discrepancy in reads
% o
f d
isc
rep
an
t re
ad
s
hs (204k)
old-trim (219k)
SUCEST
Teste de consistência interna
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
% of discrepancy in reads
% o
f d
isc
rep
an
t re
ad
s
hs (204k)
phrap (218k)
cap3 (221k)
SUCEST
Teste de consistência interna
0
1
2
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
% of discrepancy in reads
% o
f d
isc
rep
an
t re
ad
s
hs (204k)
phrap (218k)cap3 (221k)
old-trim (219k)
SUCEST
Teste de consistência externa
0,00000
0,00005
0,00010
0,00015
0,00020
75 80 85 90 95 100
% identity
% p
oss
ible
mat
ches
hsold-trim
SUCEST
Teste de consistência externa
0,00000
0,00005
0,00010
0,00015
0,00020
75 80 85 90 95 100
% identity
% p
ossib
le m
atc
hes
cap3phraphs
SUCEST
Teste de consistência externa
0,00000
0,00005
0,00010
0,00015
0,00020
75 80 85 90 95 100
% identity
% p
oss
ible
mat
ches
cap3phraphsold-trim
SUCEST
Números totais
Total sequences 291,689
cDNA clones sequenced (5’or 3’) 260,352
5’ end sequences 259,325
3’ end sequences 32,364
Total high-quality sequences 237,954
Success index (%) 81.6
Average insert size (bp) 1,250
Average sequence size (bp) 864 / 642
Bases with phred quality = 20/read 399
SUCEST
Números totais
Trotal sequences analyzed 237,954
Number of contigs 26,803
Number of singletons 16,338
Number of sugarcane assembled sequences (SAS) 43,141
Number of assembled sequences matching to known genes 27,833 (64.5%)
Number of clones with full length inserts 14,409 (
SUCEST
Contribuição específica
por biblioteca
Número de
ESTs SAS contigs singletons % contribuição
AD1 8,137 1,474 1,200 3.4
AM1 5,991 841 664 1.9
AM2 6,629 982 705 2.3
CL6 3,511 595 467 1.4
FL1 8,412 1,753 1,465 4.1
FL3 5,714 840 667 1.9
FL4 7,289 1,082 886 2.5
FL5 5,115 861 744 2.0
FL8 3,362 378 337 0.9
HR1 5,070 717 519 1.7
LB1 3,699 459 369 1.1
LB2 5,402 790 650 1.8
LR1 6,653 984 819 2.3
LR2 2,329 299 254 0.7
LV1 3,068 384 327 0.9
RT1 4,227 569 484 1.3
RT2 5,819 942 728 2.2
RT3 4,356 614 478 1.4
RZ1 2,012 205 175 0.5
RZ2 3,177 385 301 0.9
RZ3 6,528 929 752 2.1
SB1 7,407 1,313 1,132 3.0
SD1 4,459 792 642 1.8
SD2 4,099 857 632 2.0
ST1 4,359 645 523 1.5
ST3 4,519 507 418 1.2
• 47% dos SAS são formados
por reads oriúndos de uma
única biblioteca
• 38% dos SAS tecido-especícos
são singletons
SUCEST
Classificação funcional
SUCEST
Porcentagem por órgão
Functional Category AD/HR AM/LB CL FL LV RT SD ST X±sd
Amino acid metabolism 6.0 4.0 5.9 4.2 3.9 7.3 4.6 4.9 5.1±1.2
Bioenergetics 17.3 12.6 17.5 12.6 12.0 21.9 16.9 15.8 15.8±3.3
Cellular communication/ Signal transduction
10.7 11.1 9.8 11.0 10.1 10.8 7.9 11.8 10.4±1.2
Cellular dynamics 18.5 19.7 15.0 19.1 18.6 18.3 15.1 20.1 18.1±2.0
DNA metabolism 2.5 3.8 2.4 3.8 4.8 2.2 1.8 1.7 9±1.1
Lipid, fatty-acid and isoprenoid metabolism
4.0 3.5 3.2 3.8 3.7 4.4 4.6 3.5 3.8±0.5
Mobile genetic elements 1.5 1.3 2.0 1.4 1.0 1.1 0.7 1.1 1.3±0.4
Nitrogen, sulphur and phosphate metabolism
0.9 0.7 0.8 0.6 0.7 1.7 0.7 0.7 0.9±0.3
Nucleotide metabolism 2.5 2.5 3.1 2.5 2.2 2.4 2.2 2.6 2.5±0.3
Plant grow and development 3.8 3.4 4.5 3.5 3.4 4.0 2.9 4.1 3.7±0.5
Protein metabolism 16.9 18.7 17.2 18.4 21.7 16.5 20.5 16.5 18.3±1.9
RNA metabolism and transcription
6.7 7.2 7.1 6.8 6.8 5.6 5.0 6.4 6.5±0.8
Secondary metabolism 6.1 3.9 5.2 4.5 4.0 8.4 5.5 6.8 5.5±1.5
Storage proteins 0.2 0.1 0.2 0.1 0.1 0.3 5.7 0.2 0.9±1.9
Stress response 14.7 12.7 14.4 12.8 13.4 17.6 17.8 15.7 14.9±2.0
Transport 8.5 6.7 6.3 7.1 6.7 8.0 9.2 7.8 7.5±1.0
Putative proteins 15.2 16.2 15.8 16.8 15.1 14.0 12.7 15.8 15.2±1.3
Unable to classify 6.3 6.1 6.4 6.7 6.3 5.8 5.3 6.3 6.1±0.4
SUCEST
SAStecido- específicas
Número
de ESTs Melhor hit biblioteca
360 (Y17556) alpha kafirin [Sorghum bicolor] SD
103 (A23207) zein zA1 [Zea mays] SD
42 (AF232008) beta-glucosidase aggregating factor precursor [Zea mays] RT
24 (AC007789) putative low molecular early light-inducible protein [Oryza sativa] SD
22 (AP002820) putative peroxidase [Oryza sativa] RT
19 (X56337) alpha-amylase [Oryza sativa] CL
18 (AP000374) cyclopropane fatty acid synthase [Arabidopsis thaliana] FL
SUCEST
• Total de Entradas 1,528,715• Homo sapiens 967,015 (63,4%)
• Plantas (total) 73,087 (4.8%)
• Mus musculus + domesticus (camundongo) 306,544• Caenorhabditis elegans 72,521• Arabidopsis thaliana 36,173• Drosophila melanogaster 27,625• Oryza sativa (arroz) 25,844• Rattus sp. (rato) 20,311• Brugia malayi (nematoide parasita) 13,641• Toxoplasma gondii 10,671• Emericella nidulans 5,787• Schistosoma mansoni 3,659• Trypanosoma brucei rhodesiense 3,519• Danio rerio (zebrafish) 3,373• Saccharomyces cerevisiae 3,042• Zea mays (milho) 1,783• Leishmania major 1,692
• Saccharum sp. 495• Outros ~ 20,000
GenBank - dbEST Março de 1998
SUCEST
• Total de Entradas 7,692,809• Homo sapiens 3,369,459 (43.8%)
• Plantas (total) 1,099,102 (14.3 %)
• Glycine max (soja) 160,500• Arabidopsis thaliana 113,000• Medicago truncatula (barrel medic) 112,458• Lycopersicon esculentum (tomate) 107,226• Zea mays (milho) 86,999• Oryza sativa (arroz) 72,657• Hordeum vulgare (cevada) 68,480• Chlamydomonas reinhardtii 64,973• Sorghum bicolor 62,642• Triticum aestivum (trigo) 58,141• Pinus taeda (loblolly pine) 34,896• Lotus japonicus 27,078• Solanum tuberosum (batata) 26,177• Gossypium arboreum 20,978• Sorghum propinquum 17,974• Mesembryanthemum (ice plant) 14,033• Gossypium hirsutum (algodão) 9,438• Secale cereale 8,123• Saccharum sp. 495• Outras Plantas (67 spp.) 32.834
GenBank - dbEST Março de 2001
SUCEST
• Total de Entradas 12,845,578• Homo sapiens 4,691,979 (36.5%)
• Plantas (total) 2,279,170 (17.4 %)
• Glycine max (soja) 284,714 • Triticum aestivum (trigo) 256,593 • Hordeum vulgare (cevada) 240,882 • Zea mays (milho) 180,587 • Arabidopsis thaliana 174,624 • Medicago truncatula (barrel medic) 170,500 • Lycopersicon esculentum (tomate) 148,346• Chlamydomonas reinhardtii 130,324 • Oryza sativa (arroz) 108,429 • Solanum tuberosum (batata) 94,420 • Sorghum bicolor 84,712 • Lactuca sativa (alface) 68,188 • Pinus taeda (loblolly pine) 60,226 • Physcomitrella patens 50,250 • Helianthus annuus (girassol) 44,961• Gossypium arboreum (algodão) 38,894 • Lotus japonicus 32,096 • Sorghum propinquum 21,387 • Saccharum sp. 495• Outras Plantas (78 spp.) 88.542
GenBank - dbEST Setembro de 2002
http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html
SUCEST
Genetics and Molecular Biology1. The libraries that made SUCEST
2. Bioinformatics of the sugarcane EST project
3. Trimming and clustering sugarcane ESTs
4. The sugarcane signal transduction (SUCAST) catalogue: prospecting signal transduction in sugarcane
5. In silico characterization and expression analyses of sugarcane putative sucrose non-fermenting-1 (SNF1) related kinases
6. Identification of 14-3-3-like protein in sugarcane (Saccharum officinarum)
7. A search for homologues of plant photoreceptor genes and their signaling partners in the sugarcane expressed sequence tag (Sucest) database
8. Phylogenetic relationships between Arabidopsis and sugarcane bZIP transcriptional regulatory factors
9. Identification of sugarcane cDNAs encoding components of the cell cycle machinery
10. Dissecting the sugarcane expressed sequence tag (SUCEST) database: unraveling flower-specific genes
11. Molecular chaperone genes in the sugarcane expressed sequence database (SUCEST)
12. Oxidative stress response in sugarcane
13. In silico differential display of defense-related expressed sequence tags from sugarcane tissues infected with diazotrophic endophytes
14. Mechanisms of sugarcane response to herbivory
15. Base excision repair in sugarcane
19. Preliminary analysis of microsatellite markers derived from sugarcane expressed sequence tags (ESTs)
20. Sequence polymorphism from EST data in sugarcane: a fine analysis of 6-phosphogluconate dehydrogenase genes
21. A search for markers of sugarcane evolution
22. Sugarcane genes related to mitochondrial function
23. Mitochondrial and chloroplast localization of FtsH-like proteins in sugarcane based on their phylogenetic profile
24. Patterns of expression of cell wall related genes in sugacane
25. Expression of sugarcane genes induced by inoculation with Gluconacetobacter diazotrophicus and Herbaspirillum rubrisubalbicans
26. Identifying sugarcane expressed sequences associated with nutrient transporters and peptide metal chelators
27. Prospecting sugarcane genes involved in aluminum tolerance
28. N-glycosylation in sugarcane
29. Sugarcane expressed sequences tags (ESTs) encoding enzymes involved in lignin biosynthesis pathways
30. Biosynthesis of secondary metabolites in sugarcane
31. Identification of sugarcane genes involved in the purine synthesis pathway
32. A new member of the chalcone synthase (CHS) family in sugarcane
33. Classification. expression pattern and comparative analysis of sugarcane expressed analysis of sugarcane expressed sequences tags (ESTs) encoding glycine-rich proteins (GRPs)
34. Identification. classification and expression pattern analysis of sugarcane cysteine proteinases
35. Identification of metalloprotease gene families in sugarcane
36. Sugarcane phytocystatins: Identification. classification and expression pattern analysis
16. DNA repair-related genes in sugarcane expressed sequence tags (ESTs)
17. Distribution of DNA repair-related ESTs in sugarcane
18. Survey of transposable elements in sugarcane expressed sequence tags (ESTs)
SUCEST
Genetics and Molecular Biology
http://www.sbg.org.br/revista24_index.htm
SUCEST
Grupo do SUCEST
SUCEST
Uma parte do LBI
SUCEST
Uma parte do LBI
SUCEST
Os trimmadores
Grupo Genoma - CBMEG
Grupo Genoma - CBMEG
[email protected] http://www.lbi.ic.unicamp.br/
SUCEST
www.laerte.com.brwww.laerte.com.br
SUCEST
SUCEST