![Page 1: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/1.jpg)
Introduction to KEGG
Susumu Goto, Masahiro Hattori, Wataru Honda, Junko Yabuzaki
Kyoto University, Bioinformatics Center
Systems Biology and the Omics Cascade, Karolinska Institutet, 10 June 2008
![Page 2: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/2.jpg)
Today’s Menu
• Brief history and overview of KEGG and GenomeNet (Goto) • KEGG resources related to genomic information (Goto) • KEGG resources related to systems information (Honda) • KEGG resources related to chemical information (Hattori)
Morning session
• Searching genomic information in KEGG (Yabuzaki) • Assembling cDNA sequences and annotating functions (Goto/Yabuzaki)
• Handling microarray data for mapping KEGG pathways (Goto/Honda)
• Searching and computing pathways and chemical information in KEGG (Honda/Hattori)
Afternoon session (laboratory work)
![Page 3: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/3.jpg)
Genome
Transcriptome Proteome
Chemical information Metabolism Glycans Lipids
Metabolome Glycome
Molecular information
Sequences Structures Molecular interaction
Physical interaction Co-expression
Background
Various types of omics data are now available
![Page 4: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/4.jpg)
Overview
• Kyoto Encyclopedia of Genes and Genomes • Integrated database of biological systems, genetic building blocks and chemical building blocks
KEGG (http://www.genome.jp/kegg/)
• Kanehisa, M., et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 36, D480 (2008)
• Kanehisa, M., et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34, D354 (2006)
References
![Page 5: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/5.jpg)
Four Major Components of KEGG
Reconstructing biological phenomenon with various omics data and researcher’s knowledge
Chemical Information KEGG LIGAND
Genomic Information KEGG GENES
Hierarchical classification KEGG BRITE
Systems Information KEGG PATHWAY Tools
EGassembler KAAS GENIES KegArray
Tools e-zyme pathcomp KegArray
![Page 6: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/6.jpg)
KEGG history with ID system
Release Database Object identifier
1995 KEGG PATHWAY map number KEGG GENOME organism code (T number) KEGG GENES locus_tag / NCBI GeneID KEGG ENZYME EC number KEGG COMPOUND C number 2001 KEGG REACTION R number 2002 KEGG ORTHOLOGY K number 2003 KEGG GLYCAN G number 2004 KEGG RPAIR A number 2005 KEGG BRITE br number KEGG DRUG D number 2008 KEGG MODULE M number KEGG DISEASE H number
Kanehisa, M., et al. Nucleic Acids Res. 36, D480 (2008)
![Page 7: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/7.jpg)
Databases in GenomeNet
Established in 1991 under the Japanese Human Genome Project http://www.genome.jp/
A network of databases and computational services for genome research and related research areas in biomedical sciences
>30 databases including KEGG, GenBank, UniProt, PDB, …
>500 databases linking from/to GenomeNet
Unique ID for each entry database:entry e.g. compound:C00002 for ATP
![Page 8: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/8.jpg)
Four Major Components of KEGG
Reconstructing biological phenomenon with various omics data and researcher’s knowledge
Chemical Information KEGG LIGAND
Genomic Information KEGG GENES
Hierarchical classification KEGG BRITE
Systems Information KEGG PATHWAY
![Page 9: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/9.jpg)
KEGG GENES Database Molecular building blocks of life
in the genomic space
![Page 10: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/10.jpg)
A collection of gene catalogs for all complete genomes and some partial genomes generated from publicly available resources, mostly NCBI RefSeq
http://www.genome.jp/kegg/genes.html
KEGG GENES
![Page 11: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/11.jpg)
KEGG GENES main components
KEGG Orthology (KO) Ortholog groups linked to PATHWAY and BRITE
GENES Gene catalogs of complete genomes with manual functional annotation
DGENES Gene catalogs of draft genomes with automatic functional annotation
EGENES Consensus contigs of EST data with automatic functional annotation
SSDB Sequence similarity with best-hit information for identifying ortholog/paralogs
GENES, DGENES, EGENES: Genomic information KEGG Orthology, SSDB: Relationship
![Page 12: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/12.jpg)
Organisms in KEGG GENES
http://www.genome.jp/kegg/catalog/org_list.html
![Page 13: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/13.jpg)
141 Eukaryotes
43 Animals (Complete / Draft Genomes) 54 Plants (Mostly EST Consensus Contigs) 27 Fungi (Mostly Complete Genomes) 17 Protists (Mostly Complete Genomes)
633 Eubacteria (All Complete Genomes)
52 Archea (All Complete Genomes)
Total: 826 organisms (30 May 2008)
Organisms in KEGG GENES
![Page 14: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/14.jpg)
• Hierarchical classification of organisms is based on Taxonomy database in NCBI
• 3-4 letter code / T number
Homo sapiens Escherichia coli K-12 MG1655 Escherichia coli K-12 W3110
Fugu rubripes Hordeum vulgare (barley) (EST)
Organisms in KEGG GENES
⇒ hsa ⇒ eco ⇒ ecj ⇒ dfru ⇒ ehvu
GENES
DGENES EGENES
![Page 15: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/15.jpg)
Entry of GENES
![Page 16: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/16.jpg)
Entry field Entry ID Entry types (CDS, RNA, Contig, etc. …) Organism name
![Page 17: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/17.jpg)
Gene name field Names and synonyms of geens and/or proteins
![Page 18: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/18.jpg)
Definition field Functional annotation assigned by original genome project
![Page 19: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/19.jpg)
Orthology field Ortholog annotation assigned by KEGG project
![Page 20: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/20.jpg)
Pathway field Links to pathway maps including this gene
![Page 21: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/21.jpg)
![Page 22: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/22.jpg)
Class field Link to BRITE functional categories
![Page 23: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/23.jpg)
Another Example of BRITE
![Page 24: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/24.jpg)
SSDB field Link to SSDB for obtaining • Orthologs:
• BBH pairs • Paralogs:
• homologs in the same organism
• Conserved gene clusters
![Page 25: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/25.jpg)
![Page 26: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/26.jpg)
Motif field Domains and motifs found in the protein sequence
![Page 27: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/27.jpg)
Other DBs field
Links to other databases
LinkDB field
![Page 28: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/28.jpg)
Structure field
![Page 29: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/29.jpg)
Position field Locus on the genome sequence contained in the KEGG GENOME database
![Page 30: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/30.jpg)
AA seq field
Amino acid and nucleotide sequence in FASTA format and link to BLAST search in GenomeNet
NT seq field
![Page 31: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/31.jpg)
KEGG GENES main components
KEGG Orthology (KO) Ortholog groups linked to PATHWAY and BRITE
GENES Gene catalogs of complete genomes with manual functional annotation
DGENES Gene catalogs of draft genomes with automatic functional annotation
EGENES Consensus contigs of EST data with automatic functional annotation
SSDB Sequence similarity with best-hit information for identifying ortholog/paralogs
GENES, DGENES, EGENES: Genomic information KEGG Orthology, SSDB: Relationship
![Page 32: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/32.jpg)
KEGG Orthology Database Ortholog groups bridging the genomic space
and systems space
![Page 33: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/33.jpg)
Linking GENES to PATHWAY
Reconstructing biological phenomenon with various omics data and researcher’s knowledge
Chemical Information KEGG LIGAND
Genomic Information KEGG GENES
Hierarchical classification KEGG BRITE
Systems Information KEGG PATHWAY
KEGG Orthology
![Page 34: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/34.jpg)
• KEGG PATHWAY database • Important component of the KEGG databases • Golden standard of metabolic pathways for bioinformatics
Node: genes and compounds Edge: reactions and interactions
Linking GENES to PATHWAY
![Page 35: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/35.jpg)
Compound nodes in the pathway networks
Linking GENES to PATHWAY
![Page 36: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/36.jpg)
Gene/protein nodes in the pahtway networks
Linking GENES to PATHWAY
![Page 37: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/37.jpg)
Thus, grouping by the orthologous relationships
Linking GENES to PATHWAY
hsa:5211 hsa:5213 hsa:5214
sce:YGR240C sce:YMR205C
eco:b1723 eco:b3916
KEGG Orthology
H. sapiens S. cerevisiae E. coli
K00850
![Page 38: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/38.jpg)
Nodes in the reference pathways Reference nodes for genes of all organisms
KEGG Orthology
hsa:5211 hsa:5213 hsa:5214
sce:YGR240C sce:YMR205C
eco:b1723 eco:b3916
H. sapiens S. cerevisiae E. coli
K00850
GENES
KO
Pathways in
Reference pathway
Genes and compounds
KOs and compounds
![Page 39: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/39.jpg)
KEGG Orthology
GENES
Annotated
Unannotated
10,890 ortholog groups contain 28% of KEGG GENES entries
E. coli B. subtilis S. cerevisiae H. sapiens A. thaliana
![Page 40: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/40.jpg)
DGENES
Draft genome
GENES
Complete genome
EGENES
Transcriptome EGassembler
EST contigs
Annotation via KEGG Orthology
SW score BLASTP BLASTX
Homology & BBH
Automatic annotation
KAAS
Curation
KO, PATHWAY & BRITE
![Page 41: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/41.jpg)
KEGG Automatic Annotation Server Ortholog assignment and pathway mapping
![Page 42: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/42.jpg)
• KEGG Automatic Annotation Server • http://www.genome.jp/kegg/kaas/
• Automatic annotation system for KO • Using GENES as a template set • More than 90% accuracy
• Reconstruct PATHWAY by using your own data set
KAAS
![Page 43: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/43.jpg)
Functional Annotation in KAAS 1. Query gene
2. Homologs
3. Ortholog candidates
4. KEGG Orthology groups
5. Ranking of KEGG Orthology
BLAST
Cut off by bi-directional best hit rate
Grouping by KEGG Orthology
Scoring by probability and heuristics
Bi-directional best hit rate BHRab = Rf × Rr Genome A Genome B
Gene a Gene a’ Gene b’
Gene b S
S’: best hit
Rf = S / S’
Moriya, Y. et al. Nucl. Acids Res. 2007 35:W182-W185
![Page 44: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/44.jpg)
Your query genome sequences KEGG GENES entries with KO
> seq 1 > seq 2 > seq 3 > seq 4 > seq 5 > seq 6 > seq 7 ⋯ ⋯ ⋯
> Your query sequences 1
> Your query sequences 2
> Your query sequences 3
> Your query sequences 4
> Your query sequences 5
⋯
⋯
⋯
Ortholog3
KO2
Score 2
Homolog
SKO = Sh - log2(mn) - log2(ΣxCkpk(1-p)x-k) k=N x
Ortholog1
KO3
Score 3
Ortholog2
KO1
Score 1
1. The highest score
1.
2. Normalization by the sequence lengths
2.
3. Weighting factor of the number of ortholog candidates
3.
Scoring and Ranking of KO
![Page 45: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/45.jpg)
Moriya, Y. et al. Nucl. Acids Res. 2007 35:W182-W185
Pathway mapping in KAAS
![Page 46: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/46.jpg)
EGassembler Easy assembling nucleotide sequences
for pathway reconstruction
![Page 47: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/47.jpg)
EGassembler
1. EST (GSS, cDNA, gDNA) sequences
2. Sequence cleaning
3. Repeat masking
4. Vector masking
5. Organelle masking
6. Sequence assembling (CAP3)
PolyA/PolyT, Low-complexity, Low-quality filtering Short sequence removal (seqclean)
RepBase, TIGR, TREP, User’s database (RepeatMasker)
UniVec, emvec, User’s database (CrossMatch)
NCBI organelle database
Automatic all-in-one pipeline Masoudi-Nejad, A. et al. Nucl. Acids Res. 2006 34:W459-W462
http://egassembler.hgc.jp/
![Page 48: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/48.jpg)
1. EST Raw Data
![Page 49: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/49.jpg)
PolyA / PolyT tail
Low complex sequence
Low quality sequence
(Shorter than 100) Short sequence
2. Sequence Cleaning
![Page 50: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/50.jpg)
Removed
![Page 51: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/51.jpg)
Repetitive sequences
Libraries • Human • Your custom repeats library • EGrep repeats library (Our custom repeatlibrary) • RepBase repeats library • TREP repeats library • TIGR repeats library
3. Repeat Masking
![Page 52: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/52.jpg)
Masked
Masked as “XXXXXX…”
![Page 53: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/53.jpg)
Vector sequences
Libraries • Your custom vector library • EGvec vector library (Our custom vectors library) • NCBI‘s vector library CoreRedundant • EMBL vector library
4. Vector Masking
![Page 54: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/54.jpg)
Masked
Masked as “XXXXXX…”
![Page 55: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/55.jpg)
Organelle sequences
Libraries • Your custom organelle library • Plastids library (44 species including Arabidopsis, Chlamydomonas, Lotus, Zea mays, etc.)
• Mitochondria library (Fungi, Metazoa, Plants, Plasmid, etc.)
5. Organelle Masking
![Page 56: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/56.jpg)
Masked
Masked as “XXXXXX…”
![Page 57: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/57.jpg)
Huang, X. and Madan, A. Genome Res. (1999) 9:868-877
6. Assembly by CAP3
![Page 58: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/58.jpg)
Potential overlap determination by BLAST-like technique
![Page 59: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/59.jpg)
Local alignment by Smith-Waterman algorithm
+ Singletons
![Page 60: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/60.jpg)
Assembly
+ Singletons
contig1 contig2
contig3 contig4
contig5 contig6 contig7
![Page 61: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/61.jpg)
EST consensus contigs and Singletons in FASTA format
>contig1
>contig2
>contig3
>contig4
>contig5
>contig6
>contig7
+ Singletons >1
>2
>3
>4
![Page 62: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/62.jpg)
Solanum lycopersicum (Tomato) : Taxonomy TAX: 4081
Number of accepted ESTs at NCBI : 260126 (2007/5/22) clean
Number of cleaned ESTs : 246118 (~96% left)
Assembled sequences: Contigs 19479 + Singletons 18010 (Total : 37507)
assembly
EGENES Example
![Page 63: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/63.jpg)
Alignment results An example of an alignment result Cleaned sequences
Three ESTs
![Page 64: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/64.jpg)
>Contig1 GGCACGAGGAAAAAAAAAAAAAAAGAATGGATTCCAAAGCATTTCTATTTTTTGGTCTTT TTTTGGCTATTTTCCTAATGATAAGCTCTGAGATTTTAGCTACTGAGTTGGCTGAGAACT CTAAGAAATCTGAAAACAAGAATGAAGTACATGAAGCCCAATACGGTGGATATCCTGGTG GTGGTGGTGGATATGGACGCGGTGGTGGTGGTGGATATGGACGCGGTGGTGGTGGAGGAT ATGGACGTGGTGGTGGATACGGACATGGTGGTGGTGGTGGATATGGACATGGTGGTGGTG GTGGATATGGACACGGTGGTGGTGGATACGGACACCGTGGTGGCGGTGGTGGTGGACGAC GTGGTGGATACTGCCAGTATGGTTGCTGTGGCCATGGTGACAATGGTTGCTATAGGTGTT GCTCCTATAAAGGTGAGGCAATGGACAAAGTTACTCAAGCTAAGCCACACAATTAATTAA TTATGTGTGGAGTACGTAGTATATTATATTAAAACTTTTGTAATGGCAATTATGTAATAT TATTAGCAATGCTCTTTCTACTTTAGAGCTTGCTATAATATTACTAAAAATGTATTGAAT AAAAGCCATGTTGTAGTAATTTTATTATCAATATTTTATCATGATATTCATATTTACTGT ATTTCACTAATTATACCAAAAAGTTTTAGTGC …
>Contig13317 GATTTGGGAAGACCATATGGTAGAGTTAGTAGAAAGAAGCTGAAGTTGAAATTGTTGAAT AGTTGTGTTGAGAACAGAGTGAAGTTTTATAAAGCTAAGGTTTGGAAAGTGGAACATGAA GAATTTGAGTCTTCAATTGTTTGTGATGATGGTAAGAAGATAAGAGGTAGTTTGGTTGTG GATGCAAGTGGTTTTGCTAGTGATTTTATAGAGTATGACAGGCCAAGAAACCATGGTTAT CAAATTGCTCATGGGGTTTTAGTAGAAGTTGATAATCATCCATTTGATTTGGATAAAATG GTGCTTATGGATTGGAGGGATTCTCATTTGGGTAATGAGCCATATTTAAGGGTGAATAAT GCTAAAGAACCAACATTCTTGTATGCAATGCCATTTGATAGAGATTTGGTTTTCTTGGAA GAGACTTCTTTGGTGAGTCGTCCTGTTTTATCGTATATGGAAGTAAAAAGAAGGATGGTG GCAAGATTAAGGCATTTGGGGATCAAAGTGAAAAGTGTTATTGAGGAAGAGAAATGTGTG ATCCCTATGGGAGGACCACTTCCGCGGATTCCTCAAAATGTTATGGCTATTGGTGGGAAT TCAGGGATAGTTCATCCATCAACAGGGTACATGGTGGCTAGGAGCATGGCTTTAGCACCA GTACTAGCTGAAGCCATCGTCGAGGGGCTTGGCTCAACAAGAATGATAAGAGGGTCTCAA CTTTACCATAGAGTTTGGAATGGTTTGTGGCCTTTGGATAGAAGATGTGTTAGAGAATGT TATTCATTTGGGATGGAGACATTGTTGAAGCTTGATTTGAAAGGGACTAGGAGATTGTTT GACGCTTTCTTTGATCTTGATCCTAAATACTGGCAAGGGTTCCTTTCTTCAAGATTGTCT GTCAAAGAACTTGGTTTACTCAGCTTGTGTCTTTTCGGACATGGCTCAAACATGACTAGG TTGGATATTGTTACAAAATGTCCTCTTCCTTTGGTTAGACTGATTGGCAATCTAGCAATA GAGAGCCTTTGAATGTGAAAAGTTTGAATCATTTTCTTCATTTTAAA
…
>Contig19479 … Total : 19479 contigs
![Page 65: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/65.jpg)
>gi|56121773|gb|CV970760.1|CV970760 SlE8 SlE Solanum lycopersicum cDNA, mRNA sequence TGATAACCTAGCTTTGTGATCATAATCATGGGTTCTTCTTCTCAAAAGAGGGTTGTAGGG TTACTTCTTGTCTTGAGTATTTTCTTGGAATTGAGTGCCATCACTTTTGGTGATGATAAG TTTGAAGAGTCGAGGTGGGGTAATGATTATGGTTGTGGACGATTTGGTAGAAGGGGATGT AGTGGACGCGATCAAGGTGGTGGTAGAGGTGTTGGAGGAGGGTTTGGAGGCGGAGCTGGT GGAGGAGGAGGTC >gi|56121770|gb|CV970757.1|CV970757 SlE5 SlE Solanum lycopersicum cDNA, mRNA sequence CTTAATTCCAAAGAGTCTTGTTGTACATCCCCCAAATTTGAGGAACTGGTGGTTGGTGCA TTATCCCTGCTAGTAATGAGGCTTTTTGCTGGCGGCCCCGGCCGGGGTCGTCATTATTTG CCACATATCTTAGTGGCCATTTTGGCTCTTGTTGATATTGTTTCTGCTGACCCTTACATA TATGCCTCTCCACCACCACCGTATGTGTACAAATCCCCACCACCTCCTTCTCCTTCTCCA CCACCACCATACGTGTACAAGTCCCCGCCACCTCCTTCTCCCTCTCCTCCACCACCGTAC GTGTACAAGT >gi|117725616|gb|DB705376.1|DB705376 DB705376 Solanum lycopersicum cv. Micro-Tom leaf Solanum ly copersicum cDNA clone LEFL1088BH06 5', mRNA sequence TTTTAAAATGAATGCATATAGTGGCGAAGCATGTTCTGTAGTTAATAGGGCTTGTTGTTG TATTATAGGTATAAGAAAGATACATTTTTGCACTTAGATCCACTATGATGTGAGTTATTC AACTTGGATTTGAGTGTAAAGTGTATATATAGTTGAGGTCTTCAATCTATTACAATGTTA CGACAAATTGGATCTACAGTGGGTTCTTGGACTCACAAGAAGATCGTTGATCCTCTTCTC CAAATCCTTCGTAGGGGTGCAGAGCCAAAACAATTGGCATTCTCTGGGGCTCTTGGTGCT ACATTGGGTCTCTTTCCCATCTGTGGGGTTGCTGTGTTTCTATGTGGCGTAGCTATTGTA GTACTTGGATCCTTGTGTCATGCACCAACTGTGTTGTTGGTCAACTTCATTGTTACTCCC ATTGAGCTGAGTTTGGTGATTCCTTTCCTACGTTTAGGTGAATATGTGAGTGGTGGACCT CATTTTGCTTTGACCTCAGATGCATTAAAGAGGGTCTTCACTGGTAAAGCTTCGTGGGAA GTCTTGCTGAGCATTTACCATGCGTTGCTGGGCTGGCTTGTTGCTGTACCATTCATC …
+ 18010 singletons
![Page 66: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/66.jpg)
EGENES Example
![Page 67: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/67.jpg)
![Page 68: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/68.jpg)
EGENES Example
![Page 69: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/69.jpg)
![Page 70: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/70.jpg)
KegArray Mapping microarray expression data
to KEGG pathway data
![Page 71: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/71.jpg)
• Java application for handling expression data • http://www.genome.jp/download/
• Microarray data with KEGG Gene IDs • Coloring based on expression levels • Mapping the coloring to pathway data • ID conversion available: NCBI-GeneID, IPI, …
• Metabolome data can be also mapped
KegArray
![Page 72: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/72.jpg)
• GECS: Gene Expression to Chemical Structure • Microarray data -> Carbohydrate structures • http://www.genome.jp/tools/gecs/
• GENIES • Gene Network Inference Engine based on Supervised Analysis
• Integration of multiple omics data to infer gene functions
• Using pathway data for supervised data • http://www.genome.jp/tools/genies/
Other tools
![Page 73: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/73.jpg)
• KEGG is 13 years old database for genomic, chemical, and systems information.
• Genomic information (GENES section) includes complete genomes, draft genomes and EST contigs
• KEGG Orthology plays a key role in connecting genomic and systems information
• Users can input their own sequences (genomes or EST collections) for reconstructing pathway data using KAAS and EGassembler
• KegArray is a Java application for expression data • GECS and GENIES can be used to infer glycan structures and gene functions, respectively
Summary
![Page 74: Introduction to KEGG - metabolomics.se Biology/Lectures/KEGG.pdf• Gene Network Inference Engine based on Supervised Analysis • Integration of multiple omics data to infer gene](https://reader031.vdocuments.site/reader031/viewer/2022030504/5ab0dc197f8b9ac3348ba977/html5/thumbnails/74.jpg)
Microarray
Yeast two-hybrid
Subcellular localizaton
Phylogenetic profile
Network inference
Similarity matrix
Network inference from multiple data