![Page 1: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/1.jpg)
CS276BText Information Retrieval, Mining, and
Exploitation
Lecture 16Bioinformatics IIMarch 13, 2003
(includes slides borrowed from J. Chang, R. Altman, L. Hirschman, A. Yeh, S. Raychaudhuri)
![Page 2: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/2.jpg)
Bioinformatics Topics
Last week Basic biology Why text about biology is special Text mining case studies
Microarray analysis, Abbreviation mining
Today Combined text mining and data mining I
Text-enhanced homology search Text mining in biological databases KDD cup: Information extraction for bio-
journals Combining text mining and data mining II
![Page 3: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/3.jpg)
Text-Enhanced Homology Search(Chang, Raychaudhuri, Altman)
![Page 4: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/4.jpg)
Sequence Homology Detection
Obtaining sequence information is easy; characterizing sequences is hard.
Organisms share a common basis of genes and pathways.
Information can be predicted for a novel sequence based on sequence similarity:
Function Cellular role Structure
![Page 5: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/5.jpg)
PSI-BLAST Used to detect protein sequence
homology. (Iterated version of universally used BLAST program.)
Searches a database for sequences with high sequence similarity to a query sequence.
Creates a profile from similar sequences and iterates the search to improve sensitivity.
![Page 6: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/6.jpg)
![Page 7: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/7.jpg)
PSI-BLAST Problem: Profile Drift
At each iteration, could find non-homologous (false positive) proteins.
False positives create a poor profile, leading to more false positives.
![Page 8: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/8.jpg)
Addressing Profile Drift
PROBLEM: Sequence similarity is only one indicator of homology.
More clues, e.g. protein functional role, exists in the literature.
SOLUTION: we incorporate MEDLINE text into PSI-BLAST.
![Page 9: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/9.jpg)
![Page 10: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/10.jpg)
Modification to PSI-BLAST
Before including a sequence, measure similarity of literature. Throw away sequences with least similar literatures to avoid drift.
Literature is obtained from SWISS-PROT gene annotations to MEDLINE (text, keywords).
Define domain-specific “stop” words (< 3 sequences or >85,000 sequences) = 80,479 out of 147,639.
Use similarity metric between literatures (for genes) based on word vector cosine.
![Page 11: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/11.jpg)
Evaluation
Created families of homologous proteins based on SCOP (gold standard site for homologous proteins--http://scop.berkeley.edu/ )
Select one sequence per protein family: Families must have >= five members Associated with at least four references Select sequence with worst performance
on a non-iterated BLAST search
![Page 12: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/12.jpg)
Evaluation
Compared homology search results from original and our modified PSI-BLAST.
Dropped lowest 5%, 10% and 20% of literature-similar genes during PSI-BLAST iterations
![Page 13: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/13.jpg)
![Page 14: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/14.jpg)
Results
46/54 families had identical performance 2 families suffered from PSI-BLAST drift,
avoided with text-PSI-BLAST. 3 families did not converge for PSI-BLAST,
but converged well with text-PSI-BLAST 2 families converged for both, with slightly
better performance by regular PSI-BLAST.
![Page 15: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/15.jpg)
![Page 16: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/16.jpg)
Discussion
Profile drift is rare in this test set and can sometimes be alleviated when it occurs.
Overall PSI-BLAST precision can be increased using text information.
![Page 17: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/17.jpg)
Mining Text inBiological Databases
![Page 18: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/18.jpg)
Where is the Information?What is the Data?
GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory
networks Medline – biomedical literature Taxonomies / Ontologies
![Page 19: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/19.jpg)
Genetic Information in GenBank
1.00
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
10000000.00
100000000.00
1000000000.00
10000000000.00
100000000000.00
1983 1988 1993 1998
Base Pairs
Sequences
•Numbers are for all species.
•Biology is fundamentally an information science.
![Page 20: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/20.jpg)
Species represented in GENBANK
Entries Bases Species 4323294 7028540140 Homo sapiens 2595599 1385749133 Mus musculus 166778 488340565 Drosophila melanogaster 182124 247830592 Arabidopsis thaliana 114669 203787073 Caenorhabditis elegans 189000 165542107 Tetraodon nigroviridis 159412 136005048 Oryza sativa 219183 107771966 Rattus norvegicus 166688 75404535 Bos taurus 155647 68679866 Glycine max 109941 56390403 Lycopersicon esculentum 70448 51527034 Hordeum vulgare 104773 51202716 Medicago truncatula 91352 50512383 Trypanosoma brucei 56416 49410018 Giardia intestinalis 77536 47598841 Strongylocentrotus purpuratus 49939 44524589 Entamoeba histolytica 86706 42479448 Danio rerio 79696 37899117 Zea mays 71318 37381894 Xenopus laevis
![Page 21: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/21.jpg)
Complete GenomesAquifex aeolicus Aquifex aeolicus Archaeoglobus fulgidus Archaeoglobus fulgidus Bacillus subtilis Bacillus subtilis Borrelia burgdorferi Borrelia burgdorferi Chlamydia trachomatis Chlamydia trachomatis Escherichia coli Escherichia coli Haemophilus influenzae Haemophilus influenzae Methanobacterium Methanobacterium thermoautotrophicum thermoautotrophicum
Caulobacter crescentusCaulobacter crescentus
Helicobacter pyloriHelicobacter pyloriMethanococcus jannaschii Methanococcus jannaschii Mycobacterium Mycobacterium tuberculosis tuberculosis Mycoplasma genitalium Mycoplasma genitalium Mycoplasma pneumoniae Mycoplasma pneumoniae Pyrococus horikoshii Pyrococus horikoshii Treponema pallidumTreponema pallidumSaccharomyces cerevisiaeSaccharomyces cerevisiae Drosophila melanogasterDrosophila melanogasterArabidopsis thalianaArabidopsis thalianaHomo sapiensHomo sapiens
![Page 22: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/22.jpg)
Where is the Information?What is the Data?
GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory
networks Medline – biomedical literature Taxonomies / Ontologies
![Page 23: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/23.jpg)
Protein Sequences
Swiss-prot (as of 3/03) 122,564 sequences Almost 45,000,000 total amino
acids 103,486 references
http://www.expasy.ch/sprot/
![Page 24: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/24.jpg)
Three-Dimensional Structures
Protein three-dimensional Structures Protein Data Bank (PDB), as of March
27, 2001 13,158 proteins 939 nucleic acids 616 protein/nucleic acid complex 18 carbohydrates
http://www.rcsb.org/pdb/
![Page 25: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/25.jpg)
Where is the Information?What is the Data?
GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory
networks Medline – biomedical literature
![Page 26: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/26.jpg)
Completeyeastgenome(6000 genes)on a chip.
![Page 27: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/27.jpg)
Online access to DNA chip Data
http://genome-www4.stanford.edu/MicroArray/SMD/
O(10) data sets available from Stanford site 10,000 to 40,000 genes per chip Each set of experiments involves 3 to 40 “conditions” Each data set is therefore near 1 million data points.
People gearing up for these measurements everywhere…
![Page 28: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/28.jpg)
Where is the Information?What is the Data?
GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory
networks Medline – biomedical literature Taxonomies / Ontologies
![Page 29: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/29.jpg)
A Reaction in EcoCYC
![Page 30: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/30.jpg)
![Page 31: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/31.jpg)
KEGG
![Page 32: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/32.jpg)
Where is the Information?What is the Data?
GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory
networks Medline – biomedical literature Taxonomies / Ontologies
![Page 33: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/33.jpg)
Signaling Pathways
![Page 34: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/34.jpg)
Where is the Information?What is the Data?
GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory
networks Medline – biomedical literature Taxonomies / Ontologies
![Page 35: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/35.jpg)
Where’s the Information? Medical Literature on line. Online database of published literature
since 1966 = Medline = PubMED resource
4,000 journals 10,000,000+ articles (most with
abstracts) www.ncbi.nlm.nih.gov/PubMed/
![Page 36: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/36.jpg)
PubMed
![Page 37: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/37.jpg)
SwissProt103,000 references100s Mb of text100,000s unique words
![Page 38: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/38.jpg)
Abstracts Referenced in SP37
Number of abstracts associated with sequences in Swiss Prot.
(# sequences truncated at 100)
(as of 2001)
![Page 39: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/39.jpg)
Where is the Information?What is the Data?
GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory
networks Medline – biomedical literature Taxonomies / Ontologies
![Page 40: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/40.jpg)
MESH = Medical Entity Subject Headings
Controlled vocabulary for indexing biomedical articles.
19,000 “main headings” organized hierarchically
Browser at http://www.nlm.nih.gov/mesh/MBrowser.html
![Page 41: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/41.jpg)
MESH
![Page 42: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/42.jpg)
UMLS: Semantic Model of Biomedical Language
Representing more of semantics of words and more relationships.
UMLS = Unified Medical Language System
http://www.nlm.nih.gov/research/umls/
![Page 43: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/43.jpg)
UMLS Elements Semantic concepts (475K) = specific terms
connected to semantic categories (e.g. Munchausen syndrome linked to Behavioral-Dysfunction)
Concept maps (1,000K) = mapping from a terminology to a semantic concept (e.g. ICD-9 Billing code to Munchausen syndrome)
Categorizations = relate semantic concepts Conceptual links (7K) = relate two semantic
concepts with a semantic relationship
![Page 44: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/44.jpg)
Gene Ontology(http://www.geneontology.org/)
A controlled listing of three types of function:
Molecular Function Biological Process Cellular Component
Vision: universal language for molecular biology across species
![Page 45: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/45.jpg)
Molecular Function
<molecular_function ; GO:0003674 %anti-toxin ; GO:0015643 %lipoprotein anti-toxin ; GO:0015644
%anticoagulant ; GO:0008435 %antifreeze ; GO:0016172 %ice nucleation inhibitor ; GO:0016173
%antioxidant ; GO:0016209 %glutathione reductase (NADPH) ; GO:0004362 ; EC:1.6.4.2 % flavin-
containing electron transporter ; GO:0015933 % oxidoreductase\, acting on NADH or NADPH\, disulfide as acceptor ; GO:0016654
%thioredoxin reductase (NADPH) ; GO:0004791 ; EC:1.6.4.5 % flavin-containing electron transporter ; GO:0015933 % oxidoreductase\, acting on NADH or NADPH\, disulfide as acceptor ; GO:0016654
![Page 46: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/46.jpg)
Current Genome Annotationshttp://www.geneontology.org
![Page 47: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/47.jpg)
Where is the Information?What is the Data?
GenBank – genetic sequences Swiss-prot – protein sequences DNA chips / microarrays Metabolic pathways Signaling pathways / regulatory
networks Medline – biomedical literature Taxonomies / Ontologies
![Page 48: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/48.jpg)
KDD Cup 2002:Information Extraction for
Biological Text
![Page 49: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/49.jpg)
Task Background: Flybase
Flybase project Curates biomedical publications on the fruitfly Uses GO (gene ontology) as ontology Fruitfly (Drosophila melanogaster) is one of the key “model
organisms” Flybase goals
Distillation of literature on the fruitfly Table of contents function Support search of literature
Current methodology: Manual curation Curators read the literature and manually update flybase
Goal of KDD Cup 2002: Can this be (partially) automated?
![Page 50: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/50.jpg)
FlyBase: Example of Data Curation
![Page 51: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/51.jpg)
Curators Cannot Keep Up with the Literature!
FlyBase References By Year
![Page 52: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/52.jpg)
Task Rationale and Description FlyBase provided the
Data annotation (plus biological expertise) Input on the task formulation
What can be useful to the curators
Start fairly simple. Try to help automate part of what one group of FlyBase curators needs to do:
Determine which papers need to be curated for fruit fly gene expression information
Want to curate those papers containing experimental results on gene products (RNA transcripts and proteins)
![Page 53: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/53.jpg)
Abstracts are not enough, need the full papers
E.g., for one paper on Appl proteins (PubMed ID #8764652), FlyBase lists 19 “when-where” pairs for Appl protein expression
A “when-where” pair indicates when in the life cycle and where in the body some transcript or protein is found
“When-where” pair example: adult-brain Only 2 of the 19 pairs (11%) are mentioned in the
abstract. The rest are only mentioned in the body of the full paper
So need full papers in electronic form
Some Data (Text) Preparation Challenges
![Page 54: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/54.jpg)
Full papers are copyrighted by publishers For the contest, only use “free” papers
As a result of all these complications, out of the ~7100 papers in FlyBase that were of interest only ~1100 were used
Some Data (Text) Preparation Challenges
![Page 55: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/55.jpg)
Plain text is not enough, also need things like superscripts, subscripts, italics, Greek letters (in English text)
E.g., represent alleles (variants of a gene) with superscripts
Some Appl gene alleles: Appl , Appl , Appl If lose the superscripts, these appear as:
Appld, Appls, Applsd This would make it harder to determine that
these refer to the same gene Need to know what suffixes to remove before
trying to match
Some Data (Text) Preparation Challenges (Continued)
d s sd
![Page 56: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/56.jpg)
FlyBase has certain conventions to represent superscripts, etc. in ASCII
E.g., represent those alleles as Appl[d], Appl[s], Appl[sd]
In general, gene and protein names are already hard to match because they often have a complicated word structure (morphology)
One needs to know what morphological transformations (like prefix or suffix removal) to perform before attempting to match the names
Some Data (Text) Preparation Challenges (Continued)
![Page 57: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/57.jpg)
Information Extraction Task
Given for each paper The full text of that paper A list of the genes mentioned in that paper
Determine for each paper For each gene mentioned in the paper,
does that paper have experimental results for
Transcript(s) of that gene (Yes/No)? Protein(s) of that gene (Yes/No)?
![Page 58: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/58.jpg)
Task is Harder Than It First Appears
Interested in results applicable to “regular” (found in the wild) flies, not mutants
Genes have multiple names (synonyms) Given a list of the known synonyms But list may be incomplete
Some names can refer to multiple genes E.g., “Clk” is a symbol for one gene (Clock) and is
also a synonym for another gene (period, symbol is “per”)
Contestants given evidence of experimental results found in the training data,
But only in the form that is recorded in the FlyBase database
![Page 59: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/59.jpg)
Training Data in Flybase
Database (DB) records what evidence is found in a training paper, but not where in that paper
The evidence is often recorded in a “normalized” form and domain knowledge is needed to find the corresponding text, e.g.,
DB: Assay mode: “immunolocalization”Text (PubMed ID#9006979): “Figure 12. …Whole-mount tissue staining using an affinity-purified anti-PHM antibody in the CNS … This view displays only a portion of the CNS”
Term “immunolocalization” is not in the text Instead, text describes the process of
performing an immunolocalization
![Page 60: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/60.jpg)
Typical NLP Training Data:More Detailed
These systems assume every mention of an entity or relation of interest in the text is annotated
So anything not annotated is not a mention E.g., Annotations to train a “Northern blot”
detector:Paper #7540168: ... transcripts on Northern analyses, raising questions whether @norpA@ ... @Northern Blots@ ... Northern blots were carried out as described by Zhu @et al.@(1993) ... @Northern Analysis of Adult RNA@ ... Figure 3: Northern blot analysis of @norpA@ transcripts in adult ... IThis paper has a total of 19 mentions.
![Page 61: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/61.jpg)
Task Details
Task has 3 sub-tasks, that contribute equally to the overall score
1. Ranked-list of papers (curatable before non-curatable)
2. Yes/No decisions on the papers being curatable (having any results of interest)
3. Yes/No decisions for having results for each type of product (transcript, protein) for each gene mentioned in a paper
![Page 62: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/62.jpg)
Some Numbers
Training set: 862 articles Test set: 213 articles (non-public!) Time Allowed
Release training set, wait ~6 weeks Release test set, results due ~2 weeks
later 18 teams submitted 32 entries Entries from 7 “countries”:
Japan, Taiwan, Singapore, India, UK, Portugal, USA
About equal numbers of universities and companies
Evaluation measure: F measure
![Page 63: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/63.jpg)
Winner: a team from ClearForest and Celera Used manually generated rules and
patterns to perform information extraction Also had the best score in each of the 3
sub-tasks Best MedianRanked-list: 84% 69% Yes/No curate paper: 78% 58%Yes/No gene products: 67% 35%
Results
![Page 64: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/64.jpg)
Summary
Reliance on partial annotations is key. “Information retrieval” task easiest to
solve and immediately useful. Electronic availability of full-text is big
issue. Mundane format problems (subscripts etc)
are a big issue. Best results were 67% for information
extraction.
![Page 65: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/65.jpg)
Curated Databases
Flybase is an example of a curated database.
A lot of biological research is organized around such databases (cf. building and publishing software packages in CS)
There are hundreds (thousands?) of curated databases.
13 important databases just for one area: nuclear receptors.
Maintaining curated databases is labor-intensive.
![Page 66: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/66.jpg)
Curated Databases
Text mining can be used for: Cost savings Time savings Consistency Freshness
![Page 67: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/67.jpg)
Curated Databases: Uses
Protein-protein interactions Which proteins interact with X?
Support information retrieval Find all transcription factors that are
involved in cell death Interpretation of data-intensive
experiments Microarray case study presented last week
In silico biology
![Page 68: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/68.jpg)
E-Cell (http://e-cell.org/)
![Page 69: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/69.jpg)
Curated Databases: Uses (cont.)
Summary/selection of what is known Support search Knowledge discovery
Contradictory findings Nobel Prize
He/She who points out a critical gene-disease link first, wins the Nobel Prize.
You better do a thorough literature search.
![Page 70: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/70.jpg)
CombiningText Mining and Data Mining
![Page 71: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/71.jpg)
Combining Text and Links
Recall: Classifying a web document based on The text they contain The categories of other pages pointing to it The categories of other pages it is pointing
to Also
Usage information (Pitkow et al.)
![Page 72: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/72.jpg)
Clustering: Example(Eisen et al.)
![Page 73: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/73.jpg)
Combining Gene Expression&Text
Clustering of genes in a microarray experiment
Last week Clustering based on text only, or: Clustering based on gene expression only
What about combining the two? There is a large number of “good
clusterings” for a particular problem Use literature to guide clustering
![Page 74: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/74.jpg)
Comments
Yeast : genes were grouped by expression. Functional labels guided us to find key subgroups. Once key subgroups are identified, supervised approaches
can refine identification process.
Cancer : cell line were grouped by semantic category (hypoxia versus normoxia).
Used supervised approaches to refine identification process
![Page 75: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/75.jpg)
Literature as a guide
Free text documentation is widely available
Patient records to describe pathological specimens
~20,000 documents describing specific yeast genes
May have the information to guide us in searching for similarities in genes and expression
![Page 76: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/76.jpg)
Goal of algorithm
To identify subgroups of genes with commonalities in gene expression and in biological function.
Literature is the means by which we identify functional commonalities
![Page 77: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/77.jpg)
Projections in Linear Discriminant Analysis
A normal distribution is estimated for the features of each population of the training set.
Each distribution is centered at the mean of the population
Linear discriminant analysis assumes a pooled covariance matrix.
![Page 78: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/78.jpg)
Our approach
Look for projections that separate specific groups of genes
In a good projection, the separated genes have some functional commonalities
These commonalities should be evident in the gene literature
![Page 79: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/79.jpg)
Challenges
C1 : Can we identify biologically meaningful concepts from simple text representations?
C2 : In a group of genes with some biological similarity, can we detect that similarity in the literature?
C3 : Can we then find projections in the expression data that group genes appropriately?
![Page 80: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/80.jpg)
Resources
NLP sessions of PSB: psb.stanford.edu www.bionlp.com bioperl.org, biopython.org National Library of Medicine:
www.nlm.nih.gov http://www.ai.ucsd.edu/rik/annblast/ab-bm.
html (out of date, but still comprehensive)
![Page 81: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/81.jpg)
Links to Today’s Topics
http://www.smi.stanford.edu/projects/helix/psb01/chang.pdf Pac Symp Biocomput. 2001;:374-83. PMID: 11262956
Blast: http://www.ncbi.nlm.nih.gov/BLAST/ http://www-smi.stanford.edu/projects/helix/psb03
Genome Res 2002 Oct;12(10):1582-90 Using text analysis to identify functionally coherent gene groups.Raychaudhuri S, Schutze H, Altman RB
www.biostat.wisc.edu/~craven/kddcup/ http://www.ncbi.nlm.nih.gov/Genbank/genbankstat
s.html http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?d
b=Genome (complete genomes)
![Page 82: CS276B Text Information Retrieval, Mining, and Exploitation Lecture 16 Bioinformatics II March 13, 2003 (includes slides borrowed from J. Chang, R. Altman,](https://reader035.vdocuments.site/reader035/viewer/2022062422/56649f005503460f94c15f9b/html5/thumbnails/82.jpg)
Links to Today’s Topics
http://www.nlm.nih.gov/mesh/meshhome.html