csci6904 genomics and biological computing
DESCRIPTION
CSCI6904 Genomics and Biological Computing. Lecture 3 – Conceptual Biology Cells, Gene circuits Conceptual Biology. Overview. Computing in Biological systems - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/1.jpg)
CSCI6904
Genomics and Biological Computing
Lecture 3 – Conceptual Biology
Cells, Gene circuits
Conceptual Biology
![Page 2: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/2.jpg)
Overview
Computing in Biological systemsCells are computing information and react programatically to various situations. We will have a brief look at what is a cell and how they “compute”.
Evolutionary emergence of NetworksThese Circuits of gene products are arising in a stochastic manner. We will have a quick look on how this random walk results in a combinatorial strategy to evolve solutions.
Investigating NetworksNone of these network is visible, investigating the relationships in the physical world is a resource consuming operation.
Building Knowledge models of cells using text miningPresent a test case called GENEWAY.
![Page 3: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/3.jpg)
Cells
![Page 4: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/4.jpg)
Scope of molecular Biology
Molecular biology tries to organize a stochastically evolved system comprising hundreds of thousands components.
None of these components can be seen, even under the mostpowerful microscopes.
They are usually present in the 10-8 – 10-12 grams scale.
They degrade in a matter of second to hours.
The bottomline is:
Everything we know about this system comes from fragments of information.
Many of these are going to be refuted over time.
![Page 5: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/5.jpg)
Cells as processors
![Page 6: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/6.jpg)
Scope of Biological research
Research is usually structured such that individual contributions Can be pieced together into a “pathway”
![Page 7: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/7.jpg)
Scope of Biological research
Research is usually structured such that individual contributions Can be pieced together into a “pathway”
SugarEssential oils
(plants)
Vitamin K
Bile
Eye Pigments
Sexual Hormones
Amino-Acids
![Page 8: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/8.jpg)
Networks
How do they come into being?Combinatorial assembly during a stochastic process.
What is done to understand the main pathways?Grasping event the smallest facts about 1 edge in the graph is a feat.
![Page 9: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/9.jpg)
Evolutionary Quandary
Intelligent design opposition to evolution of complex systems
A B C D
![Page 10: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/10.jpg)
Evolutionary Quandary
Intelligent design opposition to evolution of complex systems
A B C D
Useless metabolites
![Page 11: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/11.jpg)
Evolutionary Quandary
Intelligent design opposition to evolution of complex systems
A D
Impossible
![Page 12: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/12.jpg)
Evolutionary Quandary
Intelligent design opposition to evolution of complex systems
A B C D
Therefore, the pathway A->D had to be designed by an intelligent entity which had the knowledge of the
intended purpose of the pathway!
![Page 13: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/13.jpg)
Closer look at high-level genes organization
A modular systemProteins can be broken down into domains.
A combinatorial effectDomains can assemble in a combinatorial fashion to try together a vast array of potential biological activities.
![Page 14: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/14.jpg)
Proteins are made of domains
Proteins are organized into domains
Transcription factor eF1eF1/ (PDB: 1IJF)
http://www.ncbi.nlm.nih.gov
![Page 15: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/15.jpg)
Proteins are made of domains
Domains have several interesting properties.
Transcription factor eF1eF1/ (PDB: 1IJF)
http://www.ncbi.nlm.nih.gov
![Page 16: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/16.jpg)
Proteins are made of domains
Domains fold onto themselves such that it is possible to express them separately (in most case).
They are small relative to actual proteins. Which may make it easier to rapidly fold into the right conformation.
Transcription factor eF1eF1/ (PDB: 1IJF)
![Page 17: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/17.jpg)
Proteins are made of domains
They usually provide a biological function through binding or catalysis.
Transcription factor eF1eF1/ (PDB: 1IJF)
![Page 18: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/18.jpg)
A stochastic process
![Page 19: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/19.jpg)
A molecular network
= An interaction
![Page 20: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/20.jpg)
Interfaces are expensive to evolve
Transcription factor eF1/ (PDB: 1IJF)
Interfaces are very sensitive to mutation as they must provide a perfect match.
![Page 21: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/21.jpg)
Network of Metabolites
Metabolites are essentially forming network with a scale-free property, which parallels the stochastic assembly of domains.
At least, this appears to be true with the data there are so far.
Rzhetsky and Gomez, 2001. Bioinformatics, 17:988-996
http://www.genego.com/about/products.shtml
![Page 22: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/22.jpg)
Evolutionary Quandary
Back to our A to D problem.
A B C D
An observed pathway therefore is simply a path connecting an input molecule and a required output. Each edge can be seen as a gene product (protein).
Overall, the pathway offers some kind of advantage to the host organism.
With positive selection, the pathway gets better and look as if it was designed for a specific purpose.
![Page 23: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/23.jpg)
Scope of Biological research
Density of knowledge generating statements per article withrespect to source journals
![Page 24: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/24.jpg)
Where it becomes a bioinformatic’s problem:
Nature of the problemBuilding a global model from plain English text sources.
Size
Complexity
What is done in the GeneWays project The workflow of their integrated system
What I think it really means in the long runThe relationship between research and researchers
(The right information system will be the next big thing)
![Page 25: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/25.jpg)
Motivation
Human limitations andData-heavy and knowledge-heavy Disciplines
SynthesizingHypothesis building
Visualizing Records keeping
Modeling Knowledge StreamliningStructuring(Directing)
(Changing the way research is communicated?)
![Page 26: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/26.jpg)
Motivation
In knowledge-intensive field, the connection between investigators and background information is thinning down.
Data
Hypothesis
Experiment
Information(data,
concepts)
KnowledgeThis arrow does not scale up
as quickly as the others BioinformaticsComputational Biology
![Page 27: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/27.jpg)
Scope of GeneWays
Build from plain-English publications a
model for molecular biology
Allow a more holistic approach to hypothesis formulation.
![Page 28: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/28.jpg)
Scope of GeneWays
~ 3 million statements
150 K full text articles
![Page 29: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/29.jpg)
Scope of GeneWays
What are we looking for, ultimately ?
protein A binds gene Bgene B regulates gene Cgene C express protein D
protein D inactivates protein A
![Page 30: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/30.jpg)
Scope of GeneWays
Doc Sorting
Terms identification
Disambiguation
Information extraction
Ontology
Visualization
![Page 31: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/31.jpg)
Details of GeneWays
Doc Sorting
From Abstracts, using either clustering (unsupervised) or
Naïve Bayes.
This system is using a mixture of methods to
achieve the binary classification:
Relevant / irrelevant
![Page 32: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/32.jpg)
Details of GeneWays
Tagging terms
Especially hard in biology(?)
Morphological rulesGrammatical rules
Rules/dictionary methodsSVMHMM
Naïve BayesDecision Trees
Recall in the 70’s to 80’s
![Page 33: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/33.jpg)
Details of GeneWays
Tagging terms
HTML -> XML-like format
![Page 34: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/34.jpg)
Details of GeneWays
Tagging terms
Vertices:
GeneProtein
GeneorproteinProcess
SmallmoleculesSpeciesComplexDisease
Domain (protein)
![Page 35: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/35.jpg)
Details of GeneWays
Tagging terms
Edges:
N-acylateacetylate
N-glycosylateO-glycosylate
BindDegrade
(De-)methylate(De-)phophorylate[Make|break]bond
ExpressTranscribeReleaseInteract
Substitute… n = 125 (2001)
![Page 36: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/36.jpg)
Details of GeneWays
Learning new verbs:
AVAD system
Χ2 statistics of occurrence of terms before and after tagged
items.
Log-likelihood test based on frequency of occurrence in corpus-specific literature
Co-localize and synergize were discovered using AVAD
![Page 37: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/37.jpg)
Nomenclature
There are obscure ways to agree:
Protein kinase A phosphorylates protein B
Is the same as :
AB ATP B P ADP
![Page 38: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/38.jpg)
Nomenclature
There are obscure ways, period:
Gene named:
“Forever Young” in Arabidopsis Thaliana (mustard familly)
“Mother against decapentaplegic” in Fruit fly
![Page 39: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/39.jpg)
Nevermind the jargon!
Fight fire with fire:
They developed a method that uses BLAST, a popular sequence database search algorithm to mine for biological terms.
(Krauthammer et al., 2000. Gene. 259:245-252)
![Page 40: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/40.jpg)
Nevermind the jargon!
Fight fire with fire:
N-(2-Hydroxyethyl)piperazine-N'-(2-ethanesulfonic acid) (HEPES)2-(N-Morpholino)ethanesulfonic acid (MES)
3-(N-Morpholino)propanesulfonic acid (MOPS)N-tris[Hydroxymethyl]methyl-3-aminopropanesulfonic acid (TAPS)
tris(Hydroxymethyl)aminomethane (TRIS)
![Page 41: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/41.jpg)
Details of GeneWays
Disambiguation
il2 and interleukine-2 can both be used to refer to either
the gene, the protein or the mRNA.
![Page 42: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/42.jpg)
Details of GeneWays
Disambiguation
Use canonical name as much as possible.
Learn Semantic classes
![Page 43: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/43.jpg)
Details of GeneWays
Information extraction
Correlation methodsHMM
Formal grammar (lexicon)
GeneWays uses NLP GENIES
Attempts complete parsing, then default to segmenting
and partial parsing.
![Page 44: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/44.jpg)
Details of the NLP system
GENIES (GENomics Information Extraction System)
Based on MedLEE (medical NLP system)
Term tagging component uses rules and external knowledge
Nested relationships, normalized and agentive forms of verbs inhibit, inhibition and inhibitor .
![Page 45: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/45.jpg)
Details of GeneWays
Information simplification
Convert nested relationships into a collection of binary
statements.
![Page 46: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/46.jpg)
Details of GeneWaysOntology
Knowledge Models
![Page 47: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/47.jpg)
Uses for GeneWaysVisualization
Synthesis and querying facility
The only filter described at the time of the publication is a filter
based on the number of statement supporting an edge.
![Page 48: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/48.jpg)
Uses for GeneWaysVisualization
Synthesis and querying facility
![Page 49: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/49.jpg)
Validation of GeneWays
Expert Review
125 statements / 2500 were erroneous or “phantoms”.
Of these 125:
- 100 due to term identification.- 12 NLP errors.- 5 Simplifier errors.- 8 Actually correct!
System’s precision: 95%Expert’s precision : 93.5%
Such as system should be seen as a mean to enrich
![Page 50: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/50.jpg)
Validation of GeneWays
Redundancy
Redundant statements are not necessarily “more true”.
Redundancy due to indirect relationships.
![Page 51: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/51.jpg)
Validation of GeneWays
A parser’s nightmare:
Statement : “mitogen-activated protein kinase kinase kinase (MAPKKK) phosphorylates protein B”
Interpretations:
1. Protein kinase [protein] is activated by the mitogen [complex]2. MAPK[protein] phosphorylate MAPKK[protein]3. MAPKK[protein] phosphorylate MAPKKK[protein]4. MAPKKK[protein] phosphorylate B [protein]
Potential historical artifacts:
1. B[protein] is activated by the mitogen[complex]2. MAPKK[wrongly thought to be MAPK] phosphorylate B[protein]3. …
![Page 52: CSCI6904 Genomics and Biological Computing](https://reader033.vdocuments.site/reader033/viewer/2022051401/56814f2a550346895dbcb6ef/html5/thumbnails/52.jpg)
Perspective
References
Main: Rzhetski et al., 2004. GeneWays: a system for extracting, analysing,
visualizing, and integrating molecular pathway data. J. Biomed. Informatics, 37:43-53
Learning Verbs: Hatzivassiloglou, V., Weng, W. Learning Anchor Verbs for Biological
Interactions Patterns from published text articleswww.cs.columbia.edu/nlp/papers/2002/ hatzivassiloglou_weng_02.pdf
NLP processor: Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A . 2001.GENIES: a natural-language processing system for the extraction of
molecular pathways from journal articles.Bioinformatics, 17:S74-S82
Acknowledgement: Aditya Aggarwal, the student who dug out this paper to present in CSCI 6904 (2004)