ontology learning from text
TRANSCRIPT
Ontology Learning From Text?
Robert Stevens
BioHealth Informatics Group
School of Computer Science
University of Manchester
Introduction
• Can we use ontology learning to build ontologies?
• Not text-mining research, but ontology research
• What is ontology learning from text?• The questions we posed• The experiment we performed• The results we obtained• The conclusions we made
Ontology learning
• Text2Onto: http://ontoware.org/projects/text2onto/
• “The erythrocytes are the blood cells that carry oxygen to others cells in the body”
• “Lymphocytes, leukocytes, monocytes, phagocytes and granulocytes are all kinds of white blood cell”
• “These experiments show that the individual hemopoietic stem cell is a multipotent cell and can give rise to the complete range of blood cell types, both myeloid and lymphoid, as well as new stem cells like itself.”
Ontology Learning
Blood Cell
Erythrocyte
White Blood Cell
Monocyte
Leukocyte
Lymphocyte
Phagocyte
Granulocyte
Multipotent Stem Cell
Hemopoietic Stem Cell
arise from
Text to Ontology “Workflow”
Corpus
Tokenising / Sentence splitting
Part-Of-Speech (POS) tagging
Lemmatizing / Stemming
JAPE transducer annotates corpus
Text2Onto Algorithms for extracting modeling primitive
Text2Onto meta-ontology
Promotion to OWL ontology
Extracting Patterns from Text
“CFU-S is a blood stem cell”
CFU-S[NNP] is[VBN] a[DT] blood[NN] stem[NN] cell[NN]
Sentence:
Part of Speech (POS) Tagging:
Pseudo JAPE rule:
Any series of nouns (A) followed by the string “ is a ” followed by series of nouns (B)
Key: NN=noun; DT=determiner; NNP=proper noun; VBN = verb past participle.
Ontological assertions:
A and B are concepts, A is a subclass of B
Text2Onto meta-ontology
Some Text2Onto Instances
• Instance: Astrocyte_c– typeOf: Concept that
– Fact: confidence VALUE 1.0
Instance: AstrocycteNerveCell
TypeOf: Subclass that
Fact: domain VA\LUE NerveCell and
FACT: Range VALUE Astrocyte and
Fact: confidence VALUE 1.0
The Questions We Asked
• Can we press the button and get a good ontology?
• If not, can we get something useful?
• Can we do it without having to write too many rules?
• Does the end-point act as as a donor or recipient ontology?
Strategy
• Collect corpus• Manually markup text for cells: Definitive list
of terms• Process corpus through T2O• Analyse output of T2O for recall and precision
of terms and hierarchy• Iteration of previous two step with variants in
rules• Evaluation against CTO gold standard
The Experimental Conditions
• Default T2O• T2O plus cell specific JAPE rules and all
algorithms• Only cell specific JAPE rules, /EntropyExtraction
Algorithm and some “hierarchy spotting” based on term composition
• Same 3, but with VerticalRelationsConceptClassification to include our simple JAPE rules
• Same 4, but with WordConceptClassificaiton for additional hierarchy
Rules for Extracting Cell Types
• Words ending in ‘cyte’, ‘blast’, ‘cell’, ‘glia’, ‘glium’, ‘cell type’, ‘cell line’ and ‘cell lineage’ (together with their plurals)
• Zero or more adjectives followed by zero or more nouns or proper nouns followed by a ‘cell word’ (together with plural) e.g. ‘renshaw cell’, ‘Muller cell’, ‘immature blood cell’, etc..
• Any stem cell term is a stem cell
• Any term ending with ‘progeneitor cell’ is a Progenitor Cell.
• Any term ending with ‘precursor cell’ is a Precursor Cell.
• Any term ending in ‘blast’ is a Blast Cell.
• Any term ending with ‘cyte’ or ‘cell’ is a Differentiated Cell.
Evaluation Strategy
• Extraction performance
• Ontology evaluation
• Domain coverage
• Expert evaluation
Term Recognition
• 1,277 terms in our definitive list• 16,384 terms from whole corpus; 625 relevant• Increase to 17,851 and 916• All 118 CTO terms in corpus recalled• Corpus has anatomical bias• Simple rules exploit regularity of language• Many false positives from adjective noun rule
Cell Terms
• Morphology: Stellate cell; columnar cell;• Ploidy• Maturity: Tetrapooil cell; multiploid cell;• Potentiality• Lineage: Totipotent stem cell; multipotent cell;• Species origin• Anatomical location: Animal cell; human sell;• Developmental stage: Mitotic cell; S-phase cell;• Lineage: Mesoderm cell;
Common errorsManually
extracted from corpus
Automatically extracted from
corpus
Comments
+t - cell Symbols not handled very well
contains cell False -positive cell type
Foam cell New cell type extracted
leukocyte leucocyte Spelling errors in corpus
naïve cell nave cell Character encoding problem
Spermatogonia No rule to extract
Term Recall and Precision
Default learnt ontology
Final learnt ontology
Still not perfect!
Ontology evaluation
Learnt Ontology under CTO
Discussion
• Exploiting poor performance to focus learning• Exploiting regularity of language• Never really going to find CTO domain general
layer• Terms highly compositional and conflate axes• Ask the questions “is it useful?” not “is it good?”• Is CTO a good standard?• The extracted hierarchy was not bad from a cell
biology and ontological point of view
Nascent Methodology
• Form corpus that includes, but is not limited to scope of target ontology
• Extract terms from corpus• Filter and massage list of terms to find those of
ontological interest• Use ontology learning to see what happens• Inspect and augment rules to recognise and
incorporate into hierarchy• Iterate Use as donor ontology to transfer useful
bits to recipient ontology
Conclusions
• No;
• Yes;
• Yes;
• Donor
Acknowledgements
• Simon Jupp has done the work
• Jaclyn Bibby MSc Project prototype
• Johanna Volker for help with Text2Onto
• David Shotton for knowledge about cell biology