semantic annotation of biomedical data

32
Semantic annotation of biomedical data Clement Jonquet [email protected] INRIA - EXMO seminar - March 24th, 2010

Upload: clement-jonquet

Post on 10-May-2015

1.762 views

Category:

Education


2 download

DESCRIPTION

Presentation about semantic annotation of biomedical data. Presented at LIRMM, INRIA and other between 2008 and 2010.

TRANSCRIPT

  • 1.Semantic annotation of biomedical data
    Clement Jonquet
    [email protected]
    INRIA - EXMO seminar - March 24th, 2010

2. Speech overview

  • Introduction: semantic annotation, semantic web, biomedical context, the challenge

3. Ontology-based annotation workflow: concept recognition, semantic expansion, why its hard? 4. Annotation services: the NCBO Annotatorweb service, the NCBO biomedical resources index 5. Users & use cases 6. Conclusion and future work2
INRIA - EXMO seminar - March 24th, 2010
7. Annotation & semantic web

  • Part of the vision for the semantic web

8. Web content must be semantically described using ontologies 9. Semantic annotations help to structure the web 10. Annotation is not an easy task 11. Automatic vs. manual 12. Lack of annotation tools (convenient, simple to use and easily integrated into automatic processes) 13. Todays web content (& public data available through the web) mainly composed of unstructured textINRIA - EXMO seminar - March 24th, 2010
3
14. Annotation is not a common practice

  • High number of ontologies

15. Getting access to all is hard: formats, locations, APIs 16. Lack of tools that easily access all ontologies (domain) 17. Users do not always know the structure of an ontologys content or how to use it in order to do the annotations themselves 18. Lack of tools to do the annotations automatically 19. Boring additional task without immediate reward for the userINRIA - EXMO seminar - March 24th, 2010
4
20. Biomedical context

  • Explosion of publicly available biomedical data

21. Very diverse, grow very fast 22. Most of the data are unstructured and rarely described with ontology concepts available in the domains 23. Hard for biomedical researchers to find the data they need 24. Data integration problem 25. Translational discoveries are prevented 26. Good example of use of ontologies and terminologies for annotations 27. Gene Ontology annotations 28. PubMed (biomedical literature) indexed with Mesh headings 29. Limitations 30. UMLS only, almost nothing for OBO & OWL ontologies 31. Manual approaches, curators (scalability?) 32. Automatic approaches (usability & accuracy?) INRIA - EXMO seminar - March 24th, 2010
5
33. The challenge

  • Automatically process a piece of raw text to annotate it with relevant ontologies

34. Large scale to scale up for many resources and ontologies 35. Automatic to keep precision and accuracy 36. Easy to use and to access to prevent the biomedical community from getting lost 37. Customizable to fit very specific needs 38. Smart to leverage the knowledge contained in ontologiesINRIA - EXMO seminar - March 24th, 2010
6
39. Vocabulary

  • Element = a collection of observations resulting from a biomedical experimentorstudy

40. a dataset, clinical-trial description, research article,imaging study 41. Text metadata=the set of free text that describe or annotate an element 42. Resource = a collection of elements 43. GEO, PubMed, ClinicalTrial.gov, Guideline.gov, ArrayExpress 44. Concept = a unique entity (class) in an specific ontology (has an URI) 45. UMLS CUI or NCBO URI e.g., C0025202, DOID:1909 46. Term = a string that identifies a given concept (name, synonyms) 47. Melanoma, Melanomas, Malignant melanoma 48. Annotation = meta-information on a data: this data deals with this concept 49. PMID17984116 deals with C0025202INRIA - EXMO seminar - March 24th, 2010
7
50. Why using ontologies?
They structure the knowledge from a domain
They specify terms that can be used by natural language processing algorithms to process text
They uniquely identify concept (URI)
They specify relations between concepts that can be used for computing concept similarity
They define hierarchies allowing abstraction of type
They play the role of common denominator for various data froma domain
INRIA - EXMO seminar - March 24th, 2010
8
51. Why using ontologies?
9
INRIA - EXMO seminar - March 24th, 2010
52. Why is it a hard problem? (1/2)

  • Identify concept from text is a hard task

53. May involve NLP, stemming, spell-checking, or recognition of morphological variants 54. Concept disambiguation 55. Scalability issues 56. We want to deal with millions of concepts (~4M) 57. 200+ ontologies in several format, spread out 58. Huge biomedical resources e.g., PubMed 17M citations 59. What to do with annotations when the ontologies and the resources evolve over time 60. e.g., elements in resources are added 61. e.g., concepts in ontologies are removed INRIA - EXMO seminar - March 24th, 2010
10
62. Why is it a hard problem? (2/2)
How to leverage the knowledge contained in ontologies?
Process the transitive closure for relations (not trivial for ontologies with 300k concepts)
Execute semantic distance algorithms to determine similarity
Compute mappings between ontologies to connect ontologies one another
Keep all of this up to date when ontologies evolve
e.g., new GO version everyday
INRIA - EXMO seminar - March 24th, 2010
11
63. Ontology-based annotation workflow
INRIA - EXMO seminar - March 24th, 2010`
12
First, direct annotations are created by recognizing concepts in raw text,
Second,annotations are semantically expanded using knowledge of the ontologies,
Third, all annotations are scored according to the context in which they have been created.
64. Concept recognition (step 1)

  • Uses a dictionary: a list of strings that identifies ontology concepts

65. 220 ontologies, ~4.2M concepts & ~7.9M termsUses NCIBI Mgrep, a syntactic concept recognizer
High degree of accuracy
Fast, scalable,
Domain independent
13
INRIA - EXMO seminar - March 24th, 2010`
66. Semantic expansion (step 2)

  • Uses is_a hierarchies defined by original ontologies

67. Uses mapping in UMLS Metathesaurus and NCBO BioPortal 68. Usessemantic- similarity algorithms based on the is_a graph (ongoing work) 69. Componentsavailable asweb services14
INRIA - EXMO seminar - March 24th, 2010`
70. An example

  • Melanoma is a malignant tumor of melanocytes which are found predominantly in skin but also in the bowel and the eye.

71. NCI/C0025201, Melanocyte in NCI Thesaurus 72. 39228/DOID:1909, Melanoma in Human Disease 73. Is_a closure expansion 74. 39228/DOID:191, Melanocytic neoplasm, direct parent of Melanoma in Human Disease 75. 39228/DOID:0000818, cell proliferation disease, grand parent of Melanoma in Human Disease 76. Mapping expansion 77. FMA/C0025201, Melanocyte in Foundational Model of Anatomy, concept mapped to NCI/C0025201 in UMLS.INRIA - EXMO seminar - March 24th, 2010`
15

  • Melanoma is a malignant tumor of melanocytes whichare found predominantly in skin but also in the bowel and the eye.