inducing ontologies from folksonomies using natural language understanding
DESCRIPTION
Inducing Ontologies from Folksonomies using Natural Language Understanding. Marta Tatu, Dan Moldovan Lymba Corporation Presenter: Chris Irwin Davis. Overview. Folksonomy. lexical normalization of tags semantic consistency tag-tag relations. folksonomy-based applications - PowerPoint PPT PresentationTRANSCRIPT
Inducing Ontologies from Folksonomies using Natural Language Understanding
Marta Tatu, Dan MoldovanLymba Corporation
Presenter: Chris Irwin Davis
Overview
LREC 2010 May 19th, 2010
NLP
Folksonomy
• typographical errors, spelling variations• singular/plural forms, lower case• space/punctuation used as delimiters• same tag in different contexts• tag synonymy
Ontology
• lexical normalization of tags• semantic consistency• tag-tag relations
social annotations (author vs. user) browse/search bookmarks resource discovery (recommendations) collaborative tagging (across folksonomies)
folksonomy-based applications reasoning applications
Semantic Approach
1. Folksonomy semantic representation
2. Tag understandingo Lexical: language identification, tokenization and spelling corrections, capitalization
restoration
o Syntactic: part-of-speech tagging, syntactic parsing
o Semantic: acronym understanding, word sense disambiguation, named entity recognition, semantic parsing
3. Deriving the ontological structureo Semantic relations between tags
• Sources of informationo Tag text semantics
o Social bookmarking annotations
o Machine understanding of bookmark content
LREC 2010 May 19th, 2010
Representing Folksonomies
• knowledge
• advertisign
• americanhistory
• read-now
LREC 2010 May 19th, 2010
American[JJ]1 history[NN]2TOPIC
now[RB]3 read[VB]1TEMPORAL
advertising[NN]1
knowledge[NN]1
Representing Folksonomies
LREC 2010 May 19th, 2010
SYNONYMY cluster
knowledge
(axlape,www.wolframalpha.com/)(nicksoni,www.curatingthecity.org/map.jsp)(pilx,www.wolframalpha.com/)...
knowledge|NN|1
knowledge,cognition
(bernsnarok,www.wolframalpha.com/)(_tarea_,academicearth.org/)(_tarea_,www.howstuffworks.com/)...
(omnamoprabhu,www.goertzel.org/dynapsyc/dynacon.html)(MikeMolto,cvcl.mit.edu/)(latrippi,nymag.com/news/features/56793/)...
cognition|NN|1
folksonomic tags
associated (user,document) pairs
WN synsetId = 20729
Associated (user, document) pairs
Representing Folksonomies
cognition|NN|1; knowledge|NN|1
module|NN|1; faculty|NN|1 organization|NN|1; organisation|NN|2 pattern|NN|1; form|NN|3
ISA ISA ISA
cognitive|JJ|1 PERTAIN perception|NN|1
PW
design|NN|2
ISA
calendar|NN|1
ISA
ISAISA
PDA|NN|1 – Personal Digital Assistant
organization|NN|1; governance|NN|1
SIM
SIM
SIM
adaptive|JJ|1 design|NN|2PAH
instructional|JJ|1 design|NN|2AGT
LREC 2010 May 19th, 2010
System Architecture
NLP processing
Document Cache
Document NLP
Repository
Social Tag-Tag & Tag-Doc Associations
Lexical Processing
of Tags
Syntactic Processing
of Tags
Semantic Processing
of Tags
Social Annotations
user
document
tag
Doc-2-TextLanguage IdentificationNLP of EN documents: Tokenization Part-of-speech tagging Sentence boundary detection Named entity recognition Syntactic parsing Word sense disambiguation Semantic parsing
Semantic Representation
of Tags
Ontology generation(Tag-Tag relations)
Applications: Search, browse, visualize Recommendations Collaborative tagging
Tag Classification Rules
Induced Ontology
LREC 2010 May 19th, 2010
Tag Understanding
Sources used to understand tags
Tag text Social bookmarking data Document content
Lexical
Language identification X X X
Tokenization and Spell checking X X X
Capitalization restoration X X
SyntacticPart-of-speech tagging X X
Syntactic parsing X
Semantic
Abbreviation and acronym expansion X X X
Word sense disambiguation (+ ner) X X X
Semantic parsing X
LREC 2010 May 19th, 2010
Acronym/Abbreviation Understanding• Abbreviation dictionary: (abbreviation - expansion - domain of usage)
o 118,055 distinct abbreviations
o 137 domains: Law, Music, TV/Radio Stations, Countries, Airport, Domain Names, Chat, Emoticons, etc.
o 25% of the abbreviations have more than one definition
• (unambiguous) Zip codes – (76012 : Arlington, TX)
• (ambiguous) SS : 192 definitions in 66 domains
o Social Security – Business and US Government, Screen Saver – File Extensions, Stainless Steel – Housing and Products, Subtropical Storm – Meteorology, Style Sheet – Software
• Check tag if part of abbreviation dictionary
• Use lexical chains to link document content to abbreviation domain
• Use co-occurring tags to identify correct expansion
• Use text alignment to find new abbreviation definitions within document content
LREC 2010 May 19th, 2010
Acronym/Abbreviation Understanding• “PR” ~ 1409 documents
• 87 definitions for PR
o Press Release, Public Relations, Puerto Rico, Page Rank, Public Radio, Permanent Resident/Residency, etc.
• http://prsarahevans.com/2009/06/do-you-have-a-strategy-for-online-comments
o “PR” = “public relations” (6 times in document content)
o Other tags of the bookmark: “public”, “relations”, “media”, “strategy”
• http://www.bbc.co.uk/pressoffice/pressreleases/category/new_media_index.shtml
o “PR” = “press releases” (in document content)
• http://escape.topuertorico.com
o “PR” = “Puerto Rico” (in document content)
LREC 2010 May 19th, 2010
Evaluation
• Experimental datao ~ 150,000 (user,document,tag) from del.icio.us
• 8,460 tags; 83,827 documents; 58,198 users
• Main error source: tag cannot be identified within documento Lack of document content (images, non-EN content, etc.)
• Errors propagate from initial processing steps to later oneso Bad capitalization leads to bad named entity recognition
LREC 2010 May 19th, 2010
Ontological Tag-Tag Relations• EQUALITY relations
o same lemma, part-of-speech, and sense number
o EQ(activity, activities), EQ(after-effects, AfterEffects), EQ(opinion, Opnion), etc.
• SYNONYMY clusters
o Same synset id
o SYN(OS, operating.system), SYN(LA, losangeles), SYN (nyt, nytimes)
• ISA relations between named entities and type tags
o ISA(OracleCorporation, organization), ISA(davidfosterwallace, person)
• WordNet relations between tags
o ISA(vegan, vegetarian), ANTONYMY(peace, war), PART_WHOLE(Businesses, markets), ENTAIL(proofreading, +read), SIMILARITY(important, general), DOMAIN(light, physics)
LREC 2010 May 19th, 2010
Ontological Tag-Tag Relations• Lexical chains of size 2 and Semantic calculus
– tag1 rel1 synset rel2 tag2
• rel1 & rel2 rel3
• rel3(tag1, tag2) is added to the ontology
– ISA(integration, events,) ISA(integration, group_action/NN/1) and ISA(group_action/NN/1, events,)
– PART_WHOLE(lobby, hotels) PART_WHOLE(lobby, building/NN/1) and ISA(building/NN/1, hotels)
• ISA relations between “modifier head” and “head” tags
– ISA(book-cover, covers)
– ISA(theoryofmind, theory)
– ISA(photoshoptutorials, tutorials,)
LREC 2010 May 19th, 2010
Ontological Tag-Tag Relations
• Relations between “modifieri headi” tags (i=1,2)
– ISA(build-solar-panel, create-solar-panel)
– SIMILARITY(socialnetworks, socialweb)
LREC 2010 May 19th, 2010
modifier2
modifier1
ISA
head2
head1
ISA
modifier2
modifier1
ISA
head2
head1
SYN
modifier2
modifier1
SYN
head2
head1
ISA& & &OR OR
head2
modifier2
REL
head2
modifier2
REL
ISA⇒
Evaluation
• 9,820 EQ clusters for the 8,460 unique tagso Same abbreviation expanded to different definitions
o EQ: tutorial, tutorials, tutorials,
• 8,801 SYN clusterso Largest cluster (133 bookmarks): car, automobiles, auto, autos, cars,
automobile
• 17% of tags placed into incorrect SYN clustero Errors caused by imperfect word sense disambiguation
• 5,439 ontological tag-tag relationso 3,869 ISA, 601 SIMILARITY, 429 PART_WHOLE, etc.
o 1,778 relations derived using WordNet’s lexical chains and Lymba’s semantic calculus rules
LREC 2010 May 19th, 2010
Folksonomic Ontology
LREC 2010 May 19th, 2010
• Portion of ontology generated from experimental folksonomy
Folksonomic Ontology
LREC 2010 May 19th, 2010
• Portion of ontology generated from experimental folksonomy
Thank you!
For questions: email [email protected]