semiautomatic domain model building from text-data

17
SMAP 2011, Vigo, Spain, December 1-2, 2011

Upload: creda

Post on 16-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Semiautomatic domain model building from text-data. Petr Šaloun Petr Klimánek Zdenek Velart. SMAP 2011, Vigo, Spain, December 1-2, 2011. Introduction and goals. The basic tasks in creating a domain model: selection of domain and scope consideration of reusability - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Semiautomatic domain model building from text-data

SMAP 2011, Vigo, Spain, December 1-2, 2011

Page 2: Semiautomatic domain model building from text-data

The basic tasks in creating a domain model: selection of domain and scope consideration of reusability finding a important terms defining classes and class hierarchy defining properties of classes and

constraints creation of instances of classes

Goals designing a method for semiautomatic

domain creation different input documents different languages design and implementation of tool

Page 3: Semiautomatic domain model building from text-data

Algorithm and tasks work with domain model

different document formats different languages domain model

concepts, relations domain model creation = time

consuming‐ manual creation‐ automatic creation‐ semiautomatic creation

Page 4: Semiautomatic domain model building from text-data

natural language processing – NLP Stanford NLP

‐ Stanford Parser‐ Stanford POS tagger‐ Stanford Named Entity Recognizer

multi-language environment – Google Translate

WordNet (synsets)

Tool – Java, SWING, XML, jTidy, JAWS, SNLP, JUNG

Page 5: Semiautomatic domain model building from text-data

An/DT integer/NN character/NN

constant/NN has/VBZ type/NN int/NN ./.

<html><body><p>An integer character

constant has type int.</p></body></html>

Page 6: Semiautomatic domain model building from text-data

input TXT, HTML, PDF removal of occurrences of special

characters using regular expressions numeric designation of chapters and

references removal of single letter prepositions(\\s+[^Aa\\s\\.]{1})+\\s+ parentheses, dashes, and other

translation into English – the tools work only with english text Google Translate

Page 7: Semiautomatic domain model building from text-data

Stanford CoreNLP Stanford Parser, Stanford POS tagger,

Stanford Named Entity Recognizer machine learning over large data,

statistical model of maximum entropy learned models included

Activities tokenization sentence splitting POS tagging - Part-of-speech lemmatization NER - Named Entity Recognition

Page 8: Semiautomatic domain model building from text-data

<html><body><p>An integer character constant has type int.</p></body></html>

An/DT integer/NN character/NN constant/NN has/VBZ type/NN int/NN ./.

Page 9: Semiautomatic domain model building from text-data

tokens marked by POS tagger as nouns are first concept candidates

one word or multi-words nouns identifying token as concept by

disambiguation from WordNet assigning synset – automatic, manual using domain term for searching possible selection of incorrect synset –

with other meaning

Page 10: Semiautomatic domain model building from text-data

unoriented / oriented unnamed / named WordNet – concept must have synset

‐ hyperonyms and hyponyms – IsA relations‐ holonyms and meronyms – partOf relations‐ relation orientation based on concept order

only direct relations from text

lexical-syntactic patterns decomposition of multi-word terms – right part

of term corresponds to existing concept assignment expression

assignment expression IsA expression sentence syntax analysis – amod parser

(adjectival modifier), adjective followed by noun

integral type IsA type

Page 11: Semiautomatic domain model building from text-data
Page 12: Semiautomatic domain model building from text-data

ANSI/ISO C language comparison with existing manually

created ontology 2 experiments

all concept candidates only first 200 candidates 3 variants of experiment

‐ only candidates‐ candidates and IsA proposals‐ candidates and IsA proposals and NER

entities

Page 13: Semiautomatic domain model building from text-data

type 645 argument 182 Behavior 149

Value 571 member 180 result 148

Character 529 String 180 Return 135

function 447 Stream 172 Macro 127

Pointer 329 Array 160 Declaration 119

Object 322 Sequence 160 Implementation 118

Expression 304 char 158 Conversion 111

Identifier 220 Operator 155 Integer 105

int 195 Number 155 File 102

operand 184 Description 155 Reference 100

Page 14: Semiautomatic domain model building from text-data

Variant Added Items in model

Found concepts

Found / Items

Found / total in ontology

Found / can be found

All

- 3137 395 13 % 38 % 73 %

IsA 4519 450 10 % 43 % 84 %

IsA + NER 4558 465 10 % 45 % 86 %

200

- 200 98 49 % 9 % 18 %

IsA 1802 152 8 % 15 % 28 %

IsA + NER 1962 318 16 % 31 % 59 %

Page 15: Semiautomatic domain model building from text-data

Variant of experiment without IsA relations only with NER entities

Variant Items Found Concepts / Items

Concepts / total

Concepts / can be found

All + NER 3204 444 13.9 % 42.8 % 82.4 %

200 + NER 360 265 73.6 % 25.5 % 49.2 %

Page 16: Semiautomatic domain model building from text-data

concepts => lightweight ontology enables better automatic relations

mining

Page 17: Semiautomatic domain model building from text-data

Petr ŠalounFEECS, VSB–Technical University of [email protected]

Petr Klimánek(was: Faculty of Science, University of Ostrava)[email protected]

Zdenek VelartFEECS, VSB–Technical University of [email protected]