sémantické vyhľadávanie a sémantick É siete

SÉMANTICKÉ VYHĽADÁVANIE A SÉMANTICKÉ SIETE

Podpora přednášky kurzu Teoretické aspekty umělé inteligence KA 16

RNDr. Michal Laclavík, PhD.

11.4.2013 Hradec Králové 2

Primary Research Team & CapabilitiesDept. of Parallel and Distributed ComputingResearch and Development Areas:

– Large-scale HPCN, Grid and MapReduce applications– Intelligent and Knowledge oriented Technologies

Experience from IST:– 3 project in FP5: ANFAS, CrosGRID, Pellucid– 6 project in FP6: EGEE II, K-Wf Grid, DEGREE

(coordinator), EGEE, int.eu.grid, MEDIGRID– 4 projects in FP7: Commius, Admire, Secricom, EGEE III

Several National Projects (SPVV, VEGA, APVT)IKT Group Focus:

– Information Processing (Large Scale)– Graph Processing – Information Extraction and Retrieval– Semantic Web– Knowledge oriented Technologies– Parallel and Distributed Information Processing

Solutions:– SGDB: Simple Graph Database– gSemSearch: Graph based Semantic Search– Ontea: Pattern-based Semantic Annotation– ACoMA: KM tool in Email– EMBET: Recommendation System– Experts on MapReduce and IR (Nutch, Solr, Lucene)

Director & leader of PDC: Dr. Ladislav Hluchý

URL: http://ikt.ui.sav.sk

http://ikt.ui.sav.sk/

Hradec Králové 3

Obsah

• Google Knowledge Graph• Facebook Graph Search• SemSets• Sémantické siete• gSemSearch• IBM Watson• Extrakcia informácií

11.4.2013

Podčiarknuté sú metódy vyvíjané na ÚI SAV

Hradec Králové 4

Google Knowledge Graph• Wikipedia• Freebase• Confirmed human

knowledge

11.4.2013

[ulanoff]

Hradec Králové 5

Facebook Graph Search

• Užívateľmi generovaný obsah

• Prepojenia na web

11.4.2013

[facebook13]

Sémantické vyhľadávanie SemSets• Odpovede na otázky typu

zoznam: astronauts who walked on the Moon

• Wikipédia ako text aj graf• Text: usporiadanie pomocou

lucene• Graf/sieť: šírenie aktivácie a

SemSets• Víťazné riešenie na

Semantic Search Challenge


1. Eugene_Cernan2. Alan_Bean3. David_Scott4. John_Young_(astronaut)5. Neil_Armstrong6. Pete_Conrad7. Harrison_Schmitt8. Alan_Shepard9. Charles_Duke10. Buzz_Aldrin11. James_Irwin12. Edgar_Mitchell

[SemSets]

Objavovanie vzťahov vo veľkých grafových dátach • Motivácia

• Grafy a siete sú všadeprítomné : sociálne site, web, LinkedData, transakcie, komunikácia (email, telefóny).

• Text tiež môže byť prevedený na graf. • Prepojenie grafových dát a vyhľadávania relácii v nich je dôležite

• Prístup• Tvorba sémantických stromov a grafov z textu, webu, komunikácie, databáz a LinkedData• Užívateľská interakcia s týmito dátami aby sa dali lepšie integrovať zdroje a vyčistiť upraviť

dáta• Užívatelia to budú robiť ak to bude mať zmysel, teda okamžitý vplyv na lepšie výsledky

vyhľadávania

11.4.2013 7Hradec Králové

Hradec Králové 8

Sémantické siete• Sociálne siete: priatelia a iné artefakty ako správy,

statusy, fotky a podobne.• Emaily: sociálna sieť + iné objekty ako firmy, organizácie,

dokumenty, linky, čas a podobne.• Telekomunikácie: sieť navzájom komunikujúcich ľudí -

hovory, SMS s ďalšími metadátami ako čas alebo miesto. • Internet: sieť odkazov a prepojení.• Wikipédia: sieť prepojení a hierarchie jednotlivých

tematických stránok ako aj jazykových mutácií • LinkedData

11.4.2013

Náhodná sieť a sieť s mocninovou distribúciou


Zdroj: http://geza.kzoo.edu/bionet/html/scalefree.html

Sieť s mocninovou dist. stuňov Sieť s binomickou dist. stuňov

[Slide borrowed from Marek Ciglan]

Siete malého sveta• Siete malého sveta často obsahujú kliky,

alebo „skoro kliky“ • Efekt „moji priatelia v sociálnej sieti sú

často priatelia navzájom“• Matematicky to možno zachytiť pomocou

zhlukovacieho koeficientu• Lokálny zhlukovací koeficient:


Zdroj: http://en.wikipedia.org/wiki/Clustering_coefficient

[Slide borrowed from Marek Ciglan]

Hradec Králové 11

Vlastností vybraných grafov/sietíEnronDBPedia

DSK

LinkedInBBC

Events ACM

Gorila

Datasety:• DBPedia• Web

• BBC, LinkedIn, DSK• Gorila – document• Events – agent simulation event

graph• ACM – publications, LinkedData

11.4.2013

Názov siete

Počet vrcholov

Počet hrán Priem. klást. koef.

Koef. assort.

Priem. najkr. cesta

Enron Full 8 269 278 20 383 709 0,29 -0,02 6,58 Enron5 160 387 630 330 0,30 -0,04 6,64 LinkedIn 1 564 698 6 094 634 0,36 0,13 6,48 BBC 1 725 900 6 839 358 0,34 -0,05 7,55 DSK 21 518 98 952 0,31 0,39 5,79 DSK3 2 857 8 754 0,36 -0,14 5,46 Gorila 5 959 23 724 0,31 0,03 6,25 Events 25 478 539 328 0,38 -0,25 2,47 ACM 941 322 2 198 001 0,34 -0,06 7,30

Extrakcia entít, stromy a siete

• Information Extraction (Entity identification)– We have used Ontea, but other tools like GATE or Stanford NER can be used– Ontea advantage – forming entity trees

• Trees• Graphs/Networks


Ontea: Nástroj na extrakciu informácií

Regulárne výrazy (vzory) Gazetteers (Slovníky) Výsledky - Anotácie

Key-value páry Sémantické stromy Grafy a siete

Transformácie, Konfigurácia Automatické načítanie extraktorov

Visuálny nástroj na anotáciu Integrácia s inými technológiami

GATE, Stanford NER, Hadoop … Testy s rôznymi jazykmi

Angličtina, Slovenčina, Španielčina, Taliančina

Hradec Králové 1311.4.2013

http://ontea.sf.net

[ontea_email].

gSemSearch: objavovanie relácií v grafoch a sieťach

• Vylepšené vyhľadávanie relácií v sémantických grafoch

• Škálovateľnosť• Nasmerované na prepojenie

– štruktúrovaných (Relačné dáta, LinkedData)

– neštruktúrovaných dát (text, dokumenty, komunikácia)


[gSemSearch]

Navigácia v zjednodušenom LinkedData grafe

• Konverzia ACM LinkedData na jednoduchý graf pre gSemSearch– Experiment na hľadanie relácií a navigáciu– Pri konverzii na jednoduchší graf

zanedbanie typov vzťahov: niekedy problém


Teória grafov: šírenie aktivácie


• Fast algorithm• Takes graph topology into account• Breadth First • Ends after it visit certain number of nodes

(set to 10,000 experimentally)

public Map<Result,Double> relatedBreadthFirst(Set<Entity> startNodes) { Map<Result,Double> rM = new HashMap<Result, Double>(); LinkedList<Entity> rLL = new LinkedList<Entity>(); int count = visitNodeCount; int sizeInit = startNodes.size(); for (Entity start : startNodes) { rLL.addLast(start); rM.put(start, (double) count/ (double) sizeInit); } while (!rLL.isEmpty() && count >= 0) { Entity r = rLL.removeFirst(); visited.add(r); int nCount = g.getNeighborCount(r); double v = rM.get(r)/(double)nCount; if (v < threshold) continue; if (nCount<=count) { Collection<Entity> rC = g.getNeighbors(r); for (Entity entity : rC) { if (!visited.contains(entity)) { rLL.addLast(entity); } visited.add(entity); double val = v; if (rM.containsKey(entity)) val += rM.get(entity); rM.put(entity, val);

} count -=nCount; } } return rM; }

In our algorithm, activation is started from a set of nodes (𝑆= ሼ𝑣1,𝑣2 … 𝑣𝑘ሽ). The activation value is a constant (𝑛 = 10,000) determined experimentally. It is also a maximum number of visited nodes. Visited nodes are stored in the set V, which contains the starting nodes at the beginning (𝑉= 𝑆). Starting nodes are put into the queue 𝑃= (𝑣1,𝑣2 … 𝑣𝑘). 𝑅 is a set of nodes with assigned relevance, which is computed as 𝑛/𝑘: 𝑅= {ሺ𝑣1,𝑛/𝑘ሻ,ሺ𝑣2,𝑛/𝑘ሻ…ሺ𝑣𝑘,𝑛/𝑘ሻ} 1. Because we traverse the graph using Breadth First method, when the queue is

defined as 𝑃= (𝑝1,𝑝2 … 𝑝𝑙), we first take out the first node for processing 𝑝= 𝑝1;𝑃= (𝑝2 … 𝑝𝑙−1) . Then the queue is processed for each 𝑝 until 𝑃≠ ∅ ˄ 𝑛 > 0. For each 𝑝, all of its neighbors are defined as a set 𝑁𝑝: ∃ 𝑒𝑝𝑏 = ሼሺ𝑝,𝑏ሻሽ⇒ 𝑏∈𝑁

2. For each 𝑏𝑖 ∈𝑁𝑝 we compute new relevance value of node 𝑞𝑏 = 𝑞𝑝/|𝑁𝑝| . We know the value 𝑞𝑝 of node 𝑝 because (𝑝,𝑞𝑝) ∈𝑅. We process the neighbours of 𝑝 only if 𝑞𝑏 > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ˄ 𝑛 > |𝑁𝑝| , otherwise the next node from 𝑃 is processed.

3. Each 𝑏𝑖 is added into queue 𝑃 = (𝑝2 … 𝑝𝑙−1,𝑏𝑖), but only if it does not already belong to the set of visited nodes 𝑉. After processing, 𝑏𝑖 is added to 𝑉.

4. If (𝑏𝑖,𝑞) ∈𝑅, then it is replaced by (𝑏𝑖,𝑞+ 𝑞𝑏) ∈𝑅, otherwise (𝑏𝑖,𝑞𝑏) ∈𝑅. 5. When all 𝑏𝑖 are processed, 𝑛 is decreased by the neighbor count of node 𝑝: 𝑛 = 𝑛− |𝑁𝑝| . 6. Then we process the next node 𝑝∈𝑃 from the queue going back to the first step.

When the algorithm finishes, the set 𝑅 contains the list of nodes relevant to the set of starting nodes (𝑆) with assigned relevancy values (𝑞𝑖) including the starting nodes. 𝑅= {ሺ𝑟1,𝑞1ሻ,ሺ𝑟2,𝑞2ሻ…ሺ𝑟𝑛,𝑞𝑛ሻ} In our algorithm we also define OR and AND operations over the starting nodes. OR operation is done exactly as we described, starting from multiple nodes. When using AND operation, we independently run algorithm for each starting node. For example, if running AND for two nodes, we get following results sets: 𝑅1 = ሼሺ𝑟1,𝑞1ሻ,ሺ𝑟2,𝑞2ሻ…ሺ𝑟𝑛,𝑞𝑛ሻሽ; 𝑅2 = {ሺ𝑠1,𝑔1ሻ,ሺ𝑠2,𝑔2ሻ…ሺ𝑠𝑡,𝑔𝑡ሻ} In final result set for AND operation we include only those nodes which appeared in both sets, and the relevance value is computed by multiplying relevancies: 𝑞 = 𝑞𝑖 𝑔𝑗

൫𝑟𝑖,𝑞𝑖𝑔𝑗൯ ∈𝑅 ⟺ሺ𝑟𝑖,𝑞𝑖ሻ ∈𝑅1˄ ൫𝑠𝑗,𝑔𝑗൯ ∈𝑅2˄ 𝑟𝑖 = 𝑠𝑗

SGDB: Simple Graph Database

• Úložisko pre grafové štruktúry• Optimalizované na traverzovanie grafu• Pri traverzovaní rýchlejšie ako Neo4j• Podpora Blueprints API• https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3

• Graph Database Benchmark– Benchmark pre operácie traverzovania v grafe– http://ups.savba.sk/~marek/gbench.html– Blueprints API – Možnosť testovať databázy ktoré podporujú toto API


[gBench]

https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3

http://ups.savba.sk/~marek/gbench.html

Spracovanie rozsiahlych textových a grafových dát

Technológie• Sťahovanie dát

– Nutch + plugins

• Indexovanie a fultextové vyhľadávanie– lucene, Sorl

• Extrakcia informácií– Ontea, GATE

• Všetky vyššie uvedené na rozsiahlych dátach– Hadoop, S4

• Spracovanie a dopytovanie grafových dát– Simple Graph Database (SGDB)– gSemSearch– Neo4j– Blueprints

Podčiarknuté sú technológie vyvíjané ÚISAV


[uiWeb]

IBM Watson


[Perrone11]

Machine Learning a Dáta (trénovacie)• Log súbory (užívatelia)• Wikipédia, DBPedia (111 languages)• Tags (YouTube, Delicious .....)• LinkedData


[Zaragoza]

Information Extraction: OpenNLP• NLP úlohy

– tokenization– sentence segmentation– part-of-speech tagging– named entity extraction– Chunking– Parsing– coreference resolution

• Machine Learning Models– maximálna entropia

(maximum entropy)– model perceptrónu


[TamingText, OpenNLP]• Experimenty

• http://vi.ikt.ui.sav.sk/ • Extrakcia mien osôb• Lokalít• Rozpoznávanie viet SK• Rozpoznávanie viet EN

http://vi.ikt.ui.sav.sk/

Information Extraction: Features

Example word features:– identity of word– is in all caps– ends in “-ski”– is part of a noun phrase– is in a list of city names– is under node X in WordNet or Cyc– is in bold font– is in hyperlink anchor– features of past & future– last person name was female– next two words are “and Associates”

begins-with-numberbegins-with-ordinalbegins-with-punctuationbegins-with-question-wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-numbercontains-httpcontains-non-spacecontains-numbercontains-pipe

contains-question-markcontains-question-wordends-with-question-markfirst-alpha-is-capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-spaceonly-punctuationprev-is-blankprev-begins-with-ordinalshorter-than-30

[Nigam]


Word Features– lists of job titles, – Lists of prefixes– Lists of suffixes– 350 informative phrases

HTML/Formatting Features– {begin, end, in} x

{<b>, <i>, <a>, <hN>} x{lengths 1, 2, 3, 4, or longer}

– {begin, end} of line

Is Capitalized Is Mixed Caps Is All Caps Initial CapContains DigitAll lowercase Is InitialPunctuationPeriodCommaApostropheDashPreceded by HTML tag

Character n-gram classifier says string is a person name (80% accurate)

In stopword list(the, of, their, etc)

In honorific list(Mr, Mrs, Dr, Sen, etc)

In person suffix list(Jr, Sr, PhD, etc)

In name particle list (de, la, van, der, etc)

In Census lastname list;segmented by P(name)

In Census firstname list;segmented by P(name)

In locations lists(states, cities, countries)

In company name list(“J. C. Penny”)

In list of company suffixes(Inc, & Associates, Foundation)

Pokusy s Machine Learning na ÚI SAV• Extrakcia pomocou jednoduchých

regulárnych výrazov– Krsne_meno Priezvisko– Ing. Meno Priezvisko – Best regards, Meno …– Ulica CISLO, PSC Meno_Mesta– Hocičo s veľkými písmenami (type-less

entity)• Tieto metódy fungujú ale nie vždy

– Človek nevie dobre nadefinovať pravidlá– Ak mám trénovacie dáta ML môže povedať

kedy ktoré funguje• Trénovacie dáta z user interaction

– Delete, annotate, change type


Hradec Králové 26

Annotowatch

Š. Dlugolinský, P. Krammer, M. Ciglan, M. Laclavík

MSM 2013Challenge

http://oak.dcs.shef.ac.uk/msm2013/challenge.html11.4.2013

http://oak.dcs.shef.ac.uk/msm2013/challenge.html

Used Named Entity Recognition (NER) tools

1. ANNIE (GATE)2. Apache OpenNLP3. Illinois NER4. Illinois Wikifier5. LingPipe6. Open Calais7. Stanford NER8. WikiMiner9. Miscinator*

Most of these tools are intended to be used rather on a relatively long news-like texts than on microposts* our specialized tool designed to detect entities

of the MISC type, as defined in MSM’13 challenge; that is entertainment/award event, sports event, movies, TV shows, political event or programming languages ; uses Google Sets


PS

RS

F1S

PL

RLF1L

PA

RA

F1A

0.00

0.25

0.50

0.75

1.00

AnnieApache OpenNLPIllinois NERIllinois WikifierLingPipeOpen CalaisStanford NERWikiminer

Average Performance of all tools*

* on MSM’13 training dataset v1.5

LOC MISC ORG PER0.00

0.20

0.40

0.60

0.80

1.00

P

R

F1

Some of the tools are more suitable on different entity type. It can be seen on different performances for LOC and MISC for example.


LOC MISC ORG PER0.00

0.20

0.40

0.60

0.80

1.00

P

R

F1

Different tools produce diverse results, which when combined bring higher recall than the best tool individually.

Features for machine learning– Example of method features vector computation for MISC

annotation:


Sample part of generated prunned tree...... IllinoisNER.MISC.AScore.aiir <= 0.7273... | ApacheOpenNLP.ORG.AScore.aiir <= 0.2059... | | Wikiminer.MISC.AScore.ail <= 16... | | | Ann.type = LOC... | | | | LingPipe.LOC.AScore.aiir <= 0.5882: LOC (21.0/1.0)... | | | | LingPipe.LOC.AScore.aiir > 0.5882: NULL (371.0/11.0)... | | | Ann.type = MISC... | | | | Wikiminer.MISC.AScore.aiir <= 0.5172... | | | | | IllinoisWikifier.MISC.AScore.aiia <= 0.5: MISC (22.0)... | | | | | IllinoisWikifier.MISC.AScore.aiia > 0.5: NULL (95.0/5.0)... | | | | Wikiminer.MISC.AScore.aiir > 0.5172: NULL (682.0/12.0)... | | | Ann.type = NP: NULL (7624.0/83.0)... | | | Ann.type = ORG


Annotations found in sample tweet by all tools” 2,000 fetuses found hidden at Thai Buddhist temple _URL_ via _Mention_”


Hradec Králové 32

Annotowatch

Naše riešenie Annotowatch je v prvých 6 najlepších riešeniach zo 17 tímov ktoré súťažili v MSM 2013 challenge

MSM 2013Challenge

http://oak.dcs.shef.ac.uk/msm2013/challenge.html11.4.2013

http://oak.dcs.shef.ac.uk/msm2013/challenge.html

Hradec Králové 33

Záver• Sémantické siete zo štruktúrovaných a neštruktúrovaných dát

– Majú zaujímavé vlastnosti – Možnosť optimalizácie grafových algoritmov a infraštruktúry

• Sémantické vyhľadávanie v sémantických sieťach– Používateľ vyhľadáva, interaguje, opravuje a teda generuje trénovaciu

množinu– Techniky strojového učenia na vylepšenie modelu sietí z

neštruktúrovaných dát ako aj vyhľadávania

11.4.2013

Literatúra

• [Ulanoff] Lance Ulanoff: Google Knowledge Graph Could Change Search Forever http://mashable.com/2012/02/13/google-knowledge-graph-change-search/, 2012

• [facebook13] Sean Gallagher, Knowing the score: How Facebook’s Graph Search knows what you want, http://arstechnica.com/information-technology/2013/03/knowing-the-score-how-facebooks-graph-search-knows-what-you-want/, 2013

• [Perrone11] Michael Perrone: What is Watson – An Overview, 2011, http://static.usenix.org/event/lisa11/tech/slides/perrone.pdf

• [WatsonJr] Tony Pearson: IBM Watson - How to build your own "Watson Jr." in your basement, 2012, https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=en

• [OpenNLP]OpenNLP: http://www.slideshare.net/gagan1667/opennlp-demo • [TamingText] Ingersoll, G., Morton, T., & Farris, L. (2012). Taming Text: How to find,

organize and manipulate it.• [Zaragoza] Hugo Zaragoza. Machine Learning and Information Retrieval, ESSIR 2009 Lecture• [Nigam] Kamal Nigam: Generative Models for Text Classification

and Information Extraction, http://www.cs.cmu.edu/~knigam/15-505/ie-lecture.ppt 11.4.2013 Hradec Králové 34

http://mashable.com/2012/02/13/google-knowledge-graph-change-search/

http://arstechnica.com/information-technology/2013/03/knowing-the-score-how-facebooks-graph-search-knows-what-you-want/

http://arstechnica.com/information-technology/2013/03/knowing-the-score-how-facebooks-graph-search-knows-what-you-want/

http://static.usenix.org/event/lisa11/tech/slides/perrone.pdf

https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=en

https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=en

http://www.slideshare.net/gagan1667/opennlp-demo

http://www.cs.cmu.edu/~knigam/15-505/ie-lecture.ppt

Literatúra

• [SemSets] CIGLAN, Marek - NoRVaG, Kjetil - HLUCHÝ, Ladislav. The SenSets model for ad-hoc semantic list search. In WWW´12 Proceedings of the 21st International Conference on World Wide Web. - New York : ACM, 2012, p. 131-140. ISBN 978-1-4503-1229-5. SCOPUS, http://www2012.wwwconference.org/proceedings/proceedings/p131.pdf

• [gSemSearch] LACLAVÍK, Michal - DLUGOLINSKÝ, Štefan - ŠELENG, Martin - CIGLAN, Marek - HLUCHÝ, Ladislav. Emails as graph: relation discovery in email archive. In WWW´12 Companion Proceedings of the 21st International Conference companion on World Wide Web. - New York : ACM, 2012, 841-846. ISBN 978-1-4503-1230-1. http://www2012.wwwconference.org/proceedings/companion/p841.pdf . SCOPUS

• [gBench] CIGLAN, Marek - AVERBUCH, Alex - HLUCHÝ, Ladislav. Benchmarking traversal operations over graph databases. In 2012 IEEE 28th International Conference on Data Engineering Workshops : proceedings. - Los Alamitos : IEEE Computer Society, 2012, p. 186-189. ISBN 978-1-4673-1640-8. SCOPUS

• [ontea_email] LACLAVÍK, Michal - DLUGOLINSKÝ, Štefan - ŠELENG, Martin - KVASSAY, Marcel - GATIAL, Emil - BALOGH, Zoltán - HLUCHÝ, Ladislav. Email analysis and information extraction for enterprise benefit. In Computing and informatics, 2011, vol. 30, no. 1, p. 57-87. (0.356 - IF2010). ISSN 0232-0274.

• [uiWeb] Dlugolinský, Štefan - Šeleng, Martin - Laclavík, Michal - Hluchý, Ladislav. Distributed Web-scale Infrastructure for Crawling, Indexing and Search with Semantic Support. In Computer Science Journal, 13 (4)


http://www2012.wwwconference.org/proceedings/proceedings/p131.pdf

http://www2012.wwwconference.org/proceedings/companion/p841.pdf

sémantické vyhľadávanie a sémantick É siete

Documents