jist2015-data challenge
TRANSCRIPT
![Page 1: JIST2015-data challenge](https://reader036.vdocuments.site/reader036/viewer/2022082123/58edf0b81a28ab49698b464f/html5/thumbnails/1.jpg)
An Ensemble Approach for Entity Type
Prediction over Linked Data
Guangyuan Piao, John G. Breslin
Insight Centre for Data Analytics @NUI Galway, Ireland
Unit for Social Software
The Data Challenge at 5th Joint International Semantic Technology Conference
Yichang, China, 11/11/2015
![Page 2: JIST2015-data challenge](https://reader036.vdocuments.site/reader036/viewer/2022082123/58edf0b81a28ab49698b464f/html5/thumbnails/2.jpg)
Contents
• Introduction of the Data Challenge
• Overall Approach
• Results
2
![Page 3: JIST2015-data challenge](https://reader036.vdocuments.site/reader036/viewer/2022082123/58edf0b81a28ab49698b464f/html5/thumbnails/3.jpg)
• the main task of the challenge is to predict labels of
entities/resources in Zhishi.me1
• 1,897 entity URLs are provided and 1,397 of them are provided
with label information. Information related to the entities:• abstracts of entities
• infobox properties
• external links
• related pages
• 13 participated teams
3
Introduction of the Data Challenge
1. http://zhishi.apexlab.org/
![Page 4: JIST2015-data challenge](https://reader036.vdocuments.site/reader036/viewer/2022082123/58edf0b81a28ab49698b464f/html5/thumbnails/4.jpg)
• features for predicting entity types1. all distinct properties of entities in the dataset
2. semantic similarities between the entity and all labels (i.e., insect,
novel etc.)
3. a bag of Named Entities (NEs) created from all abstracts of entities in
the dataset
• feature selection• in total, there were 1,888 features
• filter out irrelevant features using GainRatioAttributeEval method in
Weka1 (1,888 458 features)
• prediction strategy• Random Forest as the classification method (using 100 trees)
4
Overall Approach
1. http://www.cs.waikato.ac.nz/ml/weka/
![Page 5: JIST2015-data challenge](https://reader036.vdocuments.site/reader036/viewer/2022082123/58edf0b81a28ab49698b464f/html5/thumbnails/5.jpg)
1. all distinct properties of entities in the dataset• the value of the property is 1 if the entity has the property, 0 if not
1. semantic similarities between the entity and all labels (i.e.,
insect , novel etc.)
• RESIM(ei, ej)1: a measure for calculating the semantic similarity
between two entities in the context of a Linked Data graph
• |lu|: the total # of entities of label lu
5
Overall Approach - Features
1. Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes, Piao et al., JIST2015
![Page 6: JIST2015-data challenge](https://reader036.vdocuments.site/reader036/viewer/2022082123/58edf0b81a28ab49698b464f/html5/thumbnails/6.jpg)
3. a bag of Named Entities (NEs) created from all abstracts
of entities in the dataset
• entities appeared at the beginning of an abstract and appeared
frequently in the abstract can have higher weights.
6
Overall Approach - Features
abstract Stanford NER1 segmented NEs
• nei : a NE appeared more than 10 times
• pos(nei, a): the position of nei in a
• n: the # of NEs in a
1. http://nlp.stanford.edu/software/CRF-NER.shtml
![Page 7: JIST2015-data challenge](https://reader036.vdocuments.site/reader036/viewer/2022082123/58edf0b81a28ab49698b464f/html5/thumbnails/7.jpg)
• performance on the provided training set (10-fold cross-
validation)Classifier Precision Recall F-score
Decision Tree 0.942 0.942 0.942
SVM 0.920 0.910 0.912
Random Forest 0.970 0.969 0.969
Stacking 0.949 0.948 0.948
7
Results
• performance on the provided test set (4th among 13 teams)
Team Precision Recall F-score
4PKUICL 0.985439 0.987461 0.986449
1CBrain 0.983633 0.985069 0.98435
3FRDC_ML 0.978096 0.977348 0.977722
6pgy 0.977442 0.977247 0.977344
![Page 8: JIST2015-data challenge](https://reader036.vdocuments.site/reader036/viewer/2022082123/58edf0b81a28ab49698b464f/html5/thumbnails/8.jpg)