jist2015-data challenge

An Ensemble Approach for Entity Type

Prediction over Linked Data

Guangyuan Piao, John G. Breslin

Insight Centre for Data Analytics @NUI Galway, Ireland

Unit for Social Software

The Data Challenge at 5th Joint International Semantic Technology Conference

Yichang, China, 11/11/2015

Contents

• Introduction of the Data Challenge

• Overall Approach

• Results

2

• the main task of the challenge is to predict labels of

entities/resources in Zhishi.me1

• 1,897 entity URLs are provided and 1,397 of them are provided

with label information. Information related to the entities:• abstracts of entities

• infobox properties

• external links

• related pages

• 13 participated teams

3

Introduction of the Data Challenge

1. http://zhishi.apexlab.org/

http://zhishi.apexlab.org/

• features for predicting entity types1. all distinct properties of entities in the dataset

2. semantic similarities between the entity and all labels (i.e., insect,

novel etc.)

3. a bag of Named Entities (NEs) created from all abstracts of entities in

the dataset

• feature selection• in total, there were 1,888 features

• filter out irrelevant features using GainRatioAttributeEval method in

Weka1 (1,888 458 features)

• prediction strategy• Random Forest as the classification method (using 100 trees)

4

Overall Approach

1. http://www.cs.waikato.ac.nz/ml/weka/

http://www.cs.waikato.ac.nz/ml/weka/

1. all distinct properties of entities in the dataset• the value of the property is 1 if the entity has the property, 0 if not

1. semantic similarities between the entity and all labels (i.e.,

insect , novel etc.)

• RESIM(ei, ej)1: a measure for calculating the semantic similarity

between two entities in the context of a Linked Data graph

• |lu|: the total # of entities of label lu

5

Overall Approach - Features

1. Computing the Semantic Similarity of Resources in DBpedia for Recommendation Purposes, Piao et al., JIST2015

3. a bag of Named Entities (NEs) created from all abstracts

of entities in the dataset

• entities appeared at the beginning of an abstract and appeared

frequently in the abstract can have higher weights.

6

Overall Approach - Features

abstract Stanford NER1 segmented NEs

• nei : a NE appeared more than 10 times

• pos(nei, a): the position of nei in a

• n: the # of NEs in a

1. http://nlp.stanford.edu/software/CRF-NER.shtml

http://nlp.stanford.edu/software/CRF-NER.shtml

• performance on the provided training set (10-fold cross-

validation)Classifier Precision Recall F-score

Decision Tree 0.942 0.942 0.942

SVM 0.920 0.910 0.912

Random Forest 0.970 0.969 0.969

Stacking 0.949 0.948 0.948

7

Results

• performance on the provided test set (4th among 13 teams)

Team Precision Recall F-score

4PKUICL 0.985439 0.987461 0.986449

1CBrain 0.983633 0.985069 0.98435

3FRDC_ML 0.978096 0.977348 0.977722

6pgy 0.977442 0.977247 0.977344

jist2015-data challenge

Engineering