mining temporal footprints from wikipedia
DESCRIPTION
Discovery of temporal information is key for organising knowledge and therefore the task of extracting and representing temporal information from texts has received an increasing interest. In this paper we focus on the discovery of temporal footprints from encyclopaedic descriptions. Temporal footprints are time-line periods that are associated to the existence of specific concepts. Our approach relies on the extraction of date mentions and prediction of lower and upper bound- aries that define temporal footprints. We report on several experiments on persons’ pages from Wikipedia in order to illustrate the feasibility of the proposed methods.TRANSCRIPT
[email protected] of Computer Science
Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Mining temporal footprints from Wikipedia
Michele Filannino, Goran Nenadic
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
introduction
■ Temporal information is crucial for organising
structured and unstructured data
■ Several temporal information extraction (TIE)
systems are nowadays available
● thanks to TempEval challenge series
2
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
URL: http://www.cs.man.ac.uk/~filannim/mantime.html
ManTIME
3
/ 23Test with long text 4
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Immanuel Kant, Paul Guyer, and Allen W Wood. 1998. Critique of pure reason. Cambridge University Press.
temporal footprint
A temporal footprint is a
continuous period on the time-
line that temporally defines the
existence of a particular concept.
5
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
problem
■ input: textual description of a concept
■ output: prediction of a temporal
interval
Can we predict temporal footprints from
encyclopaedic descriptions of concepts?
/ 23Examples of temporal footprints 7
Web
Cellphone
Computer
Car
Richard Feynman
Bicycle
Carl Friedrich Gauss
French revolution
Age of Enlightenment
Galileo Galilei
Leonardo Da Vinci
Christopher Columbus
Renaissance
Arming sword
High Middle Ages
Gengis Khan
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
Object Person Historical period
/ 238
/ 238
/ 238
/ 238
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
methodology
1. date mention extraction
2. outlier filtering
3. normal distribution fitting
4. prediction
9
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
date mentions extraction
10
freq
0.000
0.013
0.025
0.038
0.050
time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
freq
0.000
0.013
0.025
0.038
0.050
time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810
freq
0.000
0.013
0.025
0.038
0.050
time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810
Gamma parameter controls the outlier region’s boundaries.
outlier filtering
11
γ param.
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Alpha and Beta parameters control the size and offset of the gaussian bell.
normal distribution fitting
12
freq
0.000
0.013
0.025
0.038
0.050
time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810
α param.
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Alpha and Beta parameters control the size and offset of the gaussian bell.
normal distribution fitting
12
freq
0.000
0.013
0.025
0.038
0.050
time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810
α param.
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
freq
0.000
0.013
0.025
0.038
0.050
time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810
Alpha and Beta parameters control the size and offset of the gaussian bell.
normal distribution fitting
13
β param.
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
freq
0.000
0.013
0.025
0.038
0.050
time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810
Alpha and Beta parameters control the size and offset of the gaussian bell.
normal distribution fitting
13
β param.
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Fatima De Carvalho. 1996. Histogrammes et indices de proximite en analyse donne es
symboliques. Acyes de l’e cole d’e te sur l’analyse des donne es symboliques. LISE-
CEREMADE, Universite de Paris IX Dauphine, pages 101–127.
error measure
14
union overlap
gold
prediction
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Fatima De Carvalho. 1996. Histogrammes et indices de proximite en analyse donne es
symboliques. Acyes de l’e cole d’e te sur l’analyse des donne es symboliques. LISE-
CEREMADE, Universite de Paris IX Dauphine, pages 101–127.
error measure
15
union
gold
prediction
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
strategies
A. RegEx
B. RegEx + Filtering
C. RegEx + Filtering + Gaussian fitting
D. HeidelTime + Filtering + Gaussian fitting
16
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
evaluation
■ subject: people
■ lived from 1000 AD to 2014
● text from Wikipedia web pages
● year of birth and death from DBpedia
■ 228,824 people collected
■ simple definition of temporal footprint
● birth and death dates
17
/ 23People per textual length 18
#p
eo
ple
0
100
200
300
400
500
#words
0 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
aggregate results
19
StrategyMean
Distance Error
Standard Deviation
RegEx 0.2636 0.3409
RegEx + Filtering 0.2596 0.3090
RegEx + Filtering + Gaussian fitting 0.3503 0.2430
HeidelTime + Filtering + Gaussian fitting 0.5980 0.2470
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
results
20
MD
E
0.0
0.2
0.4
0.6
0.8
1.0
#words
1112 3336 5560 7785 10009 12233 14458 16682 18906 21131 23355 25579 27804
RegEx RegEx + FilteringHeidelTime + Filtering + Gaussian fitting RegEx + Filtering + Gaussian fitting
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
results
20
MD
E
0.0
0.2
0.4
0.6
0.8
1.0
#words
1112 3336 5560 7785 10009 12233 14458 16682 18906 21131 23355 25579 27804
RegEx RegEx + FilteringHeidelTime + Filtering + Gaussian fitting RegEx + Filtering + Gaussian fitting
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
E: 0.204
results
■ Galileo Galilei (1564-1642), prediction: 1556-1654
21
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
E: 0.159
results
■ Robin Williams (1951 - 2014), prediction: 1953-2006
22
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Prediction: 1366-2057 (1451-1506), E: 0.92
other types of temporal footprint?
■ Christopher Columbus will die in 2057 ?!
23
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Prediction: 1366-2057 (1451-1506), E: 0.92
other types of temporal footprint?
■ Christopher Columbus will die in 2057 ?!
23
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Prediction: 1366-2057 (1451-1506), E: 0.92
other types of temporal footprint?
■ Christopher Columbus will die in 2057 ?!
23
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
Prediction: 1366-2057 (1451-1506), E: 0.92
other types of temporal footprint?
■ Christopher Columbus will die in 2057 ?!
23
AHA!
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
physical existence vs. social coverage
■ Anne Frank’s footprint is shifted in the future
24
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
physical existence vs. social coverage
■ Anne Frank’s footprint is shifted in the future
24
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
physical existence vs. social coverage
■ Anne Frank’s footprint is shifted in the future
24
/ 25Dublin, 23/08/2014
presentation 1st AHA! Workshop, COLING 2014
conclusions
■ how the methodology behaves on different
languages? how on different sources?
■ oracle-like side-effect behaviour:
• Apple Inc. will be closed down this year
• Stanford University will be closed down in 2029
■ Future works
• mixture of normal distributions
25
Thank you.