extracting linked data from statistic spreadsheets › ... › slides › sbd2017-s3-t2.pdf ·...

19
Extracting Linked Data from statistic spreadsheets Tien-Duc Cao [email protected] Ioana Manolescu [email protected] Xavier Tannier [email protected] Semantic Big Data workshop, Chicago, May 19th, 2017

Upload: others

Post on 30-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

ExtractingLinkedDatafromstatisticspreadsheets

Tien-Duc [email protected] Manolescu [email protected]

XavierTannier [email protected]

SemanticBigDataworkshop,Chicago,May19th,2017

Page 2: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

Agenda

1. Context:datajournalismandjournalisticfact-checking

2. Researchproblem:extractinglinkedopendatafromspreadsheets

3. Approach

4. Results

5. Futurework

1Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Page 3: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

1.Fact-checkingisacontentmanagementproblem

19/05/2017Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 2

Claimtobechecked (text

ordata)Mediacontent

Mediacontext

Referenceinformationsource1

Human actors(journalists,experts,

crowd workers)

Referenceinformationsource2

Referenceinformationsourcen

Verification tool(query,match,sourcesearch…)

Analysis result« True /rather true /rather false/false

See sources:http://dataref.com… »

Page 4: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

1.Fact-checkingisacontentmanagementproblem

19/05/2017Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 3

Claimtobechecked (text

ordata)Mediacontent

Mediacontext

Referenceinformationsource1

Human actors(journalists,experts,

crowd workers)

Referenceinformationsource2

Referenceinformationsourcen

Verification tool(query,match,sourcesearch…)

Analysis result« True /rather true /rather false/false

See sources:http://dataref.com… »

Claimextraction

Socialnetworkanalysis

Reconciliation,reputation

Sourced’informationderéférencen+1

Sourced’informationderéférencen+1

Referenceinformationsourcen+1

Sourcesearch /sourceselection

Referencesourceconstruction,refinement,integration

Page 5: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

1.Context

• Whichdatasource canhelpustofact-checkastatisticalclaimfromthemedia?

• E.g:“TheunemploymentrateinFrancelastyearwas50%?”• ThisworkisapartofContentCheck 1 project

41 https://team.inria.fr/cedar/contentcheck/

Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Page 6: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

2.Researchproblem:high-qualityreferencedata

• NationalstatisticinstitutessuchasINSEE1,France’seconomicandsocietalstatisticsinstituteareoftenvaluabledataproviders

51 https://insee.fr/Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

http://abonnes.lemonde.fr/les-decodeurs/portfolio/2017/04/18/les-fractures-francaises-1-5-le-logement-les-raisons-de-la-crise_5112859_4355770.html

Existing houseprice indexAvailable revenueperheadRent indexConsumerprice index

Page 7: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

2.Theroadtohighqualitydata…

6

UnfortunatelymostofthedatapublishedbyINSEElookslikethis(ourtextcoloring):

Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Page 8: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

2.Theroadtohighqualitydata…

7

Sometimestherearemorethan1tablepersheet

Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Page 9: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

3.Extractionapproach

8Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets"Imagesources:

https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128https://www.w3.org/RDF/icons/rdf_w3c.svg

19/05/2017

Page 10: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

3.Extractionapproach

9Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets"Imagesources:

https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128https://www.w3.org/RDF/icons/rdf_w3c.svg

19/05/2017

Page 11: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

3.Approach:findingtableboundaries

10Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Page 12: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

3.Extractionapproach

11Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets"

Imagesources:https://www.iconfinder.com/icons/7661/excel_microsoft_word_xls_icon#size=128https://www.w3.org/RDF/icons/rdf_w3c.svg

19/05/2017

Page 13: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

3.Approach:tableextractor

12

• Headercellsmostly containtexts

• Theirpositionsareat:• thetop(headerrows)oftable• theleft(headercolumns)oftable

• Havingmorethan1headerrows/columnsindicatesdataaggregation

• Datacellsmostly containnumericvalues

Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Page 14: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

3.Approach:tableextractor

1. Wedistinguishheader/datarow/columnsusing• datatypeofitscells(text,number,specialvaluetoindicateamissingvalue,nullforemptycell)• formattinginformationofitscells:cell’sborder,cellsbelongtomergedcell• thetypesofitsneighborrows/columns

2. Basedontheseweidentifytheexactstructureofeachtable

13Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Page 15: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

3.Conceptualdatamodel

14Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Page 16: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

4.Results• Collected16011 Excelspreadsheets,extracted74117 tables.

• Accuracyevaluation:• Weselectedrandomly100Excelfilesà 2432tables• Wevisuallyidentifiedtheheadercells,datacellsandheaderhierarchyandthencomparedwiththoseobtainedfromoursystem.

15Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Page 17: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

16Tien-Duc CAO,Ioana Manolescu,XavierTannier "Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

4.SampleextractedRDF

Page 18: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

5.Futurework

17Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

Referenceinformationsource1

Referenceinformationsource2

Referenceinformationsourcen

Verification tool(query,match,sourcesearch…)

Sourcesearch /sourceselection

Referencesourceconstruction,refinement,integration

Page 19: Extracting Linked Data from statistic spreadsheets › ... › slides › SBD2017-s3-t2.pdf · 2019-11-05 · Extracting Linked Data from statistic spreadsheets Tien-DucCao tien-duc.cao@inria.fr

Thanks/questions?

18Tien-DucCAO,IoanaManolescu,XavierTannier"Extractinglinkeddatafromstatisticspreadsheets" 19/05/2017

ExcelfilesandextractedRDFfiles(10.5GBwillbeexpiredinMay29th 2017)https://goo.gl/4Y5Dtv

Sourcecode:noexpirationdate:)https://gitlab.inria.fr/cedar/insee-crawlerhttps://gitlab.inria.fr/cedar/excel-extractor