data curation @ spaziodati - nexa lunch seminar
TRANSCRIPT
![Page 1: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/1.jpg)
Matteo Brunati @dagoneye
22/07/2015
Data Curation @SpazioDati
33° Nexa Lunch Seminarhttp://nexa.polito.it/lunch-33
![Page 3: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/3.jpg)
a lot of European projectshttp://www.spaziodati.eu/en/#research
![Page 4: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/4.jpg)
Data Curation?https://www.google.com/search?q=data+curation&ie=utf-8&oe=utf-8
![Page 5: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/5.jpg)
!
!
Data curation is the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process.
http://strataconf.com/stratany2014/public/schedule/detail/36021
![Page 6: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/6.jpg)
a lot of things involved
!
ETL (Extract-Transform-Load) tools Data Science tools Linked Data tools Big Data tools Domain Knowledge
![Page 7: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/7.jpg)
why we need a data curation process?
@
![Page 8: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/8.jpg)
it’s our mantra: ALL YOU NEED IS DATA
![Page 9: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/9.jpg)
:)accessible
for everyone
lat 00° 00’ 00” -> GPS -> Smartphones -> UI IPhone / Android
it’s our mantra: ALL YOU NEED IS DATA
![Page 10: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/10.jpg)
we are building two products
![Page 13: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/13.jpg)
<Powered by SpazioDati> codename
2014
2015
data platform
![Page 14: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/14.jpg)
Why a knowledge graph?
![Page 15: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/15.jpg)
Our Entity Extraction API is based on a graph
Brussels
Paris
Berlin
Eiffel Tower
2009 World Championships in Athletics
King Baudouin Stadium
Champ de Mars
0.42
0.80
0.43
0.53
0.53
0.53
0.63
0.59
0.440.44
https://dandelion.eu/docs/api/datatxt/nex/v1/
![Page 16: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/16.jpg)
CONTEXTUAL DATA
![Page 17: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/17.jpg)
different sources; different semantics; companies, people, Wikipedia topics, POI… simple to query on traversals global statistics
why a knowledge graph
![Page 18: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/18.jpg)
let’s start with some details on the “Powered by SpazioDati” data platform…
![Page 19: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/19.jpg)
http://blog.spaziodati.eu/en/2014/10/21/spaziodati-at-iswc-2014-visit-our-booth-research-plans-available/
“Powered by SpazioDati” data platform backstage
PWR-BY-SD
![Page 20: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/20.jpg)
OpenRefine
https://azkaban.github.io/
Azkaban Open Source Workflow Manager
Apache Silk
Titan graph db
Apache Cassandra
The Linked Data Integration Framework
Tools involved
![Page 21: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/21.jpg)
http://blog.spaziodati.eu/en/2014/07/24/using-openrefine-to-perform-text-mining-on-your-data-food-for-thoughts/
starting from OpenRefine to clean up the data easily, for example
* reconcile and clean up the data* align the data model to our internal ontologies, using RDF skeletons
* export the RDF modelled using our rules
![Page 22: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/22.jpg)
in other words…
Rexster: JSON-based REST interface to Titan
![Page 23: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/23.jpg)
Our internal ontology: a sample
![Page 25: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/25.jpg)
~5,9 ★ MLN companies
>10 ★ MLN persons
900k
updated weekly
★ Weekly web crawl of the Italian corporate
![Page 26: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/26.jpg)
★ Real-time data collection from company social accounts
★ ~1600 online & offline newspapers (updated daily)
updated weekly
![Page 28: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/28.jpg)
Search: how it works
Direct search of one particular company through its name or “partita iva” (vat number)
Content search into company websites
Keyword search among extracted and refined entities from company resources !Dandelion API is the extraction engine!
1.
2. [*]
3. [*]
![Page 29: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/29.jpg)
Corporate page
atoka.io
![Page 30: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/30.jpg)
Some details on
• Five main “types”:!– Company!– Person!– Site!– Administrative Division!–Website
![Page 31: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/31.jpg)
our infrastructure to crawl the Web for ATOKA
![Page 32: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/32.jpg)
other details
Cerved • Company • People • Site • Position+Share
ISTAT • AdminDiv
ES
DBPedia • Company
cluster computing
![Page 33: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/33.jpg)
something really interesting on OpenRefine
![Page 34: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/34.jpg)
OpenRefine as usual
![Page 35: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/35.jpg)
OpenRefine on Spark
![Page 36: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/36.jpg)
it rocks! :)
more background details on http://blog.spaziodati.eu/wp-content/uploads/2015/07/RefineOnSpark.pdf
![Page 38: Data Curation @ SpazioDati - NEXA Lunch Seminar](https://reader030.vdocuments.site/reader030/viewer/2022032623/55d07c64bb61eb97198b4583/html5/thumbnails/38.jpg)
References
1) From raw data to dataGEMs for developers - http://ceur-ws.org/Vol-1268/paper1.pdf 2) Knowledge Graph ovunque: http://www.slideshare.net/dagoneye/knowledge-graphs-ovunque-un-quadro-di-insieme-e-le-implicazioni-per-uno-sviluppo-condiviso-del-web-of-data 3) Linking Enterprise Data - https://www.springer.com/it/book/9781441976642 4) Using OpenRefine - https://www.packtpub.com/big-data-and-business-intelligence/using-openrefine 5) Why Your Business Needs A Customer Data Knowledge Graph - http://www.dataversity.net/business-needs-customer-data-knowledge-graph/ 6) Enabling parallel processing for OpenRefine: Spark vs Akka - http://refinepro.com/blog/enabling-parallel-processing-for-openrefine-spark-vs-akka/