data quality in real estate

14
Data Quality In Real Estate Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference

Upload: dimitris-kontokostas

Post on 18-Mar-2018

367 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data quality in Real Estate

Data QualityIn Real Estate

Dimitris Kontokostas, Andy van der Hoeven, Samur Araujo

Amsterdam, Sep 14th 2017, LDQ Workshop, SEMANTiCS Conference

Page 2: Data quality in Real Estate

About Geophy

● Goal to map all buildings in the world

● Provide a quality score for each building

○ Based on location, building status, history, environmental metrics, etc

● Semantic platform

○ RDF eases the data integration process

● Team of 45 with aim to double by next year

Page 3: Data quality in Real Estate

Real Estate is a very complex domain

Really!

Page 4: Data quality in Real Estate

Possible constraints on addresses?

● An address will start with, or at least include, a building number.

● When there is a building number, it will be all-numeric.

● No buildings are numbered zero

● Well, at the very least no buildings have negative numbers

● A building number will only be used once per street

● A building will only have one number

● A building name won't also be a number

● [...] https://www.mjt.me.uk/posts/falsehoods-programmers-believe-about-addresses

Page 5: Data quality in Real Estate

Geophy [set of] ontologies

● 13 ontologies (+ 9 external)

● 125 Classes

○ Buildings

○ Addresses

○ Companies

○ [...]

● 720 properties

○ 500 datatype

○ 160 relation properties

● Growing...

Page 6: Data quality in Real Estate

Quality is expensive

● Quality of source data○ Free, open, closed data sources, etc.

● Data clean up process○ Violations, deduplication, precision, etc.

○ How much time and effort can one afford?

How much quality is good enough?

� Fitness for use

Page 7: Data quality in Real Estate

Quality of ...

● Source data○ Accuracy of the source

● Translation of source data○ RDF mappings, rml, d2rq, scripts etc.

● Model design○ Modelling quality

○ Data fitting on schema

● Model definition○ Mapping of model on RDFS, OWL, ShEx|SHACL Shapes, etc

○ Semantics i.e RDFS, OWL DL/RL/FULL, etc

Page 8: Data quality in Real Estate

Evolution & quality

� Data evolves

� so do ontologies

� so do RDF mappings

� so does code

� so do SPARQL queries

� so do constraints

http://aligned-project.eu

Page 9: Data quality in Real Estate

Scaling quality ...

● Thousands of triples

● Millions of triples

● Billions of triples

● ?

Try to move validation in the K range (when possible)

Page 10: Data quality in Real Estate

Validate closer to the source

� Validate the model

� Validate the RDF mappings

� Validate RDF mapping excerpts

� Validate instance data

Page 11: Data quality in Real Estate

Automate, automate & automate

Can you spot the error?

rdfs:label ⇒ rdf:langString

� :foo rdfs:label ″foo @en″ .

Page 12: Data quality in Real Estate

Automate, automate & automate

Can you spot the error?

rdfs:label ⇒ rdf:langString

� :foo rdfs:label ″foo @en″ .

� :foo rdfs:label ″foo″@en .

Page 13: Data quality in Real Estate

CI/CD is your buddy

● Integrate validation with your CI/CD

○ Choose tools & technologies wisely

○ Jenkins, Travis, Gitlab, TeamCity

● Fail the build until data issues are fixed

● Data integration validation checks

○ Standalone datasets can pass CI

Page 14: Data quality in Real Estate

Thank you for your attention

Questions?