friday lunchtime lecture: the beauty of quality data
TRANSCRIPT
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
1/23
The Beauty ofQuality Data
How we can improve open data quality,and why we must.
John Griffin
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
2/23
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
3/23
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
4/23
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
5/23
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
6/23
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
7/23
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
8/23
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
9/23
OK, but we need to ask questions:
! Is there any missing data?
! How was the data collected?
! How are terms defined? E.g. “SME”
! What is the license - how am I allowed to use it?
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
10/23
What do we mean by data quality?
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
11/23
This is why we can’t have nice things
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
12/23
CSV Files
! Open format! Simple
! Everyone gets them! We’re already using them
You can put any old crap in them x
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
13/23
1. Full dataset spread over multiple files
Sector Size Year Sales (£M)
Education SME 2000 3
Education SME 2001 5
Education SME 2002 19
Education SME 2003 42
2000
Sector Size Sales (£M)
Education SME 3
Education Large 5
Health SME 7
Health Large 11
2001
Sector Size Sales (£M)
Education SME 5
Education Large 9
Health SME 13
Health Large 18
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
14/23
?
2. Chracter Encoding IssuesÂ
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
15/23
3. Non-normalised schema
Sector Size Year Sales (£M)
Education SME 2000 3
Education SME 2001 5
Education SME 2002 19
Education SME 2003 42
Sector Size 2000 2001 2002 2003
Education SME 3 5 19 42
Education Large 5 10 23 45
Health SME 7 18 29 67
Health Large 11 28 36 80
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
16/23
Year Sales (£M)
Sector Size Year Sales (£M)
Education SME 2000 3
Education SME 2001 5
Education SME 2002 19
Education SME 2003 42
Introductory Text Before Header
This dataset is subject to the Open Government License
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
17/23
Empty cells where there should be a
Sector Size Year Sales (£M)
Education SME 2000 3
Education 2001 5
Education SME 2002 19
Education SME 2003 42
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
18/23
Duplicate misspelt category terms
Sector Size Year Sales (£M)
Education SME 2000 3
education SME 2001 5
Education SME 2002 19
EDU SME 2003 42
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
19/23
OpenRefine can help if you’re inthis situation
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
20/23
What can we do to improve
machine-readability?! Schemas - csvlint.io
(Also see: CSV On the Web W3C Primer)
! Registers
! ODI Certificates
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
21/23
Why Bother?
This Code is issued to meet the Government’s desire to place
more power into citizens’ hands to increase democratic
accountability and make it easier for local people to
contribute to the local decision making process and help
shape public services.
Local Government Transparency Code 2015
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
22/23
low
hanging
fruit
-
8/18/2019 Friday lunchtime lecture: The beauty of quality data
23/23
John Griffin
Principal Consultant
twitter.com/johngriffin
atchai.com
getdataseed.com
http://getdataseed.com/http://atchai.com/mailto:[email protected]://twitter.com/johngriffin