(open) data analysis for decision support: challenges and essentials - with examples from open...
DESCRIPTION
This slideset was presented at the 2014 RENA Summer School on Good Government and Open Citizenship. It uses examples from an open dataset on EU fundings in Italy to show essentials and challenges in using open data to support decisionsTRANSCRIPT
(Open) data analysis for decision support: challenges and essentials !
Antonio Vetrò Technische Universität München, Germany
01 September 2014,Matera (Italy), RENA Summer school
@phisaz
With examples from Open Coesione
With material from a joint work with: Lorenzo Canova, Marco Torchiano (PoliTO - Politecnico di Torino) Federico Morando, Raimondo Iemma (NEXA Center for Internet and Society - PoliTO) Aline Pennisi ( Ministero dell’ Economia e delle Finanze ) Feedback from Andrea Milan (United Nations University) Daniel Méndez Fernández (Technische Universität München)
RENA Summer School 2014
2
RENA Summer School 2014
2
RENA Summer School 2014
Deciding and
implementing together
Monitoring togetherPlanning together
2
Outline
• Data analysis : a philosophical perspective, empiricism
• Data analysis challenges: examples with Open Data
3
Outline
• Data analysis : a philosophical perspective, empiricism
• Data analysis challenges: examples with Open Data
4
5
Data
Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,
Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München
Klaus Mainzer
Munich Center for Technology in Society Technische Universität München
Knowledge Representation : World, Model, and Formal Theory
World Model Theory
observation simulation deduction
approximation: {good, sufficient, insufficient}
interpretation: {true, false}
6
Data
Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,
Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München
Klaus Mainzer
Munich Center for Technology in Society Technische Universität München
Knowledge Representation : World, Model, and Formal Theory
World Model Theory
observation simulation deduction
approximation: {good, sufficient, insufficient}
interpretation: {true, false}
Figure: techrepublic.com
6
Data analysis A philosophical perspective, empiricism
Observations / Evaluations
Questions / Hypotheses
Theory/System of theories
Pattern building
Falsification / support
Theory building
Study population
Deductive logicInductive logic
See also: Runeson et al. Case Study Research in Software Engineering: Guidelines and Experiments
7
• Each empirical method…
• has a specific specific purpose • relies on a specific data type • has a specific setting !!
Purpose • Exploratory • Descriptive • Explanatory / confirmatory • Improving !
Data Type • Qualitative • Quantitative
Data analysis A philosophical perspective, empiricism
Observations / Evaluations
Questions / Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logic
Inductive logic
8
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
See also: Vessey et al A unified classification system for research in the computing disciplines
9
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
Formal / conceptual analysis
Grounded theory
Exploratory – case/field studies – experiments – data analysis
Survey / interview research
Confirmatory – case studies – experiments – data analysis – …
Ethnographic studies
See also: Vessey et al A unified classification system for research in the computing disciplines
9
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
Formal / conceptual analysis
Grounded theory
Exploratory – case/field studies – experiments – data analysis
Survey / interview research
Confirmatory – case studies – experiments – data analysis – …
Ethnographic studies
See also: Vessey et al A unified classification system for research in the computing disciplines
9
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
Formal / conceptual analysis
Grounded theory
Exploratory – case/field studies – experiments – data analysis
Survey / interview research
Confirmatory – case studies – experiments – data analysis – …
Ethnographic studies
See also: Vessey et al A unified classification system for research in the computing disciplines
9
Data analysis A philosophical perspective
Observations / Evaluations
(Tentative) Hypotheses
Theory/System of theories
Pattern building Falsification /
support
Theory building
Study population
Deductive logicInductive logic
Formal / conceptual analysis
Grounded theory
Exploratory – case/field studies – experiments – data analysis
Survey / interview research
Confirmatory – case studies – experiments – data analysis – …
Ethnographic studies
See also: Vessey et al A unified classification system for research in the computing disciplines
Deciding and
implementing togetherMonitoring together
Planning together
9
Outline
• Data analysis : a philosophical perspective, empiricism
• Data analysis challenges: examples with Open Data
10
Opportunities
Mike Lemansky, Open Data 11
Opportunities
Lab
Mike Lemansky, Open Data 11
12
& challenges
12
& challenges
12
Open Coesione
13
Open Coesione
13
Open Coesione
Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns
13
Open Coesione
Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns
> colnames(subsetProgetti) [1] "FINANZ_UE" "FINANZ_STATO_FONDO_DI_ROTAZIONE" [3] "FINANZ_STATO_FSC" "FINANZ_STATO_PAC" [5] "FINANZ_STATO_ALTRI_PROVVEDIMENTI" "FINANZ_REGIONE" [7] "FINANZ_PROVINCIA" "FINANZ_COMUNE" [9] "FINANZ_ALTRO_PUBBLICO" "FINANZ_STATO_ESTERO" [11] "FINANZ_PRIVATO" "FINANZ_DA_REPERIRE" [13] "FINANZ_TOTALE_PUBBLICO" "DPS_DATA_INIZIO_PREVISTA" [15] "DPS_DATA_FINE_PREVISTA" "DPS_DATA_INIZIO_EFFETTIVA" [17] "DPS_DATA_FINE_EFFETTIVA" "DPS_FLAG_CUP" [19] "DPS_FLAG_PRESENZA_DATE" "DPS_FLAG_COERENZA_DATE_PREV" [21] "DPS_FLAG_COERENZA_DATE_EFF" "DATA_AGGIORNAMENTO" 13
Milepost5 850 NE 81st Ave Portland, OR 97213 http://milepost5.net/galleries/
Gallery of challenges: Guided Tour
14
Challenge #1: Errors in data
15
16
16
16
43 !
16
43 !
Errors can be inserted from:
- source (observation, sensor)
- manual insertion
- error from ETL*
!Be careful before claiming errors:
they might be “just” accuracy problems
* extraction, transformation, and loading
16
Challenge #2: accuracy
17
18
18
18
43 !
18
Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74
———45.00
43 !
18
Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74
———45.00
43 !
18
Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74
———45.00
»Refer always to raw data
»If not possible, estimate accuracy on analysis (e.g., about 5% in the example above)
43 !
18
Challenge #3: missing data
19
20
20
20
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
21
NA in “finanziamenti”
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
21
NA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
Codes and descriptions
Ateco + other descriptions
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
Codes and descriptions
Ateco + other descriptions
No rows are complete
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
In 89% of projects dates are present
Codes and descriptions
Ateco + other descriptions
No rows are complete
21
No datesNA in “finanziamenti”
NA in “finanziamenti”, “pagamenti”,
missing values in Ateco and other columns
sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain
Completeness
pcc= percentage of complete cells pcrp = percentage of complete rows
Valu
e
In 89% of projects dates are present
Codes and descriptions
Ateco + other descriptions
Codes and descriptions
Ateco + other descriptions
No rows are complete
21
What to do with missing data
1. Understand domain: - e.g., NA or 0 ?
2. Find motivation (e.g.. missing start date o.k. if project hasn’t started yet) 3. Understand how much they impact your analysis 4. You might also:
– exclude rows with missing values – use imputation techniques
– mean substitution – regression substitution – group mean substitution – hot deck imputation – multiple imputation
Source: A Mockus , Missing data in software engineering, Guide to advanced empirical software engineering, 200822
Challenge #4: outliers
23
» Outliers can point to interesting facts
Challenge #4: outliers
23
» … or to something which deserves a second look
Challenge #4: outliers
24
Valu
e
pcvc= percentage of cells with correct value
25
ca. 50000 fundings < 1€Va
lue
pcvc= percentage of cells with correct value
25
ca. 50000 fundings < 1€
ca.210000 <5€
Valu
e
pcvc= percentage of cells with correct value
25
ca. 50000 fundings < 1€
ca.210000 <5€
ca. 360000<55€Va
lue
pcvc= percentage of cells with correct value
25
ca. 50000 fundings < 1€
ca.210000 <5€
ca. 360000<55€Va
lue
ca.430000<89€
pcvc= percentage of cells with correct value
25
What to do with outliers
1. Retention – Check the distribution of data: if heavy tailed, keep
them but don’t apply techniques which require normality
2. Exclusion – Remove them in case you think is a measurement error
or an exceptional case 3. Sensitivity analysis
– compare results with and without outliers – reason on the motivations
26
Challenge #5: Drawing proper conclusions
27
Challenge #5: Drawing proper conclusions
» Knowledge is more than statistical significance
» Context and domain knowledge are fundamental
» Consider both qualitative and quantitative aspects
» Triangulate data with other sources27
Summing up and additional suggestions
28
Summing up and additional suggestions
28
Challenges : watch out to
!- Errors
- Data accuracy
- Missing data
- Outliers
- Drawing proper conclusions
Summing up and additional suggestions
28
Challenges : watch out to
!- Errors
- Data accuracy
- Missing data
- Outliers
- Drawing proper conclusions
Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations
Summing up and additional suggestions
28
Challenges : watch out to
!- Errors
- Data accuracy
- Missing data
- Outliers
- Drawing proper conclusions
Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations
Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions
Summing up and additional suggestions
28
Challenges : watch out to
!- Errors
- Data accuracy
- Missing data
- Outliers
- Drawing proper conclusions
Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations
Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions
Keep track of:
- modifications and reasons
- different versions
- raw data
Interesting readings
29
Gallery of challenges: Guided Tour End of Guided Tour
30
Gallery of challenges: Guided Tour End of Guided Tour
30