predicting the functional state of tanzanian water …€¦ · predicting the functional state of...
TRANSCRIPT
![Page 1: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/1.jpg)
PREDICTING THE FUNCTIONAL STATE
OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING
Aantal woorden / Word count: 19596
Jacob Benoot Stamnummer : 01170804
Promotor: Els Clarysse
Masterproef voorgedragen tot het bekomen van de graad van:
Master’s Dissertation submitted to obtain the degree of:
Master of Science in de Handelswetenschappen
Academiejaar / Academic year: 2016 - 2017
![Page 2: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/2.jpg)
![Page 3: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/3.jpg)
PREDICTING THE FUNCTIONAL STATE
OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING
Aantal woorden / Word count: 19596
Jacob Benoot Stamnummer : 01170804
Promotor: Els Clarysse
Masterproef voorgedragen tot het bekomen van de graad van:
Master’s Dissertation submitted to obtain the degree of:
Master of Science in de Handelswetenschappen
Academiejaar / Academic year: 2016 - 2017
![Page 4: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/4.jpg)
![Page 5: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/5.jpg)
I
PERMISSION
I declare that the content of this Master’s Dissertation can be consulted and/or reproduced if the
sources are mentioned.
Name student: Jacob Benoot
Signature:
![Page 6: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/6.jpg)
II
NEDERLANDSTALIGE ABSTRACT
Data mining is alomtegenwoordig in de hedendaagse digitale wereld. Het fenomeen data mining
heeft tal van synoniemen zoals: data science, data analytics, big data analytics en business analytics. Deze
willen allemaal hetzelfde bereiken: het halen van kennis uit data. Velen komen er dagelijks mee in
contact zonder het te weten, denk maar aan diensten zoals Facebook, Google, Uber, Amazon etc.
Dit werk plaatst data mining in zijn context en past data mining principes toe op een case die de
status van waterpompen in Tanzania voorspelt. Als leidraad wordt het CRISP-DM kader gebruikt.
Elke stap van dit werkkader wordt besproken en uitgewerkt. Deze gestructureerde aanpak stelt ons
in staat om (1) het sterkste model en de beste manier van aanpak te selecteren, (2) een model op te
zetten dat met zijn voorspellende kracht en afgeleide inzichten over de gebruikte variabelen de
slaagkracht van de Tanzaniaanse overheid om waterschaarste tegen te gaan kan verbeteren en (3) de
impact van een data opkuis op de accuraatheid van een model te bestuderen.
![Page 7: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/7.jpg)
III
PREFACE
This thesis is in line with my personal interests and professional ambitions and therefore I enjoyed
every step of the way. Finalizing this work after countless hours of data manipulation, R debugging,
troubleshooting, waiting on R to finish computations, finding the best visualizations to support my
message, fills me with a great sense of accomplishment. As is a custom, I would like to take the time
to thank everyone that helped me along the way. My greatest friend in dark times was Google, who
always made time to listen to my problems and suggest further action. Also a huge thanks to all the
people that take time to provide answers to questions on forums like stackoverflow, you are the real
heroes. This thesis, in its current form, was only possible because of Dirk van den Poel, Matthijs
Meire and by extension Han-Thijs de Senerpont Domis, from whom and with whom I discovered
the world of data science. Furthermore, a big thanks to my brother Stijn Benoot, who lent me an
extra 8GB of RAM. That act of generosity uncorked my computational bottleneck and sped up the
whole process. Thanks.
![Page 8: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/8.jpg)
IV
TABLE OF CONTENTS
PAINTING THE PICTURE
1 Introduction .............................................................................................................................................. 1
1.1 The fuzz ............................................................................................................................................. 1
2 Literature review ....................................................................................................................................... 4
2.1 What is data mining? ........................................................................................................................ 4
2.1.1 How is data mining done? ....................................................................................................... 5
2.2 CRISP-DM ........................................................................................................................................ 7
CASE-STUDY ELABORATION
3 Case Study ................................................................................................................................................. 9
3.1 Abstract .............................................................................................................................................. 9
3.1.1 Research question ................................................................................................................... 10
3.2 Business understanding .................................................................................................................. 10
3.3 Data Understanding & Data Preparation .................................................................................... 12
3.3.1 Data Read-in ........................................................................................................................... 12
3.3.2 Data Exploration, preparation and validation methodology ........................................... 14
3.3.3 Data Exploration, preparation and validation .................................................................... 18
3.3.4 Summary of Data Understanding / Data preparation ...................................................... 34
3.4 Modelling & Modelling Evaluation .............................................................................................. 35
3.4.1 Modelling introduction .......................................................................................................... 35
3.4.2 Three way classification approach ....................................................................................... 39
3.4.3 Modelling evaluation .............................................................................................................. 40
3.4.4 Modelling Approach: Cross-validation ............................................................................... 42
3.4.5 Modelling case ........................................................................................................................ 43
3.5 Evaluation ........................................................................................................................................ 45
3.5.1 Variable importances ............................................................................................................. 45
3.5.2 Partial dependence ................................................................................................................. 46
3.5.3 Data cleaning evaluation ....................................................................................................... 49
![Page 9: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/9.jpg)
V
3.6 Deployment ..................................................................................................................................... 50
CONCLUSION
4 Conclusion............................................................................................................................................... 50
5 Bibliografie .............................................................................................................................................. 51
APPENDIX
6 Appendix ................................................................................................................................................. 55
6.1 Appendix: Data understanding / preparation stage elaboration ............................................. 55
6.1.1 Packages used .......................................................................................................................... 55
6.1.2 Amount_tsh ............................................................................................................................ 55
6.1.3 Date Recorded ........................................................................................................................ 55
6.1.4 Funder & Installer .................................................................................................................. 56
6.1.5 GPS Height ............................................................................................................................. 58
6.1.6 Longitude and Latitude ......................................................................................................... 60
6.1.7 Public meeting ........................................................................................................................ 61
6.1.8 Permit ....................................................................................................................................... 62
6.1.9 ConstructionYear ................................................................................................................... 63
6.1.10 Collection of other variables ................................................................................................. 64
6.1.11 Population ............................................................................................................................... 65
6.2 Appendix: Modelling stage elaboration ....................................................................................... 66
6.2.1 Packages used .......................................................................................................................... 66
6.2.2 Delong test: ROC curve comparison .................................................................................. 67
6.3 Appendix: Evaluation stage elaboration ...................................................................................... 67
6.3.1 Packages used .......................................................................................................................... 67
6.3.2 Variable importances ............................................................................................................. 68
6.3.3 Partial dependence ................................................................................................................. 69
6.3.4 Data cleaning evaluation ....................................................................................................... 78
![Page 10: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/10.jpg)
VI
ABBREVIATIONS
CRISP-DM – Cross industry standard process for data mining
ANOVA – Analysis of variance
ROC – Receiver operating characteristics
AUC – Area under curve
![Page 11: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/11.jpg)
VII
TABLES AND FIGURES
Figures
Figure 1 3 V's of Big Data (Sagiroglu & Sinanc, Big data: A review, 2013) ............................................. 3
Figure 2 Financial value across sectors through the use of Big Data (Evans & Lidner, 2012) ............. 4
Figure 3 Data mining versus the use of data mining results (Provost & Fawcett, 2013) ....................... 6
Figure 4 CRISP data mining process (Provost & Fawcett, 2013) .............................................................. 7
Figure 5 Generic tasks and outputsof the CRISP-DM reference model (Chapman, et al., 2000) ........ 8
Figure 6 Taarifa geographic mapping of waterpumps and their status (Taarifa, 2016) ........................ 11
Figure 7 Percentage of functional water pumps (left) & Population coverage (right) per region
(Taarifa, 2016) .................................................................................................................................................. 11
Figure 8 Classification of data quality problems in data sources (Rahm & Hai Do , 2009) ................. 14
Figure 9 Distribution of status group over Public Meeting values .......................................................... 17
Figure 10 Boxplot of GPS_Height over status_group .............................................................................. 18
Figure 11 Distribution of Status_group variable ........................................................................................ 19
Figure 12 Boxplot of population .................................................................................................................. 24
Figure 13 Public meeting ................................................................................................................................ 24
Figure 14 Permit .............................................................................................................................................. 25
Figure 15 Construction year as factor .......................................................................................................... 25
Figure 16 Distribution of Extraction_type_class variable ......................................................................... 26
Figure 17 Distribution of water pumps over Management_group .......................................................... 27
Figure 18 Distribution of water pumps over payment variable ............................................................... 29
Figure 19 Distribution of Quality_group ..................................................................................................... 30
Figure 20 Quantity .......................................................................................................................................... 31
Figure 21 Waterpoint type ............................................................................................................................. 32
Figure 22 Basin ................................................................................................................................................ 33
Figure 23 Region ............................................................................................................................................. 33
Figure 24 Attributes and target attribute representation, inspired by (Provost & Fawcett, 2013) ...... 36
Figure 25 A Classification tree representation (Based on total population: 59400) .............................. 37
Figure 26 Bagging as presented in course material of Advanced Predictive Analytics (personal
correspondence with Dirk van den Poel) .................................................................................................... 38
Figure 27 Linear and logistic regression ....................................................................................................... 38
Figure 28 Sequantial Boosting (personal correspondence with Dirk van den Poel) ............................. 39
Figure 29 An example of a ROC curve ........................................................................................................ 40
Figure 30 Cumulative response curve and lift ............................................................................................. 42
![Page 12: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/12.jpg)
VIII
Figure 31 Modelling approach (personal correspondence with Dirk van den Poel) ............................. 42
Figure 32 An illustration of cross-validation (Provost & Fawcett, 2013) ............................................... 43
Figure 33 Variable importances..................................................................................................................... 46
Figure 34 Partial dependence plot of Quantity and Payment ................................................................... 47
Figure 35 Partial Dependence Plot of Installer and Funder variables ..................................................... 48
Figure 36 Partial Dependence Plot of ConstructionYearFactor variable ............................................... 49
Figure 37 Distribution of funder and installer variables ............................................................................ 58
Figure 38 GPS Height distribution per status group ................................................................................. 59
Figure 39 Distribution of latitude and longitude per status group .......................................................... 60
Figure 40 Distribution of status_group of Public Meeting ....................................................................... 62
Figure 41 Distribution of status_group over Permit values...................................................................... 62
Figure 42 Boxplots of Constructionyear over status_group ..................................................................... 63
Figure 43 Boxplots of population over status_group ................................................................................ 65
Figure 44 Variable importances..................................................................................................................... 68
Figure 45 Variable importances (AUC)........................................................................................................ 69
Figure 46 Mean decrease in Gini comparison ............................................................................................. 79
Figure 47 Mean decrease in AUC comparison ........................................................................................... 80
![Page 13: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/13.jpg)
IX
Tables
Table 1 Different sources of data (Vale, 2013) ............................................................................................. 2
Table 2 Different types of analytics (Evans & Lidner, 2012) ..................................................................... 5
Table 3 Available data about water pumps in Tanzania .......................................................................... 12
Table 4 Proportional Crosstab of status_group and public_meeting ...................................................... 16
Table 5 Summary of dependency tests ......................................................................................................... 18
Table 6 Seasons in Tanzania .......................................................................................................................... 20
Table 7 Newly created categories for variable funder ................................................................................ 20
Table 8 GPS Height: Manual look-up of missing data .............................................................................. 22
Table 9 GPS Height: The process of imputing missing values ................................................................ 22
Table 10 Longitude and Latitude validation ................................................................................................ 23
Table 11 Latitude & Longitude: The process of imputing missing values ............................................. 23
Table 12 Granularity of extraction type ....................................................................................................... 27
Table 13 Granularity of Management .......................................................................................................... 28
Table 14 SchemeGroup and Scheme management levels ......................................................................... 29
Table 15 Quantity Crosstab ........................................................................................................................... 31
Table 16 Granularity of Source ..................................................................................................................... 32
Table 17 Summary of data handling ............................................................................................................. 34
Table 18 One vs all classification results using a 5-fold crossvalidation ................................................. 44
Table 19 All-in-one classification results using 5-fold crossvalidation .................................................... 45
Table 20 Data exploration and preparation packages ................................................................................ 55
Table 21 Crosstab of season and status_group .......................................................................................... 56
Table 22 Chi-square test of season ............................................................................................................... 56
Table 23 Categorisation method for Installer and Funder variable ......................................................... 57
Table 24 Chi-square evaluation of Funder and Installer ........................................................................... 58
Table 25 One-way ANOVA of GPS Height .............................................................................................. 59
Table 26 Ad Hoc comparison using TukeyHSD for GPS Height........................................................... 59
Table 27 one-way ANOVA of Latitude ...................................................................................................... 60
Table 28 TukeyHSD multiple comparison test of Latitude ...................................................................... 60
Table 29 one-way ANOVA of Longitude ................................................................................................... 61
Table 30 TukeyHSD multiple comparison of Longitude .......................................................................... 61
Table 31 Chi-square results for public meeting .......................................................................................... 61
Table 32 Chi-Square test of Permit .............................................................................................................. 62
Table 33 one way ANOVA of Constructionyear ....................................................................................... 63
![Page 14: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/14.jpg)
X
Table 34 TukeyHSD test of construction year ........................................................................................... 63
Table 35 Chi-square test of construction year as a factor ......................................................................... 64
Table 36 Summary of Chi-square tests over other variables..................................................................... 64
Table 37 One-way ANOVA of population ................................................................................................. 65
Table 38 TukeyHSD ad hoc comparison for Population ......................................................................... 66
Table 39 Modelling packages used................................................................................................................ 66
Table 40 Delong test ....................................................................................................................................... 67
Table 41 Evaluation packages used .............................................................................................................. 67
Table 42 Partial dependence plots with interpretation .............................................................................. 70
Table 43 Delong test for ROC curve comparison ..................................................................................... 78
![Page 15: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/15.jpg)
1
1 Introduction
The capture and analysis of data is a hot topic. It is a promising ‘new’ frontier on which businesses
compete to gain an advantage (Davenport & Harris, Competing on analytics, 2007). Capturing and
analyzing data is not new, but the way it is done is changing drastically. New developments and
trends facilitate the capture and analysis of data. There has been an enormous growth in available
data. More data is being captured, but also less traditional sources of data can now be handled.
This has led to a number of success stories that are able to dazzle our minds. Companies like
Amazon, Facebook and Google created their business model by thoroughly analyzing their data and
are able to generate great value by doing so. Although these examples illustrate the huge
opportunities data analytics can yield, there is still a lack of maturity in this field. Where a couple
seem to succeed many more projects are doomed to fail (Demirkan & Dal, 2014). It is therefore
imperative that some general approach can be used to deliver such projects. CRISP-DM is such
framework. And by following this framework, this thesis explores how to apply it on a real-life case.
The case study aims to guide the reader through a data mining competition in predicting the
functional state of water pumps, explaining all the steps along the way. The goal is to combine the
practical approach with the theory behind it, so the reader understands what is happening in each
phase. For that reason literature findings and practical issues are combined within the elaboration of
the case study.
1.1 The fuzz
Data mining, data science, (business) analytics, knowledge discovery etc. are all closely related terms
that relate to analyzing data in order to gain knowledge. It is not a new phenomenon, as this practice
is as old as the field of statistics which has been around since the 18th century (Agarwal & Dhar,
2014). But lately, for the past two decades, it is getting increasingly important (Chen, Roger, &
Storey, 2012). Nowadays, the collection of data is nurtured by the internet, with the rapid pace at
which economic and social transactions are moving online. The opportunities of this field are also
expanded by the availability of ‘Big Data’ and advancements in the field of machine learning. The
arrival of Big Data at the scene is claimed to be the most significant tech disruption since the
internet and digital economy (Agarwal & Dhar, 2014).
Big Data can be described as data that is too big, too fast, or too hard for existing tools to process.
This relates directly to the 3 V’s of Big Data, volumes, variety and velocity, its 3 main characteristics
(Madden, 2012).
![Page 16: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/16.jpg)
2
5 Exabyte (10^18 bytes) of data were created by human until 2003. Nowadays, this is created in two
days. 10 billion text messages are sent every day. By 2050, 50 billion devices will be connected to the
internet. Facebook has 955 million monthly active users, every day 30 billion pieces of content are
posted and 2.7 billion likes and comments have been posted. 571 new websites are created every
minute (Sagiroglu & Sinanc, Big data: A review, 2013). An enormous amount of data is available and
it is being generated at an increasing pace. The size of this data is getting large, sometimes reaching
petabytes. This is called the volume aspect of big data.
The United Nations Economic Commission for Europe (UNECE) classifies different sources of
data in 3 domains, displayed in table 1. Firstly, there is a lot of data concerning human experiences.
‘Social networks’ (human-sourced information) is the source of data coming from blogs, comments,
pictures, videos, internet searches etc. Secondly, ‘traditional business systems’ leave a trace of doing
business like medical records, transaction information and stock records. Lastly, ‘the Internet of
Things’ covers all data coming from sensors or computer systems (Vale, 2013).
Table 1 Different sources of data (Vale, 2013)
Social Networks (human-sourced information)
Social networks (Facebook, Twitter etc.)
Blogs and comments
Personal documents
Pictures (Instagram, Flickr) Videos (Youtube)
Internet searches
Mobile data, text messages
User-generated maps
E-mail Traditional Business systems (process-mediated data)
Medical records
Commercial transactions
Banking/stock records
E-commerce
Credit cards Internet Of Things (machine-generated data)
Sensor data (home automation, weather sensors, traffic sensor)
Mobile sensor data (location, cars, satellite images)
(web) logs
There is a huge variety to be found in these data sources. Traditionally data sources are structured,
like commercial data which is often stored in a database or data warehouse. Alternatively, semi-
structured data has, like the name implies, some main structure. Think of Twitter messages: the
messages posted by users do not have any structure, it could be whatever they like, but the data
generated from those messages also comes with metadata, which is structured. Date and time,
locations, IP-addresses and so on are also being captured. Next to those types of data, data can also
![Page 17: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/17.jpg)
3
be completely unstructured, like video or audio data. Velocity, the third V of Big Data, captures the
increasing speed at which data is coming at us. For example, the real time capturing of data through
sensors or clickstreams generated on websites (Sagiroglu & Sinanc, Big data: A review, 2013). These
3 V’s are presented in figure 1.
Figure 1 3 V's of Big Data (Sagiroglu & Sinanc, Big data: A review, 2013)
Big data solutions can thus analyze and interpret data that was previously assumed too difficult to
handle. This in the dimensions of volume, variety and velocity. Overcoming these technical
challenges clears the way for new opportunities and applications. The potential value derived from
these new opportunities is estimated to be huge. Global consultancy company McKinsy claims
enormous gains in a wide variety of sectors. The potential value from data analysis in the US health
care sector alone could reach up to $300 million. For the European public sector, this would be
€250 billion (Evans & Lidner, 2012). This and other examples are illustrated in figure 2. On top of
that, Davenport & Harris (2007) suggest that top performing organizations are three times more
likely to be sophisticated analytics users than lower performers, implying a clear beneficial result of
using advanced analytics.
These projections of economic wealth and riches trigger organizations to invest in data analysis, but
to do that, people are needed that can handle their data. The job of data scientist is being hyped as
the sexiest job of the century. There is already a shortage of data scientists which is becoming a
serious constraint in some sectors (Davenport & Patil, Data Scientist: The Sexiest Job of the 21st
Century, 2012).
Based on these findings it might be beneficial to obtain some knowledge on how to handle data.
Even if you don’t have the ambition to become a data scientist yourself, knowing how one thinks
might already be beneficial in dealing with and understanding one. This thesis covers data mining
![Page 18: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/18.jpg)
4
theory and concepts and applies those to a case ranging from data gathering to the interpretation of
the results to help the reader gain some practical understanding of data analysis.
Figure 2 Financial value across sectors through the use of Big Data (Evans & Lidner, 2012)
2 Literature review
2.1 What is data mining?
Data mining is the process of analyzing data from different angles and extracting useful information.
This is sometimes called knowledge discovery (Satinderpal , Sheilly, & Kaur, 2012). The goal is thus
to transform raw data into useful information or knowledge which is then used to improve decision
making, derive value or gain a competitive edge (Provost & Fawcett, 2013) (de Tré, 2007).
Transforming data into knowledge can be done in several ways. The method varies depending on
the question that needs answering. Descriptive analytics looks at what has happened and why, by
looking at the data from different angles through summarizing the data in charts and reports. It
helps to understand and analyze business performance. These are useful if it is somewhat known
what to look for, but there can also be more hidden patterns that require more complex methods to
surface. These hidden patterns can answer more complex questions and lead to more interesting and
actionable insight (de Tré, 2007). This is the domain of predictive analytics which answers questions
like ‘what will happen’. Those more complex methods rely on advanced analytical techniques and
![Page 19: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/19.jpg)
5
are often called datamining techniques (de Tré, 2007). Prescriptive analytics tries to optimize a
certain situation, for example, to minimize costs or maximize profit (Evans & Lidner, 2012). Table 2
offers an overview of these different types of analytics.
Table 2 Different types of analytics (Evans & Lidner, 2012)
Type Question Examples
Descriptive analytics What has happened? Reporting, visualization, dashboards
Predictive analytics What will happen? Detect hidden patterns, data mining
Prescriptive analytics What should happen? Optimization, revenue management, what-if analysis
As already hinted in the previous paragraph, data mining is often associated with the predictive
analytics type. Predictive modelling can be seen as one of the main topics of data mining (Provost &
Fawcett, 2013). Evans & Lidner (2012) describe data mining as a focus on understanding
characteristics and patterns among variables in large databases using a variety of statistical and
analytical tools. Through the use of these statistical and mathematical principles it investigates
historical data to detect patterns and relationships (Satinderpal , Sheilly, & Kaur, 2012) (Evans &
Lidner, 2012). Data mining tries to learn those patterns and apply them onto new data to predict
their behavior (de Tré, 2007).
2.1.1 How is data mining done?
In the realm of data mining there are some distinctions to be made. On the one hand, there are
parametric methods. These require some assumptions about reality, for example, we can assume a
linear relation. On the other hand, there are non-parametric methods that do not make assumptions.
As there are no assumptions about the relationship, non-parametric methods have an opportunity to
better fit what is required. But this comes with a downside as well: it requires a very large number of
observations to obtain an accurate estimate. Another distinction can be found in the definition of a
data mining problem. If we know what we want to predict and have for example a target variable to
focus on, it’s called supervised learning. Unsupervised learning has no response variable to predict.
In those cases, it is not known what to look for (James, Witten, Hastie, & Tibshirani, 2013).
As already mentioned in the previous section, data mining tries to learn patterns and apply them on
new data (de Tré, 2007). This is exactly what is figure 3 illustrates. The top part represents the model
![Page 20: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/20.jpg)
6
extraction from data that is available, the historic data. The bottom part illustrates the appliance of
this model in predicting the class of new data points.
What algorithms are used to extract a model depend on the type of problem. The case-study covers
a supervised learning problem, and more precisely a classification problem. The approach taken to
tackle this specific problem, like the type of algorithms that can be used for a classification, is
elaborated further in the case-study.
Figure 3 Data mining versus the use of data mining results (Provost & Fawcett, 2013)
![Page 21: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/21.jpg)
7
2.2 CRISP-DM
Evans & Lidner (2012) claim organizations are overwhelmed by this abundance of data and struggle
to understand how to use it to achieve business results. Without any framework, the success of a
data mining project is dependent on the skill of the person or team in question. This is a great
restraint on reproducibility of their efforts. A standard approach like the ‘Cross Industry Standard
Process for Data Mining‘ , aims to make data mining projects less costly, more reliable, repeatable,
manageable and faster by providing help in the translation of business problems into data mining
tasks, suggesting appropriate data transformation steps and modelling techniques, providing a way to
evaluate the results and a standard way to document the whole process (Wirth & Jochen, 2000). The
CRISP-DM Process model for Data Mining consists out of 6 phases, which are visualized in figure
4. The arrows represent the most important dependencies between phases. The large outer circle
indicates the iterative nature of this framework: going back and forth between steps is often needed,
as findings along the way trigger new questions (Shearer, 2000). A more detailed overview is given in
figure 5. Per phase, generic tasks and desired outputs are given.
Figure 4 CRISP data mining process (Provost & Fawcett, 2013)
![Page 22: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/22.jpg)
8
Figure 5 Generic tasks and outputsof the CRISP-DM reference model (Chapman, et al., 2000)
Business understanding
In this phase, the focus is on the business problem at hand. What is it that we are trying to
accomplish, what question do we want answered? Determine the project objectives and
requirements and how this can be translated into a data mining problem.
The business understanding phase basically consists of designing a clear picture of the road ahead by
framing the whole data mining endeavour into a clearly defined project plan (Chapman, et al., 2000).
Data understanding
The second step, data understanding, involves the collection and exploration of the data. This
requires investigating and describing different attributes to understand and document their meaning.
Validation of the data can uncover hidden mistakes and provide ways to deal with those. In that
way, a clear picture of the data is drawn. This often goes hand in hand with some visualization
(Chapman, et al., 2000).
Data preparation
In the data preparation stage, the final data set is constructed on which the analysis will be done.
This by selecting relevant attributes, transforming and cleaning of the data. Often, if there are
different data sources, integration is necessary by merging these sources (Chapman, et al., 2000).
Modelling
![Page 23: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/23.jpg)
9
Appropriate modelling techniques are evaluated and applied. As can be identified in figure 4, there’s
a lot of going back and forth between the data preparation stage and the modelling stage. Different
modelling techniques sometimes have different requirements of the data. This stage also evaluates
and compares the performance of different models used by checking performance on a separate test
set (Chapman, et al., 2000).
Evaluation
It is important to evaluate the model(s) created. Does the model really achieve the business
objectives or did we make any possible mistakes along the way? If the model is satisfactory, the
deployment phase can be initiated. If that’s not the case, the previous phases should be revised and
adjustments made in order to achieve the desired results (Chapman, et al., 2000).
Deployment
The next step is to ‘make use’ of the model, to deploy it throughout the organization. This can go
from a simple report generation to present findings to implementing a repeatable data mining
process across the organization (Chapman, et al., 2000).
3 Case Study
3.1 Abstract
In this case study, the principles of data mining were applied following the CRISP-DM framework
to solve the problem of the Tanzanian faulty water pumps. By using data mining algorithms for
classification, the class of water pumps was predicted: is a water pump functional, functional but in
need of repairs, or non-functional? The best results in solving this multiclass-classification problem
were obtained by an one-vs-all approach using the Random Forest algorithm, which yielded an AUC
of 0.91 and a classification rate of 0.8209. All data processing and modelling is done in the statistical
programming language R, for which the entire code can be found in the appendix of the electronic
version of this thesis.
The case study aims to guide the reader through a data mining competition in predicting the
functional state of water pumps, explaining all the steps along the way. The goal is to combine the
practical approach with the theory behind it, so the reader understands what is happening in each
phase. For that reason literature findings and practical issues are combined within the elaboration of
the case study. To increase readability and avoid repetition, all repetitive or space consuming tasks
are provided in the appendix and not in text, although they are a vital part of this work.
![Page 24: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/24.jpg)
10
3.1.1 Research question
This thesis tries to answer three questions. The first and main question is: “Can data mining provide an
added value for the Tanzanian government in battling water scarcity?”. The main purpose of analysing data is
to extract value and of course we would like to know if we succeeded. The second question, “What is
the best way to predict the functional state of Tanzanian water pumps”, lays focus on the data mining approach
in which different algorithms are evaluated to determine what works best. The third question
ponders upon the statement that 80% of all data analysis time goes into data cleaning, which was
also the case in this thesis. “Does data preparation improve the predictive capabilities of an algorithm?”.
3.2 Business understanding
Tanzania is the largest country in East-Africa, with a population of 52 million people. But of those
52 million people, 23 million have no choice but to drink dirty water from unsafe sources. 44 million
do not have access to adequate sanitation and 4000 children die from preventable diseases due to
unsafe water. Safe water is scarce, and often women and children have to spend two to seven hours
to collect clean water (WaterAid, 2016). This is quite the predicament. Water is a basic need and
right for all human beings. The Tanzanian ministry of water agrees and together with Taarifa, they
aim to improve sanitation conditions in their country.
The Taarifa platform is an open source web application for information collection, visualisation and
interactive mapping, created by a global network of volunteers. It enables citizens to report
sanitation problems such as broken public toilets or broken water pumps in their neighbourhood
through SMS, twitter or their mobile app. These issues are gathered and organized in the platform
and in this way, it is being communicated to the responsible governments and decision makers.
(Taarifa, 2016)
Next to this data collection, Taarifa also visualized these issues and created an interactive map of it.
The interactive map indicates the location of water points and their status, an illustration can be
found in figure 6. Possible states can be functional (blue), non-functional (red) or in need of
maintenance (orange). The work done by the Taarifa organisation helps to draw a clear image of
what is happening in Tanzania regarding sanitation. Visually, the geographic representation helps to
pinpoint problem areas where water tends to become scarce. Together with the created dashboards,
it provides a powerful tool for the local authorities to manage and follow-up the situation. (Taarifa,
2016)
![Page 25: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/25.jpg)
11
Figure 6 Taarifa geographic mapping of waterpumps and their status (Taarifa, 2016)
Additional information from Taarifa shows that only 54.30% of the water pumps are functional and
an average of 17.24% of the population is covered. Figure 7 is an extract from the Taarifa water
points dashboard. The map on the left tell us something about the percentage of functional pumps
per region. The colour-coding ranges from dark green, when all water pumps in the region are
functional (100%), to dark red, when none of the water pumps are functional (0%). The map clearly
indicates that there are still a lot of water pumps not working. The map on the right, which uses the
same colour-coding, represents the population coverage of clean water. Availability of water to the
population seems to be almost non-existent in some regions (Taarifa, 2016).
Figure 7 Percentage of functional water pumps (left) & Population coverage (right) per region (Taarifa, 2016)
Enabling better communication between citizens and local authorities regarding sanitation has
already helped a great deal to tackle the Tanzanian problems. But there is still a great deal that could
be improved. One next step towards a better world involves the use of data mining. The principles
of data mining could enable a thorough analysis, in this case with the use of a predictive model,
which would constitute a benefit that is twofold. A first benefit a predictive model would bring to
![Page 26: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/26.jpg)
12
the table, is the capacity to act and schedule maintenance before a water pump actually breaks down.
Next to that, characteristics of faulty pumps can be analysed and help explain why it happens
(Provost & Fawcett, 2013). Both of these benefits could lead to improved decision making regarding
sanitation.
3.3 Data Understanding & Data Preparation
3.3.1 Data Read-in
The Taarifa platform gathers data from citizens and combines these with data from the Tanzanian
ministry of water. Next to the functional status of the water pump, there is information available
about the location (in terms of longitude, latitude, region, city etc.), the water itself (quality, capacity,
source, extraction type) and the management (operator, funder, payment info). The complete list is
presented in table 3 with a description and an example. This dataset can be downloaded at
DrivenData website1, which hosts data science competitions ‘to save the world’. It contains 59400
observations and 40 variables excluding the functional state.
Table 3 Available data about water pumps in Tanzania
Variable Description Example
amount_tsh amount water available to waterpoint 300
date_recorded Date entered 2013-02-26
funder Who funded the well Germany Republi
gps_height Altitude of the well 1335
installer Organization that installed the well CES
longitude GPS coordinate 37.2029845
latitude GPS coordinate -3.22870286
wpt_name Name of the waterpoint if there is one Kwaa Hassan Ismail
num_private Unknown 0
basin Geographic water basin Pangani
subvillage Geographic location Bwani
region Geographic location Kilimanjaro
region_code Geographic location (coded) 3
1 https://www.drivendata.org/competitions/7/
![Page 27: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/27.jpg)
13
district_code Geographic location (coded) 5
lga Geographic location Hai
ward Geographic location Machame Uroki
population Population around the well 25
public_meeting True/False True
recorded_by Group entering this row of data GeoData Consultants Ltd
scheme_management Who operates the water point Water Board
scheme_name Who operates the water point Uroki-Bomang'ombe water sup
permit If the water point is permitted True
construction_year Year the water point was constructed 1995
extraction_type The kind of extraction the water point uses gravity
extraction_type_group The kind of extraction the water point uses gravity
extraction_type_class The kind of extraction the water point uses gravity
management How the water point is managed water board
management_group How the water point is managed user-group
payment What the water costs other
payment_type What the water costs other
water_quality The quality of the water soft
quality_group The quality of the water good
quantity The quantity of water enough
quantity_group The quantity of water enough
source The source of the water spring
source_type The source of the water spring
source_class The source of the water groundwater
waterpoint_type The kind of water point communal standpipe
waterpoint_type_group The kind of water point communal standpipe
Status_group The functional state: Functional, in need of repairs or non-functional
![Page 28: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/28.jpg)
14
3.3.2 Data Exploration, preparation and validation methodology
Without the use of advanced analytics, this stage explores the data and gives a feel of what we have
to work with. Looking at this set of data, it is sometimes still unclear what variables mean or in some
cases what and if there’s a difference between some variables. It’s important to know what variables
represent, so we can ask ourselves if they are necessary or if the data received makes any sense. This
takes up a large chunk of time and often it is claimed that 80% of the data analysis work goes into
cleaning the data (Wickham, 2014).
3.3.2.1 Common data cleaning problems
Erronous data
Rahm and Hai Do (2009) provide an overview of possible data cleaning problems. Their overview is
presented in figure 8. Our case, which only uses one data source, shifts the attention to the ‘Single-
Source problems’, the left side of that figure. Most errors are data entry errors. For example,
misspellings when data entries allow open text as input or letters in a numbers-only field. In a
previous project, some people refused to provide their telephone number and answered ‘no’ instead,
which is also a perfect example. There’s also the issue of clearly impossible values, for example,
creation dates that are in the future or negative numbers entered as age. A problem with uniqueness
could be a duplicate in what was supposed to be a unique identifier. For example, a family and first
name combination that is twice entered. Or, more applied to this case, a water point name that is
entered twice. If on top of that, different GPS location data were entered for these 2 water points,
we encounter the presence of contradictory values.
Figure 8 Classification of data quality problems in data sources (Rahm & Hai Do , 2009)
Missing data
![Page 29: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/29.jpg)
15
Another common problem is the presence of missing data. Often people are reluctant, or simply not
able to provide information. In the case of water pumps, missing values are mostly due to the
information not being available. There is no right way in handling missing data, each variable that
encounters missing values needs to be evaluated on its own. But Gilbert A. Churchill and Dawn
Iacobucci (2005) provide 3 basic ways in handling missing data:
When confronted with missing data it’s possible to only take into account complete cases,
where none of the variables have missing values. This method discards data that could still
be useful so it’s advised to use this only as a last resort (Churchill & Iacobucci, 2005)
(Gelman & Hill, 2007).
Where possible, missing data should be filled in or imputed in order to save instances of
analysis. There are different ways to do this. The easiest way would be to impute the missing
value with the mean, median or mode, but this could distort the distribution of the data and
thus also the relationship between variables. Sometimes, it is also possible to use information
from related variables in order to derive the right value or make an educated guess.
Advanced methods are also possible (Gelman & Hill, 2007) (Churchill & Iacobucci, 2005).
Another way could be to leave in missing values but tag them as missing and put them in a
separate category (Churchill & Iacobucci, 2005).
An advanced method of imputing consists of deploying a model (for example a regression model)
on the non-missing variables that are available to predict a sensible value for the missing items. This
only works in the assumption of only one variable with missing data. To create a valid regression
model, the predictor variables need to be present. To solve the problem of multiple variables having
missing entries multivariate imputation is used. One way to do execute this multivariate approach is
to iteratively assess each variable. A model is created to predict a certain missing variable and if any
of the predictor variables have a missing value, it is imputed in a more crude manner as discussed
above. This is then done iteratively until all variables are dealt with. A more complete overview on
how to deal with missing data can be found in the work of Gelman and Hill (2007).
3.3.2.2 Principles of Tidy Data
The principles of Tidy data propose a standard way to organise and structure data, in order to
facilitate its analysis. A dataset is called tidy when each variable forms a column, each observation
forms a row and each type of observational unit forms a table. (Wickham, 2014)
In our case, the observational unit is a water pump, which has its own table. In that table, each row
represents 1 specific pump and each column a specific characteristic of that pump. Because of these
criteria, we can call our dataset tidy.
![Page 30: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/30.jpg)
16
3.3.2.3 Assessing statistical relevance
Chi-square
In the data exploration stage, the variables are tested on their dependency with the functional state
of the water pumps (functional, in need of repairs or broken). The functional state is represented in
this data set with the variable status_group. When confronted with a categorical variable, the Chi-
square test is used. The Chi-square test’s null hypothesis claims there is no association between the
two categorical variables (Churchill & Iacobucci, 2005). A crosstab between these two variables is
used as basis for this test. This leads to the following null hypothesis:
H0: 𝑡ℎ𝑒 𝑟𝑜𝑤 𝑣𝑎𝑙𝑢𝑒 𝑖𝑠 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑜𝑙𝑢𝑚𝑛 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
As an example, table 4 represents a proportional crosstab of the public_meeting variable. The
categories of public_meeting can be found in the rows, the categories of status_group or the functional
state in the columns. The 4th row, GLOBAL, is the global distribution. Of all water pumps 54% are
functional, 7% are in need of repairs and 38% are broken. If the variables public_meeting and
status_group were independent, the same distribution of the status_group would occur over all
public_meeting categories. But this does not seem to be the case. The global percentage of functional
cases is 54%, but if there was no public meeting, only 43% are functional. The distribution is
visualised in figure 9, it’s clear that there are some differences among these groups. The Chi-square
evaluates if the measured distribution is different from the expected distribution if they were
independent (global distribution). If the p-value is smaller than 0.05, the null hypothesis is rejected,
which indicates the variables are not independent. The chi-square test regarding public meeting and
status_group results in a p-value of 0.00, which leads us to conclude the variables are not independent
from each other. The same approach will be used for other categorical variables in this dataset.
Table 4 Proportional Crosstab of status_group and public_meeting
PUBLIC MEETING FUNCTIONAL REPAIR BROKEN
FALSE 43% 9% 48% 100%
TRUE 56% 7% 37% 100%
UNKNOWN 50% 5% 45% 100%
GLOBAL 54% 7% 38% 100%
![Page 31: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/31.jpg)
17
Figure 9 Distribution of status group over Public Meeting values
One-way ANOVA
If the focal variable is numeric, a one-way ANOVA is used. In simple terms, an ANOVA-test or
ANalysis Of Variance-test checks if a numeric variable changes due to the effect of a ‘treatment’.
The treatment variable in our case is the status_group. For example, when evaluating the GPS_height
variable, a boxplot over the different functional states reveals that non-functional water pumps may
have a lower GPS_height value (figure 10). The ANOVA-test investigates if this is statistically so, and
if it is, it can be claimed that status_group and GPS_height are not independent. Table 5 gives a little
summary of which methods are used in this thesis to investigate dependencies, when they are used
and what the question is, it answers.
![Page 32: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/32.jpg)
18
Table 5 Summary of dependency tests
Statistical test Dependency of Description
Chi square 2 categorical variables Is the relationship between the variables the same as what would be expected of independent variables?
One-way ANOVA a numeric and a categorical variable. Does the numerical variable differ statistically between the 3 functional states?
3.3.3 Data Exploration, preparation and validation
Now it’s time to get our hands dirty and identify if there are issues present in our case.
Functional state
Figure 10 Boxplot of GPS_Height over status_group
![Page 33: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/33.jpg)
19
First of all, an evaluation of the functional
state. Most of the water pumps are functional
(32259 or 54.31%, represented in blue in
Figure 11). Still a large portion is non-
functional (22824 or 38.42%, in red) and a
smaller set is in need of repairs (4317 or
7.27%, in orange). This is the value that we
want to predict in a later stage, but in this data
exploration phase, we can already find out if
there are correlation or relationships between
this functional state and other variables.
Amount_tsh
The description identifies amount_tsh as the amount available to the water point. In more pump-
technical terms: ‘total static head’ (expressed in meters). Pumpfundamentals.com claims “head is a
very useful and practical term to use when evaluating a pump’s capacity to do a job”. Total static
head indicates the height at which the pump can raise up water, or again in technical terms: “the
elevation between the surface of the reservoir and the point of discharge into the receiving tank”.
For this reason, it is impossible to have a value of ‘0’ as total static head. Because otherwise, a pump
would not be needed (Chaurette, 2016). In most cases (70.10%), however, this is 0. A possible
explanation could be that missing values are represented by 0. 70% of missing values is too much to
impute without introducing bias, that’s why this variable is excluded from the analysis.
Date_recorded
Almost all water points were recorded between 2010 and 2013, only 31 were not. The date the water
pump was entered in the system should not influence the functional state of the water pump. But
maybe the time of year could play an influential role2. Tanzania has two rainy seasons and two dry
seasons. The main rainy season or the ‘long rains’ happens during March, April and May. This is
followed by the long dry season in June, July, August, September and October. In November and
December, there’s a smaller rainy season or the ‘short rains’. January and February are called the
‘short dry season’ (ExpertAfrica, sd). This is summarized in table 6. If the water pump was recorded
during the ‘Long rains’ season 60% was functional, whereas recordings in the ‘Short Dry’ season
2 User Dipetkov on the DrivenData forums inspired this approach: https://github.com/dipetkov/DrivenData-PumpItUp/blob/master/transform-data.R
Figuur 1 Distribution of Functional state
Figure 11 Distribution of Status_group variable
![Page 34: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/34.jpg)
20
only has a functional percentage of 50%. The seasons are a relevant factor that has influence on our
focal variable, that’s why these seasons captured in the newly created RecordingSeason variable.
Table 6 Seasons in Tanzania
Season Months
Short dry season January, February
Long rains March, April, May
Long dry season June, July, August, September, October
Short rains November, December
Funder & Installer
The variable Funder and Installer contain respectively 1898 and 2146 unique values. After careful
investigation, it seems apparent those can be grouped into 8 categories. This is represented in table
7, which also gives a quick description and an indication of their size. The greatest hurdle is the lack
of structure in data entries. There are a lot of typo’s (Oxfam vs ‘oxfarm’, world bank vs ‘wourld
bak’) and different spellings, which require intensive manual investigation to classify. Most entries
for Funder and Installer are organization names or acronyms, but if a Google search does not reveal
what it refers to, it is not possible to classify them as there are no other means to gain this domain
knowledge. Those impossible case were placed in the ‘other’ category.
Table 7 Newly created categories for variable funder
Category Characteristics
Other When not belonging to any other group, or unable to identify where it should belong
Government All things related to Tanzanian authorities
International Investment Partnerships between Tanzanian and foreign governments
Aid Organizations like Red cross, Oxfam, Unicef, world bank…
Unknown Entries like ‘0’, ‘ ‘, ‘no’, ‘not known’ or responses that only have 1 character
![Page 35: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/35.jpg)
21
Religious initiatives Initiatives originating from a religious organisation. For example the Lutheran church in Tanzania.
Private Private companies/individuals
Community/local efforts Local ngo’s, schools, community funding
GPS Height
GPS Height ranges from -90 to 2270. The Nations Encyclopedia claims the lowest point in
Tanzania is around sea level or 0 meter (Nations Encyclopedia, sd), so we would expect a minimum
value of 0 meter. As it is unclear how this variable was exactly measured or obtained, we can assume
it was by some sort of GPS, which is erroneous by default if we believe gpsinformation.net
(Mehaffrey, 2001). Even though the GPS height information may not be accurate, it still provides an
indication of height and can still be useful.
34.4% of observations have GPS Height of zero. This indicates, as previously encountered, missing
values are recorded as zero. One way of dealing with missing values is by imputing with the mean,
but imputing 34.4% of all observations with a global mean would severely influence the distribution
and relationships between GPS Height and other variables. A more appropriate way to impute
would be by looking at water points that are nearby and derive a more focused value to use as
imputation value.
Wikipedia shows us the subdivision of Tanzania (Wikipedia, 2016): there are 30 regions, which are
divided into districts and divisions, composed of wards that consist of villages. The variables region,
district, wards and subvillage are present in the dataset. On top of that there’s also a variable named
LGA or Local Government Authority, which also groups villages together based on proximity. To
impute a missing value, we start by looking at water pumps in the same subvillage. If there are several
other water pumps in the same subvillage, we can average their GPS height values to impute the
missing one. In that way, there’s an imputation by the mean, but it’s a more sophisticated imputation
as it is based on nearby water pumps. In case there are no water pumps in the same village or all the
other water pumps in the village also have missing values, the range should be broadened to water
pumps in the same ward. If there are still missing values left after this, LGA’s and districts can also be
included. After going through this process, there are still 16 missing values left. These are situated in
only 2 districts (Mpwapwa and Kishapu). Further investigation reveals that those 16 water pumps
are only spread around 3 wards: ‘Gode Gode’, ‘Matomondo’ and ‘Masanga’. To impute these last 16
![Page 36: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/36.jpg)
22
water pump’s GPS Height, elevationmap.net-tool3 helps to identify the altitude in these wards. An
overview of the missing districts and wards with their altitude values found on elevationmap.net is
provided in table 8. The entire process with the decrease of missing values at each step is shown in
table 9.
Table 8 GPS Height: Manual look-up of missing data
District Ward Altitude
Mpwapwa Gode Gode 837m
Mpwapwa Matomondo 1091m
Kishapu Masanga 1173m
Table 9 GPS Height: The process of imputing missing values
Estimation method Missing values Units without data
Original situation 20438 (34.41%) 20438 water pumps
Subvillage mean 15419 (25.96%) 7265 villages
Ward mean 14735 (24.81%) 751 wards
LGA mean 13885 (23.38%) 41 LGA’s
District mean 16 (0%) 2 districts
Manual look-up 0 (0%) 0 water pumps
Longitude and Latitude
Google maps4 was used to retrieve location data of the Tanzanian border, in order to evaluate if the
data received in the variables longitude and latitude are valid. The border location cases used are
‘Lake Tanganyika’ on the left side, ‘Mavago’ on the bottom side, ‘Mtware’ on the right side and
‘Lake Victoria’ on the top side. The locations, together with their values, are presented in table 10.
3 http://elevationmap.net/#menu2 4 https://www.google.be/maps/
![Page 37: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/37.jpg)
23
Table 10 Longitude and Latitude validation
Location Position Latitude Longitude Map
Lake Tanganyika Left side -6.091258 29.495719
Mavago Bottom side -11.732787 36.548942
Mtwara Right side 10.374224 40.372184
Lake Victoria Top side -0.961119 32.374137
This location data show that valid entries should have a latitude between -11,73 and -0.96 and a
longitude between 29.50 and 40.37. The data provided shows a latitude between -11.65 and -0.96
and a longitude between 29.61 and 40.35. But, for both variables, zero is again used for missing
values. Values of zero for longitude are impossible, as it is not situated in Tanzania. The values for
latitude that are zero seem to be all related to the regions ‘Mwanza’ and ‘Shinyanga’, which definitely
do not have latitude values close to zero. The same method for imputing GPS height missing values is
used here and summarized in table 11. At the end of the process 268 water pumps remain with
missing location data. But those are all situated in the same LGA: ‘Geita’. A manual look-up helps to
identify the right location data to use for imputation.
Table 11 Latitude & Longitude: The process of imputing missing values
Estimation method Missing values Units without data
Original situation 1812 (3.05%) 1812 water pumps
Subvillage mean 1142 (1.92%) 720 villages
Ward mean 921 (1.55%) 57 wards
LGA mean 268 (0.45%) 1 LGA’s
Manual look-up 0 (0%) 0 water pumps
![Page 38: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/38.jpg)
24
Population
36% of water points have a population of 0 to serve, which can be interpreted as a missing value.
That’s a large portion, but who knows, it may still aid in the model creation. Population is a highly
skewed variable. This skewness provides the interesting box plot representation in figure 12. Most
values are close to zero, with a
minimum of 1, a mean of 281, a
median of 150 and a maximum of
30500. Following the statistical
method to identify outliers (inter
quartile range multiplied by 1.5)
there are 7682 outliers or 13% of all
water points. That’s a minority but
still a very large portion. However,
it is still possible that there are a
lower number of water pumps that
can have a very high population to serve, if we think about the urban versus countryside population.
For this reason, these outliers might be justified and removing them would not reflect reality.
Deleting the variable altogether may be a little crude, so imputation may be useful. The imputation is
done by using the median instead of the mean, because the variable is highly skewed.
Public meeting
The largest portion of water pumps were approved
by a public meeting (51011 or 85.88%). 5055 (or
8.51%) were not and for 3334 (5.61%) water
pumps this variable was missing. This can be seen
on figure 13.
Figure 13 Public meeting
Figure 12 Boxplot of population
![Page 39: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/39.jpg)
25
Permit
Permit has 3 possible values. Either true (38852), false (17492) or unknown (3056).
Figure 14 Permit
Construction year
The construction year of water points range between 1960 and 2013. 35% have a missing value. In
order to deal with this large portion of missing values, we could opt to impute them with some sort
of mean. Or, we could categorize them as missing, which is done here. 7 buckets are formed, each
summarizing a period of 10 years: 60’s, 70’s, 80’s, 90’s, 00’s. A separate category is created to catch
all missing entries and is named accordingly: “missing”. The distribution over the years looks like
figure 15.
Figure 15 Construction year as factor
![Page 40: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/40.jpg)
26
Extraction type
There are 3 variables regarding extraction type, which display a different level of granularity.
Extraction_type has 21 unique values, some of those are only represented by a very small number of
water points. We could opt to reduce the number of values and group some together, but this has
already been done for us in the Extraction_type_group variable, which has 13 levels. The variable
Extraction_type_class has 7 levels of which the distribution can be found in figure 16.
A scrutiny of the different levels and their differences can be performed by looking at table 12 in
which the different groups are accompanied by their relative size. Some categories have a very small
portion of water pumps and it is clear that the extra division the extraction_type variable makes is one
too many by splitting already small chunks into even smaller pieces. On top of that, some
summarization can be performed in the Extraction_type_ group variable as well. As the ‘india mark III’
is so small, we could add it together with the ‘india mark II’. The category ‘motor pump’ should also
not be split and be kept as-is in the Extraction_type_group variable.
Figure 16 Distribution of Extraction_type_class variable
![Page 41: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/41.jpg)
27
Table 12 Granularity of extraction type
Class Group Type
Gravity (45%)
Hand pump (28%) Afridev (3%)
India Mark II (4%)
India Mark III (0.2%)
Nira/tanira (14%)
Swn 80 (6%)
Other Handpump (0.6%) Other – play pump (0.1%)
Other – Swn 81 (0.4%)
Walimi (0.1%)
Other – mkulima / shinyanga (0.0%)
Submersible (10%) Submersible (10%) Submersible (8%)
KSB (2%)
Motor pump (5%) Other motor pump (0.2%) Climax (0.05%)
Cemo (0.15%)
Mono (5%)
Rope pump (0.8%)
Wind powered (0.2%) Wind powered (0.2%) Windmill (0.2%)
Management & Management group
The Management variable contains 12 levels, whereas the Management_group variable has 5. The
distribution of water pumps over the different levels of Management_group is shown in figure 17.
Figure 17 Distribution of water pumps over Management_group
![Page 42: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/42.jpg)
28
Table 13 Granularity of Management
Management_group Management
User-group (88%) VWC (68%)
Water board (5%)
Wua (4%)
Wug (11%)
Commercial (6%) Company (1%)
Private Operator (3%)
Trust (0.1%)
Water authority (2%)
Parastatal (3%)
Other (2%) Other (1.4%)
Other – school (0.2%)
Unknown (1%)
Based on table 13, I would opt to summarize some of the factors in the Management variable. The
subdivision of ‘Other’ is too small and the ‘Company’, ‘Private Operator’ and ‘Trust’ can be grouped
together.
Scheme management
We encounter 2697 unique scheme names in the scheme_name variable, but they are conveniently
grouped into the scheme_management variable, which only has 13 unique values. The values
encountered in scheme_management are the same as for the variable management discussed in table 13.
But that variable has a related variable called management_group that summarizes management into 5
groups. For the Scheme_management variable, we could also create a similar grouping variable. This
newly created variable is called SchemeGroup and its levels are displayed in table 14.
![Page 43: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/43.jpg)
29
Table 14 SchemeGroup and Scheme management levels
SchemeGroup Scheme management
User-group (80%) VWC (62%)
Water board (5%)
Wua (5%)
Wug (9%)
Commercial (9%) Company (2%)
Private Operator (2%)
Trust (0.1%)
Water authority (5%)
Parastatal (3%)
Other (1%) Other (1.3%)
SWC (0.1%)
Unknown (7%)
Payment & Payment type
This variable keeps track of the way payments are done. The variables payment and payment_type are
almost exactly the same with the only difference being the naming of ‘pay when scheme fails’ or ‘pay
on failure’. Which by naming should mean the same. For that reason we only continue with the
payment (and not the payment_type) variable.
Figure 18 Distribution of water pumps over payment variable
![Page 44: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/44.jpg)
30
Quality group
The quality_group variable tells us something about… the quality of the water. For most water points
(86%), the quality is ‘good’. As can be noted from a glance at figure 19. There are no particular
issues with this variable so it will be left untouched.
Figure 19 Distribution of Quality_group
Quantity
There are 5 different levels in the quantity variable. The variable Quantity_group can be deleted as it is
an exact duplicate. Most of water pumps have the label “enough” as can be seen in figure 20. The
question that immediately rises when looking at these levels is whether water pumps with the
quantity ‘dry’ have a relationship with non-functional water pumps. Looking at the crosstab with
status_group this is indeed the case (table 15).
![Page 45: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/45.jpg)
31
Figure 20 Quantity
Table 15 Quantity Crosstab
QUANTITY FUNCTIONAL REPAIR BROKEN
DRY 3% 1% 97% 100%
ENOUGH 65% 7% 27% 100%
INSUFFICIENT 52% 10% 38% 100%
SEASONAL 57% 10% 32% 100%
UNKNOWN 27% 2% 71% 100%
GLOBAL 54% 7% 38% 100%
Table 15 shows us the crosstab of quantity and status_group. It displays the distribution of the
status_group variable over the quantity-levels. It seems that if water pumps are ‘dry’ or the quantity
variable is not known, there is a lot of chance the water point is broken (respectively 97% and 71%
of water pumps). On the other hand, if the quantity level is ‘enough’ there is a higher chance the
water point is functional (65% of water pumps).
Source
There are 3 granularities: source, source_type and source_class, having respectively 10, 7 and 3 levels. This
can be identified in table 16. The percentages displayed relate to the total number of water pumps,
which is 59400. The actual difference between source and source_type is the split in the ‘borehole’
source type, in which only a very small portion is further identified as ‘hand dtw’ which only
accounts for 1% of all water pumps. For this reason, the source variable can be ignored.
![Page 46: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/46.jpg)
32
Table 16 Granularity of Source
Class Type Source
Ground water (77%) Spring (29%)
Borehole (20%) Machine dbh (19%), hand dtw (1%)
Shallow well (28%) Shallow well
Surface (22%) Rainwater harvesting (4%)
River/lake (17%)
Dam (1%)
Unknown (0.5%) Other (0.5%) Unknown (0.3%), Other (0.4%)
Water point type
There are 2 variables related to the type of water point: waterpoint_type and waterpoint_type_group. The
first one has 7 variables and the second one has 6. The only difference lies in the subdivision of
‘communal standpipe’ (from the waterpoint_type_group variable) into 2 categories depending on if
there are 1 or more standpipes. As this subdivision separates a substantial part of water pumps (6103
or 10%), the most detailed variable (waterpoint_type) is kept. The distribution is shown in figure 21.
Figure 21 Waterpoint type
![Page 47: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/47.jpg)
33
Location data
There are several indicators of location in this dataset. Some related to coordinates, but also some
factor variables like Basin, Region and District code. District codes are only provided in number and are
thus hard to interpret when doing an analysis. There are definitely more differences between regions
than between basins. The percentage functional water pumps range from 30% in the Lindi and
Mtwara region to 68% in Arusha region. The lowest percentage functional per basin can be found
ranging from 41% (Lake Rukwa) to 65% (Lake Nyasa). The distribution of these variables can be
found in figure 22 and 23.
Figure 22 Basin
Figure 23 Region
![Page 48: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/48.jpg)
34
3.3.4 Summary of Data Understanding / Data preparation
It’s a long read to go through the data understanding / data preparation stage. A summary could
help to revise what has happened. Due to the hard-to-understand nature of amount_tsh and the fact
that 70% of it is missing, this variable was deleted from the analysis. A strong manual effort made it
possible to use the funder and installer variables, grouped in 8 categories. Missing values for
GPS_height, latitude and longitude were imputed by looking at nearby locations and deriving a sensible
mean to impute them with. A new variable (RecordingSeason) was created that contains the season in
which the water pump was recorded. For a couple of subjects, different variables displaying different
levels of granularity were available. Each of those divisions was investigated to check if it makes
sense, deleting variables with too much unique levels and summarizing low-frequency values into
groups. Population has 30% values missing, which were imputed by the median. Table 17 shows this
summary in a tabular form. All variables seem to have a statistical relevant correlation with the focal
variable, the status of the water pumps. This was tested by a Chi-square or an ANOVA test
depending on the variable type.
Table 17 Summary of data handling
Variable Name Description of cleaning Relevance
Funder Grouped into 8 categories
Installer Grouped into 8 categories
GPS height Missing data imputed with mean of nearby available data
Longitude Missing data imputed with mean of nearby available data
Latitude Missing data imputed with mean of nearby available data
Region Recoding of empty values as “Unknown”
District Code Read as factor, not as a number
Public Meeting Recoding of empty values as “Unknown”
Population Many missing values. As it is highly skewed, impute with median instead of mean.
Scheme & SchemeGroup Recoding of empty values as “Unknown”. Deletion of scheme_name variable: too detailed. Creation of summarizing variable SchemeGroup.
Permit Recoding of empty values as “Unknown”
ConstructionYearFactor Placed in 6 buckets and made categorical, missing values in separate bucket
Extraction Type Group & Extraction Type Class
In three levels with different granularity: type, group and class. Restructuring of Group categories. Deletion of type variable.
Management & Management Group
Recoding of empty values as “Unknown”. Restructuring of Management categories.
![Page 49: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/49.jpg)
35
Payment Recoding of empty values as “Unknown”. Deletion of payment_type: duplicates.
Water Quality Recoding of empty values as “Unknown”
Quantity Recoding of empty values as “Unknown”. Quantity_group can be deleted as it’s an exact duplicate.
Source Recoding of empty values as “Unknown”
Water point type Recoding of empty values as “Unknown”
RecordingSeason Derived from the Date_recorded variable
3.4 Modelling & Modelling Evaluation
3.4.1 Modelling introduction
The aim of this business problem is to increase the efficiency on how to deal with water scarcity in
Tanzania. This was translated into the data mining problem to predict the functional state of a water
pump. For every water pump, the correct class or the probability that an instance can belong to a
class needs to be predicted. In data mining terminology this is called a classification, a class
probability estimation (James, Witten, Hastie, & Tibshirani, 2013) or a supervised segmentation
problem (Provost & Fawcett, 2013). All these terms make sense. It is a supervised problem because
it has a target attribute and training data where the value for the target attribute is known. It is a
segmentation/classification problem because the aim is to segment the data into different groups or
classes. Figure 24 was inspired by the represtation of Provost & Fawcett (2013) and peeks at how
the data received is structured. Each row represents an instance, in this case a water point. Each
water point has several attributes or characteristics like Quantity, Region, Quality etc. . The attributes
are to be found in the columns. In a supervised segmentation or classification problem there’s
always a target attribute, in this case the functional state of the water point. In the next couple of
paragraphs, some approaches on how to predict this target attribute will be covered.
![Page 50: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/50.jpg)
36
Figure 24 Attributes and target attribute representation, inspired by (Provost & Fawcett, 2013)
3.4.1.1 Classification trees
A first approach to predict a target attribute is the use of classification trees. Classification trees
predict a qualitative response, a class to which an instance belongs, by using recursive binary
splitting. Which means that in every ‘node’ of a tree, the dataset will be split into 2 separate groups
based on the values of a certain variable. The criterion to use in the binary split could be the
classification error rate. We assign an observation to the most commonly occurring class that is
encountered in a split. The classification error rate is the portion of training observations that do not
belong to that class. An alternative would be the GINI index as a measure of node purity, in which
small values indicate more purity (James, Witten, Hastie, & Tibshirani, 2013).
Figure 25 is an example of how trees work, applied to our case-study. To be easy in the creation of
the tree, the functional states ‘in need of repair’ and ‘non-functional’ were merged. The blocks
represent a state. In the initial state, 55% of all water pumps are functional. That’s the number
displayed in each block: the percentage of functional pumps. Recall, possible values for the Quantity
variable are “Enough”, “Insufficient”, “Seasonal”, “Dry” and “Unknown”. If we only look at the
water pumps that are “Dry” and “Unknown” we notice that only 5% of those water pumps are
functional (right-side branch of the tree). 95% of the pumps are non-functional, this tree model
predicts (= assigns the label of the majority) that the water pumps with those characteristics are non-
functional, and therefore it is labelled red. On the other hand, if the Quantity variable is “Enough”,
“Insufficient” or “Seasonal” we can improve the node purity to 61% of pumps that are functional
(left side branch). The majority is functional, hence the prediction dictates the state is functional
![Page 51: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/51.jpg)
37
(and is therefore coloured blue). The ultimate aim is to result in end nodes that are as pure as
possible. Predicting “functional” on a subset of data of which only 61% is functional is not quite
accurate. Another binary split can be made using the water point type variable which helps to obtain
more pure end nodes (68% functional vs 30% functional). We can add more and more variables
until we are satisfied with the end node purity.
Figure 25 A Classification tree representation (Based on total population: 59400)
To create a classification model in R. The package ‘Rpart’ is used (Therneau, Atkinson, & Ripley,
2015). Its name refers to recursive partition, the way classification trees are built. The Rpart package
uses the GINI index to justify splits as a default setting.
Random Forest
The Random Forest algorithm is a combination of a lot of trees (= a forest) with random feature
selection, hence the name. It is considered one of the top performing techniques. Randomness is
induced in two ways. At each split a random sample of predictors is chosen as split candidates, this
ensures that the constructed trees do not look like each other and are not correlated. A second way
to create randomness is by using ‘Bootstrap Aggregation’ or simply the bagging principle. Bagging
creates different bags or boots by randomly sampling (with replacement). Each boot creates its own
model which leads to a decision regarding a certain instance and these decision are then combined
by averaging or a majority vote (James, Witten, Hastie, & Tibshirani, 2013). This process is also
presented in figure 26. Several ‘boots’ (B) are extracted from the data (through random sampling
with replacement) which have their own model. The decision or predictions resulting from those
![Page 52: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/52.jpg)
38
models (D) are then combined.
Figure 26 Bagging as presented in course material of Advanced Predictive Analytics (personal correspondence with Dirk van den Poel)
To execute this in R code, the RandomForest package is used (Liaw & Wiener, 2015). It is based on
the work of Leo Breiman,, UC Berkeley professor and creator of the Random Forest approach.
3.4.1.2 Logistic Regression
A logistic regression can be seen as a regression with a dependent variable that is categorical. Instead
of predicting a numeric value, it predicts the probability that a certain instance belongs to a class.
Linear regression would try to draw a straight line through the observations, but as they only have a
value of 0 or 1, a straight line seems to miss the point. It also allows for negative values, which are
impossible when talking about probabilities. This is illustrated in the left graph of figure 27. The
logistic model ensures that probability ranges between 0 and 1 by using a logit function, it is shown
in the right part of figure 27 (James, Witten, Hastie, & Tibshirani, 2013). The R implementation of
logistic regression, and by expansion generalized linear models, does not need any external R
packages.
Figure 27 Linear and logistic regression
![Page 53: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/53.jpg)
39
Boosting with Logistic regression
Boosting is also one of the top performing techniques. It is also a combination of several models.
But this time, they are built sequentially. Each model depends on the previous one. Misclassified
instances get a higher weight in the next iteration and in the end, all created model predictions are
combined by averaging or a majority vote in to one prediction (James, Witten, Hastie, & Tibshirani,
2013). Figure 28 captures this method. From the data a first model (T1) is created, with equal
weights assigned to all instances (a), this leads to a first prediction (D1). Misclassified instances
influence the weights used in the next iteration of the model (a2), which again outputs a prediction.
This process is repeated until satisfied. All the model’s predictions are then combined through
averaging or majority voting into a final decision.
Figure 28 Sequantial Boosting (personal correspondence with Dirk van den Poel)
The boosting approach can be used with different algorithms. In this case, we chose to do it with
logistic regression, to build further upon the algorithms that are already explained. The ‘Ada’
package helped to execute this. It is based on Additive Logistic Regression: A Statistical View of
Boosting by Friedman, et al. (2000) (Culp, Johnson, & Michailidis, 2016).
3.4.2 Three way classification approach
One vs all classification
The value to predict is the state of the water pump, which can be ‘functional’, ‘functional but in need
of repairs’ or ‘non-functional’. Traditional classification focuses on binary problems, but this binary
approach can be used creatively to address this problem as well. Probability estimates for each class
can be obtained using binary classification. Once this is done, these estimates are compared and the
![Page 54: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/54.jpg)
40
class for which the highest probability is found will be chosen as predicted value (Lin, Weng, & Wu,
2004).
All-in-one classification
Some algorithms are able to handle the predictions of multiple classes. In this thesis, it is unofficially
called all-in-one classification in contrast with the one vs all approach. Among those are also the
tree-based methods (decision tree and Random Forest), support vector machine (Angulo, Xavier, &
Catala, 2003) algorithms and linear discriminant analysis (Li, Zhu, & Ogihara, 2006).
3.4.3 Modelling evaluation
In evaluating a classification model’s performance, the most common approach for assessing the
accuracy is the error rate or the proportion of mistakes that are made (James, Witten, Hastie, &
Tibshirani, 2013). This is the opposite of the percentage correctly classified, accuracy or
classification rate which is used in the DrivenData competition to evaluate the performance. This
measure is usually too simplistic to evaluate the total performance of a model (Provost & Fawcett,
2013).
ROC & AUC
Figure 29 An example of a ROC curve
A Receiver Operating Characteristiscs (ROC) graph plots the true positive rate against the false
positive rate. It represents the relative trade-offs the model makes. For simplicity, the true positive rate
is sometimes called the hit rate, percentage of positives the model gets right. The false positive rate is
called the false alarm rate or the percentage of actual negatives the classifier gets wrong. If the
classifier is doing well, the true positive rate (hit rate) will increase rapidly and the area under the
curve will be large. Thus, this representation also takes the types of successes and errors into
account. Figure 29 provides an example of some ROC-curves. The sensitivity or true positive rate on the
![Page 55: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/55.jpg)
41
y-axis. The false positive rate or (1-specificity) on the x-axis. The ROC graph is a stepwise graph. All
observations are ranked according to their probability to belong to a class (highest probability
observations come first). Following this ranking, the observations are evaluated one-by-one to check
if the predicted class matches the actual class it belongs to. Starting from the bottom left on the
graph, if the class was correctly predicted, the ROC curve goes up. If the model predicts it belongs
to a class but in reality it doesn’t, it triggers the ‘false alarm rate’ and the curve ‘grows’ to the right.
This goes on until all observations are evaluated. The grey diagonal line would represent a random
classifier. (Provost & Fawcett, 2013)
When comparing different models it is desirable to have a single measurement figure to evaluate on
(Bradley, 1997). The area under the ROC-curve (AUC) is such a single measure used as a summary
of the performance of a model (Provost & Fawcett, 2013). The area under the curve will be large
(closer to 1) if a good classifier is used. If the classifier is no better than random guessing, the area
under the curve will be close to 0.5. The AUC represents the probability that a randomly chosen
positive example is ranked into the positive class with higher probability than a randomly chosen
negative example (Bradley, 1997). Thus the AUC can be seen as a more sophisticated approach of
evaluation through a single value.
Lift
In simple terms, the lift evaluates how many times better the model can predict than random. Figure
30 helps to grasp this concept. The left graph is a cumulative response graph. It plots the percentage
of correctly classified positives against the amount of observations evaluated. Again, the straight
diagonal is random, if 40% of all instances are evaluated and assigned randomly into a class, 40%
will be correctly classified. This way of looking at a classifier allows us to see how it is doing
compared to a random classifier. To look at how many times the model does better, the lift measure
is used, shown on the right graph of figure 30. It uses the value of the cumulative response curve
and divides by the value on the random-diagonal. This results in the graph on the right. Because the
observations are ranked by probability, the observations with highest certainty of belonging to a
class get evaluated first, resulting in the highest lift values (Provost & Fawcett, 2013).
![Page 56: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/56.jpg)
42
Figure 30 Cumulative response curve and lift
3.4.4 Modelling Approach: Cross-validation
The practical performance or generalization capability of a model is only measured in its
performance on previously unseen data points. Therefore, all evaluation metrics to compare models
should be calculated on a test set or holdout set, rather than on the training set used (James, Witten,
Hastie, & Tibshirani, 2013).
This case study was approached following the outline of figure 31. The data obtained consists of a
training set of 59400 observations and a test set of 14800 observations. From a training part, a
model is created, an unrelated validation set is used to apply this created model and evaluate its
performance (using cross-validation). This is done for several models and we identify which one
achieves the best results. This best model will then be recreated using all available data and used to
apply on the test set.
Figure 31 Modelling approach (personal correspondence with Dirk van den Poel)
In this case study, a 5-fold cross validation is used, just as displayed in figure 32. This means that all
observations are grouped into 5 groups (or folds). Each time a different group is used as test or
holdout set, while the others are used as a training set. This is repeated 5 times. Following such an
approach determines how well a certain technique can be expected to perform on independent data
![Page 57: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/57.jpg)
43
(James, Witten, Hastie, & Tibshirani, 2013). Several predictions are better than one, if only just to be
safe that the one prediction was not just a lucky case. Comparing several predictions can smooth this
out, they can be compared, their average can be computed and their variability assessed. (Provost &
Fawcett, 2013).
3.4.5 Modelling case
The approach, algorithms and evaluation metrics discussed earlier will now be applied to the case.
The results for the one vs all classification approach is shown in the following table (table 18). The
different models are compared on their AUC. Remember, the one vs. all approach predicts all
different classes separately and then combines their results. The metrics were calculated on a 5-fold
cross validation basis and averaged.
The one vs all models evaluated are the logistic regression, the boosted logistic regression
(adaboost), a classification tree, an ensemble method of bagged trees (RandomForest) and a
variation of this called the RotationForest. The all-in-one classification approach was performed
Figure 32 An illustration of cross-validation (Provost & Fawcett, 2013)
![Page 58: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/58.jpg)
44
using classification trees and RandomForest. Other viable options would be a support vector
machine algorithm or a linear discriminant analysis.
The one vs all approach has an AUC for each class and the total AUC is calculated as their mean. A
classification rate is also shown, as this is the objective to maximize in the data mining competition.
The Random Forest algorithm is clearly the winner in this case, reaching an AUC of 0.907. The
other approach (all-in-one) lets algorithms figure it out on their own and predicts all classes at once.
Therefore, only the total AUC is provided.
The performance of the all-in-one classification Random Forest model is almost exactly the same as
when the Random Forest model is used in the one vs all approach. To check if the difference in
performance between those is statistically relevant, the Delong test is used (Delong, Delong, &
Clarke-Pearson, 1988). This test compares the ROC curves of both models, in this case that would
mean 3 different tests, one for each functional state, need to be performed. The null hypothesis of
this test contains the statement that the two ROC curves (and thus the AUC’s) are the same. The
Delong test indicates a statistically meaningful difference in the prediction of the Functional and Non
Functional categories between the two approaches, using the Random Forest algorithm (respectively a
p-value of 0.03 and 0.01). There is no proof of a difference in the Repair category, as its p-value is
0.51. This test supports the claim that in this case, the one vs all method approach is the winning
approach.
Possible values for an AUC range from 0.5, meaning the model is unable to figure it out, to 1, where
all observations are classified perfectly. On that scale, an AUC of 0.91 is very good. As this case is
based on an online competition, we can have a look at how colleagues are doing and compare.
There are no AUC metrics to compare, but the maximum classification rate my fellow data scientist
enthusiasts could obtain was 0.8285. Our 0.812 is thus very close.
Table 18 One vs all classification results using a 5-fold crossvalidation
One vs all classification
Functional Repair Broken Average Classification Rate
Logistic regression 0.835 0.794 0.853 0.827 0.744
AdaBoost 0.850 0.842 0.873 0.855 0.756
Tree 0.749 0.50 0.775 0.675 0.718
RandomForest 0.907 0.876 0.836 0.903 0.812
![Page 59: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/59.jpg)
45
RotationForest 0.703 0.500 0.746 0.650 0.686
Table 19 All-in-one classification results using 5-fold crossvalidation
All –in-one classification Evaluation (AUC) Classification Rate
Random Forest 0.905 0.813
Tree 0.712 0.707
3.5 Evaluation
We have created and compared several models. The Random Forest model came out as winner in
terms of the AUC and classification rate. Next to the predictions, the model can also help to gain
insight in why a water pump is more likely to belong to a certain class. It helps us understand what is
important and what the relationship is between the variables.
3.5.1 Variable importances
Earlier in this thesis, the variables were checked on their statistical relevance towards the functional
state of a water pump by a chi-square or one-way ANOVA test. All variables used, passed this first
test. Now we want to see what variables were the most influential in the assignment of probabilities
in the Random Forest model.
To do this, a variable importance plot is used. It is created by evaluating the accuracy, the GINI
coefficient or the AUC of the model. The mean decrease in accuracy checks the drop encountered
in the accuracy measure if this variable would be excluded from the analysis (Liaw & Wiener, 2015).
If the accuracy of the model would drop severely, this would indicate that the variable excluded is a
very important factor in obtaining accurate predictions. The same thought goes into the evaluation
using the GINI coefficient, but instead of the accuracy of the model, the GINI coefficient is used.
More particularly, the decrease in GINI. The GINI coefficient measures the (im)purity of the nodes
(when thinking of a tree) (James, Witten, Hastie, & Tibshirani, 2013). We’re looking at the total
decrease in node impurity from splitting on that particular variable. The more a particular variable
can reduce impurity by splitting on it, the stronger it is (Liaw & Wiener, 2015). Next to the GINI
and accuracy measurements, the AUC can also be used in the same way (Ballings & Van den Poel,
2016). As described in the modelling evaluation part of this case study, the AUC provides a more
![Page 60: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/60.jpg)
46
balanced view on the performance of a model. For this reason, the focus lies on the variable
importance ranking using the AUC.
Figure 33 shows the top 10 most important variables for these measurements. All agree quantity as
the most important variable. The complete list can be found in the appendix.
Figure 33 Variable importances
3.5.2 Partial dependence
Next to the importance, the effect of a variable on the predicted probabilities can also be
investigated. To do this, a partial dependence plot is created. It shows the marginal effect of a
particular variable on the class probability (Liaw & Wiener, 2015). Partial dependence calculations
can be compared to the coefficients obtained in linear regression, which also give an indication of
importance and the way relationships hold between variables. It also allows to understand how the
variables contribute to the prediction making process (Pearson, 2016).
The values that are plotted by using the interpretR R package developed by Ballings and Van den Poel
(2016) are obtained by this formula: 𝑚𝑒𝑎𝑛(0.5 ∗ 𝑙𝑜𝑔𝑖𝑡(𝑃1)). It looks at the mean encountered
probability of belonging to class one (P1) per value of a variable. The values revolve around 0, if it’s
larger than 0, there is a positive effect from that factor level or particular value on the probabilities
encountered through the model. If the values are smaller than 0, there is a negative influence from
that variable value on the probability of belonging to class one. Creating partial dependence plots
![Page 61: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/61.jpg)
47
currently only works for binary approaches (Ballings & Van den Poel, 2016). In this evaluation of
partial dependence, the class ‘functional’ is evaluated versus the other two classes.
The partial dependence plots are created in two types. There’s the ‘factor-type’ plot, which has
horizontal bars with the possible factor values on the x-axis for easier readability (figure 34). In the
middle of the plot, there’s the zero point. If the bars are close to the zero point, that factor level of
the variable does not influence the probabilities that much. If the bar goes to the left side, it tends to
associate it with lower probabilities for class one. In other words, it tends to characterize non-
functional water pumps. If the bar show up at the right side, it tends to be associated with higher
probabilities to be functional. Also pay attention to the range of values, they display the importance
of that variable. If the range is small, the overall importance is small. The other type of partial
dependence plots is designed for numeric variables. It has the variable range on the x-axis and the
probability dependence values on the y axis.
An overview of all variables used in the model, their partial dependence plots and an interpretation
can be found in the appendix. Here, we’re going to cover the some of the most notable variables.
Quantity
The variable quantity is identified by all different types of measures as the most important variable in
determining the functionality of a water pump. This by the variable importance measures accuracy,
GINI and the AUC, but it was also identified by the classification tree as the most valuable variable
to split on. That split is in accordance with the results of the left partial dependence plot in figure 34.
If a water pump is dry or the quantity is unknown, the pump tends to not be functional, which is
straight forward.
Figure 34 Partial dependence plot of Quantity and Payment
Payment
The partial dependence plot of the payment variable (figure 34, right side) shows us it is important
to have some payment conditions. In absense of a payment method, or when it is not known, water
![Page 62: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/62.jpg)
48
pumps tend to be faulty. The relationship between the payment types and the tendency to be
functional is now uncovered. The next step would be to figure out why. One might argue that the
distinction between payment en non-payment also translates in the relationship and behaviour
towards a water pump. If people have to pay for something, they might be more invested and treat it
with more care and respect than if it was a public good offered for free.
Funder & Installer
One curious thing that was uncovered was the influence of the government being the installer or
funder for a water pump. Figure 35 shows government sponsored water points are much more likely
to be in a bad state than any other sponsors. This might again be an indication of investment in
water points. Community sponsored water point might receive more care as the community itself
reaps the benefits of a well-functioning water pump. Privately owned water pumps could receive
more attention as it is seen as own property and would lead to a loss of income in commercial
scenario’s. Religious initiatives and aid initiatives are bound by their noble motives to deliver quality
in their work. I would not exactly claim the government is doing a bad job. Most likely the
government has to provide and maintain water pumps in the most difficult circumstances, where
there are no willing alternative sponsors, in the name of countrywide public access and availability of
water.
Figure 35 Partial Dependence Plot of Installer and Funder variables
Construction Year
I expected a more linear relationship between the age of the water pump and its tendency to break
down. It is true, figure 36 confirms that the most recently installed water pumps, those constructed
after 2000, are less likely to fail than those installed before. But surprisingly, water pumps built in the
60’s seem more durable than those in the 70’s and 80’s. “In my time, things were built to last”, my
grandfather might say. There might be a number of reasons why this occurs. Maybe different
![Page 63: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/63.jpg)
49
materials were popular in different era’s or the oldest water pumps were already prioritized for
renovation projects.
Figure 36 Partial Dependence Plot of ConstructionYearFactor variable
3.5.3 Data cleaning evaluation
Some claim 80% of data analysis work goes into data cleaning (Wickham, 2014). I know we have
done our fair share of work on that in this case study. But was it worth it? The performance gain
seems rather marginal. The best performing model using a thoroughly cleaned dataset reached an
AUC of 0.907. Without any data cleaning, an AUC of 0.894 is obtained. The Delong test (Delong,
Delong, & Clarke-Pearson, 1988) indicates this difference is statistically relevant for the Functional
and Broken categories. For the Repair category, there is no statistically notable difference in
performance between a ‘dirty’ and ‘cleaned’ data set. The test statistics are available in the appendix.
Even though the performance gain is partially statistically relevant, it is still only a small gain.
It is most curious that the extensive data cleaning phase is not very rewarding. The supporting
graphs in the appendix go over each variable to figure this out. The graphs look at the importances
portrayed by the mean decrease in GINI and AUC in order to find some anomalies that might
indicate the phenomenon at hand. This method is not very stable as the importances depend on all
other variables included in the model creation, which are different in both data sets. Thus, this
analysis is only a superficial endeavour.
Eyeballing these graphs indicate that grouping the construction_year values into ConstructionYearFactor,
which was one of the data cleaning steps taken, might not have been a good move. In the GINI
importance ranking without cleaning, construction_year (the deleted variable) is even a clear number
one. Some of the importance loss of not including construction_year was covered by including
ConstructionYearFactor, but the difference still looks significant. Another striking indication is that the
importance of the variables with imputed values has no convincing positive effect. In fact, when
looking at the AUC importances, it seems their importance decreased by pre-processing.
![Page 64: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/64.jpg)
50
Again this approach is most superficial, a better way to further investigate this is to try and run the
model with different combinations of variables. For example, create a number of models with all
possible combinations of construction_year and ConstructionYearFactor. Then compare the AUC of the
different models and see if they are statistically different and which combination works best. The
same approach can be applied with the evaluation of different imputation techniques. This iterative
approach is quite time-consuming and computationally heavy and therefore beyond the scope of
this thesis.
3.6 Deployment
The final stage in the CRISP-DM procedure is the deployment phase. It is the phase in which the
analysis is applied for actual use. The result from this analysis is twofold. On the one hand, a model
is created which could be applied in categorizing water pumps. For example, a ranking could be
made of water pumps most likely to be broken which can be used to create a priority list for repairs
for local governments to act upon. On the other hand, insight was generated in the characteristics of
water pumps and how they relate to the functionality. As I am no water pump engineer nor a
Tanzania expert, additional insight could be extracted by adding some domain knowledge.
The inspiration for this case study came from the DrivenData competition. So the deployment
would come from their end, together with the Taarifa platform. The competition allows for
submission which are evaluated on the classification rate. With the model obtained by creating this
thesis, I was able to get a classification rate of 0.8209 which put me 77th on the ranking. The winning
score has a classification rate of 0.8285.
4 Conclusion
Data mining is a hot topic and its relevance is present in our everyday life. Its skills are in high
demand and as this case study illustrates, it can be very interesting and insightful. This case study
aims to guide the reader through a data mining competition in predicting the functional state of
water pumps, explaining all the steps along the way. The goal is to combine the practical approach
with the theory behind it, so the reader understands what is happening in each phase. For that
reason literature findings and practical issues are combined within the elaboration of the case study.
The CRISP-DM framework is used as foundation. The business problem is properly described and
then translated into a data mining problem. Common problems and solutions are covered in
working with erroneous data in the data understanding and data preparation stage. In the modelling
phase, some common simple and advanced algorithms are explained and applied. A common
modelling approach is presented and some ways of comparing different models are delivered. This
![Page 65: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/65.jpg)
51
thesis also covers the evaluation phase, in which methods of extracting insight from your model are
covered and applied.
Data mining can indeed be of service in aiding the Tanzanian water activists. Next to providing
practical insight by applying theory to real-life data problems, this thesis also succeeds in creating a
powerful model and a comprehensive variable insight report. Several models were constructed for
this case. The RandomForest algorithm turned out to be the best classifier with an AUC of 0.91 and
a classification rate of 0.8209. This catapulted me to the 77th place in the DrivenData competition on
which this thesis was based upon.
This thesis also studied the effect of the data cleaning process on predictive performance. For this
case, the cleaning efforts only partially improved the AUC on a statistically relevant level. In absolute
terms, we’re talking about a 0.02 increase in AUC. Our efforts to investigate the underlying reasons
are superficial and not based on scientific literature, further research could shed more light on this
matter.
The quality of this thesis could be increased if there was more expert input in terms of domain
knowledge. This would have helped in the data understanding and data preparation stage. Evaluating
what values make sense and which are impossible. Determining if certain groupings make sense etc.
For example, the funder and installer variables needed a lot of googling before it was possible to group
their values. Domain knowledge would also be valuable in the evaluation phase. Evaluating regions
or water pump types on their tendency to be functional is hard without knowledge of those regions
or types of water pumps. Apart from this domain knowledge, improvements could have been made
in terms of diversity in modelling algorithms. Comments on the DrivenData forum show the
XGBoost algorithm was very popular and successful, but I was unable to make it work for my
purposes.
5 Bibliografie
Agarwal, R., & Dhar, V. (2014). Big Data, Data Science, and Analytics: The Opportunity and
Challenge for IS Research. Information Systems Research, 25(3), 443-448.
Angulo, C., Xavier, P., & Catala, A. (2003). K-SVCR. A support vector machine for multi-class
classification. Neurocomputing, 57-77.
Ballings, M., & Van den Poel, D. (2016, 03 19). interpretR. Retrieved from The Comprehensive R
Archive Network: https://cran.r-project.org/web/packages/interpretR/
![Page 66: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/66.jpg)
52
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine
learning algorithms. Pattern recognition, 30, 1145-1169.
Chapman, P., Clinton , J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000).
CRISP-DM 1.0 Step-by-step data mining guide. SPSS.
Chaurette, J. (2016). Pressure or head. Retrieved from Pumpfundamentals:
http://pumpfundamentals.com
Chen, H., Roger, C. H., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data
to Big Impact. MIS Quarterly.
Churchill, G. A., & Iacobucci, D. (2005). Marketing Research: Methodolical Foundations (9e ed.). South-
Western, Thomson.
Culp, M., Johnson, K., & Michailidis, G. (2016). ada: The R Package Ade for Stochastic Boosting.
Retrieved from The Comprehensive R Archive Network: https://cran.r-
project.org/web/packages/ada/index.html
Davenport, T. H., & Harris, J. G. (2007). Competing on analytics. Boston: Harvard Business School .
Davenport, T. H., & Patil, D. (2012). Data Scientist: The Sexiest Job of the 21st Century. Harvard
Business Revoew.
de Tré, G. (2007). Principes van databases. Pearson Education.
Delong, E. R., Delong, D., & Clarke-Pearson, D. (1988). Comparing the Areas Under Two or More
Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics.
Demirkan, H., & Dal, B. (2014). The Data Economy: Why do so many analytics projects fail?
Analytics Magazine.
Evans, J. R., & Lidner, C. H. (2012). Business analytics: the next frontier for decision sciences.
Decision Line, 43(2), 4-6.
ExpertAfrica. (n.d.). Tanzania Weather and Climate. Retrieved from ExpertAfrica:
https://www.expertafrica.com/tanzania/info/tanzania-weather-and-climate
Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models.
Cambridge University Press.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning.
Springer.
![Page 67: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/67.jpg)
53
Li, T., Zhu, S., & Ogihara, M. (2006). Using discriminant analysis for multi-class classification: an
experimental investigation. Knowledge and information systems, 453-472.
Liaw, A., & Wiener, M. (2015, 10 06). The Comprehensive R Archive Network. Retrieved from Package
'randomForest': https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Lin, C.-J., Weng, R., & Wu, T. (2004). Probability estimates for multi-class classification by pairwise
coupling. Journal of Machine Learning Research, 975-1005.
Madden, S. (2012). From databases to Big Data. IEEE Internet Computing.
Mehaffrey, J. (2001, 10 02). GPS Altitude Readout: How Accurate? Retrieved 12 12, 2016, from
GPSinformation: http://gpsinformation.net/main/altitude.htm
Nations Encyclopedia. (n.d.). Tanzania. Retrieved 12 12, 2016, from Nations Encyclopedia:
http://www.nationsencyclopedia.com/geography/Slovenia-to-Zimbabwe-Cumulative-
Index/Tanzania.html
Pearson, R. (2016, 11 23). Interpreting Predictive Models Using Partial Dependence Plots. Retrieved from
The Comprehensive R Archive Network : https://cran.r-
project.org/web/packages/datarobot/vignettes/PartialDependence.html
Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and
data-analytic thinking. O'Reilly Media, Inc.
Rahm, E., & Hai Do , H. (2009). Data Cleaning: Problems and Current Approaches. University of Leipzig.
Robin, X., Turck, N., Hainard, A., Tiberti, N., & Lisacek, F. (2015). pROC: Display and Analyze ROC
Curves. Retrieved from The Comprehensive R Archive Network: https://cran.r-
project.org/web/packages/pROC/index.html
Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. Collaboration Technologies and Systems (CTS) (pp.
42-47). IEEE.
Sagiroglu, S., & Sinanc, D. (2013). Big Data: A Review. Collaboration Technologies and Systems (CTS).
Satinderpal , S. E., Sheilly, P. E., & Kaur, J. E. (2012). A new insight into data mining. International
Journal of Engineering Research and Applications (IJERA), 586-589.
Shearer, C. (2000). The CRISP-DM Model: The New Blueprint for Data Mining. Journal of data
warehousing, 13-22.
Taarifa. (2016, 09 23). Tanzania water points. Retrieved from Taarifa:
http://dashboard.taarifa.org/#/dashboard
![Page 68: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/68.jpg)
54
Taarifa. (2016, 09 23). What is Taarifa. Retrieved from Taarifa: http://taarifa.org/
Therneau, T., Atkinson, B., & Ripley, B. (2015). rpart. Retrieved from The Comprehensive R
Archive Network: https://cran.r-project.org/web/packages/rpart/
Tsoumakas, G. &. (2006). Multi-label classification: An overview. Dept. of Informatics, Aristotle
University of Thessaloniki, Greece.
Vale, S. (2013). Classification of Types of Big Data. UNECE.
WaterAid. (2016, 09 23). Tanzania. Retrieved from WaterAid: http://www.wateraid.org/where-we-
work/page/tanzania
Watson, H. J. (2010). BI-based Organizations. Business Intelligence Journal(15), 4-6.
Wickham, H. (2014). Tidy Data. Journal of Statistical Software.
Wikipedia. (2016, October 19). Subdivisions of Tanzania. Retrieved from Wikipedia:
https://en.wikipedia.org/wiki/Subdivisions_of_Tanzania
Wirth, R., & Jochen, H. (2000). CRISP-DM: Towards a standard process model for data mining.
Proceedings of the 4th international conference on the practical applications of knowledge discovery and data
mining, 29-39.
![Page 69: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/69.jpg)
55
6 Appendix
6.1 Appendix: Data understanding / preparation stage elaboration
6.1.1 Packages used
The data understanding and preparation part of the case study mainly involves reading in and
manipulating data. The data was provided in CSV file. Data manipulation was done by using aspects
of the data.table, dplyr and tidyr package. Graphs, mainly histograms, bar plots and box plots, were
made pretty using the ggplot2 package. An overview of these packages and their source is provided
in table 20.
Table 20 Data exploration and preparation packages
Package name Use Source
Data.table Data manipulation https://cran.r-project.org/web/packages/data.table/index.html
Dplyr Data manipulation https://cran.r-project.org/web/packages/dplyr/index.html
Tidyr Data manipulation https://cran.r-project.org/web/packages/tidyr/index.html
Ggplot2 Graph-making https://cran.r-project.org/web/packages/ggplot2/index.html
6.1.2 Amount_tsh
The description identifies amount_tsh as the amount available to the water point. In more pump-
technical terms: ‘total static head’ (expressed in meters). Pumpfundamentals.com claims “head is a
very useful and practical term to use when evaluating a pump’s capacity to do a job”. Total static
head indicates the height at which the pump can raise up water, or again in technical terms: “the
elevation between the surface of the reservoir and the point of discharge into the receiving tank”.
For this reason, it is impossible to have a value of ‘0’ as total static head. Because otherwise, a pump
would not be needed. In most cases (70.10%), however, this is 0. A possible explanation could be
that missing values are represented by 0.
6.1.3 Date Recorded
The date recorded should be irrelevant of the functional state of the water pump, but we can derive
seasonal effects. The recording dates were grouped in 4 seasons: LongDry, LongRains, ShortDry,
ShortRains. Crosstabbing this with functional state shows some interaction (table 21). Recordings in
![Page 70: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/70.jpg)
56
the LongRain season have 10% more functional water pumps. Table 22 shows by using the chi-
square test, we can assume some correlation between the 2 variables.
Table 21 Crosstab of season and status_group
FUNCTIONAL REPAIR BROKEN
SHORTDRY 50% 9% 40% 100%
LONGRAIN 60% 6% 34% 100%
LONGDRY 51% 7% 42% 100%
SHORTRAIN 52% 5% 43% 100%
GLOBAL 54% 7% 38% 100%
Table 22 Chi-square test of season
Test x-squared Degrees of freedom p-value Relevance
Chi-squared 553 6 0.00
6.1.4 Funder & Installer
6.1.4.1 Categorization
The variable Funder and Installer contain respectively 1898 and 2146 unique values. After careful
investigation, it seems apparent those can be grouped into 8 categories. This is represented in table
4, which also gives a quick description and an indication of their size. The greatest hurdle is the lack
of structure in data entries. There are a lot of typo’s (Oxfam vs ‘oxfarm’, world bank vs ‘wourld
bak’) and different spellings, which require intensive manual investigation to classify. Most entries
for funder and Installer are organization names or acronyms, but if a Google search does not reveal
what it refers to, it is not possible to classify them. Those impossible case were placed in the ‘other’
category.
The categorization was executed based on heuristics (entries containing certain words) and manual
look-up of frequently encountered organisations. The summary is shown in table 23 tells us which
words triggered a certain group or which organization acronyms belong to a certain group. Figure 37
then illustrates what the distribution of the Funder and Installer variables is regarding these newly
formed groups.
![Page 71: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/71.jpg)
57
Table 23 Categorisation method for Installer and Funder variable
Category Manual look-up Heuristic
Other When not belonging to any other group, or unable to identify where it should belong
Government LGA: Local government authority DWE: District Water Engineering DWSP: Domestic Water Supply TASAF: Tanzanian social action fund RWSSP: Rural water supply and sanitation programme WSDP: Water sector development programme DMDD: Diocese of Mbulu Development Department
Every entry containing “gov”, “government”, "council", "ministry", "government","goverm", "agency", "district water depar","department", "tanzania"
International Investment
HIFAB: Swedish project management consultants NORAD: Norwegian agency for development HESAWA: Swedish - Tanzanian cooperation DANIDA: Danish-Tanzanian cooperation RUDEP: rural development programma, Norwegian initiative CES(GMBH): Consulting Engineers Salzgitter GmbH (CES) JICA/JAICA: japan international cooperation agency
Every entry containing: "italian","japan","german", "korea", "niger","frankfurt", "british", "netherlands", "embassy", "u.s.a", "european union" ,"holland", "international", "africa", "finland", "unesco", "irish", "Greec", "swisland", "imf", "china","swedish"
Aid Red cross, Oxfam, unicef, world bank, world vision ADB: African Development bank AMREF: Amref flying doctors ADRA: ngo of italy ACRA: Community development and emergency relief
Every entry containing “aid”
Unknown Entries like ‘0’, ‘ ‘, ‘no’, ‘not known’ or responses that only have 1 character
Religious initiatives
KKKT: Kanisa la Kiinjili la Kilutheri Tanzania, lutherean church in tanzania TCRS: Tanganyika Christian Refugee Service
Entries containing: “church”, “catholic”, “muslim”, “missionary”
Private Entries containing: “Private”, “private company”, “private individual”
Community/local efforts
SHIPO: ngo in Tanzania TWESA: ngo in Tanzania SEMA: ngo in Tanzania
Entries containing: “village”, ”municipal”,” local”, “community”
![Page 72: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/72.jpg)
58
Figure 37 Distribution of funder and installer variables
6.1.4.2 Statistical relevance
As installer and funder are categorical variables, the chi-square test is used. Both tests indicate the
relevance of these variables, in terms of that they are not independent of the status_group variable
(table 24).
Table 24 Chi-square evaluation of Funder and Installer
Test x-squared Degrees of freedom p-value Relevance
Funder 1240.6 14 0.00
Installer 738.36 14 0.00
6.1.5 GPS Height
The imputation of missing values was covered in the main part of this work. All that remains is to
check the statistical relevance of this variable in relation to the focal variable status_group. To do this
a one-way ANOVA is used. This checks if the distribution of GPS Height differs between
functional states of a water pump. In figure 38 the 3 boxplots portray these distributions. The GPS
Height values of non-functional water points seem to be a little lower than for functional water
pumps, but is this statistically so. Table 25 presents the test statistics to evaluate. A p-value of 0.00
leads us to reject the null hypothesis which claims independence between the functional states. An
Ad Hoc comparison between groups was executed to investigate this relationship further. As table
26 shows, only the difference between water pumps in need of repair and functional pumps is not
![Page 73: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/73.jpg)
59
statistically relevant. So we can claim that the GPS-Height for non-functional water pumps are
statistically significantly lower than for functional water pumps (either fully or partially functional).
Table 25 One-way ANOVA of GPS Height
Degrees of freedom SSE MSE F p-value Relevance
Status group
2 263963882 131981941 490.4 0.00
Residuals 59397 15985897658 269136
Table 26 Ad Hoc comparison using TukeyHSD for GPS Height
Comparison Difference p-value Relevance
Functional needs repair
Functional -4.352513 0.86
Non Functional Functional -137.542611 0.00
Non Functional Functional needs repair
-133.190099 0.00
Figure 38 GPS Height distribution per status group
![Page 74: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/74.jpg)
60
6.1.6 Longitude and Latitude
The handling of these variables is similar to GPS Height. The same method for imputing missing
values with the mean based on proximity is applied here, as described in the main text. The same
procedure will again be used to assess relevance to our cause. A visual investigation is supported by
the boxplots of figure 39. But to verify if we can indeed claim there is a difference between the
groups, an ANOVA test needs to be conducted. The results of this test can be found in the
following tables.
Both the ANOVA-test for latitude and longitude indicate a difference between status groups (table
27 and Table 29 one-way ANOVA of Longitude). Looking at latitude, all status groups are
significantly different from eachother (table 29). For longitude, only the difference between
functional and non functional is not significant (table 31).
Figure 39 Distribution of latitude and longitude per status group
Table 27 one-way ANOVA of Latitude
Latitude Degrees of freedom SSE MSE F p-value Relevance
Status group
2 660 330.1 41.89 0.00
Residuals 59397 468049 7.9
Table 28 TukeyHSD multiple comparison test of Latitude
Latitude Comparison Difference p-value Relevance
![Page 75: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/75.jpg)
61
Functional needs repair
Functional 0.3166800 0.00
Non Functional Functional -0.1033620 0.00
Non Functional Functional needs repair
-0.4200421 0.00
Table 29 one-way ANOVA of Longitude
Longitude Degrees of freedom SSE MSE F p-value Relevance
Status group
2 3048 1524.2 229.7 0.00
Residuals 59397 394129 6.6
Table 30 TukeyHSD multiple comparison of Longitude
Longitude Comparison Difference p-value Relevance
Functional needs repair
Functional -0.84798889 0.00
Non Functional Functional 0.04855053 0.07
Non Functional Functional needs repair
0.89653942 0.00
6.1.7 Public meeting
To visually identify the relation between Public Meeting and Status Group, Figure 40 can be consulted.
It shows a difference of distribution over the different classes of Public Meeting, to investigate further
a chi-square test is used. The result of the test (table 32) gives us reason to believe public meeting
and status group are not independent.
Table 31 Chi-square results for public meeting
Test x-squared Degrees of freedom p-value Relevance
Chi-squared 384 4 0.00
![Page 76: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/76.jpg)
62
Figure 40 Distribution of status_group of Public Meeting
6.1.8 Permit
Permit has 3 possible values. Either true (38852), false (17492) or unknown (3056). The visual
representation of its cross tab is shown in figure 41. A Chi-square test, of which the values are
displayed in table 33, shows that the differences encountered are statistically relevant and thus the
variables status_group and permit are not deemed independent.
Figure 41 Distribution of status_group over Permit values
Table 32 Chi-Square test of Permit
Test x-squared Degrees of freedom p-value Relevance
![Page 77: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/77.jpg)
63
Chi-squared 104.18 4 0.00
6.1.9 ConstructionYear
Figure 42 Boxplots of Constructionyear over status_group
The relevance of construction year can be tested in 2 ways. Either in its original form, where the years
were presented in numeric form. Or in the categorized form, in which the years are placed in
buckets and a separate bucket was created to contain all missing values. The boxplot of figure 42
visually reveals what was to be expected: the older the water pumps the more they are broken or in
need of repairs.
Table 33 one way ANOVA of Constructionyear
Construction Year
Degrees of freedom
SSE MSE F p-value Relevance
Status group 2 500055 250028 1753 0.00
Residuals 38688 5518248 143
Table 34 TukeyHSD test of construction year
Construction Year Comparison Difference p-value Relevance
Functional needs repair
Functional -4.680764 0.00
![Page 78: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/78.jpg)
64
Non Functional Functional -7.541137 0.07
Non Functional Functional needs repair
-2.860374 0.00
Table 35 Chi-square test of construction year as a factor
Test x-squared Degrees of freedom p-value Relevance
Chi-squared 3245.4 12 0.00
As both the one-way ANOVA test (for construction year as a numeric variable, table 34 and 35) and
the Chi-square test (when categorized, table 36) show, the difference between functional states is
statistically relevant and it can be claimed that older water pumps encounter more troubles.
6.1.10 Collection of other variables
Table 36 Summary of Chi-square tests over other variables
Chi-Squared x-squared Degrees of freedom p-value Relevance
Water point 7450 12 0.00
Basin 1921 16 0.00
Region 4795 40 0.00
District Code 1674 38 0.00
Source 2624 18 0.00
Source_type 1907 12 0.00
Source_class 590 4 0.00
Quantity 11361 8 0.00
Quality_group 2100.1 10 0.00
Payment 3965.6 12 0.00
![Page 79: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/79.jpg)
65
Management 2081.1 22 0.00
Management_group 287.7 8 0.00
Extraction_type 7365.6 34 0.00
Extraction_type_group 7265.8 24 0.00
Extraction_type_class 6931.2 12 0.00
Scheme management 1990.4 22 0.00
6.1.11 Population
Although there are a lot of missing values we can still investigate if the non-missing values relate to
status_group. Because most entries for population are close to zero this variable is highly skewed. The
ANOVA test indicate a dependence between the variables (table 38). Only for the difference
between Repair and Functional there’s no statistical support (table 39). Figure 43 supports this
visually.
Figure 43 Boxplots of population over status_group
Table 37 One-way ANOVA of population
Population Degrees of freedom
SSE MSE F p-value Relevance
Status group 2 4342098 2171049 7 0.00
![Page 80: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/80.jpg)
66
Residuals 38016 12118539409 318775
Table 38 TukeyHSD ad hoc comparison for Population
Population Comparison Difference p-value Relevance
Functional needs repair
Functional 9.05 0.73
Non Functional Functional -20.55 0.00
Non Functional Functional needs repair
-29.61 0.04
6.2 Appendix: Modelling stage elaboration
6.2.1 Packages used
The modelling stage involves creating and evaluating models.
Table 39 Modelling packages used
Package name Use Source
AUC Calculate the ROC and AUC metric
https://cran.r-project.org/web/packages/AUC/index.html
lift Calculate the lift evaluation metric
https://cran.r-project.org/web/packages/lift/index.html
randomForest Random forest algorithm https://cran.r-project.org/web/packages/randomForest/index.html
ada Adaboost package, boosting with logistic regression
https://cran.r-project.org/web/packages/ada/index.html
XGBoost Boosting with trees https://cran.r-project.org/web/packages/xgboost/index.html
RotationForest Variation in classification trees
https://cran.r-project.org/web/packages/rotationForest/index.html
pROC Perform a Delong test https://cran.r-project.org/web/packages/pROC/index.html
![Page 81: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/81.jpg)
67
6.2.2 Delong test: ROC curve comparison
Table 40 Delong test
Category AUC one vs all AUC all in one p-value Relevance
Functional 0.9085 0.9013 0.034
Repair 0.8683 0.8734 0.509
Non Functional 0.9304 0.9217 0.005
The Delong test indicates statistically meaningful differences in the prediction of the Functional and
Non Functional categories between the two approaches, using the Random Forest algorithm
(respectively a p value of 0.03 and 0.01). There is no proof of a difference in the Repair category, as
its p value is 0.51. This test supports the claim that in this case, the one vs all method approach is
the winning approach.
This test was not conducted in a 5-fold cross validation approach as this is computationally too
burdensome. For that reason the AUC’s displayed here are not exactly the same as the cross
validated average shown before.
6.3 Appendix: Evaluation stage elaboration
6.3.1 Packages used
Table 41 Evaluation packages used
Package name Use Source
randomForest Extract insight from constructed model
https://cran.r-project.org/web/packages/AUC/index.html
Functions varImplot() and partialPlot()
Ggplot2 Make the plots shine https://cran.r-project.org/web/packages/lift/index.html
InterpretR Extract insight from constructed model
https://cran.r-project.org/web/packages/interpretR/
Functions variableImportance() and parDepPlot()
![Page 82: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/82.jpg)
68
6.3.2 Variable importances
Figure 44 is the standard output when using the VarImpPlot() function on a Random Forest object
in R. It show the variable importance by looking at the mean decrease in accuracy and GINI. Figure
45 does this by using the AUC measure.
Figure 44 Variable importances
![Page 83: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/83.jpg)
69
Figure 45 Variable importances (AUC)
6.3.3 Partial dependence
Partial dependence plots are used to look at the effect of a variable on the output probability of a
model. Table 42 gathers all variables with their corresponding partial dependence plot. The results
are then interpreted.
![Page 84: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/84.jpg)
70
Table 42 Partial dependence plots with interpretation
Quantity variable is seen by all
measurements as the most important
variable. If the quantity value is dry, the water
pump is more likely to not be functional.
Whereas the enough value is more associated
with working water pumps.
The payment variable captures the payment
method associated with the water pump.
Water pumps where there is a payment
method are positively correlated with
working water pumps, whereas when there
are no payments or it is not known tends to
characterize not working water pumps.
Communal standpipes, hand pumps, improved
springs and cattle trough water pump are more
likely to be functional, with decreasing
probability. Communal standpipe multiple water
pump types or ‘other’ tend to not work more
often.
The older a water pump is, the more flaws it
has. This could be seen as a global rule. But
this relationship is not absolute. Water
pumps built in the 70’s or 80’s are more
often associated with non working water
pumps than those built in 60’s for example.
![Page 85: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/85.jpg)
71
Numeric variables are not that easy to
interpret. But a large drop in functionality
can be identified between a latitude of -11
and -9. These values refer to the southern
part of Tanzania. Further investigation
reveals in that area 20% of water pumps are
dry, for the other part of Tanzania it is only
8%.
The most non working water pumps tend to
be situated around the eastern part of the
country. Surprisingly that’s the part closest to
the sea. This part has 10% more non
working water pumps than the rest of the
country.
Are water pumps that are situated higher less
likely to be broken? Only a small portion of
all water pumps are situated that high, but
the small sample only occurs at a height of
around 1750, which is well after the large
spike. Indeed, water pumps that have a
height of more than 1500 have 15% more
working water pumps.
Another interesting conclusion here is that
the government has a hand in the presence
of faulty water pumps.
![Page 86: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/86.jpg)
72
36% of all water pumps had a missing value
for this variable. Those were imputed by the
median, which was around 150. Interesting
to see is water pumps with a smaller
population that is not close to zero tend to
be more functional. Could this be because of
a sense of shared responsibility in a smaller
group?
A histogram of the DaysRecorded variable
reveals that there are 2 large clusters. One
revolving around 1400 and one around 2000.
Only the parts of the plot that are around
these values can be trusted. This reveals that
the second ‘round’ of water pump recording
efforts yielded more functional water pumps.
Analysis of the extraction_type_group variable
has no surprises. If you do not need a special
mechanism and you can let gravity do all the
work. There’s less chance the water pump
will be broken.
The Iringa region seems to do well in terms
of functional water pumps. It also has the
second largest GNP per capita of Tanzania.
The Lindi region has the worst reputation. It
is situated in the right bottom corner of
Tanzania, and is the least densely populated
region of Tanzania. This is in accordance
with the lat and lon findings.
![Page 87: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/87.jpg)
73
There are no district names given, so an
interpretation in terms of disctricts is hard.
This confirms the findings of the funder
variable. The government instances of
Tanzania seem te be associated with faulty
water pumps.
The interpretation of this variable is the
same as its related variable
extraction_type_group.
![Page 88: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/88.jpg)
74
The source type show us that there are
indeed differences between the type of
source.
The basin from where the water is coming
from also has an influence.
In terms of management, it seems VWC has
a lot of explaining to do.
![Page 89: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/89.jpg)
75
In terms of management, it seems VWC has
a lot of explaining to do.
The RecordingSeason variable was created to
see of the season in which the water pump
data was intered into the system has some
role in determining functionality. It seems
like it does, during the LongRains, it’s more
likely to find a functional water pump.
No surprises here, if there was no public
meeting, there is probably no ‘support’ from
the locals in maintaining the water pump.
![Page 90: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/90.jpg)
76
Water quality and quality group are closely
related. For some reason, some levels do not
conform. Milky quality pumps first tended to
be more functional, but here it’s the other
way around.
I expected to find the same results as with
the permit, put to my surprise, this is not the
case. We can only conclude that without a
permit, there is more chance on a broken
pump.
![Page 91: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/91.jpg)
77
As I have no domain knowledge, it’s hard to
interpret everything I encounter.
As I have no domain knowledge, it’s hard to
interpret everything I encounter. In the
source_class variable. There’s not much
variation contributing to an interpretation.
As I have no domain knowledge, it’s hard to
interpret everything I encounter.
![Page 92: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/92.jpg)
78
6.3.5 Data cleaning evaluation
Table 43 Delong test for ROC curve comparison
Category AUC ‘dirty’ AUC ‘clean’ p-value Relevance
Functional 0.8976 0.9052 0.013
Repair 0.8652 0.8690 0.584
Non Functional 0.9183 0.9256 0.009
The best performing model using a thoroughly cleaned dataset reached an AUC of 0.907. Without
any data cleaning, an AUC of 0.894 is obtained. The Delong test (table 43) indicates this difference
is statistically relevant for the Functional and Broken categories. For the Repair category, there is no
statistically notable difference in performance between a ‘dirty’ and ‘cleaned’ data set.
To look at the differences variable per variable figure 46 and 47 were constructed. They compare the
Gini and AUC importance measures of a Random Forest algorithm on a cleaned and uncleaned data
set. The black dots represent the ‘uncleaned’ importances, the green ones indicate the ‘cleaned’
importances. The graphs look at the importances portrayed by the mean decrease in GINI and AUC
in order to find some anomalies that might indicate the phenomenon at hand. This method is not
very stable as the importances depend on all other variables included in the model creation, which
are different in both data sets. Thus, this analysis is only a superficial endeavour.
The uncleaned set is just the data read in the state it was, excluding the variables amount_tsh, id,
wpt_name, funder, installer, date_recorded, scheme_name, ward, lga and subvillage as those have to many
distinct factor variables to handle. The cleaned data set contains six more variables: SchemeGroup,
RecordingSeason, Installer, Funder, DaysRecorded and ConstructionYearFactor. Those are placed on the top
side to get them somewhat out of the way. Some highly correlated variables were also deleted. For
example, the reason the quantity variable is much higher rated in importance in the cleaned data set is
because in the uncleaned one, its importance is divided over the quantity and quantity_group variables.
Eyeballing these graphs indicate that grouping the construction_year values into ConstructionYearFactor
might not have been a good move. In the GINI importance ranking without cleaning,
construction_year is even a clear number one. Some of the importance loss of not including
construction_year was covered by including ConstructionYearFactor, but the difference still looks
![Page 93: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/93.jpg)
79
significant. Another striking indication is that the importance of the variables with imputed values
has no convincing positive effect. In fact, when looking at the AUC importances, it seems their
importance decreased by pre-processing.
Figure 46 Mean decrease in Gini comparison
![Page 94: PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER …€¦ · PREDICTING THE FUNCTIONAL STATE OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING Aantal woorden / Word count: 19596](https://reader033.vdocuments.site/reader033/viewer/2022060320/5f0d03387e708231d4383e25/html5/thumbnails/94.jpg)
80
Figure 47 Mean decrease in AUC comparison