predicting the functional state of tanzanian water …€¦ · predicting the functional state of...

PREDICTING THE FUNCTIONAL STATE

OF TANZANIAN WATER PUMPS A CASE STUDY IN DATA MINING

Aantal woorden / Word count: 19596

Jacob Benoot Stamnummer : 01170804

Promotor: Els Clarysse

Masterproef voorgedragen tot het bekomen van de graad van:

Master’s Dissertation submitted to obtain the degree of:

Master of Science in de Handelswetenschappen

Academiejaar / Academic year: 2016 - 2017

I

PERMISSION

I declare that the content of this Master’s Dissertation can be consulted and/or reproduced if the

sources are mentioned.

Name student: Jacob Benoot

Signature:

II

NEDERLANDSTALIGE ABSTRACT

Data mining is alomtegenwoordig in de hedendaagse digitale wereld. Het fenomeen data mining

heeft tal van synoniemen zoals: data science, data analytics, big data analytics en business analytics. Deze

willen allemaal hetzelfde bereiken: het halen van kennis uit data. Velen komen er dagelijks mee in

contact zonder het te weten, denk maar aan diensten zoals Facebook, Google, Uber, Amazon etc.

Dit werk plaatst data mining in zijn context en past data mining principes toe op een case die de

status van waterpompen in Tanzania voorspelt. Als leidraad wordt het CRISP-DM kader gebruikt.

Elke stap van dit werkkader wordt besproken en uitgewerkt. Deze gestructureerde aanpak stelt ons

in staat om (1) het sterkste model en de beste manier van aanpak te selecteren, (2) een model op te

zetten dat met zijn voorspellende kracht en afgeleide inzichten over de gebruikte variabelen de

slaagkracht van de Tanzaniaanse overheid om waterschaarste tegen te gaan kan verbeteren en (3) de

impact van een data opkuis op de accuraatheid van een model te bestuderen.

III

PREFACE

This thesis is in line with my personal interests and professional ambitions and therefore I enjoyed

every step of the way. Finalizing this work after countless hours of data manipulation, R debugging,

troubleshooting, waiting on R to finish computations, finding the best visualizations to support my

message, fills me with a great sense of accomplishment. As is a custom, I would like to take the time

to thank everyone that helped me along the way. My greatest friend in dark times was Google, who

always made time to listen to my problems and suggest further action. Also a huge thanks to all the

people that take time to provide answers to questions on forums like stackoverflow, you are the real

heroes. This thesis, in its current form, was only possible because of Dirk van den Poel, Matthijs

Meire and by extension Han-Thijs de Senerpont Domis, from whom and with whom I discovered

the world of data science. Furthermore, a big thanks to my brother Stijn Benoot, who lent me an

extra 8GB of RAM. That act of generosity uncorked my computational bottleneck and sped up the

whole process. Thanks.

IV

TABLE OF CONTENTS

PAINTING THE PICTURE

1 Introduction .............................................................................................................................................. 1

1.1 The fuzz ............................................................................................................................................. 1

2 Literature review ....................................................................................................................................... 4

2.1 What is data mining? ........................................................................................................................ 4

2.1.1 How is data mining done? ....................................................................................................... 5

2.2 CRISP-DM ........................................................................................................................................ 7

CASE-STUDY ELABORATION

3 Case Study ................................................................................................................................................. 9

3.1 Abstract .............................................................................................................................................. 9

3.1.1 Research question ................................................................................................................... 10

3.2 Business understanding .................................................................................................................. 10

3.3 Data Understanding & Data Preparation .................................................................................... 12

3.3.1 Data Read-in ........................................................................................................................... 12

3.3.2 Data Exploration, preparation and validation methodology ........................................... 14

3.3.3 Data Exploration, preparation and validation .................................................................... 18

3.3.4 Summary of Data Understanding / Data preparation ...................................................... 34

3.4 Modelling & Modelling Evaluation .............................................................................................. 35

3.4.1 Modelling introduction .......................................................................................................... 35

3.4.2 Three way classification approach ....................................................................................... 39

3.4.3 Modelling evaluation .............................................................................................................. 40

3.4.4 Modelling Approach: Cross-validation ............................................................................... 42

3.4.5 Modelling case ........................................................................................................................ 43

3.5 Evaluation ........................................................................................................................................ 45

3.5.1 Variable importances ............................................................................................................. 45

3.5.2 Partial dependence ................................................................................................................. 46

3.5.3 Data cleaning evaluation ....................................................................................................... 49

V

3.6 Deployment ..................................................................................................................................... 50

CONCLUSION

4 Conclusion............................................................................................................................................... 50

5 Bibliografie .............................................................................................................................................. 51

APPENDIX

6 Appendix ................................................................................................................................................. 55

6.1 Appendix: Data understanding / preparation stage elaboration ............................................. 55

6.1.1 Packages used .......................................................................................................................... 55

6.1.2 Amount_tsh ............................................................................................................................ 55

6.1.3 Date Recorded ........................................................................................................................ 55

6.1.4 Funder & Installer .................................................................................................................. 56

6.1.5 GPS Height ............................................................................................................................. 58

6.1.6 Longitude and Latitude ......................................................................................................... 60

6.1.7 Public meeting ........................................................................................................................ 61

6.1.8 Permit ....................................................................................................................................... 62

6.1.9 ConstructionYear ................................................................................................................... 63

6.1.10 Collection of other variables ................................................................................................. 64

6.1.11 Population ............................................................................................................................... 65

6.2 Appendix: Modelling stage elaboration ....................................................................................... 66

6.2.1 Packages used .......................................................................................................................... 66

6.2.2 Delong test: ROC curve comparison .................................................................................. 67

6.3 Appendix: Evaluation stage elaboration ...................................................................................... 67

6.3.1 Packages used .......................................................................................................................... 67

6.3.2 Variable importances ............................................................................................................. 68

6.3.3 Partial dependence ................................................................................................................. 69

6.3.4 Data cleaning evaluation ....................................................................................................... 78

VI

ABBREVIATIONS

CRISP-DM – Cross industry standard process for data mining

ANOVA – Analysis of variance

ROC – Receiver operating characteristics

AUC – Area under curve

VII

TABLES AND FIGURES

Figures

Figure 1 3 V's of Big Data (Sagiroglu & Sinanc, Big data: A review, 2013) ............................................. 3

Figure 2 Financial value across sectors through the use of Big Data (Evans & Lidner, 2012) ............. 4

Figure 3 Data mining versus the use of data mining results (Provost & Fawcett, 2013) ....................... 6

Figure 4 CRISP data mining process (Provost & Fawcett, 2013) .............................................................. 7

Figure 5 Generic tasks and outputsof the CRISP-DM reference model (Chapman, et al., 2000) ........ 8

Figure 6 Taarifa geographic mapping of waterpumps and their status (Taarifa, 2016) ........................ 11

Figure 7 Percentage of functional water pumps (left) & Population coverage (right) per region

(Taarifa, 2016) .................................................................................................................................................. 11

Figure 8 Classification of data quality problems in data sources (Rahm & Hai Do , 2009) ................. 14

Figure 9 Distribution of status group over Public Meeting values .......................................................... 17

Figure 10 Boxplot of GPS_Height over status_group .............................................................................. 18

Figure 11 Distribution of Status_group variable ........................................................................................ 19

Figure 12 Boxplot of population .................................................................................................................. 24

Figure 13 Public meeting ................................................................................................................................ 24

Figure 14 Permit .............................................................................................................................................. 25

Figure 15 Construction year as factor .......................................................................................................... 25

Figure 16 Distribution of Extraction_type_class variable ......................................................................... 26

Figure 17 Distribution of water pumps over Management_group .......................................................... 27

Figure 18 Distribution of water pumps over payment variable ............................................................... 29

Figure 19 Distribution of Quality_group ..................................................................................................... 30

Figure 20 Quantity .......................................................................................................................................... 31

Figure 21 Waterpoint type ............................................................................................................................. 32

Figure 22 Basin ................................................................................................................................................ 33

Figure 23 Region ............................................................................................................................................. 33

Figure 24 Attributes and target attribute representation, inspired by (Provost & Fawcett, 2013) ...... 36

Figure 25 A Classification tree representation (Based on total population: 59400) .............................. 37

Figure 26 Bagging as presented in course material of Advanced Predictive Analytics (personal

correspondence with Dirk van den Poel) .................................................................................................... 38

Figure 27 Linear and logistic regression ....................................................................................................... 38

Figure 28 Sequantial Boosting (personal correspondence with Dirk van den Poel) ............................. 39

Figure 29 An example of a ROC curve ........................................................................................................ 40

Figure 30 Cumulative response curve and lift ............................................................................................. 42

VIII

Figure 31 Modelling approach (personal correspondence with Dirk van den Poel) ............................. 42

Figure 32 An illustration of cross-validation (Provost & Fawcett, 2013) ............................................... 43

Figure 33 Variable importances..................................................................................................................... 46

Figure 34 Partial dependence plot of Quantity and Payment ................................................................... 47

Figure 35 Partial Dependence Plot of Installer and Funder variables ..................................................... 48

Figure 36 Partial Dependence Plot of ConstructionYearFactor variable ............................................... 49

Figure 37 Distribution of funder and installer variables ............................................................................ 58

Figure 38 GPS Height distribution per status group ................................................................................. 59

Figure 39 Distribution of latitude and longitude per status group .......................................................... 60

Figure 40 Distribution of status_group of Public Meeting ....................................................................... 62

Figure 41 Distribution of status_group over Permit values...................................................................... 62

Figure 42 Boxplots of Constructionyear over status_group ..................................................................... 63

Figure 43 Boxplots of population over status_group ................................................................................ 65

Figure 44 Variable importances..................................................................................................................... 68

Figure 45 Variable importances (AUC)........................................................................................................ 69

Figure 46 Mean decrease in Gini comparison ............................................................................................. 79

Figure 47 Mean decrease in AUC comparison ........................................................................................... 80

IX

Tables

Table 1 Different sources of data (Vale, 2013) ............................................................................................. 2

Table 2 Different types of analytics (Evans & Lidner, 2012) ..................................................................... 5

Table 3 Available data about water pumps in Tanzania .......................................................................... 12

Table 4 Proportional Crosstab of status_group and public_meeting ...................................................... 16

Table 5 Summary of dependency tests ......................................................................................................... 18

Table 6 Seasons in Tanzania .......................................................................................................................... 20

Table 7 Newly created categories for variable funder ................................................................................ 20

Table 8 GPS Height: Manual look-up of missing data .............................................................................. 22

Table 9 GPS Height: The process of imputing missing values ................................................................ 22

Table 10 Longitude and Latitude validation ................................................................................................ 23

Table 11 Latitude & Longitude: The process of imputing missing values ............................................. 23

Table 12 Granularity of extraction type ....................................................................................................... 27

Table 13 Granularity of Management .......................................................................................................... 28

Table 14 SchemeGroup and Scheme management levels ......................................................................... 29

Table 15 Quantity Crosstab ........................................................................................................................... 31

Table 16 Granularity of Source ..................................................................................................................... 32

Table 17 Summary of data handling ............................................................................................................. 34

Table 18 One vs all classification results using a 5-fold crossvalidation ................................................. 44

Table 19 All-in-one classification results using 5-fold crossvalidation .................................................... 45

Table 20 Data exploration and preparation packages ................................................................................ 55

Table 21 Crosstab of season and status_group .......................................................................................... 56

Table 22 Chi-square test of season ............................................................................................................... 56

Table 23 Categorisation method for Installer and Funder variable ......................................................... 57

Table 24 Chi-square evaluation of Funder and Installer ........................................................................... 58

Table 25 One-way ANOVA of GPS Height .............................................................................................. 59

Table 26 Ad Hoc comparison using TukeyHSD for GPS Height........................................................... 59

Table 27 one-way ANOVA of Latitude ...................................................................................................... 60

Table 28 TukeyHSD multiple comparison test of Latitude ...................................................................... 60

Table 29 one-way ANOVA of Longitude ................................................................................................... 61

Table 30 TukeyHSD multiple comparison of Longitude .......................................................................... 61

Table 31 Chi-square results for public meeting .......................................................................................... 61

Table 32 Chi-Square test of Permit .............................................................................................................. 62

Table 33 one way ANOVA of Constructionyear ....................................................................................... 63

X

Table 34 TukeyHSD test of construction year ........................................................................................... 63

Table 35 Chi-square test of construction year as a factor ......................................................................... 64

Table 36 Summary of Chi-square tests over other variables..................................................................... 64

Table 37 One-way ANOVA of population ................................................................................................. 65

Table 38 TukeyHSD ad hoc comparison for Population ......................................................................... 66

Table 39 Modelling packages used................................................................................................................ 66

Table 40 Delong test ....................................................................................................................................... 67

Table 41 Evaluation packages used .............................................................................................................. 67

Table 42 Partial dependence plots with interpretation .............................................................................. 70

Table 43 Delong test for ROC curve comparison ..................................................................................... 78

1

1 Introduction

The capture and analysis of data is a hot topic. It is a promising ‘new’ frontier on which businesses

compete to gain an advantage (Davenport & Harris, Competing on analytics, 2007). Capturing and

analyzing data is not new, but the way it is done is changing drastically. New developments and

trends facilitate the capture and analysis of data. There has been an enormous growth in available

data. More data is being captured, but also less traditional sources of data can now be handled.

This has led to a number of success stories that are able to dazzle our minds. Companies like

Amazon, Facebook and Google created their business model by thoroughly analyzing their data and

are able to generate great value by doing so. Although these examples illustrate the huge

opportunities data analytics can yield, there is still a lack of maturity in this field. Where a couple

seem to succeed many more projects are doomed to fail (Demirkan & Dal, 2014). It is therefore

imperative that some general approach can be used to deliver such projects. CRISP-DM is such

framework. And by following this framework, this thesis explores how to apply it on a real-life case.

The case study aims to guide the reader through a data mining competition in predicting the

functional state of water pumps, explaining all the steps along the way. The goal is to combine the

practical approach with the theory behind it, so the reader understands what is happening in each

phase. For that reason literature findings and practical issues are combined within the elaboration of

the case study.

1.1 The fuzz

Data mining, data science, (business) analytics, knowledge discovery etc. are all closely related terms

that relate to analyzing data in order to gain knowledge. It is not a new phenomenon, as this practice

is as old as the field of statistics which has been around since the 18th century (Agarwal & Dhar,

2014). But lately, for the past two decades, it is getting increasingly important (Chen, Roger, &

Storey, 2012). Nowadays, the collection of data is nurtured by the internet, with the rapid pace at

which economic and social transactions are moving online. The opportunities of this field are also

expanded by the availability of ‘Big Data’ and advancements in the field of machine learning. The

arrival of Big Data at the scene is claimed to be the most significant tech disruption since the

internet and digital economy (Agarwal & Dhar, 2014).

Big Data can be described as data that is too big, too fast, or too hard for existing tools to process.

This relates directly to the 3 V’s of Big Data, volumes, variety and velocity, its 3 main characteristics

(Madden, 2012).

2

5 Exabyte (10^18 bytes) of data were created by human until 2003. Nowadays, this is created in two

days. 10 billion text messages are sent every day. By 2050, 50 billion devices will be connected to the

internet. Facebook has 955 million monthly active users, every day 30 billion pieces of content are

posted and 2.7 billion likes and comments have been posted. 571 new websites are created every

minute (Sagiroglu & Sinanc, Big data: A review, 2013). An enormous amount of data is available and

it is being generated at an increasing pace. The size of this data is getting large, sometimes reaching

petabytes. This is called the volume aspect of big data.

The United Nations Economic Commission for Europe (UNECE) classifies different sources of

data in 3 domains, displayed in table 1. Firstly, there is a lot of data concerning human experiences.

‘Social networks’ (human-sourced information) is the source of data coming from blogs, comments,

pictures, videos, internet searches etc. Secondly, ‘traditional business systems’ leave a trace of doing

business like medical records, transaction information and stock records. Lastly, ‘the Internet of

Things’ covers all data coming from sensors or computer systems (Vale, 2013).

Table 1 Different sources of data (Vale, 2013)

Social Networks (human-sourced information)

Social networks (Facebook, Twitter etc.)

Blogs and comments

Personal documents

Pictures (Instagram, Flickr) Videos (Youtube)

Internet searches

Mobile data, text messages

User-generated maps

E-mail Traditional Business systems (process-mediated data)

Medical records

Commercial transactions

Banking/stock records

E-commerce

Credit cards Internet Of Things (machine-generated data)

Sensor data (home automation, weather sensors, traffic sensor)

Mobile sensor data (location, cars, satellite images)

(web) logs

There is a huge variety to be found in these data sources. Traditionally data sources are structured,

like commercial data which is often stored in a database or data warehouse. Alternatively, semi-

structured data has, like the name implies, some main structure. Think of Twitter messages: the

messages posted by users do not have any structure, it could be whatever they like, but the data

generated from those messages also comes with metadata, which is structured. Date and time,

locations, IP-addresses and so on are also being captured. Next to those types of data, data can also

3

be completely unstructured, like video or audio data. Velocity, the third V of Big Data, captures the

increasing speed at which data is coming at us. For example, the real time capturing of data through

sensors or clickstreams generated on websites (Sagiroglu & Sinanc, Big data: A review, 2013). These

3 V’s are presented in figure 1.

Figure 1 3 V's of Big Data (Sagiroglu & Sinanc, Big data: A review, 2013)

Big data solutions can thus analyze and interpret data that was previously assumed too difficult to

handle. This in the dimensions of volume, variety and velocity. Overcoming these technical

challenges clears the way for new opportunities and applications. The potential value derived from

these new opportunities is estimated to be huge. Global consultancy company McKinsy claims

enormous gains in a wide variety of sectors. The potential value from data analysis in the US health

care sector alone could reach up to $300 million. For the European public sector, this would be

€250 billion (Evans & Lidner, 2012). This and other examples are illustrated in figure 2. On top of

that, Davenport & Harris (2007) suggest that top performing organizations are three times more

likely to be sophisticated analytics users than lower performers, implying a clear beneficial result of

using advanced analytics.

These projections of economic wealth and riches trigger organizations to invest in data analysis, but

to do that, people are needed that can handle their data. The job of data scientist is being hyped as

the sexiest job of the century. There is already a shortage of data scientists which is becoming a

serious constraint in some sectors (Davenport & Patil, Data Scientist: The Sexiest Job of the 21st

Century, 2012).

Based on these findings it might be beneficial to obtain some knowledge on how to handle data.

Even if you don’t have the ambition to become a data scientist yourself, knowing how one thinks

might already be beneficial in dealing with and understanding one. This thesis covers data mining

4

theory and concepts and applies those to a case ranging from data gathering to the interpretation of

the results to help the reader gain some practical understanding of data analysis.

Figure 2 Financial value across sectors through the use of Big Data (Evans & Lidner, 2012)

2 Literature review

2.1 What is data mining?

Data mining is the process of analyzing data from different angles and extracting useful information.

This is sometimes called knowledge discovery (Satinderpal , Sheilly, & Kaur, 2012). The goal is thus

to transform raw data into useful information or knowledge which is then used to improve decision

making, derive value or gain a competitive edge (Provost & Fawcett, 2013) (de Tré, 2007).

Transforming data into knowledge can be done in several ways. The method varies depending on

the question that needs answering. Descriptive analytics looks at what has happened and why, by

looking at the data from different angles through summarizing the data in charts and reports. It

helps to understand and analyze business performance. These are useful if it is somewhat known

what to look for, but there can also be more hidden patterns that require more complex methods to

surface. These hidden patterns can answer more complex questions and lead to more interesting and

actionable insight (de Tré, 2007). This is the domain of predictive analytics which answers questions

like ‘what will happen’. Those more complex methods rely on advanced analytical techniques and

5

are often called datamining techniques (de Tré, 2007). Prescriptive analytics tries to optimize a

certain situation, for example, to minimize costs or maximize profit (Evans & Lidner, 2012). Table 2

offers an overview of these different types of analytics.

Table 2 Different types of analytics (Evans & Lidner, 2012)

Type Question Examples

Descriptive analytics What has happened? Reporting, visualization, dashboards

Predictive analytics What will happen? Detect hidden patterns, data mining

Prescriptive analytics What should happen? Optimization, revenue management, what-if analysis

As already hinted in the previous paragraph, data mining is often associated with the predictive

analytics type. Predictive modelling can be seen as one of the main topics of data mining (Provost &

Fawcett, 2013). Evans & Lidner (2012) describe data mining as a focus on understanding

characteristics and patterns among variables in large databases using a variety of statistical and

analytical tools. Through the use of these statistical and mathematical principles it investigates

historical data to detect patterns and relationships (Satinderpal , Sheilly, & Kaur, 2012) (Evans &

Lidner, 2012). Data mining tries to learn those patterns and apply them onto new data to predict

their behavior (de Tré, 2007).

2.1.1 How is data mining done?

In the realm of data mining there are some distinctions to be made. On the one hand, there are

parametric methods. These require some assumptions about reality, for example, we can assume a

linear relation. On the other hand, there are non-parametric methods that do not make assumptions.

As there are no assumptions about the relationship, non-parametric methods have an opportunity to

better fit what is required. But this comes with a downside as well: it requires a very large number of

observations to obtain an accurate estimate. Another distinction can be found in the definition of a

data mining problem. If we know what we want to predict and have for example a target variable to

focus on, it’s called supervised learning. Unsupervised learning has no response variable to predict.

In those cases, it is not known what to look for (James, Witten, Hastie, & Tibshirani, 2013).

As already mentioned in the previous section, data mining tries to learn patterns and apply them on

new data (de Tré, 2007). This is exactly what is figure 3 illustrates. The top part represents the model

6

extraction from data that is available, the historic data. The bottom part illustrates the appliance of

this model in predicting the class of new data points.

What algorithms are used to extract a model depend on the type of problem. The case-study covers

a supervised learning problem, and more precisely a classification problem. The approach taken to

tackle this specific problem, like the type of algorithms that can be used for a classification, is

elaborated further in the case-study.

Figure 3 Data mining versus the use of data mining results (Provost & Fawcett, 2013)

7

2.2 CRISP-DM

Evans & Lidner (2012) claim organizations are overwhelmed by this abundance of data and struggle

to understand how to use it to achieve business results. Without any framework, the success of a

data mining project is dependent on the skill of the person or team in question. This is a great

restraint on reproducibility of their efforts. A standard approach like the ‘Cross Industry Standard

Process for Data Mining‘ , aims to make data mining projects less costly, more reliable, repeatable,

manageable and faster by providing help in the translation of business problems into data mining

tasks, suggesting appropriate data transformation steps and modelling techniques, providing a way to

evaluate the results and a standard way to document the whole process (Wirth & Jochen, 2000). The

CRISP-DM Process model for Data Mining consists out of 6 phases, which are visualized in figure

4. The arrows represent the most important dependencies between phases. The large outer circle

indicates the iterative nature of this framework: going back and forth between steps is often needed,

as findings along the way trigger new questions (Shearer, 2000). A more detailed overview is given in

figure 5. Per phase, generic tasks and desired outputs are given.

Figure 4 CRISP data mining process (Provost & Fawcett, 2013)

8

Figure 5 Generic tasks and outputsof the CRISP-DM reference model (Chapman, et al., 2000)

Business understanding

In this phase, the focus is on the business problem at hand. What is it that we are trying to

accomplish, what question do we want answered? Determine the project objectives and

requirements and how this can be translated into a data mining problem.

The business understanding phase basically consists of designing a clear picture of the road ahead by

framing the whole data mining endeavour into a clearly defined project plan (Chapman, et al., 2000).

Data understanding

The second step, data understanding, involves the collection and exploration of the data. This

requires investigating and describing different attributes to understand and document their meaning.

Validation of the data can uncover hidden mistakes and provide ways to deal with those. In that

way, a clear picture of the data is drawn. This often goes hand in hand with some visualization

(Chapman, et al., 2000).

Data preparation

In the data preparation stage, the final data set is constructed on which the analysis will be done.

This by selecting relevant attributes, transforming and cleaning of the data. Often, if there are

different data sources, integration is necessary by merging these sources (Chapman, et al., 2000).

Modelling

9

Appropriate modelling techniques are evaluated and applied. As can be identified in figure 4, there’s

a lot of going back and forth between the data preparation stage and the modelling stage. Different

modelling techniques sometimes have different requirements of the data. This stage also evaluates

and compares the performance of different models used by checking performance on a separate test

set (Chapman, et al., 2000).

Evaluation

It is important to evaluate the model(s) created. Does the model really achieve the business

objectives or did we make any possible mistakes along the way? If the model is satisfactory, the

deployment phase can be initiated. If that’s not the case, the previous phases should be revised and

adjustments made in order to achieve the desired results (Chapman, et al., 2000).

Deployment

The next step is to ‘make use’ of the model, to deploy it throughout the organization. This can go

from a simple report generation to present findings to implementing a repeatable data mining

process across the organization (Chapman, et al., 2000).

3 Case Study

3.1 Abstract

In this case study, the principles of data mining were applied following the CRISP-DM framework

to solve the problem of the Tanzanian faulty water pumps. By using data mining algorithms for

classification, the class of water pumps was predicted: is a water pump functional, functional but in

need of repairs, or non-functional? The best results in solving this multiclass-classification problem

were obtained by an one-vs-all approach using the Random Forest algorithm, which yielded an AUC

of 0.91 and a classification rate of 0.8209. All data processing and modelling is done in the statistical

programming language R, for which the entire code can be found in the appendix of the electronic

version of this thesis.

The case study aims to guide the reader through a data mining competition in predicting the

functional state of water pumps, explaining all the steps along the way. The goal is to combine the

practical approach with the theory behind it, so the reader understands what is happening in each

phase. For that reason literature findings and practical issues are combined within the elaboration of

the case study. To increase readability and avoid repetition, all repetitive or space consuming tasks

are provided in the appendix and not in text, although they are a vital part of this work.

10

3.1.1 Research question

This thesis tries to answer three questions. The first and main question is: “Can data mining provide an

added value for the Tanzanian government in battling water scarcity?”. The main purpose of analysing data is

to extract value and of course we would like to know if we succeeded. The second question, “What is

the best way to predict the functional state of Tanzanian water pumps”, lays focus on the data mining approach

in which different algorithms are evaluated to determine what works best. The third question

ponders upon the statement that 80% of all data analysis time goes into data cleaning, which was

also the case in this thesis. “Does data preparation improve the predictive capabilities of an algorithm?”.

3.2 Business understanding

Tanzania is the largest country in East-Africa, with a population of 52 million people. But of those

52 million people, 23 million have no choice but to drink dirty water from unsafe sources. 44 million

do not have access to adequate sanitation and 4000 children die from preventable diseases due to

unsafe water. Safe water is scarce, and often women and children have to spend two to seven hours

to collect clean water (WaterAid, 2016). This is quite the predicament. Water is a basic need and

right for all human beings. The Tanzanian ministry of water agrees and together with Taarifa, they

aim to improve sanitation conditions in their country.

The Taarifa platform is an open source web application for information collection, visualisation and

interactive mapping, created by a global network of volunteers. It enables citizens to report

sanitation problems such as broken public toilets or broken water pumps in their neighbourhood

through SMS, twitter or their mobile app. These issues are gathered and organized in the platform

and in this way, it is being communicated to the responsible governments and decision makers.

(Taarifa, 2016)

Next to this data collection, Taarifa also visualized these issues and created an interactive map of it.

The interactive map indicates the location of water points and their status, an illustration can be

found in figure 6. Possible states can be functional (blue), non-functional (red) or in need of

maintenance (orange). The work done by the Taarifa organisation helps to draw a clear image of

what is happening in Tanzania regarding sanitation. Visually, the geographic representation helps to

pinpoint problem areas where water tends to become scarce. Together with the created dashboards,

it provides a powerful tool for the local authorities to manage and follow-up the situation. (Taarifa,

2016)

11

Figure 6 Taarifa geographic mapping of waterpumps and their status (Taarifa, 2016)

Additional information from Taarifa shows that only 54.30% of the water pumps are functional and

an average of 17.24% of the population is covered. Figure 7 is an extract from the Taarifa water

points dashboard. The map on the left tell us something about the percentage of functional pumps

per region. The colour-coding ranges from dark green, when all water pumps in the region are

functional (100%), to dark red, when none of the water pumps are functional (0%). The map clearly

indicates that there are still a lot of water pumps not working. The map on the right, which uses the

same colour-coding, represents the population coverage of clean water. Availability of water to the

population seems to be almost non-existent in some regions (Taarifa, 2016).

Figure 7 Percentage of functional water pumps (left) & Population coverage (right) per region (Taarifa, 2016)

Enabling better communication between citizens and local authorities regarding sanitation has

already helped a great deal to tackle the Tanzanian problems. But there is still a great deal that could

be improved. One next step towards a better world involves the use of data mining. The principles

of data mining could enable a thorough analysis, in this case with the use of a predictive model,

which would constitute a benefit that is twofold. A first benefit a predictive model would bring to

12

the table, is the capacity to act and schedule maintenance before a water pump actually breaks down.

Next to that, characteristics of faulty pumps can be analysed and help explain why it happens

(Provost & Fawcett, 2013). Both of these benefits could lead to improved decision making regarding

sanitation.

3.3 Data Understanding & Data Preparation

3.3.1 Data Read-in

The Taarifa platform gathers data from citizens and combines these with data from the Tanzanian

ministry of water. Next to the functional status of the water pump, there is information available

about the location (in terms of longitude, latitude, region, city etc.), the water itself (quality, capacity,

source, extraction type) and the management (operator, funder, payment info). The complete list is

presented in table 3 with a description and an example. This dataset can be downloaded at

DrivenData website1, which hosts data science competitions ‘to save the world’. It contains 59400

observations and 40 variables excluding the functional state.

Table 3 Available data about water pumps in Tanzania

Variable Description Example

amount_tsh amount water available to waterpoint 300

date_recorded Date entered 2013-02-26

funder Who funded the well Germany Republi

gps_height Altitude of the well 1335

installer Organization that installed the well CES

longitude GPS coordinate 37.2029845

latitude GPS coordinate -3.22870286

wpt_name Name of the waterpoint if there is one Kwaa Hassan Ismail

num_private Unknown 0

basin Geographic water basin Pangani

subvillage Geographic location Bwani

region Geographic location Kilimanjaro

region_code Geographic location (coded) 3

1 https://www.drivendata.org/competitions/7/

https://www.drivendata.org/competitions/7/

13

district_code Geographic location (coded) 5

lga Geographic location Hai

ward Geographic location Machame Uroki

population Population around the well 25

public_meeting True/False True

recorded_by Group entering this row of data GeoData Consultants Ltd

scheme_management Who operates the water point Water Board

scheme_name Who operates the water point Uroki-Bomang'ombe water sup

permit If the water point is permitted True

construction_year Year the water point was constructed 1995

extraction_type The kind of extraction the water point uses gravity

extraction_type_group The kind of extraction the water point uses gravity

extraction_type_class The kind of extraction the water point uses gravity

management How the water point is managed water board

management_group How the water point is managed user-group

payment What the water costs other

payment_type What the water costs other

water_quality The quality of the water soft

quality_group The quality of the water good

quantity The quantity of water enough

quantity_group The quantity of water enough

source The source of the water spring

source_type The source of the water spring

source_class The source of the water groundwater

waterpoint_type The kind of water point communal standpipe

waterpoint_type_group The kind of water point communal standpipe

Status_group The functional state: Functional, in need of repairs or non-functional

14

3.3.2 Data Exploration, preparation and validation methodology

Without the use of advanced analytics, this stage explores the data and gives a feel of what we have

to work with. Looking at this set of data, it is sometimes still unclear what variables mean or in some

cases what and if there’s a difference between some variables. It’s important to know what variables

represent, so we can ask ourselves if they are necessary or if the data received makes any sense. This

takes up a large chunk of time and often it is claimed that 80% of the data analysis work goes into

cleaning the data (Wickham, 2014).

3.3.2.1 Common data cleaning problems

Erronous data

Rahm and Hai Do (2009) provide an overview of possible data cleaning problems. Their overview is

presented in figure 8. Our case, which only uses one data source, shifts the attention to the ‘Single-

Source problems’, the left side of that figure. Most errors are data entry errors. For example,

misspellings when data entries allow open text as input or letters in a numbers-only field. In a

previous project, some people refused to provide their telephone number and answered ‘no’ instead,

which is also a perfect example. There’s also the issue of clearly impossible values, for example,

creation dates that are in the future or negative numbers entered as age. A problem with uniqueness

could be a duplicate in what was supposed to be a unique identifier. For example, a family and first

name combination that is twice entered. Or, more applied to this case, a water point name that is

entered twice. If on top of that, different GPS location data were entered for these 2 water points,

we encounter the presence of contradictory values.

Figure 8 Classification of data quality problems in data sources (Rahm & Hai Do , 2009)

Missing data

15

Another common problem is the presence of missing data. Often people are reluctant, or simply not

able to provide information. In the case of water pumps, missing values are mostly due to the

information not being available. There is no right way in handling missing data, each variable that

encounters missing values needs to be evaluated on its own. But Gilbert A. Churchill and Dawn

Iacobucci (2005) provide 3 basic ways in handling missing data:

When confronted with missing data it’s possible to only take into account complete cases,

where none of the variables have missing values. This method discards data that could still

be useful so it’s advised to use this only as a last resort (Churchill & Iacobucci, 2005)

(Gelman & Hill, 2007).

Where possible, missing data should be filled in or imputed in order to save instances of

analysis. There are different ways to do this. The easiest way would be to impute the missing

value with the mean, median or mode, but this could distort the distribution of the data and

thus also the relationship between variables. Sometimes, it is also possible to use information

from related variables in order to derive the right value or make an educated guess.

Advanced methods are also possible (Gelman & Hill, 2007) (Churchill & Iacobucci, 2005).

Another way could be to leave in missing values but tag them as missing and put them in a

separate category (Churchill & Iacobucci, 2005).

An advanced method of imputing consists of deploying a model (for example a regression model)

on the non-missing variables that are available to predict a sensible value for the missing items. This

only works in the assumption of only one variable with missing data. To create a valid regression

model, the predictor variables need to be present. To solve the problem of multiple variables having

missing entries multivariate imputation is used. One way to do execute this multivariate approach is

to iteratively assess each variable. A model is created to predict a certain missing variable and if any

of the predictor variables have a missing value, it is imputed in a more crude manner as discussed

above. This is then done iteratively until all variables are dealt with. A more complete overview on

how to deal with missing data can be found in the work of Gelman and Hill (2007).

3.3.2.2 Principles of Tidy Data

The principles of Tidy data propose a standard way to organise and structure data, in order to

facilitate its analysis. A dataset is called tidy when each variable forms a column, each observation

forms a row and each type of observational unit forms a table. (Wickham, 2014)

In our case, the observational unit is a water pump, which has its own table. In that table, each row

represents 1 specific pump and each column a specific characteristic of that pump. Because of these

criteria, we can call our dataset tidy.

16

3.3.2.3 Assessing statistical relevance

Chi-square

In the data exploration stage, the variables are tested on their dependency with the functional state

of the water pumps (functional, in need of repairs or broken). The functional state is represented in

this data set with the variable status_group. When confronted with a categorical variable, the Chi-

square test is used. The Chi-square test’s null hypothesis claims there is no association between the

two categorical variables (Churchill & Iacobucci, 2005). A crosstab between these two variables is

used as basis for this test. This leads to the following null hypothesis:

H0: 𝑡ℎ𝑒 𝑟𝑜𝑤 𝑣𝑎𝑙𝑢𝑒 𝑖𝑠 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑜𝑙𝑢𝑚𝑛 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒

As an example, table 4 represents a proportional crosstab of the public_meeting variable. The

categories of public_meeting can be found in the rows, the categories of status_group or the functional

state in the columns. The 4th row, GLOBAL, is the global distribution. Of all water pumps 54% are

functional, 7% are in need of repairs and 38% are broken. If the variables public_meeting and

status_group were independent, the same distribution of the status_group would occur over all

public_meeting categories. But this does not seem to be the case. The global percentage of functional

cases is 54%, but if there was no public meeting, only 43% are functional. The distribution is

visualised in figure 9, it’s clear that there are some differences among these groups. The Chi-square

evaluates if the measured distribution is different from the expected distribution if they were

independent (global distribution). If the p-value is smaller than 0.05, the null hypothesis is rejected,

which indicates the variables are not independent. The chi-square test regarding public meeting and

status_group results in a p-value of 0.00, which leads us to conclude the variables are not independent

from each other. The same approach will be used for other categorical variables in this dataset.

Table 4 Proportional Crosstab of status_group and public_meeting

PUBLIC MEETING FUNCTIONAL REPAIR BROKEN

FALSE 43% 9% 48% 100%

TRUE 56% 7% 37% 100%

UNKNOWN 50% 5% 45% 100%

GLOBAL 54% 7% 38% 100%

17

Figure 9 Distribution of status group over Public Meeting values

One-way ANOVA

If the focal variable is numeric, a one-way ANOVA is used. In simple terms, an ANOVA-test or

ANalysis Of Variance-test checks if a numeric variable changes due to the effect of a ‘treatment’.

The treatment variable in our case is the status_group. For example, when evaluating the GPS_height

variable, a boxplot over the different functional states reveals that non-functional water pumps may

have a lower GPS_height value (figure 10). The ANOVA-test investigates if this is statistically so, and

if it is, it can be claimed that status_group and GPS_height are not independent. Table 5 gives a little

summary of which methods are used in this thesis to investigate dependencies, when they are used

and what the question is, it answers.

18

Table 5 Summary of dependency tests

Statistical test Dependency of Description

Chi square 2 categorical variables Is the relationship between the variables the same as what would be expected of independent variables?

One-way ANOVA a numeric and a categorical variable. Does the numerical variable differ statistically between the 3 functional states?

3.3.3 Data Exploration, preparation and validation

Now it’s time to get our hands dirty and identify if there are issues present in our case.

Functional state

Figure 10 Boxplot of GPS_Height over status_group

19

First of all, an evaluation of the functional

state. Most of the water pumps are functional

(32259 or 54.31%, represented in blue in

Figure 11). Still a large portion is non-

functional (22824 or 38.42%, in red) and a

smaller set is in need of repairs (4317 or

7.27%, in orange). This is the value that we

want to predict in a later stage, but in this data

exploration phase, we can already find out if

there are correlation or relationships between

this functional state and other variables.

Amount_tsh

The description identifies amount_tsh as the amount available to the water point. In more pump-

technical terms: ‘total static head’ (expressed in meters). Pumpfundamentals.com claims “head is a

very useful and practical term to use when evaluating a pump’s capacity to do a job”. Total static

head indicates the height at which the pump can raise up water, or again in technical terms: “the

elevation between the surface of the reservoir and the point of discharge into the receiving tank”.

For this reason, it is impossible to have a value of ‘0’ as total static head. Because otherwise, a pump

would not be needed (Chaurette, 2016). In most cases (70.10%), however, this is 0. A possible

explanation could be that missing values are represented by 0. 70% of missing values is too much to

impute without introducing bias, that’s why this variable is excluded from the analysis.

Date_recorded

Almost all water points were recorded between 2010 and 2013, only 31 were not. The date the water

pump was entered in the system should not influence the functional state of the water pump. But

maybe the time of year could play an influential role2. Tanzania has two rainy seasons and two dry

seasons. The main rainy season or the ‘long rains’ happens during March, April and May. This is

followed by the long dry season in June, July, August, September and October. In November and

December, there’s a smaller rainy season or the ‘short rains’. January and February are called the

‘short dry season’ (ExpertAfrica, sd). This is summarized in table 6. If the water pump was recorded

during the ‘Long rains’ season 60% was functional, whereas recordings in the ‘Short Dry’ season

2 User Dipetkov on the DrivenData forums inspired this approach: https://github.com/dipetkov/DrivenData-PumpItUp/blob/master/transform-data.R

Figuur 1 Distribution of Functional state

Figure 11 Distribution of Status_group variable

https://github.com/dipetkov/DrivenData-PumpItUp/blob/master/transform-data.R

https://github.com/dipetkov/DrivenData-PumpItUp/blob/master/transform-data.R

20

only has a functional percentage of 50%. The seasons are a relevant factor that has influence on our

focal variable, that’s why these seasons captured in the newly created RecordingSeason variable.

Table 6 Seasons in Tanzania

Season Months

Short dry season January, February

Long rains March, April, May

Long dry season June, July, August, September, October

Short rains November, December

Funder & Installer

The variable Funder and Installer contain respectively 1898 and 2146 unique values. After careful

investigation, it seems apparent those can be grouped into 8 categories. This is represented in table

7, which also gives a quick description and an indication of their size. The greatest hurdle is the lack

of structure in data entries. There are a lot of typo’s (Oxfam vs ‘oxfarm’, world bank vs ‘wourld

bak’) and different spellings, which require intensive manual investigation to classify. Most entries

for Funder and Installer are organization names or acronyms, but if a Google search does not reveal

what it refers to, it is not possible to classify them as there are no other means to gain this domain

knowledge. Those impossible case were placed in the ‘other’ category.

Table 7 Newly created categories for variable funder

Category Characteristics

Other When not belonging to any other group, or unable to identify where it should belong

Government All things related to Tanzanian authorities

International Investment Partnerships between Tanzanian and foreign governments

Aid Organizations like Red cross, Oxfam, Unicef, world bank…

Unknown Entries like ‘0’, ‘ ‘, ‘no’, ‘not known’ or responses that only have 1 character

21

Religious initiatives Initiatives originating from a religious organisation. For example the Lutheran church in Tanzania.

Private Private companies/individuals

Community/local efforts Local ngo’s, schools, community funding

GPS Height

GPS Height ranges from -90 to 2270. The Nations Encyclopedia claims the lowest point in

Tanzania is around sea level or 0 meter (Nations Encyclopedia, sd), so we would expect a minimum

value of 0 meter. As it is unclear how this variable was exactly measured or obtained, we can assume

it was by some sort of GPS, which is erroneous by default if we believe gpsinformation.net

(Mehaffrey, 2001). Even though the GPS height information may not be accurate, it still provides an

indication of height and can still be useful.

34.4% of observations have GPS Height of zero. This indicates, as previously encountered, missing

values are recorded as zero. One way of dealing with missing values is by imputing with the mean,

but imputing 34.4% of all observations with a global mean would severely influence the distribution

and relationships between GPS Height and other variables. A more appropriate way to impute

would be by looking at water points that are nearby and derive a more focused value to use as

imputation value.

Wikipedia shows us the subdivision of Tanzania (Wikipedia, 2016): there are 30 regions, which are

divided into districts and divisions, composed of wards that consist of villages. The variables region,

district, wards and subvillage are present in the dataset. On top of that there’s also a variable named

LGA or Local Government Authority, which also groups villages together based on proximity. To

impute a missing value, we start by looking at water pumps in the same subvillage. If there are several

other water pumps in the same subvillage, we can average their GPS height values to impute the

missing one. In that way, there’s an imputation by the mean, but it’s a more sophisticated imputation

as it is based on nearby water pumps. In case there are no water pumps in the same village or all the

other water pumps in the village also have missing values, the range should be broadened to water

pumps in the same ward. If there are still missing values left after this, LGA’s and districts can also be

included. After going through this process, there are still 16 missing values left. These are situated in

only 2 districts (Mpwapwa and Kishapu). Further investigation reveals that those 16 water pumps

are only spread around 3 wards: ‘Gode Gode’, ‘Matomondo’ and ‘Masanga’. To impute these last 16

22

water pump’s GPS Height, elevationmap.net-tool3 helps to identify the altitude in these wards. An

overview of the missing districts and wards with their altitude values found on elevationmap.net is

provided in table 8. The entire process with the decrease of missing values at each step is shown in

table 9.

Table 8 GPS Height: Manual look-up of missing data

District Ward Altitude

Mpwapwa Gode Gode 837m

Mpwapwa Matomondo 1091m

Kishapu Masanga 1173m

Table 9 GPS Height: The process of imputing missing values

Estimation method Missing values Units without data

Original situation 20438 (34.41%) 20438 water pumps

Subvillage mean 15419 (25.96%) 7265 villages

Ward mean 14735 (24.81%) 751 wards

LGA mean 13885 (23.38%) 41 LGA’s

District mean 16 (0%) 2 districts

Manual look-up 0 (0%) 0 water pumps

Longitude and Latitude

Google maps4 was used to retrieve location data of the Tanzanian border, in order to evaluate if the

data received in the variables longitude and latitude are valid. The border location cases used are

‘Lake Tanganyika’ on the left side, ‘Mavago’ on the bottom side, ‘Mtware’ on the right side and

‘Lake Victoria’ on the top side. The locations, together with their values, are presented in table 10.

3 http://elevationmap.net/#menu2 4 https://www.google.be/maps/

http://elevationmap.net/#menu2

https://www.google.be/maps/

23

Table 10 Longitude and Latitude validation

Location Position Latitude Longitude Map

Lake Tanganyika Left side -6.091258 29.495719

Mavago Bottom side -11.732787 36.548942

Mtwara Right side 10.374224 40.372184

Lake Victoria Top side -0.961119 32.374137

This location data show that valid entries should have a latitude between -11,73 and -0.96 and a

longitude between 29.50 and 40.37. The data provided shows a latitude between -11.65 and -0.96

and a longitude between 29.61 and 40.35. But, for both variables, zero is again used for missing

values. Values of zero for longitude are impossible, as it is not situated in Tanzania. The values for

latitude that are zero seem to be all related to the regions ‘Mwanza’ and ‘Shinyanga’, which definitely

do not have latitude values close to zero. The same method for imputing GPS height missing values is

used here and summarized in table 11. At the end of the process 268 water pumps remain with

missing location data. But those are all situated in the same LGA: ‘Geita’. A manual look-up helps to

identify the right location data to use for imputation.

Table 11 Latitude & Longitude: The process of imputing missing values

Estimation method Missing values Units without data

Original situation 1812 (3.05%) 1812 water pumps

Subvillage mean 1142 (1.92%) 720 villages

Ward mean 921 (1.55%) 57 wards

LGA mean 268 (0.45%) 1 LGA’s

Manual look-up 0 (0%) 0 water pumps

24

Population

36% of water points have a population of 0 to serve, which can be interpreted as a missing value.

That’s a large portion, but who knows, it may still aid in the model creation. Population is a highly

skewed variable. This skewness provides the interesting box plot representation in figure 12. Most

values are close to zero, with a

minimum of 1, a mean of 281, a

median of 150 and a maximum of

30500. Following the statistical

method to identify outliers (inter

quartile range multiplied by 1.5)

there are 7682 outliers or 13% of all

water points. That’s a minority but

still a very large portion. However,

it is still possible that there are a

lower number of water pumps that

can have a very high population to serve, if we think about the urban versus countryside population.

For this reason, these outliers might be justified and removing them would not reflect reality.

Deleting the variable altogether may be a little crude, so imputation may be useful. The imputation is

done by using the median instead of the mean, because the variable is highly skewed.

Public meeting

The largest portion of water pumps were approved

by a public meeting (51011 or 85.88%). 5055 (or

8.51%) were not and for 3334 (5.61%) water

pumps this variable was missing. This can be seen

on figure 13.

Figure 13 Public meeting

Figure 12 Boxplot of population

25

Permit

Permit has 3 possible values. Either true (38852), false (17492) or unknown (3056).

Figure 14 Permit

Construction year

The construction year of water points range between 1960 and 2013. 35% have a missing value. In

order to deal with this large portion of missing values, we could opt to impute them with some sort

of mean. Or, we could categorize them as missing, which is done here. 7 buckets are formed, each

summarizing a period of 10 years: 60’s, 70’s, 80’s, 90’s, 00’s. A separate category is created to catch

all missing entries and is named accordingly: “missing”. The distribution over the years looks like

figure 15.

Figure 15 Construction year as factor

26

Extraction type

There are 3 variables regarding extraction type, which display a different level of granularity.

Extraction_type has 21 unique values, some of those are only represented by a very small number of

water points. We could opt to reduce the number of values and group some together, but this has

already been done for us in the Extraction_type_group variable, which has 13 levels. The variable

Extraction_type_class has 7 levels of which the distribution can be found in figure 16.

A scrutiny of the different levels and their differences can be performed by looking at table 12 in

which the different groups are accompanied by their relative size. Some categories have a very small

portion of water pumps and it is clear that the extra division the extraction_type variable makes is one

too many by splitting already small chunks into even smaller pieces. On top of that, some

summarization can be performed in the Extraction_type_ group variable as well. As the ‘india mark III’

is so small, we could add it together with the ‘india mark II’. The category ‘motor pump’ should also

not be split and be kept as-is in the Extraction_type_group variable.

Figure 16 Distribution of Extraction_type_class variable

27

Table 12 Granularity of extraction type

Class Group Type

Gravity (45%)

Hand pump (28%) Afridev (3%)

India Mark II (4%)

India Mark III (0.2%)

Nira/tanira (14%)

Swn 80 (6%)

Other Handpump (0.6%) Other – play pump (0.1%)

Other – Swn 81 (0.4%)

Walimi (0.1%)

Other – mkulima / shinyanga (0.0%)

Submersible (10%) Submersible (10%) Submersible (8%)

KSB (2%)

Motor pump (5%) Other motor pump (0.2%) Climax (0.05%)

Cemo (0.15%)

Mono (5%)

Rope pump (0.8%)

Wind powered (0.2%) Wind powered (0.2%) Windmill (0.2%)

Management & Management group

The Management variable contains 12 levels, whereas the Management_group variable has 5. The

distribution of water pumps over the different levels of Management_group is shown in figure 17.

Figure 17 Distribution of water pumps over Management_group

28

Table 13 Granularity of Management

Management_group Management

User-group (88%) VWC (68%)

Water board (5%)

Wua (4%)

Wug (11%)

Commercial (6%) Company (1%)

Private Operator (3%)

Trust (0.1%)

Water authority (2%)

Parastatal (3%)

Other (2%) Other (1.4%)

Other – school (0.2%)

Unknown (1%)

Based on table 13, I would opt to summarize some of the factors in the Management variable. The

subdivision of ‘Other’ is too small and the ‘Company’, ‘Private Operator’ and ‘Trust’ can be grouped

together.

Scheme management

We encounter 2697 unique scheme names in the scheme_name variable, but they are conveniently

grouped into the scheme_management variable, which only has 13 unique values. The values

encountered in scheme_management are the same as for the variable management discussed in table 13.

But that variable has a related variable called management_group that summarizes management into 5

groups. For the Scheme_management variable, we could also create a similar grouping variable. This

newly created variable is called SchemeGroup and its levels are displayed in table 14.

29

Table 14 SchemeGroup and Scheme management levels

SchemeGroup Scheme management

User-group (80%) VWC (62%)

Water board (5%)

Wua (5%)

Wug (9%)

Commercial (9%) Company (2%)

Private Operator (2%)

Trust (0.1%)

Water authority (5%)

Parastatal (3%)

Other (1%) Other (1.3%)

SWC (0.1%)

Unknown (7%)

Payment & Payment type

This variable keeps track of the way payments are done. The variables payment and payment_type are

almost exactly the same with the only difference being the naming of ‘pay when scheme fails’ or ‘pay

on failure’. Which by naming should mean the same. For that reason we only continue with the

payment (and not the payment_type) variable.

Figure 18 Distribution of water pumps over payment variable

30

Quality group

The quality_group variable tells us something about… the quality of the water. For most water points

(86%), the quality is ‘good’. As can be noted from a glance at figure 19. There are no particular

issues with this variable so it will be left untouched.

Figure 19 Distribution of Quality_group

Quantity

There are 5 different levels in the quantity variable. The variable Quantity_group can be deleted as it is

an exact duplicate. Most of water pumps have the label “enough” as can be seen in figure 20. The

question that immediately rises when looking at these levels is whether water pumps with the

quantity ‘dry’ have a relationship with non-functional water pumps. Looking at the crosstab with

status_group this is indeed the case (table 15).

31

Figure 20 Quantity

Table 15 Quantity Crosstab

QUANTITY FUNCTIONAL REPAIR BROKEN

DRY 3% 1% 97% 100%

ENOUGH 65% 7% 27% 100%

INSUFFICIENT 52% 10% 38% 100%

SEASONAL 57% 10% 32% 100%

UNKNOWN 27% 2% 71% 100%

GLOBAL 54% 7% 38% 100%

Table 15 shows us the crosstab of quantity and status_group. It displays the distribution of the

status_group variable over the quantity-levels. It seems that if water pumps are ‘dry’ or the quantity

variable is not known, there is a lot of chance the water point is broken (respectively 97% and 71%

of water pumps). On the other hand, if the quantity level is ‘enough’ there is a higher chance the

water point is functional (65% of water pumps).

Source

There are 3 granularities: source, source_type and source_class, having respectively 10, 7 and 3 levels. This

can be identified in table 16. The percentages displayed relate to the total number of water pumps,

which is 59400. The actual difference between source and source_type is the split in the ‘borehole’

source type, in which only a very small portion is further identified as ‘hand dtw’ which only

accounts for 1% of all water pumps. For this reason, the source variable can be ignored.

32

Table 16 Granularity of Source

Class Type Source

Ground water (77%) Spring (29%)

Borehole (20%) Machine dbh (19%), hand dtw (1%)

Shallow well (28%) Shallow well

Surface (22%) Rainwater harvesting (4%)

River/lake (17%)

Dam (1%)

Unknown (0.5%) Other (0.5%) Unknown (0.3%), Other (0.4%)

Water point type

There are 2 variables related to the type of water point: waterpoint_type and waterpoint_type_group. The

first one has 7 variables and the second one has 6. The only difference lies in the subdivision of

‘communal standpipe’ (from the waterpoint_type_group variable) into 2 categories depending on if

there are 1 or more standpipes. As this subdivision separates a substantial part of water pumps (6103

or 10%), the most detailed variable (waterpoint_type) is kept. The distribution is shown in figure 21.

Figure 21 Waterpoint type

33

Location data

There are several indicators of location in this dataset. Some related to coordinates, but also some

factor variables like Basin, Region and District code. District codes are only provided in number and are

thus hard to interpret when doing an analysis. There are definitely more differences between regions

than between basins. The percentage functional water pumps range from 30% in the Lindi and

Mtwara region to 68% in Arusha region. The lowest percentage functional per basin can be found

ranging from 41% (Lake Rukwa) to 65% (Lake Nyasa). The distribution of these variables can be

found in figure 22 and 23.

Figure 22 Basin

Figure 23 Region

34

3.3.4 Summary of Data Understanding / Data preparation

It’s a long read to go through the data understanding / data preparation stage. A summary could

help to revise what has happened. Due to the hard-to-understand nature of amount_tsh and the fact

that 70% of it is missing, this variable was deleted from the analysis. A strong manual effort made it

possible to use the funder and installer variables, grouped in 8 categories. Missing values for

GPS_height, latitude and longitude were imputed by looking at nearby locations and deriving a sensible

mean to impute them with. A new variable (RecordingSeason) was created that contains the season in

which the water pump was recorded. For a couple of subjects, different variables displaying different

levels of granularity were available. Each of those divisions was investigated to check if it makes

sense, deleting variables with too much unique levels and summarizing low-frequency values into

groups. Population has 30% values missing, which were imputed by the median. Table 17 shows this

summary in a tabular form. All variables seem to have a statistical relevant correlation with the focal

variable, the status of the water pumps. This was tested by a Chi-square or an ANOVA test

depending on the variable type.

Table 17 Summary of data handling

Variable Name Description of cleaning Relevance

Funder Grouped into 8 categories

Installer Grouped into 8 categories

GPS height Missing data imputed with mean of nearby available data

Longitude Missing data imputed with mean of nearby available data

Latitude Missing data imputed with mean of nearby available data

Region Recoding of empty values as “Unknown”

District Code Read as factor, not as a number

Public Meeting Recoding of empty values as “Unknown”

Population Many missing values. As it is highly skewed, impute with median instead of mean.

Scheme & SchemeGroup Recoding of empty values as “Unknown”. Deletion of scheme_name variable: too detailed. Creation of summarizing variable SchemeGroup.

Permit Recoding of empty values as “Unknown”

ConstructionYearFactor Placed in 6 buckets and made categorical, missing values in separate bucket

Extraction Type Group & Extraction Type Class

In three levels with different granularity: type, group and class. Restructuring of Group categories. Deletion of type variable.

Management & Management Group

Recoding of empty values as “Unknown”. Restructuring of Management categories.

35

Payment Recoding of empty values as “Unknown”. Deletion of payment_type: duplicates.

Water Quality Recoding of empty values as “Unknown”

Quantity Recoding of empty values as “Unknown”. Quantity_group can be deleted as it’s an exact duplicate.

Source Recoding of empty values as “Unknown”

Water point type Recoding of empty values as “Unknown”

RecordingSeason Derived from the Date_recorded variable

3.4 Modelling & Modelling Evaluation

3.4.1 Modelling introduction

The aim of this business problem is to increase the efficiency on how to deal with water scarcity in

Tanzania. This was translated into the data mining problem to predict the functional state of a water

pump. For every water pump, the correct class or the probability that an instance can belong to a

class needs to be predicted. In data mining terminology this is called a classification, a class

probability estimation (James, Witten, Hastie, & Tibshirani, 2013) or a supervised segmentation

problem (Provost & Fawcett, 2013). All these terms make sense. It is a supervised problem because

it has a target attribute and training data where the value for the target attribute is known. It is a

segmentation/classification problem because the aim is to segment the data into different groups or

classes. Figure 24 was inspired by the represtation of Provost & Fawcett (2013) and peeks at how

the data received is structured. Each row represents an instance, in this case a water point. Each

water point has several attributes or characteristics like Quantity, Region, Quality etc. . The attributes

are to be found in the columns. In a supervised segmentation or classification problem there’s

always a target attribute, in this case the functional state of the water point. In the next couple of

paragraphs, some approaches on how to predict this target attribute will be covered.

36

Figure 24 Attributes and target attribute representation, inspired by (Provost & Fawcett, 2013)

3.4.1.1 Classification trees

A first approach to predict a target attribute is the use of classification trees. Classification trees

predict a qualitative response, a class to which an instance belongs, by using recursive binary

splitting. Which means that in every ‘node’ of a tree, the dataset will be split into 2 separate groups

based on the values of a certain variable. The criterion to use in the binary split could be the

classification error rate. We assign an observation to the most commonly occurring class that is

encountered in a split. The classification error rate is the portion of training observations that do not

belong to that class. An alternative would be the GINI index as a measure of node purity, in which

small values indicate more purity (James, Witten, Hastie, & Tibshirani, 2013).

Figure 25 is an example of how trees work, applied to our case-study. To be easy in the creation of

the tree, the functional states ‘in need of repair’ and ‘non-functional’ were merged. The blocks

represent a state. In the initial state, 55% of all water pumps are functional. That’s the number

displayed in each block: the percentage of functional pumps. Recall, possible values for the Quantity

variable are “Enough”, “Insufficient”, “Seasonal”, “Dry” and “Unknown”. If we only look at the

water pumps that are “Dry” and “Unknown” we notice that only 5% of those water pumps are

functional (right-side branch of the tree). 95% of the pumps are non-functional, this tree model

predicts (= assigns the label of the majority) that the water pumps with those characteristics are non-

functional, and therefore it is labelled red. On the other hand, if the Quantity variable is “Enough”,

“Insufficient” or “Seasonal” we can improve the node purity to 61% of pumps that are functional

(left side branch). The majority is functional, hence the prediction dictates the state is functional

37

(and is therefore coloured blue). The ultimate aim is to result in end nodes that are as pure as

possible. Predicting “functional” on a subset of data of which only 61% is functional is not quite

accurate. Another binary split can be made using the water point type variable which helps to obtain

more pure end nodes (68% functional vs 30% functional). We can add more and more variables

until we are satisfied with the end node purity.

Figure 25 A Classification tree representation (Based on total population: 59400)

To create a classification model in R. The package ‘Rpart’ is used (Therneau, Atkinson, & Ripley,

2015). Its name refers to recursive partition, the way classification trees are built. The Rpart package

uses the GINI index to justify splits as a default setting.

Random Forest

The Random Forest algorithm is a combination of a lot of trees (= a forest) with random feature

selection, hence the name. It is considered one of the top performing techniques. Randomness is

induced in two ways. At each split a random sample of predictors is chosen as split candidates, this

ensures that the constructed trees do not look like each other and are not correlated. A second way

to create randomness is by using ‘Bootstrap Aggregation’ or simply the bagging principle. Bagging

creates different bags or boots by randomly sampling (with replacement). Each boot creates its own

model which leads to a decision regarding a certain instance and these decision are then combined

by averaging or a majority vote (James, Witten, Hastie, & Tibshirani, 2013). This process is also

presented in figure 26. Several ‘boots’ (B) are extracted from the data (through random sampling

with replacement) which have their own model. The decision or predictions resulting from those

38

models (D) are then combined.

Figure 26 Bagging as presented in course material of Advanced Predictive Analytics (personal correspondence with Dirk van den Poel)

To execute this in R code, the RandomForest package is used (Liaw & Wiener, 2015). It is based on

the work of Leo Breiman,, UC Berkeley professor and creator of the Random Forest approach.

3.4.1.2 Logistic Regression

A logistic regression can be seen as a regression with a dependent variable that is categorical. Instead

of predicting a numeric value, it predicts the probability that a certain instance belongs to a class.

Linear regression would try to draw a straight line through the observations, but as they only have a

value of 0 or 1, a straight line seems to miss the point. It also allows for negative values, which are

impossible when talking about probabilities. This is illustrated in the left graph of figure 27. The

logistic model ensures that probability ranges between 0 and 1 by using a logit function, it is shown

in the right part of figure 27 (James, Witten, Hastie, & Tibshirani, 2013). The R implementation of

logistic regression, and by expansion generalized linear models, does not need any external R

packages.

Figure 27 Linear and logistic regression

39

Boosting with Logistic regression

Boosting is also one of the top performing techniques. It is also a combination of several models.

But this time, they are built sequentially. Each model depends on the previous one. Misclassified

instances get a higher weight in the next iteration and in the end, all created model predictions are

combined by averaging or a majority vote in to one prediction (James, Witten, Hastie, & Tibshirani,

2013). Figure 28 captures this method. From the data a first model (T1) is created, with equal

weights assigned to all instances (a), this leads to a first prediction (D1). Misclassified instances

influence the weights used in the next iteration of the model (a2), which again outputs a prediction.

This process is repeated until satisfied. All the model’s predictions are then combined through

averaging or majority voting into a final decision.

Figure 28 Sequantial Boosting (personal correspondence with Dirk van den Poel)

The boosting approach can be used with different algorithms. In this case, we chose to do it with

logistic regression, to build further upon the algorithms that are already explained. The ‘Ada’

package helped to execute this. It is based on Additive Logistic Regression: A Statistical View of

Boosting by Friedman, et al. (2000) (Culp, Johnson, & Michailidis, 2016).

3.4.2 Three way classification approach

One vs all classification

The value to predict is the state of the water pump, which can be ‘functional’, ‘functional but in need

of repairs’ or ‘non-functional’. Traditional classification focuses on binary problems, but this binary

approach can be used creatively to address this problem as well. Probability estimates for each class

can be obtained using binary classification. Once this is done, these estimates are compared and the

40

class for which the highest probability is found will be chosen as predicted value (Lin, Weng, & Wu,

2004).

All-in-one classification

Some algorithms are able to handle the predictions of multiple classes. In this thesis, it is unofficially

called all-in-one classification in contrast with the one vs all approach. Among those are also the

tree-based methods (decision tree and Random Forest), support vector machine (Angulo, Xavier, &

Catala, 2003) algorithms and linear discriminant analysis (Li, Zhu, & Ogihara, 2006).

3.4.3 Modelling evaluation

In evaluating a classification model’s performance, the most common approach for assessing the

accuracy is the error rate or the proportion of mistakes that are made (James, Witten, Hastie, &

Tibshirani, 2013). This is the opposite of the percentage correctly classified, accuracy or

classification rate which is used in the DrivenData competition to evaluate the performance. This

measure is usually too simplistic to evaluate the total performance of a model (Provost & Fawcett,

2013).

ROC & AUC

Figure 29 An example of a ROC curve

A Receiver Operating Characteristiscs (ROC) graph plots the true positive rate against the false

positive rate. It represents the relative trade-offs the model makes. For simplicity, the true positive rate

is sometimes called the hit rate, percentage of positives the model gets right. The false positive rate is

called the false alarm rate or the percentage of actual negatives the classifier gets wrong. If the

classifier is doing well, the true positive rate (hit rate) will increase rapidly and the area under the

curve will be large. Thus, this representation also takes the types of successes and errors into

account. Figure 29 provides an example of some ROC-curves. The sensitivity or true positive rate on the

41

y-axis. The false positive rate or (1-specificity) on the x-axis. The ROC graph is a stepwise graph. All

observations are ranked according to their probability to belong to a class (highest probability

observations come first). Following this ranking, the observations are evaluated one-by-one to check

if the predicted class matches the actual class it belongs to. Starting from the bottom left on the

graph, if the class was correctly predicted, the ROC curve goes up. If the model predicts it belongs

to a class but in reality it doesn’t, it triggers the ‘false alarm rate’ and the curve ‘grows’ to the right.

This goes on until all observations are evaluated. The grey diagonal line would represent a random

classifier. (Provost & Fawcett, 2013)

When comparing different models it is desirable to have a single measurement figure to evaluate on

(Bradley, 1997). The area under the ROC-curve (AUC) is such a single measure used as a summary

of the performance of a model (Provost & Fawcett, 2013). The area under the curve will be large

(closer to 1) if a good classifier is used. If the classifier is no better than random guessing, the area

under the curve will be close to 0.5. The AUC represents the probability that a randomly chosen

positive example is ranked into the positive class with higher probability than a randomly chosen

negative example (Bradley, 1997). Thus the AUC can be seen as a more sophisticated approach of

evaluation through a single value.

Lift

In simple terms, the lift evaluates how many times better the model can predict than random. Figure

30 helps to grasp this concept. The left graph is a cumulative response graph. It plots the percentage

of correctly classified positives against the amount of observations evaluated. Again, the straight

diagonal is random, if 40% of all instances are evaluated and assigned randomly into a class, 40%

will be correctly classified. This way of looking at a classifier allows us to see how it is doing

compared to a random classifier. To look at how many times the model does better, the lift measure

is used, shown on the right graph of figure 30. It uses the value of the cumulative response curve

and divides by the value on the random-diagonal. This results in the graph on the right. Because the

observations are ranked by probability, the observations with highest certainty of belonging to a

class get evaluated first, resulting in the highest lift values (Provost & Fawcett, 2013).

42

Figure 30 Cumulative response curve and lift

3.4.4 Modelling Approach: Cross-validation

The practical performance or generalization capability of a model is only measured in its

performance on previously unseen data points. Therefore, all evaluation metrics to compare models

should be calculated on a test set or holdout set, rather than on the training set used (James, Witten,

Hastie, & Tibshirani, 2013).

This case study was approached following the outline of figure 31. The data obtained consists of a

training set of 59400 observations and a test set of 14800 observations. From a training part, a

model is created, an unrelated validation set is used to apply this created model and evaluate its

performance (using cross-validation). This is done for several models and we identify which one

achieves the best results. This best model will then be recreated using all available data and used to

apply on the test set.

Figure 31 Modelling approach (personal correspondence with Dirk van den Poel)

In this case study, a 5-fold cross validation is used, just as displayed in figure 32. This means that all

observations are grouped into 5 groups (or folds). Each time a different group is used as test or

holdout set, while the others are used as a training set. This is repeated 5 times. Following such an

approach determines how well a certain technique can be expected to perform on independent data

43

(James, Witten, Hastie, & Tibshirani, 2013). Several predictions are better than one, if only just to be

safe that the one prediction was not just a lucky case. Comparing several predictions can smooth this

out, they can be compared, their average can be computed and their variability assessed. (Provost &

Fawcett, 2013).

3.4.5 Modelling case

The approach, algorithms and evaluation metrics discussed earlier will now be applied to the case.

The results for the one vs all classification approach is shown in the following table (table 18). The

different models are compared on their AUC. Remember, the one vs. all approach predicts all

different classes separately and then combines their results. The metrics were calculated on a 5-fold

cross validation basis and averaged.

The one vs all models evaluated are the logistic regression, the boosted logistic regression

(adaboost), a classification tree, an ensemble method of bagged trees (RandomForest) and a

variation of this called the RotationForest. The all-in-one classification approach was performed

Figure 32 An illustration of cross-validation (Provost & Fawcett, 2013)

44

using classification trees and RandomForest. Other viable options would be a support vector

machine algorithm or a linear discriminant analysis.

The one vs all approach has an AUC for each class and the total AUC is calculated as their mean. A

classification rate is also shown, as this is the objective to maximize in the data mining competition.

The Random Forest algorithm is clearly the winner in this case, reaching an AUC of 0.907. The

other approach (all-in-one) lets algorithms figure it out on their own and predicts all classes at once.

Therefore, only the total AUC is provided.

The performance of the all-in-one classification Random Forest model is almost exactly the same as

when the Random Forest model is used in the one vs all approach. To check if the difference in

performance between those is statistically relevant, the Delong test is used (Delong, Delong, &

Clarke-Pearson, 1988). This test compares the ROC curves of both models, in this case that would

mean 3 different tests, one for each functional state, need to be performed. The null hypothesis of

this test contains the statement that the two ROC curves (and thus the AUC’s) are the same. The

Delong test indicates a statistically meaningful difference in the prediction of the Functional and Non

Functional categories between the two approaches, using the Random Forest algorithm (respectively a

p-value of 0.03 and 0.01). There is no proof of a difference in the Repair category, as its p-value is

0.51. This test supports the claim that in this case, the one vs all method approach is the winning

approach.

Possible values for an AUC range from 0.5, meaning the model is unable to figure it out, to 1, where

all observations are classified perfectly. On that scale, an AUC of 0.91 is very good. As this case is

based on an online competition, we can have a look at how colleagues are doing and compare.

There are no AUC metrics to compare, but the maximum classification rate my fellow data scientist

enthusiasts could obtain was 0.8285. Our 0.812 is thus very close.

Table 18 One vs all classification results using a 5-fold crossvalidation

One vs all classification

Functional Repair Broken Average Classification Rate

Logistic regression 0.835 0.794 0.853 0.827 0.744

AdaBoost 0.850 0.842 0.873 0.855 0.756

Tree 0.749 0.50 0.775 0.675 0.718

RandomForest 0.907 0.876 0.836 0.903 0.812

45

RotationForest 0.703 0.500 0.746 0.650 0.686

Table 19 All-in-one classification results using 5-fold crossvalidation

All –in-one classification Evaluation (AUC) Classification Rate

Random Forest 0.905 0.813

Tree 0.712 0.707

3.5 Evaluation

We have created and compared several models. The Random Forest model came out as winner in

terms of the AUC and classification rate. Next to the predictions, the model can also help to gain

insight in why a water pump is more likely to belong to a certain class. It helps us understand what is

important and what the relationship is between the variables.

3.5.1 Variable importances

Earlier in this thesis, the variables were checked on their statistical relevance towards the functional

state of a water pump by a chi-square or one-way ANOVA test. All variables used, passed this first

test. Now we want to see what variables were the most influential in the assignment of probabilities

in the Random Forest model.

To do this, a variable importance plot is used. It is created by evaluating the accuracy, the GINI

coefficient or the AUC of the model. The mean decrease in accuracy checks the drop encountered

in the accuracy measure if this variable would be excluded from the analysis (Liaw & Wiener, 2015).

If the accuracy of the model would drop severely, this would indicate that the variable excluded is a

very important factor in obtaining accurate predictions. The same thought goes into the evaluation

using the GINI coefficient, but instead of the accuracy of the model, the GINI coefficient is used.

More particularly, the decrease in GINI. The GINI coefficient measures the (im)purity of the nodes

(when thinking of a tree) (James, Witten, Hastie, & Tibshirani, 2013). We’re looking at the total

decrease in node impurity from splitting on that particular variable. The more a particular variable

can reduce impurity by splitting on it, the stronger it is (Liaw & Wiener, 2015). Next to the GINI

and accuracy measurements, the AUC can also be used in the same way (Ballings & Van den Poel,

2016). As described in the modelling evaluation part of this case study, the AUC provides a more

46

balanced view on the performance of a model. For this reason, the focus lies on the variable

importance ranking using the AUC.

Figure 33 shows the top 10 most important variables for these measurements. All agree quantity as

the most important variable. The complete list can be found in the appendix.

Figure 33 Variable importances

3.5.2 Partial dependence

Next to the importance, the effect of a variable on the predicted probabilities can also be

investigated. To do this, a partial dependence plot is created. It shows the marginal effect of a

particular variable on the class probability (Liaw & Wiener, 2015). Partial dependence calculations

can be compared to the coefficients obtained in linear regression, which also give an indication of

importance and the way relationships hold between variables. It also allows to understand how the

variables contribute to the prediction making process (Pearson, 2016).

The values that are plotted by using the interpretR R package developed by Ballings and Van den Poel

(2016) are obtained by this formula: 𝑚𝑒𝑎𝑛(0.5 ∗ 𝑙𝑜𝑔𝑖𝑡(𝑃1)). It looks at the mean encountered

probability of belonging to class one (P1) per value of a variable. The values revolve around 0, if it’s

larger than 0, there is a positive effect from that factor level or particular value on the probabilities

encountered through the model. If the values are smaller than 0, there is a negative influence from

that variable value on the probability of belonging to class one. Creating partial dependence plots

47

currently only works for binary approaches (Ballings & Van den Poel, 2016). In this evaluation of

partial dependence, the class ‘functional’ is evaluated versus the other two classes.

The partial dependence plots are created in two types. There’s the ‘factor-type’ plot, which has

horizontal bars with the possible factor values on the x-axis for easier readability (figure 34). In the

middle of the plot, there’s the zero point. If the bars are close to the zero point, that factor level of

the variable does not influence the probabilities that much. If the bar goes to the left side, it tends to

associate it with lower probabilities for class one. In other words, it tends to characterize non-

functional water pumps. If the bar show up at the right side, it tends to be associated with higher

probabilities to be functional. Also pay attention to the range of values, they display the importance

of that variable. If the range is small, the overall importance is small. The other type of partial

dependence plots is designed for numeric variables. It has the variable range on the x-axis and the

probability dependence values on the y axis.

An overview of all variables used in the model, their partial dependence plots and an interpretation

can be found in the appendix. Here, we’re going to cover the some of the most notable variables.

Quantity

The variable quantity is identified by all different types of measures as the most important variable in

determining the functionality of a water pump. This by the variable importance measures accuracy,

GINI and the AUC, but it was also identified by the classification tree as the most valuable variable

to split on. That split is in accordance with the results of the left partial dependence plot in figure 34.

If a water pump is dry or the quantity is unknown, the pump tends to not be functional, which is

straight forward.

Figure 34 Partial dependence plot of Quantity and Payment

Payment

The partial dependence plot of the payment variable (figure 34, right side) shows us it is important

to have some payment conditions. In absense of a payment method, or when it is not known, water

48

pumps tend to be faulty. The relationship between the payment types and the tendency to be

functional is now uncovered. The next step would be to figure out why. One might argue that the

distinction between payment en non-payment also translates in the relationship and behaviour

towards a water pump. If people have to pay for something, they might be more invested and treat it

with more care and respect than if it was a public good offered for free.

Funder & Installer

One curious thing that was uncovered was the influence of the government being the installer or

funder for a water pump. Figure 35 shows government sponsored water points are much more likely

to be in a bad state than any other sponsors. This might again be an indication of investment in

water points. Community sponsored water point might receive more care as the community itself

reaps the benefits of a well-functioning water pump. Privately owned water pumps could receive

more attention as it is seen as own property and would lead to a loss of income in commercial

scenario’s. Religious initiatives and aid initiatives are bound by their noble motives to deliver quality

in their work. I would not exactly claim the government is doing a bad job. Most likely the

government has to provide and maintain water pumps in the most difficult circumstances, where

there are no willing alternative sponsors, in the name of countrywide public access and availability of

water.

Figure 35 Partial Dependence Plot of Installer and Funder variables

Construction Year

I expected a more linear relationship between the age of the water pump and its tendency to break

down. It is true, figure 36 confirms that the most recently installed water pumps, those constructed

after 2000, are less likely to fail than those installed before. But surprisingly, water pumps built in the

60’s seem more durable than those in the 70’s and 80’s. “In my time, things were built to last”, my

grandfather might say. There might be a number of reasons why this occurs. Maybe different

49

materials were popular in different era’s or the oldest water pumps were already prioritized for

renovation projects.

Figure 36 Partial Dependence Plot of ConstructionYearFactor variable

3.5.3 Data cleaning evaluation

Some claim 80% of data analysis work goes into data cleaning (Wickham, 2014). I know we have

done our fair share of work on that in this case study. But was it worth it? The performance gain

seems rather marginal. The best performing model using a thoroughly cleaned dataset reached an

AUC of 0.907. Without any data cleaning, an AUC of 0.894 is obtained. The Delong test (Delong,

Delong, & Clarke-Pearson, 1988) indicates this difference is statistically relevant for the Functional

and Broken categories. For the Repair category, there is no statistically notable difference in

performance between a ‘dirty’ and ‘cleaned’ data set. The test statistics are available in the appendix.

Even though the performance gain is partially statistically relevant, it is still only a small gain.

It is most curious that the extensive data cleaning phase is not very rewarding. The supporting

graphs in the appendix go over each variable to figure this out. The graphs look at the importances

portrayed by the mean decrease in GINI and AUC in order to find some anomalies that might

indicate the phenomenon at hand. This method is not very stable as the importances depend on all

other variables included in the model creation, which are different in both data sets. Thus, this

analysis is only a superficial endeavour.

Eyeballing these graphs indicate that grouping the construction_year values into ConstructionYearFactor,

which was one of the data cleaning steps taken, might not have been a good move. In the GINI

importance ranking without cleaning, construction_year (the deleted variable) is even a clear number

one. Some of the importance loss of not including construction_year was covered by including

ConstructionYearFactor, but the difference still looks significant. Another striking indication is that the

importance of the variables with imputed values has no convincing positive effect. In fact, when

looking at the AUC importances, it seems their importance decreased by pre-processing.

50

Again this approach is most superficial, a better way to further investigate this is to try and run the

model with different combinations of variables. For example, create a number of models with all

possible combinations of construction_year and ConstructionYearFactor. Then compare the AUC of the

different models and see if they are statistically different and which combination works best. The

same approach can be applied with the evaluation of different imputation techniques. This iterative

approach is quite time-consuming and computationally heavy and therefore beyond the scope of

this thesis.

3.6 Deployment

The final stage in the CRISP-DM procedure is the deployment phase. It is the phase in which the

analysis is applied for actual use. The result from this analysis is twofold. On the one hand, a model

is created which could be applied in categorizing water pumps. For example, a ranking could be

made of water pumps most likely to be broken which can be used to create a priority list for repairs

for local governments to act upon. On the other hand, insight was generated in the characteristics of

water pumps and how they relate to the functionality. As I am no water pump engineer nor a

Tanzania expert, additional insight could be extracted by adding some domain knowledge.

The inspiration for this case study came from the DrivenData competition. So the deployment

would come from their end, together with the Taarifa platform. The competition allows for

submission which are evaluated on the classification rate. With the model obtained by creating this

thesis, I was able to get a classification rate of 0.8209 which put me 77th on the ranking. The winning

score has a classification rate of 0.8285.

4 Conclusion

Data mining is a hot topic and its relevance is present in our everyday life. Its skills are in high

demand and as this case study illustrates, it can be very interesting and insightful. This case study

aims to guide the reader through a data mining competition in predicting the functional state of

water pumps, explaining all the steps along the way. The goal is to combine the practical approach

with the theory behind it, so the reader understands what is happening in each phase. For that

reason literature findings and practical issues are combined within the elaboration of the case study.

The CRISP-DM framework is used as foundation. The business problem is properly described and

then translated into a data mining problem. Common problems and solutions are covered in

working with erroneous data in the data understanding and data preparation stage. In the modelling

phase, some common simple and advanced algorithms are explained and applied. A common

modelling approach is presented and some ways of comparing different models are delivered. This

51

thesis also covers the evaluation phase, in which methods of extracting insight from your model are

covered and applied.

Data mining can indeed be of service in aiding the Tanzanian water activists. Next to providing

practical insight by applying theory to real-life data problems, this thesis also succeeds in creating a

powerful model and a comprehensive variable insight report. Several models were constructed for

this case. The RandomForest algorithm turned out to be the best classifier with an AUC of 0.91 and

a classification rate of 0.8209. This catapulted me to the 77th place in the DrivenData competition on

which this thesis was based upon.

This thesis also studied the effect of the data cleaning process on predictive performance. For this

case, the cleaning efforts only partially improved the AUC on a statistically relevant level. In absolute

terms, we’re talking about a 0.02 increase in AUC. Our efforts to investigate the underlying reasons

are superficial and not based on scientific literature, further research could shed more light on this

matter.

The quality of this thesis could be increased if there was more expert input in terms of domain

knowledge. This would have helped in the data understanding and data preparation stage. Evaluating

what values make sense and which are impossible. Determining if certain groupings make sense etc.

For example, the funder and installer variables needed a lot of googling before it was possible to group

their values. Domain knowledge would also be valuable in the evaluation phase. Evaluating regions

or water pump types on their tendency to be functional is hard without knowledge of those regions

or types of water pumps. Apart from this domain knowledge, improvements could have been made

in terms of diversity in modelling algorithms. Comments on the DrivenData forum show the

XGBoost algorithm was very popular and successful, but I was unable to make it work for my

purposes.

5 Bibliografie

Agarwal, R., & Dhar, V. (2014). Big Data, Data Science, and Analytics: The Opportunity and

Challenge for IS Research. Information Systems Research, 25(3), 443-448.

Angulo, C., Xavier, P., & Catala, A. (2003). K-SVCR. A support vector machine for multi-class

classification. Neurocomputing, 57-77.

Ballings, M., & Van den Poel, D. (2016, 03 19). interpretR. Retrieved from The Comprehensive R

Archive Network: https://cran.r-project.org/web/packages/interpretR/

52

Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine

learning algorithms. Pattern recognition, 30, 1145-1169.

Chapman, P., Clinton , J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000).

CRISP-DM 1.0 Step-by-step data mining guide. SPSS.

Chaurette, J. (2016). Pressure or head. Retrieved from Pumpfundamentals:

http://pumpfundamentals.com

Chen, H., Roger, C. H., & Storey, V. C. (2012). Business Intelligence and Analytics: From Big Data

to Big Impact. MIS Quarterly.

Churchill, G. A., & Iacobucci, D. (2005). Marketing Research: Methodolical Foundations (9e ed.). South-

Western, Thomson.

Culp, M., Johnson, K., & Michailidis, G. (2016). ada: The R Package Ade for Stochastic Boosting.

Retrieved from The Comprehensive R Archive Network: https://cran.r-

project.org/web/packages/ada/index.html

Davenport, T. H., & Harris, J. G. (2007). Competing on analytics. Boston: Harvard Business School .

Davenport, T. H., & Patil, D. (2012). Data Scientist: The Sexiest Job of the 21st Century. Harvard

Business Revoew.

de Tré, G. (2007). Principes van databases. Pearson Education.

Delong, E. R., Delong, D., & Clarke-Pearson, D. (1988). Comparing the Areas Under Two or More

Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics.

Demirkan, H., & Dal, B. (2014). The Data Economy: Why do so many analytics projects fail?

Analytics Magazine.

Evans, J. R., & Lidner, C. H. (2012). Business analytics: the next frontier for decision sciences.

Decision Line, 43(2), 4-6.

ExpertAfrica. (n.d.). Tanzania Weather and Climate. Retrieved from ExpertAfrica:

https://www.expertafrica.com/tanzania/info/tanzania-weather-and-climate

Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models.

Cambridge University Press.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning.

Springer.

53

Li, T., Zhu, S., & Ogihara, M. (2006). Using discriminant analysis for multi-class classification: an

experimental investigation. Knowledge and information systems, 453-472.

Liaw, A., & Wiener, M. (2015, 10 06). The Comprehensive R Archive Network. Retrieved from Package

'randomForest': https://cran.r-project.org/web/packages/randomForest/randomForest.pdf

Lin, C.-J., Weng, R., & Wu, T. (2004). Probability estimates for multi-class classification by pairwise

coupling. Journal of Machine Learning Research, 975-1005.

Madden, S. (2012). From databases to Big Data. IEEE Internet Computing.

Mehaffrey, J. (2001, 10 02). GPS Altitude Readout: How Accurate? Retrieved 12 12, 2016, from

GPSinformation: http://gpsinformation.net/main/altitude.htm

Nations Encyclopedia. (n.d.). Tanzania. Retrieved 12 12, 2016, from Nations Encyclopedia:

http://www.nationsencyclopedia.com/geography/Slovenia-to-Zimbabwe-Cumulative-

Index/Tanzania.html

Pearson, R. (2016, 11 23). Interpreting Predictive Models Using Partial Dependence Plots. Retrieved from

The Comprehensive R Archive Network : https://cran.r-

project.org/web/packages/datarobot/vignettes/PartialDependence.html

Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and

data-analytic thinking. O'Reilly Media, Inc.

Rahm, E., & Hai Do , H. (2009). Data Cleaning: Problems and Current Approaches. University of Leipzig.

Robin, X., Turck, N., Hainard, A., Tiberti, N., & Lisacek, F. (2015). pROC: Display and Analyze ROC

Curves. Retrieved from The Comprehensive R Archive Network: https://cran.r-

project.org/web/packages/pROC/index.html

Sagiroglu, S., & Sinanc, D. (2013). Big data: A review. Collaboration Technologies and Systems (CTS) (pp.

42-47). IEEE.

Sagiroglu, S., & Sinanc, D. (2013). Big Data: A Review. Collaboration Technologies and Systems (CTS).

Satinderpal , S. E., Sheilly, P. E., & Kaur, J. E. (2012). A new insight into data mining. International

Journal of Engineering Research and Applications (IJERA), 586-589.

Shearer, C. (2000). The CRISP-DM Model: The New Blueprint for Data Mining. Journal of data

warehousing, 13-22.

Taarifa. (2016, 09 23). Tanzania water points. Retrieved from Taarifa:

http://dashboard.taarifa.org/#/dashboard

54

Taarifa. (2016, 09 23). What is Taarifa. Retrieved from Taarifa: http://taarifa.org/

Therneau, T., Atkinson, B., & Ripley, B. (2015). rpart. Retrieved from The Comprehensive R

Archive Network: https://cran.r-project.org/web/packages/rpart/

Tsoumakas, G. &. (2006). Multi-label classification: An overview. Dept. of Informatics, Aristotle

University of Thessaloniki, Greece.

Vale, S. (2013). Classification of Types of Big Data. UNECE.

WaterAid. (2016, 09 23). Tanzania. Retrieved from WaterAid: http://www.wateraid.org/where-we-

work/page/tanzania

Watson, H. J. (2010). BI-based Organizations. Business Intelligence Journal(15), 4-6.

Wickham, H. (2014). Tidy Data. Journal of Statistical Software.

Wikipedia. (2016, October 19). Subdivisions of Tanzania. Retrieved from Wikipedia:

https://en.wikipedia.org/wiki/Subdivisions_of_Tanzania

Wirth, R., & Jochen, H. (2000). CRISP-DM: Towards a standard process model for data mining.

Proceedings of the 4th international conference on the practical applications of knowledge discovery and data

mining, 29-39.

55

6 Appendix

6.1 Appendix: Data understanding / preparation stage elaboration

6.1.1 Packages used

The data understanding and preparation part of the case study mainly involves reading in and

manipulating data. The data was provided in CSV file. Data manipulation was done by using aspects

of the data.table, dplyr and tidyr package. Graphs, mainly histograms, bar plots and box plots, were

made pretty using the ggplot2 package. An overview of these packages and their source is provided

in table 20.

Table 20 Data exploration and preparation packages

Package name Use Source

Data.table Data manipulation https://cran.r-project.org/web/packages/data.table/index.html

Dplyr Data manipulation https://cran.r-project.org/web/packages/dplyr/index.html

Tidyr Data manipulation https://cran.r-project.org/web/packages/tidyr/index.html

Ggplot2 Graph-making https://cran.r-project.org/web/packages/ggplot2/index.html

6.1.2 Amount_tsh

The description identifies amount_tsh as the amount available to the water point. In more pump-

technical terms: ‘total static head’ (expressed in meters). Pumpfundamentals.com claims “head is a

very useful and practical term to use when evaluating a pump’s capacity to do a job”. Total static

head indicates the height at which the pump can raise up water, or again in technical terms: “the

elevation between the surface of the reservoir and the point of discharge into the receiving tank”.

For this reason, it is impossible to have a value of ‘0’ as total static head. Because otherwise, a pump

would not be needed. In most cases (70.10%), however, this is 0. A possible explanation could be

that missing values are represented by 0.

6.1.3 Date Recorded

The date recorded should be irrelevant of the functional state of the water pump, but we can derive

seasonal effects. The recording dates were grouped in 4 seasons: LongDry, LongRains, ShortDry,

ShortRains. Crosstabbing this with functional state shows some interaction (table 21). Recordings in

https://cran.r-project.org/web/packages/data.table/index.html

https://cran.r-project.org/web/packages/dplyr/index.html

https://cran.r-project.org/web/packages/tidyr/index.html

https://cran.r-project.org/web/packages/ggplot2/index.html

56

the LongRain season have 10% more functional water pumps. Table 22 shows by using the chi-

square test, we can assume some correlation between the 2 variables.

Table 21 Crosstab of season and status_group

FUNCTIONAL REPAIR BROKEN

SHORTDRY 50% 9% 40% 100%

LONGRAIN 60% 6% 34% 100%

LONGDRY 51% 7% 42% 100%

SHORTRAIN 52% 5% 43% 100%

GLOBAL 54% 7% 38% 100%

Table 22 Chi-square test of season

Test x-squared Degrees of freedom p-value Relevance

Chi-squared 553 6 0.00

6.1.4 Funder & Installer

6.1.4.1 Categorization

The variable Funder and Installer contain respectively 1898 and 2146 unique values. After careful

investigation, it seems apparent those can be grouped into 8 categories. This is represented in table

4, which also gives a quick description and an indication of their size. The greatest hurdle is the lack

of structure in data entries. There are a lot of typo’s (Oxfam vs ‘oxfarm’, world bank vs ‘wourld

bak’) and different spellings, which require intensive manual investigation to classify. Most entries

for funder and Installer are organization names or acronyms, but if a Google search does not reveal

what it refers to, it is not possible to classify them. Those impossible case were placed in the ‘other’

category.

The categorization was executed based on heuristics (entries containing certain words) and manual

look-up of frequently encountered organisations. The summary is shown in table 23 tells us which

words triggered a certain group or which organization acronyms belong to a certain group. Figure 37

then illustrates what the distribution of the Funder and Installer variables is regarding these newly

formed groups.

57

Table 23 Categorisation method for Installer and Funder variable

Category Manual look-up Heuristic

Other When not belonging to any other group, or unable to identify where it should belong

Government LGA: Local government authority DWE: District Water Engineering DWSP: Domestic Water Supply TASAF: Tanzanian social action fund RWSSP: Rural water supply and sanitation programme WSDP: Water sector development programme DMDD: Diocese of Mbulu Development Department

Every entry containing “gov”, “government”, "council", "ministry", "government","goverm", "agency", "district water depar","department", "tanzania"

International Investment

HIFAB: Swedish project management consultants NORAD: Norwegian agency for development HESAWA: Swedish - Tanzanian cooperation DANIDA: Danish-Tanzanian cooperation RUDEP: rural development programma, Norwegian initiative CES(GMBH): Consulting Engineers Salzgitter GmbH (CES) JICA/JAICA: japan international cooperation agency

Every entry containing: "italian","japan","german", "korea", "niger","frankfurt", "british", "netherlands", "embassy", "u.s.a", "european union" ,"holland", "international", "africa", "finland", "unesco", "irish", "Greec", "swisland", "imf", "china","swedish"

Aid Red cross, Oxfam, unicef, world bank, world vision ADB: African Development bank AMREF: Amref flying doctors ADRA: ngo of italy ACRA: Community development and emergency relief

Every entry containing “aid”

Unknown Entries like ‘0’, ‘ ‘, ‘no’, ‘not known’ or responses that only have 1 character

Religious initiatives

KKKT: Kanisa la Kiinjili la Kilutheri Tanzania, lutherean church in tanzania TCRS: Tanganyika Christian Refugee Service

Entries containing: “church”, “catholic”, “muslim”, “missionary”

Private Entries containing: “Private”, “private company”, “private individual”

Community/local efforts

SHIPO: ngo in Tanzania TWESA: ngo in Tanzania SEMA: ngo in Tanzania

Entries containing: “village”, ”municipal”,” local”, “community”

58

Figure 37 Distribution of funder and installer variables

6.1.4.2 Statistical relevance

As installer and funder are categorical variables, the chi-square test is used. Both tests indicate the

relevance of these variables, in terms of that they are not independent of the status_group variable

(table 24).

Table 24 Chi-square evaluation of Funder and Installer


Funder 1240.6 14 0.00

Installer 738.36 14 0.00

6.1.5 GPS Height

The imputation of missing values was covered in the main part of this work. All that remains is to

check the statistical relevance of this variable in relation to the focal variable status_group. To do this

a one-way ANOVA is used. This checks if the distribution of GPS Height differs between

functional states of a water pump. In figure 38 the 3 boxplots portray these distributions. The GPS

Height values of non-functional water points seem to be a little lower than for functional water

pumps, but is this statistically so. Table 25 presents the test statistics to evaluate. A p-value of 0.00

leads us to reject the null hypothesis which claims independence between the functional states. An

Ad Hoc comparison between groups was executed to investigate this relationship further. As table

26 shows, only the difference between water pumps in need of repair and functional pumps is not

59

statistically relevant. So we can claim that the GPS-Height for non-functional water pumps are

statistically significantly lower than for functional water pumps (either fully or partially functional).

Table 25 One-way ANOVA of GPS Height

Degrees of freedom SSE MSE F p-value Relevance

Status group

2 263963882 131981941 490.4 0.00

Residuals 59397 15985897658 269136

Table 26 Ad Hoc comparison using TukeyHSD for GPS Height

Comparison Difference p-value Relevance

Functional needs repair

Functional -4.352513 0.86

Non Functional Functional -137.542611 0.00

Non Functional Functional needs repair

-133.190099 0.00

Figure 38 GPS Height distribution per status group

60

6.1.6 Longitude and Latitude

The handling of these variables is similar to GPS Height. The same method for imputing missing

values with the mean based on proximity is applied here, as described in the main text. The same

procedure will again be used to assess relevance to our cause. A visual investigation is supported by

the boxplots of figure 39. But to verify if we can indeed claim there is a difference between the

groups, an ANOVA test needs to be conducted. The results of this test can be found in the

following tables.

Both the ANOVA-test for latitude and longitude indicate a difference between status groups (table

27 and Table 29 one-way ANOVA of Longitude). Looking at latitude, all status groups are

significantly different from eachother (table 29). For longitude, only the difference between

functional and non functional is not significant (table 31).

Figure 39 Distribution of latitude and longitude per status group

Table 27 one-way ANOVA of Latitude

Latitude Degrees of freedom SSE MSE F p-value Relevance

Status group

2 660 330.1 41.89 0.00

Residuals 59397 468049 7.9

Table 28 TukeyHSD multiple comparison test of Latitude

Latitude Comparison Difference p-value Relevance

61


Functional 0.3166800 0.00



-0.4200421 0.00

Table 29 one-way ANOVA of Longitude

Longitude Degrees of freedom SSE MSE F p-value Relevance

Status group

2 3048 1524.2 229.7 0.00

Residuals 59397 394129 6.6

Table 30 TukeyHSD multiple comparison of Longitude

Longitude Comparison Difference p-value Relevance


Functional -0.84798889 0.00

Non Functional Functional 0.04855053 0.07


0.89653942 0.00

6.1.7 Public meeting

To visually identify the relation between Public Meeting and Status Group, Figure 40 can be consulted.

It shows a difference of distribution over the different classes of Public Meeting, to investigate further

a chi-square test is used. The result of the test (table 32) gives us reason to believe public meeting

and status group are not independent.

Table 31 Chi-square results for public meeting


Chi-squared 384 4 0.00

62

Figure 40 Distribution of status_group of Public Meeting

6.1.8 Permit

Permit has 3 possible values. Either true (38852), false (17492) or unknown (3056). The visual

representation of its cross tab is shown in figure 41. A Chi-square test, of which the values are

displayed in table 33, shows that the differences encountered are statistically relevant and thus the

variables status_group and permit are not deemed independent.

Figure 41 Distribution of status_group over Permit values

Table 32 Chi-Square test of Permit


63

Chi-squared 104.18 4 0.00

6.1.9 ConstructionYear

Figure 42 Boxplots of Constructionyear over status_group

The relevance of construction year can be tested in 2 ways. Either in its original form, where the years

were presented in numeric form. Or in the categorized form, in which the years are placed in

buckets and a separate bucket was created to contain all missing values. The boxplot of figure 42

visually reveals what was to be expected: the older the water pumps the more they are broken or in

need of repairs.

Table 33 one way ANOVA of Constructionyear

Construction Year

Degrees of freedom

SSE MSE F p-value Relevance

Status group 2 500055 250028 1753 0.00

Residuals 38688 5518248 143

Table 34 TukeyHSD test of construction year

Construction Year Comparison Difference p-value Relevance


Functional -4.680764 0.00

64



-2.860374 0.00

Table 35 Chi-square test of construction year as a factor


Chi-squared 3245.4 12 0.00

As both the one-way ANOVA test (for construction year as a numeric variable, table 34 and 35) and

the Chi-square test (when categorized, table 36) show, the difference between functional states is

statistically relevant and it can be claimed that older water pumps encounter more troubles.

6.1.10 Collection of other variables

Table 36 Summary of Chi-square tests over other variables

Chi-Squared x-squared Degrees of freedom p-value Relevance

Water point 7450 12 0.00

Basin 1921 16 0.00

Region 4795 40 0.00

District Code 1674 38 0.00

Source 2624 18 0.00

Source_type 1907 12 0.00

Source_class 590 4 0.00

Quantity 11361 8 0.00

Quality_group 2100.1 10 0.00

Payment 3965.6 12 0.00

65

Management 2081.1 22 0.00

Management_group 287.7 8 0.00

Extraction_type 7365.6 34 0.00

Extraction_type_group 7265.8 24 0.00

Extraction_type_class 6931.2 12 0.00

Scheme management 1990.4 22 0.00

6.1.11 Population

Although there are a lot of missing values we can still investigate if the non-missing values relate to

status_group. Because most entries for population are close to zero this variable is highly skewed. The

ANOVA test indicate a dependence between the variables (table 38). Only for the difference

between Repair and Functional there’s no statistical support (table 39). Figure 43 supports this

visually.

Figure 43 Boxplots of population over status_group

Table 37 One-way ANOVA of population

Population Degrees of freedom

SSE MSE F p-value Relevance

Status group 2 4342098 2171049 7 0.00

66

Residuals 38016 12118539409 318775

Table 38 TukeyHSD ad hoc comparison for Population

Population Comparison Difference p-value Relevance


Functional 9.05 0.73



-29.61 0.04

6.2 Appendix: Modelling stage elaboration

6.2.1 Packages used

The modelling stage involves creating and evaluating models.

Table 39 Modelling packages used


AUC Calculate the ROC and AUC metric

https://cran.r-project.org/web/packages/AUC/index.html

lift Calculate the lift evaluation metric

https://cran.r-project.org/web/packages/lift/index.html

randomForest Random forest algorithm https://cran.r-project.org/web/packages/randomForest/index.html

ada Adaboost package, boosting with logistic regression

https://cran.r-project.org/web/packages/ada/index.html

XGBoost Boosting with trees https://cran.r-project.org/web/packages/xgboost/index.html

RotationForest Variation in classification trees

https://cran.r-project.org/web/packages/rotationForest/index.html

pROC Perform a Delong test https://cran.r-project.org/web/packages/pROC/index.html



https://cran.r-project.org/web/packages/randomForest/index.html

https://cran.r-project.org/web/packages/randomForest/index.html

https://cran.r-project.org/web/packages/ada/index.html

https://cran.r-project.org/web/packages/xgboost/index.html



https://cran.r-project.org/web/packages/pROC/index.html

67

6.2.2 Delong test: ROC curve comparison

Table 40 Delong test

Category AUC one vs all AUC all in one p-value Relevance

Functional 0.9085 0.9013 0.034

Repair 0.8683 0.8734 0.509

Non Functional 0.9304 0.9217 0.005

The Delong test indicates statistically meaningful differences in the prediction of the Functional and

Non Functional categories between the two approaches, using the Random Forest algorithm

(respectively a p value of 0.03 and 0.01). There is no proof of a difference in the Repair category, as

its p value is 0.51. This test supports the claim that in this case, the one vs all method approach is

the winning approach.

This test was not conducted in a 5-fold cross validation approach as this is computationally too

burdensome. For that reason the AUC’s displayed here are not exactly the same as the cross

validated average shown before.

6.3 Appendix: Evaluation stage elaboration

6.3.1 Packages used

Table 41 Evaluation packages used


randomForest Extract insight from constructed model


Functions varImplot() and partialPlot()

Ggplot2 Make the plots shine https://cran.r-project.org/web/packages/lift/index.html

InterpretR Extract insight from constructed model

https://cran.r-project.org/web/packages/interpretR/

Functions variableImportance() and parDepPlot()



https://cran.r-project.org/web/packages/interpretR/

68

6.3.2 Variable importances

Figure 44 is the standard output when using the VarImpPlot() function on a Random Forest object

in R. It show the variable importance by looking at the mean decrease in accuracy and GINI. Figure

45 does this by using the AUC measure.

Figure 44 Variable importances

69

Figure 45 Variable importances (AUC)

6.3.3 Partial dependence

Partial dependence plots are used to look at the effect of a variable on the output probability of a

model. Table 42 gathers all variables with their corresponding partial dependence plot. The results

are then interpreted.

70

Table 42 Partial dependence plots with interpretation

Quantity variable is seen by all

measurements as the most important

variable. If the quantity value is dry, the water

pump is more likely to not be functional.

Whereas the enough value is more associated

with working water pumps.

The payment variable captures the payment

method associated with the water pump.

Water pumps where there is a payment

method are positively correlated with

working water pumps, whereas when there

are no payments or it is not known tends to

characterize not working water pumps.

Communal standpipes, hand pumps, improved

springs and cattle trough water pump are more

likely to be functional, with decreasing

probability. Communal standpipe multiple water

pump types or ‘other’ tend to not work more

often.

The older a water pump is, the more flaws it

has. This could be seen as a global rule. But

this relationship is not absolute. Water

pumps built in the 70’s or 80’s are more

often associated with non working water

pumps than those built in 60’s for example.

71

Numeric variables are not that easy to

interpret. But a large drop in functionality

can be identified between a latitude of -11

and -9. These values refer to the southern

part of Tanzania. Further investigation

reveals in that area 20% of water pumps are

dry, for the other part of Tanzania it is only

8%.

The most non working water pumps tend to

be situated around the eastern part of the

country. Surprisingly that’s the part closest to

the sea. This part has 10% more non

working water pumps than the rest of the

country.

Are water pumps that are situated higher less

likely to be broken? Only a small portion of

all water pumps are situated that high, but

the small sample only occurs at a height of

around 1750, which is well after the large

spike. Indeed, water pumps that have a

height of more than 1500 have 15% more

working water pumps.

Another interesting conclusion here is that

the government has a hand in the presence

of faulty water pumps.

72

36% of all water pumps had a missing value

for this variable. Those were imputed by the

median, which was around 150. Interesting

to see is water pumps with a smaller

population that is not close to zero tend to

be more functional. Could this be because of

a sense of shared responsibility in a smaller

group?

A histogram of the DaysRecorded variable

reveals that there are 2 large clusters. One

revolving around 1400 and one around 2000.

Only the parts of the plot that are around

these values can be trusted. This reveals that

the second ‘round’ of water pump recording

efforts yielded more functional water pumps.

Analysis of the extraction_type_group variable

has no surprises. If you do not need a special

mechanism and you can let gravity do all the

work. There’s less chance the water pump

will be broken.

The Iringa region seems to do well in terms

of functional water pumps. It also has the

second largest GNP per capita of Tanzania.

The Lindi region has the worst reputation. It

is situated in the right bottom corner of

Tanzania, and is the least densely populated

region of Tanzania. This is in accordance

with the lat and lon findings.

73

There are no district names given, so an

interpretation in terms of disctricts is hard.

This confirms the findings of the funder

variable. The government instances of

Tanzania seem te be associated with faulty

water pumps.

The interpretation of this variable is the

same as its related variable

extraction_type_group.

74

The source type show us that there are

indeed differences between the type of

source.

The basin from where the water is coming

from also has an influence.

In terms of management, it seems VWC has

a lot of explaining to do.

75

In terms of management, it seems VWC has

a lot of explaining to do.

The RecordingSeason variable was created to

see of the season in which the water pump

data was intered into the system has some

role in determining functionality. It seems

like it does, during the LongRains, it’s more

likely to find a functional water pump.

No surprises here, if there was no public

meeting, there is probably no ‘support’ from

the locals in maintaining the water pump.

76

Water quality and quality group are closely

related. For some reason, some levels do not

conform. Milky quality pumps first tended to

be more functional, but here it’s the other

way around.

I expected to find the same results as with

the permit, put to my surprise, this is not the

case. We can only conclude that without a

permit, there is more chance on a broken

pump.

77

As I have no domain knowledge, it’s hard to

interpret everything I encounter.


interpret everything I encounter. In the

source_class variable. There’s not much

variation contributing to an interpretation.


interpret everything I encounter.

78

6.3.5 Data cleaning evaluation

Table 43 Delong test for ROC curve comparison

Category AUC ‘dirty’ AUC ‘clean’ p-value Relevance

Functional 0.8976 0.9052 0.013

Repair 0.8652 0.8690 0.584

Non Functional 0.9183 0.9256 0.009

The best performing model using a thoroughly cleaned dataset reached an AUC of 0.907. Without

any data cleaning, an AUC of 0.894 is obtained. The Delong test (table 43) indicates this difference

is statistically relevant for the Functional and Broken categories. For the Repair category, there is no

statistically notable difference in performance between a ‘dirty’ and ‘cleaned’ data set.

To look at the differences variable per variable figure 46 and 47 were constructed. They compare the

Gini and AUC importance measures of a Random Forest algorithm on a cleaned and uncleaned data

set. The black dots represent the ‘uncleaned’ importances, the green ones indicate the ‘cleaned’

importances. The graphs look at the importances portrayed by the mean decrease in GINI and AUC

in order to find some anomalies that might indicate the phenomenon at hand. This method is not

very stable as the importances depend on all other variables included in the model creation, which

are different in both data sets. Thus, this analysis is only a superficial endeavour.

The uncleaned set is just the data read in the state it was, excluding the variables amount_tsh, id,

wpt_name, funder, installer, date_recorded, scheme_name, ward, lga and subvillage as those have to many

distinct factor variables to handle. The cleaned data set contains six more variables: SchemeGroup,

RecordingSeason, Installer, Funder, DaysRecorded and ConstructionYearFactor. Those are placed on the top

side to get them somewhat out of the way. Some highly correlated variables were also deleted. For

example, the reason the quantity variable is much higher rated in importance in the cleaned data set is

because in the uncleaned one, its importance is divided over the quantity and quantity_group variables.

Eyeballing these graphs indicate that grouping the construction_year values into ConstructionYearFactor

might not have been a good move. In the GINI importance ranking without cleaning,

construction_year is even a clear number one. Some of the importance loss of not including

construction_year was covered by including ConstructionYearFactor, but the difference still looks

79

significant. Another striking indication is that the importance of the variables with imputed values

has no convincing positive effect. In fact, when looking at the AUC importances, it seems their

importance decreased by pre-processing.

Figure 46 Mean decrease in Gini comparison

80

Figure 47 Mean decrease in AUC comparison

predicting the functional state of tanzanian water …€¦ · predicting the functional state of...

Documents