abstract instructor: natalia sizova world data: exploring kaggle data...

WORLD DATA:

EXPLORING

KAGGLE DATA

SETS

ABSTRACT Introduction to Data Sets on Kaggle.com. The guide is

prepared by undergraduate students majoring in

Economics at Rice University for undergraduate

students majoring in the Social Sciences.

Instructor: Natalia Sizova Rice University

World Data Sample Analysis: Overview

On Kaggle, we can find many useful data under its “Datasets” section. Here, we will take

a look at “World Development Indicators.” This dataset is provided by Ben Hamner,

cofounder and CTO of Kaggle. Click on the title and you will see the following webpage:

On the web page, we can see that there are two links in blue fonts. One is a list of the

available indicators; the other is a list of the available countries. Now let’s take a closer

look at the indicators. There is a total of 1,345 indicators spread across 10 large topics:

Economics, Education, Environment, Financial Sector, Health, Infrastructure, Poverty,

Private Sector & Trade, Public Sector, and Social Protections & Labor. In each large topic,

there are also smaller topics that provide more specific details about the indicators, such as

CO2 emissions under the topic of the environment, and government debt under public

sector. In total, there are 5,656,458 observations in a total of 247 countries across 56 years,

and we can explore many interesting topics with this comprehensive dataset, such as the

improvement of gender inequality across years, debts and income levels of different

countries, or even the number of endangered species on each continent. In this section, we

will use this dataset to do some simple analysis and help students gain a better

understanding of how to use this dataset to do their own research.

The first step is to download the dataset. You can download the “.csv” version of the data

by clicking on the “Download Data” button. Then, open the data in STATA. (Note: some

of the files might be too large and can’t be opened by the student version of STATA.)

Besides the “Download Data” button, there are three other sections of this dataset:

Description, Kernels (formerly Scripts), and Discussion (formerly Forum). In the

“Description” section you can find a list of the available countries and indicators in the

dataset, as well as “life expectancy at birth” example. In the “Kernels” section, you can

find some sample analysis and codes submitted by other users, such as “gender equality.”

In the “Discussion” section, you can participate in discussions initiated by other users or

propose your own questions.

Example 1: Overview of the G7 Data

Under the “Kernels” section (formerly “Scripts”), we can find some sample analyses

submitted by other Kaggle users. We will first take a look at their analysis to gain a basic

understanding of how to use the dataset. The example used here is “Overview of G7 Data”

by Hiraku Shibuya. In his analysis, Hiraku compared the populations, GDPs, and

unemployment rates of G7 countries. He also included code in his analysis, which we can

directly run on Kaggle. Click the blue “Fork Script” button (or “Fork Kernel”), and you

will see a web page like this:

On the left is the code written by the user, and you can click the “Run” button to run the

code. Results will be shown on the right half of the web page. If there are any pictures or

graphs generated, you can find them in the “Output” tab. From the data and code, we obtain

the graphs of population trends, population growth rate, GDP per capita, GDP per capita

growth rate, and unemployment rate for G7 countries. Below are sample population graphs

extracted from the results:

Figure 1 Source: Kaggle.com

From the population graphs, we can see that the population growth rates seem to slow down

for most of the G7 countries. Germany and Japan even had negative population growth

rates in recent years. The GDP growth rate and unemployment rate graphs are also

generated by this kernel, and those are available for students to explore.

Example 2: Percentage of Renewable Energy in Total Energy

Consumption

Now we want to do our own analysis with the World Development Indicators dataset. With

increasing concerns about greenhouse effects and potential future energy shortage,

countries are paying more and more attention to renewable energy instead of traditional

fossil fuels. Large amounts of resources have been invested by governments and companies

to explore viable and efficient alternative sources of energy, and we want to compare how

various countries are doing in this new trend of innovation. One important indicator is the

percentage of renewable energy in total energy consumption, and we will use this indicator

to compare the levels of renewable energy use among countries.

After downloading the data from the website, we will import the “Indicators.csv” file into

STATA. File “Indicators.csv” contains all the necessary information, categorized by

country, indicator, year, and value. We don’t need to merge different datasets in the folder,

as the other files are just descriptions of variables, such as indicator names and country

codes. Once the file “Indicators.csv” is imported, we can take a look at the data in Data

Editor. There are six columns: countryname, countrycode, indicatorname, indicatorcode,

year, and value. The dataset is originally sorted by countryname, but in this case, we want

to compare different countries under the same indicator. Hence, we need to filter and clean

up the dataset. Each observation is a row, which contains these six columns, and we want

to keep observations with desired indicators only.

To sort out needed observations, we first find the desired “indicatorname” in the dataset.

The “indicatorname” for the percentage of renewable energy in the total energy

consumption is “Renewable energy consumption” in % of total final energy consumption.

The corresponding “indicatorcode” is EG.FEC.RNEW.ZS. We only need observations

with this indicator code. Hence, we delete the rest of the dataset by typing

keep if indicatorcode == "EG.FEC.RNEW.ZS"

in STATA. This line of code can help us to get rid of rows that have other indicator codes.

We are then left with 4,867 observations, which is more manageable than the original 5

million. There are 221 countries in our dataset with the data on the use of renewable energy

over 23 years, from 1990 to 2012. To compare the trends for different countries, we plot a

graph. We will export the updated dataset to a .csv file and do the following analysis in

Excel. The graph below shows the share of the renewable energy in total energy

consumption of ten different countries across 23 years.

Figure 2

There are a lot of interesting observations to be made from the graph. First, we look at the trends for India, China, and Mexico. Unlike the trends for other countries, these three countries have downward sloping trends, indicating that the share of renewable energy have decreased over the years. This finding is a bit counterintuitive, as we generally assume that the share of renewable energy in the total energy consumption increases with technological advancements. However, we must notice that these three countries are all developing countries, and they need to consume more and more energy in their development processes. Compared to renewable energy, fossil fuels are more readily available and much cheaper. Hence, the increase in energy consumption is mainly in the form of fossil fuels, reducing the share of renewable energy. Yet, China and India have much higher shares of renewable energy in total energy consumption compared to other countries, which is the opposite of what we might expect based on the news and other sources. There can be some issues with the data if records for some countries are inaccurate. However, in our simple sample analysis, we will not explore this issue further.

Among the countries with increased shares of renewable energy, Germany has experienced the largest change in percentage, from about 2% in 1990 to more than 12% in 2012. However, there is a dip in Germany’s share of the green energy around 2008, coinciding with the global financial crisis. A possible explanation is that switching to renewable energy is costly, and companies did not have sufficient resources to do so. Hence, they increased their fossil fuel use or reduced their use of the renewable energy during that time. Nonetheless, Germany is the only country that experienced such a dip around 2008 of the 10 countries considered above. The question remains as to why other countries did not encounter this issue, which is left to students to explore.

We have shown a simple example of how to perform an analysis using the World Indicator

dataset. One can explore more interesting relationships among indicators and countries

using the other parts of the dataset. A large dataset like World Development Indicators may

seem daunting and confusing at first. In analyzing such datasets, it can be helpful to find

the useful information and remove irrelevant data. We did this by filtering the dataset and

reducing unnecessary information until we got a succinct subset containing only the

relevant observations. Then, we could perform our analysis on the new dataset in Excel

and STATA. We made comparisons across countries, but we could also run regressions to

explore relationships between indicators and so on.

Hint: We can also list all of the indicators present in the dataset using a provided kernel (a

piece of code) without downloading the dataset. This code is shown below, and you can

also find it on Kaggle.com by typing “Indicators in Data” in the search window. This kernel

is written by Ben Hamner and can also be found under the “Kernels” section.

```{r include=FALSE}

library.warn <- library

library <- function(package, help, pos = 2, lib.loc = NULL, character.only = FALSE,

logical.return = FALSE, warn.conflicts = TRUE, quietly = FALSE,

verbose = getOption("verbose")) {

if (!character.only) { package <- as.character(substitute(package)) }

suppressPackageStartupMessages(library.warn(

package, help, pos, lib.loc, character.only = TRUE,

logical.return, warn.conflicts, quietly, verbose))}

library(xtable)

print.table <- function(table) {

html <- print(xtable(table), type="html", print.results=FALSE, include.rownames=FALSE)

cat(paste0("<div style='width:800; overflow:auto; border-width: 2;'><style>td {padding: 3px;} th {padding: 3px;}</style>", html, "</div>"))

}

```

```{r results="asis"}

library(dplyr)

library(readr)

indicators <- read_csv("../input/Indicators.csv")

counts <- indicators %>%

group_by(IndicatorCode, IndicatorName) %>%

summarise(NumCountries = n_distinct(CountryName),

NumYears = n_distinct(Year),

FirstYear = min(Year),

LastYear = max(Year))

counts$IndicatorName <- gsub("\\$", "dollar", counts$IndicatorName)

print.table(counts)

```

This kernel can be run on Kaggle. Use the “New Kernel” button on the World Development

Indicators page, paste the above code, choose any kernel title, select RMarkdown of the

dropdown list of environments, and then press “Run”. This will produce a table containing

all indicators and their time spans in the dataset. The initial several rows of the list appear

as follows:

IndicatorCode IndicatorName NumCountries NumYears FirstYear LastYear

AG.AGR.TRAC.NO Agricultural machinery,

tractors 219 49 1961 2009

AG.CON.FERT.PT.ZS

Fertilizer consumption

(% of fertilizer

production)

118 12 2002 2013

AG.CON.FERT.ZS

Fertilizer consumption

(kilograms per hectare

of arable land)

188 12 2002 2013

AG.LND.AGRI.K2 Agricultural land (sq.

km) 242 53 1961 2013

AG.LND.AGRI.ZS Agricultural land (% of

land area) 241 53 1961 2013

AG.LND.ARBL.HA Arable land (hectares) 207 53 1961 2013

Now you have a table containing all indicator names without downloading the whole

dataset!

abstract instructor: natalia sizova world data: exploring kaggle data...

Documents