abstract instructor: natalia sizova world data: exploring kaggle data...
TRANSCRIPT
WORLD DATA:
EXPLORING
KAGGLE DATA
SETS
ABSTRACT Introduction to Data Sets on Kaggle.com. The guide is
prepared by undergraduate students majoring in
Economics at Rice University for undergraduate
students majoring in the Social Sciences.
Instructor: Natalia Sizova Rice University
World Data Sample Analysis: Overview
On Kaggle, we can find many useful data under its “Datasets” section. Here, we will take
a look at “World Development Indicators.” This dataset is provided by Ben Hamner,
cofounder and CTO of Kaggle. Click on the title and you will see the following webpage:
On the web page, we can see that there are two links in blue fonts. One is a list of the
available indicators; the other is a list of the available countries. Now let’s take a closer
look at the indicators. There is a total of 1,345 indicators spread across 10 large topics:
Economics, Education, Environment, Financial Sector, Health, Infrastructure, Poverty,
Private Sector & Trade, Public Sector, and Social Protections & Labor. In each large topic,
there are also smaller topics that provide more specific details about the indicators, such as
CO2 emissions under the topic of the environment, and government debt under public
sector. In total, there are 5,656,458 observations in a total of 247 countries across 56 years,
and we can explore many interesting topics with this comprehensive dataset, such as the
improvement of gender inequality across years, debts and income levels of different
countries, or even the number of endangered species on each continent. In this section, we
will use this dataset to do some simple analysis and help students gain a better
understanding of how to use this dataset to do their own research.
The first step is to download the dataset. You can download the “.csv” version of the data
by clicking on the “Download Data” button. Then, open the data in STATA. (Note: some
of the files might be too large and can’t be opened by the student version of STATA.)
Besides the “Download Data” button, there are three other sections of this dataset:
Description, Kernels (formerly Scripts), and Discussion (formerly Forum). In the
“Description” section you can find a list of the available countries and indicators in the
dataset, as well as “life expectancy at birth” example. In the “Kernels” section, you can
find some sample analysis and codes submitted by other users, such as “gender equality.”
In the “Discussion” section, you can participate in discussions initiated by other users or
propose your own questions.
Example 1: Overview of the G7 Data
Under the “Kernels” section (formerly “Scripts”), we can find some sample analyses
submitted by other Kaggle users. We will first take a look at their analysis to gain a basic
understanding of how to use the dataset. The example used here is “Overview of G7 Data”
by Hiraku Shibuya. In his analysis, Hiraku compared the populations, GDPs, and
unemployment rates of G7 countries. He also included code in his analysis, which we can
directly run on Kaggle. Click the blue “Fork Script” button (or “Fork Kernel”), and you
will see a web page like this:
On the left is the code written by the user, and you can click the “Run” button to run the
code. Results will be shown on the right half of the web page. If there are any pictures or
graphs generated, you can find them in the “Output” tab. From the data and code, we obtain
the graphs of population trends, population growth rate, GDP per capita, GDP per capita
growth rate, and unemployment rate for G7 countries. Below are sample population graphs
extracted from the results:
Figure 1 Source: Kaggle.com
From the population graphs, we can see that the population growth rates seem to slow down
for most of the G7 countries. Germany and Japan even had negative population growth
rates in recent years. The GDP growth rate and unemployment rate graphs are also
generated by this kernel, and those are available for students to explore.
Example 2: Percentage of Renewable Energy in Total Energy
Consumption
Now we want to do our own analysis with the World Development Indicators dataset. With
increasing concerns about greenhouse effects and potential future energy shortage,
countries are paying more and more attention to renewable energy instead of traditional
fossil fuels. Large amounts of resources have been invested by governments and companies
to explore viable and efficient alternative sources of energy, and we want to compare how
various countries are doing in this new trend of innovation. One important indicator is the
percentage of renewable energy in total energy consumption, and we will use this indicator
to compare the levels of renewable energy use among countries.
After downloading the data from the website, we will import the “Indicators.csv” file into
STATA. File “Indicators.csv” contains all the necessary information, categorized by
country, indicator, year, and value. We don’t need to merge different datasets in the folder,
as the other files are just descriptions of variables, such as indicator names and country
codes. Once the file “Indicators.csv” is imported, we can take a look at the data in Data
Editor. There are six columns: countryname, countrycode, indicatorname, indicatorcode,
year, and value. The dataset is originally sorted by countryname, but in this case, we want
to compare different countries under the same indicator. Hence, we need to filter and clean
up the dataset. Each observation is a row, which contains these six columns, and we want
to keep observations with desired indicators only.
To sort out needed observations, we first find the desired “indicatorname” in the dataset.
The “indicatorname” for the percentage of renewable energy in the total energy
consumption is “Renewable energy consumption” in % of total final energy consumption.
The corresponding “indicatorcode” is EG.FEC.RNEW.ZS. We only need observations
with this indicator code. Hence, we delete the rest of the dataset by typing
keep if indicatorcode == "EG.FEC.RNEW.ZS"
in STATA. This line of code can help us to get rid of rows that have other indicator codes.
We are then left with 4,867 observations, which is more manageable than the original 5
million. There are 221 countries in our dataset with the data on the use of renewable energy
over 23 years, from 1990 to 2012. To compare the trends for different countries, we plot a
graph. We will export the updated dataset to a .csv file and do the following analysis in
Excel. The graph below shows the share of the renewable energy in total energy
consumption of ten different countries across 23 years.
Figure 2
There are a lot of interesting observations to be made from the graph. First, we look at the trends for India, China, and Mexico. Unlike the trends for other countries, these three countries have downward sloping trends, indicating that the share of renewable energy have decreased over the years. This finding is a bit counterintuitive, as we generally assume that the share of renewable energy in the total energy consumption increases with technological advancements. However, we must notice that these three countries are all developing countries, and they need to consume more and more energy in their development processes. Compared to renewable energy, fossil fuels are more readily available and much cheaper. Hence, the increase in energy consumption is mainly in the form of fossil fuels, reducing the share of renewable energy. Yet, China and India have much higher shares of renewable energy in total energy consumption compared to other countries, which is the opposite of what we might expect based on the news and other sources. There can be some issues with the data if records for some countries are inaccurate. However, in our simple sample analysis, we will not explore this issue further.
Among the countries with increased shares of renewable energy, Germany has experienced the largest change in percentage, from about 2% in 1990 to more than 12% in 2012. However, there is a dip in Germany’s share of the green energy around 2008, coinciding with the global financial crisis. A possible explanation is that switching to renewable energy is costly, and companies did not have sufficient resources to do so. Hence, they increased their fossil fuel use or reduced their use of the renewable energy during that time. Nonetheless, Germany is the only country that experienced such a dip around 2008 of the 10 countries considered above. The question remains as to why other countries did not encounter this issue, which is left to students to explore.
We have shown a simple example of how to perform an analysis using the World Indicator
dataset. One can explore more interesting relationships among indicators and countries
using the other parts of the dataset. A large dataset like World Development Indicators may
seem daunting and confusing at first. In analyzing such datasets, it can be helpful to find
the useful information and remove irrelevant data. We did this by filtering the dataset and
reducing unnecessary information until we got a succinct subset containing only the
relevant observations. Then, we could perform our analysis on the new dataset in Excel
and STATA. We made comparisons across countries, but we could also run regressions to
explore relationships between indicators and so on.
Hint: We can also list all of the indicators present in the dataset using a provided kernel (a
piece of code) without downloading the dataset. This code is shown below, and you can
also find it on Kaggle.com by typing “Indicators in Data” in the search window. This kernel
is written by Ben Hamner and can also be found under the “Kernels” section.
```{r include=FALSE}
library.warn <- library
library <- function(package, help, pos = 2, lib.loc = NULL, character.only = FALSE,
logical.return = FALSE, warn.conflicts = TRUE, quietly = FALSE,
verbose = getOption("verbose")) {
if (!character.only) { package <- as.character(substitute(package)) }
suppressPackageStartupMessages(library.warn(
package, help, pos, lib.loc, character.only = TRUE,
logical.return, warn.conflicts, quietly, verbose))}
library(xtable)
print.table <- function(table) {
html <- print(xtable(table), type="html", print.results=FALSE, include.rownames=FALSE)
cat(paste0("<div style='width:800; overflow:auto; border-width: 2;'><style>td {padding: 3px;} th {padding: 3px;}</style>", html, "</div>"))
}
```
```{r results="asis"}
library(dplyr)
library(readr)
indicators <- read_csv("../input/Indicators.csv")
counts <- indicators %>%
group_by(IndicatorCode, IndicatorName) %>%
summarise(NumCountries = n_distinct(CountryName),
NumYears = n_distinct(Year),
FirstYear = min(Year),
LastYear = max(Year))
counts$IndicatorName <- gsub("\\$", "dollar", counts$IndicatorName)
print.table(counts)
```
This kernel can be run on Kaggle. Use the “New Kernel” button on the World Development
Indicators page, paste the above code, choose any kernel title, select RMarkdown of the
dropdown list of environments, and then press “Run”. This will produce a table containing
all indicators and their time spans in the dataset. The initial several rows of the list appear
as follows:
IndicatorCode IndicatorName NumCountries NumYears FirstYear LastYear
AG.AGR.TRAC.NO Agricultural machinery,
tractors 219 49 1961 2009
AG.CON.FERT.PT.ZS
Fertilizer consumption
(% of fertilizer
production)
118 12 2002 2013
AG.CON.FERT.ZS
Fertilizer consumption
(kilograms per hectare
of arable land)
188 12 2002 2013
AG.LND.AGRI.K2 Agricultural land (sq.
km) 242 53 1961 2013
AG.LND.AGRI.ZS Agricultural land (% of
land area) 241 53 1961 2013
AG.LND.ARBL.HA Arable land (hectares) 207 53 1961 2013
Now you have a table containing all indicator names without downloading the whole
dataset!