clustering the world · share an underlying philosophy and common apis. install the complete...

46
the woRld clusteRing basic machine leaRning with R beRnaRdus aRi kuncoRo V3.0 woRkshop

Upload: others

Post on 27-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

the woRldclusteRing

basic machine leaRning with R

beRnaRdus aRi kuncoRo

V3.0

woRkshop

Page 2: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

about mE

Hi, I am Ari!

Work

School

Page 3: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

quiZ

How many R letters written on my _____ slide?

A. 3B. 6C. 9D. 12

Page 4: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

downloaD

http://arikuncoro.xyz/fff

Page 5: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

agendA

Quick intro to Dataset and R

Data Exploration

Data Preparation

Clustering

Use case

Page 6: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

quotE

“Learning from data is virtually universally useful. Master it and you will be welcomed anywhere.”

John ElderElder Research

Page 7: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

evidencE

“In 2017 more than 10 companies approached me to work as a Data Scientist.”

Page 8: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

to dataset & Rquick intro

Part - 1

Page 9: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

dataseT

World Happiness Record https://www.kaggle.com/unsdsn/world-happiness

Download

Page 10: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

1-2-3 in R

Install R& R Studio

Install Packages

Code

Page 11: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

What is R?

R is a language and environment for statistical computing and graphics.

R was created by Ross Ihaka and Robert Gentleman at Univ of Auckland.

Runs on widely UNIX System (Linux, MacOS, FreeBSD) and Windows.

Source: https://www.r-project.org/about.htmlSource: https://edu.kpfu.ru/mod/page/view.php?id=35064

Page 12: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

How to Install?

• Install R• Download installer R: https://cran.r-

project.org• Choose version (Windows, Mac, Linux

available)• Run R installer. Leave all default

settings in the installation options.

• Install RStudio as integrated development environment (IDE) for R

• Download RStudio fromhttps://www.rstudio.com/products/rstudio/download (choose version and install it)

• Leave all default settings in the installation options

Page 13: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

RStudio

Source: http://dss.princeton.edu/training/RStudio101.pdf,Learn more: https://www.rstudio.com/online-learning/

Page 14: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

Some Useful R - links

Find out the link that suits you! ☺

• Official page: www.r-project.org• CRAN download page:

https://cran.r-project.org• Microsoft R Open offical page:

https://mran.revolutionanalytics.com/open

• RStudio IDE: www.rstudio.com• Streams of news and articles:

http://www.r-bloggers.com• Question and answer:

http://stackoverflow.com/tags/r• Some Useful R links:

http://stats.stackexchange.com/questions/138/free-resources-for-learning-r

• ... need more? Google it

Online Course from:- Datacamp- Coursera - edX- Udemy- And many more…

Page 15: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

R packages for data scienceThe tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying philosophy and common APIs.

Install the complete tidyverse with

install.packages("tidyverse")

::tidyversE

Please install these packages also for our clustering purpose. “corrplot”, “plotly”, “cluster”, “fpc”

Page 16: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

DemoGetting Started

Page 17: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

Basic R Syntax & Data Types

Variable Assignment: <- or =

Type “Hello World!” in console, then hit enterOr, assign to a variable :

> my.string <- "Hello World!"or> my.string = "Hello World!"

Display value: > print(my.string) or simply > my.string[1] "Hello World"

Arithmetic in RAddition: +Subtraction: -Multiplication: *Division: /Exponentiation: ^Modulo: %%

Data Types: class()• Decimals values like 5.6 are

called numerics.• Natural numbers like 7 are

called integers. Integers are also numerics.

• Boolean values (TRUE or FALSE) are called logical.

• Text (or string) values are called characters.

Page 18: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

Special Value

Value Description ExampleNA Stands for not available. NA is a

placeholder for a missing value> num <- c(NA, 5, 4, 3)> mean(num)> mean(num, na.rm = TRUE)

NULL Empty set. It has no class (its class is NULL) and does not take up any space in a vector

> num <- c(NULL, 1, 2, 3)> mean(num)

Inf Stands for infinity and only applies to vectors of class numeric

> num <- 1/0> num

NaN Stands for not a number. This is generally the result of a calculation of which the result is unknown, but it is surely not a number

> 0/0> Inf - Inf

Source: Nurandi (DSI Bootcamp Slide)

Page 19: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

Exercise #1

Using Consoles in R Studio,

#1 Calculate 3 + 4

#2 Assign “aku cinta Indonesia” as my_char

#3 Find out the data type of my_char

#4 Assign FALSE as my_logical

#5 Find out the data type of my_logical

Timer

End123456789101112131415161718192021222324252627282930313233343536373839404142434445

Page 20: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

Data Structure

• Vector: one-dimensional array of the same mode

• Matrix: two-dimensional array

• Array: multi-dimensional (more than two dimensions) array

• Data Frame: tabular data objects

• List: collection of objects, where the elements can be of different types

• *Factor: vector along with the distinct values of the elements

• *Function: objects to make specific operations

Source: Nurandi (DSI Bootcamp Slide)

Page 21: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

How to Create Vector, Matrix, Array?

numbers <- 1:11

colors <- c("red", "green", "blue")

height <- c(leo = 170, jon = 167, alex = 185)

colors[1]

colors[2:3]

colors[-3]

height["alex"]

c(numbers, height)

my.matrix <- matrix(c(1:12), nrow = 3)

my.matrix[2, ]

my.matrix[, 3:4]

my.matrix[2, 4]

my.matrix[2:4]

my.array <- array(c(1:24), dim = c(4,3,2))

my.array

my.array[2, ,]

my.array[2:3, ,]

my.array[1:10]

Source: Nurandi (DSI Bootcamp Slide)

Page 22: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

bmi <- data.frame(

gender = c("Female", "Male","Female"),

single = c(F, F, T),

height = c(155, 170, 165.5),

weight = c(64, 65, 48.5),

age = c(42, 38, 26)

)

bmi

bmi[1, ]

bmi[, 3]

bmi[2, 3]

bmi$gender

bmi[c("height", "weight")]

newdata <- subset(bmi, age >= 30, select=c(height, weight))

How to Create a Data Frame?

Source: Nurandi (DSI Bootcamp Slide)

Page 23: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

List, Factor, & Function

my.list <- list(colors, bmi, my.matrix, 4)

my.list[[1]]

my.list[[1]][2]

my.list[[2]]["single"] edu <- rep(c("SD", "SMP", "SMA"), 3)

edu <- factor(edu)

edu

Source: Nurandi (DSI Bootcamp Slide)

# Define ratio() function

ratio <- function(x, y) {

x/y

}

# Call ratio() with arguments 3 and 4

ratio(3,4)

Page 24: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

Exercise #2

#1 Create a vector that contains the following strings, assign it as name_vector!

"Ahmad" "Hani" "Andri" "Dian"

#2 Create a vector that contains the following integers, assign it as nim_vector!

1506772460 1506699232 1506772561 1506699453

#3 Construct a matrix with 6 rows that contain the numbers 1 up to 36, assign it as my_array!

#4 Construct a data frame from two vectors: name_vectorand nim_vector! Assign it as data_mahasiswa!

#5 Rename the columns using names() function into: “Nama” and “NIM”!

#6 Write a function selisih() that takes arguments x and y and returns their difference, x - y!

2:001:591:581:571:561:551:541:531:521:511:501:491:481:471:461:451:441:431:421:411:401:391:381:371:361:351:341:331:321:311:301:291:281:271:261:251:241:231:221:211:201:191:181:171:161:151:141:131:121:111:101:091:081:071:061:051:041:031:021:011:000:590:580:570:560:550:540:530:520:510:500:490:480:470:460:450:440:430:420:410:400:390:380:370:360:350:340:330:320:310:300:290:280:270:260:250:240:230:220:210:200:190:180:170:160:150:140:130:120:110:100:090:080:070:060:050:040:030:020:01End5:00

Page 25: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

explorationdata

Part - 2

Page 26: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

No To do My Recommendation, please utilize

1 Name your code title, loading packages and dataset

(Optional) Code title: World Happiness ExplorationPackage name: tidyverseFunction: read.csv()

2 Know the data: number of files, dimension, column name, merging dataframe, statistic summary of data, what is happiest country in 2017, etc.

Function: dim(), colnames(), summary()

3 Know the rank changes Function: mutate()

4 Correlation between variables Function: corrplot(), corr()

5 Visualize the data with boxplot Function: boxplot()

explorE

Page 27: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

DemoData Exploration

Page 28: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

explorE#loading packagelibrary(tidyverse)library(corrplot)library(plotly)

#setting working directorysetwd("D:/data/world-happiness-report")

#import datawhr_2015 <- read.csv("2015.csv",stringsAsFactors = F)whr_2016 <- read.csv("2016.csv", stringsAsFactors = F)whr_2017 <- read.csv("2017.csv", stringsAsFactors = F)

#cek datahead(whr_2015)head(whr_2016)head(whr_2017)

Page 29: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

explorE#cek dimensi data (bisa juga dengan melihat di bagian environment di kanan atas)dim(whr_2017)

#cek nama kolomcolnames(whr_2015)colnames(whr_2016)colnames(whr_2017)

#Cek Distribusi dari Happiness Score untuk WHR 2015summary(whr_2015$Happiness.Score)

Page 30: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

explorE#Pertanyaan:

Dari tahun 2016 ke 2017, negaramana yang peringkat tingkatkebahagiaannya nya naik paling tinggi?

Naik berapa peringkat?

Dari peringkat berapa ke berapa?

Page 31: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

explorE#Mengetahui perubahan Ranking dari tahun 2016 ke 2017

#1) join dua data frame by Countrywhr_all<-merge(whr_2016[,c(1,3)],

whr_2017[,c(1,2)],by.x = "Country",by.y = "Country")

#2) namai kolom whr_all lalu buat kolom Rank Change dengan mutate functioncolnames(whr_all)<-c("Country","Happiness Rank 2016","Happiness Rank 2017")

whr_all1 <-whr_all %>%mutate(`Rank Change`=`Happiness Rank 2016`-`Happiness Rank 2017`)

#3) tunjukkan negara mana yang memiliki kenaikan peringkat paling besarwhr_all1[whr_all1$`Rank Change`==max(whr_all1$`Rank Change`),][,1]

#Jawaban: Bulgaria, naik 24 peringkat dari 129 ke 105.

Page 32: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

explorE

#Korelasi antar variabel

test3<-cor(as.matrix(whr_2017[,-c(1,2)]))

corrplot::corrplot(test3,type = "upper",method = "square",mar = c(0,0,1,0))

2017

Page 33: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

explorE#Data 2017 tidak ada kolom region, sehingga kita harus melakukan join by country agar kolom Region muncul.

whr_2017_new <- whr_2017 %>% left_join(whr_2016[,c(1,2)], by="Country")

plot_ly(whr_2017_new,x=~Region,y=~Happiness.Score,type="box",boxpoints="all",pointpos = -1.8,color=~Region)%>%

layout(xaxis=list(showticklabels= FALSE),

margin=list(b = 100))

Page 34: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

preparationdata

Part - 3

Page 35: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

preparE

Goal: Cluster Analysis World Happiness Report 2017

Select Feature

01Normalize dataset

02Cluster data

03

Page 36: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

DemoData Preparation

Page 37: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

preparE#Tujuan Anda adalah melakukan clustering berdasarkan variable pembentuk Happiness Score. Misalkan kita pilih 6 variable:#GDP, Family, Life Expectancy, Freedom, Trust, Generousity

clean_data <- whr_2017[,c(1,6:11)]

# Melihat persebaran masing-masing variabelpar(mfrow=c(1,6))for(i in 2:7) {

boxplot(clean_data[,i], main=names(clean_data)[i])}

Page 38: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

preparE#menghilangkan data yang missing atau naclean_data <- na.omit(clean_data) # listwise deletion of missingclean_data$Country <- as.character(clean_data$Country)

#normalize data (optional, karena dilihat dari hasil boxplot persebarannya sudah cukup rata)clean_data_2 <- scale(clean_data[,2:7]) # standardize variablesa <- data.frame(clean_data_2)clean_data_3 <- cbind(clean_data$Country,a)

#melihat boxplot setelah discale/normalize (hasil normalisasiini tidak digunakan, kita gunakan clean_data)par(mfrow=c(1,6))for(i in 2:7) {

boxplot(clean_data_3[,i], main=names(clean_data)[i])}

Page 39: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

clustering

Part - 4

Page 40: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

No To do My Recommendation, please utilize

1 Use 3 algorithm of clustering: - Partitioning: K-Means- Hierarchical Clustering

Function: kmeans(), hclust(), cutree(),

2 Plotting cluster solution # vary parameters for most readable graphlibrary(cluster)clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,

labels=2, lines=0)

# Centroid Plot against 1st 2 discriminant functionslibrary(fpc)plotcluster(mydata, fit$cluster)

https://www.statmethods.net/advstats/cluster.htmlhttps://datascienceplus.com/k-means-clustering-in-r/https://datascienceplus.com/hierarchical-clustering-in-r/

Page 41: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

DemoClustering

Page 42: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

clusterinG# K Means Clustering

# Determine number of clusterswss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))

for (i in 2:15) wss[i] <-sum(kmeans(mydata,

centers=i)$withinss)plot(1:15, wss, type="b", xlab="Number of Clusters",

ylab="Within groups sum of squares")

# K-Means Cluster Analysisfit1 <- kmeans(mydata, 3) # 3 cluster solution# get cluster means aggregate(mydata,by=list(fit1$cluster),FUN=mean)# append cluster assignmentmydata_kmeans <- data.frame(clean_data, fit1$cluster)

Page 43: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

clusterinG# Ward Hierarchical Clusteringrownames(clean_data) <- clean_data[,1]mydata <- clean_data[,-1]

d <- dist(mydata, method = "euclidean") # distance matrixfit2 <- hclust(d, method="ward.D") plot(fit2) # display dendogramrect.hclust(fit2, k=3, border="red")groups <- cutree(fit2, k=3) # cut tree into 3 clusters

# draw dendogram with red borders around the 3 clusters mydata_hclust <-data.frame(clean_data,groups)

Page 44: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

clusterinG# Cluster Plot against 1st 2 principal components, and vary parameters for most readable graphlibrary(cluster) clusplot(mydata, fit1$cluster, color=TRUE, shade=TRUE,

labels=2, lines=0)

# Centroid Plot against 1st 2 discriminant functionslibrary(fpc)plotcluster(mydata, fit1$cluster)

Page 45: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

q&A

Page 46: clusteRing the woRld · share an underlying philosophy and common APIs. Install the complete tidyverse with install.packages("tidyverse") ... Source: Nurandi (DSI Bootcamp Slide)

Thank You