hochschule düsseldorf fachbereich ... · an effective data handling and storage facility, ... sap...

31
HSD Hochschule Düsseldorf University of Applied Scienses W Fachbereich Wirtschaftswissenschaften Faculty of Business Studies IT Applications in Business Analytics Business Analytics (M.Sc.) IT in Business Analytics SS2016 / Lecture 04 The R Programming Language Thomas Zeutschler

Upload: dinhthuan

Post on 29-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDHochschule Düsseldorf

University of Applied Scienses

WFachbereich Wirtschaftswissenschaften

Faculty of Business Studies

IT Applications in Business Analytics

Business Analytics (M.Sc.)

IT in Business Analytics

SS2016 / Lecture 04 – The R Programming Language

Thomas Zeutschler

Page 2: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Let’s get started…

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 2

Page 3: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Intoduction

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 3

Page 4: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

The R Programming Language

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 4

R is a Statistical Programming Language developed by Ross Ihaka

and Robert Gentleman, introduced in 1993.

R provides a wide variety of statistical and graphical techniques.

(linear and nonlinear modelling, classical statistical tests, time-series

analysis, classification, clustering, …)

R is open source, highly extensible and runs on all platforms.

Today, R is the most used software / eco-system for statistical analysis.

Page 5: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

The R Programming Language

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 5

R system contains two major components:

1. Base System – contains the R language software and the high

priority add-on packages.

2. User contributed add-on Packages.

R includes… an effective data handling and storage facility,

a suite of operators for calculations on arrays, in particular matrices,

a large collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either on-screen or on

hardcopy, and

a simple and effective programming language which includes conditionals,

loops, user-defined recursive functions and input and output facilities.

Page 6: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Who uses the R Language?

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 6

Data scientists & analysts, statisticans, mathematicians.

All scientists, researches and (product) developers who deal with data.

esp. in natural science (medicine, biology) and social science.

R is especially used quite often developing countries.

Because it allows universal free access to state of the art tools

for statistical data analysis.

Most widely used for teaching undergraduates and graduates statistics.

Because its free of cost.

Page 7: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Who uses the R Language?

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 7

Many software vendors integrate R to provide advanced statistical

capabilities from within their products.

Statistical SoftwareSAS, SPSS, Statistica, Knime, RapidMiner, Mathematica etc.

Relational Database:Oracle, SAP HANA, Microsoft SQL Server, IBM DB2 etc.

Big Data and NoSQL DatabasesHadoop, MongoDB, Cassandra etc.

LOB (Line of Business) Applications, eg. ERP, CRMSAP, Microsoft etc.

Page 8: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Popularity

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 8

Google Trends 04.2016

Page 9: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Eco-System

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 9

Page 10: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Eco System

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 10

Packages are collections

of R functions, data and/or

compiled code.

CRAN “The Comprehensive R

Archive Network” is the

central repository for all public

available R packages.

https://cran.r-project.org/

8300 different packages

available (as of 2016.04)

Page 11: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Eco System – Packages

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 11

The table below: Some R packages ordered by date of creation.

Many packages are constantly updated and very reliable.

The community is the reason for the success of R.

Page 12: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Eco System – Packages (most popular 2015)

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 12

1. Rcpp Seamless R and C++ Integration

693.288 downloads

2. ggplot2 An Implementation of the Grammar of Graphics

3. stringr Simple, Consistent Wrappers for Common String Operations.

4. plyr Tools for Splitting, Applying and Combining Data

5. digest Create Cryptographic Hash Digests of R Objects.

6. reshape2 Flexibly Reshape Data: A Reboot of the Reshape Package

7. colorspace Color Space Manipulation

8. RColorBrewer ColorBrewer Palettes

9. manipulate Interactive Plots for RStudio.

10.scales Scale Functions for Visualization

11.labeling Axis Labeling

12.proto Prototype object-based programming.

13.munsell Munsell colour system.

14.gtable Arrange grobs in tables

15.dichromat Color Schemes for Dichromats

16.mime Map Filenames to MIME Types.

17.RCurl General network (HTTP/FTP/...) client interface for R.

18.bitops Bitwise Operations

19.zoo S3 Infrastructure for Regular and Irregular Time Series

20.knitr A General-Purpose Package for Dynamic Report Generation in R.

295.528 downloads

Page 13: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Studio

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 13

Page 14: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

RStudio

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 14

Native R is a console

application, RStudio is

wrapper for convenience…

Page 15: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basics

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 15

Page 16: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 16

Variables

Simple Mathematics

Charting

# Declaration and usage of variables

A <- 2

B <- 3

x <- seq(0, 2*pi, 0.1)

y <- sin(x)

# Attention: R is case sensitive

1 + 2

Sin(2*3)

# Declaration and usage of variables

plot(x,y, main=„Sinus Plot",

sub=„made with R",

xlab="x-axis",

ylab="y-axis")

Page 17: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basics – Install and use packageshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 17

Using Packages

Installing Packages (remove the #)

Automatic Load and (if required) Installation of a Package

Page 18: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 18

Loading Data

Assign Data to Objects

Accessing Data

Page 19: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

19

Accessing Data continued / Saving Data

Page 20: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

20

Simple Data Analysis

d <- read.csv(“http://www.ats.ucla.edu/stat/data/hsb2.csv“)

# return the number of observations(rows) and variables(columns) in d.

dim(d)

# get the structure of d, including the class(type) of all variables

str(d)

# return the distributional summaries of variables in the dataset

summary(d)

# return a summary of the dataset for all rows where variable ‘read’ >= 60.

# note that filter is in the dplyr package.

summary(filter(d, read >= 60))

Page 21: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

21

Charting

# load the lattice charting package

require(lattice)

# draw a simple scatter plot

xyplot(read ~ write, data = d)

# conditioned scatter plot

xyplot(read ~ write | prog, data = d)

# box and whisker plots

bwplot(read ~ factor(prog), data = d)

More Charting (ggplot2 package)

# draw a kernel density plot

ggplot(d, aes(x = write)) + geom_density()

# draw a kernel density plot per prog

ggplot(d, aes(x = write)) + geom_density()

+ facet_wrap(~ prog)

# inspect univariate and bivariate

# relationships using a scatter plot matrix

ggpairs(d[, 7:11])

Page 22: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Exercise in R

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 22

Page 23: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

First Exercise in R

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 23

"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)

https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt

…/sleep.csv

Source:

https://www.stat.auckland.

ac.nz/~stats330/datasets.d

ir/

Training Video:

https://www.youtube.com/

watch?v=Uo1C7Iligw0

Page 24: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

First Exercise in R

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 24

Data Import… …/sleep.csv

Page 25: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

First Exercise in R

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 25

"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)

1. How old do animals become on average?

2. Which species gets the oldest?

3. Can we have a histogram of lifespan?

4. What is the correlation between lifespan

and size of an animal?

5. Can we have a full correlation matrix of all

variables (see figure 1)?

6. Can we have a scatter-plot of species size

vs. danger factor (see figure 2)?

Figure 1

Figure 2

Page 26: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Lecture Summary & Homework

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 26

Page 27: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Lessons Learned

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 27

CRISP DM is a highly adopted and standardized process for

data mining projects.

Ex-ante definition of success criteria is essential for successful projects.

Data understanding and preparation are typically the most costly and

time-consuming (~80%) phases in CRISP DM.

CRISP DM is an iterative approach. Certain phases are likely to be

passed multiple times (modelling and evaluation.

Page 28: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Lessons Learned

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 28

Lorem Ipsum

Page 29: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Resources

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 29

Learn R

Interactive Web Training: http://tryr.codeschool.com/

Learn R in R with Swirl: http://swirlstats.com/students.html

Swirl Courses: https://github.com/swirldev/swirl_courses#swirl-courses

Tips & Tricks

Tips & Tricks: https://www.stat.wisc.edu/network-skills/learnR#guide

R by example: http://www.mayin.org/ajayshah/KB/R/

R Tutorials: https://ww2.coastal.edu/kingw/statistics/R-tutorials/

Blogs

#1 R blog to subscribe: http://www.r-bloggers.com/

Page 30: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Get Prepared (Homework)

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 30

Take the full course = retype and execute each command. Enjoy…

http://www.ats.ucla.edu/stat/r/seminars/intro.htm

Get prepared for next lesson: Install Knime on your PC/Laptop.

Page 31: Hochschule Düsseldorf Fachbereich ... · an effective data handling and storage facility, ... SAP HANA, Microsoft SQL Server, ... First Exercise in R

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Any Questions?

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 31