hochschule düsseldorf fachbereich

57
HSD Hochschule Düsseldorf University of Applied Scienses W Fachbereich Wirtschaftswissenschaften Faculty of Business Studies IT Applications in Business Analytics Business Analytics (M.Sc.) IT in Business Analytics SS2016 / Lecture 14 Wrap Up Thomas Zeutschler SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 1

Upload: others

Post on 17-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hochschule Düsseldorf Fachbereich

HSDHochschule Düsseldorf

University of Applied Scienses

WFachbereich Wirtschaftswissenschaften

Faculty of Business Studies

IT Applications in Business Analytics

Business Analytics (M.Sc.)

IT in Business Analytics

SS2016 / Lecture 14 – Wrap Up

Thomas Zeutschler

SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 1

Page 2: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Let’s get started…

SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 2

Page 3: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Targets of Module and Lectures

SS 2016 - IT Applications in Business Analytics - 1. Introduction 3

German

“Die Studierenden erlernen die Anwendung praxisrelevanter IT-

Werkzeuge (für Business Analytics) anhand von Fallstudien.”

English

“Students will learn to apply analytical tools on business problems.”

American English

“We’ll try make you a Bruce Willis in Analytics.”

Page 4: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Scope of Module and Lectures

SS 2016 - IT Applications in Business Analytics - 1. Introduction 4

Advanced Analytics“Advanced Analytics is the

autonomous or semi-

autonomous examination of

data or content using

sophisticated techniques and

tools, typically beyond those of

traditional business intelligence

(BI), to discover deeper

insights, make predictions, or

generate recommendations.”http://www.gartner.com/it-glossary/

Page 5: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

In Scope / Out of Scope

SS 2016 - IT Applications in Business Analytics - 1. Introduction 5

Data Science

Data Mining, Text Mining

Predictive Analytics, Simulation, Machine Learning

Database Technologies

Information Retrieval

Data Analysis

Text Analysis, Semantic Web, XML

Data Warehouse, Data Mart, ETL

In Memory Technologies

Reporting, OLAP

Data and Decision Modelling

Data Visualization

Data Quality Management, Data Protection

Specific Business Applications ► Case Studies

Page 6: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Sequence of Lectures

SS 2016 - IT Applications in Business Analytics - 1. Introduction 6

Introduction 1st April 2016

Methodology and process model for analytics (CRISP DM)

Tools, technologies and data sources

The R Programming Language

KNIME

Case Study 1

Case Study 2

Case Study 3

Wrap Up 8th July 2016

1

2

3

4

5

6

9

12

15

Theory

Tools Training

Hands On Case Studies

Page 7: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 7

Page 8: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

The Data Mining Process

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 8

CRoss-InduStry Process for Data Mining A methodology covering the typical phases of an analytical project,

the tasks involved with each phase, and an explanation of the

relationships between these tasks.

A process model, as CRISP-DM provides an overview of the data

mining life cycle.

CRISP-DM was conceived in 1996 and first published in 1999 by SPSS, NCR and Mercedes

and is reported as the leading methodology for data mining/predictive analytics projects.

IBM has released a new implementation method for Data Mining/Predictive Analytics projects in 2015

called Analytics Solutions Unified Method for Data Mining & Predictive Analytics (ASUM-DM)

which is a refined and extended CRISP-DM. But it’s a little bit too complex start with…

Page 9: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Introduction

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 9

„The process of

knowledge discovery in

data mining has to be

reproducible and reliable.

Especially for people who

have no background in

data science.“

Page 10: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 10

CRoss-InduStry Process for Data Mining A methodology covering the typical phases of an analytical project,

the tasks involved with each phase, and an explanation of the

relationships between these tasks.

A process model, as CRISP-DM provides an overview of the data

mining life cycle.

CRISP-DM was conceived in 1996 and first published in 1999 by SPSS, NCR and Mercedes

and is reported as the leading methodology for data mining/predictive analytics projects.

IBM has released a new implementation method for Data Mining/Predictive Analytics projects in 2015

called Analytics Solutions Unified Method for Data Mining & Predictive Analytics (ASUM-DM)

which is a refined and extended CRISP-DM. But it’s a little bit too complex start with…

Page 11: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – Current Industry Standard

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 11

Source:

http://www.kdnuggets.com/2014/10/crisp-dm-top-

methodology-analytics-data-mining-data-

science-projects.html

Other approaches:

KDD „Knowledge Discovery in Databases“ developed by

Usama Fayyad (Microsoft Research, 1996) describes

methods and technologies to assist humans in

extracting useful information (knowledge) from the

rapidly growing volumes of digital data.

SEMMA SEMMA is an acronym that stands for Sample, Explore,

Modify, Model and Assess. It is a list of sequential steps

developed by SAS Institute in 2009.

Criticism: SEMMA mainly focuses on the modeling

tasks of data mining projects, leaving the business

aspects out. Focussed on the usage of SAS products.

Page 12: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – Objectives and Benefits

Ensure quality of knowledge discovery project results

Reduce skills required for knowledge discovery

Reduce costs and time

General purpose (i.e., stable across varying applications)

Robust (i.e., insensitive to changes in the environment)

Tool and technique independent

Tool supportable

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 12

Support documentation of projects

Capture experience for reuse

Support knowledge transfer and training

Page 13: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – Phases and Tasks

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 13

Business

Understanding Determine Business

Objectives

Background.

Business Objectives.

Business Success

Criteria.

Assess Situation

Inventory of Resources,

Requirements,

Assumptions and

Constraints.

Risks and Contingencies

Terminology.

Costs and Benefits.

Determine Data Mining

Goals

Data Mining Goals.

Data Mining Success

Criteria.

Produce Project Plan

Project Plan.

Initial Assessment of

Tools and Techniques.

Data

UnderstandingCollect Initial Data

Initial Data Collection

Report.

Describe Data

Data Description

Report.

Explore Data

Data Exploration

Report.

Verify Data Quality

Data Quality Report.

Data

PreparationSelect Data

Rationale for Inclusion/

Exclusion.

Clean Data

Data Cleaning Report.

Construct Data

Derived Attributes.

Generated Records.

Integrate Data

Merged Data.

Format Data

Reformatted Data.

Dataset

Dataset Description.

Modelling

Select Modelling

Technique

Modelling Technique.

Modelling Assumptions.

Generate Test Design

Test Design.

Build Model

Parameter Settings

Models.

Model Description.

Assess Model

Model Assessment.

Revised Parameter

Settings.

Evaluation

Evaluate Results

Assessment of Data.

Mining Results w.r.t.

Business Success

Criteria.

Approved Models.

Review Process

Review of Process.

Determine Next Steps

List of Possible Actions.

Decision.

Deployment

Plan Deployment

Deployment Plan.

Plan Monitoring and

Maintenance

Monitoring and

Maintenance Plan.

Produce Final Report

Final Report.

Final Presentation.

Review Project

Experience

Documentation.

Page 14: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – Objectives and Benefits

Typical Effort per CRISP DM Phase in %

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 14

Eff

ort

Business

Under-

standing

Data

Under-

standing

Data

Prepa-

ration

Modelling Eva-

luation

Deploy-

ment

10%

20%

30%

Page 15: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – 1 Business Understanding

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 15

1.1 Determine Business ObjectivesBackground.

Business Objectives.

Business Success Criteria.

1.2 Assess SituationInventory of Resources, Requirements,

Assumptions and Constraints.

Risks and Contingencies Terminology.

Costs and Benefits.

1.3 Determine Data Mining GoalsData Mining Goals.

Data Mining Success Criteria.

1.4 Produce Project PlanProject Plan.

Initial Assessment of Tools and Techniques.

Page 16: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – 2 Data Understanding

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 16

2.1 Collect Initial DataInitial Data Collection Report.

2.2 Describe DataData Description Report.

2. 3 Explore DataData Exploration Report.

2.4 Verify Data QualityData Quality Report.

Page 17: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – 3 Data Preparation

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 17

3.1 Select DataRationale for Inclusion / Exclusion.

3.2 Clean DataData Cleaning Report.

3.3 Construct DataDerived Attributes.

Generated Records.

3.4 Integrate DataMerged Data.

3.5 Format DataReformatted Data.

3.6 DatasetDataset Description.

Page 18: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – 4 Modelling

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 18

4.1 Select Modelling TechniqueModelling Technique.

Modelling Assumptions.

4.2 Generate Test DesignTest Design.

4.3 Build ModelParameter Settings Models.

Model Description.

4.4 Assess ModelModel Assessment.

Revised Parameter Settings.

Page 19: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – 5 Evaluation

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 19

5.1 Evaluate ResultsAssessment of Data.

Mining Results with respect to Business Success Criteria.

Approved Models.

5.2 Review ProcessReview of Process.

5.3 Determine Next StepsList of Possible Actions.

Decision.

Page 20: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

CRISP DM – 6 Deployment

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 20

6.1 Plan DeploymentDeployment Plan.

6.2 Plan Monitoring and MaintenanceMonitoring and Maintenance Plan.

6.3 Produce Final ReportFinal Report.

Final Presentation.

6.4 Review ProjectExperience Documentation.

Page 21: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Tools

SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 21

Page 22: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Basic Concept – SQL

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources 22

SQL Structured Query Language is a special-purpose programming language

designed for managing data held in a relational database management system

(RDBMS)

SQL is based upon relational algebra and tuple relational calculus.

https://en.wikipedia.org/wiki/Relational_algebra

SQL defines 3 language aspects:

data definition language (DDL) …to define database schemas

data manipulation language … selecting, inserting, deleting and updating data

data control language …to control access rights in databases

Page 23: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Database System Classification

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

SQL Databases

Predefined Schema

Standard definition and interface

language

Tight consistency

Well defined semantics

NoSQL Database

No predefined Schema

Per-product definition and

interface language

Getting an answer quickly is more

important than getting a correct

answer

23

Page 24: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Big Data Framework

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources 24

A pre-customized

and pre-compiled

collection of tools and

technologies required

for big data processing

based on Hadoop.

Page 25: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

The R Programming Language

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 25

R system contains two major components:

1. Base System – contains the R language software and the high

priority add-on packages.

2. User contributed add-on Packages.

R includes… an effective data handling and storage facility,

a suite of operators for calculations on arrays, in particular matrices,

a large collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either on-screen or on

hardcopy, and

a simple and effective programming language which includes conditionals,

loops, user-defined recursive functions and input and output facilities.

Page 26: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

RStudio

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 26

Native R is a console

application, RStudio is

wrapper for convenience…

Page 27: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 27

Variables

Simple Mathematics

Charting

# Declaration and usage of variables

A <- 2

B <- 3

x <- seq(0, 2*pi, 0.1)

y <- sin(x)

# Attention: R is case sensitive

1 + 2

Sin(2*3)

# Declaration and usage of variables

plot(x,y, main=„Sinus Plot",

sub=„made with R",

xlab="x-axis",

ylab="y-axis")

Page 28: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basics – Install and use packageshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 28

Using Packages

Installing Packages (remove the #)

Automatic Load and (if required) Installation of a Package

Page 29: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 29

Loading Data

Assign Data to Objects

Accessing Data

Page 30: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

30

Accessing Data continued / Saving Data

Page 31: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

31

Simple Data Analysis

d <- read.csv(“http://www.ats.ucla.edu/stat/data/hsb2.csv“)

# return the number of observations(rows) and variables(columns) in d.

dim(d)

# get the structure of d, including the class(type) of all variables

str(d)

# return the distributional summaries of variables in the dataset

summary(d)

# return a summary of the dataset for all rows where variable ‘read’ >= 60.

# note that filter is in the dplyr package.

summary(filter(d, read >= 60))

Page 32: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

32

Charting

# load the lattice charting package

require(lattice)

# draw a simple scatter plot

xyplot(read ~ write, data = d)

# conditioned scatter plot

xyplot(read ~ write | prog, data = d)

# box and whisker plots

bwplot(read ~ factor(prog), data = d)

More Charting (ggplot2 package)

# draw a kernel density plot

ggplot(d, aes(x = write)) + geom_density()

# draw a kernel density plot per prog

ggplot(d, aes(x = write)) + geom_density()

+ facet_wrap(~ prog)

# inspect univariate and bivariate

# relationships using a scatter plot matrix

ggpairs(d[, 7:11])

Page 33: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Analytics Data Processing – Sample: Knime

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources 33

www.knime.org

Page 34: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime - Essential Nodes

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 34

Data Preparation

The input table is split into two partitions (i.e. row-wise),

e.g. train and test data. The two partitions are available

at the two output ports.

This node helps handle missing values found in cells of

the input table.

The node allows for row / column filtering according to

certain criteria

Page 35: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime - Essential Nodes

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 35

First Statistical Data Analysis

Calculates statistical moments such as minimum, maximum,

mean, standard deviation, variance, median, overall sum,

number of missing values and row count across all numeric

columns, and counts all nominal values together with their

occurrences.

Creates a cross table (also referred as contingency table

or cross tab). It can be used to analyze the relation of

two columns with categorical data and does display the

frequency distribution of the categorical variables in a

table.

Page 36: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime – Data Mining Cheating…

SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 36

Algorithm Pros Cons Good at

Linear regression

- Very fast (runs in constant time)

- Easy to understand the model

- Less prone to overfitting

- Unable to model complex relationships

-Unable to capture nonlinear relationships

without first transforming the inputs

- The first look at a dataset

- Numerical data with lots of features

Decision trees

- Fast

- Robust to noise and missing values

- Accurate

- Complex trees are hard to interpret

- Duplication within the same sub-tree is

possible

- Star classification

- Medical diagnosis

- Credit risk analysis

Neural networks

- Extremely powerful

- Can model even very complex relationships

- No need to understand the underlying data

- Almost works by “magic”

- Prone to overfitting

- Long training time

- Requires significant computing power for

large datasets

- Model is essentially unreadable

- Images

- Video

- “Human-intelligence” type tasks like driving or

flying

- Robotics

Support Vector

Machines

- Can model complex, nonlinear

relationships

- Robust to noise (because they maximize

margins)

- Need to select a good kernel function

- Model parameters are difficult to interpret

- Sometimes numerical stability problems

- Requires significant memory and

processing power

- Classifying proteins

- Text classification

- Image classification

- Handwriting recognition

K-Nearest Neighbors

- Simple

- Powerful

- No training involved (“lazy”)

- Naturally handles multiclass classification

and regression

- Expensive and slow to predict new

instances

- Must define a meaningful distance

function

- Performs poorly on high-dimensionality

datasets

- Low-dimensional datasets

- Computer security: intrusion detection

- Fault detection in semiconducter manufacturing

- Video content retrieval

- Gene expression

- Protein-protein interaction

Page 37: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Knime – Data Mining Cheating…

SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 37

http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html

https://github.com/soulmachin

e/machine-learning-cheat-

sheet/raw/master/machine-

learning-cheat-sheet.pdf

https://azure.microsoft.com/en-

us/documentation/articles/mach

ine-learning-algorithm-cheat-

sheet/

Page 38: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Time Series

SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 38

Page 39: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Time Series

SS 2016 - IT Applications in Business Analytics - 10. Time Series 39

A time series is a sequence of

often equally spaced observations

in chorological order

over a continuous time interval.

Samples from science and business

Meteorology: weather data, e.g. temperature, pressure, wind.

Economy and finance: economic factors, financial indexes, exchange rates.

Business: sales, production or any activity of business

Industry: electric load, power consumption, sensors.

Medicine: physiological signals (EEG), heart-rate, patient temperature.

Web: views, clicks, logs.

Page 40: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Time Series

SS 2016 - IT Applications in Business Analytics - 10. Time Series 40

Time-Series DecompositionDecompose the variation of a series into 3 main parts…

A. Trend This is a long-term change in the mean level,

e.g. an increasing trend.

B. Seasonal effect Many time series exhibit variation which is seasonal

(e.g. annual) in period. The measure and the removal

of such variation is called deseasonalizing of data.

C. Irregular fluctuations After trend and cyclic variations have been removed

from a set of data, there is a series of residuals,

which may (or may not) be completely random.

Seasonal Trend Decomposition using LOESS (STL)

STL Method, 1990: http://www.wessa.net/download/stl.pdf

Page 41: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Time Series in R

SS 2016 - IT Applications in Business Analytics - 10. Time Series 41

# load data

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")

# convert to time series

birthstimeseries <- ts(births, frequency = 12, start = c(1946,1))

# Seasonal trend decomposition using Loess algorithm (STL)

births.stl = stl(birthstimeseries, s.window = "periodic")

# plot trend decomposition

plot(births.stl)

Seasonal Trend Decomposition using LOESS*

*LOcal regrESSion

Page 42: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Time Series – First Example

SS 2016 - IT Applications in Business Analytics - 10. Time Series 42

# load data

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")

# convert to time series

birthstimeseries <- ts(births, frequency = 12, start = c(1946,1))

# build ARIMA model

birthsmodel <- arima(birthstimeseries, order = c(1,0,0), list(order = c(2,1,0), period = 12))

# 24 month forecast based on the model

birthsforecast <- predict(birthsmodel, n.ahead=24)

# calculate bounds for 95% confidence level

U <- birthsforecast$pred + 2 * birthsforecast$se

L <- birthsforecast$pred - 2 * birthsforecast$se

# plot for time series, prediction and confidence interval

ts.plot(birthstimeseries, birthsforecast$pred, U, L, col = c(1,2,4,4), lty = c(1,1,2,2))

# add legend to plot

legend("topleft", c("Actual", "Forecast", "Error Bounds (95% Confidence)"),

col =c(1,2,4), lty = c(1,1,2))

Forecasting using ARMIA model

Page 43: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Decision Tree Learning

SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 43

Page 44: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Classification Method Comparison

SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 44

Try to understand the pattern of data...

…by applying visual data analysis

…by applying pairwise comparison of attributes

Is your data Linear Separable?

Yes: Logistic Regression, Neuronal Networks…be cautious on Decision Tree or Random Forrest

No: Random Forrest or SVM

???: Random Forrest…good balance of generalization and accuracy, and its computational cost is relatively low

But: Neuronal Networks can (not must) be the best solution…but it’s not easy to tune them to deliver good results (many parameters).

Page 45: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Decision Tree

SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 45

Decision Tree (partial) for Bike Sales Sample

A supervised learning method.

Purpose: Predict the certain value

of an item (record) based on

observations from other items.

If the target value is from a

finite set of values, then we

call it classification tree.

Leaves represent class

labels (e.g. Region),

whereas Branches

represent conjunctions

of features that lead to

those class labels.

Page 46: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Outlier Detection

SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 46

Page 47: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Outliers – Where are they?

SS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 47

Article: Anomaly Detection with Score functions based on Nearest Neighbour Graphs

https://arxiv.org/pdf/0910.5461.pdf

Page 48: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Outlier Detection – Introduction

SS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 48

“An outlier is an observation which

deviates so much from the other

observations as to arouse suspicions

that it was generated by a different

mechanism”D. M. Hawkins 1980

Two reasons for outliers:

Bad Data e.g. measurement errors, typos

Correct Datae.g. random variation of data, heavy-tailed

distribution of dataLOF - Local Outlier Factor

Page 49: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Outliers – Core Problem

SS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 49

Find them..

How to detect outliers?

Keep them (or not)…

Do we need to keep them? They are the main subject of interest

(e.g. in fraud detection)

They are an integral part of the statistical case.

Do we need to remove them? For more robust statistics.

For clean data (remove bad data).

Treat them…

What action needs to be done? Business purpose >>> outlier treatment

Page 50: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Outliers – Core Problem

SS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 50

Outlier Labeling

Flag potential outliers for further investigation

(i.e., are the potential outliers erroneous data, indicative of an

inappropriate distributional model, and so on).

Outlier Accommodation

Use robust statistical techniques that will not be unduly affected by

outliers. That is, if we cannot determine that potential outliers are

erroneous observations, do we need modify our statistical analysis to

more appropriately account for these observations?

Outlier Identification

Formally test whether observations are outliers.

Boris Iglewicz and David Hoaglin (1993),

"Volume 16: How to Detect and Handle Outliers",

The ASQC Basic References in Quality Control:

Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.

Page 51: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 51

Excursus – PCAAnalysis of environmental controls on tsunami deposit texture

a) PCA loading plot of variables along components 1 and 2 (accounting for 58% of total variance), showing the

spatial relationship of the variables along these dimensions.

b) Scoreplot, showing individual data points plotted in coordinate space along components 1 and 2

Page 52: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Other Topics

SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 52

Page 53: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Information Gathering

SS 2016 - IT Applications in Business Analytics - 12. Data Acquisition 53

Page 54: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

S.U.C.C.E.S.S.

SS 2016 - IT Applications in Business Analytics - 13. Information Design 54

SAY Deliver messages: Reports and presentations serve to convey messages to

readers and listeners.

UNIFY Standardize content: Reports and presentations are more easily understood

when the content displayed adheres to a uniform concept of meaning.

CONDENSE Concentrate information: Reports and presentations are better understood

when the contents have a high level of information density.

CHECK Ensure quality: Reports and presentations are credible when the conveyed

content is based on correct, appropriate, and current data.

ENABLE Implement concept: Organizational, personnel-related, and technical

requirements must be met in order to implement the rules.

SIMPLIFY Avoid complication: Reports and presentations are better understood when

noise and redundancy are avoided.

STRUCTURE Group content: Reports and presentations should adhere to the requirements

for homogeneous, mutually exclusive, and exhaustive structures.

source: http://www.hichert.com

Page 55: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

S.U.C.C.E.S.S.

SS 2016 - IT Applications in Business Analytics - 13. Information Design 55

Page 56: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

#TheEnd

SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 56

Page 57: Hochschule Düsseldorf Fachbereich

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Any Questions?

SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 57