hochschule düsseldorf fachbereich

HSDHochschule Düsseldorf

University of Applied Scienses

WFachbereich Wirtschaftswissenschaften

Faculty of Business Studies

IT Applications in Business Analytics

Business Analytics (M.Sc.)

IT in Business Analytics

SS2016 / Lecture 14 – Wrap Up

Thomas Zeutschler

SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 1

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Let’s get started…



Thomas Zeutschler

Associate Lecturer

Targets of Module and Lectures

SS 2016 - IT Applications in Business Analytics - 1. Introduction 3

German

“Die Studierenden erlernen die Anwendung praxisrelevanter IT-

Werkzeuge (für Business Analytics) anhand von Fallstudien.”

English

“Students will learn to apply analytical tools on business problems.”

American English

“We’ll try make you a Bruce Willis in Analytics.”


Thomas Zeutschler

Associate Lecturer

Scope of Module and Lectures


Advanced Analytics“Advanced Analytics is the

autonomous or semi-

autonomous examination of

data or content using

sophisticated techniques and

tools, typically beyond those of

traditional business intelligence

(BI), to discover deeper

insights, make predictions, or

generate recommendations.”http://www.gartner.com/it-glossary/

http://www.gartner.com/it-glossary/


Thomas Zeutschler

Associate Lecturer

In Scope / Out of Scope


Data Science

Data Mining, Text Mining

Predictive Analytics, Simulation, Machine Learning

Database Technologies

Information Retrieval

Data Analysis

Text Analysis, Semantic Web, XML

Data Warehouse, Data Mart, ETL

In Memory Technologies

Reporting, OLAP

Data and Decision Modelling

Data Visualization

Data Quality Management, Data Protection

Specific Business Applications ► Case Studies


Thomas Zeutschler

Associate Lecturer

Sequence of Lectures


Introduction 1st April 2016

Methodology and process model for analytics (CRISP DM)

Tools, technologies and data sources

The R Programming Language

KNIME

Case Study 1

Case Study 2

Case Study 3

Wrap Up 8th July 2016

1

2

3

4

5

6

9

12

15

Theory

Tools Training

Hands On Case Studies


Thomas Zeutschler

Associate Lecturer

CRISP DM

SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 7


Thomas Zeutschler

Associate Lecturer

The Data Mining Process


CRoss-InduStry Process for Data Mining A methodology covering the typical phases of an analytical project,

the tasks involved with each phase, and an explanation of the

relationships between these tasks.

A process model, as CRISP-DM provides an overview of the data

mining life cycle.

CRISP-DM was conceived in 1996 and first published in 1999 by SPSS, NCR and Mercedes

and is reported as the leading methodology for data mining/predictive analytics projects.

IBM has released a new implementation method for Data Mining/Predictive Analytics projects in 2015

called Analytics Solutions Unified Method for Data Mining & Predictive Analytics (ASUM-DM)

which is a refined and extended CRISP-DM. But it’s a little bit too complex start with…


Thomas Zeutschler

Associate Lecturer

Introduction


„The process of

knowledge discovery in

data mining has to be

reproducible and reliable.

Especially for people who

have no background in

data science.“


Thomas Zeutschler

Associate Lecturer

CRISP DM


CRoss-InduStry Process for Data Mining A methodology covering the typical phases of an analytical project,

the tasks involved with each phase, and an explanation of the

relationships between these tasks.

A process model, as CRISP-DM provides an overview of the data

mining life cycle.

CRISP-DM was conceived in 1996 and first published in 1999 by SPSS, NCR and Mercedes

and is reported as the leading methodology for data mining/predictive analytics projects.

IBM has released a new implementation method for Data Mining/Predictive Analytics projects in 2015

called Analytics Solutions Unified Method for Data Mining & Predictive Analytics (ASUM-DM)

which is a refined and extended CRISP-DM. But it’s a little bit too complex start with…


Thomas Zeutschler

Associate Lecturer

CRISP DM – Current Industry Standard


Source:

http://www.kdnuggets.com/2014/10/crisp-dm-top-

methodology-analytics-data-mining-data-

science-projects.html

Other approaches:

KDD „Knowledge Discovery in Databases“ developed by

Usama Fayyad (Microsoft Research, 1996) describes

methods and technologies to assist humans in

extracting useful information (knowledge) from the

rapidly growing volumes of digital data.

SEMMA SEMMA is an acronym that stands for Sample, Explore,

Modify, Model and Assess. It is a list of sequential steps

developed by SAS Institute in 2009.

Criticism: SEMMA mainly focuses on the modeling

tasks of data mining projects, leaving the business

aspects out. Focussed on the usage of SAS products.

http://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html


Thomas Zeutschler

Associate Lecturer

CRISP DM – Objectives and Benefits

Ensure quality of knowledge discovery project results

Reduce skills required for knowledge discovery

Reduce costs and time

General purpose (i.e., stable across varying applications)

Robust (i.e., insensitive to changes in the environment)

Tool and technique independent

Tool supportable


Support documentation of projects

Capture experience for reuse

Support knowledge transfer and training


Thomas Zeutschler

Associate Lecturer

CRISP DM – Phases and Tasks


Business

Understanding Determine Business

Objectives

Background.

Business Objectives.

Business Success

Criteria.

Assess Situation

Inventory of Resources,

Requirements,

Assumptions and

Constraints.

Risks and Contingencies

Terminology.

Costs and Benefits.

Determine Data Mining

Goals

Data Mining Goals.

Data Mining Success

Criteria.

Produce Project Plan

Project Plan.

Initial Assessment of

Tools and Techniques.

Data

UnderstandingCollect Initial Data

Initial Data Collection

Report.

Describe Data

Data Description

Report.

Explore Data

Data Exploration

Report.

Verify Data Quality

Data Quality Report.

Data

PreparationSelect Data

Rationale for Inclusion/

Exclusion.

Clean Data

Data Cleaning Report.

Construct Data

Derived Attributes.

Generated Records.

Integrate Data

Merged Data.

Format Data

Reformatted Data.

Dataset

Dataset Description.

Modelling

Select Modelling

Technique

Modelling Technique.

Modelling Assumptions.

Generate Test Design

Test Design.

Build Model

Parameter Settings

Models.

Model Description.

Assess Model

Model Assessment.

Revised Parameter

Settings.

Evaluation

Evaluate Results

Assessment of Data.

Mining Results w.r.t.

Business Success

Criteria.

Approved Models.

Review Process

Review of Process.

Determine Next Steps

List of Possible Actions.

Decision.

Deployment

Plan Deployment

Deployment Plan.

Plan Monitoring and

Maintenance

Monitoring and

Maintenance Plan.

Produce Final Report

Final Report.

Final Presentation.

Review Project

Experience

Documentation.


Thomas Zeutschler

Associate Lecturer

CRISP DM – Objectives and Benefits

Typical Effort per CRISP DM Phase in %


Eff

ort

Business

Under-

standing

Data

Under-

standing

Data

Prepa-

ration

Modelling Eva-

luation

Deploy-

ment

10%

20%

30%


Thomas Zeutschler

Associate Lecturer

CRISP DM – 1 Business Understanding


1.1 Determine Business ObjectivesBackground.

Business Objectives.

Business Success Criteria.

1.2 Assess SituationInventory of Resources, Requirements,

Assumptions and Constraints.

Risks and Contingencies Terminology.

Costs and Benefits.

1.3 Determine Data Mining GoalsData Mining Goals.

Data Mining Success Criteria.

1.4 Produce Project PlanProject Plan.

Initial Assessment of Tools and Techniques.


Thomas Zeutschler

Associate Lecturer

CRISP DM – 2 Data Understanding


2.1 Collect Initial DataInitial Data Collection Report.

2.2 Describe DataData Description Report.

2. 3 Explore DataData Exploration Report.

2.4 Verify Data QualityData Quality Report.


Thomas Zeutschler

Associate Lecturer

CRISP DM – 3 Data Preparation


3.1 Select DataRationale for Inclusion / Exclusion.

3.2 Clean DataData Cleaning Report.

3.3 Construct DataDerived Attributes.

Generated Records.

3.4 Integrate DataMerged Data.

3.5 Format DataReformatted Data.

3.6 DatasetDataset Description.


Thomas Zeutschler

Associate Lecturer

CRISP DM – 4 Modelling


4.1 Select Modelling TechniqueModelling Technique.

Modelling Assumptions.

4.2 Generate Test DesignTest Design.

4.3 Build ModelParameter Settings Models.

Model Description.

4.4 Assess ModelModel Assessment.

Revised Parameter Settings.


Thomas Zeutschler

Associate Lecturer

CRISP DM – 5 Evaluation


5.1 Evaluate ResultsAssessment of Data.

Mining Results with respect to Business Success Criteria.

Approved Models.

5.2 Review ProcessReview of Process.

5.3 Determine Next StepsList of Possible Actions.

Decision.


Thomas Zeutschler

Associate Lecturer

CRISP DM – 6 Deployment


6.1 Plan DeploymentDeployment Plan.

6.2 Plan Monitoring and MaintenanceMonitoring and Maintenance Plan.

6.3 Produce Final ReportFinal Report.

Final Presentation.

6.4 Review ProjectExperience Documentation.


Thomas Zeutschler

Associate Lecturer

Tools



Thomas Zeutschler

Associate Lecturer

Basic Concept – SQL

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources 22

SQL Structured Query Language is a special-purpose programming language

designed for managing data held in a relational database management system

(RDBMS)

SQL is based upon relational algebra and tuple relational calculus.

https://en.wikipedia.org/wiki/Relational_algebra

SQL defines 3 language aspects:

data definition language (DDL) …to define database schemas

data manipulation language … selecting, inserting, deleting and updating data

data control language …to control access rights in databases

https://en.wikipedia.org/wiki/Relational_algebra


Thomas Zeutschler

Associate Lecturer

Database System Classification

SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources

SQL Databases

Predefined Schema

Standard definition and interface

language

Tight consistency

Well defined semantics

NoSQL Database

No predefined Schema

Per-product definition and

interface language

Getting an answer quickly is more

important than getting a correct

answer

23


Thomas Zeutschler

Associate Lecturer

Big Data Framework


A pre-customized

and pre-compiled

collection of tools and

technologies required

for big data processing

based on Hadoop.


Thomas Zeutschler

Associate Lecturer

The R Programming Language

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 25

R system contains two major components:

1. Base System – contains the R language software and the high

priority add-on packages.

2. User contributed add-on Packages.

R includes… an effective data handling and storage facility,

a suite of operators for calculations on arrays, in particular matrices,

a large collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either on-screen or on

hardcopy, and

a simple and effective programming language which includes conditionals,

loops, user-defined recursive functions and input and output facilities.


Thomas Zeutschler

Associate Lecturer

RStudio


Native R is a console

application, RStudio is

wrapper for convenience…


Thomas Zeutschler

Associate Lecturer

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm


Variables

Simple Mathematics

Charting

# Declaration and usage of variables

A <- 2

B <- 3

x <- seq(0, 2*pi, 0.1)

y <- sin(x)

# Attention: R is case sensitive

1 + 2

Sin(2*3)

# Declaration and usage of variables

plot(x,y, main=„Sinus Plot",

sub=„made with R",

xlab="x-axis",

ylab="y-axis")


Thomas Zeutschler

Associate Lecturer

R Basics – Install and use packageshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm


Using Packages

Installing Packages (remove the #)

Automatic Load and (if required) Installation of a Package


Thomas Zeutschler

Associate Lecturer



Loading Data

Assign Data to Objects

Accessing Data


Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language


30

Accessing Data continued / Saving Data


Thomas Zeutschler



31

Simple Data Analysis

d <- read.csv(“http://www.ats.ucla.edu/stat/data/hsb2.csv“)

# return the number of observations(rows) and variables(columns) in d.

dim(d)

# get the structure of d, including the class(type) of all variables

str(d)

# return the distributional summaries of variables in the dataset

summary(d)

# return a summary of the dataset for all rows where variable ‘read’ >= 60.

# note that filter is in the dplyr package.

summary(filter(d, read >= 60))


Thomas Zeutschler



32

Charting

# load the lattice charting package

require(lattice)

# draw a simple scatter plot

xyplot(read ~ write, data = d)

# conditioned scatter plot

xyplot(read ~ write | prog, data = d)

# box and whisker plots

bwplot(read ~ factor(prog), data = d)

More Charting (ggplot2 package)

# draw a kernel density plot

ggplot(d, aes(x = write)) + geom_density()

# draw a kernel density plot per prog

ggplot(d, aes(x = write)) + geom_density()

+ facet_wrap(~ prog)

# inspect univariate and bivariate

# relationships using a scatter plot matrix

ggpairs(d[, 7:11])


Thomas Zeutschler

Associate Lecturer

Analytics Data Processing – Sample: Knime


www.knime.org


Thomas Zeutschler

Associate Lecturer

Knime - Essential Nodes

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 34

Data Preparation

The input table is split into two partitions (i.e. row-wise),

e.g. train and test data. The two partitions are available

at the two output ports.

This node helps handle missing values found in cells of

the input table.

The node allows for row / column filtering according to

certain criteria


Thomas Zeutschler

Associate Lecturer

Knime - Essential Nodes

SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 35

First Statistical Data Analysis

Calculates statistical moments such as minimum, maximum,

mean, standard deviation, variance, median, overall sum,

number of missing values and row count across all numeric

columns, and counts all nominal values together with their

occurrences.

Creates a cross table (also referred as contingency table

or cross tab). It can be used to analyze the relation of

two columns with categorical data and does display the

frequency distribution of the categorical variables in a

table.


Thomas Zeutschler

Associate Lecturer

Knime – Data Mining Cheating…

SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 36

Algorithm Pros Cons Good at

Linear regression

- Very fast (runs in constant time)

- Easy to understand the model

- Less prone to overfitting

- Unable to model complex relationships

-Unable to capture nonlinear relationships

without first transforming the inputs

- The first look at a dataset

- Numerical data with lots of features

Decision trees

- Fast

- Robust to noise and missing values

- Accurate

- Complex trees are hard to interpret

- Duplication within the same sub-tree is

possible

- Star classification

- Medical diagnosis

- Credit risk analysis

Neural networks

- Extremely powerful

- Can model even very complex relationships

- No need to understand the underlying data

- Almost works by “magic”

- Prone to overfitting

- Long training time

- Requires significant computing power for

large datasets

- Model is essentially unreadable

- Images

- Video

- “Human-intelligence” type tasks like driving or

flying

- Robotics

Support Vector

Machines

- Can model complex, nonlinear

relationships

- Robust to noise (because they maximize

margins)

- Need to select a good kernel function

- Model parameters are difficult to interpret

- Sometimes numerical stability problems

- Requires significant memory and

processing power

- Classifying proteins

- Text classification

- Image classification

- Handwriting recognition

K-Nearest Neighbors

- Simple

- Powerful

- No training involved (“lazy”)

- Naturally handles multiclass classification

and regression

- Expensive and slow to predict new

instances

- Must define a meaningful distance

function

- Performs poorly on high-dimensionality

datasets

- Low-dimensional datasets

- Computer security: intrusion detection

- Fault detection in semiconducter manufacturing

- Video content retrieval

- Gene expression

- Protein-protein interaction


Thomas Zeutschler

Associate Lecturer

Knime – Data Mining Cheating…


http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html

https://github.com/soulmachin

e/machine-learning-cheat-

sheet/raw/master/machine-

learning-cheat-sheet.pdf

https://azure.microsoft.com/en-

us/documentation/articles/mach

ine-learning-algorithm-cheat-

sheet/

http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html

https://github.com/soulmachine/machine-learning-cheat-sheet/raw/master/machine-learning-cheat-sheet.pdf

https://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-cheat-sheet/


Thomas Zeutschler

Associate Lecturer

Time Series



Thomas Zeutschler

Associate Lecturer

Time Series

SS 2016 - IT Applications in Business Analytics - 10. Time Series 39

A time series is a sequence of

often equally spaced observations

in chorological order

over a continuous time interval.

Samples from science and business

Meteorology: weather data, e.g. temperature, pressure, wind.

Economy and finance: economic factors, financial indexes, exchange rates.

Business: sales, production or any activity of business

Industry: electric load, power consumption, sensors.

Medicine: physiological signals (EEG), heart-rate, patient temperature.

Web: views, clicks, logs.


Thomas Zeutschler

Associate Lecturer

Time Series


Time-Series DecompositionDecompose the variation of a series into 3 main parts…

A. Trend This is a long-term change in the mean level,

e.g. an increasing trend.

B. Seasonal effect Many time series exhibit variation which is seasonal

(e.g. annual) in period. The measure and the removal

of such variation is called deseasonalizing of data.

C. Irregular fluctuations After trend and cyclic variations have been removed

from a set of data, there is a series of residuals,

which may (or may not) be completely random.

Seasonal Trend Decomposition using LOESS (STL)

STL Method, 1990: http://www.wessa.net/download/stl.pdf

http://www.wessa.net/download/stl.pdf


Thomas Zeutschler

Associate Lecturer

Time Series in R


# load data

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")

# convert to time series

birthstimeseries <- ts(births, frequency = 12, start = c(1946,1))

# Seasonal trend decomposition using Loess algorithm (STL)

births.stl = stl(birthstimeseries, s.window = "periodic")

# plot trend decomposition

plot(births.stl)

Seasonal Trend Decomposition using LOESS*

*LOcal regrESSion


Thomas Zeutschler

Associate Lecturer

Time Series – First Example


# load data

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")

# convert to time series

birthstimeseries <- ts(births, frequency = 12, start = c(1946,1))

# build ARIMA model

birthsmodel <- arima(birthstimeseries, order = c(1,0,0), list(order = c(2,1,0), period = 12))

# 24 month forecast based on the model

birthsforecast <- predict(birthsmodel, n.ahead=24)

# calculate bounds for 95% confidence level

U <- birthsforecast$pred + 2 * birthsforecast$se

L <- birthsforecast$pred - 2 * birthsforecast$se

# plot for time series, prediction and confidence interval

ts.plot(birthstimeseries, birthsforecast$pred, U, L, col = c(1,2,4,4), lty = c(1,1,2,2))

# add legend to plot

legend("topleft", c("Actual", "Forecast", "Error Bounds (95% Confidence)"),

col =c(1,2,4), lty = c(1,1,2))

Forecasting using ARMIA model


Thomas Zeutschler

Associate Lecturer

Decision Tree Learning



Thomas Zeutschler

Associate Lecturer

Classification Method Comparison


Try to understand the pattern of data...

…by applying visual data analysis

…by applying pairwise comparison of attributes

Is your data Linear Separable?

Yes: Logistic Regression, Neuronal Networks…be cautious on Decision Tree or Random Forrest

No: Random Forrest or SVM

???: Random Forrest…good balance of generalization and accuracy, and its computational cost is relatively low

But: Neuronal Networks can (not must) be the best solution…but it’s not easy to tune them to deliver good results (many parameters).


Thomas Zeutschler

Associate Lecturer

Decision Tree


Decision Tree (partial) for Bike Sales Sample

A supervised learning method.

Purpose: Predict the certain value

of an item (record) based on

observations from other items.

If the target value is from a

finite set of values, then we

call it classification tree.

Leaves represent class

labels (e.g. Region),

whereas Branches

represent conjunctions

of features that lead to

those class labels.


Thomas Zeutschler

Associate Lecturer

Outlier Detection



Thomas Zeutschler

Associate Lecturer

Outliers – Where are they?

SS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 47

Article: Anomaly Detection with Score functions based on Nearest Neighbour Graphs

https://arxiv.org/pdf/0910.5461.pdf

https://arxiv.org/pdf/0910.5461.pdf


Thomas Zeutschler

Associate Lecturer

Outlier Detection – Introduction


“An outlier is an observation which

deviates so much from the other

observations as to arouse suspicions

that it was generated by a different

mechanism”D. M. Hawkins 1980

Two reasons for outliers:

Bad Data e.g. measurement errors, typos

Correct Datae.g. random variation of data, heavy-tailed

distribution of dataLOF - Local Outlier Factor


Thomas Zeutschler

Associate Lecturer

Outliers – Core Problem


Find them..

How to detect outliers?

Keep them (or not)…

Do we need to keep them? They are the main subject of interest

(e.g. in fraud detection)

They are an integral part of the statistical case.

Do we need to remove them? For more robust statistics.

For clean data (remove bad data).

Treat them…

What action needs to be done? Business purpose >>> outlier treatment


Thomas Zeutschler

Associate Lecturer

Outliers – Core Problem


Outlier Labeling

Flag potential outliers for further investigation

(i.e., are the potential outliers erroneous data, indicative of an

inappropriate distributional model, and so on).

Outlier Accommodation

Use robust statistical techniques that will not be unduly affected by

outliers. That is, if we cannot determine that potential outliers are

erroneous observations, do we need modify our statistical analysis to

more appropriately account for these observations?

Outlier Identification

Formally test whether observations are outliers.

Boris Iglewicz and David Hoaglin (1993),

"Volume 16: How to Detect and Handle Outliers",

The ASQC Basic References in Quality Control:

Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.


Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 51

Excursus – PCAAnalysis of environmental controls on tsunami deposit texture

a) PCA loading plot of variables along components 1 and 2 (accounting for 58% of total variance), showing the

spatial relationship of the variables along these dimensions.

b) Scoreplot, showing individual data points plotted in coordinate space along components 1 and 2


Thomas Zeutschler

Associate Lecturer

Other Topics



Thomas Zeutschler

Associate Lecturer

Information Gathering

SS 2016 - IT Applications in Business Analytics - 12. Data Acquisition 53


Thomas Zeutschler

Associate Lecturer

S.U.C.C.E.S.S.

SS 2016 - IT Applications in Business Analytics - 13. Information Design 54

SAY Deliver messages: Reports and presentations serve to convey messages to

readers and listeners.

UNIFY Standardize content: Reports and presentations are more easily understood

when the content displayed adheres to a uniform concept of meaning.

CONDENSE Concentrate information: Reports and presentations are better understood

when the contents have a high level of information density.

CHECK Ensure quality: Reports and presentations are credible when the conveyed

content is based on correct, appropriate, and current data.

ENABLE Implement concept: Organizational, personnel-related, and technical

requirements must be met in order to implement the rules.

SIMPLIFY Avoid complication: Reports and presentations are better understood when

noise and redundancy are avoided.

STRUCTURE Group content: Reports and presentations should adhere to the requirements

for homogeneous, mutually exclusive, and exhaustive structures.

source: http://www.hichert.com

http://www.hichert.com/


Thomas Zeutschler

Associate Lecturer

S.U.C.C.E.S.S.

SS 2016 - IT Applications in Business Analytics - 13. Information Design 55


Thomas Zeutschler

Associate Lecturer

#TheEnd



Thomas Zeutschler

Associate Lecturer

Any Questions?


hochschule düsseldorf fachbereich

Documents