chapter 1 introduction - kocwcontents.kocw.net/kocw/document/2014/hanyang/huhseon/1.pdfexplore,...

Hanyang

University

Quest Lab.

Chapter 1Introduction

Spring, 2015Sun Hur

Department of IMEHanyang University

Hanyang UniversityQuest Lab.

• Definitions of DM(data mining):

“Extracting useful information from large datasets.”

“Data mining is the process of exploration and analysis, by automaticor semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.”

“Data mining is the process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.”

1.1 What is DM?

3

What is Data Mining ?

Data Flood !!

Data Mining: The process of extracting useful information from large datasets through the use of any relevant data analysis techniques developed to help people make better decision.


1.2 Where is DM used?

• Military: accuracy of bombs• Intelligence agencies: analyzing intercepted

communications• Security: examine packets of network• Medical: predict cancer relapse• Business: classify customer classes (respond? fraud? loan

default? Service abandon?) • Analysis of SNS and much more …


• Linear regression, logistic regression, discriminant analysis, principle

component analysis

• Classical statistics:

difficult computation, scarce data

same data is used to estimate and also to determine the reliability of it

1.3 Origins of DM


• data and computing power are plentiful

• scale

• speed

• simplicity (in the logic of the inference, not of the algorithm)

• machine learning technique (ex. neural nets)

• But is vulnerable to the danger of overfitting (data의 구조적 특성 뿐

만 아니라 우연적 특성까지도 포함하게 됨 – In addition to “signal”,

“noise” is also included)

1.3 Origins of DM

DM


• 빅데이터_산업지각변동의 진원.pdf

• 빅데이터분석과 활용.pdf

1.4 Rapid growth of DM

• Growth of data! (ex. Wal-Mart: 20 million transactions per day in

2003 – 10TB DB) peta, exa, zeta

• POS, GPS, Internet, ... (OLTP: On-line transaction processing)

• DW (Data warehouse): large integrated data storage facility that ties

together the DSS of an enterprise (OLAP: On-line Analytical

Processing – MySQL, SQL Server, DB2, 오라클 DB 등)

• Computational power and cost of data storage and retrieval

What can DM do ?

• Classification

• Prediction

• Association discovery

• Clustering

Supervised Learning

Unsupervised Learning

Regression

Ø The regression model for i = 1, …, n observations

• Parameters: bi for i=1,2,…k. , b0

• Random Error: ei has mean 0 and variance s 2

Þ Yi is a r.v.

ikki xxxY ebbbb +++= L22110

6/2/2015 10

Learn a method for predicting the instance class from pre-labeled (classified) instances.

Classification

Classification Models

Classification problem

qFeatures (variables)§ Width, length, lightness, etc.

salmon

bass

lightness

width

Classification Problem

• One example of rule in the filter

If (%free > yes ) & (%remove > yes) then Spam

ElseE-mail

Email Spam Filtering

13

Find “natural” grouping of data given un-labeled data

Clustering

Clustering (Race)

Clustering (Gender)

Clustering (Air Quality)

Bioinformatics - Microarray

• Clustering problem– Partition the genes into

groups or clusters based on their expression patterns.

Amazon.com

• The Elements of Statistical Learning (Hardcover)by T. Hastie, R. Tibshirani, J. H. Friedman

Customers who bought this item also bought - Principles of Data Mining (Adaptive Computation and

Machine Learning) by David J. Hand- Pattern Classification (2nd Edition) by Richard O. Duda

Association Problem

Association Problem

• Walmart• Rule 1: If the customer buy the blanket,

he/she buys the lamp with the probability of 0.6

• Customer relation management (CRM)- Display (location)- Brochure

The parable of the beer and diapersOn Friday afternoons, young American males who buy diapers (nappies) also have a predisposition to buy beer. No one had predicted that result, so no one would ever have even asked the question in the first place. Hence, this is an excellent example of the difference between data mining and querying.

The story goes on that, once the correlation was uncovered, it was easy to back extrapolate from the effect to the cause.Young American males frequently indulge in ritualised carousing behaviour with friends of Friday nights.Carousing usually involves the consumption of beer.Most young American males only buy diapers after they have fathered offspring.Offspring acquisition is a known carousing inhibitor.So the proud new father is walking around the store on Friday afternoon. He knows there is no way that he is going to get out of the house to join his mates at the bar. However, there is nothing to stop him from drinking beer at home. All he needs is to be reminded of that fact. After seeing the results of the data mining, Wal-Mart moved the beer next to the diapers and beer sales went up.


1.5 Why are there so many different methods?

Each has ad and disad

Apply several different methods and select the most useful one


1.6 Terminology and notation

AlgorithmAttribute (X, Feature = Input var = Independent var = field = Predictor)Case (= Observation)Confidence: Pr(C will be purchased | A and B are purchased)Confidence intervalDependent var (Y, Outcome var = Output var = Target var = Response)Estimation (= Prediction)Holdout sample (= Validation set, Test set)ModelObservation (= Record)PatternPredictionPredictorResponseScoreSuccess classSupervised learningTest data: Assessing the final modelTraining data: Data used to fit a modelUnsupervised learning: learning something rather than predicting Validation data


1.7 Road maps to this book



Continuous Y Categorical Y No Y

Continuous X

Linear regression

Neural nets

k-NN

Logistic regression

Neural nets

Discriminant analysis

k-NN

Principal components

Cluster analysis

Categorical X

Linear regression

Neural nets

Regression trees

Neural nets

Classification trees

Logistic regression

Naive Bayes

Association rules



Part I: Overview

Part II: Data exploration and dimension reduction

Part III: Performance evaluation (comparison of supervised methods)

Part IV: Various supervised learning methods

Part V: Unsupervised learning

Part VI: Forecasting time series

Part VII: Cases


소프트웨어

• Use of XLMiner® Software

Hanyang

University

Quest Lab.

Chapter 2

Overviews of the DM Process

Spring, 2015Sun Hur

Department of IMEHanyang University


2.1 Introduction

• General steps of the data modeling process:

• OLAP(On-line analytical processing), SQL are not covered

- they are descriptive and no statistical modeling is involved

• Focus on the “predictive analytics” in this book

- classification / prediction / affinity analysis


Most basic form of data analysis Classification이 알려져 있는 유사 data를 사용하여 rule을 발견하고 이를classification이 알려져 있지 않은 data에 적용

2.2 Core ideas in DM

Classification

Classification predicts class and prediction predicts value

Prediction

= affinity analysis (ex. Amazon.com and Netflix.com’s recommenders)“what goes with what”

Association rules


2.2 Core ideas in DM

Graphical analysisNumerical variables → histograms, boxplotsCategorical variables → bar charts, pie chartsRelationship of variables , detecting outliers → scatterplots

Data Visualization

Grouping data into smaller number of groups

Data reduction

Review, examine, full understanding, aggregation of similar variables and records

Data exploration

Data reduction

Data exploration


2.3 Supervised and unsupervised learning

• Process of providing an model using both the input (X) and output

(Y) variables.

• Goal is to predict the output values unknown based on the input

variables.

• Training data로부터 “learn” 또는 “train”한 후 validation data에 적용하

여 test해서 다른 model과 비교

• 제3의 test data에 적용하여 비교할 수 있도록 data를 아끼는 것이 좋음

ex. simple linear regression. Y is outcome var. and X is predictor var.

Supervised learning algorithms


• Explanatory analysis to learn something about the data.

• Not to predict the values in an output variable.

• No learning from cases

• Goal is to describe the association or identify inherent patterns of

dataset.

ex. association rules, data reduction methods, clustering techniques.

Unsupervised learning algorithms


2.4 The Steps in DM

Understand the purpose of DM project or application

Step 1 :

Obtain the data set (random sampling, aggregation of database,...)

Step 2 :

Explore, clean, and preprocess the data

Step 3 :

One-shot answer? On-going procedure?

Random sampling, Pulling data together from DB

How to handle missing data? Data are reasonable? Outliers? Consistency of fields?


2.4 The Steps in DM

Reduce and separate the data into training/ validation/ test datasets

Step 4 :

Training Set

Validation Set

Data set used for creating models (or classifiers)

Data set used for validating classifiers obtained from training set.


2.4 The Steps in DM

Determine DM task: Classification? Prediction? Clustering? etc.

Step 5 :

Choose the DM technique to be used(regression, neural nets, clustering, etc.)

Step 6 :

Use algorithms to perform task: - iteratively - multiple variants (different variables, refine setting, etc.)

Step 7 :


2.4 The Steps in DM

Interpret the results

Step 8 :

Deploy the model (integrate the model into operational systems, run it on real records to produce actions)

Step 9 :

→ choose the best algorithm to deploy→ test on the test data

Ex. “include the mailing if the predicted amount of purchase is > $10”

cf. SEMMA (Sample, Explore, Modify, Model, Access), methodology ofSAS/Enterprise Miner

cf. CRISP-DM (CRoss-Industry Standard Process for DM), methodology ofSPSS-Clementine


2.5 Preliminary steps

• Variables are in columns and records are in rows

• One of these variables is outcome variable (Cat.MEDV, at the end)

Organization of datasets

- 대개, 일부 record만 가지고 algorithm을 수행하고자 함

- Algorithm이 record나 variable의 개수에서 limitation을 가짐

- 수백 개 정도의 records로도 정확한 model 수립이 가능

Sampling from a database



• 관심대상 event가 매우 rare 한 경우

• model을 수립하는 데에 아무 information을 주지 않는 수많은 non-

rare event를 포함하는 sampling을 하게 됨

• 이 관심대상 event에 대해 overweight 하게 됨

• Rare event를 찾기 위한 비용도 발생

• Nonresponder를 responder로 잘못 classify하는 비용과, responder를

찾아내는 비용간의 균형

Oversampling rare events



① Types of variables

• numerical or text (character )/ continuous, integer, or categorical/

• categorical: numerical or text/ unordered(“nominal” – Asia, Europe,

North America) or ordered(“ordinal” – high, medium, low)

• 각 방법마다 적용가능 variable에 제한이 있을 수 있음(ex. Naïve Bayes

는 categorical)

Preprocessing and cleaning the data



② Handling categorical variables

• may be regarded as continuous variable, if ordered

• may be decomposed into many dummy binary (yes/no) variables

ex. category “Student”, “Unemployed”, “Employed”, “Retired”

-> split into four binary variables, like “Student – yes/no”, ...

-> We need three variables.




③ Variable selection

- More is not necessarily better

- Equal, parsimony(절약), or compactness are better




④ Overfitting

- The more variables, the greater the risk of overfitting

- What is overfitting?





- No error (residual)

- Not accurate

- Not useful



• A simple straight line might do a better job in predicting future

outcome

• Mislabeled the noise as if it was a signal

• Adding more input variables (predictor) might improve the

performance of model but it probably includes spurious “explanation”

(ex. height vs. donation amount, 우연한 상관관계)

• Dataset should be much larger than the number of predictor

variables for the model not to be dependent on just a few cases



⑤ How many variables and how much data?

• Rule of Thumb: 10 records per predictor variable

• 6×m×p records (m: number of outcome classes, p: number of

variables)

• x-y plots for all variable combinations → if straight line, delete one

of them




⑥ Outliers

• The more data, the greater chance of erroneous values

• Rule of Thumb: “anything over 3σ away from the mean”

• domain knowledge의 사용 (use common sense)

• 각 column 별로 sorting하여 outlier 탐색, 또는 max-min value 검토

• 그냥 버리는가? 아니면?




⑦ Missing values

• Delete the records if the number of records with missing values is small

• 하지만 Variable의 개수가 많다면 missing value portion이 작더라도 많은

record에 영향 (ex. 30 variables, 5% of values are missing → 80% of

records are deleted: 1-0.9530 = 0.785)

• Omitting value 부분을 평균값으로 substitute 하는 것도 한 방법

→ 이 경우, dataset의 variability가 understate되지만

→ validation data를 사용하여 variability 와 성능측정 가능




⑧ Normalizing (standardizing) the data

• z = ( X – μ ) /σ : z-score “number of standard deviations away from

the mean”

• unit가 서로 다른 것 해결 (큰 단위 숫자들이 dominate할 가능성

(See Prob2.9)




• Divide (partition) the data and develop model using only one of the

partitions

• After modeling, try the model on another partition to see the

performance (ex. Classification model, prediction model)

• Two or three partitions: training set, validation set, test set

(미리 정해진 비율대로 랜덤하게 나누거나, old/new 별로 나눔)

Use and creation of partitions



① Training partition

• the largest

• apply many model to a training partition

② Validation partition

• to assess the performance of each model and compare them and

pick the best




③ Test partition

• “holdout or evaluation partition”

• to avoid the overfitting problem (선택된 모델이 validation data에 우

연히 잘 맞았을 수도 있음)



Three data partitions and their role in the data mining process



2.6 Building a model: Example with linear regression

Multiple linear regression을 사용하여 대부분의 DM에서 사용하는 step들을 예시

Ex. Boston housing data



A B C D E F G H I J K L M N O

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDVCAT.

MEDV0.006 18 2.31 0 0.54 6.58 65.2 4.09 1 296 15.3 397 5 24 00.027 0 7.07 0 0.47 6.42 78.9 4.97 2 242 17.8 397 9 21.6 00.027 0 7.07 0 0.47 7.19 61.1 4.97 2 242 17.8 393 4 34.7 10.032 0 2.18 0 0.46 7.00 45.8 6.06 3 222 18.7 395 3 33.4 10.069 0 2.18 0 0.46 7.15 54.2 6.06 3 222 18.7 397 5 36.2 10.030 0 2.18 0 0.46 6.43 58.7 6.06 3 222 18.7 394 5 28.7 00.088 12.5 7.87 0 0.52 6.01 66.6 5.56 5 311 15.2 396 12 22.9 00.145 12.5 7.87 0 0.52 6.17 96.1 5.95 5 311 15.2 397 19 27.1 00.211 12.5 7.87 0 0.52 5.63 100 6.08 5 311 15.2 387 30 16.5 00.170 12.5 7.87 0 0.52 6.00 85.9 6.59 5 311 15.2 387 17 18.9 0



Step 1: Understand the purpose of DM project or application

→ to predict the “median house value”

Step 2: Obtain the data set

→ given

→공공정보 포털사이트

공유자원포털 http://www.data.go.kr

국가참조표준센터 http://www.srd.re.kr

영국의 공공정보 포털사이트 http://www.data.gov.uk

미국의 공공정보 포털사이트 http://data.gov

The modeling process



Step 3: Explore, clean, and preprocess the data

→ understand the descriptions of variables

→ think about the meanings and inclusions of variables

(ex. TAX와 home value 둘 다 필요한 것인가?

High/low 분류목적이라면 CAT.MEDV는 불필요함)

→ check outliers that might be errors

(ex. Number of rooms is 79.29!! → modify to 7.929)




Step 4: Reduce and separate the data into training/validation/test

datasets

→ 13 variables so no data reduction (ex. PCA) is required

→ training set (to build model) and validation set (to see how well)

→ Random partition has option to specify seed for randomization




Step 5: Determine DM task:

→ To predict the “median house value”

Step 6: Choose the DM technique to be used

→ multiple linear regression

Step 7: Use algorithms to perform task

→ SAS, MATLAB, SPSS, C/C++, etc.

→ MEDV is output var.,

→ all others are input var.,

→ CATMEDV unused




Predictions for the training data




Predictions for the validation data




→ prediction error comparison between training and validation data

→ Three measures of prediction errors:

average error: average of residuals

total SSE (sum of squared errors)

RMS (root mean squared) error: square root of average squared

error




Step 8: Interpret the results

→ Choose the best model → to be covered later

Step 9: Deploy the model



2.7 Using Excel for DM

• Database management system:

IBM DB2 Intelligent Miner, Microsoft SQL Server 2005, Oracle Data

Mining, Teradata Warehouse Miner, ...

• Standalone DM tools:

KXEN, RuleQuest Research C5.0, Salford System CART, MARS, Treenet, ...

SAS Enterprise Miner, SPSS Clementine, Insightful Miner, ...

Using Excel for DM

chapter 1 introduction - kocwcontents.kocw.net/kocw/document/2014/hanyang/huhseon/1.pdfexplore,...

Documents