chapter 1 introduction - kocwcontents.kocw.net/kocw/document/2014/hanyang/huhseon/1.pdfexplore,...
TRANSCRIPT
Hanyang
University
Quest Lab.
Chapter 1Introduction
Spring, 2015Sun Hur
Department of IMEHanyang University
Hanyang UniversityQuest Lab.
• Definitions of DM(data mining):
“Extracting useful information from large datasets.”
“Data mining is the process of exploration and analysis, by automaticor semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.”
“Data mining is the process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.”
1.1 What is DM?
3
What is Data Mining ?
Data Flood !!
Data Mining: The process of extracting useful information from large datasets through the use of any relevant data analysis techniques developed to help people make better decision.
Hanyang UniversityQuest Lab.
1.2 Where is DM used?
• Military: accuracy of bombs• Intelligence agencies: analyzing intercepted
communications• Security: examine packets of network• Medical: predict cancer relapse• Business: classify customer classes (respond? fraud? loan
default? Service abandon?) • Analysis of SNS and much more …
Hanyang UniversityQuest Lab.
• Linear regression, logistic regression, discriminant analysis, principle
component analysis
• Classical statistics:
difficult computation, scarce data
same data is used to estimate and also to determine the reliability of it
1.3 Origins of DM
Hanyang UniversityQuest Lab.
• data and computing power are plentiful
• scale
• speed
• simplicity (in the logic of the inference, not of the algorithm)
• machine learning technique (ex. neural nets)
• But is vulnerable to the danger of overfitting (data의 구조적 특성 뿐
만 아니라 우연적 특성까지도 포함하게 됨 – In addition to “signal”,
“noise” is also included)
1.3 Origins of DM
DM
Hanyang UniversityQuest Lab.
• 빅데이터_산업지각변동의 진원.pdf
• 빅데이터분석과 활용.pdf
1.4 Rapid growth of DM
• Growth of data! (ex. Wal-Mart: 20 million transactions per day in
2003 – 10TB DB) peta, exa, zeta
• POS, GPS, Internet, ... (OLTP: On-line transaction processing)
• DW (Data warehouse): large integrated data storage facility that ties
together the DSS of an enterprise (OLAP: On-line Analytical
Processing – MySQL, SQL Server, DB2, 오라클 DB 등)
• Computational power and cost of data storage and retrieval
What can DM do ?
• Classification
• Prediction
• Association discovery
• Clustering
Supervised Learning
Unsupervised Learning
Regression
Ø The regression model for i = 1, …, n observations
• Parameters: bi for i=1,2,…k. , b0
• Random Error: ei has mean 0 and variance s 2
Þ Yi is a r.v.
ikki xxxY ebbbb +++= L22110
6/2/2015 10
Learn a method for predicting the instance class from pre-labeled (classified) instances.
Classification
Classification Models
Classification problem
qFeatures (variables)§ Width, length, lightness, etc.
salmon
bass
lightness
width
Classification Problem
• One example of rule in the filter
If (%free > yes ) & (%remove > yes) then Spam
ElseE-mail
Email Spam Filtering
13
Find “natural” grouping of data given un-labeled data
Clustering
Clustering (Race)
Clustering (Gender)
Clustering (Air Quality)
Bioinformatics - Microarray
• Clustering problem– Partition the genes into
groups or clusters based on their expression patterns.
Amazon.com
• The Elements of Statistical Learning (Hardcover)by T. Hastie, R. Tibshirani, J. H. Friedman
Customers who bought this item also bought - Principles of Data Mining (Adaptive Computation and
Machine Learning) by David J. Hand- Pattern Classification (2nd Edition) by Richard O. Duda
Association Problem
Association Problem
• Walmart• Rule 1: If the customer buy the blanket,
he/she buys the lamp with the probability of 0.6
• Customer relation management (CRM)- Display (location)- Brochure
The parable of the beer and diapersOn Friday afternoons, young American males who buy diapers (nappies) also have a predisposition to buy beer. No one had predicted that result, so no one would ever have even asked the question in the first place. Hence, this is an excellent example of the difference between data mining and querying.
The story goes on that, once the correlation was uncovered, it was easy to back extrapolate from the effect to the cause.Young American males frequently indulge in ritualised carousing behaviour with friends of Friday nights.Carousing usually involves the consumption of beer.Most young American males only buy diapers after they have fathered offspring.Offspring acquisition is a known carousing inhibitor.So the proud new father is walking around the store on Friday afternoon. He knows there is no way that he is going to get out of the house to join his mates at the bar. However, there is nothing to stop him from drinking beer at home. All he needs is to be reminded of that fact. After seeing the results of the data mining, Wal-Mart moved the beer next to the diapers and beer sales went up.
Hanyang UniversityQuest Lab.
1.5 Why are there so many different methods?
Each has ad and disad
Apply several different methods and select the most useful one
Hanyang UniversityQuest Lab.
1.6 Terminology and notation
AlgorithmAttribute (X, Feature = Input var = Independent var = field = Predictor)Case (= Observation)Confidence: Pr(C will be purchased | A and B are purchased)Confidence intervalDependent var (Y, Outcome var = Output var = Target var = Response)Estimation (= Prediction)Holdout sample (= Validation set, Test set)ModelObservation (= Record)PatternPredictionPredictorResponseScoreSuccess classSupervised learningTest data: Assessing the final modelTraining data: Data used to fit a modelUnsupervised learning: learning something rather than predicting Validation data
Hanyang UniversityQuest Lab.
1.7 Road maps to this book
Hanyang UniversityQuest Lab.
1.7 Road maps to this book
Continuous Y Categorical Y No Y
Continuous X
Linear regression
Neural nets
k-NN
Logistic regression
Neural nets
Discriminant analysis
k-NN
Principal components
Cluster analysis
Categorical X
Linear regression
Neural nets
Regression trees
Neural nets
Classification trees
Logistic regression
Naive Bayes
Association rules
Hanyang UniversityQuest Lab.
1.7 Road maps to this book
Part I: Overview
Part II: Data exploration and dimension reduction
Part III: Performance evaluation (comparison of supervised methods)
Part IV: Various supervised learning methods
Part V: Unsupervised learning
Part VI: Forecasting time series
Part VII: Cases
Hanyang UniversityQuest Lab.
소프트웨어
• Use of XLMiner® Software
Hanyang
University
Quest Lab.
Chapter 2
Overviews of the DM Process
Spring, 2015Sun Hur
Department of IMEHanyang University
Hanyang UniversityQuest Lab.
2.1 Introduction
• General steps of the data modeling process:
• OLAP(On-line analytical processing), SQL are not covered
- they are descriptive and no statistical modeling is involved
• Focus on the “predictive analytics” in this book
- classification / prediction / affinity analysis
Hanyang UniversityQuest Lab.
Most basic form of data analysis Classification이 알려져 있는 유사 data를 사용하여 rule을 발견하고 이를classification이 알려져 있지 않은 data에 적용
2.2 Core ideas in DM
Classification
Classification predicts class and prediction predicts value
Prediction
= affinity analysis (ex. Amazon.com and Netflix.com’s recommenders)“what goes with what”
Association rules
Hanyang UniversityQuest Lab.
2.2 Core ideas in DM
Graphical analysisNumerical variables → histograms, boxplotsCategorical variables → bar charts, pie chartsRelationship of variables , detecting outliers → scatterplots
Data Visualization
Grouping data into smaller number of groups
Data reduction
Review, examine, full understanding, aggregation of similar variables and records
Data exploration
Data reduction
Data exploration
Hanyang UniversityQuest Lab.
2.3 Supervised and unsupervised learning
• Process of providing an model using both the input (X) and output
(Y) variables.
• Goal is to predict the output values unknown based on the input
variables.
• Training data로부터 “learn” 또는 “train”한 후 validation data에 적용하
여 test해서 다른 model과 비교
• 제3의 test data에 적용하여 비교할 수 있도록 data를 아끼는 것이 좋음
ex. simple linear regression. Y is outcome var. and X is predictor var.
Supervised learning algorithms
Hanyang UniversityQuest Lab.
• Explanatory analysis to learn something about the data.
• Not to predict the values in an output variable.
• No learning from cases
• Goal is to describe the association or identify inherent patterns of
dataset.
ex. association rules, data reduction methods, clustering techniques.
Unsupervised learning algorithms
Hanyang UniversityQuest Lab.
2.4 The Steps in DM
Understand the purpose of DM project or application
Step 1 :
Obtain the data set (random sampling, aggregation of database,...)
Step 2 :
Explore, clean, and preprocess the data
Step 3 :
One-shot answer? On-going procedure?
Random sampling, Pulling data together from DB
How to handle missing data? Data are reasonable? Outliers? Consistency of fields?
Hanyang UniversityQuest Lab.
2.4 The Steps in DM
Reduce and separate the data into training/ validation/ test datasets
Step 4 :
Training Set
Validation Set
Data set used for creating models (or classifiers)
Data set used for validating classifiers obtained from training set.
Hanyang UniversityQuest Lab.
2.4 The Steps in DM
Determine DM task: Classification? Prediction? Clustering? etc.
Step 5 :
Choose the DM technique to be used(regression, neural nets, clustering, etc.)
Step 6 :
Use algorithms to perform task: - iteratively - multiple variants (different variables, refine setting, etc.)
Step 7 :
Hanyang UniversityQuest Lab.
2.4 The Steps in DM
Interpret the results
Step 8 :
Deploy the model (integrate the model into operational systems, run it on real records to produce actions)
Step 9 :
→ choose the best algorithm to deploy→ test on the test data
Ex. “include the mailing if the predicted amount of purchase is > $10”
cf. SEMMA (Sample, Explore, Modify, Model, Access), methodology ofSAS/Enterprise Miner
cf. CRISP-DM (CRoss-Industry Standard Process for DM), methodology ofSPSS-Clementine
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
• Variables are in columns and records are in rows
• One of these variables is outcome variable (Cat.MEDV, at the end)
Organization of datasets
- 대개, 일부 record만 가지고 algorithm을 수행하고자 함
- Algorithm이 record나 variable의 개수에서 limitation을 가짐
- 수백 개 정도의 records로도 정확한 model 수립이 가능
Sampling from a database
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
• 관심대상 event가 매우 rare 한 경우
• model을 수립하는 데에 아무 information을 주지 않는 수많은 non-
rare event를 포함하는 sampling을 하게 됨
• 이 관심대상 event에 대해 overweight 하게 됨
• Rare event를 찾기 위한 비용도 발생
• Nonresponder를 responder로 잘못 classify하는 비용과, responder를
찾아내는 비용간의 균형
Oversampling rare events
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
① Types of variables
• numerical or text (character )/ continuous, integer, or categorical/
• categorical: numerical or text/ unordered(“nominal” – Asia, Europe,
North America) or ordered(“ordinal” – high, medium, low)
• 각 방법마다 적용가능 variable에 제한이 있을 수 있음(ex. Naïve Bayes
는 categorical)
Preprocessing and cleaning the data
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
② Handling categorical variables
• may be regarded as continuous variable, if ordered
• may be decomposed into many dummy binary (yes/no) variables
ex. category “Student”, “Unemployed”, “Employed”, “Retired”
-> split into four binary variables, like “Student – yes/no”, ...
-> We need three variables.
Preprocessing and cleaning the data
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
③ Variable selection
- More is not necessarily better
- Equal, parsimony(절약), or compactness are better
Preprocessing and cleaning the data
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
④ Overfitting
- The more variables, the greater the risk of overfitting
- What is overfitting?
Preprocessing and cleaning the data
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
Preprocessing and cleaning the data
- No error (residual)
- Not accurate
- Not useful
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
• A simple straight line might do a better job in predicting future
outcome
• Mislabeled the noise as if it was a signal
• Adding more input variables (predictor) might improve the
performance of model but it probably includes spurious “explanation”
(ex. height vs. donation amount, 우연한 상관관계)
• Dataset should be much larger than the number of predictor
variables for the model not to be dependent on just a few cases
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
⑤ How many variables and how much data?
• Rule of Thumb: 10 records per predictor variable
• 6×m×p records (m: number of outcome classes, p: number of
variables)
• x-y plots for all variable combinations → if straight line, delete one
of them
Preprocessing and cleaning the data
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
⑥ Outliers
• The more data, the greater chance of erroneous values
• Rule of Thumb: “anything over 3σ away from the mean”
• domain knowledge의 사용 (use common sense)
• 각 column 별로 sorting하여 outlier 탐색, 또는 max-min value 검토
• 그냥 버리는가? 아니면?
Preprocessing and cleaning the data
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
⑦ Missing values
• Delete the records if the number of records with missing values is small
• 하지만 Variable의 개수가 많다면 missing value portion이 작더라도 많은
record에 영향 (ex. 30 variables, 5% of values are missing → 80% of
records are deleted: 1-0.9530 = 0.785)
• Omitting value 부분을 평균값으로 substitute 하는 것도 한 방법
→ 이 경우, dataset의 variability가 understate되지만
→ validation data를 사용하여 variability 와 성능측정 가능
Preprocessing and cleaning the data
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
⑧ Normalizing (standardizing) the data
• z = ( X – μ ) /σ : z-score “number of standard deviations away from
the mean”
• unit가 서로 다른 것 해결 (큰 단위 숫자들이 dominate할 가능성
(See Prob2.9)
Preprocessing and cleaning the data
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
• Divide (partition) the data and develop model using only one of the
partitions
• After modeling, try the model on another partition to see the
performance (ex. Classification model, prediction model)
• Two or three partitions: training set, validation set, test set
(미리 정해진 비율대로 랜덤하게 나누거나, old/new 별로 나눔)
Use and creation of partitions
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
① Training partition
• the largest
• apply many model to a training partition
② Validation partition
• to assess the performance of each model and compare them and
pick the best
Use and creation of partitions
Hanyang UniversityQuest Lab.
2.5 Preliminary steps
③ Test partition
• “holdout or evaluation partition”
• to avoid the overfitting problem (선택된 모델이 validation data에 우
연히 잘 맞았을 수도 있음)
Use and creation of partitions
Hanyang UniversityQuest Lab.
Three data partitions and their role in the data mining process
2.5 Preliminary steps
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
Multiple linear regression을 사용하여 대부분의 DM에서 사용하는 step들을 예시
Ex. Boston housing data
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
A B C D E F G H I J K L M N O
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDVCAT.
MEDV0.006 18 2.31 0 0.54 6.58 65.2 4.09 1 296 15.3 397 5 24 00.027 0 7.07 0 0.47 6.42 78.9 4.97 2 242 17.8 397 9 21.6 00.027 0 7.07 0 0.47 7.19 61.1 4.97 2 242 17.8 393 4 34.7 10.032 0 2.18 0 0.46 7.00 45.8 6.06 3 222 18.7 395 3 33.4 10.069 0 2.18 0 0.46 7.15 54.2 6.06 3 222 18.7 397 5 36.2 10.030 0 2.18 0 0.46 6.43 58.7 6.06 3 222 18.7 394 5 28.7 00.088 12.5 7.87 0 0.52 6.01 66.6 5.56 5 311 15.2 396 12 22.9 00.145 12.5 7.87 0 0.52 6.17 96.1 5.95 5 311 15.2 397 19 27.1 00.211 12.5 7.87 0 0.52 5.63 100 6.08 5 311 15.2 387 30 16.5 00.170 12.5 7.87 0 0.52 6.00 85.9 6.59 5 311 15.2 387 17 18.9 0
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
Step 1: Understand the purpose of DM project or application
→ to predict the “median house value”
Step 2: Obtain the data set
→ given
→공공정보 포털사이트
공유자원포털 http://www.data.go.kr
국가참조표준센터 http://www.srd.re.kr
영국의 공공정보 포털사이트 http://www.data.gov.uk
미국의 공공정보 포털사이트 http://data.gov
The modeling process
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
Step 3: Explore, clean, and preprocess the data
→ understand the descriptions of variables
→ think about the meanings and inclusions of variables
(ex. TAX와 home value 둘 다 필요한 것인가?
High/low 분류목적이라면 CAT.MEDV는 불필요함)
→ check outliers that might be errors
(ex. Number of rooms is 79.29!! → modify to 7.929)
The modeling process
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
Step 4: Reduce and separate the data into training/validation/test
datasets
→ 13 variables so no data reduction (ex. PCA) is required
→ training set (to build model) and validation set (to see how well)
→ Random partition has option to specify seed for randomization
The modeling process
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
Step 5: Determine DM task:
→ To predict the “median house value”
Step 6: Choose the DM technique to be used
→ multiple linear regression
Step 7: Use algorithms to perform task
→ SAS, MATLAB, SPSS, C/C++, etc.
→ MEDV is output var.,
→ all others are input var.,
→ CATMEDV unused
The modeling process
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
Predictions for the training data
The modeling process
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
Predictions for the validation data
The modeling process
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
→ prediction error comparison between training and validation data
→ Three measures of prediction errors:
average error: average of residuals
total SSE (sum of squared errors)
RMS (root mean squared) error: square root of average squared
error
The modeling process
Hanyang UniversityQuest Lab.
2.6 Building a model: Example with linear regression
Step 8: Interpret the results
→ Choose the best model → to be covered later
Step 9: Deploy the model
The modeling process
Hanyang UniversityQuest Lab.
2.7 Using Excel for DM
• Database management system:
IBM DB2 Intelligent Miner, Microsoft SQL Server 2005, Oracle Data
Mining, Teradata Warehouse Miner, ...
• Standalone DM tools:
KXEN, RuleQuest Research C5.0, Salford System CART, MARS, Treenet, ...
SAS Enterprise Miner, SPSS Clementine, Insightful Miner, ...
Using Excel for DM