예측 모델링과 분류 기법 - kaistkseworkshop.kaist.ac.kr/2014/material/2014kse-2.pdf · r,...

예측 모델링과 분류 기법 2014년 2월 27일

KAIST 지식서비스공학과

이재길

제 2회 지식서비스 워크샵: 기업을 위한 데이터 사이언스 2014-02-27 2

목차

Chapter 3. Introduction to Predictive Modeling: From Correlation to Supervised Segmentation

Chapter 4. Fitting a Model to Data

Chapter 5. Overfitting and Its Avoidance


Predictive Modeling

Supervised segmentation: 흥미 있는 특정 값에 따라 집단을 분류하는 것

예:

계약이 만기되면 어떤 고객이 떠날 것인가?

어떤 잠재 고객이 카드 값을 내지 않을 것인가?

어떤 고객이 쿠폰을 보냈을 때 구매를 할 것인가?


용어 정의

Model: 알려지지 않은 값을 예측하기 위한 공식

Example, Attribute, Target attribute

Class: 다른 target attribute 값을 가지는 example들의 group

Example


For Fun

Target attribute: 진짜 가수, 모창 능력자


Supervised Segmentation

Target attribute에 따라 아래의 모양들을 어떻게 분류할 수 있는가?

Attributes:

- head-shape: square, circular - body-shape: rectangular, oval - body-color: gray, white

Target attribute: - write-off: Yes, No


Informative Attributes

어떤 attribute로 분류할지 선택하는 기준으로 information gain이 가장 널리 쓰임

어떤 attribute 값에 따라 분류하면 target attribute 값이 가장 순수(동일)해 지는가? purity measure, 가장 대표적인 것이 entropy

Entropy

pi: attribute 값이 i 일 확률


Entropy


Information Gain

Information gain: attribute 값에 따라 split 한 후에 entropy 값이 얼마나 감소하였는가?

많이 감소 할수록, 즉 information gain 값이 클 수록 분류에 좋은 attribute임


Information Gain 계산

예:

왼쪽 entropy 오른쪽 entropy


Tree-Structured Model

Classification tree 혹은 decision tree

Internal node는 attribute의 분류 기준을 나타냄

Leaf node는 class를 나타냄

예:


앞의 사람 모양의 분류를 위한 decision tree의 예


확률의 계산

Decision tree의 leaf에 해당 target attribute 값의 확률을 함께 제시하면 더 유용한 정보를 제공하는 것이 가능함

Confidence를 반영하기 위해 단순 확률보다는 아래의 식에 따라 smoothing하는 것이 일반적임 (Laplace correction)

예: 2개 중에 2개 모두 +인 경우 vs. 20개 중에 20개 모두 +인 경우 (p=0.75 vs. p≈0.95)


Decision Tree의 구축

가장 information gain이 좋은 attribute를 재귀적으로 선택하여 split 함

예:

Attribute의 information gain 값


Attribute가 선택된 순서가 앞의 information gain 순서와 일치하지 않는 이유?


Model Fitting

주어진 data를 가지고 어떻게 model을 구축할 것인가?

Decision tree 외에 널리 사용되는 model 구축 방법은 무엇이 있는가?


수학 함수를 사용한 분류

Linear discriminant function


일반적으로는 아래와 같이 정의됨

예:

f(x) 값이 positive면 + class, negative면 – class로 분류하면 됨


목적 함수의 최적화

다음 중 어떤 linear boundary가 최적인가? 이는 어떤 기준으로 결정하는가?


사용 예제 설명

Iris dataset from the UCI Dataset Repository

Target attribute: 붓꽃의 품종 (3종)

Attribute: petal(꽃잎) length, petal width, sepal(꽃받침) length, sepal width


Iris Versicolor

Iris Setosa


Support Vector Machine (SVM)

Boundary와 실제 data간의 margin을 최대화하려고 시도함


현실적으로 두 개의 class를 깔끔하게 구분 지을 수 없기 때문에, 잘못 분류되는 point에 대해서는 penalty를 줌

Loss function의 예


Logistic Regression

편의상 linear regression으로 설명

Logistic function은 아래와 같이 정의됨


모든 training data의 example들에 대해 아래의 g(x,w) 값을 계산하여 총 합이 최대화 되도록 w0, w1, w2 등의 파라미터 값들을 변화시킴


Non-Linear Functions

실제 data에서 linear하게 boundary가 나누어지는 경우는 많지 않음 non-linear boundary

아래 예에서 SVM의 kernel trick을 사용하여 non-linear한 boundary를 만들어 2개의 class를 완벽하게 구분함


Logistic Regression vs. Tree Induction

Wisconsin Breast Cancer Dataset

Logistic Regression Decision Tree

Accuracy: 98.9% Accuracy: 99.1%


Overfitting

Generalization vs. Overfitting?

Model은 주어진 data에서 발견된 generalize된 특성을 설명하는 것임

주어진 data를 완벽하게 (100% 정확도로) 설명하는 model은 과연 좋기만 한 것일까?

현재 주어진 data는 전체 data의 완벽한 대표라고 볼 수 있는가?

지금까지 보지 못했던 새로운 data가 들어오면 현재 model로 처리할 수 있는가?

너무 세부적인 model fitting을 overfitting이라 부르며 이는 지양해야 함


Holdout: 주어진 data 중 일부로 model을 생성하고 나머지 일부로 test하는 방식

Model의 복잡도가 증가할 수록 training data의 error는 감소하지만 holdout의 error는 반드시 그렇지 않음 Overfitting


Tree induction에서의 일반적인 fitting curve


수학 함수에서의 Overfitting

아래의 f(x)에서 xi의 개수를 너무 많이 늘리는 것

Dimension의 개수를 늘리면 늘릴 수록 주어진 data를 보다 더 잘 맞출 수 있음 (다른 용어로 더 많은 attribute 혹은 feature)

Modeler들이 overfitting을 방지하게 위해 일부 attribute들을 미리 제거하는 경우도 있음 feature selection


Overfitting의 예

Point 1개()를 제대로 분류하기 위해 logistic regression의 boundary가 많이 바뀌었음 ← 바뀐 boundary가 더 좋은 것인가?


Point 1개(o)를 제대로 분류하기 위해 logistic regression의 boundary가 많이 바뀌었음 ← 바뀐 boundary가 더 좋은 것인가?


Cross-Validation

생성한 model을 보다 더 체계적으로 test하고자 할 때 널리 사용되는 방법임


Overfitting의 방지

Tree induction의 경우 2가지 정책이 가능함

1. Tree를 너무 복잡해지기 전까지만 생성하는 정책

예: tree의 leaf 개수가 최소 몇 개 이상이 되면 stop!

2. 일단 tree를 복잡해질 때 까지 생성한 다음에 tree를 절단하면서 단순하게 만드는 정책


Cross-validation을 통해 체계적인 test

Automatic feature selection

Objective function에 complexity penalty를 추가

Model의 complexity와 accuracy의 trade-off를 고려함


맺음말

Organization What is Predicted

Facebook Friendship

Allstate Bodily harm from car crashes

Researchers HIV progression

New South Wales Travel time vis-à-vis traffic

Univ. Melbourne Awarding of grants

Hewlett Foundation Student grades

Ford Motor Co. Driver inattentiveness

CareerBuilder Job applications


어떤 방법도 항상 best는 아닐 수 있음 여러 방법을 사용해 보고 종합적으로 판단하자 (ensemble)

Netflix contest That 20 minutes was worth a million dollars.

Data mining tool을 쉽게 사용할 수 있음 R, Weka 등의 tool에 대부분의 classification 기법이 제공되고 있음


Thank You! Any Questions?

Phone: 042-350-1617 E-mail: [email protected]

mailto:[email protected]

예측 모델링과 분류 기법 - kaistkseworkshop.kaist.ac.kr/2014/material/2014kse-2.pdf · r,...

Documents