osztályozás. célja az osztályozás célja új dokumentumot, szavakat előre megadott...

Post on 14-Jan-2016

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Osztályozás

Célja

• Az osztályozás célja uj dokumentumot, szavakat elore megadott csoportok valamelyikéhez rendelni oly modon, hogy az legjobban illeszkedjen a csoport elemeivel – elore definiált csoportok vannak– felugyelt tanulás– hozzárendelési szabályt állit elo

UMUC Data Mining Lecture 4 3By Dr. Borne 2005

Introduction to Classification Applications

• Classification = to learn a function that classifies the data into a set of predefined classes.– predicts categorical class labels (i.e., discrete labels)– classifies data (constructs a model) based on the training set

and on the values (class labels) in a classifying attribute; and then uses the model to classify new database entries.

Example: A bank might want to learn a function that determines whether a customer should get a loan or not. Decision trees and Bayesian classifiers are examples of classification algorithms. This is called Credit Scoring.

Other applications: Credit approval; Target marketing; Medical diagnosis; Outcome (e.g., Treatment) analysis.

UMUC Data Mining Lecture 4 4By Dr. Borne 2005

Classification - a 2-Step Process• Model Construction (Description): describing a set of

predetermined classes = Build the Model.– Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute– The set of tuples used for model construction = the training set– The model is represented by classification rules, decision trees, or

mathematical formulae

• Model Usage (Prediction): for classifying future or unknown objects, or for predicting missing values = Apply the Model.– It is important to estimate the accuracy of the model:

• The known label of test sample is compared with the classified result from the model

• Accuracy rate is the percentage of test set samples that are correctly classified by the model

• Test set is chosen completely independent of the training set, otherwise over-fitting will occur

UMUC Data Mining Lecture 4 5By Dr. Borne 2005

When to use Classification Applications?

• If you do not know the types of objects stored in your database, then you should begin with a Clustering algorithm, to find the various clusters (classes) of objects within the DB. This is Unsupervised Learning.

• If you already know the classes of objects in your database, then you should apply Classification algorithms, to classify all remaining (or newly added) objects in the database using the known objects as a training set. This is Supervised Learning.

• If you are still learning about the properties of known objects in the database, then this is Semi-Supervised Learning, which may involve Neural Network techniques.

Dokumentum osztályozás

• A dokumentumot az elore ismert osztályok egyikéhez (vagy csoportjához) rendeljuk

• Szohalmaz Kategoria • A leképzés tanito mintán alapulo statisztikai

modszerekkel torténik – Bayes– Dontési fa– K legkozelebbi szomszéd– SVM

UMUC Data Mining Lecture 4 7By Dr. Borne 2005

Issues in Classification - 1

• Data Preparation:– Data cleaning

• Preprocess data in order to reduce noise and handle missing values

– Relevance analysis (feature selection) • The “interestingness problem”• Remove the irrelevant or redundant attributes

UMUC Data Mining Lecture 4 8By Dr. Borne 2005

Issues in Classification - 3• Robustness:

– Handling noise and missing values

• Speed and scalability of model– time to construct the model– time to use the model

• Scalability of implementation– ability to handle ever-growing databases

• Interpretability: – understanding and insight provided by the model

• Goodness of rules– decision tree size– compactness of classification rules

• Predictive accuracy

By Dr. Borne 2005 UMUC Data Mining Lecture 4 9

Issues in Classification - 4

• Overfitting– Definition: If your classifier (machine learning

model) fits noise (i.e., pays attention to parts of the data that are irrelevant), then it is overfitting.

GOOD BAD

BAYES

Bayesian Methods

• Learning and classification methods based on probability theory (see spelling / POS)

• Bayes theorem plays a critical role• Build a generative model that approximates

how data is produced• Uses prior probability of each category given no

information about an item.• Categorization produces a posterior probability

distribution over the possible categories given a description of an item.

UMUC Data Mining Lecture 4 12By Dr. Borne 2005

Bayesian Classifiers

• Bayes Theorem: P(C|X) = P(X|C) P(C) / P(X) which states …

posterior = (likelihood x prior) / evidence • P(C) = prior probability = probability that any

given sample data is in class C, estimated before we have measured the sample data.

• We wish to determine the posterior probability P(C|X) that estimates whether C is the correct class for a given set of sample data X.

UMUC Data Mining Lecture 4 13By Dr. Borne 2005

Estimating Bayesian Classifiers• P(C|X) = P(X|C) P(C) / P(X) …

– Estimate P(Cj) by counting the frequency of occurrence of each class Cj in the training data set.*

– Estimate P(Xk) by counting the frequency of occurrence of each attribute value Xk in the data.*

– Estimate P(Xk | Cj) by counting how often the attribute value Xk occurs in class Cj in the training data set.*

– Calculate the desired end-result P(Cj | Xk) which is the classification = the probability that Cj is the correct class for a data item having attribute Xk.

(*Estimating these probabilities can be computationally very expensive for very large data sets.)

UMUC Data Mining Lecture 4 14By Dr. Borne 2005

Example of Bayes Classification

• Show sample database• Show application of Bayes theorem:

– Use sample database as the “set of priors”– Use Bayes results to classify new data

UMUC Data Mining Lecture 4 15By Dr. Borne 2005

Example of Bayesian Classification :• Suppose that you have a database D that contains characteristics of a large

number of different kinds of cars that are sorted according to each car’s manufacturer = the car’s classification C.

• Suppose one of the attributes X in D is the car’s “color”.• Measure P(C) from the frequency of different manufacturers in D.• Measure P(X) from the frequency of different colors among the cars in D. (This

estimate is made independent of manufacturer.)• Measure P(X|C) from frequency of cars with color X made by manufacturer C.• Okay, now you see a red car flying down the beltway. What is the car’s make

(manufacturer)? You can estimate the likelihood that the car is from a given manufacturer C by calculating P(C|X) via Bayes Theorem:

– P(C|X) = P(X|C) P(C) / P(X) (Class is “C” when P(C|X) is a maximum.)

• With only one attribute, this is a trivial result, and not very informative. However, using a larger set of attributes (e.g., two-door, with sun roof) leads to a much better classification estimator : example of a Bayes Belief Network.

UMUC Data Mining Lecture 4 16By Dr. Borne 2005

Sample Database for Bayes Classification Example

x = car colorC = class of car (manufacturer)

Car Database:

Tuple x C 1 red honda2 blue honda3 white honda4 red chevy5 blue chevy6 white chevy7 red toyota8 white toyota9 white toyota10 red chevy11 white ford12 white ford13 blue ford14 red chevy15 red dodge

Some statistical results:x1 = red P(x1) = 6/15x2 = white P(x2) = 6/15x3 = blue P(x3) = 3/15

C1 = chevy P(C1) = 5/15C2 = honda P(C2) = 3/15C3 = toyota P(C3) = 3/15C4 = ford P(C4) = 3/15C5 = dodge P(C5) = 1/15

UMUC Data Mining Lecture 4 17By Dr. Borne 2005

Application #1 of Bayes Theorem• Recall the theorem: P(C|X) = P(X|C) P(C) / P(X)

• From last slide, we know P(C) and P(X). Calculate P(X|C) and then we can perform the classification.

P(C | red) = P(red | C) * P(C) / P(red)P(red | chevy) = 3/5P(red | honda) = 1/3P(red | toyota) = 1/3P(red | ford) = 0/3P(red | dodge) = 1/1

Therefore ...

P(chevy | red) = 3/5 * 5/15 * 15/6 = 3/6 = 50%P(honda | red) = 1/3 * 3/15 * 15/6 = 1/6 = 17%P(toyota | red) = 1/3 * 3/15 * 15/6 = 1/6 = 17%P(ford | red) = 0P(dodge | red) = 1/1 * 1/15 * 15/6 = 1/6 = 17%

Example #1:We see a red car.What type of caris it?

UMUC Data Mining Lecture 4 18By Dr. Borne 2005

Results from Bayes Example #1

• Therefore, the red car is most likely a Chevy (maybe a Camaro or Corvette? ).

• The red car is unlikely to be a Ford.• We choose the most probable class as the

Classification of the new data item (red car): therefore, Classification = C1 (Chevy).

UMUC Data Mining Lecture 4 19By Dr. Borne 2005

Application #2 of Bayes Theorem

• Recall the theorem: P(C|X) = P(X|C) P(C) / P(X)

P(C | white) = P(white | C) * P(C) / P(white)P(white | chevy) = 1/5P(white | honda) = 1/3P(white | toyota) = 2/3P(white | ford) = 2/3P(white | dodge) = 0/1

Therefore ...

P(chevy | white) = 1/5 * 5/15 * 15/6 = 1/6 = 17%P(honda | white) = 1/3 * 3/15 * 15/6 = 1/6 = 17%P(toyota | white) = 2/3 * 3/15 * 15/6 = 2/6 = 33%P(ford | white) = 2/3 * 3/15 * 15/6 = 2/6 = 33%P(dodge | white) = 0

Example #2:We see a white car.What type of caris it?

UMUC Data Mining Lecture 4 20By Dr. Borne 2005

Results from Bayes Example #2• Therefore, the white car is equally likely to

be a Ford or a Toyota. • The white car is unlikely to be a Dodge.• If we choose the most probable class as the

Classification, we have a tie. You can either pick one of the two classes randomly (if you must pick). Or else weight each class 0.50 in the output classification (C3, C4), if a probabilistic classification is permitted.

UMUC Data Mining Lecture 4 21By Dr. Borne 2005

Why Use Bayesian Classification?

• Probabilistic Learning: Allows you to calculate explicit probabilities for a hypothesis -- “learn as you go”. This is among the most practical approaches to certain types of learning problems (e.g., e-mail Spam detection).

• Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct.

• Data-Driven: Prior knowledge can be combined with observed data.

• Probabilistic Prediction: Allows you to predict multiple hypotheses, each weighted by their own probabilities.

• The Standard: Bayesian methods provide a standard of optimal decision-making against which other methods can be compared.

UMUC Data Mining Lecture 4 22By Dr. Borne 2005

Naïve Bayesian Classification

• Naïve Bayesian Classification assumes that all classes C(i) are independent of one another.

• Naïve Bayes assumption: attribute independence

P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

(= a simple product of probabilities)

• P(xi|C) is estimated as the relative frequency of samples

in class C for which their attribute “i” has the value “xi”.

• This assumes that there is no correlation in the attribute values x1,…,xk (attribute independence)

UMUC Data Mining Lecture 4 23By Dr. Borne 2005

The Independence Hypothesis…

• … makes the computation possible (tractable)

• … yields optimal classifiers when satisfied

• … but is seldom satisfied in practice, as attributes (variables) are often correlated.

• Some approaches to overcome this limitation:– Bayesian networks, that combine Bayesian reasoning

with causal relationships between attributes

– Decision trees, that reason on one attribute at a time, considering most important attributes first

DÖNTÉSI FÁK

Decision Tree Based Classification

• Advantages:– Inexpensive to construct– Extremely fast at classifying unknown records– Easy to interpret for small-sized trees– Accuracy is comparable to other classification

techniques for many simple data sets

26

Decision trees

• Decision trees are popular for pattern recognition because the models they produce are easier to understand.

Root node

A A

B B B B

A. Nodes of the tree

B. Leaves (terminal nodes) of the tree

C. Branches (decision point) of the tree

C

27

Weather Data: Play or not Play?Outlook Temperature Humidity Windy Play?

sunny hot high false No

sunny hot high true No

overcast hot high false Yes

rain mild high false Yes

rain cool normal false Yes

rain cool normal true No

overcast cool normal true Yes

sunny mild high false No

sunny cool normal false Yes

rain mild normal false Yes

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

Note:Outlook is theForecast,no relation to Microsoftemail program

28

overcast

high normal falsetrue

sunnyrain

No NoYes Yes

Yes

Example Tree for “Play?”

Outlook

HumidityWindy

29

Building Decision Tree [Q93]

• Top-down tree construction– At start, all training examples are at the root.– Partition the examples recursively by choosing one

attribute each time.• Bottom-up tree pruning

– Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.

30

Choosing the Splitting Attribute

• At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose.

• Typical goodness functions:– information gain (ID3/C4.5)– information gain ratio– gini index

witten&eibe

31

Which attribute to select?

witten&eibe

32

A criterion for attribute selection

• Which is the best attribute?– The one which will result in the smallest tree– Heuristic: choose the attribute that produces the

“purest” nodes• Popular impurity criterion: information gain

– Information gain increases with the average purity of the subsets that an attribute produces

• Strategy: choose attribute that results in greatest information gain

witten&eibe

33

Computing information

• Information is measured in bits– Given a probability distribution, the info required

to predict an event is the distribution’s entropy– Entropy gives the information required in bits (this

can involve fractions of bits!)• Formula for computing the entropy:

nnn ppppppppp logloglog),,,entropy( 221121

witten&eibe

Alternative Splitting Criteria based on INFO• Entropy at a given node t:

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed

among all classes implying least information• Minimum (0.0) when all records belong to one class,

implying most information

– Entropy based computations are similar to the GINI index computations

j

tjptjptEntropy )|(log)|()(

Examples for computing Entropy

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

j

tjptjptEntropy )|(log)|()(2

36

Example: attribute “Outlook”, 1

witten&eibe

Outlook Temperature Humidity Windy Play?

sunny hot high false No

sunny hot high true No

overcast hot high false Yes

rain mild high false Yes

rain cool normal false Yes

rain cool normal true No

overcast cool normal true Yes

sunny mild high false No

sunny cool normal false Yes

rain mild normal false Yes

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

37

Example: attribute “Outlook”, 2

• “Outlook” = “Sunny”:

• “Outlook” = “Overcast”:

• “Outlook” = “Rainy”:

• Expected information for attribute:

bits 971.0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info([2,3]

bits 0)0log(0)1log(10)entropy(1,)info([4,0]

bits 971.0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info([3,2]

Note: log(0) is not defined, but we evaluate 0*log(0) as zero

971.0)14/5(0)14/4(971.0)14/5([3,2])[4,0],,info([3,2] bits 693.0

witten&eibe

38

Computing the information gain• Information gain: (information before split) – (information after

split)

• Compute for attribute “Humidity”

0.693-0.940[3,2])[4,0],,info([2,3]-)info([9,5])Outlook"gain(" bits 247.0

witten&eibe

39

Example: attribute “Humidity”

• “Humidity” = “High”:

• “Humidity” = “Normal”:

• Expected information for attribute:

• Information Gain:

bits 985.0)7/4log(7/4)7/3log(7/37,4/7)entropy(3/)info([3,4]

bits 592.0)7/1log(7/1)7/6log(7/67,1/7)entropy(6/)info([6,1]

592.0)14/7(985.0)14/7([6,1]),info([3,4] bits 79.0

0.1520.788-0.940[6,1]),info([3,4]-)info([9,5]

40

Computing the information gain• Information gain: (information before split) – (information after

split)

• Information gain for attributes from weather data:

0.693-0.940[3,2])[4,0],,info([2,3]-)info([9,5])Outlook"gain(" bits 247.0

bits 247.0)Outlook"gain(" bits 029.0)e"Temperaturgain("

bits 152.0)Humidity"gain(" bits 048.0)Windy"gain("

witten&eibe

41

Continuing to split

bits 571.0)e"Temperaturgain(" bits 971.0)Humidity"gain("

bits 020.0)Windy"gain("

witten&eibe

42

The final decision tree

• Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further

witten&eibe

43

Highly-branching attributes

• Problematic: attributes with a large number of values (extreme case: ID code)

• Subsets are more likely to be pure if there is a large number of valuesÞ Information gain is biased towards choosing

attributes with a large number of valuesÞ This may result in overfitting (selection of an

attribute that is non-optimal for prediction)

witten&eibe

44

Weather Data with ID codeID Outlook Temperature Humidity Windy Play?

A sunny hot high false No

B sunny hot high true No

C overcast hot high false Yes

D rain mild high false Yes

E rain cool normal false Yes

F rain cool normal true No

G overcast cool normal true Yes

H sunny mild high false No

I sunny cool normal false Yes

J rain mild normal false Yes

K sunny mild normal true Yes

L overcast mild high true Yes

M overcast hot normal false Yes

N rain mild high true No

45

Split for ID Code Attribute

Entropy of split = 0 (since each leaf node is “pure”, having onlyone case.

Information gain is maximal for ID code

witten&eibe

46

Gain ratio• Gain ratio: a modification of the information gain

that reduces its bias on high-branch attributes• Gain ratio should be

– Large when data is evenly spread– Small when all data belong to one branch

• Gain ratio takes number and size of branches into account when choosing an attribute– It corrects the information gain by taking the intrinsic

information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to)

witten&eibe

47

.||

||2

log||||

),(SiS

SiS

ASnfoIntrinsicI

.),(

),(),(ASnfoIntrinsicI

ASGainASGainRatio

Gain Ratio and Intrinsic Info.• Intrinsic information: entropy of distribution of

instances into branches

• Gain ratio (Quinlan’86) normalizes info gain by:

48

Computing the gain ratio• Example: intrinsic information for ID code

• Importance of attribute decreases as intrinsic information gets larger

• Example of gain ratio:

• Example:

bits 807.3)14/1log14/1(14),1[1,1,(info

)Attribute"info("intrinsic_)Attribute"gain("

)Attribute"("gain_ratio

246.0bits 3.807bits 0.940

)ID_code"("gain_ratio

witten&eibe

49

More on the gain ratio• “Outlook” still comes out top• However: “ID code” has greater gain ratio

– Standard fix: ad hoc test to prevent splitting on that type of attribute

• Problem with gain ratio: it may overcompensate– May choose an attribute just because its intrinsic

information is very low– Standard fix:

• First, only consider attributes with greater than average information gain

• Then, compare them on gain ratio

witten&eibe

50

• If a data set T contains examples from n classes, gini index, gini(T) is defined as

where pj is the relative frequency of class j in T.

gini(T) is minimized if the classes in T are skewed.

n

jjpTgini

1

21)(

*CART Splitting Criteria: Gini Index

51

Discussion

• Algorithm for top-down induction of decision trees (“ID3”) was developed by Ross Quinlan– Gain ratio just one modification of this basic

algorithm– Led to development of C4.5, which can deal with

numeric attributes, missing values, and noisy data• Similar approach: CART (to be covered later)• There are many other attribute selection criteria!

(But almost no difference in accuracy of result.)

52

C4.5 History

• ID3, CHAID – 1960s• C4.5 innovations (Quinlan):

– permit numeric attributes– deal sensibly with missing values– pruning to deal with for noisy data

• C4.5 - one of best-known and most widely-used learning algorithms– Last research version: C4.8, implemented in Weka as J4.8 (Java)– Commercial successor: C5.0 (available from Rulequest)

How to Address Overfitting• Pre-Pruning (Early Stopping Rule)

– Stop the algorithm before it becomes a fully-grown tree– Typical stopping conditions for a node:

• Stop if all instances belong to the same class• Stop if all the attribute values are the same

– Based on statistical significance test– Stop growing the tree when there is no statistically significant

association between any attribute and the class at a particular nod– More restrictive conditions:

• Stop if number of instances is less than some user-specified threshold• Stop if class distribution of instances are independent of the available features

(e.g., using 2 test)• Stop if expanding the current node does not improve impurity

measures (e.g., Gini or information gain).

How to Address Overfitting…

• Post-pruning– Grow decision tree to its entirety– Trim the nodes of the decision tree in a bottom-up

fashion– If generalization error improves after trimming,

replace sub-tree by a leaf node.– Class label of leaf node is determined from majority

class of instances in the sub-tree– Postpruning preferred in practice—prepruning can

“stop too early”

55

Subtreereplacement

• Bottom-up• Consider replacing a tree

only after considering all its subtrees

witten & eibe

56

Estimating error rates• Prune only if it reduces the estimated error• Error on the training data is NOT a useful estimator

Q: Why it would result in very little pruning?• Use hold-out set for pruning

(“reduced-error pruning”)• C4.5’s method

– Derive confidence interval from training data– Use a heuristic limit, derived from this, for pruning– Standard Bernoulli-process-based method– Shaky statistical assumptions (based on training data)

witten & eibe

Extracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules• One rule is created for each path from the root to a leaf• Each attribute-value pair along a path forms a conjunction• The leaf node holds the class prediction• Rules are easier for humans to understand• Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”IF age = “31…40” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

K LEGKÖZELEBBI SZOMSZÉD

K Nearest Neighbor (KNN):

• Training set includes classes.• Examine K items near item to be classified.• New item placed in class with the most

number of close items.• O(q) for each tuple to be classified. (Here q

is the size of the training set.)

The k-Nearest Neighbor Algorithm• All instances correspond to points in the n-D space.• The nearest neighbor are defined in terms of Euclidean

distance.• The target function could be discrete- or real- valued.• For discrete-valued, the k-NN returns the most common

value among the k training examples nearest to xq. • Vonoroi diagram: the decision surface induced by 1-NN for

a typical set of training examples.

.

_+

_ xq

+

_ _+

_

_

+

.

..

. .

Discussion on the k-NN Algorithm

• The k-NN algorithm for continuous-valued target functions– Calculate the mean values of the k nearest neighbors

• Distance-weighted nearest neighbor algorithm– Weight the contribution of each of the k neighbors according

to their distance to the query point xq

• giving greater weight to closer neighbors– Similarly, for real-valued target functions

• Robust to noisy data by averaging k-nearest neighbors• Curse of dimensionality: distance between neighbors could be

dominated by irrelevant attributes. – To overcome it, axes stretch or elimination of the least

relevant attributes.

wd xq xi

12( , )

62

K Nearest Neighbors

• K Nearest Neighbors– Advantage

• Nonparametric architecture• Simple• Powerful• Requires no training time

– Disadvantage• Memory intensive• Classification/estimation is slow

63

K Nearest Neighbors

• The key issues involved in training this model includes setting– the variable K

• Validation techniques (ex. Cross validation)

– the type of distant metric• Euclidean measure

2

1

)(),(

D

i

YiXiYXDist

64

Figure K Nearest Neighbors Example

X

Stored training set patternsX input pattern for classification--- Euclidean distance measure to the nearest three patterns

65

Store all input data in the training set

For each pattern in the test set

Search for the K nearest patterns to the input pattern using a Euclidean distance measure

For classification, compute the confidence for each class as Ci /K,

(where Ci is the number of patterns among the K nearest patterns belonging to class i.)

The classification for the input pattern is the class with the highest confidence.

66

Training parameters and typical settings

• Number of nearest neighbors– The numbers of nearest neighbors (K) should be based on

cross validation over a number of K setting.– When k=1 is a good baseline model to benchmark against.– A good rule-of-thumb numbers is k should be less than the

square root of the total number of training patterns.

67

Training parameters and typical settings

• Input compression– Since KNN is very storage intensive, we may want to

compress data patterns as a preprocessing step before classification.

– Using input compression will result in slightly worse performance.

– Sometimes using compression will improve performance because it performs automatic normalization of the data which can equalize the effect of each input in the Euclidean distance measure.

SVM

Szupport Vektor Gépek (SVM)Szupport vektorok

Maximalizálja az eltérést

• SVM a szeparálo hipersikok kozti eltérést maximalizálja.

• A dontési fuggvényt teljesen meghatározza a tanulo adatoknak egy részhalmaza, a szupport vektorok.

• Kvadratikus programozási probléma

• Sokan a legsikeresebb szovegosztályozási modszernek tekintik

SVN Modszer

• Az tér felbontása alapesetben lineáris alakzattal ugy, hogy a szeparátor elem a legjobban kettéválassza a kulonbozo osztályokhoz tartozo objektumokat

• Tipikus alkalmazás: a kétosztályu esetek, pl. spam szurés, lineárisan szeparálhato esetek

• Alapadatok: – az objektumok az osztály hovatarttozási adatokkal (xi,yi) – Cél a legjobb szeparáciot ado hipersik meghatározása

• A szeparácio minoségének mérése:– a szeparácios margok kozotti távolság nagysága

• A szeparácio feltétele, hogy ellentétes oldalra keruljenek a kulonbozo osztályértéku egyedek

Példa lineárisan nem szeparálhatora

Keressünk olyan hipersíkot, amely a „rossz oldalon” lévő pontokat bünteti

Átlapolo pontok buntetése

Definiáljuk minden pontra a távolságot az ax + by = c szeparátortól, mint (ax + by) - c piros pontokrac - (ax + by) kék pontokra.

Átlapoló pontokra

negatív lesz.

Osztályozás SVM-mel

• Adott egy uj pont (x1,x2), határozzuk meg a hipersik normáljára vonatkozo projekcioját:– Számitsuk ki: score = w x + b– 2 dimenzioban: score = w1x1+w2x2+b.– Adjunk meg egy t konfidencia kuszobot.

35

7

Score > t: igen

Score < -t: nem

Amugy: nem tudjuk

top related