ppt
DESCRIPTION
TRANSCRIPT
![Page 1: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/1.jpg)
Statistische Methoden in der ComputerlinguistikStatistical Methods in Computational Linguistics
7. Machine Learning
Jonas Kuhn
Universität Potsdam, 2007
![Page 2: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/2.jpg)
Addition: Good-Turing Smoothing
Use discounted estimates (r*) only up to a threshold k For types that occurred more often than k, relative frequency
estimates are used (Jurafsky/Martin, p. 216) With threshold k:
![Page 3: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/3.jpg)
Machine Learning Overview
Tom Mitchell (1997): Machine Learning, McGraw Hill.
Slides mainly based on slides by Joakim Nivre and Tom Mitchell
![Page 4: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/4.jpg)
Machine Learning
Idea: Synthesize computer programs by learning from representative examples of input (and output) data.
Rationale:1. For many problems, there is no known
method for computing the desired output from a set of inputs.
2. For other problems, computation according to the known correct method may be too expensive.
![Page 5: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/5.jpg)
Well-Posed Learning Problems
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Examples:1. Learning to classify chemical compounds
2. Learning to drive an autonomous vehicle
3. Learning to play bridge
4. Learning to parse natural language sentences
![Page 6: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/6.jpg)
Designing a Learning System
In designing a learning system, we have to deal with (at least) the following issues:
1. Training experience2. Target function3. Learned function4. Learning algorithm
Example: Consider the task T of parsing English sentences, using the performance measure P of labeled precision and recall in a given test corpus (gold standard).
![Page 7: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/7.jpg)
Training Experience
Issues concerning the training experience:1. Direct or indirect evidence (supervised or
unsupervised).2. Controlled or uncontrolled sequence of training
examples.3. Representativity of training data in relation to test
data. Training data for a syntactic parser:
1. Treebank versus raw text corpus.2. Constructed test suite versus random sample.3. Training and test data from the same/similar/different
sources with the same/similar/different annotations.
![Page 8: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/8.jpg)
Target Function and Learned Function
The problem of improving performance can often be reduced to the problem of learning some particular target function. A shift-reduce parser can be trained by learning a
transition function f : C C, where C is the set of possible parser configurations.
In many cases we can only hope to acquire some approximation to the ideal target function. The transition function f can be approximated by a
function : Action from stack (top) symbols to parse actions.
f̂
![Page 9: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/9.jpg)
Learning Algorithm
In order to learn the (approximated) target function we require:1. A set of training examples (input arguments)
2. A rule for estimating the value corresponding to each training example (if this is not directly available)
3. An algorithm for choosing the function that best fits the training data
Given a treebank on which we can simulate the shift-reduce parser, we may decide to choose the function that maps each stack symbol to the action that occurs most frequently when is on top of the stack.
![Page 10: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/10.jpg)
Supervised Learning
Let X and Y be the set of possible inputs and outputs, respectively.1. Target function: Function f from X to Y.
2. Training data: Finite sequence D of pairs <x, f(x)> (x X).
3. Hypothesis space: Subset H of functions from X to Y.
4. Learning algorithm: Function A mapping a training set D to a hypothesis h H.
If Y is a subset of the real numbers, we have a regression problem; otherwise we have a classification problem.
![Page 11: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/11.jpg)
Variations of Machine Learning
Unsupervised learning: Learning without output values (data exploration, e.g. clustering).
Query learning: Learning where the learner can query the environment about the output associated with a particular input.
Reinforcement learning: Learning where the learner has a range of actions which it can take to attempt to move towards states where it can expect high rewards.
Batch vs. online learning: All training examples at once or one at a time (with estimate and update after each example).
![Page 12: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/12.jpg)
Learning and Generalization
Any hypothesis that correctly classifies all the training examples is said to be consistent. However:1. The training data may be noisy so that there is no
consistent hypothesis at all.2. The real target function may be outside the hypothesis
space and has to be approximated.3. A rote learner, which simply outputs y for every x such
that <x, y> D is consistent but fails to classify any x not in D.
A better criterion of success is generalization, the ability to correctly classify instances not represented in the training data.
![Page 13: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/13.jpg)
Approaches to Machine Learning
Decision trees Artificial neural networks Bayesian learning Instance-based learning (cf. Memory-based
learning/MBL) Genetic algorithms Relational learning (cf. Inductive logical
programming/ILP)
First focus:Naïve Bayes
classifier
![Page 14: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/14.jpg)
Decision Tree Example: Name Recognition
Capitalized?
Sentence-Inital?
Yes No
No
0
0
1
1
![Page 15: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/15.jpg)
Decision Tree Example: Name Recognition
![Page 16: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/16.jpg)
Bayesian Learning
Two reasons for studying Bayesian learning methods:1. Efficient learning algorithms for certain kinds of
problems2. Analysis framework for other kinds of learning
algorithms Features of Bayesian learning methods:
1. Assign probabilities to hypotheses (not accept or reject)
2. Combine prior knowledge with observed data3. Permit hypotheses that make probabilistic predictions4. Permit predictions based on multiple hypotheses,
weighted by their probabilities
![Page 17: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/17.jpg)
Towards Naive Bayes Classification
Let H be a hypothesis space defined over the instance space X, where the task is to learn a target function f : X Y, where Y is a finite set of classes used to classify instances in X, and where a1, …, an are the attributes used to represent an instance x X.
Bayes’ Theorem
Maximize numerator for finding most probable y with a given x (and its attribute representation a1, …, an
Numerator is equivalent to joint probability
),...,,(
)()|,...,,(),...,,|(
21
2121
n
nn aaaP
yPyaaaPaaayP
),...,,,( 21 naaayP
![Page 18: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/18.jpg)
Towards Naive Bayes Classification
Reformulate numerator (using chain rule)
(Naïve) conditional independence assumption: For any attributes ai, aj (ij):
Hence:
),...,,|(...),|()|()(
...
),,|,...,(),|()|()(
),|,...,()|()(
)|,...,,()(),...,,,(
11121
213121
121
2121
nn
n
n
nn
aayaPayaPyaPyP
aayaaPayaPyaPyP
ayaaPyaPyP
yaaaPyPaaayP
)|(),|( yaPayaP iji
)|(...)|()|()(),...,,,( 2121 yaPyaPyaPyPaaayP nn
![Page 19: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/19.jpg)
Naive Bayes Classifier
The naive Bayes classification of a new instance is:
i
iYy yaPyP )|()(maxarg
![Page 20: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/20.jpg)
Learning a Naive Bayes Classifier
Estimate probabilities from training data using ML estimation:
Smoothe probability estimates to compensate for sparse data, e.g. using an m-estimate:
where m is a constant called the equivalent sample size and p is a prior probability (usually assumed to be uniform).
|})(|{|
},)(|{)|(
||
})(|{)(
yxfDx
xayxfDxyaP
D
yxfDxyP
iiML
ML
myxfDx
mpxayxfDxyaP
mD
mpyxfDxyP
iiML
ML
|})(|{|
},)(|{)|(ˆ
||
})(|{)(ˆ
![Page 21: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/21.jpg)
Naive Bayes
Naive Bayes classifiers work surprisingly well for text classification tasks
![Page 22: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/22.jpg)
Decision Tree Learning
Decision trees classify instances by sorting them down the tree from the root to some leaf node, where:1. Each internal node specifies a test of some attribute.2. Each branch corresponds to a value for the tested
attribute.3. Each leaf node provides a classification for the
instance. Decision trees represent a disjunction of conjunctions
of constraints on the attribute values of instances.1. Each path from root to leaf specifies a conjunction of
tests.2. The tree itself represents the disjunction of all paths.
![Page 23: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/23.jpg)
Example: Name Recognition
Capitalized?
Sentence-Inital?
Yes No
No
0
0
1
1
![Page 24: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/24.jpg)
PlayTennis example
![Page 25: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/25.jpg)
Appropriate Problems for Decision Tree Learning Instances are represented by attribute-value
pairs. The target function has discrete output
values. Disjunctive descriptions may be required. The training data may contain errors. The training data may contain missing
attribute values.
![Page 26: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/26.jpg)
The ID3 Learning Algorithm
ID3(X = instances, Y = classes, A = attributes):1. Create a root node R for the tree.
2. If all instances in X are in class y, return R with label y.
3. Else let the decision attribute for R be the attribute a A that best classifies X and for each value vi of a:
(a) Add a branch below R for the test a = vi.
(b) Let Xi be the subset of X that have a = vi. If Xi is empty then add a leaf labeled with the most common class in X; else add the subtree ID3(Xi, Y, A − a).
4. Return R.
![Page 27: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/27.jpg)
Selecting the Best Attribute
ID3 uses the measure Information Gain (IG) to decide which attribute a best classifies a set of examples X:
where Va is the set of possible values for a, Xv is the subset of X for which a = v, and Entropy(X) is defined as follows:
An alternative measure is Gain Ratio (GR):
)()(),( VVv
V XEntropyX
XXEntropyaXIG
a
)(log)()( 2 yPyPYEntropyYy
)(
),(),(
aVEntropy
aXIGaXGR
![Page 28: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/28.jpg)
Information Gain
![Page 29: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/29.jpg)
Training examples
![Page 30: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/30.jpg)
Selecting the next attribute
![Page 31: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/31.jpg)
![Page 32: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/32.jpg)
Hypothesis Space Search and Inductive Bias Characteristics of ID3:
1. Searches a complete hypothesis space of target functions2. Maintains a single current hypothesis throughout the search3. Performs a hill-climbing search (susceptible to local
minima)4. Uses all training examples at each step in the search
Inductive bias:1. Prefers shorter trees over longer ones2. Prefers trees with informative attributes close to the root3. Preference bias (incomplete search of complete space)
![Page 33: ppt](https://reader033.vdocuments.site/reader033/viewer/2022061215/547f355cb4af9fab7c8b47f2/html5/thumbnails/33.jpg)
Overfitting
The problem of overfitting: A hypothesis h is overfitted to the training data if there
exists an alternative hypothesis h’ with higher training error but lower test error.
Two approaches for avoiding overfitting in decision tree learning:1. Stop growing the tree before it overfits the training
data.
2. Allow overfitting and then post-prune the tree. Both methods require a stopping criterion and can
be validated using held-out data.