©2012 paula matuszek gate information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 paula...
TRANSCRIPT
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html©2012 Paula Matuszek
CSC 9010: Text Mining Applications:
GATE Machine Learning
Dr. Paula Matuszek
(610) 647-9789
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Information Extraction in GATE
In the last assignment we looked at how to set up GATE to do information extraction “by hand”.– Gazeteers– JAPE rules
We modified the system to add UPenn and Villanova to universities to be extracted
Not hard for two entities, but this can get tricky. Is there a better way?
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Machine Learning We would like:
– give GATE examples of universities – let it figure out for itself how to identify them
This is basically a classification problem: given an entity, is it a university?
This sounds familiar! The classification methods we looked at
in NLTK can be considered as forms of supervised machine learning.
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Machine Learning in GATE GATE has machine learning processing
resources: CREOLE Learning plugin Same classification algorithms we saw
in NLTK But both features to be used and class
to be learned can be richer, using PRs like ANNIE
5
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
Machine Learning
We have data items comprising labels and featuresE.g. an instance of “cat” has features “whiskers=1”,
“fur=1”. A “stone” has “whiskers=0” and “fur=0”
Machine learning algorithm learns a relationship between the features and the labelsE.g. “if whiskers=1 then cat”
This is used to label new dataWe have a new instance with features “whiskers=1”
and “fur=1”--is it a cat or not???
6
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
ML in Information Extraction
We have annotations (classes)
We have features (words, context, word features etc.)
Can we learn how features match classes using ML?
Once obtained, the ML representation can do our annotation for us based on features in the textPre-annotation
Automated systems
Possibly good alternative to knowledge engineering approachesNo need to write the rules
However, need to prepare training data
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
ML in Information Extraction
Central to ML work is evaluationNeed to try different methods, different parameters, to obtain good result
Precision: How many of the annotations we identified are correct?
Recall: How many of the annotations we should have identified did we?
F-Score:
F = 2(precision.recall)/(precision+recall)
Testing requires an unseen test setHold out a test set
Simple approach but data may be scarce
Cross-validationsplit training data into e.g. 10 sections
Take turns to use each “fold” as a test set
Average score across the 10
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
More on SVMs Primary Machine Learning engine in
GATE is the Support Vector Machine– Handles very large feature sets well– Has been shown empirically to be effective
in a variety of unstructured text learning tasks.
We covered this briefly earlier; more detail will help with using them in GATE
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Basic Idea Underlying SVMs
Find a line, or a plane, or a hyperplane, that separates our classes cleanly.– This is the same concept as we have seen
in regression. By finding the greatest margin
separating them
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
denotes +1
denotes -1
How would you classify this data?
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
denotes +1
denotes -1
How would you classify this data?
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
denotes +1
denotes -1
Any of these would be fine..
..but which is best?
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore
Classifier Margin
denotes +1
denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin
denotes +1
denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin.
Called Linear Support Vector Machine (SVM)
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin
denotes +1
denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
Called Linear Support Vector Machine (SVM)
Support Vectors are those datapoints that the margin pushes up against
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Messy Data This is all good so far. Suppose our aren’t that neat:
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Soft Margins Intuitively, it still looks like we can make
a decent separation here.– Can’t make a clean margin– But can almost do so, if we allow some
errors A soft margin is one which lets us make
some errors, in order to get a wider margin
Tradeoff between wide margin and classification errors
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Messy Data
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Slack Variables and Cost In order to find a soft margin, we allow slack
variables, which measure the degree of misclassification.– Takes into account the number of misclassified instances
and the distance from the margin We then modify this by a cost (C) for these
misclassified instances.– High cost -- narrow margins, won’t generalize well– Low cost -- broader margins but misclassify more data.
How much we want it to cost to misclassify instances depends on our domain -- what we are trying to do
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Non-Linearly-Separable Data
Suppose even with a soft margin we can’t do a good linear separation of our data?
Allowing non-linearity will give us much better modeling of many data sets.
In SVMs, we do this by using a kernel. A kernel is a function which maps our
data into a higher-order order feature space where we can find a separating hyperplane
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.
Hard 1-dimensional dataset
What can be done about this?
x=0
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.
Hard 1-dimensional dataset
x=0
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.
Hard 1-dimensional dataset
x=0
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
The Kernel Trick! If we can’t do a linear separation
– add higher order features– and combinations of features
Combinations of features may allow separation where either feature alone will not
This is known as the kernel trick Typical kernels are polynomial and RBF
(Radial Basis Function (Gaussian)) GATE supports linear and polynomial
kernels
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.
Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
What If We Want Three Classes?
Suppose our task involves more than two classes: universities, cities, businesses?
Reduce multiple class problem to multiple binary class problems.– one-versus-all, one-vs-others– one-versus-one, one-vs-another
GATE will do this automatically if there are more than two classes.
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Doing this in GATE GATE has a Learning plugin available in
CREOLE. The PR available from it is Batch Learning– Newest machine learning PR– focused on chunk recognition, text classification,
relation extraction.– Primary algorithm is an SVM– Can also interface to WEKA for Naive Bayes,
KNN and decision trees There is also an older Machine_Learning
plugin, which contains wrappers for several different external learning systems.
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Using Batch Learning PR The batch learning PR is basically learning
new annotations (the class) from previous annotations (the features)
Need three things:– annotated training documents– possibly preprocessed to get annotations we want
to learn from– an XML configuration file (which is external to the
IDE) The PR treats the parent directory of the
config file as its working directory.
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
ML applications in GATE
• Batch Learning PR Evaluation Training Application
• Runs after all other PRs – must be last PR
• Configured via xml file
• A single directory holds generated features, models, and config file
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
Instances, attributes, classes
California Governor Arnold Schwarzenegger proposes deep cuts.
Token Token Token Token Tok Tok
Entity.type=Person
Attributes: Any annotation feature relative to instancesToken.StringToken.category (POS)Sentence.length
Instances: Any annotationTokens are often convenient
Class: The thing we want to learnA feature on an annotation
Sentence
Token
Entity.type=Location
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
Surround mode
• This learned class covers more than one instance....
• Begin / End boundary learning
• Dealt with by API - surround mode• Transparent to the user
California Governor Arnold Schwarzenegger proposes deep cuts.
Token Token
Entity.type=Person
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
Multi class to binary
California Governor Arnold Schwarzenegger proposes deep cuts.
Entity.type=PersonEntity.type=Location
• Three classes, including null
• Many algorithms are binary classifiers
• One against all (One against others)
LOC vs PERS+NULL / PERS vs LOC+NULL / NULL vs LOC+PERS
• One against one (One against another one)
LOC vs PERS / LOC vs NULL / PERS vs NULL
• Dealt with by API - multClassification2Binary
• Transparent to the user
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
The XML File Root element is ML-CONFIG. Contains two required elements:
– DATASET– ENGINE
And some optional settings– Described in manual; most important are
next.
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
The configuration file
• Verbosity: 0,1,2
• Surround mode: set true for entities, false for relations
• Filtering: e.g. remove instances distant from the hyperplane
<?xml version="1.0"?><ML-CONFIG> <VERBOSITY level="1"/> <SURROUND value="true"/> <FILTERING ratio="0.0" dis="near"/>
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
Thresholds
• Control selection of boundaries and classes in post processing
• The defaults we give will work
• Experiment
• See the documentation
<PARAMETERname="thresholdProbabilityEntity" value="0.3"/><PARAMETERname="thresholdProbabilityBoundary" value="0.5"/><PARAMETERname="thresholdProbabilityClassification" value="0.5"/>
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
Multiclass andevaluation
• Multi-class one-vs-others One-vs-another
• Evaluation Kfold – runs gives number of folds holdout – ratio gives training/test
<multiClassification2Binary method="one-vs-others"/><EVALUATION method="kfold" runs="10" />
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
The learning Engine
• Learning algorithm and implementation specific
• SVM: Java implementation of LibSVM– Uneven margins set with -tau
<ENGINE nickname="SVM" implementationName="SVMLibSvmJava" options=" -c 0.7 -t 1 -d 3 -m 100 -tau 0.6"/>
<ENGINE nickname="NB" implementationName="NaiveBayesWeka"/>
<ENGINE nickname="C45" implementationName="C4.5Weka"/>
University of Sheffield NLP
gate.ac.uk/sale/talks/gate-course-july09/ml.ppt
The dataset
• Defines Instance annotation Class Annotation feature to instance attribute
mapping
<DATASET>
</DATASET>
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Parameters in IDE Path to the config file: when you create
the new PR Run-time:
– corpus– input annotation set name: features and
class name– output annotation set name: where results
go. Same as input when evaluating– learningMode
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
LearningMode Run-time parameter. Default is TRAINING Primary modes are:
– TRAINING: PR learns from data and saves models into learnedModels.save
– APPLICATION: reads model from learnedModels.save and applies to data
– EVALUTION: k-fold or hold-out evaluation. Output to GATE Developer.
Additional modes for incremental (adaptive) training and for displaying outcomes
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Running the Batch Learning PR
Multiple PRs: – Running training and evaluation modes update
the model and other data files: don’t run more than one in the same working directory
Processing order:– for training and evaluation modes: needs to go
last. If there’s post-processing to be done make a separate application.
– for application mode: controlled by BATCH-APP-INTERVAL.
– set to 1: acts like normal PR
– set higher: must go last (may be more efficient)
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html
Summary GATE supports a very rich set of classifier-
type machine learning capabilities– Current: the Learning plugin has the Batch Learner– Older: Machine_Learning plugin has wrappers for
various ML systems Both class to be learned and features to learn
from are document annotations Same algorithms as we saw in NLTK, but use
can be very different– classifying chunks of text within documents– alternative to engineering JAPE rules by hand