©2012 paula matuszek gate information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 paula...

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html©2012 Paula Matuszek

CSC 9010: Text Mining Applications:

GATE Machine Learning

Dr. Paula Matuszek

[email protected]

[email protected]

(610) 647-9789

http://gate.ac.uk/sale/tao/splitch18.html

mailto:[email protected]

mailto:[email protected]

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Information Extraction in GATE

In the last assignment we looked at how to set up GATE to do information extraction “by hand”.– Gazeteers– JAPE rules

We modified the system to add UPenn and Villanova to universities to be extracted

Not hard for two entities, but this can get tricky. Is there a better way?



Machine Learning We would like:

– give GATE examples of universities – let it figure out for itself how to identify them

This is basically a classification problem: given an entity, is it a university?

This sounds familiar! The classification methods we looked at

in NLTK can be considered as forms of supervised machine learning.



Machine Learning in GATE GATE has machine learning processing

resources: CREOLE Learning plugin Same classification algorithms we saw

in NLTK But both features to be used and class

to be learned can be richer, using PRs like ANNIE


5

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

Machine Learning

We have data items comprising labels and featuresE.g. an instance of “cat” has features “whiskers=1”,

“fur=1”. A “stone” has “whiskers=0” and “fur=0”

Machine learning algorithm learns a relationship between the features and the labelsE.g. “if whiskers=1 then cat”

This is used to label new dataWe have a new instance with features “whiskers=1”

and “fur=1”--is it a cat or not???

6



ML in Information Extraction

We have annotations (classes)

We have features (words, context, word features etc.)

Can we learn how features match classes using ML?

Once obtained, the ML representation can do our annotation for us based on features in the textPre-annotation

Automated systems

Possibly good alternative to knowledge engineering approachesNo need to write the rules

However, need to prepare training data



ML in Information Extraction

Central to ML work is evaluationNeed to try different methods, different parameters, to obtain good result

Precision: How many of the annotations we identified are correct?

Recall: How many of the annotations we should have identified did we?

F-Score:

F = 2(precision.recall)/(precision+recall)

Testing requires an unseen test setHold out a test set

Simple approach but data may be scarce

Cross-validationsplit training data into e.g. 10 sections

Take turns to use each “fold” as a test set

Average score across the 10


More on SVMs Primary Machine Learning engine in

GATE is the Support Vector Machine– Handles very large feature sets well– Has been shown empirically to be effective

in a variety of unstructured text learning tasks.

We covered this briefly earlier; more detail will help with using them in GATE



Basic Idea Underlying SVMs

Find a line, or a plane, or a hyperplane, that separates our classes cleanly.– This is the same concept as we have seen

in regression. By finding the greatest margin

separating them


Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

denotes +1

denotes -1

How would you classify this data?

http://www.cs.cmu.edu/~awm/tutorials


Linear Classifiers

denotes +1

denotes -1

Any of these would be fine..

..but which is best?



Classifier Margin

denotes +1

denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.



Maximum Margin

denotes +1

denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin.

Called Linear Support Vector Machine (SVM)



Maximum Margin

denotes +1

denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

Called Linear Support Vector Machine (SVM)

Support Vectors are those datapoints that the margin pushes up against



Messy Data This is all good so far. Suppose our aren’t that neat:



Soft Margins Intuitively, it still looks like we can make

a decent separation here.– Can’t make a clean margin– But can almost do so, if we allow some

errors A soft margin is one which lets us make

some errors, in order to get a wider margin

Tradeoff between wide margin and classification errors



Messy Data



Slack Variables and Cost In order to find a soft margin, we allow slack

variables, which measure the degree of misclassification.– Takes into account the number of misclassified instances

and the distance from the margin We then modify this by a cost (C) for these

misclassified instances.– High cost -- narrow margins, won’t generalize well– Low cost -- broader margins but misclassify more data.

How much we want it to cost to misclassify instances depends on our domain -- what we are trying to do



Non-Linearly-Separable Data

Suppose even with a soft margin we can’t do a good linear separation of our data?

Allowing non-linearity will give us much better modeling of many data sets.

In SVMs, we do this by using a kernel. A kernel is a function which maps our

data into a higher-order order feature space where we can find a separating hyperplane


Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.

Hard 1-dimensional dataset

What can be done about this?

x=0



Hard 1-dimensional dataset

x=0



The Kernel Trick! If we can’t do a linear separation

– add higher order features– and combinations of features

Combinations of features may allow separation where either feature alone will not

This is known as the kernel trick Typical kernels are polynomial and RBF

(Radial Basis Function (Gaussian)) GATE supports linear and polynomial

kernels



What If We Want Three Classes?

Suppose our task involves more than two classes: universities, cities, businesses?

Reduce multiple class problem to multiple binary class problems.– one-versus-all, one-vs-others– one-versus-one, one-vs-another

GATE will do this automatically if there are more than two classes.



Doing this in GATE GATE has a Learning plugin available in

CREOLE. The PR available from it is Batch Learning– Newest machine learning PR– focused on chunk recognition, text classification,

relation extraction.– Primary algorithm is an SVM– Can also interface to WEKA for Naive Bayes,

KNN and decision trees There is also an older Machine_Learning

plugin, which contains wrappers for several different external learning systems.



Using Batch Learning PR The batch learning PR is basically learning

new annotations (the class) from previous annotations (the features)

Need three things:– annotated training documents– possibly preprocessed to get annotations we want

to learn from– an XML configuration file (which is external to the

IDE) The PR treats the parent directory of the

config file as its working directory.




ML applications in GATE

• Batch Learning PR Evaluation Training Application

• Runs after all other PRs – must be last PR

• Configured via xml file

• A single directory holds generated features, models, and config file



Instances, attributes, classes

California Governor Arnold Schwarzenegger proposes deep cuts.

Token Token Token Token Tok Tok

Entity.type=Person

Attributes: Any annotation feature relative to instancesToken.StringToken.category (POS)Sentence.length

Instances: Any annotationTokens are often convenient

Class: The thing we want to learnA feature on an annotation

Sentence

Token

Entity.type=Location



Surround mode

• This learned class covers more than one instance....

• Begin / End boundary learning

• Dealt with by API - surround mode• Transparent to the user


Token Token

Entity.type=Person



Multi class to binary


Entity.type=PersonEntity.type=Location

• Three classes, including null

• Many algorithms are binary classifiers

• One against all (One against others)

LOC vs PERS+NULL / PERS vs LOC+NULL / NULL vs LOC+PERS

• One against one (One against another one)

LOC vs PERS / LOC vs NULL / PERS vs NULL

• Dealt with by API - multClassification2Binary

• Transparent to the user


The XML File Root element is ML-CONFIG. Contains two required elements:

– DATASET– ENGINE

And some optional settings– Described in manual; most important are

next.




The configuration file

• Verbosity: 0,1,2

• Surround mode: set true for entities, false for relations

• Filtering: e.g. remove instances distant from the hyperplane

<?xml version="1.0"?><ML-CONFIG> <VERBOSITY level="1"/> <SURROUND value="true"/> <FILTERING ratio="0.0" dis="near"/>



Thresholds

• Control selection of boundaries and classes in post processing

• The defaults we give will work

• Experiment

• See the documentation

<PARAMETERname="thresholdProbabilityEntity" value="0.3"/><PARAMETERname="thresholdProbabilityBoundary" value="0.5"/><PARAMETERname="thresholdProbabilityClassification" value="0.5"/>



Multiclass andevaluation

• Multi-class one-vs-others One-vs-another

• Evaluation Kfold – runs gives number of folds holdout – ratio gives training/test

<multiClassification2Binary method="one-vs-others"/><EVALUATION method="kfold" runs="10" />



The learning Engine

• Learning algorithm and implementation specific

• SVM: Java implementation of LibSVM– Uneven margins set with -tau

<ENGINE nickname="SVM" implementationName="SVMLibSvmJava" options=" -c 0.7 -t 1 -d 3 -m 100 -tau 0.6"/>

<ENGINE nickname="NB" implementationName="NaiveBayesWeka"/>

<ENGINE nickname="C45" implementationName="C4.5Weka"/>



The dataset

• Defines Instance annotation Class Annotation feature to instance attribute

mapping

<DATASET>

</DATASET>


Parameters in IDE Path to the config file: when you create

the new PR Run-time:

– corpus– input annotation set name: features and

class name– output annotation set name: where results

go. Same as input when evaluating– learningMode



LearningMode Run-time parameter. Default is TRAINING Primary modes are:

– TRAINING: PR learns from data and saves models into learnedModels.save

– APPLICATION: reads model from learnedModels.save and applies to data

– EVALUTION: k-fold or hold-out evaluation. Output to GATE Developer.

Additional modes for incremental (adaptive) training and for displaying outcomes



Running the Batch Learning PR

Multiple PRs: – Running training and evaluation modes update

the model and other data files: don’t run more than one in the same working directory

Processing order:– for training and evaluation modes: needs to go

last. If there’s post-processing to be done make a separate application.

– for application mode: controlled by BATCH-APP-INTERVAL.

– set to 1: acts like normal PR

– set higher: must go last (may be more efficient)



Summary GATE supports a very rich set of classifier-

type machine learning capabilities– Current: the Learning plugin has the Batch Learner– Older: Machine_Learning plugin has wrappers for

various ML systems Both class to be learned and features to learn

from are document annotations Same algorithms as we saw in NLTK, but use

can be very different– classifying chunks of text within documents– alternative to engineering JAPE rules by hand

©2012 paula matuszek gate information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 paula...

Documents

gate machine learning

features whiskers

machine learning algorithm

paula matuszek csc

gate examples of universities

training data slide

features words

word features