lecture 16

72
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst October 30, 2006 Some slides by Preslav Nakov and Eibe Frank

Upload: prathyu-guduri

Post on 27-Nov-2014

459 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Lecture 16

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstOctober 30, 2006

Some slides by Preslav Nakov and Eibe Frank 

 

Page 2: Lecture 16

2

Today

The Newsgroups Text Collection

WEKA: Exporer

WEKA: Experimenter

Python Interface to WEKA

Page 3: Lecture 16

3

Source: originally collected by Ken LangContent and structure:

approximately 20,000 newsgroup documents– 19,997 originally– 18,828 without duplicates

partitioned evenly across 20 different newsgroups we are only using a subset (6 newsgroups)

Some categories are strongly related (and thus hard to discriminate):

20 Newsgroups Data Sethttp://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/

comp.graphicscomp.os.ms-windows.misccomp.sys.ibm.pc.hardwarecomp.sys.mac.hardwarecomp.windows.x

rec.autosrec.motorcyclesrec.sport.baseballrec.sport.hockey

sci.cryptsci.electronicssci.medsci.space

misc.forsale talk.politics.misctalk.politics.gunstalk.politics.mideast

talk.religion.miscalt.atheismsoc.religion.christian

computers

Page 4: Lecture 16

4

Sample Posting: “talk.politics.guns”From: [email protected] (C. D. Tavares)Subject: Re: Congress to review ATF's status

In article <[email protected]>, [email protected] (Larry Cipriani) writes:

> WASHINGTON (UPI) -- As part of its investigation of the deadly> confrontation with a Texas cult, Congress will consider whether the> Bureau of Alcohol, Tobacco and Firearms should be moved from the> Treasury Department to the Justice Department, senators said Wednesday.> The idea will be considered because of the violent and fatal events> at the beginning and end of the agency's confrontation with the Branch> Davidian cult.

Of course. When the catbox begines to smell, simply transfer itscontents into the potted plant in the foyer.

"Why Hillary! Your government smells so... FRESH!"--

[email protected] --If you believe that I speak for my company,OR [email protected] write today for my special Investors' Packet...

reply

from

subject

signature

Need special handling during

feature extraction…

… writes:

Page 5: Lecture 16

5

The 20 Newsgroups Text Collection

WEKA: Exporer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 6: Lecture 16

6Slide adapted from Eibe Frank's

WEKA: The Bird

Copyright: Martin Kramer ([email protected]), University of Waikato, New Zealand

Page 7: Lecture 16

7Slide by Eibe Frank

WEKA: the softwareWaikato Environment for Knowledge AnalysisCollection of state-of-the-art machine learning algorithms and data processing tools implemented in Java

Released under the GPL

Support for the whole process of experimental data miningPreparation of input dataStatistical evaluation of learning schemesVisualization of input data and the result of learning

Used for education, research and applicationsComplements “Data Mining” by Witten & Frank

Page 8: Lecture 16

8Slide by Eibe Frank

Main Features49 data preprocessing tools76 classification/regression algorithms8 clustering algorithms15 attribute/subset evaluators + 10 search algorithms for feature selection3 algorithms for finding association rules3 graphical user interfaces

“The Explorer” (exploratory data analysis)“The Experimenter” (experimental environment)“The KnowledgeFlow” (new process model inspired interface)

Page 9: Lecture 16

9Slide by Eibe Frank

Projects based on WEKAIncorporate/wrap WEKA

GRB Tool Shed - a tool to aid gamma ray burst researchYALE - facility for large scale ML experimentsGATE - NLP workbench with a WEKA interfaceJudge - document clustering and classification

Extend/modify WEKABioWeka - extension library for knowledge discovery in biologyWekaMetal - meta learning extension to WEKAWeka-Parallel - parallel processing for WEKAGrid Weka - grid computing using WEKAWeka-CG - computational genetics tool library

Page 10: Lecture 16

10Slide by Eibe Frank

The WEKA Project Today (2006)

Funding for the next two yearsGoal of the project remains the samePeople

6 staff2 postdocs3 PhD students3 MSc students2 research programmers

Page 11: Lecture 16

11Slide adapted from Eibe Frank's

WEKA: The Software Toolkit

Machine learning/data mining software in JavaGNU LicenseUsed for research, education and applicationsComplements “Data Mining” by Witten & FrankMain features:

data pre-processing tools learning algorithms evaluation methods graphical interface (incl. data visualization) environment for comparing learning algorithms

http://www.cs.waikato.ac.nz/ml/weka

http://sourceforge.net/projects/weka/

Page 12: Lecture 16

12

WEKA: Terminology

Some synonyms/explanations for the terms used by WEKA, which may differ from what we use:

Attribute: feature Relation: collection of examples Instance: collection in use Class: category

Page 13: Lecture 16

13Slide adapted from Eibe Frank's

WEKA GUI Chooser java -Xmx1000M -jar weka.jar

Page 14: Lecture 16

14Slide adapted from Eibe Frank's

Our Toy Example

We demonstrate WEKA on a simple example:

3 categories from “Newsgroups”:– misc.forsale, – rec.sport.hockey, – comp.graphics

20 documents per category features:– words converted to lowercase– frequency 2 or more required– stopwords removed

Page 15: Lecture 16

15Slide adapted from Eibe Frank's

Explorer: Pre-Processing The Data

WEKA can import data from:files: ARFF, CSV, C4.5, binaryURL SQL database (using JDBC)

Pre-processing tools (filters) are used for:Discretization, normalization, resampling, attribute selection, transforming and combining attributes, etc.

Page 16: Lecture 16

16

List of attributes (last: class variable)

Frequency and categories for the selected

attribute

Statistics about the values of the selected attribute

Classification

Filter selection

Manual attribute selection

Statistical attribute selection

Preprocessing

The Preprocessing Tab

Page 17: Lecture 16

17Slide adapted from Eibe Frank's

Explorer: Building “Classifiers”

Classifiers in WEKA are models for:classification (predict a nominal class)regression (predict a numerical quantity)

Learning algorithms:Naïve Bayes, decision trees, kNN, support vector machines, multi-layer perceptron, logistic regression, etc.

Meta-classifiers:cannot be used alonealways combined with a learning algorithmexamples: boosting, bagging etc.

Page 18: Lecture 16

18

Choice of classifier

The attribute whose value is to be predicted from the values of the remaining ones.

Default is the last attribute.

Here (in our toy example) it is

named “class”.

Cross-validation: split the data into e.g. 10 folds and

10 times train on 9 folds and test on the remaining one

The Classification Tab

Page 19: Lecture 16

19

Choosing a classifier

Page 20: Lecture 16

20

Page 21: Lecture 16

21

False: Gaussian

True: kernels (better)

displays synopsis and options

numerical to nominal

conversion by discretization

outputs additional information

Page 22: Lecture 16

22

Page 23: Lecture 16

23

Page 24: Lecture 16

24

all other numbers can be obtained from it

different/easy class

accuracy

Page 25: Lecture 16

25

Contains information about the actual and the predicted classification

All measures can be derived from it: accuracy: (a+d)/(a+b+c+d) recall: d/(c+d) => R precision: d/(b+d) => P F-measure: 2PR/(P+R) false positive (FP) rate: b/(a+b) true negative (TN) rate: a/(a+b) false negative (FN) rate: c/(c+d)

These extend for more than 2 classes: see previous lecture slides for details

Confusion matrix

predicted

– +

true

– a b

+ c d

Page 26: Lecture 16

26

Outputs the probability

distribution for each example

Predictions Output

Page 27: Lecture 16

27

Probability distribution for

a wrong example:

predicted 1 instead of 3

Naïve Bayes makes incorrect

conditional independence assumptions

and typically is over-confident in its prediction regardless of whether it is

correct or not.

Predictions Output

Page 28: Lecture 16

28

Error Visualization

Page 29: Lecture 16

29

Error Visualization

Little squares designate errors

Axes show example number

Page 30: Lecture 16

30

Running on Test Set

Page 31: Lecture 16

31Slide adapted from Eibe Frank's

Find which attributes are the most predictive ones

Two parts: search method: – best-first, forward selection, random, exhaustive, genetic

algorithm, ranking

evaluation method: – information gain, chi-squared, etc.

Very flexible: WEKA allows (almost) arbitrary combinations of these two

Explorer: Attribute Selection

Page 32: Lecture 16

32

Individual Features Ranking

Page 33: Lecture 16

33

misc.forsale

comp.graphics

rec.sport.hockey

Individual Features Ranking

Page 34: Lecture 16

34

misc.forsale

comp.graphics

rec.sport.hockey

???

random number

seed

Individual Features Ranking

Page 35: Lecture 16

40

Saving the Selected Features

All we can do from this tab is to save the buffer in a text file. Not very useful...

But we can also perform feature selection during the pre-processing step...(the following slides)

Page 36: Lecture 16

41

Features Selection on Preprocessing

Page 37: Lecture 16

42

Features Selection on Preprocessing

Page 38: Lecture 16

43

Features Selection on Preprocessing

679 attributes: 678 + 1 (for the class)

Page 39: Lecture 16

44

Features Selection on Preprocessing

Just 22 attributes remain:

21 + 1 (for the class)

Page 40: Lecture 16

45

Run Naïve Bayes With the 21 Features

higher accuracy

21 Attributes

Page 41: Lecture 16

46

different/easy class

accuracy

(AGAIN) Naïve Bayes With All Features

ALL 679 Attributes(repeated slide)

Page 42: Lecture 16

47

WEKA has weird naming for some algorithms

Here are some translations: Naïve Bayes: weka.classifiers.bayes.NaiveBayes Perceptron: weka.classifiers.functions.VotedPerceptron Decision tree: weka.classifiers.trees.J48 Support vector machines: weka.classifiers.functions.SMO k nearest neighbor: weka.classifiers.lazy.IBk

Some of these are more sophisticated versions of the classic algorithms

e.g. the classic Naïve Bayes seems to be missing A good alternative is the Multinomial Naïve Bayes model

Some Important Algorithms

Page 43: Lecture 16

48

The 20 Newsgroups Text Collection

WEKA: Explorer

WEKA: Experimenter

Python Interface to WEKA

WEKA: Real-time Demo

Page 44: Lecture 16

49Slide adapted from Eibe Frank's

Experimenter makes it easy to compare the performance of different learning schemes

Problems: classification regression

Results: written into file or databaseEvaluation options:

cross-validation learning curve hold-out

Can also iterate over different parameter settingsSignificance-testing built in!

Performing Experiments

Page 45: Lecture 16

50

Experiments Setup

Page 46: Lecture 16

51

Experiments Setup

Page 47: Lecture 16

52

Experiments Setup

CSV file: can be open in Exceldatasets

algorithms

Page 48: Lecture 16

53

Experiments Setup

Page 49: Lecture 16

54

Experiments Setup

Page 50: Lecture 16

55

Experiments Setup

Page 51: Lecture 16

56

Experiments Setup

Page 52: Lecture 16

57

Experiments Setup

accuracy

SVM is the best

Decision tree is the

worst

SVM is statistically better than Naïve Bayes

Decision tree is statistically worse than Naïve Bayes

Page 53: Lecture 16

58

Experiments: Excel

Results are output into an CSV file, which can

be read in Excel!

Page 54: Lecture 16

59

The Newsgroups Text Collection

WEKA: Explorer

WEKA: Experimenter

Python Interface to WEKA

Page 55: Lecture 16

60Slide adapted from Eibe Frank's

@relation heart-disease-simplified

@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}

@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...

WEKA File Format: ARFF

Other attribute types:

• String

• Date

Numerical attribute

Nominal attribute

Missing value

Page 56: Lecture 16

61

Value 0 is not represented explicitlySame header (i.e @relation and @attribute tags)the @data section is different

Instead of @data

0, X, 0, Y, "class A"0, 0, W, 0, "class B"

We have

@data

{1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}

This is especially useful for textual data (why?)

WEKA File Format: Sparse ARFF

Page 57: Lecture 16

62

Python Interface to WEKA

This is just to get you startedAssumes the newsgroups collectionExtracts simple features

currently just single word features– Uses a simple tokenizer which removes punctuation

uses a stoplist lowercases the words

Includes filtering code currently eliminates numbers

Features are weighted by frequency within document

Produces a sparse ARFF file to be used by WEKA

Page 58: Lecture 16

63

Python Interface to WEKA

Allows you to specify: Which directory to read files from which newsgroups to use the number of documents for training each newsgroup the number of features to retain

Page 59: Lecture 16

64

Python Interface to WEKA

Things to (optionally) add or change: an option to not use stopwords an option to retain capitalization regular expression pattern a feature should match other non-word-based features morphological normalization a minimum threshold for the number of time a term

occurs before it can be counted as a feature tf.idf weighting on terms your idea goes here

Page 60: Lecture 16

65

Python Interface to WEKA

TF.IDF: tij log(N/ni) TF– tij: frequency of term i in document j

– this is how features are currently weighted

IDF: log(N/ni)

– ni: number of documents containing term i

– N: total number of documents

Page 61: Lecture 16

66

Python Weka Code

Page 62: Lecture 16

67

Python Weka Code

Page 63: Lecture 16

68

Python Weka Code

Page 64: Lecture 16

69

Python Weka Code

Page 65: Lecture 16

70

Python Weka Code

Page 66: Lecture 16

71

Python Weka Code

Page 67: Lecture 16

72

Python Weka Code

Page 68: Lecture 16

73

Python Weka Code

Page 69: Lecture 16

74

ARFF file

Page 70: Lecture 16

75

ARFF file…

Page 71: Lecture 16

76

Assignment

Due November 13.Work individually on this oneObjective is to use the training set to get the best features and learning model you can.FREEZE this.Then run one time only on the test set.This is a realistic way to see how well your algorithm does on unseen data.

Page 72: Lecture 16

77

Next Time

Machine learning algorithms