logistics course reviews project report deadline: march 16 poster session guidelines: – 2.5...

Logistics

• Course reviews• Project report deadline: March 16• Poster session guidelines:– 2.5 minutes per poster (3 hrs / 55 minus overhead)– presentations will be videotaped– food will be provided

Task: Named-Entity Recognition in new corpus

Named-Entity Recognition

• Fragment of an example sentence:

Julian Assange accused the United

PER PER Other Other LOC

NER as Machine Learning




Yi

Xi

Word label {Other, LOC, PER, ORG}

Some feature representation of the word

Feature Vector: Three ChoicesWords:

current wordContext:

current word, previous word, next wordFeatures:

current word, previous word, next wordis the word capitalized?"word shape" (compact summary of orthographic information, like internal digits and punctuation)prefixes up to length 5, suffixes up to length 5any word in a +/- six word window (*not* differentiated by position the way previous word and next word are)

Discriminative vs Generative I

Y

X

AssangeCapitalized=1Previous=Julian POS= noun

Y

X

AssangeCapitalized=1Previous=Julian POS= noun

Generative vs Discriminative I

NB LR0

10

20

30

40

50

60

70

80

90

WordsContextFeatures

• 10K training words from CoNLL (British newswire) looking only for PERSON

• Metric: F1

51.3

59.1

70.8

52.8

65.5

81.5

Do More Features Always Help?

• How do we evaluate multiple feature sets?– On validation set, not test set!

• Detecting underfitting– Train & test performance similar and low

• Detecting overfitting– Train performance high, test performance low

• The same holds every time we want to consider models of varying complexity!

Sequential Modeling




Yi

Xi

Random variable with domain {Other, LOC, PER, ORG}

Random variable for vector of features about the word

Hidden Markov Model (HMM)

Y1 Y2 Y4 Y5Y3

X1 X2 X4 X5X3




Y1 Y2 Y4 Y5Y3

X1 X2 X4 X5X3


Julian

Assange

accused the UnitedCapitalized=1

Previous=Julian

POS= noun

Y1 Y2 Y4 Y5Y3

X1

X2

X4 X5X3

Advantage of Sequential Modeling

NB HMM0

10

20

30

40

50

60

70

80

wordscontextfeatures

51.3

59.1

70.8

57.461.8

70.8

Reminder: Plain logistic regression gives us 81.5!

Max Entropy Markov Model (MEMM)• Markov chain over Xi’s

• Each Xi has logistic regression CPD given Yi

X1 X2 X4 X5X3

Y1

Y2

Y4 Y5Y3

Julian

Assange


Previous=Julian

POS= noun

Max Entropy Markov Model (MEMM)• Pro: uses features in a powerful way• Con: downstream evidence doesn’t help because of v-structures

X1 X2 X4 X5X3

Y1

Y2

Y4 Y5Y3

Julian

Assange


Previous=Julian

POS= noun

NB HMM MEMM0

10

20

30

40

50

60

70

80

90

wordscontextfeatures

51.3

59.1

70.8

57.461.8

70.8

MEMM vs HMM vs NB

59.1

68.3

84.6

Finally beat logistic regression!

Conditional Random Field (CRF)


Y1 Y2 Y4 Y5Y3

X1 X2 X4 X5X3

Comparison: Sequence Models

HMM MEMM CRF0

10

20

30

40

50

60

70

80

90

100

WordsContextFeatures

59.1

68.3

84.6

59.6

70.2

85.8

57.461.8

70.8

Tradeoffs in Learning I

• HMM– Simple closed form solution

• MEMM – Gradient ascent for parameters of logistic P(Yi | Xi)– But no inference required for learning

• CRF– Gradient ascent for all parameters– Inference over entire graph required at each

iteration

Tradeoffs in Learning: II

• Can we learn from unsupervised data?• HMM– Yes, using EM

• MEMM/CRF– No

• Discriminative objective: maximize log P(Y | X)– But if Y is not observed, we can’t maximize its

probability

PGMs and ML

• PGMs deal well with predictions of structured objects (sequences, graphs, trees)– Exploit correlations between multiple parts of the

prediction task• Can easily incorporate prior knowledge into

model• Learned model can often be used for multiple

prediction tasks• Useful framework for knowledge discovery

Inference• Exact marginals?– Clique tree calibration gives all marginals– Final labeling might not be jointly consistent

• Approximate marginals?– Doesn’t make sense in this context

• MAP?– Gives single coherent solution– Hard to get ROC curves (tradeoff precision & recall)

Mismatch of Objectives• MAP inference optimizes LL = log P(Y | X)• Actual performance metric is usually different (e.g., F1)• Performance is best if we can get these two metrics to

be relatively well-aligned– If MAP assignment gets significantly lower F1 than ground

truth, model needs to be adjusted

• Very useful for debugging approximate MAP– If LL(y*) >> LL(yMAP)– If LL(y*) << LL(yMAP)

- algorithm found local optimum- LL bad surrogate for objective

Richer Models


said Stephen, Assange’s laywer to

Y1 Y2 Y4 Y5Y3

X1 X2 X4 X5X3

Y101 Y102 Y104 Y105Y103

X101 X102 X104 X105X103

Summary

• Foundation I: Probabilistic model– Coherent treatment of uncertainty– Declarative representation:• separates model and inference• separates inference and learning

• Foundation II: Graphical model– Encode and exploit structure for compact

representation and efficient inference– Allows modularity in updating the model

logistics course reviews project report deadline: march 16 poster session guidelines: – 2.5...

Documents

word slide

word features

word window

loc slide

way previous word

current word context

word lly map

noun slide