maximum entropy models/ logistic regression€¦ · announcements: course project teams of 1-4...

Maximum Entropy Models/Logistic Regression

CMSC 678UMBC

February 7th, 2018

Announcements: Assignment 2

Due 11:59 AM, Monday 3/5

Topic: classification and neural models

Out: by Friday 2/9

Announcements: Course Project

Teams of 1-4

Three “checkins”:Proposal: 3/14Update: 4/9Final submission: 5/23

Some novel aspect is neededEx 1: reimplement existing technique and apply to new domainEx 2: explore novel technique on existing problemEx 3: explore novel task

If you run into compute issues, talk to course staff.

Options exist, e.g.,

https://aws.amazon.com/education/awseducate/

Recap from last time…

A Simple Linear Regression Model: From ERM

loss function: ℓ = 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2

argmin𝑤𝑤

�𝑖𝑖

𝑦𝑦𝑖𝑖 − 𝐰𝐰T𝑥𝑥𝑖𝑖 − 𝑏𝑏 2 =

min𝑤𝑤

𝒚𝒚 − 𝐗𝐗𝐰𝐰 − 𝒃𝒃 𝑇𝑇 𝒚𝒚 − 𝐗𝐗𝐰𝐰 − 𝒃𝒃

least squares estimation

ESL, Fig 3.1

A Simple Linear Model for Classification

𝑦𝑦𝑖𝑖 = 𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢

data point xi, as a vector of features

label yi, (WLOG, binary {0, 1} value)

vector w of weights

loss function: ℓ = �1, 𝑦𝑦𝑖𝑖𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 < 00, 𝑦𝑦𝑖𝑖𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 ≥ 0

= 1[𝑦𝑦𝑖𝑖𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 < 0]

predict 1

predict 0b = 0

�𝑦𝑦𝑖𝑖

𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢

Loss Function Example: 0-1 Loss

ℓ 𝑦𝑦, �𝑦𝑦 = �0, if 𝑦𝑦 = �𝑦𝑦1, if 𝑦𝑦 ≠ �𝑦𝑦

Problem: not differentiable wrt �𝑦𝑦 (or θ)

Solution 3: is the data linearly separable? Perceptron can work

Solution 1: is h(x) a conditional distribution p(y | x)? Use MAP

Solution 2: use a surrogate loss that approximates 0-1

Classification Evaluation:Accuracy, Precision, and Recall

Accuracy: % of items correct

Precision: % of selected items that are correct

Recall: % of correct items that are selected

Actually Correct Actually IncorrectSelected/Guessed True Positive (TP) False Positive (FP)

Not select/not guessed False Negative (FN) True Negative (TN)

TPTP + FP

TPTP + FN

TP + TNTP + FP + FN + TN

Maximum Entropy (Log-linear) Models

classify in one go

𝑝𝑝 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃 ⋅ 𝑓𝑓 𝑥𝑥,𝑦𝑦 )

Document Classification

ATTACKThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Observed document Label

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.


ATTACK• # killed:• Type:• Perp:

shot ATTACK


ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.


ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously woundedas a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.


We need to score the different combinations.

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)score3(Shining Path, ATTACK)

…

COMBINE posterior probability of

ATTACK

are all of these uncorrelated?

…scorek(department, ATTACK)

Score and Combine Our Possibilities



…

COMBINE posterior probability of

ATTACK

Q: What are the score and combine functions for Naïve

Bayes?

Scoring Our PossibilitiesThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =ATTACK



…

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))ATTACK

Maxent Modeling

What function…

operates on any real number?

is never less than 0?

What function…

operates on any real number?

is never less than 0?

f(x) = exp(x)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

exp( ))score1(fatally shot, ATTACK)


…


p( | )∝ATTACK

exp( ))score1(fatally shot, ATTACK)


…

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)


p( | )∝ATTACK

exp( ))weight1 * applies1(fatally shot, ATTACK)

weight2 * applies2(seriously wounded, ATTACK)weight3 * applies3(Shining Path, ATTACK)

…


p( | )∝ATTACK


p( | ) =ATTACK

exp( ))weight1 * applies1(fatally shot, ATTACK)

weight2 * applies2(seriously wounded, ATTACK)weight3 * applies3(Shining Path, ATTACK)

…

Maxent Modeling

1Z

Q: How do we define Z?

exp( )weight1 * applies1(fatally shot, X)

weight2 * applies2(seriously wounded, X)weight3 * applies3(Shining Path, X)

…Σ

label x

Z =Normalization for Classification

𝑝𝑝 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃 ⋅ 𝑓𝑓 𝑥𝑥, 𝑦𝑦 ) classify doc y with label x in one go

Q: What if none of our features apply?

Guiding Principle for Log-Linear Models

“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957

Guiding Principle for Log-Linear Models

“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957

exp(θ· f) exp(θ· 0) = 1

exp( ))θ1 * f1(fatally shot, ATTACK)

θ 2 * f2(seriously wounded, ATTACK)θ 3 * f 3(Shining Path, ATTACK)

…

Easier-to-write form

exp( ))θ1 * f1(fatally shot, ATTACK)

θ 2 * f2(seriously wounded, ATTACK)θ 3 * f 3(Shining Path, ATTACK)

…


K weights

K features

exp( )θ · f (doc, ATTACK)


K-dimensional weight vector

K-dimensional feature vector

dot product

Log-Linear Models

𝑝𝑝𝜃𝜃 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥,𝑦𝑦 )

Log-Linear Models

Feature function(s)Sufficient statistics

“Strength” function(s)


Log-Linear Models

Feature WeightsNatural parameters

Distribution Parameters


exp( )weight1 * applies1(fatally shot, X)

weight2 * applies2(seriously wounded, X)weight3 * applies3(Shining Path, X)

…Σ

label x

Z =Normalization for Classification

𝑝𝑝 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃 ⋅ 𝑓𝑓 𝑥𝑥, 𝑦𝑦 ) classify doc y with label x in one go

Log-Linear Models

𝑝𝑝𝜃𝜃 𝑥𝑥 𝑦𝑦) =exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥,𝑦𝑦 )

∑𝑥𝑥′ exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥′,𝑦𝑦 )

https://www.csee.umbc.edu/courses/graduate/678/spring18/loglin-tutorial

https://goo.gl/SsXQS2

Lesson 1



Connections to Other Techniques

Log-Linear Models


Log-Linear Models(Multinomial) logistic regressionSoftmax regression

as statistical regression

“Solution” 1: A Simple Probabilistic (Linear*) Classifierloss function:

ℓ = 1[𝑦𝑦𝑖𝑖𝑝𝑝 �𝑦𝑦𝑖𝑖 = 1 𝑥𝑥𝑖𝑖 < 0]

turn responses into probabilities

min𝐰𝐰

�𝑖𝑖

𝔼𝔼�𝑦𝑦𝑖𝑖[1 𝑦𝑦𝑖𝑖𝑝𝑝 �𝑦𝑦𝑖𝑖 = 1 𝑥𝑥𝑖𝑖 < 0 ] =

minimize posterior 0-1 loss:

max𝐰𝐰

�𝑖𝑖

𝑝𝑝 �𝑦𝑦𝑖𝑖 = 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖

why MAP classifiers are

reasonable

decision rule:

�𝑦𝑦𝑖𝑖 = �0, 𝜎𝜎(𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 + 𝑏𝑏) < .51, 𝜎𝜎(𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 + 𝑏𝑏) ≥ .5

Plot from https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f *linear not strictly required


Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)


based in information theory


Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)Generalized Linear Models


a form of



Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)Generalized Linear ModelsDiscriminative Naïve Bayes


a form of

viewed as



Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)Generalized Linear ModelsDiscriminative Naïve BayesVery shallow (sigmoidal) neural nets


a form of

viewed as


to be cool today :)

Objective = Full Likelihood?

These values can have very small magnitude underflow

Differentiating this product could be a pain

�𝑖𝑖

𝑝𝑝𝜃𝜃 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 ∝�𝑖𝑖

exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 )

Log-LikelihoodWide range of (negative) numbers

Sums are more stable

Differentiating this becomes nicer (even

though Z depends on θ)

log�𝑖𝑖

𝑝𝑝𝜃𝜃 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 = �𝑖𝑖

log𝑝𝑝𝜃𝜃(𝑥𝑥𝑖𝑖|𝑦𝑦𝑖𝑖)

= �𝑖𝑖

𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 − log𝑍𝑍(𝑦𝑦𝑖𝑖)

Log-Likelihood Gradient

Each component k is the difference between:



the total value of feature fk in the training data



the total value of feature fk in the training data

and

the total value the current model pθthinks it computes for feature fk

“Moment Matching” A1 Q2 &A1 Q3, Eq-1 (what were the feature functions)?

�𝑖𝑖

𝔼𝔼𝑝𝑝[𝑓𝑓(𝑥𝑥′,𝑦𝑦𝑖𝑖)



Lesson 6

https://www.csee.umbc.edu/courses/graduate/678/spring18/loglin-tutorial/


Log-Likelihood Gradient Derivation

𝑦𝑦𝑖𝑖


depends on θ

𝑍𝑍 𝑦𝑦𝑖𝑖 = �𝑥𝑥′

exp(𝜃𝜃 ⋅ 𝑓𝑓 𝑥𝑥′,𝑦𝑦𝑖𝑖 )

𝑦𝑦𝑖𝑖


𝜕𝜕𝜕𝜕𝜃𝜃

log𝑔𝑔(ℎ 𝜃𝜃 ) =𝜕𝜕𝑔𝑔

𝜕𝜕ℎ(𝜃𝜃)𝜕𝜕ℎ𝜕𝜕𝜃𝜃

use the (calculus) chain rulescalar p(y’ | xi)

vector of functions

𝑦𝑦𝑖𝑖

�𝑖𝑖

�𝑥𝑥′

exp 𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥′,𝑦𝑦𝑖𝑖𝑍𝑍 𝑦𝑦𝑖𝑖

𝑓𝑓(𝑥𝑥′,𝑦𝑦𝑖𝑖)


𝜕𝜕𝜕𝜕𝜃𝜃

log𝑔𝑔(ℎ 𝜃𝜃 ) =𝜕𝜕𝑔𝑔

𝜕𝜕ℎ(𝜃𝜃)𝜕𝜕ℎ𝜕𝜕𝜃𝜃

use the (calculus) chain rulescalar p(x’ | yi)

vector of functions

𝑦𝑦𝑖𝑖

�𝑖𝑖

�𝑥𝑥′




𝑦𝑦𝑖𝑖

�𝑖𝑖

�𝑥𝑥′



Do we want these to fully match?

What does it mean if they do?

What if we have missing values in our data?



Lesson 8---Regularization



Understanding Conditioning

𝑝𝑝 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃 ⋅ 𝑓𝑓 y )

Is this a good posterior classifier? (no)



Lesson 11




Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)Generalized Linear ModelsDiscriminative Naïve BayesVery shallow (sigmoidal) neural nets


a form of

viewed as


to be cool today :)

maximum entropy models/ logistic regression€¦ · announcements: course project teams of 1-4...

Documents