maximum entropy models/ logistic regression€¦ · announcements: course project teams of 1-4...
TRANSCRIPT
Maximum Entropy Models/Logistic Regression
CMSC 678UMBC
February 7th, 2018
Announcements: Assignment 2
Due 11:59 AM, Monday 3/5
Topic: classification and neural models
Out: by Friday 2/9
Announcements: Course Project
Teams of 1-4
Three “checkins”:Proposal: 3/14Update: 4/9Final submission: 5/23
Some novel aspect is neededEx 1: reimplement existing technique and apply to new domainEx 2: explore novel technique on existing problemEx 3: explore novel task
If you run into compute issues, talk to course staff.
Options exist, e.g.,
https://aws.amazon.com/education/awseducate/
Recap from last time…
A Simple Linear Regression Model: From ERM
loss function: ℓ = 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2
argmin𝑤𝑤
�𝑖𝑖
𝑦𝑦𝑖𝑖 − 𝐰𝐰T𝑥𝑥𝑖𝑖 − 𝑏𝑏 2 =
min𝑤𝑤
𝒚𝒚 − 𝐗𝐗𝐰𝐰 − 𝒃𝒃 𝑇𝑇 𝒚𝒚 − 𝐗𝐗𝐰𝐰 − 𝒃𝒃
least squares estimation
ESL, Fig 3.1
A Simple Linear Model for Classification
𝑦𝑦𝑖𝑖 = 𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢
data point xi, as a vector of features
label yi, (WLOG, binary {0, 1} value)
vector w of weights
loss function: ℓ = �1, 𝑦𝑦𝑖𝑖𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 < 00, 𝑦𝑦𝑖𝑖𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 ≥ 0
= 1[𝑦𝑦𝑖𝑖𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 < 0]
predict 1
predict 0b = 0
�𝑦𝑦𝑖𝑖
𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢
Loss Function Example: 0-1 Loss
ℓ 𝑦𝑦, �𝑦𝑦 = �0, if 𝑦𝑦 = �𝑦𝑦1, if 𝑦𝑦 ≠ �𝑦𝑦
Problem: not differentiable wrt �𝑦𝑦 (or θ)
Solution 3: is the data linearly separable? Perceptron can work
Solution 1: is h(x) a conditional distribution p(y | x)? Use MAP
Solution 2: use a surrogate loss that approximates 0-1
Classification Evaluation:Accuracy, Precision, and Recall
Accuracy: % of items correct
Precision: % of selected items that are correct
Recall: % of correct items that are selected
Actually Correct Actually IncorrectSelected/Guessed True Positive (TP) False Positive (FP)
Not select/not guessed False Negative (FN) True Negative (TN)
TPTP + FP
TPTP + FN
TP + TNTP + FP + FN + TN
Maximum Entropy (Log-linear) Models
classify in one go
𝑝𝑝 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃 ⋅ 𝑓𝑓 𝑥𝑥,𝑦𝑦 )
Document Classification
ATTACKThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Observed document Label
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACK• # killed:• Type:• Perp:
shot ATTACK
Document Classification
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously woundedas a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
We need to score the different combinations.
Score and Combine Our Possibilities
score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)score3(Shining Path, ATTACK)
…
COMBINE posterior probability of
ATTACK
are all of these uncorrelated?
…scorek(department, ATTACK)
Score and Combine Our Possibilities
score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)score3(Shining Path, ATTACK)
…
COMBINE posterior probability of
ATTACK
Q: What are the score and combine functions for Naïve
Bayes?
Scoring Our PossibilitiesThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
score( , ) =ATTACK
score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)score3(Shining Path, ATTACK)
…
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
SNAP(score( , ))ATTACK
Maxent Modeling
What function…
operates on any real number?
is never less than 0?
What function…
operates on any real number?
is never less than 0?
f(x) = exp(x)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
exp(score( , ))ATTACK
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
exp( ))score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)score3(Shining Path, ATTACK)
…
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
exp( ))score1(fatally shot, ATTACK)
score2(seriously wounded, ATTACK)score3(Shining Path, ATTACK)
…
Maxent Modeling
Learn the scores (but we’ll declare what combinations should be looked at)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
exp( ))weight1 * applies1(fatally shot, ATTACK)
weight2 * applies2(seriously wounded, ATTACK)weight3 * applies3(Shining Path, ATTACK)
…
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | ) =ATTACK
exp( ))weight1 * applies1(fatally shot, ATTACK)
weight2 * applies2(seriously wounded, ATTACK)weight3 * applies3(Shining Path, ATTACK)
…
Maxent Modeling
1Z
Q: How do we define Z?
exp( )weight1 * applies1(fatally shot, X)
weight2 * applies2(seriously wounded, X)weight3 * applies3(Shining Path, X)
…Σ
label x
Z =Normalization for Classification
𝑝𝑝 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃 ⋅ 𝑓𝑓 𝑥𝑥, 𝑦𝑦 ) classify doc y with label x in one go
Q: What if none of our features apply?
Guiding Principle for Log-Linear Models
“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957
Guiding Principle for Log-Linear Models
“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957
exp(θ· f) exp(θ· 0) = 1
exp( ))θ1 * f1(fatally shot, ATTACK)
θ 2 * f2(seriously wounded, ATTACK)θ 3 * f 3(Shining Path, ATTACK)
…
Easier-to-write form
exp( ))θ1 * f1(fatally shot, ATTACK)
θ 2 * f2(seriously wounded, ATTACK)θ 3 * f 3(Shining Path, ATTACK)
…
Easier-to-write form
K weights
K features
exp( )θ · f (doc, ATTACK)
Easier-to-write form
K-dimensional weight vector
K-dimensional feature vector
dot product
Log-Linear Models
𝑝𝑝𝜃𝜃 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥,𝑦𝑦 )
Log-Linear Models
Feature function(s)Sufficient statistics
“Strength” function(s)
𝑝𝑝𝜃𝜃 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥,𝑦𝑦 )
Log-Linear Models
Feature WeightsNatural parameters
Distribution Parameters
𝑝𝑝𝜃𝜃 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥,𝑦𝑦 )
exp( )weight1 * applies1(fatally shot, X)
weight2 * applies2(seriously wounded, X)weight3 * applies3(Shining Path, X)
…Σ
label x
Z =Normalization for Classification
𝑝𝑝 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃 ⋅ 𝑓𝑓 𝑥𝑥, 𝑦𝑦 ) classify doc y with label x in one go
Log-Linear Models
𝑝𝑝𝜃𝜃 𝑥𝑥 𝑦𝑦) =exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥,𝑦𝑦 )
∑𝑥𝑥′ exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥′,𝑦𝑦 )
https://www.csee.umbc.edu/courses/graduate/678/spring18/loglin-tutorial
https://goo.gl/SsXQS2
Lesson 1
Connections to Other Techniques
Log-Linear Models
Connections to Other Techniques
Log-Linear Models(Multinomial) logistic regressionSoftmax regression
as statistical regression
“Solution” 1: A Simple Probabilistic (Linear*) Classifierloss function:
ℓ = 1[𝑦𝑦𝑖𝑖𝑝𝑝 �𝑦𝑦𝑖𝑖 = 1 𝑥𝑥𝑖𝑖 < 0]
turn responses into probabilities
min𝐰𝐰
�𝑖𝑖
𝔼𝔼�𝑦𝑦𝑖𝑖[1 𝑦𝑦𝑖𝑖𝑝𝑝 �𝑦𝑦𝑖𝑖 = 1 𝑥𝑥𝑖𝑖 < 0 ] =
minimize posterior 0-1 loss:
max𝐰𝐰
�𝑖𝑖
𝑝𝑝 �𝑦𝑦𝑖𝑖 = 𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖
why MAP classifiers are
reasonable
decision rule:
�𝑦𝑦𝑖𝑖 = �0, 𝜎𝜎(𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 + 𝑏𝑏) < .51, 𝜎𝜎(𝐰𝐰𝐓𝐓𝐱𝐱𝐢𝐢 + 𝑏𝑏) ≥ .5
Plot from https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f *linear not strictly required
Connections to Other Techniques
Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)
as statistical regression
based in information theory
Connections to Other Techniques
Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)Generalized Linear Models
as statistical regression
a form of
based in information theory
Connections to Other Techniques
Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)Generalized Linear ModelsDiscriminative Naïve Bayes
as statistical regression
a form of
viewed as
based in information theory
Connections to Other Techniques
Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)Generalized Linear ModelsDiscriminative Naïve BayesVery shallow (sigmoidal) neural nets
as statistical regression
a form of
viewed as
based in information theory
to be cool today :)
Objective = Full Likelihood?
These values can have very small magnitude underflow
Differentiating this product could be a pain
�𝑖𝑖
𝑝𝑝𝜃𝜃 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 ∝�𝑖𝑖
exp(𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 )
Log-LikelihoodWide range of (negative) numbers
Sums are more stable
Differentiating this becomes nicer (even
though Z depends on θ)
log�𝑖𝑖
𝑝𝑝𝜃𝜃 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 = �𝑖𝑖
log𝑝𝑝𝜃𝜃(𝑥𝑥𝑖𝑖|𝑦𝑦𝑖𝑖)
= �𝑖𝑖
𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖 − log𝑍𝑍(𝑦𝑦𝑖𝑖)
Log-Likelihood Gradient
Each component k is the difference between:
Log-Likelihood Gradient
Each component k is the difference between:
the total value of feature fk in the training data
Log-Likelihood Gradient
Each component k is the difference between:
the total value of feature fk in the training data
and
the total value the current model pθthinks it computes for feature fk
“Moment Matching” A1 Q2 &A1 Q3, Eq-1 (what were the feature functions)?
�𝑖𝑖
𝔼𝔼𝑝𝑝[𝑓𝑓(𝑥𝑥′,𝑦𝑦𝑖𝑖)
https://www.csee.umbc.edu/courses/graduate/678/spring18/loglin-tutorial
https://goo.gl/SsXQS2
Lesson 6
Log-Likelihood Gradient Derivation
𝑦𝑦𝑖𝑖
Log-Likelihood Gradient Derivation
depends on θ
𝑍𝑍 𝑦𝑦𝑖𝑖 = �𝑥𝑥′
exp(𝜃𝜃 ⋅ 𝑓𝑓 𝑥𝑥′,𝑦𝑦𝑖𝑖 )
𝑦𝑦𝑖𝑖
Log-Likelihood Gradient Derivation
𝜕𝜕𝜕𝜕𝜃𝜃
log𝑔𝑔(ℎ 𝜃𝜃 ) =𝜕𝜕𝑔𝑔
𝜕𝜕ℎ(𝜃𝜃)𝜕𝜕ℎ𝜕𝜕𝜃𝜃
use the (calculus) chain rulescalar p(y’ | xi)
vector of functions
𝑦𝑦𝑖𝑖
�𝑖𝑖
�𝑥𝑥′
exp 𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥′,𝑦𝑦𝑖𝑖𝑍𝑍 𝑦𝑦𝑖𝑖
𝑓𝑓(𝑥𝑥′,𝑦𝑦𝑖𝑖)
Log-Likelihood Gradient Derivation
𝜕𝜕𝜕𝜕𝜃𝜃
log𝑔𝑔(ℎ 𝜃𝜃 ) =𝜕𝜕𝑔𝑔
𝜕𝜕ℎ(𝜃𝜃)𝜕𝜕ℎ𝜕𝜕𝜃𝜃
use the (calculus) chain rulescalar p(x’ | yi)
vector of functions
𝑦𝑦𝑖𝑖
�𝑖𝑖
�𝑥𝑥′
exp 𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥′,𝑦𝑦𝑖𝑖𝑍𝑍 𝑦𝑦𝑖𝑖
𝑓𝑓(𝑥𝑥′,𝑦𝑦𝑖𝑖)
Log-Likelihood Gradient Derivation
𝑦𝑦𝑖𝑖
�𝑖𝑖
�𝑥𝑥′
exp 𝜃𝜃𝑇𝑇𝑓𝑓 𝑥𝑥′,𝑦𝑦𝑖𝑖𝑍𝑍 𝑦𝑦𝑖𝑖
𝑓𝑓(𝑥𝑥′,𝑦𝑦𝑖𝑖)
Do we want these to fully match?
What does it mean if they do?
What if we have missing values in our data?
https://www.csee.umbc.edu/courses/graduate/678/spring18/loglin-tutorial
https://goo.gl/SsXQS2
Lesson 8---Regularization
Understanding Conditioning
𝑝𝑝 𝑥𝑥 𝑦𝑦) ∝ exp(𝜃𝜃 ⋅ 𝑓𝑓 y )
Is this a good posterior classifier? (no)
https://www.csee.umbc.edu/courses/graduate/678/spring18/loglin-tutorial
https://goo.gl/SsXQS2
Lesson 11
Connections to Other Techniques
Log-Linear Models(Multinomial) logistic regressionSoftmax regressionMaximum Entropy models (MaxEnt)Generalized Linear ModelsDiscriminative Naïve BayesVery shallow (sigmoidal) neural nets
as statistical regression
a form of
viewed as
based in information theory
to be cool today :)