bayes’ theorem and logistic regression
TRANSCRIPT
BAYES’ THEOREM AND LOGISTIC REGRESSION
BAYES’ THEOREM Bayes’ theorem gives the relationship between the probabilities of A and B,
P(A) and P(B), and the conditional probabilities of A given B and B given A, P(A|B) and P(B|A), in its most common form P(A|B)=
In Bayesian interpretation, probability measures the degree of belief. Bayes theorem links the belief in a proposition before and after accounting for an evidence.
For proposition A and evidence B: P(A), the prior, is the initial degree of belief in A P(A|B), the posterior, is the degree of belief having accounted for B The quotient P(B|A)/P(B) represents the support B provides for A
NAÏVE BAYES’ – PROBABILISTIC MODEL The probability model for a classifier is a conditional model p(C|F1, ….,
Fn), over a dependent class variable C, with a small number of outcomes or classes conditioned on several feature variables F1 through Fn .
Problem – large number of features or features that can take large number of values makes the probability tables infeasible
Using Bayesian theorem p(C|F1, …., Fn) in plain English, posterior=
Since the denominator is not dependent on C and the values of the features Fi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model p(C,F1, …., Fn ).
Using the chain rule for repeated applications of definition of conditional probability
Role of Naïve condition: assume that the feature Fi is conditionally independent of every other feature Fj, for j≠i given the category C.
The joint model can be represented as
Under conditional distribution over the class variable
Where z=p(F1, ….., Fn) is a scaling factor
CONSTRUCTING A CLASSIFIER FROM PROBABILITY MODEL Naïve Bayes classifier combines this model with a decision rule.
Most common rule is to pick hypothesis that is most probable, known as maximum a posteriori
The probability of a document F being in class c is computed as
P(F|c) is the conditional probability of term F occurring in a document of class c.
It is a measure of how much evidence F contributes that c is a correct class.
P(c) is the prior probability of a document occurring in class c.
LOGISTIC REGRESSION Statistical classification model
Predicts binary response from a binary predictor for predicting the outcome of a categorical dependent variable
Logistic regression measures the relationship between a categorical dependent variable and one or more independent variable
Applications: medical and social science field like Trauma and Injury Severity Score (TRISS),
used to predict mortality in injured patients used to predict whether a patient has diabetes based on observed characteristics
like age, gender, BMI Predict whether a person will vote for congress or BJP based on age, income,
gender, race, state of residency
Classification Binomial or Binary logistic regression deals with variable in which the
observed outcome have two possible types ex dead or alive Outcome is coded as 0 or 1 Straightforward interpretation
Multinomial logistic regression deals with situation where there are three or more outcomes
Logistic regression is used for predicting binary outcomes rather than continuous
Takes the natural logarithm of odds of the logit transformation
FEATURE SELECTION Selects a subset of terms occurring in the training set and uses this
subset as features in text classification
Serves two main purposes: First, makes training and applying classifier more efficient by decreasing the size
of effective vocabulary Increases the accuracy by eliminating noise features
A noise feature is one which when added to the document representation, increases the classification error on new data.
Feature selection replaces the complex classifier (using all features) with a simpler one (using a subset of features)
FEATURE SELECTION METHODS Mutual Information measures how much information the presence /
absence of a term contributes to making the correct classification decision on c
X2 Feature Selection test’s the independence of two events
Frequency-based feature selection selects terms that are most common in the class. frequency can be either defined as document frequency – documents in class c
that contain the terms t Collection frequency – tokens of t that occur in documents in c document frequency -> Bernoulli model Collection frequency -> multinomial model
Feature selection for multiple classifiers selects single set of features instead of different one for each classifier