bayes’ theorem and logistic regression

BAYES’ THEOREM AND LOGISTIC REGRESSION

BAYES’ THEOREM Bayes’ theorem gives the relationship between the probabilities of A and B,

P(A) and P(B), and the conditional probabilities of A given B and B given A, P(A|B) and P(B|A), in its most common form P(A|B)=

In Bayesian interpretation, probability measures the degree of belief. Bayes theorem links the belief in a proposition before and after accounting for an evidence.

For proposition A and evidence B: P(A), the prior, is the initial degree of belief in A P(A|B), the posterior, is the degree of belief having accounted for B The quotient P(B|A)/P(B) represents the support B provides for A

NAÏVE BAYES’ – PROBABILISTIC MODEL The probability model for a classifier is a conditional model p(C|F1, ….,

Fn), over a dependent class variable C, with a small number of outcomes or classes conditioned on several feature variables F1 through Fn .

Problem – large number of features or features that can take large number of values makes the probability tables infeasible

Using Bayesian theorem p(C|F1, …., Fn) in plain English, posterior=

Since the denominator is not dependent on C and the values of the features Fi are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model p(C,F1, …., Fn ).

Using the chain rule for repeated applications of definition of conditional probability

Role of Naïve condition: assume that the feature Fi is conditionally independent of every other feature Fj, for j≠i given the category C.

The joint model can be represented as

Under conditional distribution over the class variable

Where z=p(F1, ….., Fn) is a scaling factor

CONSTRUCTING A CLASSIFIER FROM PROBABILITY MODEL Naïve Bayes classifier combines this model with a decision rule.

Most common rule is to pick hypothesis that is most probable, known as maximum a posteriori

The probability of a document F being in class c is computed as

P(F|c) is the conditional probability of term F occurring in a document of class c.

It is a measure of how much evidence F contributes that c is a correct class.

P(c) is the prior probability of a document occurring in class c.

LOGISTIC REGRESSION Statistical classification model

Predicts binary response from a binary predictor for predicting the outcome of a categorical dependent variable

Logistic regression measures the relationship between a categorical dependent variable and one or more independent variable

Applications: medical and social science field like Trauma and Injury Severity Score (TRISS),

used to predict mortality in injured patients used to predict whether a patient has diabetes based on observed characteristics

like age, gender, BMI Predict whether a person will vote for congress or BJP based on age, income,

gender, race, state of residency

Classification Binomial or Binary logistic regression deals with variable in which the

observed outcome have two possible types ex dead or alive Outcome is coded as 0 or 1 Straightforward interpretation

Multinomial logistic regression deals with situation where there are three or more outcomes

Logistic regression is used for predicting binary outcomes rather than continuous

Takes the natural logarithm of odds of the logit transformation

FEATURE SELECTION Selects a subset of terms occurring in the training set and uses this

subset as features in text classification

Serves two main purposes: First, makes training and applying classifier more efficient by decreasing the size

of effective vocabulary Increases the accuracy by eliminating noise features

A noise feature is one which when added to the document representation, increases the classification error on new data.

Feature selection replaces the complex classifier (using all features) with a simpler one (using a subset of features)

FEATURE SELECTION METHODS Mutual Information measures how much information the presence /

absence of a term contributes to making the correct classification decision on c

X2 Feature Selection test’s the independence of two events

Frequency-based feature selection selects terms that are most common in the class. frequency can be either defined as document frequency – documents in class c

that contain the terms t Collection frequency – tokens of t that occur in documents in c document frequency -> Bernoulli model Collection frequency -> multinomial model

Feature selection for multiple classifiers selects single set of features instead of different one for each classifier

bayes’ theorem and logistic regression

Data & Analytics