Elaborating Logistic Regression

dinu thomas
7 min readJan 4, 2022

--

When I was a kid, my mother used to tell me to make friends with good people and avoid bad groups. It was not an easy task to decide which one is good and which is not. Most of the time, I fall into those groups where my mom doesn’t want me to be. (Still trying to figure out, If I were one of those outliers who might had changed any group to a mischievous one…)

After listening to Prof. Patrick Winston’s lecture on Decision Trees, where he classifies “If a ‘bat’ as a vampire or not”, I am sure there must be some algorithms to identify good friends than falling into a mischievous gang.

Let’s come straight to our business, I would like to discuss here about Classification methods in Machine Learning.

Classification in simple terms, categorizes a given set of data into two or more groups or classes. The categories are often called as “class labels” or just “labels”. In machine learning terminology, this is also called as “Supervised Learning”, where the historical labelled data is already available for the machines to learn , understand and differentiate.

A suitable algorithm can identify the hidden pattern in the known data samples. In other words, there would be a mathematical function which can be derived out of this sample, which can clearly separate the data into different planes/regions that demarcate each class. Such functions can generalize any similar data and are called as “models” which can be used to predict the class of an unseen query or unseen data.

There are many algorithms which do classification. Let’s introduce one of the classic ones here.

Binomial Logistic Regression.

Binomial Logistic regression predicts the probability of 2 possible outcomes. Few of the Examples,

  1. If the patient is diabetic or non- diabetic ,
  2. If the given person a loan defaulter or non-defaulter,
  3. If an email is spam or not.

As the name implies, In Binomial, there or only 2 labels to predict. Lets represent these 2 labels as 1 and 0.

The prediction is done based on a set of data points which are represented in a multi-dimensional feature space. The classes or labels of each of these data points are available (Yi (actual) = 1 or 0), as it is already happened in the past or we have a prior knowledge about them.

Conversely it can be conclude that, when the features ( represented as ‘x’ from hereon) occur in some magnitude which can be weighted (w) and expressed as a mathematical equation and will determine the probability values Yi(predicted) = [0 ,1]. So, the “probability of an event happening or not” will be dependent on the values of x with some coefficients w,

ie. P(Yi) = f(x,w).

where,

eq:1

If you look at the above function, for any value of x for a chosen w , the output will be inside 0 or 1; which is true for any value of (x,w) ranging between -∞ , ∞. So this special function can be represented using a Sigmoid function which also will range between 0 and 1.

(eq:2)

Sometimes probability is represented as odds, ratio of an event happening against the same event not happening.

eq : 3

from equations 1,3 , we can rewrite the odds ratio,

eq : 4

By substituting eq 2 in eq 4

eq:5

The above function is not linear. Let’s transform it to a linear function as it would be easy to solve linear function. We will apply log transformation here to make this linear.

eq.6

This function is called as logodds function. Please note, logodds = wX which is an equation of a line. The slope of the line can be identified through different line fitting methods (eg: gradient descent method). The historic data points can be fitted on a line for its best values of w, which can minimize the errors in historical data (Yi- Actual)against what the line outputs (Yi- Predicted).

So we are trying to predict the probability of an event happening or not through the predictor variables (X) available with us, and the weights of these predictor variable (w) learned from historical data.

Optimization Techniques:

Since the output of the classification is 1 or 0 ( class 1 or class 2), Maximum Likelihood Estimation technique is used to optimize Logistic Regression and calculate best weights of “w”.

The likelihood function is a joint probability density function of the sample. In a Binomial Logistic Regression case, the probability distribution follows a Bernoulli distribution , and its probability distribution is defined as below,

eq.7

Yi can take either 0 or 1. Pi is the probability of Yi.

Each of the past event samples are Independent. With this condition, let’s introduce a function ‘L’ which is the product of the probabilities of all independent events. This function is also known as “Likelihood Function”,

eq.8

What is actually happening here?, we are trying to find the likelihood of predicting an event correctly across the sample-space. If our algorithm can predict a 1 or 0 with very high probability ( near to 1) which is actually same as their actual labels, then Likelihood of that sample will be high. The likelihood of predicting all the samples correctly will be the product of each of the sample probabilities as the events are independent.

eq.9

Before going further, let’s Simplify the probability of an event by considering the labels as 1 and -1 instead of 1 and 0. This is done for a mathematical convenience.

when Yi = 1, referring from (equation 2), P(Yi) can be written as,

eq:10

Similarly, when Yi = -1, deriving from (equation 2 and 1), P(Yi) can be rewritten as,

eq:11

From Equation 10 and 11, we have generalized the probability function as below form,

eq:12

Our aim is to maximize this “Likelihood function, ‘L’ ” so that model always predicts more accurately. Optimization methods chosen should find out the best parameters of ‘w’ which can maximize this likelihood.

eq.13

For a mathematical convenience we will do a transformation and maximize the log of Likelihood Function,

eq.14
eq:15

Note: When you take the log, the product will become summation.

So (equation 14) can be written as below,

eq:16

From equation 12 and 16, we can simplify ,

eq:17

which is same as finding the argmin of the following,

eq:18

The above function is a convex function on the parameters “w” and is differentiable. Normally , gradient descent is used to find the best values of “w”.

Comparing with a Naive Bayesian Model:

In this section, we will spend some efforts to compare Logistic Regression with another basic Model Naive Bayesian. Both the models are based on the conditional probability. However, Naive Bayesian is a Generative Model where as Logistic Regression is a Discriminative Model

Naive Bayes models the joint distribution of the feature X and target Y, and then predicts the posterior probability given as P(y|x).

Logistic regression directly models the posterior probability of P(y|x) by learning the input to output mapping by Optimizing the error what we have seen in the previous section.

This gives an upper edge to Logistic Model compared to Naive Bayesian.

Experiments on the same data set shows that the model parameters are better for Logistic Compared to Bayesian.

Bayesian

Logistic

Conclusion:

  1. In the first part, we have seen that, how can we create a mathematical model which can predict a future/unknown event based on the past data samples and its features.
  2. In the second part, we have discussed about the method through which we can find the best weights for the features which are used for prediction. How this optimization technique gives an upper edge compared to other similar probabilistic models.

Similar Reads from Dinu Thomas:

  1. Singular Value Decomposition and LSI

--

--

dinu thomas
dinu thomas

No responses yet