Linear regression, formalised in the late 19th century by Francis Galton and Karl Pearson, became widely used in the early 20th century for modelling continuous relationships between variables. Everyone uses it to try to gauge the correlation between two variables, allowing them to predict a continuous value, such as the relation between age and height.
However, a new question appears:
What if we want to predict the probability of something happening? For example, whether someone will develop a disease or not, or whether a product will sell well on the market?
Of course, people initially tried using ordinary linear regression for this case. But the results were often disappointing: the predicted probabilities could end up being negative or even greater than 1, values that clearly make no sense from a probabilistic standpoint. 😶🌫️
As ever, researchers refused to accept defeat. Hence, in this article, we will do a deep dive into logistic regression, how it came to be, how it works, its assumptions, and how to implement it from scratch. While the math can be a bit dense, I will try to accompany it with interesting and interactive visuals so you can get the intuition behind it.
Due to their likeness to linear regression, logistic regression is often seen as a simple, uninteresting algorithm. However, I beg to differ. Understanding how the math and intuition behind LR can certainly help you build good intuition in the realm of machine learning, since it’s not overly complex, yet carries a lot of fundamentals that can be beneficial later on.
With that in mind, let’s get going, shall we? 😋
Why can’t we use linear regression?
At first glance, it might seem tempting to use ordinary linear regression to model probabilities. After all, it has been a reliable workhorse in statistics since the early 20th century, used to model continuous relationships like height versus age or income versus education level. So, when faced with a question like “What is the probability that a person will develop diabetes based on their age?”, some researchers naturally tried to apply the same linear approach:
Let’s assume we train the model and it gives us:
Now try plugging in values for age:
Age (x) | Prediction |
---|---|
20 | |
40 | |
50 | |
60 | |
80 |
Notice how for ages like 20 or 80, the prediction is below 0 or above 1, which makes no sense for a probability.
Aside from unbounded output, classification tasks often violate the basic assumption of linear regression: homoscedasticity (errors have the same variance, regardless of the value of x). For example, if we try to predict whether someone has diabetes or not, that is or , we can think of it like a Bernoulli distribution.
For a Bernoulli random variable, the variance is:
Let’s try plugging in some numbers:
Variance | |
---|---|
0.1 | 0.09 |
0.5 | 0.25 |
0.9 | 0.09 |
So, the variance depends on the predicted value itself. This means that the spread of the errors is not constant!
Last but not least, the loss function in linear regression, Mean Squared Error (MSE), squares the error, which doesn’t align well with classification. Cross-entropy penalises confident-but-wrong predictions much more appropriately.
Quick Recap to Probabilities
Before we continue, we need to understand what odds are.
If the probability of an event is:
Then, the odds are defined as:
Now, bear with me for a second. It’s not that complicated. Odds are just a ratio between “How likely something is to happen” compared to “how likely it is not to happen.”
Look at the examples below:
Probability | Odds | Spoken As |
---|---|---|
0.5 | 1 | “1 to 1” (even odds) |
0.75 | 3 | “3 to 1 in favour” |
0.9 | 9 | “9 to 1 in favour” |
0.01 | 0.0101 | “1 to 99 against” |
0.0 | 0 | Impossible |
1.0 | ∞ | Certain |
So, odds can range from 0 to ∞, but probability only goes from 0 to 1.
Log-odds a.k.a logits
Now that we understand odds, let’s take it one step further. Log-odds (also called logits) are simply the natural logarithm of the odds:
As opposed to odds that grow sharply as probability approaching 1, logit is approximately linear over mid-range probabilities. Moreover, it also scales from to making it easier to interpret and work with.
However, you should try it yourself by playing with the interactive sliders below.
As you can see, logits behave more nicely when plotted on a linear scale, while still preserving the information behind. This very property allows us to model log-odds as a linear function, that is:
Now, the next question is: how do we invert it back to probability? Congratulations, you just asked the million-dollar question.
Introducing the sigmoid function
In the nineteenth century, Pierre François Verhulst introduced a function to model population growth: logistic function (sometimes called the sigmoid due to its S shape).
Now, you might ask, what’s so special about it? Well, the interesting thing is that, due to special properties of , no matter how big or small is, it will always get mapped between [0, 1].
Try it yourself! Play around with different values of and see how the sigmoid function behaves:
Please note that while I capped the value between 20 due to precision issues, in reality, the value of can be anything, meaning . The sigmoid function produces asymptotes in 0 and 1, meaning they will approach it, but never truly reach them.
If it’s not obvious to you, yet, the sigmoid curve resembles the logits curve, quite well, isn’t it? In fact, we can inverse the value of logits to probability with sigmoid function! Not convinced? Let’s prove it together.
We start with the logit function:
Exponentiate both sides:
Multiply both sides by :
Distribute the LHS:
Move all terms to one side:
Solve for :
Divide numerator and denominator by to simplify:
We now have the sigmoid function. Neat, right?
Maximum Likelihood Estimation (MLE)
In this article, we will assume we are doing binary classification. Usually, we assume our labels follow the Bernoulli distribution. So for each data point , the label .
That means we assume:
With probability:
The probability is predicted using the sigmoid function applied to a linear model:
The likelihood of observing label given the input and parameters is:
Because if it becomes , whereas if it becomes . As you can see, the compact expression above fits both cases.
For independent and identically distributed data points, the joint likelihood is:
However, computing that equation is quite difficult for computers due to the risk of underflow. Thankfully, we can exploit the properties of logarithm to compute the log-likelihood instead.
MLE says that we need to choose parameters that maximise that log-likelihood.
In modern ML libraries, optimisers typically minimise objectives, so we minimise the negative log-likelihood instead of directly maximising the log-likelihood.
Loss Function
In practice, we usually call the loss function derived above with its popular name: Binary Cross-Entropy (BCE).
This function is a convex function, making its global minima easy to obtain using gradient descent. What is a convex function, you may ask?
Notice how, in a convex function (left), gradient descent always reaches the global minimum regardless of the starting point. Meanwhile, in a non-convex function (right), the algorithm can get stuck in a suboptimal local minimum.
We will prove that the Binary Cross Entropy (BCE) loss in logistic regression is convex with respect to the weights .
We define:
- : input feature
- : weights
- : bias
- : true label
- : logit (linear output)
- : sigmoid
- : predicted probability
The loss for one sample is:
Substitute and use the identities:
With some simplification, we get the compact 1 form:
This loss is a composition of two functions plus a linear term:
- : an affine function
- : a convex and increasing function
- : linear (does not affect convexity)
By the standard composition rule:
The composition of a convex, increasing function with an affine function is convex. Adding a linear function preserves convexity.
Therefore, the BCE loss is convex in .
However, we might want to prove it more formally.
Let . The derivative of the loss w.r.t. is:
Using and we obtain the gradients:
Take the derivative again to get the Hessian:
This is a scalar times an outer product:
- is positive semi-definite
Since the hessian is positive semi-definite:
The BCE loss in logistic regression is convex in
We need to differentiate between convex and strictly convex. To put it simply, a strictly convex function has a single unique minimum, whereas a merely convex function may have multiple points that minimise the function. To understand it better, take a look at the visualisation below.
f(w₁,w₂) = w₁²
Convex but not strictly convex: flat directions exist (along w₂), multiple minima along a line.
Strict convexity may not hold in cases of perfect separation or redundant features, but is generally satisfied in real-world data with enough variability.
See Logistic Regression in Action
Before doing another deep dive into the math, let’s tinker for a while with logistic regression. See it, feel it, and build a good mental model along the way as to how the training process works[1].
Observe how the decision boundary (black line) evolves from a random starting point toward an optimal separation between classes. The loss curve consistently decreases, thanks to the convex nature of the loss function. At the same time, accuracy improves as the model learns to better distinguish between the two classes.
How does the training process actually work?
After seeing the interactive example above, I hope you can already have some intuition to how the training process work. However, we still need to understand how it actually work mathematically[2].
Interestingly enough, the training process of LR isn’t that different compared to deep learning. Fundamentally, they both try to minimise loss using backpropagation to modify weights and biases. The only difference being that in LR, the loss function is convex and we only use a single layer. Let’s recap what we have so far.
We already defined that LR uses a linear function followed by a sigmoid:
We have also defined BCE as our loss function:
Now, the gradient for a single sample can be calculated with:
Update the weights and biases by scaling the gradient with our learning rate:
Those may not make any sense, perhaps. So, let’s try it with real numbers.
Assume that we have a learning rate of 0.1 with initial and .
Our dataset would be:
Epoch 1:
At first, since our initial weights () and bias () are zero, the sigmoid function will produce a value of 0.5 for every sample:
Compute the gradients for each sample:
Average the gradients:
Update the weights:
Now, compute the bias gradients:
Average the bias gradients:
Update the bias:
Epoch 2:
We continue with the updated weights and .
Compute the predictions:
Compute the gradients for each sample:
Average the gradients:
Update the weights:
Now for the bias:
Average the bias gradients:
Update the bias:
You can continue with further epochs, but hopefully the example above has given you enough intuition on how to do it yourself.
Evaluating Model Performance with AUC-ROC Analysis
Once we’ve trained our logistic regression model, how do we know how good it is? There are many metrics that one can use for classification tasks. However, one of the most powerful tools for evaluating binary classification models is the Receiver Operating Characteristic curve (ROC). This visualization shows us the trade-off between sensitivity (true positive rate) and specificity at different classification thresholds.
The interactive visualization below lets you explore how different classification thresholds affect various performance metrics.
Confusion Matrix
Performance Metrics
This kind of analysis is crucial in real-world applications where the cost of false positives and false negatives might be very different (e.g., medical diagnosis, fraud detection).
Regularisation
In the real world, we want models that don’t just memorise the training data. They should work well on unseen data. Overfitting happens when:
- The model tries too hard to fit every data point
- Especially in high-dimensional space (many features)
- So it ends up learning noise instead of patterns
Regularisation is our way of telling the model:
“Please fit the data… but don’t go wild with big weights.”
There are a lot of regularisation techniques to prevent overfit, but we will keep it minimal here, only things that are directly relevant to logistic regression. Regularisation deserves its own dedicated article, which will be published later.
Let’s recall our loss function (BCE).
We are trying to minimise this loss, i.e., find the weights that make the predicted probabilities close to actual labels. However, sometimes just minimising BCE isn’t enough. It might push some weights to extreme values to get better fit, leading to overfit.
To penalise complexity, we can add a regularisation term to the loss. The two most common types are:
L1 Regularisation (Lasso):
L2 Regularisation (Ridge):
Regularised loss becomes:
This discourages large weights and keeps the model smoother.
Visualizing Regularisation Penalties
Let’s explore how different regularisation techniques penalize weights. You can toggle between L1 (Lasso) and L2 (Ridge) regularisation to see their different behaviors:
Key differences
- L2 (Ridge): Penalty grows quadratically (λw²) hence large weights get punished much more severely
- L1 (Lasso): Penalty grows linearly (λ|w|) which creates a constant penalty rate regardless of weight magnitude
The quadratic growth of L2 makes it excellent at shrinking large weights toward zero, avoiding exploding weights, whereas L1’s linear penalty can actually drive weights to exactly zero, performing some kind of “feature selection”.
Conclusion
Logistic Regression may seem deceptively simple: just a linear model passed through a sigmoid. But beneath that surface lies a foundational concept in machine learning. While it may not compete with complex models in raw power, its interpretability, theoretical grounding, and mathematical tractability make it an essential tool. Mastering logistic regression also helps you build the intuition needed for neural networks. Hopefully this article can help you understand more about logistic regression in depth.
Footnotes
[1] In the worked example, we average gradients over the whole dataset (batch gradient descent) and, for classification, a default decision threshold of 0.5 is assumed when converting probabilities to labels. In online or streaming training you’d typically use stochastic (or mini‑batch) gradient descent, and the decision threshold is a tunable choice—use the AUC‑ROC analysis above to pick a threshold that matches your precision–recall trade‑offs. ↩
[2] In classical statistics, logistic regression was fitted with Newton–Raphson (IRLS), and many statistical libraries still do so. Modern general-purpose optimisers often use quasi-Newton methods like BFGS or its limited-memory variant L-BFGS, which converge faster than plain gradient descent on smaller datasets. For very large datasets, we usually switch to first-order methods (SGD) that scale more easily. ↩
Acknowledgments
I would like to thank the open-source community for providing the tools and libraries that made the interactive visualizations in this article possible. Special appreciation goes to the developers of D3.js, React, and the broader JavaScript ecosystem that enables rich, educational content on the web. I would also like to thank Distill as my main source of inspiration to bring ML education to the broader masses.
Author Contributions
This article was researched, written, and developed by the author. All interactive visualizations were custom-built using React and D3.js to provide hands-on learning experiences.
License
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise.