🎅 Naughty by Numbers:
Classifications at Christmas
First Time Here? (click to expand)

We're excited to announce Santa's new partnership with Amazon, promising maximum efficiency at scale! Children who upgrade to Santa Pro will still receive the same individual care and attention as before, whereas those in our Free tier will receive "personalized" analytics and delivery service, albeit in a completely impersonal way.

He's using a list he's already got...

We've correlated our historical data of who Santa judged to be Naughty 😈 or Nice 😇 in the past with data harvested from social media, fitness trackers, microphones in TV remotes, Alexa-powered Elves on the Shelves, cell phone tower connections, warrantless police drone surveillance, and other data streams we can't tell you (and you're better off not knowing) about.

Gonna assign who's naughty or not...

This allowed us to identify a key (proprietary) combination of variables into one score, dubbed the Santa-Amazon Numerical Trustworthiness Assessment (SANTA), that we use to predict -- nay, assign -- classifications of who is worthy to receive presents this year. We're proud to offer an Artificial Intelligence system called "Logistic Regression" or LR for short, that can predict Naughtiness or Niceness with 95% accuracy for white males.*

Amazon is coming to town!

But that's not all: LR also stands for Lounging Reindeer, because delivery of presents will be ensured by Amazon's massive fleet of delivery drones!

In the following interactive graphical demo, we'd love to show you how this revolutionary LR AI system works.

*Other demographics may be more likely receive False Naughty ratings.

Optional: Math Discussion (click to expand)

The following is intended for students of Machine Learning who've already learned some basics of Logistic Regression and want to go deeper.

Here we address the question, "Why do I have to use Binary Cross-Entropy (BCE, or negative log-likelihood) loss when doing Logistic Regession -- why can't I just use Mean Squared Error (MSE) loss?"

One answer, as we'll see below, is "You could use MSE, but BCE is a lot more effective."

Some answers to this question involve concepts and motivations from probability theory (e.g., expectations about distributions of noisy data) which I consider to be "outside the scope" for students interested in machine learning who have a strong grasp of coding and basic calculus but perhaps are a bit weak on statistics. So to go straight to the point, we'll only use a couple derivates, a graph, and some algebra.

Logistic regression is commonly used to solve binary classification problems, where data points \( y \) from one class are given a value of 0, and those belonging to the other class are denoted by a value of 1. Then we produce predicted data points \( \hat{y}\) using an S-shaped "logistic sigmoid" function \( \hat{y}(z) \) that spans between these values, $$ \tag{1} \hat{y}(z) = {1\ \over 1 + e^{-z}},$$ where \( z = (x-x_0)/\delta \) is a linear function of the inputs \( x \); \( x_0 \) is a parameter of the fit that defines the midpoint of the sigmoid curve; and \( \delta \) is another fit parameter describing the "width" of the S-shape.

(It's more common to see \(z\) parameterized as \( z = wx+b \) but for the interactive tutorial the way I'm doing it is more intutive. And yes, the predictions \( \hat{y}(z) \) can be interpreted as probabilities, but this is not going to affect the discussion that follows.)


A defining property of the logistic sigmoid is that its derivative satisfies the relation $$ \tag{2} \hat{y}^\prime(z) \equiv {\partial \hat{y} \over \partial z} = (1-\hat{y})\hat{y}.$$ A graph of the logistic sigmoid and its derivative can be seen below. Note that the derivative is small everywhere except in the "hump" region near the midpoint. This will be relevant later.


\(z\)


To have the system "learn" the best fit parameters to use, we can start with some initial guess, define some loss (or cost) function and do gradient descent. Let's consider the BCE gradient first and then we'll come back to compare it to the MSE gradient. Typically one sees the BCE (or negative log-likelihood) loss written as $$\tag{3} L_{\rm BCE} = -{1\over N}\sum_i \left( y_i\log(\hat{y_i}) + (1-y_i)\log(1-\hat{y_i}) \right), $$ where we average over the number of data points \(N\) to keep the value at roughly the same scale regardless of how many data points there are, \(y_i\) are the true values and \(\hat{y_i}\) are the model's output predictions at discrete input data points \( x_i \). The gradients with respect to the fit parameters \( x_0 \) and \(\delta\) both contain a factor of \(\partial L_{\rm BCE} / \partial z\), so let's look at that:
$$\tag{4} L^\prime_{\rm BCE} = {\partial L_{\rm BCE} \over \partial z} = -{1\over N}\sum_i \left( y_i{\hat{y_i}^\prime\over \hat{y_i}} + (1-y_i){-\hat{y_i}^\prime\over 1-\hat{y_i}} \right), $$ where by \( \hat{y_i}^\prime \) we mean \( \hat{y}^\prime \) exaluated at \(z = z_i = (x_i - x_0)/\delta \).

Let's pause to consider something important in Equation (4), so we can "get" it conceptually: The denominator terms result from the logarithms in the BCE loss per se, which "blows up" to \(\pm\infty\) when the model is "certainly wrong," i.e. as \(\hat{y}\rightarrow (1-y)\). If we stopped with that superficial understanding, we might get the wrong idea, but the Chain Rule tells us we need to keep going when evaluating this derivative: The numerators come from the sigmoid function whose derivative tends to zero in these same ("certain") regimes. Watch what happens next, because it's kind of amazing...

Substituting Equation (2) into (4) gives $$ L^\prime_{\rm BCE} = -{1\over N}\sum_i \left( y_i{ (1-\hat{y_i})\hat{y_i} \over \hat{y_i}} + (1-y_i){-(1-\hat{y_i})\hat{y_i}\over 1-\hat{y_i}} \right). $$ So the derivative of the sigmoid (in the numerators) contains exactly the right terms to cancel with each denominator, leaving us with $$ L^\prime_{\rm BCE} = -{1\over N}\sum_i \left( y_i (1-\hat{y_i}) - (1-y_i)\hat{y_i} \right) $$ which simplifies to $$\tag{5} L^\prime_{\rm BCE} = {1\over N}\sum_i (\hat{y_i}- y_i).$$ Bam! After all that work, isn't this result remarkably simple? This derivative is nothing more than the (mean of the) difference between predicted values and the true values. Let's be clear: there is no "vanishing gradient problem" for sigmoid activations when you use BCE loss; but as we'll see below, there is such a problem if you use MSE with a sigmoid. Equation (5) is not the full gradient; to work out the updates with respect to \( \delta \) and \( x_0 \) we need to do more, but this is sufficient for comparison against the MSE.

The MSE loss appears much simpler (and easier to grasp!) than the BCE loss: $$ \tag{6} L_{\rm MSE} = {1\over N}\sum_i \left( \hat{y_i} - y_i\right)^2, $$ but it turns out that its derivative has a problematic extra factor on the end, shown in red: $$\tag{7} L^\prime_{\rm MSE} = {\partial L_{\rm BCE} \over \partial z} = {2\over N}\sum_i \left( \hat{y_i} - y_i\right) {\color{red}\hat{y_i}^\prime}. $$ Compare Equations (5) and (7), ignoring the factor of 2 in the latter. That factor of the sigmoid derivative \( \hat{y_i}^\prime \) (in red) on the end of (7) will keep the gradient small everywhere except in the middle (as shown in the graph above). This means that the gradients obtained by using MSE loss will tend to be much smaller than the gradients from BCE -- this is a "vanishing gradient problem" -- and hence (we haven't gotten to these yet but you can imagine) the updates for the fit parameters \(x_0\) and \(\delta\) will be smaller for MSE than for BCE (for comparable learning rates), and thus the MSE model will train more slowly than the BCE model. It also means that only points near the decision boundary in middle will contribute to the model's training.

So, in summary: MSE will work, but BCE is a more effective choice. Have fun with the demo!


Two Analytics Li'l Helpers, Molasses Slow Elf 🧝 and Brilliant Celerity Elf 🧝🏽‍♀️, are tasked with creating predictive models to map a child's SANTA score into a verdict of Naughty (lower half of the graph) or Nice (upper half) by fitting an S-shaped curve to scores for children judged Naughty 😈 or Nice 😇 in the past.
Let's see how their models compare...

Controls (click to expand)

Data points are sampled from  an S-shaped distribution for the probability of the point being in class "Nice" no wait I mean  our trove of historical data on real children.*

Number of points: 50
Midpoint: 0.5
Width: 0.05
Animation delay: 10 ms

*Gathered without their knowledge but whose consent was implied through their use of the internet; except in EU countries where GDPR Article 22 requires explicit consent since some people don't want EU children to have any fun.

REGULATORS: This is all a cynical parody and a complete work of fiction for the purposes of entertainment and (it is hoped) education. No actual data on children was gathered or is being used.

Data & predictions:

SANTA score ⟶

Losses:

Training iteration ⟶


Key things to observe:
  1. BCE's model settles on a solution before MSE's.
  2. The two models often give slightly different answers, even after they've been trained for a while.
  3. The models' answers tend to be closer when more data points are used.
Questions for discussion:
  1. Where the models differ, which one is "right"?
  2. How would a human (e.g., you) classify new points given the data, and how might that produce results similar to or different from the computatation-based outcomes you observed above?
  3. If these models were deployed as classifiers in society, what societal values would they (perhaps unintentionally) encode?
  4. Given those values, which model is better?
  5. Besides these models, what other models or (non-LR) methods could you use, and how might their results differ?
  6. Why might plotting both losses on the same graph be misleading?
  7. In this post's subtitle, why might "Xmas" be more accurate (albeit less alliterative) than "Christmas"?
Appendix: Careful Coding BCE

Most people will never run into the following issue because (a) they probably use libraries that already take this into account, and/or (b) it's only the gradients that matter when you're coding your own solver and the gradients are pretty mundane, so problems with the loss itself have little effect. In my case, I was doing something "custom" with my own Binary Cross-Entropy loss in the Keras package which was then affecting the gradients via autograd, and I couldn't figure out why my code kept crashing...and then I saw NaNs too while writing this Javascript demo!

If you just code up the BCE loss as it's typically written (Eq. 3 in the optional math section above), you can get NaNs resulting from catastrophic loss of precision inside the log functions. To avert this problem, you might try replacing the log(1\(-\hat{y_i}\)) term with log1p(\(-\hat{y_i}\)), or adding some kind of "fudge factor" \( \epsilon \) inside the logs. Neither of these will avert disaster. You could instead clip the inputs or add some kind of piecewise bounds-checking, which is what some libararies do, but...nah. Don't do that.

Instead, it's better to use a "logits trick" similar to the LogSumExp trick for the softmax function where you hang on to the "logit" arguments \(z_i\) of the sigmoid function. Then it can be shown that you can write the BCE loss as $$\tag{8} L = {1\over N}\sum_i \left( {\rm max}(0, z_i) - z_iy_i + \log(1+e^{-|z_i|}) \right). $$ This is resistent to over/underflow problems, and it's what appears under the hood in major libraries like PyTorch. And sure, you could use log1p() on that last term too (I do), but the main problem has already been averted. Using (8) instead of (3) in the "custom" part of my Keras code solved my NaN problem.🎉

Further reading/listening:
Bonus: Your Rating for This Year

by Scott H. Hawley, Dec 11, 2020