Loss functions

When we train a neural network, we need a way to tell if she’s getting better or worse. That’s the job of a loss function. It gives us a number. Lower is better. If we can shrink that number step by step, she learns.

Mean squared error

Mean squared error (MSE) is the plainest kind. We take the difference between her prediction and the real answer, square it, then average across all samples. Large mistakes look huge after squaring. Small mistakes fade.

Use MSE when outputs are continuous—like predicting house prices. Don’t use it if answers are “yes/no.” She’ll struggle to tell the difference between “almost right” and “totally wrong.”

Cross entropy

Cross entropy is for classification. It compares her predicted probability spread with the true label (which is just a one-hot vector: one “1,” the rest “0”). If she puts most of her weight on the right class, loss is small. If not, loss is big.

It’s the go-to choice for tasks like “Is this a cat or dog?” Because the math penalizes confident wrong answers harder than tentative ones.

How she learns

Loss is just the starting signal. Gradient descent takes that signal and tweaks weights to make the next guess closer. Over thousands of steps, she shifts from random noise to useful output.

We don’t need to love the equations. Think of it like hot-and-cold feedback. Each step says “warmer” or “colder.” She adjusts.

Our takeaway

Loss functions aren’t magic. They’re yardsticks. Choose MSE for numbers, cross entropy for categories. The rest is grind: repeat, adjust, repeat.

And somewhere between debugging and re-running, we realize the whole thing feels less like rocket science and more like playing a game of warmer-colder with a very stubborn friend.