OVERFITTING: TRAINING vs REAL LIFE
Like sportsmen who are good in trainings but bad at games, overfitting happens when the model performs well in training data but does not generalise properly in real life. It is then said that the estimator (model) has high variance. On the other hand, we have under-fitting when the estimator has high bias. This article focuses on the problem of overfitting: training vs real life.
Once mindful of this problem, we just need to reduce both bias and variance, and that’s it! Well, it is not that easy!… WHY? There’s always a tradeoff between variance and bias, in other words, overfitting and under-fitting.
Theory aside, let’s say we selected a model (any kind from linear regression to complex Neural Networks). The first step to avoid overfitting is to divide data in training and test sets. If the error associated to the training of the model is significantly different from the one associated to the unseen data (test), be careful! Something is going wrong.
Here, we share 3 good approaches to try avoid overfitting problems in your project:
MORE DATA, PLEASE
Imagine trying to guess a person’s height using the following features: their foot size and their sibling weight. Now, pick the first two friends that go through your head and gather this information from them. Do you think you can create a good model out of this dataset alone?
The answer is NO. Data samples (in this case, two friends) and number of features (in this case, 2 features: foot size and weight) of similar size lead to overfitted models. But how much bigger the amount of data samples must be compared to the features? This concept is interesting and necessary to know.
The answer lies on the curse of dimensionality. In short, as the number of features increases, the amount of data needed to obtain a valid generalised model increases exponentially. You must always keep in mind the dimensionality of your model and collect enough data based on it.
X-VALIDATION
Good data scientists not only select a fit for purpose model but also optimal values for the algorithm hyperparameters (number of neurons, layers…) before even training the model looking for the right parametres (weights and biases).
For the hyperparameter selection, cross validation is a powerful method to avoid overfitting.
The idea consists on splitting the training data into N groups (called folds). Then, you would fix a set of hyperparameters and iteratively train the model by using each time a different subset as evaluation and the complementary others to fit the model. Finally, the selected set of hyperparameters will be the ones corresponding to the model with the lowest average training and test error.
THE SIMPLER, THE BETTER
This article about Occams Razor addressed the importance of interpretability and simplicity in different aspects of Machine Learning. We concluded that simplicity is a must.
You can understand simplifying a model as reducing its variance by constraining the model. One way to do it: regularisation.
The idea is penalising complexity by modifying the function in charge of finding the model parameters. On trade, we can expect regularisation to impact negatively the model bias, but there is No Free Lunch…