Why (and how) you should create a baseline model before you train your final model

Photo by Zetong Li on Unsplash

So you’ve collected your data. You’ve outlined the business case, decided on a candidate model (e.g. Random Forest), set up your development environment, and your hands are at the keyboard. You’re ready to build and train your time series model.

Hold up — don’t start just yet. Before you train and test your Random Forest model, you should first train a baseline model.

A baseline model is a simple model used to create a benchmark, or a point of reference, upon which you will be building your final, more complex machine learning model.

Data scientists create baseline models because:

  • Baseline models can give you a good idea of how a more complex model will perform.
  • If a baseline model does badly, it could be a sign of an issue with the data quality that needs addressing.
  • If a baseline model performs better than the final model, it could indicate issues with that algorithm, features, hyperparameters or other data preprocessing.
  • If the baseline and complex model perform more or less the same, this could indicate that the complex model needs more fine tuning (in features, architecture, or hyperparameters). It could also show that a more complex model isn’t necessary, and a simpler model will suffice.

Typically, a baseline model is a statistical model, such as a moving average model. Alternatively, it is a simpler version of the target model — for example, if you will be training a Random Forest model, you can first train a Decision Tree model as a baseline.

For time series data, there’s a couple of popular options for baseline models that I’d like to share with you. Both of these work well because they assume temporal order of the data and make forecasts according to the data’s patterns.

Naive forecast

The naive forecast is the simplest — it assumes that the next value will be the same as the…