Small children baking a cake

Every year on my daughter’s birthday, I bake her a cake. My daughter’s favourite cake is chocolate, writes Luba Orlovsky, Analytical Research Lead at Earnix.

I found a recipe online a few years ago, and it made my life easier! Two eggs, 200 grams of flour, 200 ml of milk. I just follow the recipe instructions; no need to think too much about the next steps in advance.

This year, it struck me that baking a cake with a known recipe is not too different from using a Generalised Linear Model (GLM) regression to make a prediction.

To make a cake, you must gather the ingredients: eggs, milk, flour, etc. Cake ingredients are like variables in a GLM. For example, variables that might apply to agriculture or weather monitoring could be the number of rain events, amount of rainfall per event, and total rainfall per year. To fit or to apply a model for insurance products, you also need to have variables: age, geography, price, etc.

To make the cake, you need to know the quantities of ingredients and understand how they affect one another. The same applies to modelling: each variable should have its own coefficient that defines the impact of the variable on the prediction.

When fitting a GLM, you select a link function to control the “shape” of the outcome – for example, identity or logit. Just as you would with a cake. Will it be square, round or end up as a dozen cupcakes?

In the end, your cake might be different than expected. Just as in model prediction, you will always observe some differences between the prediction and the actual result.

Why GLM?

GLMs are widely used in the banking and insurance industries – regulators and professionals are familiar with them, they are relatively easy to describe, and transparent, as the change in the prediction of a GLM is directly explained by the change in the predictors.

Is it hard to build valid GLMs?

Designing an effective GLM is a combination of art and science, and you need a skillful artist to build it. While GLMs are smart models that are easy to understand, you still need to think about the various factors. If your grandma’s recipe always bakes perfectly, you do not need to change anything. But what if someone has an allergy? What if you cannot use flour or eggs? Suddenly, the recipe does not work. Now, it’s time to be creative.

The same thing happens with a GLM. If you have a prediction model that considers multiple factors and works great, there is nothing to worry about until something changes. Then you need to develop a new model. Here comes the hard part. You need to re-evaluate the best variables, variable interactions, new binning and grouping of the variables. In other words, you are back to the most challenging, time-consuming stage – feature selection and engineering.

Automatic GLM

Automatic GLM (AGLM) automates the hardest modelling step by applying different Machine Learning techniques behind the curtain with little intervention from people. The following tasks are being automated:

  • Variable selection

Initial variable ranking is done by fitting a boosting model. The model itself is not used, only its variable importance is exposed to the user, who can decide which variables to keep.

  • Binning of continuous variables

Continuous variables are split into many bins and the algorithm merges them through lasso regularisation to find the best binning for that model.

  • Interactions finding

The algorithm is looking for two-way interactions by fitting and analysing boosting models. This step is optional, and the underlying model can be tuned manually.

  • Categorical variables with lots of categories

AGLM can address categorical variables among many categories through target encoding. A user can select to apply if they know the number of categories is large or go with one-hot encoding. In addition, AGLM allows to update the regression coefficients interactively and supports hierarchical data structures.

Luba from EarnixIn Summary

Traditional GLM models can be hard and time – consuming to fit due to the large number of potential variables, transformations and interactions. However, AGLM automates the difficult part of the process, and the result is an effective, predictive, and transparent GLM that can be used as-is or serve as a basis for further work.

Just like with baking a cake, it’s the selection of ingredients that dictate the outcome. The key is to be sure the correct factors are used.

About the author

Author Luba Orlovsky, Analytical Research Lead at Earnix, has a BSc in Industrial Engineering and an MSc in Operations Research from Israel Institute of Technology.§