Skip to main content

Full Tutorial

This is a full tutorial of Akur8 - will be used as a translation help for Prismy

Diego Sanchez avatar
Written by Diego Sanchez
Updated over 4 months ago

  1. Introduction

The goal of this tutorial is to give a quick overview of the features contained in the Akur8 insurance modelling software.

This tutorial is split into two main parts.

In the first part, we will build a full model from scratch on the demo dataset. This modeling will include the elements of a state of the art insurance model such as:

  • Generalised Additive Model

  • Detection and fit of Interaction

  • Geographic smoothing

  • Inclusion of Regional and External variables

In the second part, we will show some Advanced features that are targeted to more complex use cases, or show how to automate some manual processes in order to increase productivity and reduce friction.

Texts in the grey boxes (like this one) will give you the step-by-step instructions to proceed with the tutorial: you can follow them to create your first model with Akur8.

Everywhere on the Akur8 platform, you will notice a chat icon located in the bottom right. This chat allows the user to contact Akur8’s team of Actuaries and Data Scientists that are available to answer any doubt or question you have about the software.

Do not hesitate to contact us if you are unable to proceed inside this tutorial: the team will be happy to quickly answer your questions.

  1. Log-in and Workspace selection

To access Akur8, enter https://beta.akur8-tech.com/ in your Chrome browser.

Login using your akur8 email and okta.

Enter your email address and password and LOG IN.

If this is your first login or you don’t know your password, you can click on Don’t remember your password? to receive a new one.

After logging in, you access the workspace selection page. This page allows you to select in which workspace you will work. Workspaces are dedicated to teams working on similar models, sharing their Databases and modeling projects. If you have only access to one workspace, you will directly enter it without going through this stage.

For the purpose of this tutorial, you can enter a workspace previously created for you under your name : FIRST_NAME LAST_NAME.

  1. The Data Section

The Akur8 software currently has 5 sections:

Data: to upload Databases, inspect them, and prepare the data before starting the creation of models.

Risk: to build models once the Data is ready (cf. Risk Section).

Demand: to aggregate frequency, severity and propensity models into a single risk estimate and run loss-ratio studies.

The pictograms on the left of the screen indicate which section the user is currently in. After entering the project, the user is in the Data section.

The Data section contains a list with all the Databases previously loaded in Akur8.

When entering your personal workspace for the first time, this list can be empty.

  1. Creating a new Database

NB. A tutorial Database might already be uploaded in your environment; if that’s the case, you can directly move to the next paragraph, View Database Summary.

A new Database can be created by clicking on the upload database button in the middle of the screen (or at the + button at the bottom of the databases list):

Enter the name of a new Database, select a file to upload and validate the checkbox. The data can be uploaded (middle) by clicking on the UPLOAD button. The upload progress is displayed on the screen (right).

A database can be uploaded with several formats: CSV, parquet, zipped CSV (for fast upload of large databases) or SAS database.

Once the Database is uploaded, it must be pre-processed, to compute distributions and statistics. This pre-processing may take a few minutes.

The progress bar showing the pre-processing status is displayed at the top-right of the screen.

As the pre-processing is running on the cloud, it is possible to upload other datasets while the process is running.

Download csv files from Demo_Italy

Click on the upload database button and upload the tuto_DB_Italy.csv file with the name Tuto_DB. Also upload the database Tuto_Zip_Italy.csv with the name Tuto_Zip.

  1. View Database summary

Once a Database has been uploaded and pre-processed, its summary is displayed.

This summary gives the main information on the Database: name of the original CSV file, creation date, and some statistics on the number of rows and columns.

It is possible to share information about the database in the Documentation note. This information will be available to all users granted access to the workspace, and will be included in the final documentation of the models (see section Exporting your models & documentation). It is also possible to see who was in charge of the creation and last modification of the datasource.

Select the Tuto_DB database, and start to inspect in detail the Database selected by clicking on the VIEW VARIABLES button in the top-right corner.

  1. Inspect a Database

The list on the left contains all the variables present in the database. We can select each variable for a detailed inspection.

It is possible to search for a particular variable by clicking on the search icon on the top of the list. For example, let’s type age on the search box to select the driver_age variable.

  1. Histogram, statistics and documentation

The top of the screen represents the distribution of the selected variables. Missing values are highlighted in red. In the center, you can find some statistics describing the variable (some statistics such as the median and the mean are available only for numerical variables). A documentation, giving information for the selected variable, is also available.

  1. Variables Type & Usage

The variable type is defined at the right of the screen:

Ordinal variable: if the variable follows a specific order. All the numeric variables are considered ordered by default, but sometimes non-numeric variables should be set as ordered as well (for instance a variable containing levels “A”,”B”,”C”… might have a relevant order).

Categorical variable: if the variable contains categories that have no specific order. All the non-numeric variables are considered categorical by default, but sometimes numeric variables can be categorical (for instance if you have 3 groups “1”, “2” and “3”

with no specific order).

The type of variables (ordinal / categorical) will have a strong impact on the models created: if the variables are ordered, the models created will be smooth along it. Otherwise each level will be treated independently.

Missing values are handled by the Akur8 solution: the corresponding coefficient will be set to the average value by default, and follow the observed data if they are significant (it won’t be grouped with its neighbours).

Go to variables: additional_drivers, nb_vehicles, pets, policy_type, usage_km, vehicle_age, vehicle_anti_polution_norm and switch them from Categorical to Ordinal.

This process is key for the performance of the modeling, but it can be time consuming. The good news is that this process can be reapplied automatically across different data sources, see Advanced Usage - Reuse Format.

There is a keyboard shortcut to quickly review all the variables and change their types when needed: Alt + ↑ or Alt + ↓ to select the previous or next variable, or Alt + O and Alt + C to change a variable type from categorical to Ordinal and vice-versa.

A summary of these shortcuts can be found in the help section.

  1. Variable Modification

It is possible to modify the data of each variable (grouping, reordering and management of missing values) by clicking on MODIFY VARIABLE-> MODIFY LEVELS.

Note: Go to variable driver_age. Make sure that this variable is Ordinal. If in your dataset you have the level “Company”, then:

  • Go to MODIFY VARIABLE-> MODIFY LEVELS

  • Select the level ”Company” in the list on the right of the window (it becomes green).

  • Click on the NA button at the bottom of the levels list: the value becomes red in the graph.

We cover in detail the data transformation part in the Advanced Usage section.

  1. The Risk Section

Once you have uploaded, checked and modified your data, click on the Risk section on the left.

The list on the left of the screen contains all the modelling projects already created (when entering for the first time, this list should be empty.).

In this section, our goal will be to model a Frequency MTPL Property Damage cover. The modelling will consist of different steps that will allow us to see most of Akur8’s modelling process.

  1. We will set up the context of our modelling in the Create a new modelling Project section.

  2. We will build a first GLM using only declarative information: no regional or external variables will be included at this very first step in the Build Base GLM section.

  3. We will enhance our first GLM baseline by adding interactions in the Add Interactions section.

  4. We will use the policies location to apply geographic smoothing, in the Geographic Modeling section.

  5. We will also include the impact of regional and external variables in the Introducing External Variables: Add More Variables section.

  6. We will simplify the models by combining the effect of the zip-code and those of the external variables into a single zoning variable in the Reduce variables section.

  7. Finally, we will add some actuarial knowledge by manually adjusting the impact of some variables in segments with low statistical significance in the Edit Coefficient section

  1. Create new modeling Project

You can create a new model by clicking on the Create Project button in the middle of the screen (or the + button at the bottom of the list if other projects have already been created).

Click on the Create Project and create a new project with name Tuto_project and select Frequency as Type from the drop-down list.

Next, click on SELECT A DATABASE and select the Tuto_DB by clicking on its name. Finally click on CREATE RISK PROJECT.

Project Summary

Once the project has been created a summary of the modeling Project is shown on the screen with some of its features: name, type, creation date and database on which the modeling will be performed.

In order to create models, the project needs to be set-up. You must define:

Set Goals

Click on SET GOALS.

The first screen allows to set the goals of the modeling project:

  • The variable to be predicted (Target variable).

  • The exposure variable – usually the duration of a policy or the number of claims (Exposure).

  • The time information variable – it will be very useful to ensure the time consistency of the models created (Date)

  • The stratification information variable (for validation purposes) – it allows to explicitly assign data points to the hold-out sample. The hold-out can be used to make sure the chosen model is stable when tested on new data, as well as to rigorously benchmark against other modeling techniques. (Stratification).

Set the following in the SET GOALS screen:

  • Target: target__claims_PD → this column contains the number of Third Party Liability Property Damage claims.

  • Exposure: contract_duration → this column contains the duration of policies.

  • Date: year → this column contains an accident year.

  • Stratification: strat_variable → this column contains 2 different values, Modeling and Validation, to indicate which rows will be used in the modeling and which ones will be used in the validation.

Click on CREATE MODELS.

Define Variables

This step allows the user to define which variables cannot be used in the modeling. Akur8 will then detect and select the optimal subset of variables depending on the constraints defined by the user.

There are different reasons to omit variables:

A-posteriori variables , not available when the underwriting takes place: the target and exposure variables are obvious examples of a-posteriori variables, but some other variables, modified during the claims management process, may enter this category.

To ease the detection of those a-posteriori variables, which are often very strongly correlated with the target, we show a Variable Importance score, which indicates the individual predictive power of each variable.

For instance, the cost of a Property Damage claim (zero if the client has no claims) is naturally strongly predictive of the actual number of claims: it is of course an a-posteriori variable.

Unreliable variables , for which quality is too low or whose origin is unknown.

Unavailable variables that can’t be accessed in the production environment (and while they could be fetched, this would require significant changes in the underwriting process and cannot be done for practical reasons).

Variables that shouldn’t be used in the modeling can be removed with the blue switch on the left of their name. Typically the variables of Target, Exposure and Date are unselected by default.

Click on the search icon at the top of the variables list, enter target_ , open the SELECT menu at the bottom of the list and click Unselect all (cf image above) thus removing the a posteriori variable.

Do the same for all the regional_ variables. (We remove all the regional variables now to integrate them in a later stage)

Finally, remove the zip variable (it will be integrated in geographic modeling)

Click on the GENERATE MODELS button and go to the next section.

When this has been done in one model, it is possible to reuse the variable definitions: this process is described in the reuse variable definition annex of this tutorial.

Other modeling options, impacting the variables behaviours, can be set at this stage. These more advanced options (including variables offsetting and constraints) are described in the Force a variable behaviour annex.

Now we are ready to build the models.

  1. Build Base GLM

  1. Models Generation

This step allows us to build a batch of models (GridSearch).

You must define:

  • The number of variables selected in the models (Define variable range): the models created by our machine-learning algorithm will have a number of variables that lie within the chosen interval. Select a range between 5 and 30 variables to generate models from simple (5 variables) to medium complexity (30 variables).

  • The number of Parsimony steps: each parsimony value will correspond to a group of models having the same number of variables. The default value of 5 will build five groups of models, each group having a specific number of variables within the interval chosen above.

  • The number of Smoothness steps: this indicates the number of levels of smoothness. For instance when selecting 7, the tool will generate, for each level of parsimony, 7 different models. These models will range from very sensible models (watch out for over-fitting) to very robust ones (watch out for under-fitting). An intuitive explanation of the concept of smoothness can be found in the presentation slides of the Akur8 solution.

The numbers chosen for parsimony and smoothness define the number of models created: larger values will mean more precise choices in the models’ creation, but also longer computation time. After selecting the variables range, you can leave the number of steps at their default values.

For each combination of smoothness and parsimony created, 5 different models are created:

  • 4 models offer a 4-fold cross-validation over the modeling subset of the Database;

  • 1 model is trained on the whole modeling subset.

The use of cross-validation allows to have a reliable estimate of the performance of the models created.

(Later on, the scores (Gini, Deviance…) displayed for the models created will be the average scores on the 4 folds. Error bars will indicate the performance of the least performing fold and the performance of the most performing fold (if you are not familiar with k-fold cross-validation, you can find information on Wikipedia).

Select a range between 5 and 30 variables to generate models from simple to medium complexity

You can keep the default values (5 and 7) for the sparsity and smoothness steps.

Click on the RUN button.

You will see the progress bar appearing and will be redirected to an information screen.

The model creation will take approximately 25 minutes on the tutorial database, Note that this computation time is necessary to build 175 models: one model for each combination of smoothness, parsimony and number of folds. As the computations are done in the cloud, it is still possible to fully use the solution, for example to upload other databases or to launch other computations in different modeling projects.

  1. Visualize the results

Once the process is finished, we can see the models created in a graph, where each point indicates a different model. Since we requested 5 levels of parsimony and 7 levels of smoothness, 35 different solutions to our modeling problem have been built, with different numbers of variables and smoothness.

In the graph:

  • The horizontal axis represents the number of variables in the model.

  • The vertical axis represents the out-of-sample performance of the model created (estimated on the k-fold over the modeling dataset). The default metric is given by the Gini. Other metrics, as EDR can also be selected on the list on the top left of the graphic.

  • The error bars represent the variations of performance between the folds (and thus the reliability of the performance measured).

This graph allows to have a clear visualisation of the possible trade-off between model complexities and performance:

  • A model on the right-hand side of the graph will contain more variables and be harder to interpret than a model on the left-hand side of the graph.

  • At equal performance, simpler models (on the left) should be preferred.

  • A model higher on the graph will be more performant (on the k-fold cross-validation) than a model lower on the graph.

  • At equal complexity, more performing models should be preferred.

Click on the most performing model with 15 variables.

  1. Inspect a model

Model overview

After selecting a model by clicking on it, the Model Overview screen opens.

The goal of the model overview is to give more insights about the performance of a model (on the cross validation), together with a quick overview of its properties (variables selected along with their importance).

On the left, the importance of the selected variables.

The length of each bar indicates the importance of each variable, through the spread of its coefficients.

A definition of the variables spread is provided in the Variables Spread appendix.

On the right-hand side, an in-depth inspection of the metrics is possible by navigating three tabs:

  • Segmentation

  • Residuals

  • Statistics

A description of some of these KPIs is given in the Technical Appendix of this tutorial.

Variable inspection

The variables in the model can be inspected by clicking on the VARIABLES tab.

All variables are displayed in the list on the left; they are ordered by importance in the model. The variables selected are identified with a blue point next to their name ().

A graph for the variable selected is displayed on the right. This graph contains three lines and a histogram:

  • The exposure, for each level, is represented by the blue histogram.

  • The observed average (average observed values of the target) are represented by the purple line ().

  • The fitted average (average predicted values of the target) are represented by the orange line ().

  • The coefficients of the selected variable are represented by the green line ().

Each line can be shown / hidden in the graph by clicking on their name in the legend.

It is possible to zoom in by selecting the region to zoom in, or by selecting the interval, moving the mouse vertically or horizontally.

To zoom out, double-click on the graph.

  1. Choosing the best smoothness level

When choosing a model we can find ourselves in different situations:

  • The model’s coefficients are too robust (as we can see in the High smoothness figure below, where the model is underfitting),

  • The model’s coefficients seem to capture noise (as we can see in the Low smoothness figure below, the model is clearly overfitting).

Akur8 allows to navigate all the different models, to select the best trade-off between these two extreme scenarios.

Next to Smoothness, on the top right of the screen, click the button to increase smoothness. Click the button to decrease smoothness.

When reviewing the model’s variable, as soon as coefficients suspected of over-fitting are noticed, a smoother model should be selected.

For instance, in the tutorial Database, the variable vehicle_max_speed may show a tendency to over fit (its coefficient capture dubious signal): if this is the case, click on the / button to select a more robust model.

Note that the smoothing applied is consistent along all the variables in a model: clicking on the smoothing buttons will select another model, with a similar number of variables and a weaker/stronger smoothing.

High smoothness: in the previous figure, the Akur8 engine segmented the variable in just three segments. While this choice is robust, some relevant signal is not captured on the young drivers segment.

Low Smoothness: in the previous figure, the smoothing applied is insufficient as the modeled trends are not consistent. The model is clearly overfitting as it is capturing noise.

Optimal Smoothness: the previous figure shows an optimal tradeoff between robustness of segmentation and signal included in the model.

Click the button < next to Variables, on the top-right of the screen, to move to a model with less variables.

  1. Tagging a model

When you are comfortable with the model obtained, you can reference it by giving it a name: click on the TAG MODEL button on the top right corner.

Give a name to the model that allows for easy identification.

Within the Model Tree on the left of your screen, the model you just tagged will appear under the name of your modeling project.

  1. Add Interactions

The addition of interactions has two main parts:

  • Detect and investigate interactions, both automatically, via the Akur8 engine, and manually through user-chosen interactions.

  • Fit the interactions, taken from the list of interactions provided by both the user and the algorithm.

  1. Interaction detection

On the Model visualisation page, for tagged models, you can click ENRICH and then the ADD INTERACTION button in a tagged model to add an interaction modeling.

Name the subproject (for example “interactions”).

The interactions considered can be chosen in two different ways:

  • Automatic suggestion: if enabled, the Akur8 engine will test all the possible interactions of the variables which have been selected in the univariate GLM model. It will consider the (by default 20) interactions most correlated with the target.

  • Custom Interactions: if one desires to inspect interactions independently of their importance, it is possible to manually add them.

Add a custom interaction between vehicle make and vehicle age.

Click DISPLAY AND SELECT at the top right to launch the computation to identify the interactions and their correlation with the target.

Once the computations are done, Akur8 will show the list of interactions sorted by their statistical significance (a score below 100% means the interaction’s significance is below the significance threshold used for the GLM models creation).

In the example of the image above, the interaction between driver_gender and driver_age is the most significant. The graph shows the value of the observed and predicted as a function of the age, split into male and female gender. By inspecting the young segment, (the leftmost levels of each distribution) we can see an interaction. Young females’ risk is overestimated (orange Fitted line is significantly above the Observed line in purple), while the male one is underestimated (opposite position of orange and purple line). The Akur8 engine properly detected this (well known) effect, and associated a high interaction score.

At this stage, the user should inspect the interactions proposed by the tool, and unselect those that, even if they appear to be significant, do not have value in an actuarial sense.

Launch the fit by clicking on the GENERATE MODELS button.

Choose the maximum number of interactions in the popup , for instance 4.

The Akur8 engine will build models starting from one interaction up to the maximum number chosen.

  1. Visualise the results

As before, Akur8 displays a grid search graph showing the performance of the new interaction models. The initial GLM model (in orange) is also shown for comparison purposes.

Click on the desired model to inspect the results.

To visualise the interactions, click on the VARIABLES tab: in the variables list, interaction variables are highlighted with a magenta dot ().

There are three possible visualisations:

  • 2 visualisations of the classical graph grouped by each of the two univariate variables (first two tabs).

  • A heatmap: the colour of the dots in the heatmap colour indicates the impact of the coefficient (according to the scale on the right of the graph). The size of the dot is proportional to the exposure of each level of the interaction variable.

TAG the desired model after careful inspection.

  1. Geographic modeling

  1. Create a geographic model

Upload a geography database

In order to create a geographic model, you will need to use a geographic database containing at least 3 columns:

  • Zip code (or an equivalent geographic key variable that exists in the modeling database that will allow a merge operation)

  • GPS locations:

  • Longitude

  • Latitude

NOTE: The zip code is largely the most common geographic variable encountered in data sets. However, any variable on which it is possible to attach a latitude/longitude can be used as easily. This means that if the exact location of the client is available, the Akur8 solution is able to compute Geography even with millions of distinct geographic points.

If only projected coordinates are available, it is still possible to apply geography. However the visualisation capabilities will be limited as it will not be possible to place the locations on a map.

The Akur8 engine admits the presence of zip codes without a location, up to a quality threshold (to avoid building meaningless models, no more than 10% of the zip codes in the database should be missing).

In the data section, we have already uploaded this geographic database. If you haven’t done it yet, please follow the instructions provided in the Creating a new Database section.

Create a geographic model

In the model page, click ENRICH and ADD GEOGRAPHY.

Give a name to your geographic modeling (for example geography).

Select the Tuto_Zip database (it is possible to filter the Databases by name).

In the Geographic Model Preparation screen, select the columns of the zip location file.

Be sure to check Coordinate Type to be equal to Lat/Long.

Enter the relevant information regarding the location Database:

  • Zip-code variable: ZIP → the name of the geographic zones in the geographic Database.

  • Long and Lat: lon and lat → the columns containing the longitude & latitude of each geographic zone.

On the right-hand side of the screen, in the Zip-code variable field, select the zip column in your main modeling database.

Finally, if the geography coefficients should be grouped in a predefined number of levels, it is possible to enable the Quantify geography coefficients and specify the number of levels.

Click on GENERATE MODELS and choose the number of smoothness steps for your geographic grid search.

The number of steps defines the number of geographic models generated, all with different levels of smoothness for their coefficients. To have a larger choice of geographic smoothness, you can choose 15 different smoothness steps.

Press RUN and the progress bar will appear on the top right corner of your screen.

  1. Visualize the results

The results are presented in the same manner as for the GLM grid search.

You can see on the grid search representation two colors: the blue color represents the geographic grid search and the orange color, with a name pointing at it, is the GLM model that is being enriched.

  • The horizontal axis represents a measure of spatial autocorrelation (see Moran Index). Higher values, on the right, correspond to “smoother” geography, low values to granular and noisy models. The value regarding the original GLM model has no particular meaning: it is placed arbitrarily on the right of the geographic curve for comparison purposes.

  • The vertical axis represents the out-of-sample performance of the model created (estimated on the k-fold over the modeling dataset).

  • The error bars represent the variations of performance between the folds (and thus the reliability of the performance measured).

Click on the desired model to inspect the results.

To visualize the geography, click on the VARIABLES tab: in the variables list, the geography variable is highlighted with a green dot ().

You can see the value of the coefficient by hovering the mouse on top of them. The map is zoomable with your mouse wheel in order to better inspect the results by zone:

We can navigate through different values of the (geographic) smoothness to be able to choose the optimal tradeoff between low and high smoothness.

Once we choose the best tradeoff (in the example below, in the middle) we can tag the model (TAG MODEL).

  1. Introducing External Variables: Add More Variables

When we created the first GLM model (in the section Build Base GLM), we decided to exclude all regional and external variables from those considered for the model creation. This was done in order to decorrelate the pure regional effect (determined just by the position of the zip) and the external variables.

After capturing the geographic signal, we can add external or regional variables, being sure that the geographic effect determined by the policy's location is now included in the model.

Press the ENRICH button on the top right of the screen, and click the ADD MORE VARIABLES option. Give a name to the modeling subproject (for example External variables).

The select variable page will open, as in the first creation of the GLM. By default, all the variables that are in the geo model are selected. This allows to refit the coefficients of those variables during the next fitting step if necessary. If you want to fully offset the existing coefficient and fit the external variables on top of the offset model, you can unselect all of those variables by clicking the SELECT menu icon, and click Unselect All variables.

For the purpose of this tutorial, we will keep the variables selected by default. On top of this, we can now add all the regional variables:

Click on the SEARCH icon at the top of the variables list and type regional_. All regional variables will be filtered in the list. Now click on the SELECT button (on the bottom of the variable list), and Select all.

Click on the GENERATE MODELS button to start the models creation.

The popup that will appear will slightly differ from the one we previously saw on the first step of the modeling creation.

  • Use all variables: we can force all the variables that were selected.

  • Use previous smoothness: the level of smoothness corresponds to the threshold between the noise and the signal in a particular dataset. We chose such a threshold on the first modeling phase. If we want to build models which have the same level of significance through the whole modeling process, we should keep this switched on.

For the purpose of this demo, we will also try to include a smaller number of new variables:

Set the slider from 1 to 10, and set the number of parsimony steps to 8; we will use the previous smoothness (as suggested by default).

When these parameters are set, you can press the RUN button.

After the computations, the grid search will show these results:

We can see that, in this demo database, few external variables bring value, and all the others are redundant.

Select the model having the best tradeoff in terms of number of variables / performance and tag it.

  1. Geo CLASSIFICATION

Usually regional variables are known at the same level of granularity as the geographic variable (zip). For example, the regional_number_hospitals indicates the number of hospitals in a given zip code. We could take advantage of the GAM structure of the models to reduce the number of variables by including the additional information brought by the regional variables back onto the zip variable.

Press ENRICH > ADD CLASSIFICATION > GEO CLASSIFICATION.

Give a name to the submodel project (for example Classification).

Select the variable onto which you want to reduce (in this case the variable zip). Click on Create.

An inclusion score will be computed for all the variables in the model. This score indicates the suitability to reduce that variable onto the zip variable. For example the variable regional_rural_area has a score of 100%. This means that for a given zip code the variable regional_rural_area always has the same level in the database.

Press BUILD CLASSIFICATION > USE CURRENT DATABASE.

Give a name to the model (for example Reduced model).

Click on RUN.

The new grid search with the reduced model appears. If all the reduced variables have an inclusion score of 100%, the Reduced model (in blue) will have the same performance but fewer variables than the previous model (in orange).

  1. Vehicle CLASSIFICATION

Classification can be performed for Vehicle models as well.

Press ENRICH > ADD CLASSIFICATION > VEHICLE CLASSIFICATION.

Give a name to the submodel project (for example VehicleClassification).

Click on NEXT.

Select the variable onto which you want to reduce (in this case the variable vehicle_model).

Click on NEXT.

Just like before with the Geo Classification, an inclusion score will be computed for all the variables in the model so as to indicate the suitability to reduce that variable onto the vehicle_model variable.

Once you have selected the vehicle variables, Akur8 will reduce those variables onto the variable vehicle_model:

Press BUILD CLASSIFICATION > USE CURRENT DATABASE.

Give a name to the model.

Click on RUN.

The new grid search with the new reduced model appears. If all the reduced variables have an inclusion score of 100%, the Reduced model will have the same performance but fewer variables than the previous model.

  1. Edit Coefficient

The Akur8 engine automatic grouping is based on the significance of the signal found on the data. This is a robust data-oriented best practice. However, it may happen that this is not desirable for certain segments of variables that have low exposure but a known effect.

This is for instance the case of drivers with more than 1 accident last year, as shown in the nb_claims_last_year variable. The interpretable nature of Akur8 models allows to manually adjust the impact of coefficients via the EDIT COEFFICIENTS button.

Click on EDIT COEFFICIENTS.

The Edit coefficients window will appear.

Using your mouse, draw a selection box around the coefficients that you want to modify to select them.

edit coefficients

We can move the coefficient by using the graphical interface, or by manually setting the values by clicking on the Values tab. For the purpose of this demo, we will use the “magic wand” button that will match Observed and Fitted (the last button in the Align block).

We can also modify other variables if needed by selecting them on the left side of the screen.

Once we have done all the modifications, click APPLY.

Once the menu opens, select APPLY MODIFICATIONS.

Three options are available:

  • APPLY MODIFICATIONS: only the changes that we have made to certain coefficients are considered and we re-evaluate the model.

  • APPLY MODIFICATIONS & REFIT INTERCEPT: the manual changes are applied and the intercept is refitted.

  • APPLY MODIFICATIONS & REFIT MODEL: the impact of manually changed coefficients on other coefficients fits is also considered. The values of other coefficients that we have not modified manually may also be changed to refit the model.

WARNING: the goal of the Edit coefficients screen is to manually adjust segments with very low exposure, but with a behaviour known to the expert user. It is by no means meant to regroup noisy segments that may appear in the models. The data-driven approach indicates that the best practice is to first change to a smoother model until the noisy behaviours stop appearing (by using the / button on the top-right of the screen).

For the purpose of this tutorial, we selected APPLY MODIFICATIONS: as the exposure of the modified coefficients is extremely low, the change will have no impact on other variables. After the computations are done, the changes will be applied and will be highlighted throughout the modeling process, allowing complete governance and traceability.

This model will be the model that we choose as the final result of our modeling exercise. In the next step you will export its documentation.

  1. Exporting your models & documentation

You may want to export the models created or the associated documentation. First, the model needs to have a name (this can be done by clicking on the TAG MODEL button at the top right of the screen).

It is possible to export a model in different formats:

Click on the EXPORT button and select the desired export format.

Many formats are available:

  • Predictions: a csv file containing all the predictions by the model, together with some extra columns to allow further investigation of the results via external software.

  • Legacy format: a csv file output compatible with other insurance modeling software.

  • Json: Json output to allow for faster parsing, useful to plug the model in in-house developed scripts.

  • POJO: java file with the model details, easily embeddable in any Java environment.

  • PMML: xml file containing the specification of the model. This universal file format is compatible with many pricing engines and external insurance software.

  • Documentation: fully automated documentation, in a PowerPoint file.

The documentation in PowerPoint file contains:

  • information about the data, with statistics on the target variable;

  • information about the metrics of the model, now assessed also on the validation dataset, which was not displayed at any moment during the modeling within Akur8 (hence this is a real validation score of the model);

  • all the documentation notes that were written throughout the data preparation or modeling, all in one centralized place.

  1. Advanced Usage

We have seen an in-depth methodology to build state of the art modeling using the Akur8 solution.

Akur8 provides, on top of this, some additional functionalities that can help you to do even more productive and rigorous modeling. In this section we will focus on some practical use cases:

→ Data section:

  • Reuse Format: link datasets to allow for faster parameter definition.

  • Modify Levels: apply grouping, reordering and capping to a variable.

→ Modeling section:

  • Import Model: Import existing models and perform updates.

  • Time Consistency: Visualise time consistency of models.

  • Model Comparison: Compare different models

  1. Reuse Format

As we have seen in the Data section, after uploading a database, it is necessary to look at the different variables in the dataset and to determine whether they should be treated as Categorical or Ordinal. This is a necessary step on the modeling that will greatly improve performance if done correctly.

If we upload a new version of the same database, or a filtered copy, it would be necessary to perform the same repetitive work twice. According to Akur8 philosophy, this task should be automated. The software allows to define relationships between columns of the dataset. This reduces the number of manual iterations required to obtain the best setup for modeling.

To link an updated version of the database with a pre-existing one, select the new database, called Tuto_DB_updated in the screen below:

Select Tuto_DB, and click the REUSE FORMAT button on the top right of the screen.

In the Reuse Format window, select the (old) Tuto_DB database as reference database.

A list of the columns in the updated Database will appear, together with the matched columns of the reference database, if an equivalent column has been found. A score assessing the quality of this match is also displayed.

As we can see, it is possible to rapidly spot the differences between the databases: a new column was introduced (New External Variable, at the first row), and the Matching score in the variable year is highlighted in orange, meaning that a new level (a new year) was introduced. It is possible to manually change the mapping for some columns by clicking on its name, if the column name was changed but the relationship still holds true.

If all the new variables are correctly mapped to their older version, you can click the APPLY button.

All the Categorical / Ordinal properties are copied from the reference version of the database to the updated one, along with the comments and the data preparation work done on the reference database.

In the modeling section, it will also be possible to apply the same variable selection on the modeling phase between mapped databases as was applied for the initial reference database.

  1. Reuse Variables definition

When creating a model, it is possible to reuse the work done in another project in the Define Variable tab.

In the Define Variables section select the SELECT button on the bottom of the variables list; choose the option Use selection from project.

A popup will appear allowing to choose the selected variables from other modeling projects.

To reuse the variables selected in another project (based on a database mapped with the one currently used), just select the reference modeling project and click the SET VARIABLES button.

  1. Modify Levels

Akur8 allows data processing inside the solution in order to avoid expensive iterations between the data processing software (for example, SAS) and Akur8. Akur8 by itself is not a fully featured data warehouse, meaning, for example, that it is not possible to merge different data sources.

Akur8 offers modification of column levels via grouping and reordering, and missing values management.

Consider for example the granular variable Make. It may be relevant to group such a variable by country. In the Data section, head to the View Variables page and select the vehicle_make variable.

To modify a variable, you can click on the MODIFY VARIABLE button, and press the GROUPING AND REORDERING button to open the variable modification window.

The Modify levels screen will appear showing the distribution of the variables, together with all of its levels on the right panel. On the left there is the full list of variables. Since after a modification, the statistics are recomputed, it is strongly advised to make all the modifications to the variables before applying those modifications.

Grouping is possible by just selecting the levels to group. For instance:

  • Select all the german auto makers

  • Click on the GROUP button at the bottom on the list

  • Enter a new name for the group that you are creating (German)

  • Press SAVE to create the new group.

The changes can be seen in real time on the graph.

Hovering over the Group Symbol of the new name group will also allow you to see all the old levels contained in it.

If you group all the makes by country, the updated variable Make will look like the picture below:

It is also possible to give a different order to the level (this is only meaningful for ordinal variables) by just dragging the level name in the appropriate position:

For instance, the studies_level variable needs to be reordered. By default, it follows the alphabetical order while the duration of the studies would be much more logical.

  • Select the studies_level variable

  • Drag and drop the different levels to order them correctly.

Finally, this window allows the user to manage missing values.

For instance, we can manage the missing values in the nb_children variable:

  • Select the nb_children variable in the list on the left of the window.

  • Select the unknown level in the list on the right of the window (it becomes green).

  • Click on the NA button at the bottom of the levels list: the value becomes red in the graph.

The missing values will be treated separately from the other levels in the machine-learning algorithm: when smoothing ordered levels, the missing values will be ignored (not smoothed) and:

  • if they contain no significant signal, their coefficient and their prediction will be set to the average value.

  • if they contain significant signals, their coefficients will be changed but they won’t impact the rest of the distribution.

  1. Import Model

In the context of a tariff update, it is often key to gradually update the tariff from a pre-existing one which is already in production. Throughout the tutorial we showed a modeling exercise completely from scratch. However Akur8 allows also to optimally update a pre-existing model. Importing an existing model is easy: right after the modeling project creation, at the Set Goals screen, at the top right of the screen, there is an IMPORT MODEL button.

  • Click on the IMPORT MODEL button.

  • Give a name to the import model and choose the model you want to import.

  • Press the IMPORT MODEL button.

Imports via Legacy Software or Json files are supported. In some EU countries, it may be important to define a Custom Format for the separator of the csv files.

It is necessary to check the mapping of the variable names in the model with those contained in the dataset. All the variables which do not show a matching score of 100% (in the example below, the variable year) must be inspected. When inspecting the variable year, we can see that the year 2020 is highlighted in red, meaning that the year 2020 is present in the Database but not specified in the model, which is logical since we are importing an old tariff without this year. By default Akur8 will try to optimally infer such value (in this case, it is grouped with 2019). The actuary has however full control over the coefficient of this new level through the edit coefficients button displayed below the variable’s distribution.

Once all variables are correctly mapped, click on APPLY.

The user can access the ImportedAutoGlmKModel, the imported model. It is possible to proceed to the standard modeling and update the model.

  • Press ENRICH > ADD MORE VARIABLES.

  • Leave the variables of the model selected and select new variables from the database that should be considered in the new model.

  • Press GENERATE MODELS, and select the parsimony steps and smoothness steps as desired.

  1. Force a variable behavior

Once we have defined the scope of the modeling project, we can apply some custom actions for a particular variable.

  • In the Define Variables section of the model, select a variable from the list.

  • Below the graph, there are two options: OFFSET and CONSTRAINTS.

We can force a behaviour on this variable:

  • Offset: below the graphic, you can click on + Offset. It will allow to set offset coefficients, either via a user Interface, or by modifying the values using a spreadsheet-like interface.

The offset can be defined graphically, by selecting coefficients and pressing the edition buttons at the right of the graph, or numerically by entering an exact value in the VALUES table.

  • Constraints: The flexible nature of Akur8 allows to introduce specific constraints for some variables. For example, it is possible to force an increasing/decreasing trend. Note that, if you do not have explicit regulatory concerns, this option should not be activated: if a trend should naturally arise from the data, Akur8 will be able to detect and fit it.

  1. Time Consistency

In the visualisation of variables of a model, change the Graph type (below the graph of the variable) to Time Consistency.

Guaranteeing the time consistency of a model is key for correct modeling. Akur8 allows for intuitive and clear detection of significant biases in the modeling throughout the years.

Checking for time consistency of a variable is done via the interaction between the variable and, for example, the year. If this interaction is significant, the model is not consistent throughout the years.

The Akur8 solution fits interactions up to a significance value which is determined by the smoothness of the considered model. When visualizing each variable, a time consistency button is displayed below its graph. The time consistency graph shows the interaction coefficient between time and the variable, fitted on top of the pre-existing model. This allows to immediately spot time biases, avoiding noisy graphs (which are very hard to read on granular variables).

In the graph above, we can see that the vehicle ranking shows some instabilities, particularly in the 16-26 segment, where the risk of 2018 is significantly higher than the one recorded in 2019. Whether to accept this time instability on the model or to decide to remove this variable from the modeling context is something that should be chosen by the modeller.

In this figure, we can see that the claim_history variable is stable across time, as no significant interaction was detected.

  1. Model Comparison

In the same modeling project, many different models can be tagged and inspected: for example, to see the differences between a simple model with few variables and a complex model. It is then key to be able to quickly inspect the differences between them. Akur8 gives fast and intuitive comparison thanks to its model comparer.

  • Inspect one of the models and select the COMPARE button at the top of the page.

  • Select from the list of tagged models the model you want to compare with.

  • Press the COMPARE MODELS button.

You will be able to compare all the graphs which are available in the tool: spread, lorenz curve, lift curve and variables.

TECHNICAL APPENDIX

VARIABLES SPREAD

The spread of a variable is computed from its coefficients:

  • The total bar length (Spread 100/0%) corresponds to the difference between the maximum and minimum coefficients: Spread=MaxCoefficientsMinCoefficients-1

If a variable is not selected in the model, all its coefficients are equal and the spread equals 0.

  • The dark-green bar length (Spread 95/5%) corresponds to the same measure, after removing the coefficients corresponding to the highest and lowest 5% of the dataset. It represents a more robust vision of the variable’s impact, excluding outliers.

For instance, in the graph below, the total (100/0) spread of the driver_age variable is 150%, meaning that the difference in risk between the least risky segment of age and the riskiest is two and half. However, the robust (95/5) spread is 45%. This means that a small segment drives the risk of the variable (in this case, young people).

This can be computed from the coefficients of the variable itself:

  • The highest coefficient is +142.5%

  • The lowest coefficient is -4%

→ we can compute the Spread 100/0:

  • After removing the coefficients corresponding to 5% of the profiles with the highest risk, the highest coefficient is +40,9%

  • After removing the coefficients corresponding to 5% of the profiles with the lowest risk, the lowest coefficient is -4% (it doesn’t change as a very significant population is at this risk level)

READING VARIABLES GRAPHS

The fitted average and observed averaged should, of course, follow each other. However, if the exposure is low for some levels, the average observed values might vary in non-significant ways. In this case, the coefficients will be grouped together and the model predictions (which are robust) will not follow the observed values.

The coefficients represent the impact of the selected variable, while the average predicted value represents the total impact of all the variables present in the model.

LIFT & LORENZ CURVES

The global models metrics can be visualized in the Model Overview screen.

  1. On the right of the model overview screen, the Lorenz curve is displayed. The Lorenz curve is drawn by ordering all the observations from the highest estimated frequency to the lowest (on the x axis), and computing, for each one, the cumulated number of observed claims (on the y axis). Note that this x axis is ordered reversely with respect to the usual definition of Gini index that applies in Economy.

  2. A random prediction of risk would follow the diagonal, while perfect predictions would lead to a perfect ordering of the claims (and thus the Lorenz curve would raise very sharply, to reach 100%, and stay there).

  3. You can notice that 5 different curves are drawn, corresponding to the performance of each one of the 4 different folds, plus their average.

  4. The Gini coefficients represent (twice) the area between the average curve and the diagonal. A larger value indicates better predictions, while a value close to 0 indicates that predictions were comparable to random.

  1. In addition to the Lorenz curve, you can display the lift curve by clicking on the Lift Curve tab.

The lift-curve is built by ordering the observation of the validation sets from the lowest prediction to the highest, and aggregating them by bins containing 5% of the Database. For each bin, the average predicted and observed values are represented.

This curve indicates the discriminatory power of the model (difference between the extreme predicted and observed risk).

If they follow each other closely, it means the model is fitting the data well, while a more chaotic curve means the predictions are not tracking the data well.

Did this answer your question?