Overview
In this section we will see how to correctly include Geography in our modeling process with Akur8.
Unlike other solutions, Akur8 allows one to compute coefficients for every zip code present in the geo database. This computation is done in the same pricing solution and the result can be inspected in a zoomable map.
In this documentation, we will analyze the impact of the smoothing factor in the geographical context, and we see the importance of building the geo coefficient before introducing regional variables.
Setting-up a geo enrichment
Geo-data format
All we need to create a Geography project in Akur8 is a 'LOCATION DATA' database. The 'LOCATION DATA' can be the same as the 'DATA SOURCE'. This database must contain the following information:
zip codes (or other location identifier)
GPS Coordinates (Latitude, Longitude) or projections
The zip code will serve as a link between the 'LOCATION DATA' database and the 'DATA SOURCE' database used to build the GLM model.
In order to comply with the requirements of our tool, a proper 'LOCATION DATA' database will contain:
A numerical format type for the coordinates
GPS or projected coordinates
In Akur8, we support two different types of coordinate systems. To best use the tool, the suggested coordinate system format is Latitude and Longitude.
This choice produces plots that display the values of the coefficients for each location in 'LOCATION DATA' on interactive geographical maps, providing a better user-experience and the possibility to explore the map.
It is possible to select as coordinates the projected version of them, (selecting 'projected' above). The result is a scatter plot where only the projected points are included.
Missing zip codes
Missing zip codes refer to levels of the geographic variable that either:
are present in the 'DATA SOURCE' but not in the 'LOCATION DATA', or
are present in the 'LOCATION DATA' but have invalid coordinates.
Without toggling the “Allow missing zip codes” button, all the zip codes in the 'DATA SOURCE' must be matched to a zip code in the 'LOCATION DATA' to be able to launch a grid search.
When the “Allow missing zip codes” button is toggled, missing zip codes are treated as independent categories, which allows generating models even if not all the zip codes are matched. This is allowed as long as the number of missing zip codes does not exceed 10% of the total number of zip codes in the ‘DATA SOURCE’.
A coefficient for each of these zip codes will be fitted to the data, without using geographic information. These coefficients cannot be accessed in the platform, but are included in the file when exporting the model.
The information on the percentage of unassigned geographical risk in the 'DATA SOURCE' is available after the Grid Search on the Statistics tab of any model.
The smoothness parameter
The geographic signal is considered in a continuous way throughout the actual relative positions of the zip codes. Hence, we can expect this variable to change smoothly between nearby points. When modeling the geography coefficients for our model there are two unknown factors that need to be taken into account:
the reliability of the observations in the 'DATA SOURCE' database,
the rate at which close or far locations can influence each other.
The smoothness parameter in the geography reflects precisely the trade-off between these two phenomena. The choice of the hyperparameter 'smoothness' for geographic smoothing will produce models with the additional geographic variable with different levels of sensitivity to the observed signal in the database.
As we can see, the graphs below are ordered by decreasing level of smoothness. A high level of smoothness corresponds to an overall smaller spread of coefficients that varies little over small distances.
On the contrary, a small level of smoothness creates coefficients with a larger overall spread that can change greatly over small distances.
Quantization
The general approach we have in computing the geography coefficients follows the principle that each zip code will get assigned its own coefficient.
However, it is possible to perform an automatic grouping of the coefficients, suitable for a zonal-approach, toggling on the button QUANTIZATION:
This will result in a geographic variable with only the pre-fixed number of different coefficients specified (3 in the example of the image above).
Geography partitioning
As explained above, Akur8’s geographic models follow the principle that nearby areas should have a similar risk level.
The Partition geography feature allows user to break this behaviour by declaring sub-regions (subsets of zip-codes); zip-codes belonging to different sub-regions are not required to have similar risk levels, not matter how close they are. In other words, the risk observations in a sub-region do not influence the risk estimation of the other sub-regions. In Akur8's standard geographic modeling, it can happen that distant regions influence each other - for example, with geographic databases including remote islands). This feature can be used to ensure that no such long-distance influence impacts the fit. Another use case is when regulations demand that risks from a region do not drive up the prices of neighboring regions. Compartmentalizing the modeling with this feature ensures compliance with such regulations.
To use this feature, the LOCATION DATA used must contain an additional column “partition geography variable”, assigning each zip-code to a sub-region. This partition variable must have less than 256 levels, and each sub-region it creates must have at least 4 corresponding zip-codes. In the geographic model preparation, simply toggle the “Geography partition” button and write the name of the partition variable in the field.
This will result in a geographic variable that shows clear differences between the various sub-regions.
By default, the average of the coefficients for each sub-region will be equal to 0. If “Fit partition base level” is toggled, an intercept will be added for each subset. In the following picture, an intercept was fitted for the model on the left, while the model on the right has an intercept equal to zero.
When looking at the coefficients map, it is possible to use the “Filter by partitions” dropdown list to select which sub-regions should be displayed.
If a geographic variable was fitted using Partition geography, and that a geographic classification model is built on it, then the filtering is still available.
Geo grid search
As in the previous model generation computations, the result is displayed in a grid search to assess the tradeoff between performance and complexity. As before, the performance is shown in terms of the GINI but can be changed to a number of metrics. The complexity is displayed in terms of the Moran's Index that we are explaining in the next subsection.
Moran distance
The Moran distance corresponds to the radius of a circular region that contains points with a correlation score between the relative distances and coefficients of greater than 50%.
Intuitively, for a given observer placed in a given zip code, the Moran distance gives an average distance one has to walk to observe an important change in the risk coefficient associated with the geographical variable.












