The use and valorization of enriched geographical data to enhance predictive modeling and segmentation
Nowadays, the unique word “data” has spread its declinations on every segment of the industry. To spawn its path in a more and more competitive environment, every company and business needs to act rather than react, which is not only a condition for growth, but for survival. The advent of new techniques in data science opens brand new and exciting opportunities in the world of IT and enhanced to fresh buzzwords such as Big Data, Machine Learning and Artificial Intelligence.
Managers are now aware of the infinite possibilities that data, in its broader sense, can bring to their companies. However, the type and amount of information they possess are so huge that solely finding the starting point represents a major difficulty.
In other terms, they face two main questions: What can I do with my data and what answers can I get from them? In fact, as data scientists we need to assess both questions and answers related to them, and this is only the visible part of the iceberg.
Practically, a retail company has access to all historical sales and transactions data recorded across their different locations across the country. The business goal is obvious: market growth and increased profits. From that, key performance indicators can be built and matched with their customer database. This is where predictive analytics comes into play: forecasting sales and transactions in the future using their data will help the company understand its business, acting instead of reacting, and even better: anticipating.
Big Data Blending
The business data, the starting line of the race
Conceiving model with our own data is only the tip of the iceberg. It is like having a beautiful car without knowing anything that is under the hood. If the engine is bad, you will not be going far despite the beauty of your machine… This brings two aspects on the table: the data and the modeling techniques.
First, data quality is a key point in predictive analytics, which can be simply summarized by “garbage in, garbage out”. Are my data sources reliable? Do I have missing data? How should I handle it? These are all important questions that need to be addressed. Second, it is important to validate all assumptions that directly depend on the selected algorithms (data normality if applicable, data distribution and type). Even if you have good data to work with, it can lead to false results and interpretations if this latter aspect is neglected.
The last thing to consider is the choice of the model predictors. Beyond the fact of being significant, the assessment of all model parameters in the future is a modeling prerequisite. Back to our retail company example, weather conditions and hazards (rain, thunderstorms, heavy snowfall, tornadoes) and external unpredictable events (road works, strikes, outages) have a significant impact on sales. However, it is not possible to reckon these parameters in a forecasting perspective as it is not possible to know a priori their occurrences in the future. Altogether, the model part still represents a small proportion of the whole process.
The add-on value of geographical data enrichment
At the point where an initial model is built, the first step is done. Now let us customize the engine of our vehicle and put the best pieces together to get the best fit and the maximum performance from it by enriching the existing elements with additional information, like geographical data, that can leverage the level of business understanding and make predictive analytics more efficient.
By geographical data, we mean data with a geographical component that can be either X/Y coordinates, related to a spatial type (point, line, polygon) or linked to a geographical entity (postal code, region, province). From the well-known governmental census based data sources (StatsCan) to geographic data vendors producing traffic data (HERE Traffic Analytics), POIs and other business drivers (HERE, TomTom, Pitney Bowes, Environics).
In parallel to that, the advent of social data also unveil new horizons in this problematic. The wide collection of datasets implies a considerable number of variables and information ranging from general (average income, level of education, housing type) to specific as its deepest (monthly median money spent on junk food).
Looking for additional enrichment datasets? Korem experts can help you select the best dataset from our large data portfolio.
Putting a face to your customers
In practice, our retail company can blend sales and transactions with the customers defined by their name and address. The next level is basically to answer these three questions: who are my customers, where are they, and how can I reach them (see picture below)? The responses will open new perspectives on how the company is dealing with their strategies in place and help them targeting their marketing efforts, going from the predictive to the prescriptive analytics. This is only possible by enabling data enrichment to their existing business data by blending geographical and non-geographical data using spatial tools (Decay analysis, spatial match or processing, clustering, etc.).
Concretely, this procedure can imply several operations starting from geocoding the address points to get X/Y coordinates that can further be blended with drive-time polygons defined around each retail facility. This allows matching the point addresses (your customers) with both the drive-time polygons related to the store trade area and socio-demographic variables that profile your customer databank. By comparing with the rest of the interest zone, it is now possible to build market analysis, pinpoint underperforming areas and find a way to overcome this problematic.
In the same example and considering the retail company has a risk management department, going to the dissemination area level is not precise enough. Such companies working in that sector would like to better segment these risks by considering the geographical component of all their business points. The use of clustering techniques such as k-means gives the capacity to make suitable distinctions, improving the efficiency of risk determination, analysis granularity and precision. The example below depicts an enhanced risk assessment related to each centroid derived from address points.
Overall data integration and modeling, the best recipe of the mix for the best result
The main challenge in enabling geographical data in predictive analytics is to make the most realistic assumptions that come with the implementation and use of such data. In particular, blending methods to mix business raw data with spatial information is crucial in terms of scalability (geographical level), dimension (centroid definition, polygon generalization) and operations on spatial fields (buffer and drive-time assessment, proximity, non-overlapping procedures).
That being said, the return of investment of geographical data enrichment is of much greater amount, opening the way to unexplored paths for all kinds of businesses, allowing them to be them ahead of their own future. For instance, the geocoding efficiency is crucial in flood management and risks along with its related fields, economically speaking. Other use cases showed ROI up to three times compared to initial cases after the implementation of solutions involving geodata enrichment. Korem has large portfolio enrichment datasets, geospatial tools and expertise to help you integrate that into your model.
Building the best possible predictive models leads to use the best practices such as stepwise procedures embedded in the DM-CRISP methodology implementation that takes into account all subsequent hypotheses (variable selection, multicollinearity, correlation vs. causation).
The challenges linked with predictive analytics in a geographical context are huge, the price to pay is surely high, literally as well as figuratively.