The use and valorization of enriched geographical data to enhance predictive modeling and segmentation
Nowadays, the unique word “data” has spread its declinations on every segment of the industry. In a more and more competitive environment, every company and business needs to act rather than react, which is not only a condition for growth, but for survival. The advent of new techniques in data science opens brand new and exciting opportunities in the world of IT and buzzwords such as Big Data, Machine Learning and Artificial Intelligence.
Managers are now aware of the infinite possibilities that data, in its broader sense, can bring to their companies. However, the type and amount of information they possess are so huge that solely finding the starting point represents a major difficulty.
In other terms, they face two main questions: What can I do with my data and what answers can I get from them? In fact, as data scientists we need to assess both questions and answers related to them, and this is only the visible part of the iceberg.
For example, a retail company has access to all historical sales and transactions data recorded across their different locations across the country. The business goal is obvious: market growth and increased profits. From that, key performance indicators can be built and matched with their customer database. This is where predictive analytics comes into play: forecasting sales and transactions in the future using their data will help the company understand its business, acting instead of reacting, and even better: anticipating.
Big Data Blending
Business data, the starting line of the race
Conceiving a model with your own data is only the tip of the iceberg. It is like having a beautiful car without knowing anything that is under the hood. If the engine is bad, you will not be going far despite the beauty of your machine… This brings two aspects on the table: the data and the modeling techniques.
First, data quality is a key point in predictive analytics, which can be simply summarized by “garbage in, garbage out”. Are my data sources reliable? Are there gaps in my data? How should I handle it? These are all important questions that need to be addressed. Second, it is important to validate all assumptions that directly depend on the selected algorithms (data normality if applicable, data distribution and type). Even if you have good data to work with, it can lead to false results and interpretations if this latter aspect is neglected.
The last thing to consider is the choice of the model predictors. The assessment of all model parameters in the future is a not only significant but a modeling prerequisite. Back to our retail company example, weather conditions and hazards (rain, thunderstorms, heavy snowfall, tornadoes) and external unpredictable events (road works, strikes, outages) have a significant impact on sales. However, it is not possible to reckon these parameters in a forecasting perspective as it is not possible to know beforehand their occurrences in the future. Altogether, the model part still represents a small proportion of the whole process.
The add-on value of geographical data enrichment
At the point where an initial model is built, the first step is done. Now let us customize the engine of our vehicle and put the best pieces together to get the best fit and the maximum performance from it by enriching the existing elements with additional information, like geographical data, that can leverage the level of business understanding and make predictive analytics more efficient.
By geographical data, we mean data with a geographical component that can be either X/Y coordinates, related to a spatial type (point, line, polygon) or linked to a geographical entity (postal code, region, state/province). This data is available via well-known governmental census based sources (StatsCan) to geographic data vendors producing traffic data (HERE Traffic Analytics), POIs and other business drivers (HERE, TomTom, Pitney Bowes, Environics Analytics).
In parallel, the advent of social data has also unveiled new perspectives. The wide collection of datasets implies a considerable number of variables and information ranging from general (average income, level of education, housing type) to specifics, such as monthly median amount spent on junk food.
Looking for additional enrichment datasets? Korem experts can help you select the best dataset from our large data portfolio.
Putting a face to your customers
In practice, our retail company can blend sales and transactions with the customers defined by their name and address. The next level is basically to answer these three questions: who are my customers, where are they, and how can I reach them (see picture below)? The responses will open new perspectives on how the company is dealing with their strategies in place and help them target their marketing efforts, going from the predictive to the prescriptive analytics. This is only possible by enabling data enrichment to their existing business data by blending geographical and non-geographical data using spatial tools (Decay analysis, spatial match or processing, clustering, etc.).
This procedure can imply several operations starting from geocoding the address points to get X/Y coordinates that can further be blended with drive-time polygons defined around each retail facility. This allows matching the point addresses (your customers) with both the drive-time polygons related to the store trade area and socio-demographic variables that profile your customer databank. By comparing with the rest of the interest zone, it is now possible to build market analysis, pinpoint underperforming areas and find a way to overcome this problem.
In the same example and considering the retail company has a risk management department, going to the dissemination area level is not precise enough. Companies working in that sector would like to better segment these risks by considering the geographical component of all their business points. The use of clustering techniques such as k-means gives the capacity to make suitable distinctions, improving the efficiency of risk determination, analysis granularity and precision. The example below depicts an enhanced risk assessment related to each centroid derived from address points.
Overall data integration and modeling, the best recipe of the mix for the best result
The main challenge in enabling geographical data in predictive analytics is to make the most realistic assumptions that come with the implementation and use of such data. In particular, blending methods to mix business raw data with spatial information is crucial in terms of scalability (geographical level), dimension (centroid definition, polygon generalization) and operations on spatial fields (buffer and drive-time assessment, proximity, non-overlapping procedures).
Building the best possible predictive models leads to using the best practices, such as stepwise procedures embedded in the DM-CRISP methodology implementation that takes into account all subsequent hypotheses (variable selection, multicollinearity, correlation vs. causation).
The challenges linked with predictive analytics in a geographical context are huge, the price to pay is surely high, literally as well as figuratively.
That being said, the return on investment of geographical data enrichment is of much greater value, opening the way to unexplored paths for all kinds of businesses, allowing them to be them ahead of their own future. For instance, the geocoding efficiency is crucial in flood management and risks along with its related fields, economically speaking. Other use cases showed ROI up to three times compared to initial cases after the implementation of solutions involving geodata enrichment. Korem has large portfolio enrichment datasets, geospatial tools and expertise to help you integrate that into your model.