Data has now become a crucial tool for innovation within organizations, providing them with a competitive advantage within their market. Analytics teams and the infrastructure to support them has seen an exponential growth in the past decade, yet many organizations still struggle to procure the right, high quality data sets that are suitable for making confident, data-driven decisions. Market research analyst Frost and Sullivan predicts that “big data analytics market growth is forecasted to grow at a Compound Annual Growth Rate (CAGR) of 29.7% to $40.6 billion by 2023.”
When the pandemic hit, many organizations scrambled to find and purchase the right data to face new business challenges. A year later, many haven’t made much progress. According to a report from Deloitte, apart from the few “data mature” organizations that have successfully embedded a data strategy in their practice, most organizations still struggle to navigate a complex and rapidly evolving vendor market. They also struggle with negotiating contracts—including respecting the terms and conditions (T&Cs) and managing relationships with vendors. Keeping up with new offerings and knowing what is available on the market can be a full-time job. You need to keep abreast of innovations and do continuous data benchmarks and comparative analyses.
Since its founding, Korem has accumulated a vast experience, guiding customers through the data acquisition process from product selection, contract advisory, and procurement, as well as data integration services. Based on this experience, we have compiled the top seven most common pitfalls that are overlooked by organizations during the data selection and buying process.
Summary of pitfalls and our advice
|Having unclear problems to be solved and business objectives.||Conduct a data exploratory engagement to confirm which data sets are available and suitable to answer specific business questions.|
|Thinking that data sets can be easily compared by looking only at a list of attributes or the vendor’s documentation.||Ask for a sample of data in a geographic area with which you are familiar and engage with a trusted advisor that has expertise with multiple vendors.|
|Not understanding where that data comes from.||Ask an expert about the source of the data, the method of acquisition, the integration process and the update cycle.|
|Thinking the more data you have, the better.||More data may also mean more false-positives or outdated data. The number of records shouldn’t be the sole decision factor when buying data sets, it can even be a limiting factor for your infrastructure.|
|Thinking that open-source data will be free.||Before choosing open-source data, assess if these products truly fulfill your needs. In the long run, what looks like free data can require an extensive and costly amount of work.|
|Underestimating the data integration complexity.||Get support from a data specialist and look at customized data delivery options to facilitate data integration.|
|Not factoring in terms and conditions before the end of the buying process.||T&Cs should be addressed early in the process as they may change the price of the data or prevent you from using the data the way you intended to in certain use cases.|
1. The Importance of Clear Business Objectives
One of the most common pitfalls when shopping for geospatial data is when business problems and objectives are vaguely defined. When faced with an urgent problem, teams go on a wild goose chase looking for data. But what data is really needed? What problems need to be solved? Commercial data sets can be a powerful decision tool but having unclear goals will probably lead to unrealistic objectives and missed opportunities. Third-party data is many things, but it is not a complete answer to every challenge.
Before selecting data vendors, users should have a clear business objective and a good idea of how the data will be leveraged. In turn, this will help to determine the level of accuracy, attribution, and completeness needed. The challenge is that without proper knowledge of which type of data is available on the market, it is hard to define which business questions you can answer with enough confidence.
This chicken-and-egg situation is one of the reasons why our data experts often conduct data exploratory engagements with customers who are looking for guidance and trying to confirm which data sets are available and viable to answer specific business questions.
2. The Devil Is in the Details
Sometimes, commoditized data may be good enough. However, when using data for data-driven decisions, the data quality often becomes paramount, and therefore the most complete coverage, accuracy, and detailed attributes is needed to achieve trusted insight. This is also applicable to common data sets such as street data, points of interest (POI), business points, parcel data, building footprints, etc. However, this is even more important with newer value-add data sets such as mobile-trace data, vehicular traffic or footfall data. Sourcing data from external sources, whether it’s open data or commercial is also very different from dealing only with internal data. While your internal business data may suffer from severe data quality issues, it is collected internally in a way that can be controlled or taken into consideration during analysis. With the purchase of third-party data, each data set may only provide limited documentation or metadata, and sources. Quality and validation processes may be unknown.
While some data vendors do provide extensive documentation, the reference guide may still be insufficient to uncover some of the nuances. Even with a thorough review, we recommend that you acquire a good sample of data in a geographic area with which you are familiar. This will provide the necessary “ground truth” that may otherwise be missing. At a larger scale, developing a methodology to compare and determine the “best data for your needs” can be very challenging and require experience working with these types of data.
Some considerations and privileged insight can only be uncovered with extensive knowledge of the data landscape, but also by talking to each data vendor or a trusted broker that has a strong partnership with multiple data vendors. As analytics and data science teams progress in their data literacy journey, this is a paradigm shift that requires some adaptation.
3. Understanding the Origin of the Data
The data vendor landscape is becoming very complex with many mergers and acquisitions. In addition, an ecosystem of third-party data vendors has evolved through partnerships and sometimes convoluted agreements.
Some vendors are first-party data producers that will give you access to unique data that cannot be found elsewhere. There are also third-party data consolidators that combine data from both vendors as well as open data to offer a best-of-breed approach. For example, Precisely’s Address Fabric achieves best-in-class address point coverage and completeness by conflating several address point sources.
Understanding the source of the data, the method of acquisition, the integration process and the update cycle, is paramount to determine how to process unique situations. These may include a missing new residential address, a recently decommissioned business, or a postal code that has changed.
4. Sometimes, Less Is More
While having more data to analyze can sometimes be a good thing, it can also mean the data contains false-positives or outdated data that should have been filtered out.
Using building footprint data as an example, users may have to choose between high accuracy lidar-based data that have a more limited coverage but have rich attributes (e.g. building height) or data that have been collected using satellite imagery and undergone AI processing. The AI/deep-learning technique may cover more territory, but only contains the building outline, and as a result it may have false-positives due to misclassified landscape features.
So, should users sacrifice coverage or accuracy? Again, it depends on the business problem, but if both are needed, sometimes the solution is to combine multiple, complementary data sets from multiple vendors. However, this may involve more processing challenges and more expertise.
In all cases, the number of records should never be the sole decision factor when considering the acquisition of third-party data.
5. Commercial, Open-Source or Both?
In recent years, the availability of commercial data and open data products has expanded considerably. Some commercial data products have been commoditized. Many of these are now available through free government data or open-source data initiatives. Alternatively, other commercial, value-added products have emerged.
Many companies tend to either go straight to commercial data for reliability, or open data for cost (or lack thereof). But choosing between commercial and an open-source data product is usually not as easy a decision as it might first appear.
If you are looking for data for a single city in order to acquire parcels or addresses, then acquiring data from a government-sponsored open data initiative may be sufficient for your purposes. However, the bigger challenge is acquiring data for areas the size of states, provinces, or countries where the data may come from several sources, not all of which may be authoritative. Be aware that free data isn’t always free once you factor in the ongoing cost of data integration, validation, updates and the loss of focus to your core business.
At the rate at which commercial data vendors are investing in capturing geospatial data, companies will find it difficult to develop and maintain unique data products by themselves without having to build in-house expertise. It all comes down to the following debate: build or buy data?
Whenever you are looking to acquire geospatial data and are hesitating between free, low cost, standard or premium data, we recommend that customers calculate the cost and impact of data inaccuracy and lack of completeness before making what could be a bad decision. The return on investment to build or buy will feed your business case.
6. Underestimating the Data Integration Complexity
Attributes, coverage and quality are not the only things to consider when purchasing data. Prospective buyers also need to consider how it will integrate with the existing target infrastructure and database, and specifically whether it is a spatial database, such as an ArcGIS geodatabase, a BI tool, or any other enterprise platform. Ingesting these data into a spatial database requires an understanding of proprietary data formats, update frequencies and delivery methods so that they can be integrated with in-house data. Making sure that the precision of internal data matches with commercial data is critical. Users should consider whether to match or aggregate data at the address, zip code, or block group level. This is a very important consideration because users may buy a very geospatially precise data set that needs to aggregate to higher levels of geographies to match with internal data, and thus lose the spatial resolution for which they paid.
New types of mobile trace data such as Foursquare’s footfall data allows users to answer new types of business questions. However, raw footfall data can represent terabytes in size and can be updated monthly, weekly or daily. This volume of data is challenging for traditional GIS, BI, data science or even ETL tools.
For customers that do not have all the tools or expertise to consume this volume of spatial data, Korem offers a custom data delivery service that will provide ready-to-use data sets based on the customer preference. For example, Korem can extract a spatial region for a specific county from the national coverage, pre-combining this with complementary socio-demographic segmentation through a geo-enrichment process, and deliver monthly updates in ESRI FGDB format. This sort of data service can greatly ease the data integration process and ensure that you get immediate value from the data.
There are more options for utilizing this type of data using real-time data APIs and API-based data extraction. However, this often requires other types of integration expertise not limited to web service calls. Best practices are required. For example, HERE Traffic Analytics data can extract historical traffic patterns for a specific region and timeframe, but also requires joining with the raw street segment data.
Independently of the data delivery model, getting support from a data specialist can ensure the most value.
7. T&Cs Shouldn’t Be the Last Thing You Validate
Now that you have selected a data set that fits your business need, have you considered both the right pricing model and validated the license terms and conditions (T&Cs)?
These validations often happen too late in the process and sometimes how the data is used can affect the price, it may even prevent you for using the data for certain purposes or in certain systems.
If you are a data scientist, you may be satisfied with a user-based data license for your modeling. But once you are ready to operationalize the process, then it may be necessary to opt for a server-based, department or enterprise-wide license. Moreover, the data license and T&Cs may contain restrictions related to storing geocoded output or creating derived content from the data.
We recommend addressing these topics early in the process and reaching out for support from trusted experts that will provide you vendor-agnostic recommendations and data contract advisory services.