EP15 – Why Entity Resolution Is a Key Tool in Location Intelligence
In this episode of On Point with Korem, I sat down with Jeff Jonas, an entrepreneur and a former distinguished engineer with IBM who is now the founder and CEO of Senzing. Senzing is a company that does Entity Resolution. What’s entity resolution? Let’s say your favorite hotel chain has three different loyalty club account numbers for you. How do they know all three are just one person? Or the government is trying to prevent a known terrorist into the country and the watch list has a slight name variation that should resolve whether it’s the same person or not. Entity resolution then is recognizing when two observations relate to the same entity. So, how much of this uses geospatial technology…all of it. Does this sound a little boring…maybe, but Jeff is anything but boring. I promise you’ll learn something so let’s hear from Jeff in this episode of On Point with Korem.
Joe Francica: Great to see you. Thanks, really appreciate you taking the time. Let’s start off for people who are listening to this, who may be puzzled about why we should be talking about any resolution. And I know you can go and provide an easier answer and make that link between geospatial technology and any resolution.
Jeff Jonas: You know, it’s interesting this word ”entity resolutions” really coming of age, it’s had a lot of different words over the years used for it. I mean, identity resolution, data, deduplication, record linkage, match, merge, profile unification, patient record matching, all these things are forms of entity resolution. So, just to maybe define that first, any resolution is recognizing when two observations about two different entities, like two different people or two different companies, are the same, even though they were described differently.
One of the records is Jeffrey James Jonas, one is Jeff Jonas. Maybe the month and day, and the date of birth transpose, or the addresses are messy. Maybe one of the records has an email and a name and the other record has a name and a date of birth. How can you figure out which scenarios? Can you figure out when two records are about the same person or related? Sometimes, you can be confident that they’re not the same people, but you know they’re related because they have the same home address and same home phone. So, resolution is about finding duplicates within a data set and combining data across data sets. And it’s odd, primarily the latter that becomes super important to geospatial because not all records have a lat/long or an address.
So, you know, if your records are name and email address, or you got a phone number and a name, or you got an account number and a something, something, but you don’t have an address and you don’t have a lat/long, then you can’t place it in space. And so, what entity resolution allows you to do is combine data. And what you can do is then inherit a lot/long, an address, and a lat/long are aware on data that’s unlocatable and space and time.
JF: So, you know, I’ve seen some of the other presentations you get in, and it seems like what you’re referring to, as context computing, I’ve got to put things more in context. Geography is one of those elements that’s context. I’ve heard the term and being geography, you know, applied to that, like what’s going on around me. And if you could explain that a little bit more, because I think you’re saying is you get a record and it looks like this record, but it’s different, but it all matters, context matters in that case. Is that a good way to explain it?
JJ: Yeah, maybe, I’d start by saying the way, you know… when I think about context, when I was at IBM, I was the chief scientist of context computing, so this is a topic that’s near and dear to my heart. But better understanding something by taking into account the things, that’s context. So, you know, a really simple example is the word bat. Like you can see the word ”bat” but maybe you’re not sure what kind of bat it is. As a fun example is: ”I duct as the bat flew just over my head.” Now, in your head, you might be associating the flying bat, but I’m like: ”Man, I never knew baseball games were so dangerous.”
If that’s the second sentence, it redefines ”bat.” And so when I think about information and context, I often think about it like puzzle pieces. Like: ”this pile of data is the red puzzle pieces, and that pile of data is the blue puzzle pieces.” Some of the powers of the data are very geospatial rich that a gold puzzle pieces. By the way, I think geospatial data is analytics superfood. Like if you have one piece of data wish for it’s ”when” and ”where”, how things move,… It’s incredible.
But context computing and entity resolution play a key role in this is associating all these puzzle pieces. It’s like the difference between a pile of puzzle pieces and then taking the puzzle pieces to the table and seeing how they fit. Putting the red and white pieces kind of near each other, it gives you a little more context, actually connecting them to each other brings out an image and the better of a job that you… By the way, this has a second parallel. If you only studied the red puzzle pieces, you can’t get that much learning. You want to have a diversified collection of information. You want to widen your observation space, and then, when you put it all together, you get these really rich views, and then, you make higher quality predictions.
JF: So, I know you started your business a long time ago and a lot of your success was due to detecting fraud at casinos. Can you put that in perspective? Like how did that happen? In that context, you know… I don’t know if people had a name or an address, but detecting fraud at a casino had to be related to the contextual component of an action at a casino.
JJ: Well, it starts with the actors. It starts with knowing who is who. And when we set out on that project, it became known as Nora, Non-Obvious Relationship Awarenesss, and its purpose was to help the casinos better understand who they’re doing business with. So, let’s start with the observation space. You got people making hotel reservations. You got people checking into the hotel without even reservations, they just show up last minute. You have casino credit. You lend money to people so they can play. There’s a loyalty club. You have your employees, and you have people applying for jobs. They’re trying to climb over the fence and get in. You’ve got vendors. You’ve got people you already arrested. And then you’ve got people on a watch list pub just by the gaming regulators. It says: ”if you do business with these people, called the exclusionary list, we’re going to take away your gaming license or heavily find you.”
And so, at the volumes that the casinos had, they really do need to know who they’re doing business with, whether it’s for a marketing and upsell or cross sell, or a compliance where you’re gaining licenses at risk. And bringing all of that together on who’s who, who’s connected to who, is really a first step in assessing good news or bad news, green light, red light, yellow light, right? And a subtlety by the way that is missed is clever, bad people don’t use the same name, address, and phone on every record, only the idiots do that. And so, you have to go to see-through. People are obfuscating who’s who, and when you net it all out – I dusted off some charts from the old days when I ran it the first time – you find things like a vendor, or an accounts payable manager that aren’t geospatial near each other. I mean, they are, not just in a big polygon, but they live at the same address, right? And you want to know in a case like that, it’s smoke. You can’t just call it fire, maybe it’s been disclosed, but yeah.
JF: So, one of the things you also mentioned in one of your other videos, and I don’t want it to get more of an explanation, is you had something like in regarding context, community data finds data, relevance finds you. In the context of big data, isn’t that exactly what we’re trying to do is sift through the chaff of all of this data stuff. And I just find it, you know, how much data do we really need? You always said you need more data, even bad data.
JJ: Well, you get there’s about three things to unpack on all of this. One is things that are rare, like anomalies and big data, things that are rare… There are a million rare things a day that happened. Like, how often do you and I hang out on a Zoom. So, just being rare and big data, there’s a million rare things a day. So that’s actually not enough signal. You have look at other surrounding things about the kind of rare that it is.
Another key thing is, you know, I had this disc, I’ll call it a discovery, cause it came to me, I was talking to a counter-terrorism intelligence analyst at an organization you’d expect working on stuff like that. And I asked this person, I go: ”What do you wish you could have? If you could have anything?” And she looks at me and she says: ”I wish I could get answers to my questions faster.” That sounds very reasonable. Who doesn’t want that? But you know, this is like just post 9-11, right? And it struck me so funny and I looked at her and it just popped into my head. I go: ”What if it’s not a smart question today? What if you have to wait four or five days or four weeks until more data arrived? Then it’s a smart question. What if you’ve asked the question too early.” As she looked at me and said: ”That could happen.” And then I freaked out. I said: ”But you can’t ask every smart question every day. Like you can’t come up with all your smart questions, asking him about it all every 30 seconds.” She goes: ”You’re right, I can’t apathetically.” And I’m thinking to myself: ”We’re all gonna die.”
Like her job is to protect the country from bad things, and that’s when the words popped into my head. Every piece of data that arrives is the question. It’s kind of like think of sense-making, it’s like fast as the days are arriving, you’re seeing how it fits in the puzzle and you can have a computer making a set. Humans can’t do this at the speed, but a human, the machine, the puzzle piece lands, it lands in the puzzle, it changes the shape of the puzzle, it connects some other parts of the puzzle you didn’t know were connected. Maybe it’s important, maybe it’s not, but it is important to who. Tell them! That’s data finds data, relevance finds you. And that’s really about helping focus human attention.
JF: So, does the technology we have today, allow us to do that as data finds more information? I always get confused with, you know: ”Oh, this artificial intelligence and machine learning is going to find you the answer really, really fast.” But it’s not going to find it really, really fast if you don’t have the right data.
JJ: Well, you have a lot of data for training and most people don’t have that kind of data. Then you have to look for other people to create training models, hoping they have the data you have so they can share their models with you. And that’s one reason why it’s a bigger lift. You can’t just use those words casually and think you’re just going to get gain. And by the way, it’s kind of like self-driving commerce. There was all that momentum and all that gaining. We thought overnight, we were going to see self-driving cars. Guess what? Closing the last inch, not trivial! So, you know, some companies that have all the data in the world have the advantage of the sufficient amount of data for training.
JF: So are we kind of fooling ourselves in that, you know: ”All we need is more data. I’m going to buy location data from, I don’t know, somebody’s providing footfall traffic, and that’s going to tell me everything I need to know to do my marketing.” Or: ”I’ve got a global data set of addresses, and I’m going to be able to link addresses to people.” Are we just getting ourselves with all this?
JJ: I actually think everything you just mentioned is useful. It depends on which use case, but if you’re trying to understand commercial real estate, and one of your data points is foot traffic to an area, that’s great! Now, you might want to overlay it. It might be that somebody holding large events in the parking lot weekly, and they’re not going into stores and s, at the same time, there’s these large afternoon music gatherings. So, maybe that would be a false positive. You’re like: ”Wow! A ton of people go there and shop.” But maybe they’re just going to the parking lot and going to a concert, never going to stores. So, overlaying secondary data just improves the quality of your prediction.
You know, foot flow count into businesses is extremely valuable, like if you’re a hedge fund, if you’re in commercial real estate, if you’re doing lending,… Lots of markets for that. It’s really about what’s the use case, what’s your mission, what’s the right – well, I called it fantasy analytics, man. I can’t tell you the number of times I go to somewhere and say: ”What do you want to do with analytics?” And they go: ”I want to ABC.” And I go: ”Well, what kind of data do you have?” And they go: ”You know, I got blue puzzle pieces right over there.” I go: ”Not even a divine being could use those blue puzzle pieces and find that.”
So sometimes, organizations have goals in their analytics. Like: ”let’s reduce fraud by 80%.” The question is, in many cases, you need a wide in your observation space. Maybe you need to buy some reference data from corporate hierarchies and beneficial owners in banking, right?
JF: Do you think corporations have data and they just don’t know what they have?
JJ: Yeah, well, that’s always the place to start. That’s another funny thing is a lot of places always start with: ”We just need more data from outside,” but you look, and they’ve got the blue puzzle pieces trapped in a silo. They got the red puzzle pieces downstairs and evolved. They got the yellow puzzle pieces up in the attic and a paper box. And you’re like: ”Well, maybe you should take better advantage of what you already know.”
And so, that’s always the place to start because you don’t have to buy, you don’t have to get it, it’s already your data. And by the way, if you have it, you probably have the responsibility connected. There’s nothing like an organization that finds out they’ve been negligent. They found out they had the blue puzzle piece in one hand and the red puzzle piece in the other hand. They just never noticed, like an enterprise amnesia.
JF: Right. So, in some of the things that Sensing is doing and creating models, I think I’ve seen on your website, you’re able to share models.
JJ: We’re not. I wouldn’t even say we’re creating models. I’m going to change that website. You watch. Bada boom, Bada Bing! Then a blank. You could say there are models, but the Sensing technology is the same whether you’re marketing, doing bad guy hunting, risk assessment, supply chain, integrity, voter registration,… It’s all the same. So, it doesn’t have any models about good news or bad news. It doesn’t score things like, you know, red light, green light. None of that.
We used to do that as part of what we did, but the thing that we are uniquely good at is that entity resolution piece, are these people the same, possibly the same or related, and then building that graph. Think of it like a scaffolding or a skeleton, and then on that you drape events and transactions and locations and prices and blah, and that’s context. And so what we do is we make the… it’s a challenge for people that build software when they’re building software, very… Do you know all the spellings of Muhammad or Dick, Nicki, Cardo? We make that easy for developers.
JF: But you don’t do anything like a new 4J, or you wouldn’t compare yourself as a competitor to that kind of a graph technology.
JJ: We are to Neo4J what peanut butter is digital.
JF: Is it a good compliment? I love peanut butter and jelly.
JJ: But the only thing better than peanut butter is chocolate. Maybe I should have said that. The only thing, so Neo for J is a, you know, you could say the most popular graph database, it builds a visual graph of how nodes are connected. And sometimes those nodes go from Billy to Bill, to William, to Willt. But those are all the same. And then it goes to Sue. So, it looks like you like four or five hops from Sue, but if you ended your resolve, you realize is one hub.
Energy resolution helps compress or consolidate synonym nodes in a graph database. So, they work beautifully together in faculty. We have an open-source project that takes our output, and we’ll publish it in real-time into people’s Neo4J graphs. There was just a company called data economy, just announced the work they’re doing with Sensing that, you know, we build transmissions, people sell cars, we sell transmissions. And Neo4J would be another car park, you know, but an important car part.
So, this company, data economy doing PPP loan fraud products has combined Sensing a peanut butter and jelly and have a great product that helps companies find, you know, help banks find people getting two or three loans, same company or companies that are no longer. You add some third-party data like open corporates, which is a registry of when businesses licenses are valid. Suddenly see, you’re giving loans to businesses that are out of business. But anyway, that’s what happens when you take a graph database, entity resolution, and add some data, then you get innovative products that find things more accurately.
JF: So where does something like Databricks environment fit with Sensing? It sounds almost too good to be true with what Databricks is doing, right? I mean, they’re trying to solve the world’s cloud problems and desilo information.
JJ: Yeah. You know Databricks, Snowflake, who else would come to mind? I’ll just use those two. These are organizations that are helping you pick up data, transform data. The cameras may be another good example but pickup data, transformed data and landed in a more useful form in that workflow. You need entity resolution. And by the way, there’s a couple reasons. One is if you’re doing your own machine learning, what do you think is going to learn better? What if you tell the machine learning model, it’s three people each with three transactions, each with one transaction, but what if it’s really one person with three transactions?
How different do you think the training will be? I’m telling you, hugely different, night and day. So, entity resolution is important too as a preparation, a part of the preparation process. And by the way, around geolocation, now you’re adding geolocation to records that didn’t have geolocation. So now your models can take advantage of geo. Now you can have a data set that’s email address only, but suddenly because of entity resolution, and other data, there are lats/longs on there too.
Well, now your machine learning models can do, you know, produce geospatially aware models. And then when you’re onboarding new vendors or customers, you can do real-time, better real-time, know your customer, due diligence. You can do continuous vetting on your supply chain, employees, or workforce with entity resolution. So anyway, long story short, any resolution is yet just another, you know, stool leg on the stool of the kinds of things you do when you’re doing things like Databricks or Snowflake.
JF: It’s one last thing I want to pick up on something you just said, because it relates to some of the things that insurance companies would do in terms of fraud detection, where you don’t want to resell an insurance policy to somebody that fraudulently, you know, did something fraudulent in a past life, right? And your onboarding discusses: ”Oh, this is a great-looking guy, but at a context, you wouldn’t know who this guy was.”
JJ: We’ve seen that our customers report that… you see a banking customer fired from a big bank in Asia, and then they turned back up at, let’s say at Hong Kong, and then they turn back up in India. It’s the exact same one you fired. There’s one letter different than their long last name and there’s a different passport. And you just think you just got yourself a great customer. In fact, once bad guys get really used to how to operate in your system and you finally detect them after five years, they’ll just wiggle right back. They’ve already studied you so much that just wiggle right back in.
We do something really unique called entity-centric learning. It’s a very different style of entity resolution. It’s a lot more expensive to do in terms of time. We try to make it fast. I mean, it is fast. We do thousands of things a second, but it’s a lot of extra work others don’t tend to do and it helps you find those people that are obfuscating their identity to be somebody else. They don’t want you to connect this geospatial record with this record over there, they wouldn’t want you to know that was them.
We have created a way to do that in real-time, in big data. And we just, by the way, Joe, we just let people download our entire product for free and try it, like literally. You know, maybe you would have seen it on my website. It’s like click here, try it for free. Yeah, but it’s for developers, it’s for people that are trying to solve this problem in their product or in their workflow.
JF: Well, Jeff, we’ll leave it there. I can’t thank you enough. Always enjoy the conversation with you and another time, I will continue to be educated and so my questions will get better and better whenever I talk to you.
JJ: This has been great! What are you talking about? Hey, reach out anytime. And by the way, I’m firstname.lastname@example.org, I answer every email, you know. I mean, of course I answer because we’ve known each other now forever and a day, but to any of your listeners, I just, I answer emails. It can be about work or about triathlons or travel or whatever, but I’m very accessible.
JF: Cool. Yeah. Thanks again, Jeff. Really appreciate it.
JJ: Thanks, Joe.
JF: You bet. Thanks again for joining us on another On Point with Korem. And if you like today’s podcast, please leave a comment in the comment box where this podcast is posted, which could be Apple Podcasts, Google Podcasts, Spotify, or YouTube. I hope you’ll join us next time for another On Point with Korem where we’ll get on point.