The Naughty Bits of Location Intelligence Data

ignacio.barrios
10 min readSep 17, 2021

--

Bombarding people with facts and exposing their individual ignorance is likely to backfire. People don’t like too many facts, and they certainly don’t like to feel stupid”[1].

I will start with this sentence from Harari’s “21 lessons for the 21st century”. Most of the content I’m sharing in this article might be controversial from an industry point of view, but I consider necessary to clarify those.

Some Background info

Location intelligence services has been a fast-growing sector over the last few years, mostly due to massive adoption of smartphones, and hence, digital advertisement exponential growth.

Most of the companies based their data in SDKs (software development kits). These SDK were initially conceived to enhance aspects such as functionality, connectivity or compatibility for a given APP or program. Since one the purpose of these SDKs was to gather individual information and device performance information to improve customer experience and/or APP performance, APP owners started to gather immense amount of individual, high valuable data.

APP publishers discovered a way to increase their revenues so, either they kept this data as proprietary (e.g. WhatsApp, Facebook) or opened their APPs to third parties, so they could incorporate new functionalities within their APPs, and monetize them. Some of these functionalities might include app monitoring, behavioral analytics, marketing automation, advertising, payments, and others.

In this context, it’s easily understandable that most of the available location data comes from the ad tech ecosystem, being Bid Stream/Ad Exchanges the main source of this data. BS/AE provides publishers a set of opportunities to dynamically target potential customers when they open an APP or visit a webpage. This process is named Real Time Bidding (RTB), and it generates “bid requests”.

Each bid request has associated certain parameters, so the advertiser can decide to bid or not for that opportunity. These parameters might include:

  • Timestamp
  • iOS Identifier for Advertisers (IDFA) or Google Advertising ID (AAID)
  • Location (LAT/LON)
  • And others

As you can imagine, this generates an important (and valuable) dataset, utilized not only by publishers and advertisers, but also by other companies providing location information. If you want to know how many SDKs are installed in one of your apps, you can find a guide in this link.

Data Marketing

Now we have a general idea about how data is gathered. This is not an exhaustive analysis, and there are other ways to access location data (e.g. a direct API from APP publishers), but it might be considered as a good proxy.

Everything that has to do with data, refers to the 4 Vs: volume, variety, velocity, and veracity. Location data companies utilize different proxies to dress their value proposition (yes, to dress, not to address), but the most common are:

  • Active Devices/Users
  • Mobile Apps

This information can be made readily available:

Source: https://www.placer.ai

It can be hidden or masked with other statistics, so you must do the math:

Source: https://www.olvin.com
Source: https://near.co/data/

Important to note: 5B events per day on 1.6B users is roughly 3 events per user per day. We will cover this point later.

Or directly not make accessible at all to prevent any unwanted query:

Source: https://knowledge.near.co

Data crunching

Some companies provide a more detailed data, which also requires some effort to properly interpret what it means. I’m going to do some math here:

Source: https://docs.safegraph.com/docs/patterns-summary-statistics

In this image we see median devices seen daily. For those no familiar with basic statistics, it basically means that putting in order how many devices are seen every day (day 1–100 devices, day 2–120, day — 130, etc.), you pick the 15th day. Is this a reliable proxy? As always, it depends. In normal distribution (Gauss’ bell), the median is equal to the mean[2], but in a power law[3], the mean is more evasive. Interestingly, human behavior’s related aspects such as population growth, income distribution or innovation’s diffusion follow power-laws.

Let’s move next. Now we can find two new concepts: monthly visits, and monthly visitors.

Source: https://docs.safegraph.com/docs/patterns-summary-statistics

Bearing in mind previous numbers for July 2021 (» 16M median devices seen), having 38M monthly visitors (I assume here that one device represents one visitor), each device/visitor is visible just 16/38 = 42% of the days (on average). This is, a single user will be visible 12 days in each month, with no data for the other 18 days. So, 2 every 5 days.

Since this visibility index will most probably follow a power-law, we assume here (and will see after) that most of the devices are almost invisible for most of the month.

If we now pay attention very briefly to the visits, we can perform a quick calculation: 1’000/40 = 25 visits per visitor… each month. This already yields some conclusions: each person visits (is visible in) less than one place (POI) per day.

I raise here few questions:

  1. How many different places do you visit every month?
  2. And how many of those places do you repeat at least twice? Think about it.
  3. Would you consider this data as reliable?

Data accuracy

An additional component of location data USP is related to GPS precision/accuracy. I’m not going to extent myself too much on this point, since I prefer visual examples, but when we are talking about 30 meters precision[4], it represents a circle with 30 meters radius. That kind of accuracy is perfectly achievable in open areas with no building after few minutes. But we face here two issues:

  1. Most of the shops in densely populated urban areas are less than 30 meters apart from each other:
Figure 1. Times Square (NY) — 7th Avenue has 30+ meters width

2. GPS rarely works as expected in densely populated urban areas:

Figure 2: https://support.strava.com/hc/en-us/articles/216917707-Bad-GPS-Data

3. And even St. Google creates weird scenarios:

Figure 3. Screenshots realized by our CTO Alberto Hernando on a trip between Munich and Augsburg. Distance between Google’s best guess and rail tracks is approximately 120 meters after 3 minutes

Even industry advocates such as Mobile Marketing Association acknowledges that “…up to 60% of ad requests contain some form of location data. Of these requests, less than 1⁄3 is accurate within 50–100 meters of the stated location”.[5] Note: this report is 5 years; if anyone finds a more updated analysis, I would really appreciate it.

Data clustering

Finally, location data companies’ USP includes clustering as one the most important features of their data, based essentially on a pretty accurate individual information (but discontinuous as we have seen) and a GPS accurate precision (no comments). I agree that the ideal World when it comes to clustering would look like this:

Source: www.unacast.com

Unfortunately, reality looks more like this:

Source: Kido Dynamics based on SDK third party data

This plot is based on one month of data from 60M users in the largest city in Brazil. It’s important to bear in mind that for each of these points, it would be necessary to add a 10 meters radius buffer.

If you see any cluster or underlying pattern here bear in mind two aspects:

  • Even totally random data can gather in clusters[6]
  • Apophenia is a well-known cognitive bias[7]

Alternatively, some companies utilize different simplifications or aggregations to get a significant enough number of users (also known as “critical mass”). For that matter, they aggregate data within a certain geographic area such as in the example below:

Source: https://www.placekey.io

In this case, any point of interest (PoI) falling within one of those hexagons will have the exact number of unique visitors, visits, frequency, etc.

This approach is not good or bad, but misleading to a certain extent if data is shared based on specific PoI, as in this case:

Source: www.safegraph.com

Data clarity and representativeness

On a remarkable exercise of transparency, Quadrant (www.quadrant.io) shared till February 2020 a comprehensive analysis of what they call “Data Quality Dashboard”. It looks like this:

For this month, they include 2 additional metrics: daily average users (DAU) and monthly average users(MAU). DAU equals 59.6M and MAU equals 400M, which gives a 14.9%.

If we pay attention to the graphics above, we can extract some very quick conclusions:

  • 60% of the devices have less than 10 events per day, this is less than one event per hour
  • 70% of the devices have <20 meters accuracy, which is pretty good
  • 59% of the devices are visible just ONE DAY a month, and only 0.77% are visible everyday (this represents 3.08M of those 400M)
  • 52% of the devices are visible just ONE HOUR a day, and only 0.79% are visible 24 hours a day

This example raises again the question about “Data Representativeness” or “Sample Size” that we discussed previously, and that some companies simply express as a percentage related to total population:

Source: https://shop.safegraph.com/

We encounter here a more complex and less straightforward question: since human behavior, internet activity, and hence, app visibility, all follow a power law (as plotted in the previous chart), without proper considerations of the size and scale limitations of such data, estimations of the population parameters are likely to be biased. Also, increasing sample size will not solve this problem, since power laws have scale invariance.

(Un) Timely data

One final point regarding location intelligence is about “reaching the right person, at the right time, in the right place”.

We have seen so far that part of the value proposition is simply inexistant: the level of randomness associated to users’ visibility across multiple APPs in multiple locations during multiple days, it makes impossible to asseverate this principle of timeliness.

Or not really: you might remember that most of digital ads companies are talking about CPM (cost per millage) and similar parameters when it comes to digital campaigns. Well, that’s the answer: if you can release a big enough number of advertisements, you can be sure that a significant number will hit your expected cohort.

But this is not due to machine learning, deep learning, neural networks, or any other fancy word: it’s called “law of large numbers”. You remember why all these companies were worried about millions of devices and billions of events?

Data conclusions

So why such a secrecy and opacity regarding data accuracy, reliability, representativeness, etc? The answer is clear: data quality is incredibly poor, and the only way to somehow hide this reality is by fooling people with big numbers.

Google, Facebook, Amazon, have created an unnecessary obsession about micro-targeting and hyper-profiling, but they also play with cohorts and the law of large numbers. People are incredibly random in their behavior, and we can just provide a best guess or confidence margin when it comes to forecast what people will do. They are simply much better than others because they have much more data and to some extent more relevant, but this will not definitively apply to other APP publishers.

Bombarding people with facts and exposing their individual ignorance is likely to backfire. People don’t like too many facts, and they certainly don’t like to feel stupid”[8].

High resolution in space is useless without enough resolution in time. Why? Time gives the context, the motivations, and story behind why we go where we go. Being very precise only in location misses the insights that makes this information useful. Just like the famous mathematician joke:

An engineer and a physicist are in a hot-air balloon.

After a few hours they lose track of where they are and descend to get directions.

They yell to a jogger, “Hey, can you tell us where we’re at?”

After a few moments the jogger responds, “You’re in a hot-air balloon.”

The engineer says, “You must be a mathematician.”

The jogger, shocked, responds, “yeah, how did you know I was a mathematician?”

“Because it took you far too long to come up with your answer, it was 100% correct, and it was completely useless.”

Even mathematicians will agree with us since there is a mathematic relationship between the precision in space and the precision in time to get meaningful results. Indeed, if we are interested in visits to certain locations and its frequency or popularity, the precision in how we define a location and the precision in the frequency of visits are related and limited by the resolution in space AND in time, like the Heisenberg’s uncertain relation. (We will discuss the mathematical grounds of this relation in a future technical paper.)

Today, the only source of data able to guarantee this equilibrium in the precision of both location and frequency is the mobile phone network signaling data. It sacrifices some resolution in space but offers high frequency data in time (50 to 500 events per day, on average). Being specific, mobile phone data makes 80% of the people “visible” at least 8 hours a day, and 70% are visible at least 16 hours, while 50% of the users are visible ALL DAYS in each month.

This richness (and broadness and depth) is essential when it comes to make sound decisions in aspects such as traffic modelling, retail expansion or tourism evaluation. It’s necessary to evaluate the quality of data we are utilizing before we make our decisions and being critical if we do not feel comfortable with that data.

Next time you acquire a data set to improve your operations, performance, or strategy, be sure you don’t feel stupid afterwards.

--

--

ignacio.barrios

Engineer and entrepreneur, I try to democratise big data and boost a data driven economy