In iOS 14.3, Apple added their new app privacy details to App Store listings. App privacy details, which are sometimes compared to the nutritional labels on foodstuff, are details about the data an app collects and the purposes and use of such data. What can we learn by analysing this data?
From the 14th of December 2020, all new apps and app updates have to provide information on the data the app collects. This is used to power the app privacy details labelling. On Twitter, videos scrolling through the privacy listing for Facebook circulated immediately after the 14.3 release.
This system is somewhat flawed, because app developers can, at least in theory, lie about the data they collect. Some apps that profess to collect no data, actually turn out to collect a bunch if you read their privacy policy. However, the punishment for being caught lying, removal from the App Store, is a strong deterrent and it’s safe to assume most developers will have been truthful in their accounts.
An interesting side-effect of this, is that Apple has now made available the same data that can be found in terse and hard to parse privacy policies as simple and structured data that can be parsed and analysed. In this post I will do just that i.e. collect and analyse the privacy details for thousands of the most popular apps on the App Store.
Collecting the Data
If you just want to read the juicy details feel free to skip to the analysis.
Apple makes the privacy labelling data available for each app on the App Store via an API used by the App Store apps. By reverse engineering the App Store apps I’ve figured out how to make the API divulge this data on a per app basis.
This only gets me the privacy data for a single app, but I want to analyse popular apps. A good source of popular apps are the charts the App Store provides on a per app category basis. An example of this is “Top Free” apps in “Education”. These listings contain up to 200 apps per category and price point(i.e. free or paid).
On the UK store, which is the store I’ve used for all this analysis, there are 25 categories. Each of which have top charts with up to 200 paid and 200 free apps. This means the theoretical total number of apps is 10 000. However, because some apps occupy chart positions in multiple categories and because the charts also contain app bundles the actual number is lower.
The full list of categories is:
Category |
---|
Book |
Business |
Developer Tools |
Education |
Entertainment |
Finance |
Food & Drink |
Games |
Graphics & Design |
Health & Fitness |
Lifestyle |
Magazines & Newspapers |
Medical |
Music |
Navigation |
News |
Photo & Video |
Productivity |
Reference |
Shopping |
Social Networking |
Sports |
Travel |
Utilities |
Weather |
Structure of the Data
If you don’t care about the exact details and structure of the data feel free to skip to the analysis.
The structure of the data returned by the App Store API is
The <privacy-types>
section of this document is the important bit. It’s an array where each item has the following structure.
The <string-identifier>
is one of
Identifier |
---|
DATA_LINKED_TO_YOU |
DATA_NOT_COLLECTED |
DATA_NOT_LINKED_TO_YOU |
DATA_USED_TO_TRACK_YOU |
DATA_NOT_COLLECTED
is used as a marker in which case dataCategories
and purposes
are both empty and this is the only element in the privacyDetails
array.
DATA_USED_TO_TRACK_YOU
contains details on data used to track you across websites and apps owned by other companies, Apple’s description is The following data may be used to track you across apps and websites owned by other companies:. For this entry purposes
will be empty and dataCategories
contain the different data types that are tracked across apps and websites owned by other companies.
DATA_LINKED_TO_YOU
and DATA_NOT_LINKED_TO_YOU
both contain data types with purposes specific granularity. This means that dataCategories
will be empty and the different data types are in purposes
. Apple’s description for DATA_LINKED_TO_YOU
and DATA_NOT_LINKED_TO_YOU
are The following data, which may be collected and linked to your identity, may be used for the following purposes: and The following data, which may be collected but is not linked to your identity, may be used for the following purposes: respectively.
<data-purposes>
is an array of purposes with the following structure:
The different values for <purpose-identifier>
are:
Purpose |
---|
ANALYTICS |
APP_FUNCTIONALITY |
DEVELOPERS_ADVERTISING |
OTHER_PURPOSES |
PRODUCT_PERSONALIZATION |
THIRD_PARTY_ADVERTISING |
These are described by Apple in their documentation, but I’ve added them here for completeness.
Purpose | Definition |
---|---|
Third-Party Advertising | Such as displaying third-party ads in your app, or sharing data with entities who display third-party ads |
Developer’s Advertising or Marketing | Such as displaying first-party ads in your app, sending marketing communications directly to your users, or sharing data with entities who will display your ads |
Analytics | Using data to evaluate user behavior, including to understand the effectiveness of existing product features, plan new features, or measure audience size or characteristics |
Product Personalization | Customizing what the user sees, such as a list of recommended products, posts, or suggestions |
App Functionality | Such as to authenticate the user, enable features, prevent fraud, implement security measures, ensure server up-time, minimize app crashes, improve scalability and performance, or perform customer support |
Other Purposes | Any other purposes not listed |
Lastly <data-categories>
is an array of objects with the following structure:
The full list of data types and the categories they belong to is:
Let’s do some analysis of this data
Analysis
Last Updated: 7th of January 2020. Added data for Games, which was previously missing and extended analysis of all apps to larger data set.
The data set I’ve collected contains 9477 combinations of apps and a position in a given category chart. In total there are 9435 unique apps in this data set.
Most charts contain 200 or nearly 200 apps, however Graphics & Design(Paid), Developer Tools(Paid), and Magazines & Newspapers(Paid) all have fewer than 90 apps so I’m dropping them from further analysis.
Because the privacy details have only been required for new apps and updates since mid December, not all apps contain information about privacy details. After removing those apps 3370 apps remain in the data set. Breaking this down by chart, several charts have less than 25 apps so I am dropping them from further analysis too. This leaves 3233 apps in the data set.
In total the following charts have been dropped:
- Education(Paid)
- Navigation(Paid)
- Sports(Paid)
- Business(Paid)
- Food & Drink(Paid)
- Shopping(Paid)
- Medical(Paid)
- Magazines & Newspapers(Paid)
For the analysis there are a few different data points that are interesting:
- Apps that collect data this is linked to the user and how many such data types they collect.*
- Apps that collect no data.
- Third Party tracking, i.e. tracking users across apps and websites owned by other companies and how many(max 32) such data types they collect.
* Data that is linked to the user for the purpose of supporting app functionality, that is the APP_FUNCTIONALITY
purpose, is legitimate and will be exclude from the following analysis. This leaves 160 data types spread across 5 purposes.
I am excluding data that is collected but not linked to the user, in part to keep down the length of the analysis and in part because it’s the least interesting. I’ll probably do a follow up post on it later.
The questions I’ll be looking at for this analysis are:
- Do free apps collect more data?
- Which are the worst charts?
- Which apps in the whole data set are the worst?
- Which apps lie subtly about the nature of data they collect?
But first let’s have a quick look at the data set.
Note: The images in this post can be clicked to show larger versions
As we can see here, most apps collect no data outside of that which supports the app’s functionality. To get a better view of the apps that do collect data, let’s remove the majority of apps that don’t.
Still the amount of data collected is fairly low, but there’s a curious set of outliers somewhere around 120 data types collected. All of those outliers have something in common, see if you can figure it out before I reveal the answer later in the post.
How about third party tracking?
Again most apps don’t collect any data types for third party tracking. Let’s repeat the process from above by removing those that do no tracking.
Now that we have an overview of the data let’s move on to answer the questions posed above.
Free vs Paid
A fairly common meme in the discourse around free apps is: “if you’re not paying for the product, you are the product”. Facebook is probably the quintessential example of this business model. Facebook makes money not from users paying them, but from advertisers paying to show hyper targeted ads to Facebook’s users. So is there truth to the meme? Do free apps track more than paid ones? I asked my Twitter followers this question, most people thought so.
As previously established we’ll look at a few different data points to determine this. First of all, do free apps collect more data types that are linked to the user for non-app functionality purposes?
Yes they certainly do. The median number of such data types for free apps is 3 and for paid apps it’s 0. The mean is impacted by outliers in the free category and is ~8.1 for free apps and ~0.5 for paid apps.
If we look at the number of apps that don’t collect data, as a percentage. It’s also clear that paid apps are much less likely to collect data.
Type | Percentage | # Apps | # Apps that don’t collect |
---|---|---|---|
Free | ~9.1% | 2628 | 240 |
Paid | ~53.9% | 605 | 326 |
Lastly do free apps collect more data types that are used to track the users across other apps and websites i.e. data categories with the identifier DATA_USED_TO_TRACK_YOU
?
Yes they do, the median number of data types used to track users across other apps and websites is 1 for free apps and 0 for paid apps. The mean is ~2.3 for free apps, but only ~0.2 for paid apps.
For all three metrics considered, it turns out that my Twitter followers were correct. Free apps do collect more data than paid ones.
Worst Charts
Of the 40 remaining charts in the data set which are the worst? Let’s again start by considering the number of data types collected and linked to the user for non-app functionality purposes.
Chart | Mean | Median |
---|---|---|
Games(Free) | 13.668639 | 8.0 |
Shopping(Free) | 11.938931 | 8.0 |
Travel(Free) | 10.486486 | 6.0 |
Sports(Free) | 9.410000 | 5.5 |
Business(Free) | 11.558559 | 5.0 |
Here Games(Free) is the clear winner, perhaps because the number of free games that are financed entirely by third party ads combined with the high cost of making games. Shopping(Free) is a close second presumably due to analytics data collected to optimise purchases and checkout experiences.
Another interesting observation here is that worst 24 charts, sorted by median, are all free. The first paid chart is Games(Paid) at position 25.
When considering the percentage of apps in each chart that don’t collect any data there’s commonality with the above. Health & Fitness(Free), Shopping(Free), and Travel(Free) all show up again.
Chart | Percentage | #Apps | #Apps that don’t collect |
---|---|---|---|
Games (Free) | ~0.6% | 169 | 1 |
Health & Fitness(Free) | ~2% | 151 | 3 |
News(Free) | ~2.8% | 108 | 3 |
Shopping(Free) | ~3.1% | 131 | 4 |
Travel(Free) | ~3.6 | 111 | 4 |
The same divide between paid and free apps occur again here. The first paid app shows up only at position 24(it’s Games again, Games(Paid) at ~24.3%).
When considering tracking across other apps and websites this is the result:
Chart | Mean | Median |
---|---|---|
Games(Free) | 6.088757 | 6 |
News(Free) | 2.731481 | 3 |
Shopping(Free) | 3.488550 | 2 |
Sports(Free) | 3.020000 | 2 |
Entertainment(Free) | 2.390000 | 2 |
Health & Fitness(Free) | 2.125828 | 2 |
Again Games(Free) is the worst, as I speculate above the reason is surely the amount of third party advertising used in games for monetisation.
The trend with paid vs free apps repeats again, the first paid chart is, you guessed it, Games(Paid) at position 24.
Worst Apps
Let’s now focus on individual apps, which are the absolute worst apps? While I was fetching the data, I tweeted some preliminary results based on a shallow analysis of response size. By this metric, all of Facebook’s apps were extremely data hungry. Let’s see if that conclusion holds up to more rigorous analysis. This data is based on a larger data set of 5274 apps collected from both UK and US stores rather than the smaller data set used for the analysis so far.
Let’s start by again considering data collected and linked to the user for non-app functionality purposes. Here are the top 25 apps by this measure.
App | #Data Types Collected |
---|---|
Facebook Gaming | 128.0 |
128.0 | |
Facebook Business Suite | 128.0 |
128.0 | |
Messenger | 128.0 |
Oculus | 128.0 |
Portal from Facebook | 128.0 |
Boomerang from Instagram | 128.0 |
Facebook Adverts Manager | 128.0 |
Threads from Instagram | 128.0 |
Layout from Instagram | 128.0 |
Creator Studio from Facebook | 128.0 |
LinkedIn: Job Search & News | 91.0 |
Scrabble® GO - New Word Game | 60.0 |
Football Index - Bet & Trade | 56.0 |
Klarna | Shop now. Pay later. | 56.0 |
Draw a Line: Tricky Brain Test | 55.0 |
The Telegraph News | 55.0 |
Ovia Parenting & Baby Tracker | 54.0 |
Ovia Pregnancy Tracker | 54.0 |
Ovia Fertility & Cycle Tracker | 54.0 |
Nectar: Shop & Collect Points | 53.0 |
Full Guide for Cyberpunk 2077 | 53.0 |
NFL | 52.0 |
Ring - Always Home | 51.0 |
Can you spot the pattern? The top 12 apps are all from the same company, Facebook. All of Facebook’s apps collect an ungodly amount of data, the nearest other app is LinkedIn which collects 37 fewer data types. Keep in mind that the maximum number of data types an app can collect and link to the user outside of app functionality is 160(32 data types, across 5 purposes), Facebook manages to get almost all the way there at 128. All of Facebook’s apps declare the same set of data types collected. It’s easier to look at the data Facebook does not collect than the data they do collect.
Data types not collected by Facebook:
Purpose | Category | Data Type |
---|---|---|
ANALYTICS | FINANCIAL_INFO | Credit Info |
ANALYTICS | USER_CONTENT | Emails or Text Messages |
DEVELOPERS_ADVERTISING | FINANCIAL_INFO | Credit Info |
DEVELOPERS_ADVERTISING | FINANCIAL_INFO | Payment Info |
DEVELOPERS_ADVERTISING | HEALTH_AND_FITNESS | Fitness |
DEVELOPERS_ADVERTISING | HEALTH_AND_FITNESS | Health |
DEVELOPERS_ADVERTISING | SENSITIVE_INFO | Sensitive Info |
DEVELOPERS_ADVERTISING | USER_CONTENT | Audio Data |
DEVELOPERS_ADVERTISING | USER_CONTENT | Customer Support |
DEVELOPERS_ADVERTISING | USER_CONTENT | Emails or Text Messages |
OTHER_PURPOSES | FINANCIAL_INFO | Credit Info |
OTHER_PURPOSES | FINANCIAL_INFO | Payment Info |
OTHER_PURPOSES | HEALTH_AND_FITNESS | Fitness |
OTHER_PURPOSES | HEALTH_AND_FITNESS | Health |
OTHER_PURPOSES | SENSITIVE_INFO | Sensitive Info |
OTHER_PURPOSES | USER_CONTENT | Audio Data |
OTHER_PURPOSES | USER_CONTENT | Emails or Text Messages |
PRODUCT_PERSONALIZATION | FINANCIAL_INFO | Credit Info |
PRODUCT_PERSONALIZATION | FINANCIAL_INFO | Payment Info |
PRODUCT_PERSONALIZATION | HEALTH_AND_FITNESS | Fitness |
PRODUCT_PERSONALIZATION | HEALTH_AND_FITNESS | Health |
PRODUCT_PERSONALIZATION | USER_CONTENT | Audio Data |
PRODUCT_PERSONALIZATION | USER_CONTENT | Customer Support |
PRODUCT_PERSONALIZATION | USER_CONTENT | Emails or Text Messages |
THIRD_PARTY_ADVERTISING | FINANCIAL_INFO | Credit Info |
THIRD_PARTY_ADVERTISING | FINANCIAL_INFO | Payment Info |
THIRD_PARTY_ADVERTISING | HEALTH_AND_FITNESS | Fitness |
THIRD_PARTY_ADVERTISING | HEALTH_AND_FITNESS | Health |
THIRD_PARTY_ADVERTISING | SENSITIVE_INFO | Sensitive Info |
THIRD_PARTY_ADVERTISING | USER_CONTENT | Audio Data |
THIRD_PARTY_ADVERTISING | USER_CONTENT | Customer Support |
THIRD_PARTY_ADVERTISING | USER_CONTENT | Emails or Text Messages |
Data types shows up several times here, but remember we care about the combination of data type and purpose.
The complete set of data they collect and link to users for non-app functionality purposes is scarier. Yikes.
This is how the data breaks down for Facebook’s apps:
Purpose/Category | #Data Types |
---|---|
Used to Track(3rd Party) | 7 |
Linked To You(Analytics) | 30.0 |
Linked To You(Developer Advertising) | 24.0 |
Linked To You(Other Purposes) | 25.0 |
Linked To Your(Product Personalisation) | 25.0 |
Linked To You(Third Party Advertising) | 24.0 |
What about apps that track you across apps and websites used by other companies?
App | #Data Types Tracked |
---|---|
Priceline - Hotel, Car, Flight | 23 |
Paxful Bitcoin Wallet | 23 |
Chime - Mobile Banking | 21 |
Nordstrom Rack | 21 |
Draw Coliseum | 20 |
Nordstrom | 20 |
M&S - Fashion, Food & Homeware | 19 |
Bubble Pop! Puzzle Game Legend | 18 |
Block! Triangle puzzle:Tangram | 18 |
The Bellingham Herald News | 17 |
Fresno Bee News | 17 |
Football Index - Bet & Trade | 17 |
The State News | 17 |
Miami Herald News | 17 |
Bradenton Herald News | 17 |
The Charlotte Observer News | 17 |
The Raleigh News & Observer | 17 |
Lexington Herald-Leader News | 17 |
The Telegraph News | 17 |
Kansas City Star News | 17 |
Fort Worth Star-Telegram News | 17 |
Yelp: Local Food & Services | 17 |
MLB Ballpark | 16 |
onX Backcountry GPS Trail Maps | 16 |
onX Hunt: GPS Tracking Tools | 16 |
Here Facebook’s apps aren’t showing up, after all they are the people who facilitate the tracking across apps and websites. Unsurprisingly, all of the above apps are free.
Oxymorons
Astute readers will have noticed that some of the data types are linked to the user by their very nature and as such the combination of the data category Data Not Linked to You and these data types are an oxymoron.
For example your phone number, email address, name, or physical address are always linked to you even if the data isn’t attached to an identifier. Further, as numerous news outlets — including The New York Times — have reported, precise location can be re-identified with fair ease, by identifying locations such as homes and offices. Recovering a user’s identity from the contents of their text messages and emails is also trivial.
Let’s look at apps that collect one of the above data types for the category Data Not Linked to You.
In this data set there are 740 apps that collect such oxymoron data types. The worst three offenders, by count, are KFC: Online food delivery(20), myCricket App(12), and FootballNet QPR(12) although neither collect precise location which is reassuring. The Weather Network, OpenSnow, Yanosik, imo video calls and chat, and MyRadar Weather Radar are examples of apps that collect precise locations for third party advertising purposes. In fact imo video calls and chat, and MyRadar Weather Radar collect precise location for every single one of the six different purposes.
Taimi: LGBTQ+ Dating, Chat is an LGBTQ+ dating app that collects precise location for analytics purposes. This is of particular note because in many countries LGBTQ+ people face significant risk if their status is exposed.
Conclusion
We’ve learned that a fairly large number of apps collect none or very little data that is linked to the user for non-app functionality purposes. However there are extreme outliers, not the least of which is Facebook.
Free apps collect significantly more data than paid ones and anyone who cares about their privacy should opt to pay for the apps they use.
Games are especially bad as category and especially with in third party tracking they stand out.
The data at this stage is sparse, because only about 1/3 of the apps in the data set have added privacy details. In the future this will reach close to 100% and I will redo this analysis.
If you have other questions about the data that you’d like me to cover please DM me on Twitter and I’ll try to add them to this post, or if there are a lot of questions I’ll do a follow up post.
Now, if you’ll excuse me, I’m off to uninstall Instagram.