How FLoC incentivises more, not less tracking

28 Apr 2021 • on english privacy web

Google recently published a new standard called “Federated Learning of Cohorts”, short FLoC. Google would like to see this standard adopted by the wider internet to keep its ad/tracking business floating after basically every browser nowadays, soon including Google’s own Chrome and Chromium browser, blocks 3rd-party cookies.¹

How does FLoC work?

FLoC is designed to group web users into so-called “cohorts”, which are represented by an abstract number that is calculated based on the user’s 7 day² browsing history along with content from these pages according to standard proposal.

The calculation in the Chromium browser is done using a function called “SimHash”, which generates an abstract number (the hash), which is supposed to be non-reversible, to identify how similar things input into the algorithm are. In order to reduce the privacy impact the output cohort IDs are supposed be generated in a way, that they represent thousands of people. This way ensuring that no individual person is identified by a single cohort ID.

Additionally the cohort ID should be cached for a week, which ensures that visiting a new website doesn’t impact the cohort ID immediately which might leaks information about the browsing history. ³

Further browser histories, over a too short amount of time or with an insufficient amount of entries in them, should result in an invalid cohort ID.³

And, in case this wasn’t clear yet, of course everything is calculated locally in the browser and only the cohort ID is exposed to a website that wants to show ads.

Why is FLoC still a problem?

FLoC with all these safeguards is by no means really privacy friendly. For example individual user tracking can still be done, by monitoring changes in the Cohort ID over time.⁴ ⁵ As well as abusing potential errors while assigning the cohort IDs that result in too small groups, as the current number of potential cohort IDs according to standard is 2^32 or roughly 4 billion.

Besides that, it also strengthens Google’s quasi monopoly position in the ad industry even further, since Google’s browser, which generates the Cohort IDs, holds the majority of the user-base. Allowing Google to have an unfair competitive advantage around testing new algorithms for cohort IDs earlier, correlating cohort IDs to interests better, due to browser history synchronisation⁶, and the ability to break competitors cohort ID correlations by strategically modifying parameters for the SimHash algorithm to “avoid sensitive information”. ⁷

On the topic of “sensitive information”, which has its own section in the standard, ⁸ the idea is that certain sensitive topics are not considered for the cohort IDs or cohort IDs of groups interested in sensitive topics aren’t exposed. Obviously there is no exact definition of what “sensitive topics” are in the standard proposal, as this is up to the implementation to decide.⁹ This can easily become a quite complicated debate. Is homosexuality considered a sensitive topic in countries that have death penalty on it? Does this mean a wrong browser settings can prevent you from being informed by ads about certain health campaigns in your country? Does this mean governments might start investigating people that fit into certain cohorts in combination with some other factors?¹⁰ ¹¹

Legal concerns also exist in other areas. The European GDPR, for example, does have a principle called “privacy by design”, that should be used for all existing and new products introduced. But FLoC is opt-out, not opt-in, making it rather questionable whether it is ever going to be fully compliant with GDPR. These concerns could be a reason why all test subjects of this proposed new standard were selected outside of the EU.

How FLoC incentivises more tracking, not less

By design Cohort IDs are meant to be meaningless. But that’s not how targeted advertisement works. So in order to continue existing business practices, ad companies will try to “figure out”, what people are in a cohort in order to sell their products. In order to do that, they need to use various tracking mechanisms and identify people across multiple sites to guess their interests, age group, sex/gender, … all the topics that companies want to use for target ads at the time of writing.

Currently they do this, without cohort IDs and using third-party cookies instead. But this will vanish, due to missing browser support. The upcoming trends show that CNAME cloaking and first-party ad-integration is the ad industry’s “solution” to that. This is also how I expect Google to handle this. Google is able to correlate cohort IDs to all those target groups using the information collected by YouTube, uploaded browser histories⁶ and search results.

With this first-party tracking data it even makes it easier to track people’s interest that were previously harder to track, since their cohort IDs give more insight into their browsing history than existing 3rd-party cookies, that were easy to block.

Conclusion

While Google claims that this proposal could remove the requirement for 3rd-party cookies for the ad industry and help to improves privacy, in reality the main benefit in this proposal lies with Google, which outsources it’s cohort ID calculation to the browser instead of doing it server-side with 3rd-party cookies, disables 3rd-party cookies to hurt the competition and if becoming a commonly implemented web standard, hurting those who already blocked third-party cookies for privacy reasons, as it suddenly makes them trackable with the replacement method.

All in all, with this proposal there is nothing in for consumers. There is nothing in for advertisers. But there is a lot in for Google. And the only reason for a browser to implement this standard is being Google or being interested in improving business relationships with Google.

PS: I had some fun with FLoC in the past 2 weeks, doing both, implementing a warning banner, as well as an extension for Firefox that “implements the standard proposal”. I recommend to check out my Mastodon Thread about the whole topic.

Photo by Pop & Zebra on Unsplash

At least according to Justin Schuh, director at engineering for Chrome: https://www.theverge.com/2020/1/14/21064698/google-third-party-cookies-chrome-two-years-privacy-safari-firefox ↩
Claimed by EFF’s https://amifloced.org/ ↩
As recommended by the proposed standard: https://wicg.github.io/floc/#recovering-the-browsing-history-from-cohorts ↩ ↩²
Think of two accounts on two platforms. One is a news magazine one uses their real name “John Smith”, the other a forum for cancer where one uses a random username like “person123”. An advertiser with access to both usernames as part of their tracking, is able to correlate the weekly cohort ID changes over time, making it easy to re-identify a person uniquely. ↩
This was pointed out in this GitHub issue: https://github.com/WICG/floc/issues/100 ↩
In the so-called “My Activity” view, you can see what Google know about you, including your search, browsing, map and app history, unless you explicitly disabled it. Wonder how to do that? There you go: https://support.google.com/accounts/answer/465 ↩ ↩²
A quite interesting point there is that Google already uses cohort based systems but generated on server side using third-party cookies, as mentioned in an issue: https://github.com/WICG/floc/issues/104#issuecomment-822535497 ↩
In standard: https://wicg.github.io/floc/#sensitive-information ↩
Google reference to what is considered “sensitive” during the PoC phase: https://support.google.com/adspolicy/answer/143465?hl=en ↩
A proposal to define sensitive categories on GitHub: https://github.com/WICG/floc/issues/5 ↩
General finding from the “PING audit”: https://github.com/WICG/floc/issues/71 ↩