The Taiwan Banker

The Taiwan Banker

Data clean rooms can prevent AI from learning too much

Data

2024.09 The Taiwan Banker NO.177 / By David Stinson

Data clean rooms can prevent AI from learning too muchBanker's Digest
For a decade, privacy regulators have implored internet platforms to share less information. The EU’s General Data Protection Regulation (GDPR) remains the gold standard for privacy regulation, even as copycats have emerged in countries around the world, and on the US state level in California. GDPR has helped motivate entire fields of research which can now demonstrate the harm of privacy breaches in detail. In the past several years, however, due in part to the solutions to these privacy concerns, as well as perhaps some philosophical changes in the US, anti-trust regulation has emerged as a more powerful force for the tech industry. It takes a seemingly opposite position: platforms cannot hold their data solely for their own use, and must allow other industry players to enjoy its benefits. Where does this leave the tech industry? The ability to square this conflicted set of mandates seems to come down not to law, but to math. What useful conclusions can be preserved after stripping out private information? The answer will help shape the future of the consumer-facing internet, including the corner that makes up finance. One solution being considered is data “clean rooms,” which allow for data manipulation after personal information has been removed. Adversaries may learn information theory The core problem of privacy is the real possibility of adversarial attacks. Consider the reconstruction attack on a database which has removed personal identifiers in order to be able to model sensitive information (Fig. 1). Depending on the characteristics of the features corresponding to each (unnamed) individual, it might be possible to match personal information contained in a separate database with that research database. A B C Private Information ↓ ↓ ↓ Identifier A B C ↓ ↓ Identifier Private Information Fig. 1: Reconstruction attack scheme. The grey squares indicate obfuscated data. The real substance of a reconstruction attack however involves obfuscation by a data curator to make such associations less obvious, while still retaining the statistical properties of the full dataset, which might be useful for causal inference or AI model training. If the data involved financial transactions, for instance, the amounts could be adjusted within a certain fixed percentage range, a parameter which could also be known by the downstream user, without affecting overall averages. A reconstruction attack involves cases where such protections fail. Unfortunately, however, real multivariate data is more complex than that case. Many attacks on data involve what one could call creativity – which is simply to say that the attacker makes full use of the information contained within, even when it may not take an obvious form. Information theory uses an expansive definition of information, including interactions between variables as well as their relationships with the real world. Therefore, the field of differential privacy has emerged to provide mathematical guarantees about data anonymization using information theory. Through its mathematical foundations, differential privacy has a somewhat different interpretation of privacy risk from the GDPR. The GDPR usually defines data “anonymization” as deleting the original database entirely following any obfuscation process; without doing so, user permission would still be required to manipulate the resulting data. Differential privacy makes no assumptions about whether that second database is known, or even exists at the time the first one is being used. Its definition of privacy is more aligned with the possibility of privately sharing data, where deleting the original data would be impractical. Google at the center of the internet Many of these debates revolve specifically around Google, which helped pioneer the ad revenue model. Google’s Chrome was the last major browser to allow third-party cookies, the object of the GDPR’s biggest impact on user experience. In July, Google announced that it was abandoning prior plans to phase them out, apparently due to the impact on the advertiser ecosystem. For several years, Google has been working on an alternative called Privacy Sandbox, which implements the aforementioned differential privacy directly on users’ browsers. Data from Privacy Sandbox have not been able to fetch the same valuation as less private information, which illustrates a difficulty with differential privacy: Even if when implemented correctly, the addition of noise may not bias a model or conclusions based on that data, it could affect the resulting model performance or statistical power. At the same time, Privacy Sandbox has also received regulatory attention for giving Google too much control over user data. The UK’s Competition and Market Authority has questioned the relationship between Privacy Sandbox and Google Ad Manager, a major platform in the industry for ad purchasing. As of mid-September, an anti-trust trial involving Google Ad Manager is also ongoing in the US, meanwhile. Prosecution witnesses have described the inability of ad buyers to obtain “log-level data” from Google on user behavior, which could be used to independently calculate the value of ad placements. If the suit succeeds in splitting ad sales from pricing, personal data could start to be traded at a greater pace. Before the trial, Google claimed that a finding against it would “slow innovation, raise advertising fees, and make it harder for thousands of small businesses and publishers to grow.” Meanwhile, just because Google is moving toward the privacy end of the spectrum does not mean that it is not also getting in trouble over privacy concerns. In April, the UK’s Information Commissioner’s Office also raised concerns about Privacy Sandbox precisely on these grounds, underscoring the delicate balance Google is faced with maintaining. Privacy can become a value driver These changes are largely taking place outside the financial service sector, but with the trend of embedded finance, financial institutions will need to be better integrated with tech platforms, and may need to access their data as consumers. Also, because private information gathered for one purpose may not be used for another without additional consent, financial institutions are sometimes affected by privacy barriers to information sharing even within the same corporate structure. Thus, whether from the perspective of a data producer or consumer, it can be helpful for them to understand how data can be safely exchanged. Clean rooms evoke the highly controlled sterile environments used for physical processes like semiconductor manufacturing. Several cloud service providers offer this service, along with more specialized providers. This arrangement is typically not cheap, but avoids the data control issues of Privacy Sandbox and may point to a way out of this regulatory minefield. Not every regulatory framework is as strict as GDPR, and not every piece of private information is equally sensitive, but the possibility of reconstruction and similar attacks will only grow as the threats become more sophisticated. The role of creativity in cybersecurity and fraud is particularly apparent for the social engineering aspect of attacks. Information which a victim would reasonably expect that only a trusted institution would hold can lead the victim to trust an attacker further. The fact that such information may be probabilistic in nature is less relevant if such attacks can be automated, and there is little or no cost for failed attempts. Educated guesses can also be combined with stolen genuine information for further effect. Ironing out these issues will allow financial institutions to realize the full value of the cloud. The promise of modern AI is to integrate information as granular and from as wide a variety of sources as possible, which is often easier when the data is collected in one place. Due to the insatiable curiosity of these models, it is much easier to keep secrets out of their training data in the first place, than it is to deal with the consequences afterward.