Machine unlearning gets a practical privacy upgrade
Machine learning models are everywhere now, from chatbots to credit scoring tools, and they carry traces of the data they were trained on. When someone asks to have their personal data erased under laws like the GDPR, their data also needs to be wiped from the machine learning models that learned from it.
Retraining a model from scratch every time a deletion request comes in isn’t feasible in most production settings. Machine unlearning, which refers to strategies for removing the influence of specific training data from a model, has emerged to fill the gap. But until now, most approaches have either been slow and costly or fast but lacking formal guarantees.
A new framework called Efficient Unlearning with Privacy Guarantees (EUPG) tries to solve both problems at once. Developed by researchers at the Universitat Rovira i Virgili in Catalonia, EUPG offers a practical way to forget data in machine learning models with provable privacy protections and a lower computational cost.
Rather than wait for a deletion request and then scramble to rework a model, EUPG starts by preparing the model for unlearning from the beginning. The idea is to first train on a version of the dataset that has been transformed using a formal privacy model, either k-anonymity or differential privacy. This “privacy-protected” model doesn’t memorize individual records, but still captures useful patterns. To recover some utility, the model is then fine-tuned on the full original dataset.
If a user later asks for their data to be deleted, the system falls back on the initial privacy-protected model and fine-tunes it again, this time on a version of the dataset with the user’s data removed. Because the data were anonymized up front and the model didn’t rely on any one item too heavily, the influence of the deleted records can be removed efficiently.
The approach appears to work. The authors tested EUPG on a mix of tabular and image data, comparing it with both retraining from scratch and SISA (a benchmark method that offers formal guarantees but requires heavy computation). On most datasets, EUPG matched or beat these alternatives on utility, while also reducing vulnerability to membership inference attacks, a common way to test if a model still “remembers” a deleted data point.
That said, EUPG isn’t designed for every type of ML pipeline. It assumes a one-time training stage with optional fine-tuning, rather than continuous learning where models are updated as new data comes in. When asked whether EUPG could extend to such cases, co-author Josep Domingo-Ferrer explained: “The problem here is how to enforce a privacy model on continuously increasing training data. This amounts to the problem of continuous data anonymization, which is notoriously difficult. Some heuristics for it, mainly oriented to k-anonymity-like privacy models, can be found in the literature. They would be applicable to EUPG.”
Another open question is how unlearning interacts with fairness and bias, especially if deletion requests disproportionately come from certain groups. If data from an underrepresented population is removed more often, it could skew the model’s behavior in unintended ways.
Domingo-Ferrer said the impact would mostly depend on the chosen privacy model: “The potential impact of EUPG on bias amounts to the impact of the selected privacy model on the bias of the training data. Hence, the answer would be the literature on the impact of differential privacy on bias, k-anonymity on bias, etc. In principle, enforcing a privacy model and fighting bias are ‘orthogonal’ problems, without an obvious connection. However, there are works in the literature that show that both k-anonymity and differential privacy can in be used in clever ways to mitigate bias.”
The researchers acknowledge that extending EUPG to large language models and other foundation models will require further work, especially given the scale of the data and the complexity of the architectures involved. They suggest that for such systems, it may be more practical to apply privacy models directly to the model parameters during training, rather than to the data beforehand.
Still, the core idea of preparing a model to forget before it needs to could help make machine unlearning faster, cheaper, and more compliant with privacy law. It’s a step toward making the right to be forgotten enforceable not just on paper, but in practice.
Source link