If whilst watching a courtroom drama you’ve ever wondered if the jury is truly going to ‘disregard that last remark’ as per the judge’s instructions, you already understand the problem of data pervasiveness. If you’ve ever given permission to a company to share your personal or contact information with third-parties, and have found that cancelling the original agreement does not cancel it with an ever-growing fractal spiral of third parties who retain your data (the original source of which is now irrelevant or even possibly non-existent), then you understand the practical issues in getting networks to ‘forget’ information. And if you’ve ever flagged a genuine email as ‘spam’ in Gmail, you’re part of the problem.

The problem is in getting machines to forget what they already know when source data is either malicious, misclassified or when the data’s originator – you, for instance – want to annul the fact that it was ever shared.

It’s a problem that the growth of Big Data is likely to compound in orders of magnitude in the near future unless data provenance, auditing and compliance processes have access to more effective systems of removing not only the source data from systems but also the iterative ripples that the data caused when it entered.

To this end Towards Making Systems Forget with Machine Unlearning [PDF], presented earlier this year at the IEEE Symposium on Security and Privacy 2015, proposes fundamental modifications to the way that user data enters and interacts with analyses systems and data-streams.

The paper outlines the problem of permissions provenance in a data-hungry network environment where you only have to invite the vampire in once to lose control of the situation:

Today’s systems produce a rapidly exploding amount of data, ranging from personal photos and office documents to logs of user clicks on a website or mobile device . From this data, the systems perform a myriad of computations to derive even more data. For instance, backup systems copy data from one place (e.g., a mobile device) to another (e.g., the cloud). Photo storage systems re-encode a photo into different formats and sizes. Analytics systems aggregate raw data such as click logs into insightful statistics. Machine learning systems extract models and properties (e.g., the similarities of movies) from training data (e.g., historical movie ratings) using advanced algorithms. This derived data can recursively derive more data, such as a recommendation system predicting a user’s rating of a movie based on movie similarities. In short, a piece of raw data in today’s systems often goes through a series of computations, “creeping” into many places and appearing in many forms. The data, computations, and derived data together form a complex data propagation network that we call the data’s lineage.

Summation tokenises the way data exists in a stream, with the ‘lenses’ protecting data from being directly embroiled in replication and analytical derivations. Towards Making Systems Forget with Machine Unlearning, Yinzhi Cao and Junfeng Yang Columbia University, http://arxiv.org/pdf/1509.05251v1.pdf

Summation tokenises the way data exists in a stream, with the ‘lenses’ protecting data from being directly embroiled in replication and analytical derivations. Towards Making Systems Forget with Machine Unlearning, Yinzhi Cao and Junfeng Yang Columbia University, http://arxiv.org/pdf/1509.05251v1.pdf

The researchers present a system which prevents cloned data, both source and ‘calculated’ (i.e. in machine learning algorithms) from becoming orphaned from its associated permissions by interposing a ‘summation model’ between it and the systems which gain access to it. The learning systems which utilize the data do so via these proxies; if the proxies are amended or deleted, the data itself is no longer available, and its replicated iterations into other systems is neither identifiable nor viable.

Additionally the summation technique does not require re-imagining extant systems from scratch; the researchers successfully projected it onto several real-world data analysis models, modifying between 20-300 lines of code, or less than 1% of the existing system.

The systems tested and retrofitted with this approach were the recommender system LensKit, Microsoft’s JavaScript-based malware detection system Zozzle and PJScan, a command-line utility for detecting PDF files that contain malicious JavaScript.

Inferring user interest based on history – model inversion attacks

In the case of LensKit the researchers address one particular model for which an attack vector has already been identified and used. Item-item collaborative filtering (IICF) is recognisable to most of us at least via Amazon’s oft-lauded system of ‘recommendations’, which take into account the user’s own purchases and correlates that data with analogous input branching back from the product itself to other users’ choices.

The authors note Joseph A. Calandrino’s efforts to outline the ways in which attackers can exploit a recommendation algorithm by reversing its logic to reveal what the user actually bought, which is very private information by default.

In the case of online book cataloguing service LibraryThing, the researchers found that an attack on its own IICF algorithm succeeded in identifying six book purchases per user, averaging 90% accuracy over a database of one million end-users – and this in a climate where book choices have long been of interest to the FBI and other government agencies.

Using available data to obtain accurate obscured data in this manner is called ‘model inversion’, and the paper notes potential for more significant instances of its misuse. In Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing, researchers contend that one need only add some very basic demographic data to an IICF in order to deduce their genetic markers with a 75% accuracy rate.

Disinformation – false positives and false negatives obscuring defence systems

Anomaly detection systems, such as pattern-based antivirus and anti-malware systems, rely on good information in order to remain effective in much the same way that spam-filtering systems do. Anyone who has ever found a relevant mail in their spam folder, possibly long after receiving it, or found that their spam filter sometimes allows mails into their inbox which might reasonably have been identified as uninvited spam understands the problem of maintaining the integrity of system signature-sets, and having an effective mechanism to remove false positives and false negatives.

Polluting valid systems criteria with non-valid data assignations is an active intent of attackers, but – much as with other highly-iterated pieces of information – the propagation of false information can make removal problematic due to the lack of provenance, and the orphaning of authority for that data.

Google’s apparently impossible task of amnesia

Currently Google is wrestling with a European privacy ruling which insists that the end-user’s ‘right to be forgotten’ should extend beyond Europe itself, directly touching on these issues of the ad hoc propagation and orphaning of permissions in user data, which currently seems to lose all provenance at the borders of each country that it rapidly passes through. If creating the kind of ‘gatekeeper lens’ via summation data is seen to be unworkable, and that more granular and motile data is deemed necessary for some or all machine learning systems, or for realistic commercial applications, then the problem may prove more easily addressable via blockchain-based personal data protection [PDF].