Modeling discussions: Data deletions
A decade(!) ago I wrote that you should avoid soft deletes. Today I run into a question in the mailing list and I remembered writing about this, it turned out that there has been quite the discussion on this at the time.
- My post on avoiding soft deletes
- Richard’s post on the trouble data deletion cause
- Udi’s post about never deleting data
- Dare’s post summarizing most of the discussion
The context of the discussion at the time was deleting data from relational systems, but the same principles apply. The question I just fielded asked how you can translate a Delete() operation inside the RavenDB client to a soft delete (IsDeleted = true) operation. The RavenDB client API supports a few ways to interact with how we are talking to the underlying database, including some pretty interesting hooks deep into the pipeline.
What it doesn’t offer, though, is a way to turn a Delete() operation into and update (or an update to a delete). We do have facilities in place that allow you to detect (and abort) on invalid operations. For example, invoices should never be deleted. You can tell the RavenDB client API that it should throw whenever an invoice is about to be deleted, but you have no way of saying that we should take the Delete(invoice) and turn that into a soft delete operation.
This is quite intentionally by design.
Having a way to transform basic operations (like delete –> update) is a good way to be pretty confused about what is actually going on in the system. It is better to allow the user to enforce the required behavior (invoices cannot be deleted) and let the calling code handle this different.
The natural response here, of course, is that this places a burden on the calling code. Surely we want to be able to follow DRY and not write conditionals when the user clicks on the delete button. But this isn’t an issue where this is extra duplicated code.
- An invoice is never deleted, it is cancelled. There are tax implications on that, you need to get it correct.
- A payment is never removed, it is refunded.
You absolutely want to block deletions of those type of documents, and you need to treat them (very) different in code.
In the enusing decade since the blog posts at the top of this post were written, there have been a number of changes. Some of them are architecturally minor, such as the database technology of choice or the guiding principles for maintainable software development. Some of them are pretty significant.
One such change is the GDPR.
“Huh?!” I can imagine you thinking. How does the GDPR applies to an architectural discussion of soft deletes vs. business operations. It turns out that it is very relevant. One of the things that the GDPR mandates (and there are similar laws elsewhere, such as the CCPA) the right to be forgotten. So if you are using soft deletes, you might actually run into real problems down the line. “I asked to be deleted, they told me they did, but they secretly kept my data!”. The one thing that I keep hearing about the GDPR is that no one ever found it humorous. Not with the kind of penalties that are attached to it.
So when thinking about deletes in your system, you need to consider quite a few factors:
- Does it make sense, from a business perspective, to actually lose that data? Deleting a note from a customer’s record is probably just fine. Removing the record of the customer at all? Probably not.
- Do I need to keep this data? Invoices are one thing that pops to mind.
- Do I need to forget this data? That is the other way, and what you can forget and how can be really complex.
At any rate, for all but the simplest scenarios, just marking IsDeleted = true is likely not going to be sufficient. And all the other arguments that has been raised (which I’m not going to repeat, read the posts, they are good ones) are still in effect.
Comments
".. enthusing decade .." should perhaps be ".. ensuing decade .." ?
One thing about the GPDR right to be forgotten is that it often collides with reality - e.g. imagine a forum. I answer to a thread and other people respond to me - how can I delete that without making the entire thread unreadable? There is an ongoing discussion in such cases that the right is overriden by the majority, but it is an open question how our courts will weight this.
With regard to the forum post in relation to GDPR, you would not delete forum post at all, but rather you would somehow obscure the author of the forum post. What the "right to be forgotten" actually provides for is the removal of P.I.I. (Personally Identifiable Information). This means the text of your forum post can remain (meaning the entire thread continues to make sense) but your post would be changed to be attributed to "Anonymous" or some other non-personally identifiable name.
The general approach to adhere to GDPR, specifically in more event-driven or event-sourced systems, is not to delete any data at all but to use Crypto Shredding to prevent decryption of previously encrypted data.
Stuart, thanks, fixed.
Christian, Looking at Reddit, where such things (due to moderation, not GDPR) happens quite a lot, that is workable. But people have found ways to avoid it. For example, you have bots that create copy of a post to avoid later modifications, deletes, etc.
This gets really interesting (for a legal question, at least), what do you do with someone else post that quotes you? Or another's post that mention you by name?
Craig, The issue in this case is that some of the data has never been encrypted at all. And what do you do about PII references from other users to the user who wants to be deleted?
And how does it work if I include personal information in the post - such as saying here that my last name is Bale and my first name is Stuart - now if someone quotes this post, how would the site be able to remove my personal details from the text of this message?
Oren - Unencrypted data can always be encrypted in retrospect.
Oren / Stuart - Re: PII references and PII inside posts. Those are very good points, and I don't really know the answer to how to deal with users who would, for example, put their own identifiable information inside the text of their forum post, short of ensuring that all "quoted" posts never copy the raw text, only a reference to the original post, and trying hard to remove or obfuscate personal data (an almost impossible task - much like a profanity filter).
That said, what we're getting to here are the fundamental flaws with GDPR and the so-called "right to be forgotten". Even if we did delete the entire forum post, what happens when I've already read the forum post prior to you deciding to have it removed under GDPR? What happens if half the world reads that post before it's deleted? How do you make me (or anyone else who read your post) forget what we've already read?
Stuart, No good options here, I'm guessing. Very likely not, and can cause problems down the road. I assume that telling the user that they need to find all instances of their name for the site to remove isn't valid. And given your name, what do you do if you have another Stuart that talks about moving bales of hay? No real good answer here.
Craig, As I understand it, the whole point of crypto shredding is to prepare in advance to be able to "forget" stuff. If all your data are encrypted with your key, I can forget everything about you by deleting the key. However, if I haven't done that, you have the problem of finding out what data belongs to whom. After all, deleting the data or encrypting it is pretty much the same thing.
Another factor here is what happens when you have conflicting requirements. For example, imagine that a user comes to you and wants to be forgotten. You remove the data, then you have a subpoena from a court because this user is involve in a lawsuit and these posts are evidence.
Comment preview