The Myth Of Data Integrity
In his most recent interview (Ruby on Rails Chases Simplicity in Programming), David Heinemeier Hansson mentions how he had to “slaughter the holy cows” of computer science. He then goes on to explain how he and his partners in crime proceeded to do it: “We took a pretty radical stand: Stored procedures and all things that make your database clever are evil.” And he added: “If you tell a lot of IT shops that, they’ll be majorly offended, because that’s just the way they do things.”
David is, of course, right. I don’t have to go any further than point out the reaction to my posts on the same subject, of which the most notoriously acute was Should Database Manage The Meaning? As one of the programmers (Valentino Volonghi) who indulged in the ensuing discussion had pointed out, “This is totally nonsense to me.”
The palpable confusion that is aggregating around the ’sacred cow’ (or, as David likes to call it, the ‘holy cow’!) of the relational database boils down to practically one thing — data integrity. Of course, as with many other things surrounding the lore of software development, data integrity is a myth. How do I know that? Well, I’ve examined it, and here is what I found out:
Data integrity is one of those idealistic formal criteria that sound and look great on the drawing board, but tend to lose their street appeal as soon as they get implemented and the rubber meets the road, so to speak. Like that shopping mall that looks fabulous on the computer-generated screen, but gives us the chills of repulsion once it gets built and is twice as hideous as anyone could’ve imagined, the concept of data integrity offers a rather laughable sense of false security that only people who spend nights at the computer screen and never go out can buy into.
Still, what does the term ‘data integrity’ mean in real life? In a nutshell, what it really means is that, as the data travels, it remains faithful to its source.
The data stored in a vault is meaningless. Only if the data can go places do we ever find any use for it, any value in it.
But the problem is, data is like furniture. You can move it around, but sooner or later, something’s gonna break.
Data integrity thus means that when things break, we can recover them. If I travel to a foreign country, and if at the checkout stand my Visa card shows that I’m over my limit, while I know that I’m not, I’m dealing with the broken data. My data travelled with me, and got broken in the process.
In order to restore my credit, I need to rely on data integrity. There must be an authoritative source somewhere that will confirm my story that I’m not anywhere near my credit limit.
As is obvious from this real-life example, data integrity has absolutely nothing to do with the database. Data inside the database is just useless deadwood, good for nothing. Only when it packs its bags and goes on a trip do we get to realize its worth.
Let’s now look at some of the axioms of data integrity that are so dear to the hearts of the ‘database geeks’. Firstly, every relational jockey will insist that data integrity basically means that each row in the table is uniquely identified. This integrity, then, ensures that there are no duplicate rows in a table.
Sounds logical, doesn’t it? I mean, what could be wrong with the rule that says that you can’t have two different rows describing the same patient, or the same customer? Isn’t this rule the only sane bulwark against the madness of having corrupt information stored in our database?
Well, not so fast. Let’s first look into how does that integrity rule get implemented in real life. Typically, the RDBMS will enforce this rule of uniqueness by implementing a bailout concept known as the surrogate key. What that means is that, indeed, each row in the Patient table will be unique, thanks to the unique surrogate key. But all the other information in the rows of that table could be absolutely identical, while giving the false impression of ‘uniqueness’.
Another myth of data integrity is the so-called ‘orphanage prevention’. This special case of data integrity, known as referential integrity, has much more applicability in the non-electronic storage systems than in the electronic ones. The reason for that is the fact that electronic orphaned rows are fairly innocuous, since there is no simple and obvious way to traverse the collection in order to reach them. In real life, non-electronic systems, such records are much more likely to stumble upon.
And so on, there are numerous misconceptions about the data integrity that became the ’sacred cows’ of the corporate IT practices. Rails pretty much shatters all these to pieces.
In summary, correctness, accuracy and consistency of the data cannot be enforced by the database. If the databases were completely bullet-proof, that wouldn’t mean a thing, because the data is absolutely useless while it is dormant inside the database. As we’ve already mentioned, the data comes to life only once it leaves the database, at which point none of the database rules and regulations can reach it. Being out in the wilderness, exposed to all kinds of fraud and abuse, the data must rely on the application code for its cleanliness and sanity.
This is why database-enforced data integrity is one of the biggest urban myths of corporate computing.
Leave a Reply
You must be logged in to post a comment.