More on Data Integrity
In response to my post on The Myth of Data Integrity, Will submitted a very interesting comment. I’ve decided to use it in order to clarify some of the confusion that my initial thoughts seem to have caused. Here is what Will had to say, on November 3, 2005:
So let’s say I define a constraint in my database that doesn’t allow you to post a transaction to a closed accounting period.
This prevents any application that uses my database from posting data that will damage the correctness of the company accounts.
Furthermore, the database logs any changes to the status of an accounting period, so no matter which application changes the status, we know who reopened the accounting period and when they did it.
Can you please explain under what circumstances you need to be so “agile” that you would want to circumvent these rules?
Firstly, I would never want to circumvent the above rules (if they are indeed valid business rules for the system under study). Striving for agility does not mean circumventing rules. It means stopping to determine which rules are absolutely necessary, and then using the least possible amount of effort to ensure that the rules will be properly applied.
I’m not sure what is it in my original post that led you to believe that I am advocating the abandonment of business rules? Granted, I was angling toward abandoning the idea of expressing the rules inside the RDBMS, but that does not automatically mean that I was insisting that the rules must be ruled out (no pun intended) for good.
Data and Behavior
Data without associated behavior is difficult to deal with. Sure, human mind is awesome in its capability to parse through the unordered sludge of textual information, fishing out subtle patterns and meanings. But the mechanical attempt at emulating the human mind falls pathetically short when it comes to doing similar parsing.
Because of that, we are forced to associate behavior with the data once it enters the information processing system. Only by doing that can we enable our mechanical system to know that an employee cannot be hired if that person happens to be only two years old, etc.
In the early days of information processing, all the behavior applicable to the data was defined and stored in a monolithic fashion (read: inside one giant cloud of COBOL code). The advancements in computing, especially in the areas of architecture and programming, led us toward favouring modular over monolithic. Thus we’ve ended up separating data from the behavior.
Traditionally, data is viewed as a pile of deadwood, sitting somewhere on the auxiliary storage at certain addressable location. Software code then gets layered on top of that data, and its purpose is to lend meaning to the otherwise illegible pile of bytes.
RDBMS is one such layer of software code that sits on top of the pile of otherwise meaningless bytes (meaningless form the machine’s point of view, of course). It is possible to bypass the RDBMS layer and go straight to the underlying flat file system, where the data actually resides. No one has ever done that, because it’s too much work for no benefit. But, it is possible.
What RDBMS is giving us is a nice abstraction layer that simplifies for us the ways to deal with the underlying flat storage of the bytes.
This is all fine and dandy, but the religious wars begin when we open the debate on where should the domain-specific behavior reside. In the RDBMS? Or in the higher level of abstraction, in the actual programming code?
Domain-specific Behavior
As we’ve seen, RDBMS specializes, out of the box, in general purpose behavior. For example, data that is tagged by the RDBMS as numeric, cannot contain alpha characters. And so on.
But when it comes to domain-specific behavior (such as ‘a customer who had already used up two promotional coupons does not qualify for any further discounts’), arguably such behavior does not belong inside the RDBMS. It is none of the RDBMS’s business.
So now we have an interesting situation where, just as RDBMS sits on top of the flat file structure and breathes certain meaning into the low-level logic of the bytes, the application code sits on top of an RDBMS and breathes more finesse, more sophisticated logic into its admittedly low-grade generic behavior.
I think everybody will agree with the above. If so, where is the problem, then?
To Centralize or to Decentralize?
The problem lies in people’s understanding of the nature of control. Sure, we all (agile as well as ponderous and slow-paced) agree that the rules must exist. But what we disagree on is how are those rules to be implemented.
There is a group of people who prefer heavy centralization. The top-down military chain of command. And then there is another group of people who prefer self-regulation. This is the so-called democratic crowd.
Based on one’s personal preferences, one will gravitate toward either heavily centralized implementation (i.e. shove all the business rules inside the good old RDBMS), or toward more decentralized implementation (let the players self-police themselves; no one is guilty until proven so).
The top-down crowd insists that everyone is a priori guilty (in that, they resemble Stalin’s reign of terror).
Finally, My Answer
With apologies for the digression, I think I’m now ready to offer my answer to Will’s question:
I think it’s not the best practice to rely on RDBMS to nurse the set of rules you’ve mentioned above. The reasons are manifold, and are well known (i.e. lack of expressiveness in RDBMS, lack of version control, no comprehensive testing and refactoring, etc.).
It is much better to set things up in such a way that a layer of agile, human-friendly code sits on top of RDBMS, in the same way that RDBMS sits on top of the underlying constellation of flat files.
Same as no one should be allowed to go directly to the underlying flat files (that is to say, whoever wants to touch the data, must go only through the RDBMS instance), no one should also be allowed to go straight to the RDBMS. This is because RDBMS is too low-level for the purposes of business development. Better to have a more comprehensive system acting as a proxy to the underlying RDBMS. That system is much more maintainable, comes with versioning, refactoring, comprehensive testing, you name it.
If the rules are enforced at this level, the subsequent maintenance would be much easier.
That’s basically what us agile folks are talking about.