In college, as part of my philosophy degree, I took a course on metaphysics. I can still vividly remember the first class. The professor presented us with a simple enough scenario:
Consider a wooden boat, whose construction has just finished. Now fast-forward several months. The boat has been well used, and one of the planks that makes up its hull has gotten worn down and needs to be replaced. After the plank is replaced, is it still the same boat?
The class almost unanimously agreed that it was, indeed, the same boat. Replacement of a single plank certainly wouldn’t change the boat as a whole, would it? But this wasn’t the end of the professor’s line of questioning. What if, he asked, over time every plank was eventually replaced? Would the boat in this state–being comprised of an entirely different set of planks than it originally was–still be the same boat as the one we had considered to begin with?
This question lead to quite a bit more debate than the first, but the end result was still more or less unanimous, albeit with a different conclusion: We agreed that something didn’t quite feel right about saying that it was the same boat, if it was composed of entirely different materials than the original boat. So it probably was not the same boat anymore.
But this conclusion begged one final question, the real question: If the boat wasn’t the same boat after having all of its planks replaced, was there some instant within the period during which the planks were replaced that it ceased being the same boat? Did the change happen right as the final plank was replaced? Or perhaps, was the boat suddenly no longer the same when exactly half of its planks were swapped out? Or was the change gradual? And can a thing gradually cease to be the same as it was?
The class again fell into debate, and eventually we came to the conclusion that we had been wrong to begin with; the boat wasn’t the same boat after the first plank was swapped out. But that was already too late, because it wasn’t even the same boat before that replacement was made. It wasn’t the same boat even the moment after it was created; the microscopic state of the wood and nails had already changed somewhat due to airflow and moisture.
The real “aha” moment was realization that there is a difference between identity and state. Up to and including the period after all of its original planks were swapped out, we could still point to the boat and call it by its original name–even while acknowledging that it wasn’t the same boat, physically, it once was. Its state had changed, but it was still identified the same way.
So–who cares?
People who regularly work with databases are already quite familiar with the differences between identity and state. The idea of an immutable primary key–considered a best practice by many database designers–is a perfect example of this. We want to be able to identify our entity instances even as their state changes. We’re also lucky enough to generally track a very limited number of attributes. Unlike in the boat analogy, where any number of possible factors down to the submicroscopic level can be argued to change the state of the boat, in our databases we have a fixed set of criteria. It’s easy to get enough information to confidently say “this instance is no longer in the same state as it was previously.”
And that brings me around to the actual topic of this post: programmatic concurrency control. Every time I give my talk on designing highly concurrent database applications–in which I discuss pessimistic, optimistic, and multivalue concurrency schemes–I seem to get the same question: “What about updates to different columns? In that situation do we have a collision?”
Neither in the talk, nor in Expert SQL Server 2005 Development (the talk is based on material from Chapter 8 of the book) do I address this topic. It’s simply not something I thought was an issue with these or other concurrency schemes, and so I didn’t cover it. But the proof is in what readers and attendees actually want to learn, and I keep getting this question–so apparently it is an issue for some people.
For the purpose of this post, I’ll start with a summary and work backward from there: The general answer I’ve given to this question, and will continue to give, is that updates of the same row, but to different columns, should be treated as a collision in almost every case.
To begin thinking about why this must be true, we should start with what actually defines an entity/type instance in a database (or elsewhere). An instance is really nothing more than a specific collection of values corresponding to the attributes defined by the entity. Note the key word, “specific.” Any other collection of values is a different instance, or at least a change to the instance. Each instance happens to be uniquely identified based on a certain subset of these attributes (i.e., its primary key), but the instance cannot be defined based solely on this key. As an example, a car’s VIN uniquely identifies the car, but doesn’t tell you what color it is or whether it has a crack in the rear windshield.
And that brings us around to concurrency control itself. What is the purpose of concurrency control solutions, other than to serialize changes made to any given instance? The point is to reduce concurrency on a given instance, not to increase it. This helps us avoid logical traps that otherwise might be extremely difficult to detect.
Take, for example, a table of addresses used by a credit card firm. The customers of the firm send in address update cards when they move or need to make changes to their address on file, and these cards wind up in the hands of data entry clerks whose job it is to input the changes. Today is a special day, because one customer wasn’t sure if he’d already sent a card in, so he sent two. Alas, these cards wound up in the hands of two different data entry clerks simultaneously. Let’s watch what happens if updates to different columns are not considered to be a collision.
The customer’s initial address on file is:
235 Main Street, Springland, OR 97999
As it happens, 235 Main Street is actually an apartment complex, and the customer hasn’t been getting his bills because the postal carrier doesn’t know his name and no apartment number is listed on the mail. The customer’s update cards both contain requests that his apartment number–Apartment 2–get added to his address on file.
The schema for this table includes an ApartmentNumber column, and the first data entry clerk uses it, setting its value to “2”. The second clerk, alas, is new to the job and doesn’t notice the field on the data entry user interface. So he updates the AddressLine1 column, setting its value to “235 Main Street, Apartment 2”.
No collision is detected, but the customer still doesn’t get his bills–his address is now rendered by the mailing system as:
235 Main Street Apartment 2 2, Springland, OR 97999
And now the identity thief who happens to live in Apartment 22 is getting the bills. Oops!
While this is certainly a contrived scenario, it should serve to illustrate the difficulty of coming up with a proper way to avoid collisions without locking an entire row (or instance). Note also that a programmatic concurrency control scheme’s primary job is to block any possibility that such collisions can happen, but its other purpose is to give the user enough information to help avoid problems to begin with. In this case, had either of the data entry clerks seen the other’s update, the invalid data would never have hit the system. A detected collision must do more than just keep bad data out–it also should return information about the nature of the collision, in order to help the user to better do his job.
Concurrency control at the instance level may not be right for every application, but I’ve yet to see a great example of where implementation of a column-based scheme is truly the right choice from a cost-benefit perspective. A column-based scheme will be much more complex to implement, may leave logical holes as shown here, and in the vast majority of cases will not sufficiently improve scalability–really, the only possible argument in its favor–in order to be warranted.
Like the boat, an instance in a database changes with time. And like the boat, even the smallest change to an instance effects a new version, regardless of whether we can still identify it as the same.
well you could say the boat is a slowly changing dimesnsion and is indeed the same boat. 😉
In the future we will run into this as humans. If you replace a heart, a leg and and a knee with artificial ones are you still human? What if you replace the brain? Does it depend on what you have replaced which will qualify you as a human or not?
We already run into this as humans–our bodies are always shedding cells and creating new ones. Also, think about cutting yourself and having the skin grow back, or getting a haircut… Change is a constant. Luckily, we don’t generally need to track things at such a level of detail in database systems 🙂
We need to distinguish between the ubiquitous logical concept of a boat and its transient physical implementation.
ML: That sounds like Plato’s "forms" in "The Republic"… But I don’t think that applies to to this discussion as I’m referring to handling of actual instances rather than types. So the question is not, "is the QE2 still a boat after she gets a new paint job," but rather, "is the QE2 still the same QE2 after she gets a new paint job?"
Well to answer that we just have to answer this
Is Pamela Anderson still the same Pamela Anderson after <list all jobs here> 😉
First we need to figure out if you’re still human with all of that plastic in you… <VBG>
QE2 is QE2 regardless of any of its attributes. If we attach wheels to it will it cease to be the entity that it used to be?
This is why we need to differentiate between identity and being. To name something is not to define it or describe its current state.
We name things to help us distinguish between different instances. Then we rename them to confuse innocent by-standers. 😉