Common cause and Special cause

05 Feb, 2025

I want to riff on the ideas of Common cause variation and Special cause variation from the work of W. Edwards Deming, and how they can be applied to programming. Wikipedia also lists some synonyms for these: common cause -> non-assignable cause special cause -> assignable cause I believe that assignable cause and non-assignable cause is a bad synonym because the assignable and non-assignable distinction is only applicable in a point of view. What looks like non-assignable is one point of view, becomes assignable in another, and our job when trying to solve non-assignable cause problems is to find a point of view where it is assignable.

Car troubles

My car had an oil leak, so I took it to a service shop, where they "fixed" the issue by replacing a leaky rubber seal. A few months later, the leak returned. I went to another shop, and they did the same—replacing yet another leaky seal. But after a few months, the oil started leaking again, so I took it to a third car service. At the third service center, they recognized that the issue wasn’t a special cause problem but a common cause one. The seals weren’t the root cause—the real problem was excessive pressure in the system due to a faulty valve. No matter which seal the previous shops replaced, one would always end up leaking under the high pressure. By focusing solely on the rubber seals, they couldn't find a real solution because the issue was distributed and couldn’t be traced to any single seal. Even if all the seals were in perfect condition, the oil would still have leaked somewhere. Shifting to a more abstract perspective—considering pressure as a factor—allowed them to identify high pressure as the root cause, investigate its origin, and ultimately resolve the problem.

The straw that broke the camels back

How can we make sure to not break the camels back? Was it really that last piece of straw? Or the bag of flour? Or the bag of dates? Or was it some other good being transported? Asking these questions leads us nowhere in solving the problem because we are looking at the system from the wrong point of view, we need a more abstract view using the concept of weight. We can gather the abstract information that is weight form all the goods add it up, and keeping that below a certain threshold solves our problem.

Applying it to programming

In the programming industry we do what the first two car repair services did, try to solve common cause problems as if they were special cause problems. I think the reason for this is that we are missing the underlying concepts with which we can unify the pattern of problems in to one that we can manage (pressure in the care example, weight in the camel example).

Null pointer exceptions

The dreaded Null pointer is called the billion dollar mistake. The problem here is a non-assignable problem, well.... each crash can be assigned to a missed nil check, however there is an underlying systemic problem of how developers think about their apis. Having apis return null creates a huge mistake surface. There are two mainstream solutions to this problem:

put guard rails manually and be disciplined
have the compiler automatically put guard rails for you (Maybe pattern found mainly in functional languages) We know that relying on discipline is not the best idea, but the automatic version isn't a lot better, it just automates bad design, and pollutes usage code with a lot of pattern matches. What we are doing with manual or automatic nil checking is raising the supply of nil checks to meet demand from all the nil values. But there are a lot more usage sites than production sites, so what if we reduced the demand instead, by not producing nil values? Just design your apis to return zero values instead of nil and make the zero value special. This way nil checking is not needed at all and your code can flow nicely without a bunch of guard rails. We don't really need a new concept here just reverse thinking. Instead of raising supply reduce demand. There are multiple posts on the internet about this pattern, I also talked about it in Modeling - data first: Data structures and invariants Guarantee valid reads - Ryan Fleury Special Case - Martin Fowler

Phantom traffic like effects

Phantom traffic like effects in distributed systems, such as cascading failure, thundering herd are a Common cause or Non-assignable cause problem. If a car crashes, it is understandable that there will be traffic, but in the phantom traffic video, all the cars are fine yet traffic still occurs, because the emergence of traffic is designed in to the system. In a distributed system, all the servers can work correctly, with no failure, but a small slowdown on one server can trigger timeouts, and cause a thundering herd. And these kinds of phantom traffic like problems are design into the architecture of the systems we build. What concept do we need to understand what is happening, and what can we do differently so they don't happen? I don't have a fully developed concept yet, but I will try to explain it. We need a global structure for the whole, so that the components have shared information to talk about. As I explained in The law of conservation of coupling post in the datomic example, they got rid of variation based on dependent data by creating space in the interface for temporary ids. In an sql database a transaction with data dependencies will need multiple requests and responses, therefore the whole transaction time depends on the dependency chain length, and can easily hit timeouts. In datomic it is independent of the data dependency chain, and can more consistently hit timeouts. With the concept of The law of conservation of coupling we can redesign the system to minimize these emergent effects. Minimizing null pointer exception with zero values mentioned above also reduces these emergent effects.

The cost of solving common cause problems

Because common cause problems are designed into the system, solving them needs a redesign which is costly in the short term. However, continuing to sweep problems under the rug is more expensive in the long term.

Conclusion

The important thing in solving common cause problems is finding the abstract concept that gives us a handle to solve the problem, without knowing about the abstract concept of pressure, in the car leaking oil case, or weight in the camel case, we won't be able to solve the actual problem and spend resources on treating them as exceptional causes. In programming we have to build in these abstract concepts, or if we don't know about them up front, refactor with them in mind later.