“The truth is, the Science of Nature has been already too long made only a work of the Brain and the Fancy. It is now time that it should return to the plainness and soundness of Observations on material and obvious things.”
-Robert Hooke, Micrographia
Root Cause or Root Curse?
“We have failed systems found by the end users in vehicles, causing a big warrantee problem.”
“How long has this been going on?” I ask. “Almost two years.”
He drones on to get us to take the some of the heat for such an old problem. It doesn’t work as we are too old for that, but it gives me a chance to get my feet up onto the workbench in my office.
“The team traced the electrical failure to cracks in the circuit board. They are pretty sure they have traced the root cause of cracks to temperature in wave soldering.
The heat cracks the boards, but not quite enough to interrupt the circuit, which opens later, causing the failure, then the warrantee. The team can crack more of them by aggravating the x in a couple of ways (his term, not mine) but can’t seem to eliminate all of them. Can you help these guys?”
We have completed several projects over the years for him. He is getting the hang of good problem solving, and knows teams are subject to poor strategic choices and tactical moves when pressured from above. I am glad he calls us for help. He trusts us to make progress fast in spite of the pressure.
He was alerted by a team ready to fall into the all too common trap of looking for a root cause, and then gaining a false sense of security by proving it with a statistical test, showing they could make things BETTER than the CURRENT process, but not reduce failures to zero nor provide a full causal explanation. He knows and feels that the absence of a causal explanation creates a false sense of security, whereas a causal explanation is key to competitive advantage.
The team, not directly under his supervision, knows to call him when in a jam, started by asking the default question, “What’s wrong?” which will lead to an incomplete answer. I suspect they were also influenced by leadership’s demand for an action plan, the longer the better, with lots of action items and daily reviews, creating the illusion of progress.
The team asked, “What’s wrong?” and got an answer: the Root Cause. In a few cases, it might merit being called the root curse.
We have been helping clients solve tough performance and reliability problems for years. We formed the New Science of Fixing Things because we had learned so
much in those years. We learned that there were a few important principles that demanded research and codification, principles that are a simplification of the process of solving tough problems. It was not easy. It took years of work, testing, and rework to get it just right. A few of our clients knew we had something special and were patient with us as we worked on it. They saw we knew how to solve tough problems, but needed some time to get the principles down. It all started when we recognized the important difference between a root cause and a causal explanation.
There is beauty in the sophisticated simplicity of what we have created and the power, we think and hope, it has to reshape the world of professional problem solving.
We have gained special insight into problem solving few others have. Gaining insight, seeing more, and applying the lessons learned to make better products is what we are about, products which run better and last longer, within an approach strategically and tactically constrained by principles of good science.
In Micrographia, Robert Hooke said, “The truth is, the Science of Nature has been already too long made only a work of the Brain and the Fancy. It is now time that it should return to the plainness and soundness of Observations on Material and Obvious things.” Well, now really is the time.
The Search for a Causal Explanation
Our approach is one of “plainness and soundness of Observations on Material and Obvious things.” Examining performance and behavior to construct a causal explanation is central to our strategy, The search for a root cause is an answer to a weak question. A causal explanation is powerful and key to competitive advantage.
The search for causal explanation must be the starting point of every project because it changes the initial question from “What’s wrong?” asked by those seeking a root cause, to the far more powerful question, “What’s happening?”
In order to find the answer, there are four strategies that establish the starting point for every project. The first three are central to Convergent Diagnostics for Variation Reduction taught in seminars and workshops.
The fourth, and most applicable to machine performance and reliability is taught in our lab-based seminar How Stuff Really Works.
Choosing the right model, getting the initial question into the best form, depends on the matching the problem to the strategy based on how it will reveal its nature. And there has not been a problem that does not fit one of the four!
For any project, you need to know how the process will give up clues to reveal its nature. Many serial processes will readily tell you, if you know how to ask, if the physics driving a problem lives in the input or the function. Isolation, separating inputs from function, one of the three strategies taught in Convergent Diagnostics for Variation Reduction, is a way to begin the search for a causal explanation in the cracked board case.
Think about this for a second. Isn’t it simpler (and easier) to ask this kind of a question and get an answer in an hour or a day, to converge on the causal explanation, rather than to try to identify one variable, if properly controlled, that will solve a problem?
Professional Problem Solving
A professional problem solver must consciously have an effective model in the forefront of his thought process that drives his thinking and actions, right from the start. Every bit of evidence must be collected and evaluated in terms of that model. The model is not only a guide to gaining insight, but acts as a constraint, preventing divergence. The model sets the strategy. Our models work.
A Better Start for Cracked Boards?
The team would have been better served by asking if the problem was in the input to the wave solder or the wave solder function.1 This is relatively simple, and gets a fast answer, and fulfills our requirement to get about learning one thing every day.
Staying completely away from any discussion of variables, or testing of variables, is important. Variables act in lumps based on how a process is organized, and can be tested as such. This is a subtle and important point in problem solving. Testing root cause variables is risky, in that it leaves off the rest of the process, regardless of what you might think about the protection inherent in factorial tests. Temperature does matter in board cracking. However, it is not nearly the whole story. It does not cause the problem; it releases the energy contained in the boards, energy that was put in prior to wave solder. The leverage is not wave solder temperature, but the process that charged the boards with the energy that the temperature (heat) releases. The simple and effective application of the Isolation Strategy would reveal it, but only to the problem solver who is seeking insight to understand what’s really happening!
Once he discovered the problem was in the input, not the wave soldering process where people had been looking (that’s where the defect showed up) I got an interesting email.
I did a little research over the weekend and found this material is hydroscopic before the final curing stage.
I love to read things like that. We do research in order to help gain insight and understanding, to gain confidence that we really do have a full causal explanation and to convey a clear message about what is really happening. When I read his email, I saw the enthusiasm of an engineer who understood how important it is to have a causal explanation, because those who do research do it to help gain insight. No one ever called to tell me they did research to prove they had a root cause.
Thinking Fast and Slow
Daniel Kahneman has studied the psychology of diagnosis in his book Thinking Slow and Fast. He wrote that diagnostics of highly diverse systems (slow thinking) have one thing in common; they require attention and are disrupted when attention is drawn away. Focused attention is not required for fast thinking. He further states that everyone is aware of the limited capacity of attention, and makes allowance. I’m not so sure. I think we exacerbate it with email and text messages.
Kahneman puts forth another interesting argument that supports the strategy we developed at The New Science of Fixing Things. He writes that when failing to get an answer to a question, people often change the question to a simpler one. I think they seek refuge in changing the question. Our experience is that the importance of properly framing the starting question is overlooked, making the flaw of changing the question an easy but deadly trap to fall into. Kahneman submits that freewheeling impulses (brainstorming) must be resisted for effective diagnostics. Freewheeling, he argues, is an organized system for changing the question.
The cracked board team missing the best strategic question is not unusual at all. When The New Science of Fixing Things gets involved with any project, we want to know what question you are trying to answer, such as, “input or function?” We want to know if the question is strategically sound (a good question) and tactically effective (can get to the answer fast.) What we often get is a blank stare, meaning there is no question, just a search for a root cause.
What is your model? Can you describe how you use it to help you? How does it constrain your behavior, and keep you on track? What is the track? Where did it come from? What principle is it founded upon? Who influenced the development? What are alternative models?
If strategy is all about asking the right question, then tactics are about executing in such a way as to get the answer to that question. Keeping strategy and tactics in phase is about mastery and disciplined thinking. Strategy is all about getting the question right!
The Black Box Model
The quality tools central to Root Cause Analysis are based on the Black Box model, thus probabilistic decomposition of the behavior of systems and machines. The objective is to identify a single term in the equation Y = η + ε, expanded to reveal the power of individual x’s and their interactions with others to influence a single response.
The Black Box approach is to fit mathematical approximations called Surrogate, Response Surface, Behavioural Models, Metamodels, Emulators that mimic the behaviour of the physical process. The physical mechanisms are not attempted to be understood, only the relationships between several explanatory variables and one or more response variables is important. The method was introduced by G. E. P. Box and K. B. Wilson in 1951. They acknowledged that such models are only approximate fits of equations to data, but claim they are easy to estimate and apply, even when little is known about the process. Models can provide good fits due to a combination of Pareto and the Square Root of the Sum of the Squares axiom, otherwise known as the Sparsity of Effects Principle.
The Black Box model is expressed as Y=η+ε, where η=a+bxn and ε= ‘noise’ as shown in the graphic below . This approach is limited to exposing the effects of variables on a single response (Y), and only as the levels of the chosen input variables (X’s) are manipulated. The output is a term that defines the slope of the input variables (highly sensitive to the chosen inputs) and comparing slopes, or power to effect the Y.
That is not the best way to observe machine behavior. One risk is that the emphasis is on analysis of factors, not behavior. Behavior can be observed passively, without manipulating anything, if you use our models. It is also simple and easier once you get the hang of it.
It is insight and understanding that provides options for making effective changes based on the principles of good science.
The most important overall goal in diagnosing performance and reliability is the development of a causal explanation of what is happening. That begs the first question, “What’s happening?” for any and every machine performance problem, not “What’s wrong.”
A model is simply an idea that allows us to create an explanation of what we think is happening, or should be happening, in a machine or part of a machine, and organizes our thoughts into something that can be articulated to others. Models are particularly helpful in visualizing things that are very complex, very large, very small or invisible. Models guide our perception of what processes or mechanisms are involved.
The Source-Load-Impedance model is simply the idea that machine behavior can be explained in terms of supplied energy being contained, released and dissipated, and results from 3 properties and 7 functions. The Black-Box model is the idea that a process performs arithmetic operations on a few input variables and adds a stochastic quantity to establish the value of output variables. Various representations of models — block diagrams, formulae, sketches and physical replicas — are essential tools for communicating and conversing about the scientific models underlying them, but they are not the models themselves.
For diagnosis, the differences between models that are most important relate to how they influence and constrain the kinds of questions we ask about what is happening, and the type of evidence we seek in order to converge on the functions dominating behavior.
The Case of the Door that Would Not Close
An automotive company had a problem with electrically operated doors on a van. Their description was a bit off the mark, in that they called it an “issue,” and the doors actually did close most of the way after you pushed the button, then opened. This one had been around longer than the cracked boards. Incredibly, American engineers had made over 100 changes to “address the issue,” none claiming to fix it.
A team had been assigned, got a good one and a bad one then started swapping parts around to answer that infamous question, “What’s different?” Now don’t get the wrong idea. It’s a good question, but if you can’t get the answer in a few minutes based on experience, then you need another question. This problem had all the signs of being ugly right from the start. An offending door might fail only a few times out of a hundred. That’s a lot of button pushing to find out if a swap is meaningful. That’s when they asked for help.
When you get confused, maybe you just need a new question. “How is this supposed to work?” might be needed before we even ask, “What’s happening?” No one knew. I wanted the logic circuit and a cartoon, like the one below. Cartoons are one of the most powerful diagnostic tools known to man. Verbal responses, when you don’t know an answer, are easy, untested, and quickly forgotten. Making a picture (draw it, no photos) forces you to look when you wonder, “How does this look?” and you don’t know.2
Here is the end of the story. The root cause was the torque applied to tensioner nut. Want more? Perhaps you need a causal explanation.
The door has a logic circuit. Doesn’t everything these days? The logic was as follows:
If the door is closed then given a command to open, translate the door and count the number of pulses with the rotary sensor from closed to full open, say 200 counts. When the driver pushes the button to close, begin to translate from the open position. If there is a change (increase) in the current (your child is in the doorway) from 200 down to 10 counts, then reverse the door. When 10 counts remain, turn off the safety interlock (a small finger can’t get caught.) Allow the power to increase so as to latch the door and squeeze the door seal.
Knowing the logic, we data logged an offending vehicle and pasted it into Excel. The idea was to see each pulse and the time between pulses, thus the velocity throughout the translation of the door. There were 200 pulses to open and more than 200 to close. “What’s happening?” When the door started to move, the motor gave it a bit of a yank, it bounced and oscillated. As it oscillated, the motion sensor actually was going backwards, but adding pulses. Thus, the opening sequence did not match the closing, and the logic didn’t like it. If the mismatch was over the programmed limit, it opened.
Here is what makes problems like this easy; the evidence was there for every single offending door on every open-close cycle. No, the logic might not reject every time, but a close look showed that every one was at risk, with a mismatch in the opening versus closing count. The answer? Tighten the screw. Why? To change the rate at which the energy is contained and released in the door opening and closing cycle.
How do we know it’s fixed? Select good ones and then run a probabilistic test, claiming an answer with a 5% chance of getting tricked if you really don’t have the answer? Just make sure the door thinks it opens as far as it closes and the count is a spot-on match.
A causal explanation is a rich description of how stuff works. It requires starting with a search leading to functional understanding of how a system is supposed to work, then how it really is behaving. The clever problem solver will find the search for a causal explanation a fulfilling experience, leading to new knowledge and confidence in ones ability to solve tough problems. Once understood and practiced, your peers will look to you as a master problem solver.