Knowing when a piece of equipment is going to fail (predictive maintenance) is much more difficult than making it last long (proactive maintenance). Even more complex is root cause analysis (RCA) which is performed postmortem, like an autopsy. Still, reliability professionals are increasingly stressing the importance of performing RCAs following all failures of critical machinery. As odd as it sounds, it is more productive to study failures than successes. After all, an apparent success may actually be a failure in disguise; more like a problem waiting to happen. Studying failures teaches us insightful lessons in developing predictive and proactive maintenance strategy.


Figure 1. Suspect Lineup

Root cause failure analysis is a process of working backward through a sequence of events or steps that led to functional failure of the machine. This process is often referred to as “Asking the Repetitive Why” or “the Five Whys”. The first “why” is intended to reveal the obvious and more immediate cause, sometimes referred to as the direct cause. This is the suspect that first, and most often, bears the blame. However, by continuing the series of questions, one can often expose hidden causes that include contributing causes (partners in crime) and intermediate causal agents. With a little luck, your interrogation will lead you to the root cause. Keep in mind there may be multiple root causes.

Fishbone diagrams (also known as Ishikawa diagrams) are designed to guide a process of elimination from an evolving list of possible causes that answer repetitive “why” queries. To be successful, one needs not only the knowledge to identify all the possible causes, but also the savvy to eliminate the right ones from consideration. If you are a skillful forensic pathologist, for instance, you might be good at figuring out whether the subject died by poison or by natural causes and which poison or natural cause actually occurred. For the rest of us who do not perform autopsies for a living, figuring out the cause of death would be pretty much “mission impossible”.

Many machine and lubricant failures are equally complex, enough so to confound even the most sophisticated failure investigator. I’ve seen RCAs heading wildly down wrong pathways or scrawny fishbone diagrams with all bones leading to dead ends. To help avoid such problems, the RCA process should be aided, where possible, by researching the histories of similar failures, deploying the use of faults trees and by following published troubleshooting guides. Better yet, consider hiring a machinery forensic pathologist.

Why the Bearing Failed
Some things are best illustrated by example: A bearing failed on a turbine generator train due to lubricant starvation (direct cause) from deposits that plugged the orifices through which the oil flows. In a postmortem study, the oil analysis lab found that the lubricant had oxidized leading to the deposit formation. The lubricant supplier was blamed for allegedly delivering a defective or poor-quality product. The actual sequence of events is discussed in the following list (see also the Sequence of Events chart in Figure 2):


Figure 2. Sequence of Events

  1. Reacting to a directive by the company’s CEO to quickly improve financial performance, plant management took many cost-reduction measures, including the purchase of economy-grade turbine oil and filters. Additionally, oil analysis was extended from monthly to semiannual samples. The company also suspended all training for maintenance personnel.

  2. The cheap filters allowed a high population of environmental particles to build up in the circulating oil system.

  3. Wear, caused by the dirty oil, produced an increasing concentration of metal debris as well. The poor capture-efficiency of the filters allowed the metal particles to stay unchecked, causing even more wear.

  4. The particle contamination led to seal damage and leakage, permitting the ingress of steam which later emulsified in the oil. The emulsification was further aided by the presence of the particles in the oil (polar emulsifying agents).

  5. The combination of emulsified water and particle contamination weakened the air release properties of the oil, causing a rising air/oil ratio. Entrained air reduced heat-transfer (cooling) properties and decreased the flow-rate efficiency of the oil pumps, among several other problems.

  6. The cheap turbine oil has a short oxidative life compared to premium lubricants owing to the selected base oil and additives in the formulation. The catalytic effects of water contamination and metal particles further shortened the oxidative service life (antioxidant additive depletion and base oil oxidation). Oxygen-carrying entrained air and rising heat further fueled the problem.

  7. The conditions that led to the onset of oil oxidation went unnoticed by maintenance and operations personnel due to lack of training and infrequent use of oil analysis.

  8. Soon, insoluble oxidation products began laying down varnish and deposits on critical machine surfaces including orifices, grooves and glands within the bearings.

  9. Eventually, oil flow into the bearings became restricted causing impaired lubrication, increased friction and rising heat. The varnish compounded the problem by insulating the bearing surfaces from efficient heat transfer.

  10. The elevated oil temperature combined with entrained air, metal particles and emulsified water, accelerated the rate of oil oxidation and deposit formation. Lack of training and oil analysis enabled impending bearing failure to go undetected.

  11. With only a trickle of flow now reaching one of the bearings, the hydrodynamic oil film was disrupted and the bearing failed completely.

Many have said that RCA is more art than science. Indeed, it seems to draw from a range of skills, talents, experience and knowledge. Some investigators seem to have a special knack for it while others toil through the process. But even if an RCA is unsuccessful at uncovering the root cause, the process usually brings forth new knowledge and greater awareness of reliability risk factors to the team. This new knowledge can then be rolled into criticality studies, such as failure modes effects analysis (FMEA), leading to an overall improvement in machine reality. It’s always wise to ask why.