No warning or short-warning failures are the worst kind. Think of a tire. It can wear out slowly over thousands of driving miles or rupture suddenly, at full highway speed, from a random piece of road debris. You can monitor tread loss over time and conveniently schedule a tire change. Conversely, who could predict the sudden appearance of a sharp piece of iron?

Fault bubbles are sudden-death conditions in waiting. They haven’t ruptured, but they are about to. Similar to a tire, fault bubbles can burst instantly. Unlike the tire, most fault bubbles in industrial machinery can be revealed by condition monitoring, which includes the careful examination by a skilled inspector. Once detected, the root cause can be arrested or at least mitigated.

In past columns, I’ve mentioned the P-F interval. As a review, “P” is the point at which a failure (in progress) is first detected, while “F” is the end point of functional inoperability. Although the P-F interval is a theoretical concept that has useful application, it is rarely applied in real-world machines. This is because the real world comes with many variable events. These events distort the predictability of the P-F interval.

Simply stated, the P-F interval is not well-behaved. It is a time interval that is influenced by detection skill and frequency. It is also influenced by multiple operational factors that determine the failure development period. These include:

  • Multiple components on a single machine or drive train, each with its own P-F tendencies
  • Multiple failure modes for any single component
  • Variable duty cycle (speeds, loads, shock, temperature, etc.)
  • Remaining useful life (RUL) varies with age. For any given fault mode, the P-F interval shrinks as the machine ages.
  • Failure detection methodology and effectiveness vary (ability to detect faults early).

Following are a few examples of short P-F and sudden-death failure modes and fault bubbles. What intervention strategy focused on root causes would you apply to detect and neutralize these threatening conditions?

  • Oil filter rupture
  • Negligent introduction of a wrong oil
  • Fish-bowl conditions (disturbed and mobilized bottom sediment)
  • Sudden and severe shaft misalignment
  • Stiction/silt lock of hydraulic valves (motion impediment)
  • Grease “soap lock” starvation of an auto-lube system
  • Impaired or complete loss of oil supply to a bearing or gear
  • Heavy fuel dilution from defective fuel injectors or seal leakage
  • Accidental introduction of chemical contamination
  • Gross seawater contamination of a shipboard hydraulic fluid
  • Shock loading of a large thrust bearing

When it comes to fault bubbles, the best defense is a good offense. Don’t be reactive ... after all, time is not on your side. Instead be proactive. How wise is it to wait until you’ve had your first heart attack to make needed lifestyle changes? Even if you become stricken with heart disease, there are so many ways for early detection and treatment. The following outlines a good proactive strategy to avoid or control fault bubbles.

76% of lubrication professionals say their organization has tried to change its culture, based on a recent survey at

Root Causes Relate to Reliability Culture and Education

Most fault bubble root causes can be traced to a human agency. This could relate to skill, attitude or the general reliability culture. As it is true that it’s never a good strategy to “inspect in” quality, likewise it is never sufficient to “condition monitor in” reliability. Don’t get me wrong, I’m a strong believer in condition monitoring and the value of measurement. However, the big bang-for-the-buck comes from building reliability teams flanked by education and culture.

Stop celebrating rapid repair and start celebrating the failure “non-event.” That is the failure or fault bubble that didn’t occur. A positive, nurturing maintenance culture is a critical plant asset. Consider that when people do good work, they feel good about themselves and their job. When people do bad work, they feel bad about themselves and their job. Feeling bad is a serious morale problem that multiplies and spreads. The simple solution is to enable people to do good work by culture and relentless education. And, good work should be recognized and celebrated.

This is both problem and solution. Culture drives behavior. Behavior influences quality of work. Quality work is fundamental to plant reliability and the cost of reliability. Good culture has inertia, too. It fuels a chain of reinforcing successes. Small successes beget larger and more sustainable successes. Creating a good culture starts and ends at the top, at the leadership level. When good leaders are in charge, everyone wins. When bad leaders are in charge, the culture becomes negative/hostile/stagnant, and everyone loses. Good culture also emerges from management’s aspiration for improvement and the inherent desire to do good work. It relates to skills, tools, work plans and machine readiness.

* The condition monitoring methods do not read across to the failure modes on the same line. Instead, they designate the ability to detect the root causes at the top of the table.

Focus on Detecting Root Causes, Not Just Symptoms

Work backward from sudden-death failure to develop your condition monitoring game plan. This is illustrated in a simplified form for an imaginary pump bearing in the table below. Rank the main failure modes by likelihood/severity down the left side of the table. Make sure any critical fault bubbles are included. In this example, I have mechanical wear (abrasion, scuffing, etc.), corrosion, surface fatigue and oil seal failure (causing lubricant starvation).

Next, list root causes across the top of the table. I’ve included misalignment, particle contamination, water contamination and wrong/degraded oil. In the blue area of the table, put X marks under each root cause associated with each failure mode on the left. One X mark is for a root cause that has a minor contribution to a failure mode. Two marks are for a moderate contribution, and three are for a major one.

On the right side of the table, list the condition monitoring options. I have identified heat gun/thermography, inspection, vibration and oil analysis. In the pink area of the table, put O marks under each root cause that is detectable by the condition monitoring method. One O is for slightly effective, two are for moderate, and three are for highly effective.

Finally, tally up the number of X’s and O’s in the columns below the root causes. For instance, under water contamination, there are seven X’s and six O’s. Ideally, the number of O’s should be more than the number of X’s, or at least close. The number of O’s should never be fewer than four for a given root cause. The purpose of this is to align the condition monitoring strategy with the ranked failure modes and their root causes.

Inspection and the Power Frequency

The best countermeasure for short P-F intervals is constant inspection and measurement. For many machines, real-time monitoring with imbedded sensors is justified, especially for high-speed, high-risk machines. This enables instant detection of certain root causes and symptoms. However, this is not a practical reality for the vast majority of common plant machines. Instead, a more realistic and simple solution for early detection must be achieved. There is no better option than skillful and motivated daily inspection aided by inspection windows and tools.

Often the simplest solution is best. How do you get the optimum level of reliability at the lowest possible cost? Inspection presents some benefits and advantages that are difficult, if not impossible, to duplicate with other condition monitoring options. These include:

  • Inexpensive, simple, lasting deployment
  • Operator-driven
  • More emphasis on examination skills, less on technology
  • Root-cause-oriented to avoid developing fault bubbles; more proactive, less reactive
  • Early fault detection; more predictive; fewer misses and “just-in-time” saves
  • The power of frequency and the one-minute daily inspection

We all seek more for less, and no one likes the pain and frustration that often come with exceedingly complex solutions to simple problems. KISS (keep it simple, stupid!) solutions should always be your first priority. Their application is at the core of inspection. No array of sensors and computer intelligence can outperform a human inspector at a large number of condition monitoring tasks.

Ballooning and Escalating

The unique nature of fault bubbles cannot be misunderstood or understated. These are rapidly ballooning and escalating events. Many lie in the shadows and go unnoticed, and then suddenly burst and do their damage. The belief that all failures are like a tire’s tread wear -slow and progressive - must be discarded. With the full understanding and respect that fault bubbles command must come a vigilant reliability culture of education, condition monitoring and frequent inspection. This is my main message.