Understanding Failure Modes Before Sampling Oil

Noria Corporation

There is little doubt that oil analysis is a vital tool in helping to ensure that critical oil-lubricated assets remain reliable. Used properly, oil analysis can provide early warning of impending mechanical failure and provide vital feedback to help ensure contamination levels are below the danger level, all the time validating that the physical and chemical properties of the oil have not changed. By now, you’ve heard about the importance of taking a representative sample from the right location using the right methodology.

But what about sampling frequency? How often should we take samples? All too often, sampling frequency is driven not so much by sound reliability or lubrication engineering logic, but rather by the cost of the analysis. Believing that we’re spending too much on analysis for a specific machine or class of machines, the temptation is to back off and sample less frequently to save money.

This approach is like canceling your life insurance policy because you haven’t cashed in recently! Personally, I’d be quite happy paying my life insurance premiums for a long, long time! In fact, I hope I never cash in, though logic might suggest otherwise.

Of course, cost is a factor. So, how do you strike a balance between sampling too frequently and wasting money, vs. sampling too little and potentially missing a problem? The key is to understand the likely failure modes of each machine, then design the program to address the most common failures. In doing so, it should be recognized that each machine or component will likely have several different possible failure modes. For this reason, our program – both sampling frequency and test slates – needs to be designed to address those failure modes most likely to occur.

Potential vs. Functional Failure
To explain the concept of failure modes, reliability engineers often talk about the P-F interval (Figure 1). The P-F interval is simply the time between a potential failure becoming detectable (Point P) and the point at which a functional failure (Point F) occurs. It is important to note that a functional failure doesn’t necessarily mean the machine has failed catastrophically, but rather it is not long-functioning in its designed state (for example, speed, quality, capacity, etc.). Other failures, such as a light bulb burning out, are a step function which takes some time to propagate to the point of functional failure, shown in Figure 1.

Figure 1. The P-F Interval

Most reliability engineers agree that in order to achieve sufficient early warning of a problem condition, samples should be taken three to five times the estimated P-F interval. For example, sample every 18 to 30 days if the P-F interval is determined to be 90 days.

It is also important to understand that each machine can and will have multiple failure modes. For example, consider a simple gear reducer driving a conveyor. The gearbox will have a number of possible ways in which it might fail, including (though not limited to) the following:

Overload
Misalignment
Wrong oil
Low oil level
Age-induced fatigue
Lubricant degradation
Contamination (water or particles)

Depending on application and circumstance, some of these failure modes will be more likely than others. For this reason, we need to realize that each mode, and the methodology deployed to find the problem, will have a different P-F interval and will thus require a different frequency.

Low Oil Level
To illustrate the point, consider two of these modes: low oil level and particle contamination. Depending on circumstance, the degree to which the oil is low, and whether the gearbox is splash-lubricated vs. force-fed, a low oil level could induce a lubrication-related failure in a matter of hours or at least days. This is the reason why daily (or at a minimum, weekly) level checks are strongly recommended for wet sump applications.

Conversely, while it’s safe to say that the life expectancy of a gearbox with heavy particle contamination will be shorter than one in which the oil is kept clean, it will typically take many months before a contamination-induced failure occurs on a large gear reducer. Of course, the same could not be said for a critical servo-controlled hydraulic system, where excessive particle contamination might induce failure in only a few days or weeks.

So, how do we maintain fiscal responsibility without compromising detectability? The key is to realize that you don’t have to perform all tests simultaneously on each and every sample! Consider, for example, the servo-controlled hydraulic system. If we deem that the P-F interval for fluid degradation (monitored by acid number trending) is six months, then bimonthly acid number testing may be acceptable.

Conversely, if the P-F interval for contamination-induced failure is estimated to be just 30 days, maybe we need to perform weekly particle counting, or perhaps purchase an on-site particle counter or patch test kit to enable more cost-effective and more frequent sampling.

As always, this is my opinion. I’m interested in yours.