The Role of Lubrication in Root Cause and Failure Analysis

Wes Cash, Noria Corporation
Tags: preventive maintenance, wear debris analysis, oil analysis

It’s a scenario most people are all too familiar with. Despite our best efforts a component failed in the plant and brought production to a stand-still. These failures are the bane of reliability and are often analyzed to determine failure modes, learn lessons to avoid future failures, and in some cases, to redesign the machine or environment to maximize uptime of the equipment. 

Terms like Root Cause Analysis (RCA) and Failure Reporting, Analysis and Corrective Action System (FRACAS) get thrown around, and teams of people are assembled to look at the failure from various aspects of operations, maintenance, environmental, and the list goes on. Ultimately, the goal is to determine what specifically happened that led to machine failure, so we don’t have the same thing occur time and again.

But the question remains, what role should lubrication play in all this?

Lubricant analysis exists in many forms within a facility — from the front-line staff performing inspections, all the way up to real-time sensors that provide data on the various parameters of the component. All this data is significant and should be integrated together. Each type of condition-monitoring technology has its own strengths and weaknesses but blending methods together can provide a better snapshot of what may actually be going wrong with that piece of equipment.

When it comes to lubrication analysis, the main data sources should include:

There are many different strategies to perform a root cause analysis, but they all share some similarities. With the same end-goal of determining what happened and how to avoid it from happening again, there has been the use of many methods, including the five whys, fishbone diagrams, fault tree analysis, scatter plot analysis, to name a few.

The Phases of Proper Failure Analysis

All these tools are valuable in helping to utilize the failure-analysis method that makes the most sense for your facility or organization. To simplify, here are five main phases I recommend should be followed:

1. Data Collection – This includes fact-finding, interviewing witnesses of the event, and determining if there were other sequential events that may have occurred with the failure. During the data-collection phase, it is important that evidence is preserved as much as possible. This includes documenting final running conditions, taking photographs of the equipment and components, and securing data samples much like the data mentioned above. Diligence is the key to avoid incurring any impact to the integrity of the data gathered during this step.

2. Assessment – During the assessment phase, the analytical methods such as the five whys may be employed. The overall goal of this step is to analyze the data and determine if it reveals the root cause of the failure. Oftentimes, root causes get grouped into one of many of the following categories including:

a. Equipment/Material Problems

b. Design Problems

c. Procedural Problems

d. Human Error

e. Training Deficiency

f. Management Problems

While this is not an exhaustive list, a single failure may have multiple reasons that caused it to get to a catastrophic case. For instance, the bearing wasn’t lubricated properly because the scheduled PM frequency was too long. Some technicians may just chalk this up to a lubrication issue and not look at the other aspects of what all was occurring.

3. Corrective Action – This represents the plan of remediation to fix the issue and stop it from occurring again. Oftentimes, this plan will involve various departments such as maintenance, reliability, engineering, and operations.  Depending on the complexity of the corrective action, a complete redesign/rebuild of the equipment or environment that houses the equipment may be the most prudent. These cases are rare but do occur.

4. Inform – The actions to prevent reoccurrence must be reported to the parties that will be responsible for implementing them. It is also a good practice to share the information with the departments that have an impact on the future operation of the asset. Sometimes, this may involve planners when a PM or BOM needs to be updated to reflect the changes stemming from this process.

5. Follow-up – As with any process a verification step is often employed to ensure that the corrective action plan was put into place. This may also include more detailed analysis moving forward such as increasing the rate of lubricant sampling, inspections, and testing of the equipment.

There are different types of failures that may require more significant analysis. For instance, a single-point failure that occurs when a single component fails might be solved in a matter of minutes and not require a regimented RCA process. Multi-point or sequential failures can be more difficult to determine the true root cause, and as such, require more focus and investigation to get to the real culprit.

Understanding when and where to deploy your RCA process can be based on many criteria. Usually, RCAs are reserved for those failures that are serious, complex, and repeating. If this isn’t the case, a simplified model of RCA can be used effectively without tremendous risk to the organization.

Don’t get discouraged if the process is hard or if the root cause is elusive. Be diligent and stick with it. Over time, you’ll find you’ve become adept at solving the tricky case of machine failure.