It’s a scenario most people are all too familiar with. Despite our best efforts a component failed in the plant and brought production to a stand-still. These failures are the bane of reliability and are often analyzed to determine failure modes, learn lessons to avoid future failures, and in some cases, to redesign the machine or environment to maximize uptime of the equipment.
Terms like Root Cause Analysis (RCA) and Failure Reporting, Analysis and Corrective Action System (FRACAS) get thrown around, and teams of people are assembled to look at the failure from various aspects of operations, maintenance, environmental, and the list goes on. Ultimately, the goal is to determine what specifically happened that led to machine failure, so we don’t have the same thing occur time and again.
But the question remains, what role should lubrication play in all this?
Lubricant analysis exists in many forms within a facility — from the front-line staff performing inspections, all the way up to real-time sensors that provide data on the various parameters of the component. All this data is significant and should be integrated together. Each type of condition-monitoring technology has its own strengths and weaknesses but blending methods together can provide a better snapshot of what may actually be going wrong with that piece of equipment.
When it comes to lubrication analysis, the main data sources should include:
- Inspection Results – The technicians or operators that are involved in the daily rounds at a facility hold a tremendous amount of information and experience. Reviewing notes on sight glass inspections (level, foam, emulsions, deposits, etc.) can help point towards a lubricant failure, or perhaps even, lubricant starvation. The same is true for audible inspections, noting any abnormal noises that may have occurred.
- Oil Sampling Data – This represents one of the largest areas of routine data gathered to be trended and analyzed for early warnings or symptoms of machine failure. Looking back at such test results around fluid health, contamination, and machine wear debris can often point toward a likely failure mechanism.
- Grease Sampling Data – Similar to oil data, grease holds valuable information, as well. While not as common as oil sampling, grease sampling can provide insight into how well the grease is maintaining its properties and if there are significant amounts of contamination and wear debris present.
- Filter Analysis – Often overlooked, the filter is a vault of information, especially as it relates to failures of lubricated equipment. Wear debris often becomes larger in size and concentration as failure progresses. Analyzing the shape and metallurgy of the particles in the filter can provide useful information about specific components that may be wearing, while also potentially shedding some light on the wear mechanism.
- Sensor Data – Sensors come in a variety of forms, each revealing different aspects of the lubricant or machine. With many sensors providing data at a high frequency, information can be analyzed to determine if there were any abnormal trends leading up to the failure event. If none are found, it may present a reason to reexamine where the sensors are installed and the specific parameters they are monitoring.
There are many different strategies to perform a root cause analysis, but they all share some similarities. With the same end-goal of determining what happened and how to avoid it from happening again, there has been the use of many methods, including the five whys, fishbone diagrams, fault tree analysis, scatter plot analysis, to name a few.
The Phases of Proper Failure Analysis
All these tools are valuable in helping to utilize the failure-analysis method that makes the most sense for your facility or organization. To simplify, here are five main phases I recommend should be followed:
1. Data Collection – This includes fact-finding, interviewing witnesses of the event, and determining if there were other sequential events that may have occurred with the failure. During the data-collection phase, it is important that evidence is preserved as much as possible. This includes documenting final running conditions, taking photographs of the equipment and components, and securing data samples much like the data mentioned above. Diligence is the key to avoid incurring any impact to the integrity of the data gathered during this step.
2. Assessment – During the assessment phase, the analytical methods such as the five whys may be employed. The overall goal of this step is to analyze the data and determine if it reveals the root cause of the failure. Oftentimes, root causes get grouped into one of many of the following categories including:
a. Equipment/Material Problems
b. Design Problems
c. Procedural Problems
d. Human Error
e. Training Deficiency
f. Management Problems
While this is not an exhaustive list, a single failure may have multiple reasons that caused it to get to a catastrophic case. For instance, the bearing wasn’t lubricated properly because the scheduled PM frequency was too long. Some technicians may just chalk this up to a lubrication issue and not look at the other aspects of what all was occurring.
3. Corrective Action – This represents the plan of remediation to fix the issue and stop it from occurring again. Oftentimes, this plan will involve various departments such as maintenance, reliability, engineering, and operations. Depending on the complexity of the corrective action, a complete redesign/rebuild of the equipment or environment that houses the equipment may be the most prudent. These cases are rare but do occur.
4. Inform – The actions to prevent reoccurrence must be reported to the parties that will be responsible for implementing them. It is also a good practice to share the information with the departments that have an impact on the future operation of the asset. Sometimes, this may involve planners when a PM or BOM needs to be updated to reflect the changes stemming from this process.
5. Follow-up – As with any process a verification step is often employed to ensure that the corrective action plan was put into place. This may also include more detailed analysis moving forward such as increasing the rate of lubricant sampling, inspections, and testing of the equipment.
There are different types of failures that may require more significant analysis. For instance, a single-point failure that occurs when a single component fails might be solved in a matter of minutes and not require a regimented RCA process. Multi-point or sequential failures can be more difficult to determine the true root cause, and as such, require more focus and investigation to get to the real culprit.
Understanding when and where to deploy your RCA process can be based on many criteria. Usually, RCAs are reserved for those failures that are serious, complex, and repeating. If this isn’t the case, a simplified model of RCA can be used effectively without tremendous risk to the organization.
Don’t get discouraged if the process is hard or if the root cause is elusive. Be diligent and stick with it. Over time, you’ll find you’ve become adept at solving the tricky case of machine failure.