Root Cause Analysis Explained

Jonathan Trout, Noria Corporation
Tags: maintenance and reliability

What Is Root Cause Analysis?

Root cause analysis (RCA) is defined as a systematic process for identifying the root causes of problems or events and an action plan for responding to them. Many organizations tend to focus on or single out one factor when trying to identify a cause, which leads to an incomplete resolution. Root cause analysis helps avoid this tendency and looks at the event as a whole. Another common occurrence is for companies to treat the symptoms rather than the actual underlying problems contributing to the issue, leading to recurrence.

Using root cause analysis to analyze problems or events should help you tackle the primary goal of determining:

What happened
How it happened
Why it happened
Actions for preventing recurrence of the issues

In the end, root cause analysis boils down to three goals. The first goal is just as the name implies: to discover the root cause of a problem or event. The second goal is to understand how to fix, compensate for or learn from issues derived from the root cause. The third and most important goal is to apply what you learn from the analysis to prevent issues in the future.

How to Conduct a Root Cause Analysis (RCA)

Root cause analysis can be used in a variety of settings across multiple industries. Each industry might conduct the analysis in a slightly different way, but most follow the same general five-step process when investigating issues involving heavy machinery. This process was laid out by the United States Department of Energy (DOE-NE-STD-1004-92) back in 1992. Root cause analysis is commonly referred to as detective work at its finest. You’ll see similarities between how a detective works to solve a case and how manufacturers can figure out the root cause of an issue in the five-step process.

Phase 1 - Data Collection

Just like how detectives preserve a crime scene and meticulously collect evidence for review, collecting data is probably the most important step in the root cause analysis process. It’s best practice to collect data immediately after a failure happens or, if possible, while the failure is occurring. In addition to data, be sure to note any physical evidence of the failure as well.

Examples of data you should collect include conditions before, during and after the occurrence; employee involvement (actions taken); and any environmental factors. When machinery is involved, collect data and samples on things like lubrication systems, filters and separators, byproduct deposits (gums, varnish or sludge), oil analysis, and tank and sump conditions.

Phase 2 - Assessment

During the assessment phase, analyze all collected data to identify possible causal factors until one (or more) root causes are determined. According to the DOE’s process, the assessment phase incorporates four steps:

Identify the problem.
Determine the significance of the problem.
Identify the causes (conditions or actions) immediately preceding and surrounding the problem.
Identify the reasons why the causes in the previous step exist, working backward to determine the root cause; the root cause being the reason(s), which if corrected, will keep these and similar failures around the facility from happening. Identifying the root cause is the stopping point in the assessment phase.

Common assessment conclusions for manufacturers include things like contaminated lubricant, using the wrong lubricant, using too much or too little lubricant, and abnormal wear debris.

Later we will discuss common root cause analysis methods and tools to help with the assessment phase of this process. Common methods include Pareto charts, determining the “5 Whys,” fishbone diagrams and more.

Phase 3 - Corrective Action

Implementing corrective action once a root cause has been established lets you improve your process and make it more reliable. First, identify the corrective action for each cause. Then, ask these five questions or criteria laid out by the DOE and apply them to your corrective actions to make sure they are practical.

Will this corrective action prevent recurrence?
Is this corrective action feasible?
Does this corrective action prevent recurrence and still allow for the meeting of production objectives?
Are new risks introduced with this corrective action? Are all assumed risks clearly stated? Keep in mind that corrective action(s) should not degrade the safety of other systems.
Were immediate actions appropriate and effective?

Before taking corrective action, your company as a whole should discuss and weigh the pros and cons of implementing these actions. Consider the cost of carrying out these changes. The costs may include training, engineering, risk-based and operational expenses among others. Weigh the benefits of the costs associated with eliminating the failure(s) with the probability the corrective action(s) will work. In addition to cost, your team should discuss questions like:

Will the outlined corrective actions address all causes?
Will the corrective actions cause negative effects?
What are the consequences of implementing the corrective actions?
Will training be required?
How long will it take to implement these corrective actions?
What resources are required for implementation?
What impact will implementing these corrective actions have on other departments?

Phase 4 - Inform

Communication is key. Ensure all affected parties are informed of the pending correction or implementation. In the manufacturing setting, these parties may include supervisors, managers, engineers, and operations and maintenance staff. It’s also a good idea to communicate any corrective actions with suppliers, consultants and subcontractors. Many companies inform all departments of any changes so they can be aware and determine if or how the changes apply to their unique situation as it relates to the overall manufacturing process.

Phase 5 - Follow-up

The follow-up phase is where you establish if your corrective action is effective in resolving the issues.

Track corrective actions to confirm that they were implemented properly and are working as intended.
Periodically review the new corrective action tracking system to verify that it is being implemented effectively.
Analyze any recurrence of the same event and determine why the corrective action(s) were not effective. Be sure to note any new occurrences and analyze those symptoms.

Following up regularly lets you see how well your corrective actions are working and helps you identify new issues that could lead to future failures. For a more detailed look at how to conduct root cause analysis specifically for lubrication professionals and manufacturers, check out "Root Cause Analysis Techniques for the Lubrication Professional.”

Root Cause Analysis (RCA) Tools and Methods

As discussed earlier, the data collection and assessment phases in the RCA process are perhaps the two most important aspects when it comes to properly determining the root cause of a particular failure. There are many root cause analysis tools to choose from when you’re assessing data. Each one can be used to evaluate different information or provide another way to look at similar data. Below are eight common root cause analysis tools and methods:

Pareto Charts: A Pareto chart combines both bar and line graphs, with bars representing individual values (lengths or costs) shown in descending order and lines used to illustrate the cumulative total. In quality control, a Pareto chart can highlight the most common sources of defects or the type of defect that occurs most frequently. When should you use a Pareto chart for root cause analysis?
- When looking at data on how often problems occur or the causes in a process
- When you want to weed out other problems and focus on the most significant
- When looking at broad or general causes by analyzing their specific components
- As a good communicative tool
Read more about how you can create a Pareto chart in eight easy steps.
5 Whys: You can think of the 5 Whys method like a curious child continuously asking “why” until he or she receives a satisfactory answer. Each time you ask “why,” the answer produces another “why” question. It is a simple tool, so you shouldn’t use it to determine complex problems. However, it can be useful to help dive into the results from other methods like a Pareto chart. An example of using the 5 Whys might look like the following:
- Why did Machine A stop working? The circuit overloaded causing a fuse to blow.
- Why is the circuit overloaded? The bearings locked up due to insufficient lubrication.
- Why was there insufficient lubrication on the bearings? Machine A’s oil pump isn’t circulating enough oil.
- Why is the pump not circulating enough oil? The pump’s intake is clogged with particulate.
- Why is the intake clogged? There is no filter on the pump.
You may need more or less than five questions to get to the root of your problem, but as long as your questions keep peeling away issues on the surface, the more likely you are to uncover your root cause.
Fishbone Diagrams: Sometimes called a cause-and-effect diagram, a fishbone diagram is helpful for sorting possible causes into multiple categories which all branch off from the original problem. The main categories addressed in this diagram are the six “Ms” — man, material, method, machine, measurement and Mother Nature (environment). A fishbone diagram can also have numerous sub-causes originating from each main category. When should you use a fishbone diagram?
- To identify possible causes for an issue.
- When your team’s thinking and brainstorming tends to get stuck or stagnate.
Work the diagram right to left, having your team brainstorm possible causes of the problem and placing each idea in the appropriate category. Once the team is done brainstorming, rate the potential causes by level of importance and likelihood of contributing to the problem. From here, select which causes to investigate further.

In the example above, the fishbone diagram includes a main problem, six factors contributing to the main problem and potential causes of those factors branching off.
Scatter Plot Diagrams: A scatter plot diagram is used to show the relationships between two variables by using pairs of data points. One variable is placed on the x-axis and another on the y-axis. Once you plot your data points, if the variables are correlated, the points will form a curve or a line. The closer the data points are, the better the correlation. As a quantitative method for determining correlation, these diagrams can be used with other methods, such as to test potential causes identified in your fishbone diagram. When should you use a scatter plot diagram?
- When you have paired numerical data.
- When trying to verify whether two variables are related.
- When attempting to determine if two related effects are from the same cause.
- After brainstorming with a fishbone diagram.
Failure Mode and Effects Analysis (FMEA): FMEA is used to analyze and determine potential risks, failures and causes. The process looks at ways in which failures such as errors or defects might occur and then studies or analyzes those failures. When should you use FMEA?
- During the design or redesign of a process, product or service.
- When applying an existing process, product or service in a new way.
- Before coming up with control plans for a new or modified process.
- When planning improvement goals for existing processes.
- When looking into failures of an existing process.
You can think of FMEA as more of a proactive tool rather than a reactive tool.
Fault Tree Analysis: Similar to FMEA, fault tree analysis helps identify potential risks in a system or process before they happen. Sometimes called a “top-down approach,” this deductive process starts with a general conclusion and attempts to figure out the causes of the conclusion by making a logic diagram called a fault tree. The diagram utilizes shapes called “gates” to denote various interactions among contributing failure events. The two most common gates are the “and” and “or” gates. When using these gates, consider two events: input events, which can lead to another event, referred to as an output event. If either of the input events causes the output event to occur, connect these events with an “or” gate. If both input events must happen for the output event to occur, connect them using an “and” gate, as shown below.
A fault tree can be used to build a safety program, discover what went wrong in a process or determine why employees may not be meeting company standards. For example, you can take a hypothetical incident like a lubrication spill, break down the contributing factors and see the chain of events or failures along the way. You can then choose safety procedures that help minimize these outcomes.
Barrier Analysis: Barrier analysis is a tool used with other methods to understand why a failure happened and how it can be prevented. The main idea behind it is that a failure or problem can be prevented by having set barriers to control hazards. The three basic elements of barrier analysis are the target, the hazard and the barrier. The target is generally a person. The hazard is something that can cause harm to the target, such as rotating parts or electricity. Barriers can be physical, procedural or actions, and are intended to protect the target.
Change Analysis: Change analysis is another tool that can be used with other methods to help define a problem. This process examines an event while considering it with and without a particular problem and then compares the two situations, taking note of the differences. It then analyzes the differences and identifies consequences of the differences. Change analysis usually is employed in tandem with another RCA method to distinguish a specific cause instead of the root cause.
For example, let’s say you have an abnormally good sales day and want to figure out why so you can replicate it. You’d start by considering every possible internal and external factor, such as whether a new sales training was implemented the day before or if it was the last day of the month and people were trying to hit their goals. Next, examine each event to see if it was an unrelated factor, contributing factor, correlated factor or the probable root cause. This is where all your analysis is done and where you can loop in other methods like the 5 Whys. Finally, see how the cause can be replicated.

Root Cause Analysis FAQs

How do you decide when to conduct a root cause analysis?
You can perform root cause analysis to help solve day-to-day problems using brainstorming techniques or the 5 Whys. Employ RCA routinely as a proactive tool to analyze safety and environmental data, evaluate asset utilization, and identify trends that point to chronic losses or systematic defects. High-level RCAs are costly, so you need a process to help decide when one is appropriate. If you’re considering a high-level RCA, you’ll want to define triggers that determine the point at which a formal RCA should be conducted. Below are some ideas for forming trigger criteria:
- Equipment damage or failure
- Operating performance
- Quality
- Economic performance
- Safety performance
- Regulatory compliance

How do you prepare for a root cause analysis?
It’s important to spend time preparing for a root cause analysis by doing some initial investigation, identifying the appropriate personnel and anticipating problems that could arise during the RCA meeting. A common example of preparing for an RCA is that of a puzzle builder. Even the most experienced puzzle builder, who may know tips and tricks for efficient puzzle-building, can’t be successful if a puzzle piece is missing or there is no place to build the puzzle.

Likewise, a team can’t complete a root cause analysis if it is missing important evidence, team members are absent, or the facilities are dysfunctional. So, make sure you collect evidence, identify key team members and prepare for the unexpected prior to your RCA meeting.
What is the difference between proactive and reactive root cause analysis?
In most cases, RCA is used after an event or failure has occurred. The goal with root cause analysis is to be proactive or eventually move from being reactive to proactive.
- Proactive root cause analysis consists of the actions, behaviors or controls implemented to prevent a failure from occurring.
- Reactive root cause analysis encompasses the actions, behaviors or controls implemented to mitigate or lessen the severity of a failure that has already occurred.

How long does a root cause analysis take?
The time required for a root cause analysis will depend on certain factors, such as the complexity of the incident, the availability of employees to be interviewed, whether there is regulatory interference and how far you want to dig into the causes. Most RCAs can be completed in a couple of weeks or a few months.
What are some examples of internal and external factors that could contribute to failures uncovered in a root cause analysis?
Examining internal and external factors in the weeks and months leading up to a failure event can help you obtain a snapshot of what happened. Let’s say you want to find out why revenue dipped last quarter in your food-processing company. Examples of internal and external factors might include:
- Severe weather reduced rice, corn and wheat production (external).
- The cost of sugar has risen (external).
- Trade restrictions have been implemented in some of your partner countries (external).
- Your processing plant experienced more frequent shutdowns (internal).
- New shift managers were hired in the processing plant (internal).