Pumps & Systems, September 2007
When I was first asked to define MTBF, MTBR, and MTBPM, I wasn't sure why. Of all the myriad reliability metrics employed, I had to ask myself, "Why were these singled out?" It wasn't until I ran across the following definition from PIP did I understand: Process Industry Practices (PIP) defines Mean Time Between Repairs as: "The most common measure of operating reliability typically stated as the average operating calendar time between required repairs for a particular piece of machinery, type of machinery, class of machinery, operating unit or plant. MTBR is not Mean Time Between: (a) Failures, (b) Planned Maintenance, or (c) any other categorization of shutdowns. MTBR calculations include Repairs due to (a) Failures, (b) Planned Maintenance, or (c) any other categorization of Repair events."
I was surprised to find all three reliability metrics of interest mentioned here. As I thought more about this PIP definition, I began to realize why these metrics are so important and why they need to be better understood.
Dealing with Dirty Data
Before I define the reliability terms in question, I want to provide some perspective on the person that the PIP standard was written for: maintenance personnel. I have worked in maintenance organizations for over 20 years, so I feel somewhat qualified to present the maintenance perspective on maintenance data analysis.
Let's first consider a hypothetical pump timeline (see Figure 1).
We can see that this timeline is composed of various event types, i.e. failures, repairs, and PM events. Ideally, we would like to know how long a new or refurbished pump lasts before it fails. But there is always a trade-off between theory and practice. In reality, you are usually only able to determine the average or mean between time between failures or repairs. Reliability theory tends to deal with failure data, while maintenance organizations deal with maintenance events. "But aren't failures and maintenance events the same thing?" you ask. My response is, "Not at all." Maintenance events fall to many categories, such as:
- Repairs to restore pumps to serviceable conditions
- Regular internal pump inspections
- Preventative maintenance events, such as oil changeouts
- Predictive maintenance events, such as data collections
- Preemptive repairs that are done before a pump actually fails
- Maintenance activities that are not associated with a pump but are credited to a pump's functional location due to the proximity of the work
Only the first category actually pertains to a known pump failure. It should be noted that the second category is deemed to be a repair by PIP if the inspection uncovers a failed or failing component.
To make matters more complicated, defining failures in real world environments can sometimes be challenging. If you are running tests on light bulbs, it's easy to know when failure occurs. However, here are a few examples demonstrating the difficulties in determining what is and what is not a failure:
- You discover a seal has a one drip per hour leak. Is this a seal failure? When did it start leaking?
- During a planned pump inspection, you find the impeller has lost 50 percent of its thickness. Is it a failure? If so, when did it pass the threshold from acceptable to unacceptable?
- A pump vibration levels jump from .11-ips to 0.25-ips from one pump inspection to the next. Management wants to repair the pump before things get worse. Is this a failure? When do you say it failed? Can you say it's 80 percent failed when it was removed?
One thing maintenance folks (and accountants) know for certain is when they have performed maintenance on a pump. In addition, they store their maintenance data to the point of information overload. Ask any maintenance engineer or specialist for pump maintenance data and he or she will present you with reams of it. The problem is that it's usually in a form we call "dirty data." Dirty data is an aggregate of predictive maintenance, preventative maintenance, repair and extraneous data that must be carefully culled before it is usable.
Let's look at some sample pump data in Table 1.
We have run a hypothetical report of completed work orders for Pump 101 over a 15-month period. Over that time, we see there have been 14 completed work orders (a completed work order is any work order that has been created and closed). Does this mean we have experienced 14 failures or have completed 14 repairs? Certainly not.
Any experienced maintenance person can look at Table 1 and determine which work orders represented real repairs, which ones were preventative or predictive maintenance activities, and those that were unrelated to the pump, such as the leaking suction valve. (By the way, the lack of details in Table 1 is typical of real-world data. We never have all the details required to make a fully-informed decision on the true nature of maintenance events.)
Note that I have highlighted two work orders in green that I believe represent actual repairs. So, instead of 14 repairs we really only have 2 repairs required to return this pump to operative condition.
In summary, we can state that maintenance organizations:
- Have plenty of "dirty" work order data at their disposal.
- Don't always know when a pump has failed or when it would have failed if it is removed from service prematurely.
- Must subjectively cull their data to arrive at a usable listing of repair data.
The Definitions
Now that we better understand the data analysis perspective of maintenance organizations, let's talk about the three definitions of MTBF, MTBR, and MTBPM.
MTBF (Mean Time Between Failures): The mean number of life units during which all parts of the item perform within their specified limits, during a particular measurements interval under stated conditions. When we say "all parts of the item perform within specified limits," we mean to say that on average, no parts fail until the end of the mean life. The following equation is used to determine MTBF:
MTBF = N/F
where N is the number of machines in the populations and F are the number of failures in the measurement period.
MTBR (Mean Time Between Repairs): The mean number of life units between repair activities required to bring all parts of the item back to within their specified limits, during a particular measurements interval under stated conditions. MTBR is similar to MTBF, but uses repair events instead of failure events. The following equation is used to determine MTBR:
MTBR = N/R
where N is the number of machines in the populations and R are the number of repairs in the measurement period.
MTBPM (Mean Time Between Planned Maintenance): The mean number of life units between planned maintenance activities, during a particular measurement interval under stated conditions. Planned maintenance activities only count as failures or repairs if work is required to restore the component. Here are a few examples of planned maintenance activities that are not considered repairs:
- Oil replacement
- Packing adjustment
- Work planned and then cancelled
- Periodic internal pump inspections due to known corrosion or erosion concerns
- Periodic alignment checks due to foundation settling
- Journal bearing inspection due to contamination concerns
Of these three metrics, MTBR is probably the most widely used for evaluating pump reliability, despite the following limitations and caveats:
- It includes the mean pump life along with the mean time for the organization to identify, plan, and repair the pump, which tends to inflate the value of MTBR.
- The MTBR metric is an amalgamation of repair data for all pumps, running and idle, that are included in the population. This also tends to greatly increase its value compared to the true mean time to failure. (Some will argue that even idle pumps are subject to failure.)
- As stated before, the repair data supplied for the calculation is subject to interpretation, therefore it is prone to errors and inconsistency.
- Wide fluctuations in the MTBR values can occur in small pump populations or when short reporting periods are used, due to the significant effect of small variations in repair numbers used in the calculations. This situation may require a 6-month or 12-month rolling average to smooth out these fluctuations. (I welcome statisticians to offer their recommendations on the minimum number of repairs or timeframe required for a given pump population to obtain a reliable MTBR value.)
While MTBR does not equate to the true pump failure rate, it can be trended to determine if progress is being made and how shop resources are being impacted due to pump repairs. PIP recommends MTBF metrics to be calculated for different classes of pumps, such as hot pumps, reciprocating pumps, etc. as a means of improving its usefulness.
Working with External Suppliers
Machinery reliability metrics have been receiving increasing attention recently because of the growing interest in improving operating plant profits through reliability improvement programs. This first began internally, but has extended outside the process plant confines. Now external suppliers are offering reliability improvement programs at a fixed annual cost. I am aware of several mechanical seal programs that have delivered on their promises to improve plantwide reliability at a fixed fee.
For these types of programs to be successful, participating operating plants must be willing and able to provide meaningful reliability metrics to external suppliers of seals, new pumps, pump repairs, etc. To my knowledge, pump MTBR calculations are the only practical means available to track machinery performance in a plant environment.
As key players in reliability joint ventures, external suppliers need to understand the limitations of the maintenance data collection process and the inherent inaccuracies in the MTBR value provided to them. While this metric may be considered by some to be flawed and somewhat unscientific, if calculated consistently, it can provide a reasonable benchmark for gauging reliability gains.
Both pump owners and external suppliers need to continue working together to refine the PIP MTBR standard so that it is better understood and becomes widely accepted. The present standard is not perfect, but it's a good start!