Predictive Equipment Failure

Evolution of Maintenance Strategies


1) Introduction

2) What is Predictive Equipment Failure?

3) Source of Data

4) Business Requirements

5) Machine Learning Problem

6) Existing approaches

7) My EDA (Exploratory Data Analysis) and Data Pre-processing

8) First Cut Solution

9) Custom Model and Architecture Explanation

10) Comparison of Models

11) Future Work

12) Profile

13) References


Think about all the machines you use during a year, all of them, from using a toaster every morning to an airplane in every summer holiday. Now imagine that, from now on, one of them would fail every day. What impact would that have? The truth is that we are surrounded by machines that make our life easier, but we also get more and more dependent on them. Therefore, the quality of a machine is not only based on how useful and efficient it is, but also on how reliable it is. And together with reliability comes maintenance. So, even if the quality of machine is excellent and is extremely useful, it won’t sum up unless and until we have the reliability factor coming into the picture. And this reliability can be maintained by knowing the machine insights and repairing it or replacing and maintaining it at the right time.

As it is rightly said that, PREVENTION IS BETTER THAN CURE, similarly for machines we need to perform the repairing or the maintenance work at the right time before the machine/equipment fails. And this is what my project is all about.


Predictive Equipment failure is basically used to predict/anticipate the upcoming failures in machines based on the data pattern. By predicting the upcoming failures, we can stop the machinery and prevent the unnecessary failures.


This is project is taken from here which is a Kaggle Problem.

The data that we have is of ConocoPhillips company. This company is mainly responsible for hydrocarbon production. And so, it also extracts oil from oil wells. The data which they have given is of oil wells (stripper wells to be specific). Stripper well or marginal well is an oil or gas well that is nearing the end of its useful life. In simple words we can state it as, the wells are not able to produce the expected amount of oil per day. 85% of the wells in the United States are stripper wells now. And these stripper wells are responsible for significant amount of oil production.

The data set provided has documented failure events that occurred on, on-surface equipment and down-hole equipment. For each failure event, data has been collected from over 107 sensors (attached on the mechanical equipment’s) that collect a variety of physical information from the surface (on-surface equipment) and below the ground (downhole equipment).

Oil extracting equipment setup

As you can see from the image above, we have mechanical equipment placed on the surface i.e. on the ground and also below the ground to extract oil.



Stripper wells are attractive for the company due to their low capital intensity and low operational costs. Due to these reasons, the profit margin given by these stripper wells is comparatively large. This cash is then used by the company to fund operations that require more money.

For example, the ConocoPhillips company uses the funds from the West Texas Conventional operations (it is named as conventional operation because, the main export business in Texas is Petroleum, Coal Products and oil) which serves as a cash flow to fund more expensive projects in the Delaware basin and other unconventional plays across the United States.

And so, the BUSINESS PROBLEM is to maintain this steady cash flow from the stripper wells and make sure that the maintenance cost of these stripper wells is as less as possible, to increase the profit margin and funds for the rest of the unconventional projects in the United states.

But, as with all mechanical equipment’s, things break and when things break money is lost in the form of repairs and lost oil production. And we need to prevent this failure in order to maintain a steady cash flow from the stripper wells to other operations.


1) Prevent the failure of equipment’s

2) Reduce maintenance/repairing cost

3) Cut unwanted downtime


1) No low latency Constraints

This is because we want to predict the upcoming failure and not predict the real time failure. So, we can use larger or more complex models like ensemble models.

2) High Precision and Recall

High Precision because, we want to perfectly predict the surface or downhole failure. Here, we just have to predict the types of failure (as it is a 2-class classification problem). It should not happen that the true/actual failure was for on-surface equipment and the crew sent a workover rig to fix the downhole equipment as the model predicted it to be a downhole failure. Now this would increase the unwanted downtime as we are looking for the failure at the wrong place.

High Recall because, we want to predict all the failures and NOT miss a single one. Missing even a single failure would increase the maintenance cost i.e. the repairing cost, increase the unwanted downtime and loss in the form of oil and cash.


Our job is to predict the failures (on-surface failure and down-hole/downhole failure). This information can be used to send crews to the well location to fix the equipment on the surface or, send a workover rig to the well to pull the down-hole equipment (equipment present below the ground level which is extracting the oil) and address the failure.


The dataset that we have, is highly imbalanced where we have more data points for ON-SURFACE FAILURE i.e. Class 0 and less data points for DOWNHOLE FAILURE i.e. Class 1. As we want to give EQUAL IMPORTANCE to both the class labels the,



The dataset provided has 2 types of data, the “MEASURES DATA” (which store a single measurement from the sensor) and, the “HISTOGRAM DATA” (they have data recorded from the sensors for 10-time steps). Also, this dataset has large number of missing values i.e. “NAN values”.

To give you a general overview of what other people have tried doing is, most of them imputed the missing values with either 0, mean or median. Some of them have analyzed the data and have come to a conclusion that the “Histogram data” is NOT useful and so they just discarded it. Others have done feature engineering and have come up with new features from the ‘Measures data’.

As we know from the Business Constraints, we do not have any low latency requirements, almost all of them have used EXISTING ENSEMBLE MODELS like Random Forest, XGBoost, AdaBoost, CATBoost, etc and have got good results.

But my approach is very different than the existing ones! Read the blog further to check how I have dealt with missing values, imputed them and built CUSTOM ENSEMBLE MODELS.


Data Imbalance Overview

SURFACE FAILURE =98.33% (blue portion)

DOWNHOLE FAILURE =1.67% (orange portion)

From the above image, it is evident that we have a highly imbalanced dataset where we have more data points for Surface Failure and less data points for Downhole Failure.

We first separate the Measures columns (which makes the Measures dataset) and the Histogram columns (which makes the Histogram dataset) from the input data. We will analyze and pre-process both of them separately.

NOTE: The analysis and pre-processing is done on the training data. Equivalent pre-processing will be done on the test data using the pre-trained data pre-processing models (models which are trained on the training data).

Measures Data Sample
Histogram Data Sample


I will give you an overview of the analysis and pre-processing of Measures data. I used the “describe()” function and found out that most of the columns have a value close to 0 or have large values. It is evident from the image below.

Overview of Measures Data

So, there was a chance that not all the features are useful and so I used RECURSIVE FEATURE ELIMINATION WITH CROSS VALIDATION (RFECV) to get to know the actual number of useful features in the Measure’s dataset.

Code for RFECV

I have arrived at the hyper-parameters cited above after doing the required analysis. By using RFECV I came to know that only 63 features out of 94 in the Measures data are useful for the model to predict the output.

To perform FEATURE ENGINEERING, I decided to extract the top 10 features by using Recursive Feature Elimination (RFE). Feature engineering is basically done in order to underline the existing data distribution as this helps the model to learn the data well. By doing the required analysis on these top 10 features, I came up with the following 5 engineered features. They are:-

1) Sensor12_measure into sensor13_measure

2) Sensor12_measure minus sensor13_measure

3) Sensor17_measure minus 75th percentile value corresponding to class 0

4) Sensor35_measure into sensor17_measure

5) Sensor81_measure into sensor82_measure

I checked the Pearson Correlation Coefficient (PCC) value of these engineered features with the target variables and found out that NO feature was highly correlated (either positively or negatively). But NOT having a high correlation with target variables only tells us that, there is NO LINEAR RELATIONSHIP between the engineered features and the target variables. This does NOT indicate that the engineered features are NOT useful.

Correlation between the Engineered features

From the above image you can see that, NO engineered feature is highly correlated with each other and this is a good sign as we are NOT underlying the same data distribution and making the model overfit.

To check whether the engineered features are actually useful, I used RFECV on 68 features (previous top 63 features + these 5 engineered features). I got 29 features which were of utmost importance for my model’s prediction. It included my 4 engineered features as well. So the useful engineered features are:-

1) Sensor12_measure into sensor13_measure

2) Sensor12_measure minus sensor13_measure

3) Sensor35_measure into sensor17_measure

4) Sensor81_measure into sensor82_measure

These top 29 features are responsible for representing the Measures data. And from now on, we will use these 29 features for further data pre-processing and model training.

As Measures Data either has values close to 0 or has large values, it is necessary to perform DATA STANDARDIZATION using StandardScaler.


As we have many NAN values (missing values)in our Measures dataset, I am splitting the dataset into 3 parts based on the percentage of NAN values for data imputation. The Measures data was split into 3 datasets as follows: -

a) Dataset D1 has columns having less than 5% NAN values

b) Dataset D2 has columns having 5% to 30% of NAN values

c) Dataset D3 has columns having 30% to 75% of NAN values

The columns having more than 75% of NAN values were earlier discarded due to insufficient amount of data for imputation.

The reason behind choosing this approach is that, this dataset is of the type MISSING AT RANDOM (MAR). Here the values are missing due to some technical error or glitch. Also the data that we have is coming from different mechanical equipment's and so there is a very high chance that the missing values are CORRELATED with other features in the dataset. To check other type of datasets please click here.

a) Imputation Strategy for D1

As this dataset has columns having NAN less than 5%, we will remove all the rows containing even a single NAN value. This would keep our data in a pure form.

b) Imputation Strategy for D2

As this dataset has columns having NAN values between 5% to 30%, we have sufficient amount of data for imputation. And so, we are going to use ITERATIVE IMPUTER WITH ExtraTreesRegressor as the estimator. ExtraTreesRegressor is similar to RandomForestRegressor and the former is much faster. As the estimator is an ensemble model with sufficient amount of data, it can impute the missing values with good precision. The hyper-parameters in the image below are selected after tuning the estimator.

Code for Iterative Imputer with ExtraTreesRegressor as the estimator

c) Imputation Strategy for D3

As this dataset has columns having NAN values between 30% to 75%, we do NOT have sufficient amount of data to use an ensemble model as the estimator in Iterative Imputer. So, we will use Iterative Imputer with RidgeRegression (Regression with L2 regularization) as the estimator to impute the missing values. RidgeRegression will work well as the amount of data that we have is less and its generalization capability is better as compared to other models. The hyper-parameters in the image below are selected after tuning the estimator.

Code for Iterative Imputer with RidgeRegression as the estimator

As all the Measure’s datasets have been imputed, we will combine all the 3 datasets (D1, D2, D3) on their index values and create a FINAL MEASURES DATASET.


I will give you an overview of the analysis and pre-processing of Histogram Data. After checking the number of NAN values in Histogram dataset, I found out that we have only less than 3% of data points having NAN values. And so, I just removed those values. I then used the “describe()” function to get a brief idea about the dataset.

Overview of Histogram Data

As the difference between the 75th percentile value and max value is very large, there is NO way that we can engineer features from this dataset to underline the existing data distribution. Dropping the histogram data was one option for me but, to check the usefulness of this data I decided to tune a model. The data was first STANDARDIZED using StandardScaler before passing it to the model.

Code for tuning XGBoost Model on Histogram data

Hyper-parameter tuning results: n_estimators=900, Macro F1 Score=0.8519

Corresponding Precision Matrix
Corresponding Recall Matrix

From the above Precision and Recall matrix we can surely say that, this data is certainly useful and so we will include Histogram data in our final dataset. The only pre-processing that is done on the Histogram data is Standardization.


NOTE: While performing pre-processing on Measures data and Histogram data, it is necessary to store the index values as we have to combine the datasets at the end based on these index values to prevent any shape mismatch.


The first cut solution consists of using existing model libraries and checking the behavior/performance of model. Out of many tried models, I will give a detailed explanation of the best Model I got (on the existing libraries) and the summary for all the models I tried.

As we have a highly imbalanced dataset, we will automatically get a low MACRO F1 SCORE even if the model predicts all the data points as class 0 i.e. SURFACE FAILURE. To know more about MACRO F1 SCORE please click here.

After tuning an unweighted-XGBoost model, the best hyper-parameters that I got are “n_estimators = 700” and MACRO F1 SCORE = 0.89399. (code for this is similar to “Tuning XGBoost model on Histogram data” with equivalent changes)

Corresponding Precision Matrix
Corresponding Recall Matrix

From the above Macro F1 score, Precision and Recall Matrix we can say that, the model is performing very well on the final dataset and this potentially can be used as a production model.

Other Models that I have tried are: -

· XGBoost weighted (giving equal weightage to both the classes)

· RandomForest (un-weighted and weighted)

· CATBoost (un-weighted and weighted)

· AdaBoost

· Training the Model on SMOTE (Synthetic Minority Oversampling technique) data and testing it on the original test data. This method made the model overfit and perform badly on the test data.


To see if the performance can get better on the existing data, I decided to train Custom Ensemble Models. Just to give a rough idea, the BEST MODEL that I got among all the models I tried (including custom as well as using existing Libraries) is a Custom Ensemble Model with 500 Decision Trees as the base estimators and XGBoost with “n_estimators=100” as the Meta Classifier. I will now discuss the model architecture for the best model (the above said Custom model) below.



· I split the training data into 2 datasets (D1 and D2) each having equal number of rows and columns.

· On D1, I randomly did row and column sampling with replacement so as to train the 500 Decision Trees which are acting as the base estimators (sampling is done in order to avoid the model from overfitting).

· Then for each decision tree, I stored the corresponding column names as we would require it during testing and training the meta classifier.

· Now the base estimators i.e. 500 Decision Trees are trained.

Architecture while training (PART-1)


· Take dataset D2.

· Pass this dataset through the TRAINED BASE ESTIMATORS by performing appropriate column sampling.

· Store the output from each trained base estimator and create a new dataset for the Meta Classifier i.e. XGBoost with “n_estimators = 100”.

· Now, use this new dataset to train the Meta Classifier.

· The Meta Classifier is now trained.

Architecture while training (Part-2)

TESTING PROCEDURE (this explanation is for a single data point, the same can be used for a set of data points as well): -

· Take a test data point.

· Sample the data point based on column names before passing it to the corresponding trained base estimators.

· Collect the output predicted from each base-estimator and combine them all to make a new data point having shape (1,500).

· Pass this new data point through the trained meta classifier.

· Obtain a output for that data point.

Architecture while TESTING

The MACRO F1 SCORE that I got on this model is 0.9053. The number of misclassified points on test data are 0.37443%.

Custom Model Precision Matrix
Custom Model Recall Matrix

CONCLUSION: The above discussed Custom Model is the best suitable model for my problem. The specifications of this Custom Ensemble Model are, 500 Decision Trees as the base-estimators, XGBoost with “n_estimators = 100” as the Meta Classifier.


Using existing model libraries

Using existing Model Libraries

Using Custom Models

Using custom models

To see how my models works when it is productionized and how it handles various cases, refer to the profile section and click on “Model Productionization Demo Video.”


From the histogram data, we were not able to get some good insights and engineer some features. So, using Auto-Encoders for Histogram data can be a good option. As the Histogram data is time based data, we can see whether using LSTM’s (type of Recurrent Neural Networks) can improve overall model performance .Also by doing deeper analysis on the Measures data, there is a good chance that, we can come up with more ingenious features.


a) Please click here for accessing my GitHub Repository which contains the code for final data pipeline and the pre-trained models.

b) Please click here to see the video which describes the working of my Model.

c) LinkedIn Profile












Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Hotel Mapping Tools: How GIATA, Gimmonix, DataBindR, and Other Solutions Work

What is Data Science?

Global Factor Performance: May 2022

The Aging Crisis of Japan

Text Analysis on William Ruto’s tweets with R

Project Retrospective: The Ames, IA Home Sale Price Predictions

Trend-Following Filters — Part 4

Trend-Following Filters – Part 4

Finding Emma

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tejas Deo

Tejas Deo

More from Medium

K Means… what does it mean?

A dust colors. red color, yellow color, green color, blue color, purple color

Modern Lying Machine — Data & Statistics

A magical day in Disney with Machine Learning — Part 8 Feature Selection Discrimination

Cancer Mortality Prediction