Failure Prognostic of Turbofan Engines with Uncertainty Quantification and Explainable AI (XIA)

Deep learning is quickly becoming essential to human ecosystem. However, the opacity of certain deep learning models poses a legal barrier in its adoption for greater purposes. Explainable AI (XAI) is a recent paradigm intended to tackle this issue. It explains the prediction mechanism produced by black box AI models, making it extremely practical for safety, security or financially important decision making. In another aspect, most deep learning studies are based on point estimate prediction with no measure of uncertainty which is vital for decision making. Obviously, these works are not suitable for real world applications. This paper presents a Remaining Useful Life (RUL) estimation problem for turbofan engines equipped with prognostic explainability and uncertainty quantification. A single input, multi outputs probabilistic Long Short-Term Memory (LSTM) is employed to predict the RULs distribution of the turbofans and SHapley Additive exPlanations (SHAP) approach is applied to explain the prognostic made. The explainable probabilistic LSTM is thus able to express its confidence in predicting and explains the produced estimation. The performance of the proposed method is comparable to several other published works ___________________________________________________________________________


Introduction
Each year, industries around the world spend massively to operate in a safe and sustainable way. Specifically, organizations depend on the reliability of its industrial assets to fulfil their dedicated functions. These assets are mostly complex engineered system whose downtime would mean paralyzing the whole production process. In a more serious perspective, downtime could also present a threat to the safety of personnel, that unfortunately had resulted in fatalities in certain extreme cases. Carefully maintained assets ensure the continuation of safe operation, enabling profitability and in return, guaranteeing the livelihood of millions of workers. Reliability of engineered system is one of the topics where researchers and industrial players work hand in hand. Cooperation flourishes to facilitate the exchange of ideas between both milieus. Several efforts originating from academia is actively being pursued in the industry. Domains such as Multi State System Reliability (MSS) (Chao-Hui& Chun Ho, 2019;Zhao et al., 2019) and Human Reliability Analysis (HRA) (Zwirglmaier et al., 2016;Growth et al., 2019) are some of the dedicated research branches in engineered system reliability. In the recent decades, Prognostic and Health Management (PHM) has also emerged as a strong contributor in providing frameworks to ensure the well-being of industrial assets (Shin et al., 2018;Gan, 2020;Baur&Monno, 2020).
PHM is mainly used as a decision support tool for safeguarding the health of engineered system. It facilitates maintenance cost reduction (Scanff et al., 2007), just-in-time maintenance (Sun et al., 2012), liberating load (Atamuradov et al.,2017;Ding Feng et al., 2017) and minimizing accident risks (Kwok et al., 2015;Pham et al., 2012). In PHM, three essential activities consisting of prognostic, anomaly detection and diagnostic are carried out. Prognostic is the act of defining the Remaining Useful Life (RUL) or the remaining operational time of an industrial assets before failure (Elattar et al., 2018;Akpudo&Hur, 2020). Anomaly detection refers to the identification of unusual patterns going against the normal behaviour of operational parameter measurement (Gurkan&Burak, 2020;Liu &Gryllias, 2020). Diagnostic on the other hand, is the action of discovering the root cause of failure, and if possible, matching the concerning features with known failure signatures (Zhou et al.,2020;Benedetti et al., 2018).

Black-Box AI
While there are various approaches in PHM, methods based on artificial intelligence have gained considerable attention in the past decades. In this domain, machine learning, specifically deep learning, have reigned supreme thanks to the powerful advantages inherent it possesses. Deep learning is powerful in modelling nonlinear relationships. Additionally, it is simple to apply and does not require a deep understanding in the underlying

Research Article
Research Article physical interactions of the system. Although deep learning is popular in the research domain, its adoption in the real world is currently under setback, due to its black box nature (Kim et al., 2020;Grezmak et al., 2018;Kraus et al., 2019). Black box or opaqueness of deep learning prevents users to understand why certain prediction is made. Naturally, this poses a challenge especially in areas where safety, security and investment amplify the need for comprehension. Moreover, this obscurity could result in ethical issues where black box applications risk offending race or gender of the users. To date, the only legal directive affecting the use of AI is the European General Data Protection Regulation (GDPR) whose interpretation regarding the obligation of logical explanation in automated processing, is currently being debated between AI experts and practitioners (Hacker et al., 2020;Chazette& Schneider, 2020;Bussmann et al., 2020). In term of ethical guideline, the European Commission's High-Level Expert Group on AI presented the key requirements in Ethics Guidelines for Trustworthy Artificial Intelligence in 2019, whose key requirements correspond directly or indirectly on the use of XAI (Bussmann et al., 2020).

Uncertainty in Deep Learning
While XAI helps to understand the decision made by deep learning models, it is imperative for the user to evaluate the confidence of the model when predicting, especially in real life applications. Uncertainty estimation is an indicator of deep learning's prediction quality. Most deep learning models only produce point estimates prediction where notion of uncertainty is completely absent.
Aleatoric uncertainty is the uncertainty linked to the quality of input data (Kendall &Yarin, 2017;Prado et al., 2019;Li et al., 2020). This uncertainty is characteristic of the real world applications where noise, data acquisition error or stochasticity can be present in the input data.
This paper presents a work that combine the strength of AI explainability and deep learning uncertainty quantification where a turbofan engines life prognostic problem is investigated. A single input, multi outputs probabilistic LSTM with SHAP explainability are employed to predict and explain the RUL distributions of the engines. This paper is believed to be the first of its kind that harness these abilities in failure prognostic research. This work is vital as in the real-world applications, uncertainty and explainability are indicators to assess prediction for accurate decision making.

Related Literature
AI explainability has been used in various PHM research.
Class Activation Mapping (CAM)-based explanation approaches have been employed together with CNN in many works to evaluate the focus of CNN. In (Kim et al., 2020), fault classification of linear motion guide based on Convolutional Neural Network (CNN) and Grad-CAM (FG-CAM) explainability in the frequency domain is done to analyze which frequencies have significant impact on the fault conditions. The same technique is applied in (Chen & Lee, 2020) for diagnosis of bearing's fault. In (Zhao et al., 2020), a DecouplEd Feature-Temporal CNN (DEFT-CNN) with Grad CAM is proposed to provide separate explanation on features and temporal information. An automatic vision diagnostic technique for base-excited cantilever beam and water pump system using a combination of CNN and Class Activation Maps (CAM) is presented in (Sun et al., 2020). While the CNN detect faults, the CAM localizes the faults. Additionally, CAM provides the diagnostic explainability, making the method a white box model.
Layer Wise Propagation (LRP) is an explainability technique that traces back the contribution of the input to the prediction by propagating backward the relevance measures from the output layer to the input layer through the nodes of the model. In (Felsberger et al., 2020) LRP is applied to explain failure prognostic of Proton Synchrotron Booster (PSB) of CERN Particle Accelerator. The task is to predict and explain ten most frequent priority 3 fault types of PSB accelerator power converters. In , CNN is employed with LRP to explain the diagnostic of gearbox failure. Again, in (Grezmak et al., 2020), CNN and LRP are utilize for fault classification and explanation of induction motor. In this work, the vibration time series data used as input is transformed into time-frequency image using Continuous Wavelet Transform (CWT) with Morlet wavelet. Specifically, the wavelet coefficients of the vibration data calculated from the CWT are converted to timefrequency plots that are used by the CNN.
Logic Analysis of Data (LAD) is an explainable diagnostic method base on variables analysis. Fault diagnosis in industrial chemical plant and black liquor recovery boiler based on LAD is proposed in (Ragab et al., 2017). In (Ragab et al., 2019), LAD is used to enrich Fault Tree Analysis (FTA) of industrial clean steam and hot water production. The same technique is used in the same context in (Waghen&Ouali, 2019) where diagnosis of actuator system is explained.
Local Interpretable Model-Agnostic Explanations (LIME) and SHAP are both popularly employed model agnostic explanation approaches that can be used to explain any type of machine learning model. In (Onchis&Gillich, 2021), a feed forward neural network together with SHAP and LIME are employed to predict and explain the damage of prismatic cantilever steel beam. In (Karn et al., 2021), SHAP and LIME are employed to explain crypto mining malware detection in cloud network.

LSTM Architecture with Probabilistic Layer
The model employed for the RUL prediction is a single input, multi-output LSTM with probabilistic layer. This probabilistic LSTM maps the sensors data to Health Index (HI) target of the turbofan. The model has 2 output layers. In the first output, the model predicts a sequence of HI distributions corresponding to the complete health state of the studied turbofan engine. This layer incorporates the probabilistic element where it transforms the input to gaussian distribution with variable standard deviation as suggested in (Kendall &Yarin, 2017). The model thus "force" the prediction into a suitable form for uncertainty management. In the second output, the model extracts only the HI point estimate which is the mean of the first HI distribution from the sequence obtained before. This output corresponds to the initial HI or initial RUL of the concerned turbofan where the predictive performance of the model is based upon (i.e., RMSE calculation). These outputs will thus form a single vector and will be included in SHAP analysis. This is necessary as SHAP library only accepts single vector output. The model is trained to minimize loss only based on the first output as the model favors HI distribution sequence rather than point estimate. The hyperparameters used in this model are optimized via Bayesian hyperparameter optimization.

SHapley Additive exPlanations (SHAP)
SHAP is a model agnostic, game theoretic approach to explain the output of any machine learning model. Model-agnostic explanation works by analyzing trained black box AI model input and output in post-hoc nature, or after the model is trained (Ribeiro et al., 2016). Here, is the explanation model. ′ ∈ {0,1} are the simplified features that describe the presence of interested feature in the feature's combination with ′ = 0 means the interested feature are absent in the combination and ′ = 1 signifying the feature are present. is the maximum coalition size and ∈ is the Shapley values for a feature . The formula for Shapley value is: is a subset of the features used in the model, is the vector of feature values of the instance to be explained and is the number of features is the prediction for feature values in set that are marginalized over features that are not included in set . is the average predicted value (Lundberg & Lee, 2017).

RMSE & Scoring Function
The performance of the model is assessed by calculating the Root Mean Squared Error (RMSE) and Scoring Function, , of the obtained RUL prediction as respectively shown in Eq. (4) and Eq (5), (6) and (7) (Li et al, 2018;.
The scoring function gives higher score for the same error in early prediction than late prediction. It thus penalizes late prediction than the early ones. With as the ground truth RUL for turbofan , the predicted RUL for turbofan , and as the total number of turbofans.

SHAP Explainability& Uncertainty Quantification
Uncertainty quantification will be evaluated via the rolling standard deviation plot of the HI distribution sequence. An increasing trend indicates a growing uncertainty of the prediction while the contrary signifies that the model is more and more confident with the estimation. As for explainability, SHAP library will be used to analyze the predicted HI distributions sequence of each turbofan.

Case Study: CMAPSS Turbofan Dataset
The CMAPPS (Commercial Modular Aero Propulsion System Simulation) Turbofan run-to-failure datasets consists of 4 complete sets of training, testing and ground truth RUL for numerous turbofan engines, published by Nasa Prognostic Centre (PCoE) of Ames Research Centre, denoted as FD001, FD002, FD003 and FD004 (Ramasso&Saxena, 2014). This data was produced by adjusting the operational conditions and injecting faults of varying degradation degree to the simulated turbofan system using CMAPSS software (Saxena et al., 2008).
The FD002 data is chosen in this study. This data consists of recorded turbofans degradations whose health condition deteriorate after certain cycle as shown in Table. 1. Each turbofan is associated with time series sequence comprising of Time (Cycle), 3 Operating Conditions (OC) and 21 sensors measurements corresponding to temperature, pressure, various ratios, and bleed enthalpy of the system. The OC refers to different operating regimes combination of altitude (O-42K ft.), throttle resolver angle (20-100), and Mach number (0-0.84). High levels of noise are incorporated, and the faults encountered are hidden by the effect of various operational conditions (Saxena et al., 2008).

HI Target Calculation
To obtain the RUL target for the model's training, piece-wise linear degradation model is assumed (Li et al, 2018;. Each fleet health is thus considered stable in the beginning until the failure start point which initiates a linear degradation until failure. Each time series sequence corresponds to the total operational duration of a turbofan and the last cycle indicates the final instance before failure. Thus initially, the RUL of a turbofan is assumed to be equal to the value of the last cycle and degrades linearly until 0 as shown in Figure 1(a). In this example, the turbofan 1 training data has been recorded in a total cycle of 192.  The failure start point for each sensor is calculated using Cumulative Sum (CUSUM) anomaly detection technique, which returns the first index of the upper or lower cumulative sums of each sensor's measurement that have drifted beyond 5 standard deviations from the target mean, indicating the initiating point of degradation (Matlab, MATHWORK). This index is thus equal to the cycle in which the degradation appears. The mean of all these indexes is taken as the failure start point. Combining the linear degradation obtained earlier and the failure start point, the transformed RUL sequence is presented in Figure 1(b).
The HI measurements are calculated from the RULs as indicated below with as RUL at time , as the healthy state RUL and is equals to 0, or failure state.
: Table 2RMSE and Score Comparison

RMSE & Scoring Results
As shown in Table 2, the proposed method's performance is comparable to known publish works in both RMSE and score results. The best results in both metrics are highlighted.

Sequence Prediction and Uncertainty
To analyze the prediction and associated uncertainty, results for turbofan 1 and 2 are presented as examples.
As seen in Figure 2, the prediction for turbofan 1 is not accurate compared to the ground truth HI sequence. This is translated by the growing trend in the standard deviation plot, presented by the line of best fit in Figure 3, indicating that the model is increasingly not confident with its prediction. Additionally, successive strong oscillations in standard deviation can be seen in Zoom 2 area in Figure 3, corresponding to Zoom 1 area in Figure 2. This area, as shown in Figure 2, relates to linear deterioration prediction. These extreme movements show that the model is very uncertain of its deterioration prediction.
As for turbofan 2, the HI prediction is quite accurate compared to the ground truth HI as illustrated in Figure 4. The model is confident in its prediction and this is expressed in the standard deviation plot in Figure 5. Here, the line of best fit trend is showing a decrease.

Prognostic Explainability
Local explainability is presented for turbofan 2 as the model is more confident in this case compared to turbofan 1 as shown before. The SHAP force plots are employed to visualize the local explainability.
The HI prediction, as shown in Figure 6 started to deteriorate at cycle 57, corresponding to the predicted failure start point, or anomaly detection. It is thus interesting to see the explanation before and after this point until the prediction stabilizes indicating failure at cycle 80 (HI = 0). From the force plot shown above, a pattern on features influencing the prediction can be noted before and after the failure start point. Before the model predicts the start of degradation, various features can be seen influencing the healthy state of the turbofan as presented in Figure 6(a) and 6(b). However, after cycle 57, similar patterns of feature are repeatedly shown. These features are S9, S2, S8, OC1, OC2, S11, S13, S14, OC3 and S4 as illustrated in Figure 6(d), 6(e) and 6(f). The apparition of these features gradually starts from cycle 57 as seen in Figure 6(c). The red color features drag the prediction positively while the blue color features influence the prediction negatively. Omitting the operating condition features (OCs), the description of the features influencing the failure state of the turbofan is presented in Table 3.

Sensor
Influenc e

S8
Negative Physical fan speed S2 Negative Total temperature at LPC outlet S9 Negative Physical core speed S11 Positive Static pressure at HPC outlet S13 Positive Corrected fan speed S14 Positive Corrected core speed S4 Positive Total temperature at LPT outlet

Conclusion
In this paper, a probabilistic, single input, multi-outputs LSTM with explainability that can express its prediction uncertainty is presented. The ability of this model is demonstrated in an RUL prognostic problem involving turbofan engines. SHAP model agnostic explainability approach is employed to explain the regression task while probabilistic layer produces HI distribution prediction characterizing the aleatoric uncertainty. The model is thus able to express its prediction confidence and explain its sequential outputs. These indicators are very valuable as user depends on them for correct decision making in real world AI applications. The performance of the model is also comparable to other known methods in published works.

Acknowledgements
The authors would like to thank UniversitiTeknologiPetronas Foundation (YUTP) for financing this research.