**Process fault prediction and prognosis using a hybrid model**

**Abstract**

Prediction and management of the process faults could save billions of dollars per year. This study proposes a hybrid approach to predict and prognosis process faults. The hybrid approach is comprised of a Hidden Markov Model (HMM) and Bayesian Network (BN). HMM predicts the abnormalities using process historical data while the BN uses the process knowledge to prognose the fault. In the off-line component, an HMM which is trained with Normal Operating Condition data is used to determine the log-likelihoods (LL) of each process history data string. The generated LL values are then used to develop the Conditional Probability Tables of BN while the structure of BN is developed based on process knowledge and with the help of Sign Directed Graphs. In addition to that, a separate historical data set with known faults is used to generate a database of LL values concerning the same HMM trained with standard operating condition data. In the online component, the trained HMM is used to check the LL values of incoming data strings continuously and compares with the LL historical database. Based on the comparison, the system decides the most likely future condition of the system in a number of seconds. The time of prediction of abnormality and probabilities of being in specific operational condition at the predicted time is used to generate the likelihood evidence to BN. The updated BN with likelihood evidence is then used to prognose the cause. Performance of the proposed approach is tested using published data of Tennessee Eastman Process. The system is capable of predicting all the selected ten faults while accurately prognosis eight of them.

**Keywords:** Process fault prognosis; Hidden Markov model; Bayesian network, Process fault prediction

**1** **Introduction and review of the relevant literature**

There are many options available in fault prediction using machine learning approaches. This topic discusses a variety of machine learning algorithms can be used in process fault detection and prediction. As illustrated in Figure 1, machine learning approaches can be divided into two basic categories called, supervised learning and unsupervised learning. In supervised learning, the algorithms assess the input data and corresponding outputs to learn the mapping function from input to the output. In unsupervised learning, the algorithms identify the hidden structures of data in order to learn more about the data. Classification and Regression come under supervised learning while clustering comes under unsupervised learning. Examples for each category are also illustrated in Figure 1 (The Mathworks, 2016).

Figure 1: Different machine learning techniques (The Mathworks, 2016)

As further mentioned in (The Mathworks, 2016), supervised learning needs training of a model on known input and output data, so that it can predict future outputs. It builds a model, with the available known inputs and respective outputs, which can be used to predict the output of an unknown input. In classification, the incoming data can be classified into predefined groups. For example, it can be used to detect whether an e-mail is spam or not. On the other hand, regression techniques can predict continuous changes of quantities such as changes in temperature or pressure of a polymer melt. They are widely used in electricity load forecasting.

Binary classification problems can be adequately handled using logistic regression by using as the first step. It can be used to predict the probability of a binary response by fitting a suitable model. It is more efficient when the data can be clearly separated by a single linear boundary. Also, k Nearest Neighbor (kNN) is a simple algorithm which is useful to use when the concern on memory usage and prediction speed are less concern. The primary assumption in kNN is that the objects near each other are similar. It can categorize objects based on the classes of their nearest neighbours in the data set. Distance matrices are used to locate the nearest neighbour. Further, Neural networks can be used for modelling highly nonlinear systems. It can also facilitate the constant update of data with the availability. A neural network consists of highly connected networks of neurons that relate the inputs to the desired outputs, inspired by the human brain.

On the other hand, Naïve Bayes classifies incoming data by assessing the probability of belonging to a particular class. It also assumes each class has unique features which do not have any similarities. This performs well for a small dataset containing many parameters. The discriminant analysis finds linear combinations of features to classify data. The primary assumption in the discriminant analysis is that different classes generate data based on Gaussian distributions. If there is a requirement of a simple model that is easy to interpret, discriminant analysis is a good option. The model created using discriminant analysis are fast to predict, and memory usage can be optimized during the training process. A decision tree is also a technique that has a minimum memory usage. It does not give a high predictive accuracy but easy to interpret and fast to fit. When the decision trees are lower in performance, several low-performance decision trees are combined to an ensemble called Bagged and Boosted Decision Trees. Each tree is trained independently. This version of decision trees is much suitable when the time taken to train a model is not critical.

Linear regression is a technique that is easy to interpret and train. Therefore, it is the first model to fit into a given data set. It is used to describe a continuous response variable as a linear function of one or more predictor variables. On the other hand, to describe nonlinear relationships, nonlinear regression technique is used. The effectiveness of this technique is evident when the data has a strong nonlinear tends and cannot be quickly transformed into a narrow space. Gaussian process regression (GPR) models are used for predicting the value of a continuous response variable. GPRs are non-parametric models which are widely used in spatial analysis. When there is uncertainty, GPR can make interpolations to make predictions. Support Vector Machine Regression (SVMR) is also has a similar operation to SVM Classification. Nevertheless, it is specially developed to predict continuous responses. This technique is effectively used in high-dimensional data.

Further, when the response variables have non-normal distributions, Generalized Linear Model can be successfully used which is a particular case of nonlinear models that use linear methods. Similar to decision trees used in classification, the decision trees developed for regression are called Regression Trees. These are specially modified to predict continuous responses. When the predictors are nonlinear and discrete, this technique can be successfully used.

In unsupervised learning, it finds hidden patterns or natural structures in input data. Clustering is the most common unsupervised learning technique. It is used for exploratory data analysis to find hidden patterns or groupings in data. Clustering can again be divided into two basic categories namely; hard, and soft. Discussing hard clustering examples, k-means and k-medoids are closely related to each other. The only difference is the latter does coincide data points, and the former does not. In hierarchical clustering, the data are grouped into a binary hierarchical tree while the self-organizing map is a Neural-network based clustering technique that transforms a data set into a 2D plot. About Fuzzy C-means, which is a soft clustering technique, can be used when data points belong to more than one cluster. This is widely used in pattern recognition. Similar to Fuzzy C-means, the Gaussian Mixture Model (GMM) is also a partition-based clustering technique where data points come from different multivariate normal distributions (Chelly & Denis, 2001a).

On top of this, there are three most commonly used dimensionality reduction techniques namely: Principal component analysis (PCA); Factor analysis; and Nonnegative matrix factorization. In PCA, few principal components can capture a high dimensional data set by performing a linear transformation on data. The strength of this method is, the principal components can capture most of the variance or information of the entire data set. The relationships between variables in a given data set can be identified by factor analysis and representation regarding a lesser number of latent, or common factors. Nonnegative

(R0+)

matrix factorization is used when model terms must represent nonnegative quantities such as physical quantities (Chelly & Denis, 2001b).

Figure 2: Different techniques in Clustering

Machine learning techniques used as prediction tools in a variety of applications are discussed under this topic. There are numerous examples available in the literature for both individual and combined use of machine learning techniques.

A method for predicting failures of a partially observable system is presented by (M. J. Kim, Jiang, Makis, & Lee, 2011). They have modelled the system behaviour three hidden state continuous time-homogeneous Markov process. States 0 and 1 are not observable while representing right and warning conditions respectively. Only observable failure state is 2. Model parameters are estimated using the EM algorithm.

Further, a cost-optimal Bayesian fault prediction scheme is also employed. A comparison with other prediction methods is given. It clearly illustrates the effectiveness of the proposed approach.

Logistic Regression Classifier (LRC) is a powerful tool in predicting linearly separable classes. It is a commonly used analytical model for classification problems. When a training feature matrix

X

is provided along with the corresponding target vector

Y

, a logistic regression model can be trained to predict

Y

for even unseen instances of

X

. The input has two main components namely; data, and parameters. The data component includes the training and predicted datasets. The training dataset requires the feature values and their corresponding target values, while the test dataset only requires the feature values to predict their unknown target values (Predix, 2016a).

As further mentioned in (Predix, 2016b), Random Decision Forests (RDF) is a combined approach of learning methods for tasks such as classification, and regression. Initially, it constructs a set of decision trees during training time and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It trains some decision trees from bootstrap samples from the training set with replacement. On the other hand, the algorithm also draws a random subset of features for training the individual trees. This makes the trees more independent in comparison with the conventional approach. In the case of classification, the majority rule is used on trained decision trees to classify the new data. While each decision tree comes up with a decision, the next prediction is selected based on the number of votes. Random Forest results provide better predictive performance (due to better variance-bias trade-offs) and are faster because each tree learns only from a subset of features. Random Forest is practically tuning free therefore it is not required to do any parameter tuning to find the optimal model.

Two variants of Genetic Programming (GP) approaches for intelligent online performance monitoring of electronic circuits and systems are introduced by (Hamerly & Elkan, 2005a). A stressor susceptibility interaction model is introduced to assess the reliability of circuit systems. When a stressor exceeds the susceptibility limit, the system is identified as a failure. Direct measurements through sensors are used to generate the validated stressor vectors and then they are fed to the GP model. The results are compared with ANN outputs and found to be useful in performance.

Further, (Hamerly & Elkan, 2005b) propose a technique to predict hard disk failures which are rare but costly. Two Bayesian methods are introduced namely; a mixture model of Naïve Bayes sub-models, and Naïve Bayes Classifier. A former method used the EM algorithm while the latter is a supervised learning approach. Both techniques show good prediction accuracy on real-world data. Also, (Murray, 2003) compare the performance of support vector machines (SVMs), unsupervised clustering, and non-parametric statistical tests (rank-sum and reverse arrangements) and confirms that the rank-sum method outperforms others. Two improved versions of SMART (Self-Monitoring and Reporting Technology) failure prediction system are proposed by (Hughes, Murray, Kreutz-Delgado, & Elkan, 2002). The proposed techniques give three to four times higher prediction accuracy than error thresholds on hard disk drives which are prone to fail, at 0.2 % false alarm rate.

Bearing failure prediction was successfully done by (Fulufhelo V. Nelwamondo, 2006) using HMM and GMM. They introduce feature extraction methodologies that can facilitate early detection of faults. Time domain vibration signals of faulty and standard bearings in rotating machinery are used for feature extraction. HMM and GMM is then used to classify faults based on the extracted features. Based on the classification performance, HMM is superior to that of GMM. They further mention that HMM has a disadvantage of being computationally expensive.

Euclidean distance-based feature selection is proposed by (J. Kim, Han, & Lee, 2016) for fault detection-prediction model in the semiconductor manufacturing process. As the first step, the features of the semiconductor manufacturing process are measured regarding Mean Absolute Value (MAV) and Standard deviation (SD). After that, using the Euclidean distance, the most appropriate features are selected using the classification model. Finally, with the selected features, the neural network is trained to generate a fault prediction model. The proposed method performs well in semiconductor manufacturing process fault prediction.

Neural network modelling is used by (Bekat, Erdogan, Inal, & Genc, 2012) to predict the amount of bottom ash accumulated in a pulverized coal-fired power plant. Operating data collected throughout the one-year period and the properties of the coal processed are used in the prediction process. Feed-forward type network architecture with back-propagation learning is used with three layers. The sigmoid function is used as the activation function. The authors have determined the ideal parameters for accurate predictions along with the most useful parameters to be monitored based on a sensitivity analysis.

To minimize the error and reduce the maintenance cost, prediction of an incipient fault in transformer oil is essential. Artificial neural network (ANN), and particle swarm optimization (PSO) based methodology is presented by (Illias, Chai, Bakar, & Mokhlis, 2015) to predict the incipient transformer fault. It has been found that the proposed ANN- PSO method yields the highest percentage of correct identification among the other reported works.

On the other hand, (Jiang, Wey, & Fan, 1988) propose an algorithm to predict faults in analog circuits. The central concept is continuously monitoring the component values which are evaluated according to the consecutive voltage measurements. These measurements are taken at the accessible test points, at each periodic maintenance. This approach makes it possible to locate the faulty components as well as components which are predicted to be failed shortly.

Based on multi-PCA model, (Ma & Xu, 2015) present a methodology for multiple mode process fault detection, fault estimation, and fault prediction. First, a multi-PCA model is used to detect faults in a steady state process operated under different conditions. For the transition process, a weighted algorithm is used. Fault amplitude is made consistent by using a consistent estimation algorithm, and finally, SVM is used to predict the trend of the fault amplitude. This method has a proven performance by applying and testing in Tennessee Eastman process data.

A prediction technique is introduced by (Gao & Liu, 2017), which is developed based on a new improved kernel principal component analysis (KPCA) method. It uses the concept of indiscernibility and eigenvector. The application area is the distillation column process fault prediction. The improved version of KPCA can remove variables with almost no correlation with the fault being monitored. On the other hand, it can reduce the number of data strings used several times. Proposed methodology gives better performance over the traditional technique. By applying the method in a distillation column scenario, the authors have shown that the KPCA method can predict the process failure caused by small disturbance.

Weighted least square vector machines regression is used by (Gao & Liu, 2017) to develop a Hammerstein model to predict the dynamic behaviour and the possible faults in Imperial Smelting Furnace (ISF). The proposed model is capable of accurate fault predictions of ISF. Further, (Ramana, Sapthagiri, & Srinivas, 2017) have introduced a prediction methodology for the quality of injection moulding products based on a machine learning approach, which has shown a prediction accuracy of 95%. It is an effective method to increase the productivity by eliminating defectives during a production process. To build the data mining models, Decision Tree, and k-NN techniques are used and trained using a training data set developed using actual production data of a given product. Then the prediction accuracy is tested using a testing dataset developed similarly.

The use of Computer Aided Engineering (CAE) tools on fault prediction in Copper processing line is studied by (Jahani & Razavi, 2016). The outcome of the study is highly useful in the predictive maintenance of critical equipment such as slurry pumps and hydro-cyclones. Further, the simulations can suggest best-measuring parameters, measuring intervals, and their locations.

Prediction of emerging faults of dynamic industrial processes is achieved by (Hu, Deng, Cao, & Tian, 2017), using an approach based on Canonical Variable Trend Analysis (CVTA). Canonical Variable is the leading information carry forward to make the predictions. They are the un-correlated latent features extracted through the analysis of process dynamics. The initial analysis is done Canonical Variate Analysis (CVA) algorithm while SVM is adopted to model the relationship between historical and future values. It facilitates the development of a time series prediction model for the canonical variables. An overall monitoring statistic is used to forecast the change of the process status based on the predicted canonical variables. They have demonstrated the effectiveness s of the proposed methodology by applying it in a simulation on a Continuous Stirred Tank Reactor (CSTR) system.

An intelligent algorithm for fault prediction of turbine pitch system is proposed by (Deng, 2018), based on Least Squares Support Vector Machines (LS-SVM) parameter optimization. Initially, the data of the SCADA system are analyzed. Through this, four kinds of parameters are selected as the input of the model, which are strictly related to the turbine pitch system fault. Then the minimum output coding (MOC) is introduced to construct multiple classifications LS-SVM to understand the multi-class classification of pitch fault. Later, the algorithm of particle swarm optimization is implemented to select the optimal feature parameters for the multi-class LS-SVM classifiers. The proposed methodology is applied to a wind farm pitch fault prediction scenario. The performance is compared with back propagation neural network algorithm and standard SVM algorithm. The proposed method is found to be superior in performance.

A method for data-based line trip fault prediction in power systems using long-short-term memory (LSTM) networks and SVM is proposed by (Zhang, Wang, Liu, & Bao, 2017). As LSTM networks perform well in extracting the features of time series for a long time span, it is used to capture temporal features of multisource data. To get the final prediction results, SVM is used for classification. The actual data for experiments is obtained from the Wanjiang substation in the China Southern Power Grid. Improved performance of the proposed combined approach of LSTM and SVM is evident in comparison with the current data mining methods.

A systematic approach for fault prediction of power converters in power conversion systems is presented by (Di et al., 2018). Two data-driven methods with novel techniques namely; decision tree, and SVM. Those two will take in to account working condition variances and the data imbalance respectively. It was validated with an industrial application to be useful in predicting the power converter failures.

A prediction method for cement rotary kiln process is proposed by (Sadeghian & Fatehi, 2011), using a nonlinear system identification method. First, the suitable inputs and outputs were selected, and an input-output model is identified for the plant. Locally Linear Neuro-Fuzzy (LLNF) model is used to identify various operating points of the kiln. The model is trained with an incremental tree-structure algorithm. The methodology is used to develop three models, one for normal operating conditions and the other two for two faulty situations. The result of the fault detection algorithm performance indicates that the proposed technique can predict the fault occurrence 7 minutes in advance.

A new particle predictor-based method has been proposed by (Chen, Zhou, & Liu, 2005) in order to calculate the probability of fault prediction for nonlinear time-varying systems. As illustrated by the use of simulations, the proposed methodology is capable of giving an early alarm before the fault happens. Although the conventional Particle Filter cannot perform with unknown time-varying parameters, the Particle Predictor has the ability. As further explained in (Chen et al., 2005), it is almost impossible to predict the abrupt faults of a system. Nevertheless, the slowly developing faults can be predicted with the use of an online monitoring system. Because the fault is developing slowly, thus the time-varying parameter estimated could be supposed not to change in a short time interval in the future. Then we can use this to predict forward to determine whether the state falls outside its normal range.

A solution for the fault prediction for the nonlinear stochastic system with incipient faults is proposed by (Ding & Fang, 2017). This fault estimation algorithm is developed based on the particle filter and the reasonable assumption about the incipient faults. A new fault detection strategy called ‘intuitive fault detection’ is introduced. When the incipient faults are detected, nonlinear regression is used to identify the respective parameters. The potential future fault signal is predicted based on the estimated parameters. Further, the effectiveness of the proposed method is verified by a standard simulation.

The determination of remaining useful life (RUL) of a wind turbine is very much essential to maintain the reliability of a system. Based on the adaptive neuro-fuzzy inference system (ANFIS) and particle filtering (PF) approaches, (Cheng, Qu, & Qiao, 2017) propose a method for fault prognosis and RUL prediction of the gearbox. The fault features are extracted from the stator current of the generator coupled with the gearbox. The extracted fault features are used to train ANFIS, and the PF predicts the RUL. For this, new information of the fault features is also used. The proposed method is found effective based on the experimental results. A similar problem is successfully solved by (Zhao et al., 2017). It has been shown that wind turbine generator RUL can be predicted with about 80% accuracy 18 days ahead. The generator faults can be diagnosed with a 94% accuracy when they occur. The benefit of the proposed system is, it does not require any additional hardware installation. Already available SCADA system serves as the source of information hence cost efficient. With a detailed analysis, the authors have selected the SVM as the most suitable classification technique for this particular application among ANN, Bayes classifier, k-NN classifiers.

Foreign exchange rate prediction is also the widely researched area. A nonlinear ensemble forecasting model is proposed by (Yu, Wang, & Lai, 2005) which is recommended as an alternative tool for exchange rate forecasting. It consists of generalized linear auto-regression (GLAR) with artificial neural networks (ANN) for accurate predictions. The new combined model is compared for performance with the two individual forecasting tools (i.e. GLR, and ANN). According to further explanation by (Yu et al., 2005), the new integrated approach is more accurate in comparison with the GLAR, and ANN individual systems. Similar to foreign exchange prediction, stock market forecasting is also a well-researched area. An HMM-based tool is developed by (M. Rafiul Hassan & Nath, 2005). A trained HMM is used to scan for the variable of interest behavioural data pattern from the past data set. Forecasting is done by interpolating the neighbouring log-likelihood values of the data sets. The results show the high potential that the HMM approach has in predicting stock exchange.

On the other hand, (Md Rafiul Hassan, 2009) proposes a combined model of HMM and the Fuzzy models. The HMM identifies hidden data patterns, and fuzzy logic is used to generate a forecast value. The entire data space is partitioned based on the log-likelihood for each data string. They are used to generate the fuzzy rules. The performance of the proposed system is clearly outperforming in comparison with ANN, and ARIMA. As a further improvement, (Cao, Li, Coleman, Belatreche, & McGinnity, 2015) proposes an addition to the methodology introduced by (Nguyen, 2016). In addition to the process historical data, they use the likelihoods of the model of the most recent data set.

Further, they use the developed models to predict stock closing prices of Apple, Google, and Facebook using single observation data and multiple observation data. It has concluded that the results from multiple observation data perform better in stock price predictions. Also, (Gupta & Dhingra, 2012) present the Maximum a Posteriori HMM approach for forecasting stock values for the coming day given historical data. They utilize the fractional change in Stock value and the high and low values in a given day of the stock to train the continuous HMM. The trained HMM is then used to make a Maximum a Posteriori decision over all the possible stock values for the next day. This approach has also shown the excellent potential of HMM in stock price prediction.

Real-time supervision of batch operations is very useful in quality controlling in bioprocesses. A novel efficient modelling and supervision technique based on multiway partial least squares (MPLS) is presented by (Ündey, Tatara, & Çinar, 2004). The technique can predict the quality of the batch at the end of the growth. Process monitoring, quality estimation, and fault diagnosis activities are automated and supervised by integrating them into a real-time knowledge-based system (RTKBS). Using a fed-batch penicillin production benchmark process simulator, they have validated the performance of the methodology.

A data-driven fault prediction and abnormality degree measurement method based on probability density estimation is proposed by (Wang, Cai, Fu, Wu, & Wei, 2018). The method can be used to monitor the state of a complex system quantitatively. They first define an index to quantify the degree of abnormality. Next, a single slack factor multiple kernel SVM probability density estimation model is employed to improve the computational efficiency. In addition to that, this improves the data mapping performance. The resulting model is capable of providing a higher estimation precision and speed. The degree of abnormality is found to be accurately measurable by the abnormality index.

Before an actual occurrence of a fault, there is a time duration that abnormal values or unusual trends in the monitored sensor signals over a specified period. Nevertheless, the extraction of those features for fault prediction process is challenging. A novel study is done by (Baek & Kim, 2018) to address this issue. They define the terms symptom pattern and symptom period, and then present a symptom pattern extraction method that collects all evidence of potential fault occurrence from multiple sensor signals. This study is based on the postulate that the given time markers of fault occurrences, a symptom period precedes the occurrence of a fault. The procedure is tested in the early detection of abnormal cylinder temperature in a marine diesel engine and automotive gasoline engine knocking.

Process fault prediction a and prognosis is currently a highly tricky area as it can reduce most of the risks including financial, health, and reputation. There are numerous studies have been done over the years to solve this problem in applications such as software, electronic circuits, cement manufacturing, distillation columns, computer disk drives, mechanical bearings, chemical and biological processes, metal smelting, semiconductor, stock market prediction and foreign exchange. Among the techniques used, ANN, GP, Bayesian Methods, Naïve Bayes sub-models, Naïve Bayes Classifier, HMM, GMM, PCA, KPCA, Decision Tree, k-NN, CAE, CVTA, LLNF, Particle Predictor, GLAR, Fuzzy Models, and MPLS can be found in most of the applications. It is clear that HMM-BN combined fault prediction and prognosis is not explored and there is an excellent potential to outperform the available techniques. Therefore, the present work is aimed to propose a novel hybrid approach of fault prediction and prognosis. The approach comprises two robust techniques: Hidden Markov and Bayesian Network. The remaining paper is organized as section two provide fundamentals of two techniques and how they are integrated. Section three details the methodology and its testing. Section 4 discusses results while section 5 presents the conclusions.

**2. Preliminaries**

*2.1* *Hidden Markov Model (HMM)*

A Markov chain is a random process of discrete-valued variables that involves several states. Possible transitions link these states, each with an associated probability and each state has an associated observation. The state transition is only dependent on the current state and not on former states. The actual sequence of states is not observable and hence the name Hidden Markov. The compact representation for an HMM with a discrete output probability distribution is given by;

λ={A, B,π}

; where

λ

is the model. Further,

A=aij

,

B=bijk

, and

π=πi

stand for transition probability distribution; observation probability distribution; and initial state distribution respectively.

These parameters of a given state, Si, are defined as;

aij=Pqt+1=Sjqt=Si ,1≤i,j≤N, bijk=POkqt=Si, 1≤j≤N, 1≤k≤M

, and

Πi=Pq1=Si, 1≤i≤N

. Here,

qt, N

,

Ok

, and

M

stand for State at time t, Number of states, k^{th} Observation, and some sharp observations. The probability of the observation sequence of visible states generated by the model λ. The probability is calculated using equation [1] based on

bij(k)

.

PO,λ= ∑all SπS0∏T=0T=1aSySt+1bSt+1(OSt+1) | [1] |

Some key algorithms are running in HMM, namely the k-means algorithm, the Expectation Maximization Algorithm (EM), and the Viterbi Algorithm.

k-means is almost a binary algorithm. This allows finding the cluster centers. This converges to local minima. It is required to know the number of cluster centers (

i.e. k

as an input). As the initiation, the value of

k

(i.e. number of clusters) is randomly guessed. As the repeating step, the data corresponds to the nearest cluster and then clusters are updated using corresponding data points. If the cluster is empty, the process re-starts at a random point until no change.

On the other hand, the EM algorithm is one who uses other probability distributions. EM is a probabilistic generalization which also allows finding the cluster centers. It modifies not only the shape of the clusters but also the co-variant matrix. This is probabilistically sound and can be proven that it converges in a log likelihood space. Similar to k-means, this gets converged to local minima. We need to know the number of cluster centers (

i.e. k

), similar to the k-means algorithm.

Px= ∑i=1kPC=i.PxC=i

Where;

PC=i=πi

is the Prior probability to the cluster center,

P(x|C=i)

is the Gaussian parameter for each of the individual Gaussian (i.e.

μi, Σi where i=1,2,3…

). The code can be divided into two sections namely Expectation Step (E-Step) and Maximization Step (M-Step).

In the E-Step, it is assumed that

πi, μi

and

Σi

are known values.If

eij

is the Probability of

jth

data point corresponds to cluster point

i

;

eij=π.2π-M/2Σ-1exp12xj-μi-1∑i-1(xj-μi) | [2] |

** **

Where;

π

stands for the prior probability to the cluster center,

2π-M/2Σ-1

is the Normalizer and

exp12xj-μi-1∑i-1(xj-μi)

is the Gaussian expression.

In M-Step,

πi

can be taken from

∑jeij/M

;

μi

can be taken from

∑jeij. xXj/∑jeij

; and

Σi

can be taken from

∑jeij xj-μiTxj-μi/∑ jeij

. Here

eij

works as a soft correspondence of a data point which works as a weightage for the calculation.

Defining the value of

k

is the next problem to be solved. This number is not known in real-world applications. However, it is assumed a constant. In practical cases, we guess the value of

k

and minimize the following expression, which is called the Log Likelihood.

LL= -∑jlogP(xj|σ1Σ1k)+COST×k | [3] |

Here,

LL

is the Log-likelihood, and

COST

is a constant penalty. Further, the posterior probability (i.e.

P(xj|σ1Σ1k)

) is maximized of data. As shown in the equation, if the number of clusters of data is increased, the penalty will be high. Nevertheless, typically, this minimizes at a certain value of

k

. Where

PV1, V2, …, Vk|ν

represent various variables,

ν

is the normal node, which facilitates the expression of the conditional probability, and Parent (

Vi

) is the parent nodes of

Vi

.

*2.2* *Bayesian Networks (BNs)*

BNs belong to the family of probabilistic graphical models (GMs) (Ruggeri, Faltin, & Kenett, 2007). These graphical structures can acquire the knowledge about the uncertain domain. Probabilistic causal relationships are demonstrated by the arrows, and each node connected by an arrow represents a random variable. To estimate the conditional dependencies in the graph, known statistical and computational methods are used. Hence, BNs combine principles from graph theory, probability theory, computer science, and statistics. Further, as per (Neapolitan, 2010), a BN model can be used to study the structures of gene regulatory networks. It can integrate information from both prior knowledge and experimental data.

As mentioned in (Wu, Zhou, Xu, & Wu, 2017), The basic principles of BN are conditional independence and joint probability distribution:

PV1, V2, …,Vk|ν= ∏1kP(Vi|ν) (i=1,2, …, k) | [4] | |

PV1, V2, …,Vk|ν= ∏1kP(Vi/Parent(Vi)) (i=1,2, …, k) | [5] |

If the requirement is to classify a given sequence into one of

the k

classes, we must train up

k

number of HMMs, one per class, and then compute the log-likelihood (i.e*. loglik*) that each model gives to the test sequence. If the

ith

model is the most likely, then declare the class of the sequence to be class

i

. BNs can be considered as a powerful tool for fault diagnosis. It has shown remarkable performance in FDD in work presented by (Amin, Imtiaz, & Khan, 2018). Therefore, the predicted fault by HMM can be prognosed by BN.

Fault diagnosis and prediction system is proposed by (C. Li, Wei, Wang, & Zhou, 2017), which is made up of three parts, namely: data preprocessing; degradation state detection, and fault diagnosis. The wavelet transforms correlation filter employed to extract features for complex system fault diagnosis and prediction. To enhance the performance of HMM, they further propose an HMM-based semi-nonparametric method by the probabilistic transition frequency profile matrix and the average probabilistic emission matrix. The proposed methodology is validated to be capable of identifying the system operating state and hence facilitate to predict the behaviour shortly.

BNs based rare event prediction has been studied by (Cheon, Kim, Lee, & Lee, 2009) and (Cózar, Puerta, & Gámez, 2017). BN has both causal and probabilistic semantics, and therefore it is identified as an ideal way to combine background knowledge and actual process historical data. To be used when there are not many details available about all possible values, (Cózar et al., 2017) have proposed a general-purpose decision support system tool. It consists of two BN models: one represents the failure-free behaviour of the system, and the other represents abnormal behaviours. This novel system is a robust tool that can be used for health management in industrial environments. The core of the system is a probabilistic expert system based on dynamic BNs. Fault detection is based on both conflict analysis and the likelihood-ratio test.

**4.****References**

Amin, M. T., Imtiaz, S., & Khan, F. (2018). Process system fault detection and diagnosis using a hybrid technique. *Chemical Engineering Science*, *189*, 191–211. https://doi.org/10.1016/j.ces.2018.05.045

Baek, S., & Kim, D. Y. (2018). Fault Prediction via Symptom Pattern Extraction Using the Discretized State Vectors of Multi-Sensor Signals. *IEEE Transactions on Industrial Informatics*, *3203*(c). https://doi.org/10.1109/TII.2018.2828856

Bekat, T., Erdogan, M., Inal, F., & Genc, A. (2012). Prediction of the bottom ash formed in a coal-fired power plant using artificial neural networks. *Energy*, *45*(1), 882–887. https://doi.org/10.1016/j.energy.2012.06.075

Cao, Y., Li, Y., Coleman, S., Belatreche, A., & McGinnity, T. M. (2015). Adaptive hidden Markov model with anomaly states for price manipulation detection. *IEEE Transactions on Neural Networks and Learning Systems*, *26*(2), 318–330. https://doi.org/10.1109/TNNLS.2014.2315042

Chelly, S. M., & Denis, C. (2001a). Applying Unsupervised Learning. *Medicine and Science in Sports and Exercise*, *33*(2), 326–333. https://doi.org/10.1111/j.2041-210X.2010.00056.x

Chelly, S. M., & Denis, C. (2001b). Applying Unsupervised Learning. *Medicine and Science in Sports and Exercise*, *33*(2), 326–333. https://doi.org/10.1111/j.2041-210X.2010.00056.x

Chen, M. Z., Zhou, D. H., & Liu, G. P. (2005). A new particle predictor for fault prediction of nonlinear time-varying systems. *Developments in Chemical Engineering and Mineral Processing*, *13*(3–4).

Cheng, F., Qu, L., & Qiao, W. (2017). Fault Prognosis and Remaining Useful Life Prediction of Wind Turbine Gearboxes Using Current Signal Analysis. *IEEE Transactions on Sustainable Energy*, *PP*(99), 1. https://doi.org/10.1109/TSTE.2017.2719626

Cheon, S.-P., Kim, S., Lee, S.-Y., & Lee, C.-B. (2009). Bayesian networks based rare event prediction with sensor data. *Knowledge-Based Systems*. https://doi.org/10.1016/j.knosys.2009.02.004

Cózar, J., Puerta, J. M., & Gámez, J. A. (2017). An application of dynamic Bayesian networks to condition monitoring and fault prediction in a sensor system: A case study. *International Journal of Computational Intelligence Systems*, *10*(1), 176–195.

Deng, Z. (2018). *Proceedings of 2017 Chinese Intelligent Systems Conference* (Vol. 460). https://doi.org/10.1007/978-981-10-6499-9

Di, Y., Jin, C., Bagheri, B., Shi, Z., Ardakani, H. D., Tang, Z., & Lee, J. (2018). Fault prediction of power electronics modules and systems under complex working conditions. *Computers in Industry*, *97*, 1–9. https://doi.org/10.1016/j.compind.2018.01.011

Ding, B., & Fang, H. (2017). Fault prediction for a nonlinear stochastic system with incipient faults based on particle filter and nonlinear regression. *ISA Transactions*, *68*, 327–334. https://doi.org/10.1016/j.isatra.2017.03.018

Fulufhelo V. Nelwamondo, T. M. and U. M. (2006). Early Classifications of Bearing Faults Using Hidden. *Information and Control*, *2*(6), 1281–1299.

Gao, Q., & Liu, W. (2017). Research and Application of the Distillation Column Process Fault Prediction based on the Improved KPCA, 247–251.

Gupta, A., & Dhingra, B. (2012). Stock market prediction using Hidden Markov Models. *2012 Students Conference on Engineering and Systems, SCES 2012*. https://doi.org/10.1109/SCES.2012.6199099

Hamerly, G., & Elkan, C. (2005a). Genetic Programming Approach for Fault Modeling of Electronic Hardware, *2*(1), 1563–1569. https://doi.org/doi:10.1109/CEC.2005.1554875

Hamerly, G., & Elkan, C. (2005b). Genetic Programming Approach for Fault Modeling of Electronic Hardware, *2*(1), 1563–1569. https://doi.org/doi:10.1109/CEC.2005.1554875

Hassan, M. R. (2009). A combination of hidden Markov model and fuzzy model for stock market forecasting. *Neurocomputing*, *72*(16–18), 3439–3446. https://doi.org/10.1016/j.neucom.2008.09.029

Hassan, M. R., & Nath, B. (2005). Stock market forecasting using a hidden Markov model: a new approach. *5th International Conference on Intelligent Systems Design and Applications (ISDA’05)*, 192–196. https://doi.org/10.1109/ISDA.2005.85

Hu, Y., Deng, X., Cao, Y., & Tian, X. (2017). Dynamic process fault prediction using canonical variable trend analysis. *2017 Chinese Automation Congress (CAC)*, 2015–2020. https://doi.org/10.1109/CAC.2017.8243102

Hughes, G. F., Murray, J. F., Kreutz-Delgado, K., & Elkan, C. (2002). Improved disk-drive failure warnings. *IEEE Transactions on Reliability*, *51*(3), 350–357. https://doi.org/10.1109/TR.2002.802886

Illias, H. A., Chai, X. R., Bakar, A. H. A., & Mokhlis, H. (2015). Transformer early fault prediction using combined artificial neural network and various particle swarm optimization techniques. *PLoS ONE*, *10*(6), 1–16. https://doi.org/10.1371/journal.pone.0129363

Jahani, K., & Razavi, J. (2016). Application of Computer Aided Engineering Tools in Performance Prediction and Fault Detection of Mechanical Equipment of Mining Process Line, *10*(7), 1295–1299.

Jiang, B.-L., Wey, C.-L., & Fan, L.-J. (1988). FAULT PREDICTION FOR ANALOG CIRCUITS *, *7*(1).

Kim, J., Han, Y., & Lee, J. (2016). Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process, *133*, 85–89.

Kim, M. J., Jiang, R., Makis, V., & Lee, C. G. (2011). Optimal Bayesian fault prediction scheme for a partially observable system subject to random failure. *European Journal of Operational Research*, *214*(2), 331–339. https://doi.org/10.1016/j.ejor.2011.04.023

Li, C., Wei, F., Wang, C., & Zhou, S. (2017). Fault diagnosis and prediction of a complex system based on Hidden Markov model. *Journal of Intelligent & Fuzzy Systems*, *33*(5), 2937–2944. https://doi.org/10.3233/JIFS-169344

Li, L., Wang, Z., Liu, Z., & Bu, S. (2014). Trend prognosis of aero-engine abrupt failure based on affinity propagation clustering. In W. Jinsong (Ed.), *In Proceedings of the First Symposium on Aviation Maintenance and Management* (Vol. II, pp. 13–22). Berlin, Heidelberg.: Springer.

Ma, J., & Xu, J. (2015). Fault Prediction Algorithm for Multiple Mode Process Based on Reconstruction Technique, *2015*. https://doi.org/10.1155/2015/348729

Murphy, K. (2005). Hidden Markov Model (HMM) Toolbox for Matlab. Retrieved July 19, 2018, from https://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html

Murray, J. (2003). Hard drive failure prediction using non-parametric statistical methods. *Proceedings of ICANN/ …*, (1). Retrieved from http://hebb.mit.edu/people/jfmurray/publications/Murray2003.pdf

Neapolitan, R. E. (2010). Learning Bayesian Networks. *International Journal of Data Mining and Bioinformatics*, *4*, 505–519. https://doi.org/10.1016/j.jbi.2010.03.005

Nguyen, N. (2016). Stock Price Prediction using Hidden Markov Model, 1–20. Retrieved from https://editorialexpress.com/cgi-bin/conference/download.cgi?db_name=SILC2016&paper_id=38

Pathmika, M., & Khan, F. (2018). *Dynamic process fault detection and diagnosis based on a combined approach of hidden Markov and Bayesian network model*. St. John’s.

Predix. (2016a). Machine Learning. Retrieved from https://www.predix.io/analytics

Predix. (2016b). Machine Learning.

Ramana, E. V., Sapthagiri, S., & Srinivas, P. (2017). Data Mining Approach for Quality Prediction and Fault Diagnosis of Injection Molding Process. *Indian Journal of Science and Technology*, *10*(17), 1–7. https://doi.org/10.17485/ijst/2017/v10i17/112580

Ruggeri, F., Faltin, F., & Kenett, R. (2007). Bayesian Networks. *Encyclopedia of Statistics in Quality & Reliability*, *1*(1), 4. https://doi.org/10.1002/wics.48

Sadeghian, M., & Fatehi, A. (2011). Identification, prediction and detection of the process fault in a cement rotary kiln by locally linear neuro-fuzzy technique. *Journal of Process Control*, *21*(2), 302–308. https://doi.org/10.1016/j.jprocont.2010.10.009

The Mathworks. (2016). *Introducing Machine Learning What is Machine Learninf*. *Perspectives on Ontology Learning*.

Ündey, C., Tatara, E., & Çinar, A. (2004). Intelligent real-time performance monitoring and quality prediction for batch/fed-batch cultivations. *Journal of Biotechnology*, *108*(1), 61–77. https://doi.org/10.1016/j.jbiotec.2003.10.004

Wang, H. Q., Cai, Y. N., Fu, G. Y., Wu, M., & Wei, Z. H. (2018). Data-driven fault prediction and anomaly measurement for complex systems using support vector probability density estimation. *Engineering Applications of Artificial Intelligence*, *67*(December 2016), 1–13. https://doi.org/10.1016/j.engappai.2017.09.008

Wu, J., Zhou, R., Xu, S., & Wu, Z. (2017). Probabilistic analysis of natural gas pipeline network accident based on the Bayesian network. *Journal of Loss Prevention in the Process Industries*, *46*, 126–136. https://doi.org/10.1016/j.jlp.2017.01.025

Yu, L., Wang, S., & Lai, K. K. (2005). A novel nonlinear ensemble forecasting model is incorporating GLAR and ANN for foreign exchange rates. *Computers and Operations Research*, *32*(10), 2523–2541. https://doi.org/10.1016/j.cor.2004.06.024

Zhang, S., Wang, Y., Liu, M., & Bao, Z. (2017). Data-Based Line Trip Fault Prediction in Power Systems Using LSTM Networks and SVM. *IEEE Access*, *6*, 7675–7686. https://doi.org/10.1109/ACCESS.2017.2785763

Zhao, Y., Li, D., Dong, A., Kang, D., Lv, Q., & Shang, L. (2017). Fault prediction and diagnosis of wind turbine generators using SCADA data. *Energies*, *10*(8), 1–17. https://doi.org/10.3390/en10081210