do not necessarily reflect the views of UKDiss.com.
Ensemble RBFKF model: using AdaBoost as a bridge
(Improving ensemble RBFEKF prediction model: using AdaBoost technique as a Bridge)
 Introduction
The focus of the research is to use extended Kalman filter in training radial basis function algorithm and use AdaBoost as a method in creating a committee of classifiers and outputting the final classifiers. It is our goal to accomplish the following objectives:
 RBF Issue – Application of Kalman filter to train Radial Basis Function with alternative form of generator function [8].
 Kalman Filter Issue
 To improve the convergence of Kalman filter by more intelligently initializing the training process [8] [11].
 To determine the tuning parameters (P, Q and R) of KF [8] [11] more effectively.
 AdaBoost – To use AdaBoost as an ensemble algorithm in bridging trained RFF – KF models to obtain RFFKF – AdaBoost model.
 Simulation and analysis – To test and analyse performance of the model (RFF – KF – AdaBoost) using IRIS, Cancer and Geophysical datasets
 Radial Basis Function
Radial Basis Function – RBF can be viewed an alternative to the MLP neural network for nonlinear modelling [1]. It uses radial basis function as its activation function and can be trained in many ways unlike MLP that are typically trained with back propagation algorithms [2]. It has a similar structure and configuration to multilayer perceptron network (MLPN). RBF is a feedforward neural network with a threelayer structure [3]; unlike MLPN Radial Basis Function has only one hidden layer that uses radial basis functions as an activation function [3]. RBF three layers namely are input layer, hidden layer and the output layer. Therefore, it can be viewed as a type of artificial neural network and can be used for supervised learning problems such as regression and classification. RBF are quite used in science and engineering tasks such as function approximation, curve fitting, time series and classification problems. The neurons in the RBF hidden layer contain Gaussian transfer functions. The activation of the hidden units is a nonlinear function of the distance between the input vector and the weight vector.
Review of literature shows that Kalman filter has been used extensively in training Neural Network models with challenges [4] [5] [6] [7] and promising results. Despite efforts been made in training RBF with Kalman Filter there are number of issues that are yet to be resolved [8] [9] [10]. These involves the need to focus on the application of Kalman filter to train RBF networks with alternative forms of generation function (instead of current randomized vectors and initializing weight matrix to zero); the need to improve the convergence of the Kalman filter by intelligently initializing the training process and effective determination of the Kalman filter tuning parameters.
Firstly, this research is motivated by the need to address some of the current issues that are related to RBFEKF models. Secondly, to use AdaBoost as a technique in combining RBF predations obtained when trained with EKF. Therefore, the over objective of this research is to improve the performance of RBFKF models and propose a new algorithm: RBFKFBoost. The algorithm will be based on existing RBFKF model but AdaBoost will serve as a bridge that combines multiple predictions to obtain the final classifier.
Review of RBF Network
RBF network has three layers: the input layer, the hidden layer, and the output layer. The neurons at the hidden layer are activated by a radial basis function. The hidden layer consists of an array of computing units called the hidden nodes. In addition, each hidden node contains a centre
c
vector that is a parameter vector of the same dimension as the input
x
. The activation of the hidden units in RBF is given by a nonlinear function of the distance between the input vector and a weight vector. In RBF networks, there is a twostage training which has been shown to be faster than the methods used in training multilayer perceptron [1]. During the first stage of the training the parameters of the basis functions are set so that they model the unconditional data density. The second of the training determines the weight in the output layer which has been addressed as quadratic optimization problem that be solved using algebra methods.
Theory and architecture of RBF
In this section, we discuss the basic theory and architecture of RBF networks and relevant equations that are to be implemented. Radial basis function network is an artificial neural network that uses radial basis functions as its activation functions. The output of the RBF network is a linear combination of the radial basis functions of the inputs and neuron parameters. For instance, in a RBF network that consists of
m
dimensional input
x
being passed directly to a hidden layer that consists of
c
neuron. In such a network, each of the c neurons in the hidden layer applies an activation function which is a function of the Euclidean distance between the input and a
mdimensional
prototype vector. In the network, each hidden neuron contains its own prototype vector as a parameter such that the output of each hidden neuron is weighted and passed to the output layer. Therefore, the outputs of the network consist of the sums of the weighted hidden layers.
The input of RBF network can be modelled as a vector of real numbers
y ∈ Rn
. Similarly, the output of the network can be modelled as a scalar of the input vector,
y : Rn→ R.
Therefore, RBF network mapping can be expressed in the following form:
yx= ∑j=1Mwjϕxci ;σ  (1) 
where,
 Mis the number of neurons in the hidden layer,
 wjis the weight of neuroniin the linear output neuron,
 ciis the centre vector for neuroni,
In a typical RBF network, all inputs are connected to each neuron. The standard used in representing RBF function is the Euclidean distance and the radial basis function is normally taken to be Gaussian function such that:
ϕ((xci ))=exp[β (xci )2]  (2) 
ϕr= er2δ2  (3) 
The Gaussian basis function is local to the centre vector such that:
limx→ ∞ϕxci =0  (4) 
Therefore, changing the parameter of one neuron only has a small effect for input values that are far away from the centre of that neuron. The parameters of RBF network i.e.
wj, ci
and
βi
are determined in such a way that it optimize the fit between
ϕ
and the data. Figure 1 shows a schematic architecture of a RBF network.
Figure 1 Typical architecture of radial basis function network
Exact Interpolation vs Function Approximation
Exact interpolation – Given a set of N different ddimensional input vectors
xn
and a corresponding set of ondimensional target
tn
it is possible to find a continuous function
h(x)
such that:
hxn= tn, n=1,2,…,N  (5)

By adopting a radial basis function method that consists of choosing a set of N basis functions that are centred at the N data points using Eq. 1. This can be solved by obtaining an exact solution to a set of N linear equations to find the unknown weights of the equation:
ϕ11ϕ12⋯ϕ1Nϕ21ϕ22⋯ϕ2N⋮⋮⋮⋮ϕN1ϕN2⋯ϕNNw1w2⋮wN= t1t2⋮tN  (6) 
Such that
ϕij= ϕxixj, i, j=1, 2,…,N  (7) 
Therefore Eq.6 can be represented in a more compact form as
ϕw=t  (8) 
It has been shown that the interpolation matrix in the above equation is nonsingular because exists a large class function
ϕ,
including Gaussians, inverse multiquadric and thinplate splines, for which interpolation of matrix
ϕ
is nonsingular if the points are distinct [12]. The weights can therefore be obtained by using the transpose of matrix
ϕ
as:
w= ϕ1b  (9) 
This achieve exact interpolation when substituting the weights obtained by Eq. 9 into Eq. 1, the function
y(x)
represents a continuous differentiable surface passing through each data point. In a similar way generalization to multidimensional target space is mapping from
d
dimensional input space
x
to
k
dimensional target space and is given by:
hkxn= tkn, n=1,2,…,N  (10) 
Where
tkn
are components of the output vector
tn
, and
hkx
are obtained by linear superposition of N basis functions as used for the onedimensional output case:
hkx= ∑nwknϕxxn  (11)

The weight parameters are obtained in the form
wkn= ∑j(ϕ1)njtkj  (12) 
In Eq. 12 above
ϕ1
is used for each output functions.
RBF and Function Approximation
In practice, there is no need to perform strict interpolation due to noisy data as interpolating every data point will lead to overfitting and poor generalization. This is simply because interpolation will imply that the number of basis function required will be equal to the number of patterns in the learning dataset in addition it will be costly to map a large dataset. In order to avoid this RBF model for function approximation and generalization is obtained by modifying the exact interpolation procedure [12]. Modifications to the exact interpolation therefore give a smoother fit to the data using a reduced number of basis functions that depends on the complexity of the mapping functions rather than the size of the data. The network mapping can therefore be expressed as:
yx= ∑j=1Mwkjϕjwkj+ wk0  (13) 
where the
ϕj
are the basis functions, and
wkj
are the output layer weights. However, it has been noted that the bias weights can be included to the summation by including an extra basis function
ϕ0
and making its activation set to be unity therefore Eq. 13 can be expressed as
yx= ∑j=0Mwkjϕjwkj  (14) 
In a matrix form this can be expressed as:
yx= Wϕ  (15
) 
The update equation of a Gaussian basis function can be expressed as
ϕjx=exμj22σj2  (16) 
Training RBF Parameters
The designing and training of RBF as shown in Eq. 1 involves appropriate selection of the following parameters:
 the type of basis function
ϕ,  the associated widths
σ,  the number of functions
M,  the centre of the locations
μj,and  the weight
wj.
In many cases Gaussians or other bell shaped functions with compact support are used. The thinplate splines have also been used successfully in other function approximation problems. In many scenarios, the number of functions and their type are preselected, therefore training of RBF involves determination of three main parameters namely: the centres, the widths and the weights to minimize a suitable cost function which in most cases is a nonconvex optimization function. The training of RBF can either be supervised where the prediction is compared with the expectation or unsupervised such as the twostage training.
Supervised Training
It is possible to use gradient descent technique to train and minimize the cost function and at the same time update the training parameters however; other unsupervised training algorithms can also be used.
The sum of squares of the cost function to minimize can be expressed as
E=∑nEn  (17) 
Such that:
En = 12∑ktkn ykxn2  (18) 
Where:
 tknis the target value of output unitkand
xnare the input vectors
If the Gaussian basis functions are used to minimize the cost function then the following parameter updates can be can be obtained from Eq. 16 as
∆wkj= η1tkn ykxnϕjxn  (19)  
Δμj= η2 ϕjxnxμj.σj2∑kyktknxn wkj  (20)  
Δσj= η3ϕjxnxμj2σj3∑kyktknxn wkj  (21)  
where,
η1, η2 and η3
are the learning rates for the weight, centre location and the widths respectively.
Twostage training
It is possible to update the three parameters of RBF in Eq. 19, 20 and 21 at the same time. However, this is mostly suitable for nostationary environments or online settings. Therefore, in most cases that involves static maps a better estimation of the parameters is obtained by decoupling it into twostage problems [1] [12]. It has been shown that this method of training offers an efficient batch mode solution that improves the quality of the final results compare to that obtained when the parameters are trained simultaneously.
This method involves:
 The first stage involves determining the values of
μjand that ofδj. In this stage only the input values
{xn}are used in determining the values of the centres and widths respectively. Therefore, the learning is unsupervised.
 This stage is a supervised training that involve using the values of
μjand ofδjin step 1 to determine the weights to the output units.
Unsupervised training of RBF centres and widths
The location and the widths of localized basis function can be determined by seeing them as representing the input of data density. The following methods can be used in determining the centres and the widths:
 Random subset selection
 Clustering algorithms
 Mixture modes
 Width determination
Batch training of the output layer weights
The transformation between the inputs and the corresponding output of the hidden inputs are fixed after determining the basis parameters. The network is therefore like a single Neural Network layer with linear outputs units and the minimization error in Eq. 17 can be therefore expressed as:
WT = ϕ†T  (22) 
Where
(T)nk= tkn, (∅)nj= ∅jxn, and ∅†= (∅T∅)1∅T
stands for the pseudoinverse of
∅
.
∅
is the design (variance) matrix of
A = ∅T∅
. Therefore, in practice to avoid possible problems due to illconditioning of the matrix
∅
singular value decomposition is used to obtain the weights by direct batchmodel matrix inversion.
TwoStage Training vs Supervised Training
Unsupervised learning leads to suboptional choices of parameters while supervised learning leads to optimal estimation of the centres and widths. The gradient descent used in supervised learning is a nonlinear optimization technique and is computationally expensive. Therefore, it has been observed that if the basis functions are well localized then only few basis functions will generate a significant activation.
Model Selection
The primary goal of any network modelling is to model the statistical process that is responsible for generating the data and to find the exact fit to the data. Therefore, efforts are on generalization or performance of the network on data outside the training dataset. However, this can only be achieved as tradeoff between bias and variance and the generalization error is a combination of the bias and variance. If a model is too simple this will lead to a high bias i.e. the model on the average will differ significantly from the desired result. Likewise, if a model is too complex this will lead to a low bias and high variance, thus results will be prone to specific features of the training set. Therefore, there is a need for a balance between bias and variance errors which can be achieved by finding the right number of free parameters.
RBF and Regularization
Regularization is used in machine learning as a penalty that is added to the original cost function as additional information to solve a problem and at the same time to prevent overfitting. A variety of penalty has been studied and has been linked to weight decay [12]. Some of the regularization methods in use among others are:
 Projection Matrix
 Crossvalidation
 Ridge Regression
 Local Ridge Regression
Normalized RBFNs
When a normalizing factor is added to the basis function it gives a normalized radial basis function:
∅ix= ϕxμi ∑j=1Mϕxμj  (23) 
where
M
is the total number of kernel. This is a form of equation that is obtained in various settings such as noisy data interpolation and regression. Since the basis function activation are bounded between 0 and 1, therefore they can be interpreted as probability values most especially in classification tasks.
Classification using radial basis function networks
Empirical study shows that RBF has a powerful function approximation capabilities, good local structure and efficient training algorithm. It has therefore been used in a variety of medical science and engineering classification tasks. These among others include chaotic time series prediction, speech pattern classification, image processing, medical diagnosis, nonlinear system identification, adaptive equalization in communication systems and nonlinear extraction. In RBF data are separated into various classes by placing localized kernels around each group instead of hyperplanes that are used in other algorithms such as MLPNN, SVM and AdaBoost. Further, it has been noted that RBF share similar properties with other modelling methods such as function approximation, noisy interpolation and kerned regression [1]. It is therefore possible to model conditional densities for each class of RBF such that the sum of the basis function should form a representation of the unconditional probability density of the input data. The classconditional probability densities can be represented as:
pxCk)= ∑j=1Mpx jPjCk, k =1,2,…,c  (24) 
Where
M
is a density functions,
j
is the index label, the unconditional density is obtained by summing the overall classes in Eq. 24 such that:
p(x)=∑k=1cpxCk)P(Ck)  (25)  
= ∑j=1Mp(xj)P(j)  (26) 
where
Pj= ∑k=1cP(jCk) PCk  (27) 
Applying Bayes theorem by substituting Eq. 24 and Eq. 26 we obtain the posterior probabilities that leads to a normalized RBFN:
PCkx)= ∑j=1Mwkj∅jx  (28) 
The basis functions
∅j
are represented by:
∅jx= PxjPj∑i=1MpxiPi =Pjx  (29) 
The second layer weights (i.e. the hidden layer to output weights) are given by:
wkj= PjCkP(Ck)P(j) =PCkj  (30)  
 Kalman Filter Algorithm and RBF Optimization
Kalman filter is a mathematical method that estimates the state of a dynamic system from a series of noisy measurements and other inaccuracies that affects the modelled system. It minimizes the mean of the squared error and can be used to estimate the past, the present, and the future states of a system based on the known component of noise from the measurements, and the known component of disturbance to the system. It uses Bayesian inference in estimating a joint probability distribution over the variables for estimation. The concept of Kalman filter is be represented in a block diagram as shown Figure 2 below:
Figure 2 Architecture of a Kalman Filter Algorithm
In a case when we are modelling a linear dynamic system the state
xt
can be represented mathematically as:
xt=Ftxt1+ Btut+ wt  (31) 
where,
xt
is the state vector of the process at time t,
ut
is vector is containing control input,
Ft
is the state transition matrix which applies to the system state parameter at time
t1
,
Bt
is the control input matrix which applies the effect of input parameter,
wt
is the vector contacting the process noise terms for the state vector. The measurements
yt
can also be modelled in the following form:
yt=Htxt+ vt  (32) 
where
yt
is the vector of actual measurements of
x
at time
t
and
Ht
is the transformation matrix between the state vector and the measurement vector and
vt
is the vector containing the associated measurement error.
Kalman Filter as an optimization algorithm
Recent review shows that Kalman Filter algorithm has been used in training Neural Network models [4] [5] [6] [7] and other system estimator tasks with promising results yet with various challenges [8]. Similarly, Unscented Kalman Filter models have been used in estimating the dynamic states of various systems [13] [14] [15] [16]. It is important to know that the main purpose of Kalman Filter algorithm is to minimize the mean square error between the actual and estimated data as shown in Eq. 18. Kalman filter algorithm is like the chisquare merit function that is used in modelling least square fitting [17] problems. In this session emphasis is on how Extended Kalman Filter algorithm can be applied in minimizing errors in training Radial Basis Function Network. The derivation and review of the EKF are widely available in the literature [18] [19] [20] [21].
Kalman filter can be used to optimize weight matrix and error centre of RBF as a least squares minimization problem. In a nonlinear finite dimension discrete time system, the state and measurements of such a system can be modelled as:
θk+1=fθk+ ωk  (33)  
yk=hθk+ vk  (34) 
Where, the vector
θk
is the state of the system at time
k
,
ωk
is the process noise,
yk
is the observation vector,
vk
is the observation noise,
fθk
and
hθk
are the nonlinear vector functions of the state and process respectively.
The system dynamic models
fθk
and
hθk
are assumed known, therefore EKF can be used as the standard method of choice to achieve a recursive i.e. approximate maximum likelihood estimation of the state
θk
[18]. The state and the output white noise
ωk
and
vk
has zerocorrelation (i.e. are zero mean and normally distributed variables) with covariance matrix
Qt
and
Rt
respectively. Assuming that that the covariance of the two noise models is stationary over time, therefore it can be modelled as:
Q=EωkωkT  (35)  
R=EvkvkT  (36)  
MSE=EekekT= Pk  (37) 
Where
Pk
is the error covariance matrix at time
k.
Assuming that the extended Kalman approach in Eq. 33 and Eq. 34 are sufficiently smooth, they can be expanded and approximate them around the estimate
θk
using firstorder Taylor expansion series such that:
fθk=θ̂k +Fk*θkθ̂k  (38)  
fθk=θ̂k + Hk T* θkθ̂k  (39) 
Where
Fk= ∂fθ∂θθ=θ̂k  (40)  
Hk T= ∂hθ∂θθ=θ̂k  (41) 
If we neglect the higher order terms of the Taylor series and substitute Eq. 38 and Eq. 39 into Eq. 33 and Eq. 34 respectively, then Eq. 38 and Eq. 39 can be approximated as Eq. 42 and Eq. 43 respectively
θk+1= Fkθk+ωk+ ∅k  (42)  
yk= Hk T+ vk+φk  (43) 
The desired value
θ̂n
can be estimated through recursion as in [8] [21] such that:
θ̂k =fθ̂k1 + Kkykhθ̂k1 (43)
θ̂k =fθ̂k1 + Kkykhθ̂k1  (44)  
Kk= PkKkR+ Hk TPk Hk1  (45)  
Pk+1=FkPkKkHk TPkFk T+Q  (46) 
where
Kk
is the Kaman Gain and
Pk
is the covariance matrix of the estimation error.
Optimization of RBF using Kalman Filter
Just as EKF have been used in training neural network and other algorithms EKF can also be used in training Radial basis function algorithm. In this section we describe the derivative functions of RBF, some its properties and how these can be integrated with EKF for optimization purposes.
Derivatives of Radial Based Function
In a case where the basis functions are fixed and the weights are adaptable as shown in Fig. 1, then the derivative function of the network is therefore a linear combination of the derivatives of the radial basis function [22]. It has been shown that there are many main ways in which g(.) function at the hidden layer can be represented [8] [22] . The common choices are:
 The multiquadric function
gv=(v2+ β2)1/2  (47) 
 The inverse multiquadric function
gv=(v2+ β2)1/2  (48) 
 The Gaussian function
gv=expvβ2  (49) 
4. The thine plate spline function
gv=vlogv  (50) 
where
β
is a real constant.
The spline in Eq. 50 among other can be represents as thinplate spline, spline with tension and regularized spline[xxx]. Karayiannis [23] observed that since RBF prototypes are generally interpreted as the centres of the receptive fields therefore, the hidden layer functions have the following properties:
 The response at a hidden layer is always positive
 The response at a hidden neuron becomes stronger as the input approaches the prototype,
 The response at a hidden neuron becomes more sensitive to the input at the input approaches the prototype.
Taking into consideration the above properties, therefore RBF hidden layer can be expressed as [8] [23]
gv= [g0(v)]11p  (51)

Where
p
a real is number and
gv
is a generation function
It has been shown that if
p
is greater than 1 then the function should satisfy number of conditions (generator function
go(v)
conditions) as detailed in [8]. Karayiannis [23] in Reformatted Radial Basis Neural Networks Trained by Gradient Descent, shows that a generator function that satisfies the conditions is a linear function that can be expressed as:
go(v) =av+b  (52) 
Where
a > 0
and
b ≥0
. It has been observed that if
a = 1 and p = 3
, then the hidden layer function reduces to the inverse multiquadric function in Eq. 51
Optimization of RBF based on derivatives
The RBF architecture represented in Fig. 1 with the hidden layer function
g.
have the form of Eq. 51 and can be expressed in matrix as follows:
y ̂=w10w11⋯w1cw20w22⋯w22⋮⋮⋮⋮wn0wn1⋯wnc1gx v12⋮gx vc2  (53) 
If we denote,
w10w11⋯w1cw20w22⋯w22⋮⋮⋮⋮wn0wn1⋯wnc = w1Tw2T⋮wnT =W  (54) 
Imagine a training set of
M
desired inputoutput responses
{xi,yi}
, for
i=1,…,M
. Then we can represent Eq. 54 as:
ŷ1 ⋯ ŷn=W1⋯1gx1 v12⋯gxM v12⋮⋮⋮gx1 vc2⋯gxM vc2  (55) 
Representing the RHS of the matrix in Eq. 55 with the following notations:
h0k=1 for k=1,…M  (56)  
hjk= gxk vj2 for k=1,…M, j=1,…,c  (57) 
Hence, we can represent RHS of Eq. 55 as
h01⋯h0Mh11⋯h1M⋮⋮⋮hc1⋯hCM= h1 ⋯ hM=H  (58) 
Therefore, Eq.
55
can be represented as follows:
Y ̂=WH  (59) 
When using gradient descent to minimize the training error, the error function can be defined as:
E= 12 YY ̂ F2  (60) 
Where,
 Yis the matrix of the expected values for the RBF
 . F2is the square of the Frobenius norm of the matrix, which is equal to the sum of the squares of elements of the matrix
The derivative of E with respect to the weight and the prototype [23] can be expressed as
∂E∂wi= ∑k=1Mŷikyikyk for i=1,…, n  (61)  
∂E∂vj= ∑k=1M2g’xkvj2xkvj∑i=1nyiky ̂ikwij for j=1,…, c  (62)  
Where,
 ŷikis the element in theith row and
kth column of the
Y ̂matrix in Eq.
59
 yikis the corresponding element inYmatrix form.
To optimize the RBF with respect to the rows of the weight matrix
W
and prototype
vj
we need to iteratively compute the partials in Eq. 61 and 62 and performing the following updates:
wi= wi η∂E∂wi for i=1,…, n  (63)  
vj= vj η∂E∂vj for j=1,…, c  (64) 
Where,
 ηis the step size of the gradient descent method
 wiandvjare local minimal, the optimization stops when reached
Optimization of RBF using Extended Kalman Filter
Applying similar approach as in [8] [11] [24], we can view the optimization of RBF weight
W
and the prototype
vj
as a weighed leastsquare minimization problem. Therefore the error vector can be viewed as the difference between the RBF outputs and the expected target values. Therefore, sing the RBF network as in Fig. 1 with
m
inputs,
c
prototypes, and
n
outputs: Let
y
represent the target vector for the RBF outputs, and
h(θ̂k )
denote the actual outputs at the
kth
iteration of the optimization algorithm. Then
y
and
h(θ̂k )
can be represented as Eq. 65 and Eq. 66 respectively as:
y=y11…..y1M…… yn1…….ynMT  (65)  
hθ̂k =ŷ11…….. ŷ1M…………ŷn1…….ŷnMkT  (66) 
where
n
is the dimension of the RBF output and
M
is the number of training samples.
The optimization problem of RBF can therefore be represented using Kalman filter algorithm by letting the output of the weight
W
and the elements of the prototype
vj
represent the state of a nonlinear system and the output of the RBF network. Therefore, the state of the system can be represented as:
θ=w1 T…. wn T…. vc TT  (67) 
In Eq. 67 the vector
θ
consists of all the RBF parameters
(nc+1)
arranged in a linear array and the nonlinear system to which the KF can be applied is represented by Eq. 68 and Eq. 69 as:
θk+1= θk  (68)  
yk=h(θk)  (69) 
where
h(θk)
is the RBF nonlinear mapping between its input and output parameters. However, to make the filter algorithm stable there is a need to add artificial noise:
φk
and measurement noise:
vk
to the model as in [19] [8] [xxx more citations]. Therefore, Eq. 68 and Eq. 69 can be represented as:
θk+1= θk+φk  (70)  
yk=h(θk) + vk  (71) 
In this form, we can then apply Eq.4446 to Eq. 70 and Eq.71
where
 f.is the identity
 ykis the target output of the RBF network
 hθ̂kis the actual output of the RBF network given the RBF parameters at thekthiteration of the Kalman recursion
 Hkis the partial derivative of the RBF output with respect to the RBF network matrix
 QandRmatrices are the tuning parameters which are the covariance matrices of the artificial noise processes
ωkand
vkrespectively.
It can be shown that the partial derivative of Radial Basis Function output [8] [25] [22] with respect to the its parameters can be simply be represented as:
Hk= HwHv  (72) 
Where
Hw =H0⋯00H⋯0⋮⋮⋮⋮0⋯0H  (73) 
and
Hv=w11g11’2×1v1 ⋯w11gm1’2xmv1⋯wn1g11’2×1v1⋯wn1g11’2xmv1⋮⋮⋮⋮⋮⋮⋮w1cg11’2×1vc⋯wncg1c’2×1vc⋯wncg1c’2×1vc⋯wncgmc’2xmvc  (74) 
Where
 H = is thec + 1 × Mmatrix (as in Eq.58)
 wij= is the element in theith row and
jth column of
Wweights
 xi = is theith input vector
 vj = is thejth prototype vector
 Hw = is ann c + 1 × nMmatrix (as in Eq.73)
 Hv = is anmc × nMmatrix (as in Eq. 74)
 Hk = is annc+1+ mc × nMmatrix
As in [8] [11], it is now possible to execute Eq. 46 using the extended Kalman filter to determine the weight matrix
W
and the prototype
vj.
AdaBoost as an ensemble technique: for combining trained classifiers
AdaBoost is an ensemble or metalearning method that can be used with other learning algorithms to generate strong classifiers out of weak classifiers. The concept of AdaBoost is that a better algorithm can be built by combining multiple instances of a simple algorithm where each instance is trained on the same set of training data but with different weights assigned to each case. It thereby boosts their performance as AdaBoost iteratively trains several base classifiers with each classifier paying more attention to data that are misclassified in the previous training. During iteration AdaBoost calls a simple learning algorithm that returns a classifier and assigns a weight coefficient to it. Base classifiers with smaller error have larger weights and those with larger error have smaller weights. AdaBoost combines the learning algorithms linearly into a weighted sum as a final output of the boosted classifier. Like other algorithms, AdaBoost has its own drawbacks. It is sensitive to noisy data and outliers. However, it can be less susceptible to overfitting compare to other learning algorithms such as neural networks and SVM. Review shows that many variants of the algorithm have introduced in the past decades to address one problem or the other, however in this research we will only consider the original AdaBoost algorithm.
Brief description of AdaBoost
The derivation and theories of AdaBoost has been highlighted extensively in [26] [27] [28]. The description here follows Schapire [29] : assume we are given a number labelled training examples such that
M={(x1 , y1)
,
(x2 , y2)
,.,
xn , yn}
where
xi∈RM
and the label
yn ∈1, 1.
On each iteration
t=1,…, T
, a distribution
Dt
is computed over the
M
training examples. A given weak learner is applied to find a weak hypothesis
ht: R→{1, 1}
. The aim of the weak learner is to find a weak hypothesis with low weighted error
εt
relative to
Dt.
The number of iterations implies the number of weak classifier during training. The final classifier
H(x)
as shown in Figs. 3 and 4 is computed as a weighted majority of the weak hypothesis
ht
by vote where each hypothesis is assigned a weight
αt
. The final classifier is given by:
Hx=sign∑t=1Tαthtx  (75) 
The accuracy of the hypothesis is calculated as an error measure this is given by:
εt=Pri~Dthti≠ yi  (76) 
The weight of the hypothesis is a linear combination of all the hypotheses of the participating experts, it is given by:
αt= 12ln1εtεt  (77) 
The distribution vector
Dt
is expressed as
Dt+1i=Dtiexpαtyihtxi Zt  (78) 
Zt
is a normalization factor such that the weights sum up to 1, which makes
Dt+1
a normal distribution. The pseudocode for AdaBoost is shown in figure 2.
During the training process there is always a difference from the predicted values and the expected values this deviation or expected error is the sum of squared errors. The expected error over the committee can be expressed as in Eq.79:
Eerr= 1N∑i=1Nyixhix2  (79) 
The Cost function/Weighting
Weighting – AdaBoost uses weighting function that enables new classifier to focus on the erroneous classification. During iterations AdaBoost sequentially trains several new classifiers and assigns it an output weight that in principle is equivalent to the error made the classifier. This enables new classifiers to pay more attention to data that are misclassified by previous classifiers.
Training set selection – After each training AdaBoost increases the weights on the misclassified examples such that examples with higher weights are retrained with more emphasis in the next iteration. The equation for the output weight update is shown in Eq. 77. After computing the weight then AdaBoost updates the training example weights using Eq. 78. Eq. 78 shows how to update the weight for the
i
th iteration during training, where
Dt
is a vector of weights with one weight for each training example set. The equation is evaluated for each of the training sample. The exponential loss function in the pseudocode scales up or down each weight between 1 and 1.
Exponential loss function – AdaBoost attempts to minimize an exponential loss function which is an upper bound on the average training error by doing a greedy coordinate descent on
Hxi
[28] [30]. The exponential loss function that AdaBoost attempt to minimize can be expressed as:
LH= ∑t=1mexpyiHxi  (80) 
Figure 3 The AdaBoost pseudocode
Figure 4 A typical ensemble models showing a committee of neural networks. Each classifier
hi
has an associated contribution
αi
Using ensemble AdaBoost in training Bridging EKF + RBF with AdaBoost
Even though RBF and NN have proved to be an effective tool in many applications however, there are some situations that many networks are required to produce accurate results when handling some complex tasks. One way of achieving this is to combine various predictions from the network models. AdaBoost as a technique of training learning algorithms and combining their outputs will serve this purpose. Therefore, using AdaBoost algorithm will create an ensemble of RBFEKF networks. This will produce a stronger classifier output of from the committee of RBF that were trained by EKF and will enable the RBF network model to handle classification tasks problems that a single RBF network cannot handle effectively.
 Diagrammatic representation of the proposed model check online and current reading papers
 Simulation results of Dan Simon stuff Matlab IDE
 Simulation results b using datasets from b but using AdaBoost and decision stump and RBF (Matlab IDE) for Daqing
 Simulation : AdaBoost + DS(NN, RBF) Matlab IDE
 Send Paper 2 for publication
 To add: We intend to use PSO algorithm to determine and optimize the initial parameters of RBF neural network in order to reduce the error of the RBF.
 Then use AdaBoost to learn the RBF network and train committees of RBF weak predictors.
 Linearly combine committee of weak predictors to produce strong predictors with a better classification performance.
References
[1]  I. Nabney, NETLAB Algorithms for Pattern Recognition, M. Singh, Ed., London: Springer, 2002. 
[2]  F. Schwenker, H. A. Kestle and G. Palm, “Three learning phases for radialbasisfunction networks,” Neural Networks, vol. 14, pp. 439458, 2000. 
[3]  R. Kruse, C. Borgelt, F. Klawonn, C. Moewes, M. Steinbrecher and P. Held, “Radial Basis Function Networks – Part of the series Texts in Computer Science,” in Computational Intelligence, pp. 83103. 
[4]  A. Shareef, Y. Zhu, M. Musavi and B. Shen, “Comparison of MLP Neural Networks and Kalman Filter for Localization in Wireless Sensor Networks,” in 19th IASTED International Conference: Parallel and Distributed Computing and Systems, Cambridge, MA, 2007. 
[5]  J.Sum, C.S. Leung, G. Young and W.K. Kan, “On Kalman Filtering Method in Neural Network training and Pruning,” IEEE transactions on Neural Networks, vol. 10, no. 1, pp. 161166, 1999. 
[6]  “EKF LEARNING FOR FEEDFORWARD NEURAL NETWORKS,” in 2003 European Control Conference (ECC), Canbridge, UK, 2003. 
[7]  A. Krok, “The Development of Kalman Filter Learning Technique for Artifical Neural Networks,” Journal of Telecommunications and Information Technolgy, pp. 1621, 2013. 
[8]  D. Simon, “Training Radial Basis Neural Networks with the Extended Kalman Filter,” Neurocomputing, vol. 48, pp. 455457, 2002. 
[9]  H. Kamath, A. Goswami, A. Kumar, R. Aithal and P. Singh, “RBF and BPNN Combi Model Based Filter Application for maximum Power Point Tracker of PV Cell,” in Internatinal MultiConference of Engineers and Computer Scientists, Hong Kong, 2011. 
[10]  T. Kurban and E.Beşdok, “A Comparison of RBF Neural Network Training Algorithms for Inertial Sensor Based Terrain Classification,” vol. 9, pp. 63126329, 12 August 2009. 
[11]  A. N. Chernodub, “Training Neural Networks for classification using the Extended Kalman Filter: A comparative study,” Optical Memory and Neural Networks, vol. 23, no. 2, pp. 96103, 2014. 
[12]  J.Ghosh and A.Nag, “Radial Basis Function Networks 2 – New Advances in Design,” in Radial Basis Function Network 2, PhysicaVerlag, 2001. 
[13]  S. Papp, K. Gyorgy, A.Kelemen and L.Jakabfarkas, “APPLYING THE EXTENDED AND UNSCENTED KALMAN FILTERS FOR NONLINEAR STATE ESTIMATION,” The 6th edition of the Interdisciplinarity in Engineering International Conference, 2012. 
[14]  A. Attarian, J. Batze, B.Matzuka and H. Tran, “Application of the Unscented Kalman Filtering to Parameter Estimation,” Mathematical Modeling and Validation in Physiology, vol. 2064, pp. 7588, 11 September 2012. 
[15]  S. Ramadurai, S. Kosari, H. H. King, H. J. Chizeck and B. Hannaford, “Application of Unscented Kalman Filter to a Cable Driven Surgical Robot: A Simulation Study,” in IEEE International Conference on Robotics and Automation, Saint Paul, 2012. 
[16]  K.Dróżdż and K.Szabat, “Application of Unscented Kalman Filter in adaptive control structure of twomass system,” in Power Electronics and Motion Control Conference (PEMC), 2016. 
[17]  T. Lacey, “Tutorial: The Kalman Filter,” [Online]. Available: http://web.mit.edu/kirtley/kirtley/binlustuff/literature/control/Kalman%20filter.pdf. [Accessed 17 April 2017]. 
[18]  E. Wan and R.V.D.Merwe, “The Unscented Kalman Filter for Nonlinear Estimation, Communications, and Control Symposium,” Adaptive Systems for Signal Processing, pp. 153158, 2000. 
[19]  S. Haykin, Adaptive Filter Theory, 3rd ed., PrenticeHall,, 1996. 
[20]  M. Ribeiro, “Kalman and Extended Kalman Filters: Concept, Derivation and Properties,” CiteSeer, 2004. 
[21]  B. Anderson and J. Moore, Optimal Filtering, Englewood Cliffs, NJ: PrenticeHall, 1979. 
[22]  N. MaiDuy and T. TranCong, “Approximation of function and its derivatives using radial basis function networks,” Applied Mathematical Modelling 2, vol. 27, pp. 197220, 2003. 
[23]  N. Karayiannis, “Reformulated radial basis neural networks trained by gradient descent,” Trans. Neural Networks, vol. 3, pp. 657671, 1999. 
[24]  G. Puskorious and L. Feldkamp, “Neurocontrol of nonlinear dynamical systems with kalman filter trained recurrent networks,” IEEE Trans. Neural Networks , no. 5, pp. 279297, 1994. 
[25]  R. Schaback, “A Practical Guide to Radial Basis Functions,” [Online]. Available: http://num.math.unigoettingen.de/schaback/teaching/sc.pdf. [Accessed 18 April 2017]. 
[26]  Y. Freund and R. Schapire, “A decisiontheoretic generalization of online learning and an application to boosting,” Computer and System Sciences, vol. 55, no. 1, pp. 119139, 1997. 
[27]  R. T. J. F. ] T. Hastie, The elements of statistical learning: data mining, inference, and prediction, Springer, 2001. 
[28]  R. Schapire and Y. Freund, Boosting: Foundations and algorithms, MIT Press, 2014. 
[29]  R. Schapire, “Explaining AdaBoost,” in Empirical Inference, Springer, 2013, pp. 3752. 
[30]  J. Friedman, T. Hastie and R. Tibshirani, “Additive logistic regression: a statistical view of boosting,” The Annals of Statistics, vol. 28, no. 2, pp. 337407, 2000. 