Exploring Diagnosis and Imaging Biomarkers of Parkinson’s Disease via Iterative Canonical Correlation Analysis Based Feature Selection

Abstract

Parkinson’s disease (PD) is a neurodegenerative disorder that progressively hampers the brain functions and leads to various movement and non-motor symptoms. However, it is difficult to attain early-stage PD diagnosis based on the subjective judgment of physicians in clinical routines. Therefore, automatic and accurate diagnosis of PD is highly demanded, so that the corresponding treatment can be implemented more appropriately. In this paper, we focus on finding the most discriminative features from different brain regions in PD through T1-weighted MR images, which can help the subsequent PD diagnosis. Specifically, we proposed a novel iterative canonical correlation analysis (ICCA) feature selection method, aiming at exploiting MR images in a more comprehensive manner and fusing features of different views into a common space. State succinctly, we first extract the feature vectors from the *gray matter* and the *white matter *tissues separately, represented as views of two different anatomical feature spaces for the subject’s brain. The ICCA feature selection method aims at iteratively finding the optimal feature subset from two views of features that have inherent high correlation with each other. In experiments we have conducted thorough investigations on the optimal feature set extracted by our ICCA method. We also demonstrate that using the proposed feature selection method, the PD diagnosis performance is further improved, and also outperforms many state-of-the-art methods.

**Keywords**

Iterative canonical correlation analysis, feature selection, imaging biomarkers, diagnosis, Parkinson’s disease

## 1 Introduction

Parkinson’s disease (PD) is an overwhelmingly neurodegenerative disorders that often starts with the age-related deterioration at 40 years’ old which also progresses slowly. With PD, patients’ movement, balance, and muscle control would be affected. It is still not clear about what causes PD, and there is limited effective treatment. However, the clinical diagnosis of PD is usually prone to subjective biases since it mostly relies on evaluating substantial symptoms of the patients [1]. As an alternative, machine learning technique can be used as a solution to analyze medical data intelligently and produce the diagnosis automatically. The surge of interest has been drawn onto computer-assisted diagnosis solution for PD, which is able to take advantage of all available measures to identify biomarkers related to PD.

There are several studies in the literature aiming at distinguishing PD from normal control (NC) or other similar neural disorders. Goebel *et al.* [2] analyzed the single-photon emission computed tomography (SPECT) images, and developed an observer-independent method to classify PD from multiple system atrophy Parkinson (MSA-P), progressive supranuclear palsy (PSP), and normal control automatically. Tsanas *et al.*[3] proposed a novel speech signal processing algorithm to compute dysphonia measures, which contribute to better differentiating PD patients from healthy controls. Wenning *et al.* [4] evaluated various clinical features (including response to levodopa, motor fluctuation, rigidity, dementia, speech) for distinguishing multiple system atrophy (MSA) from PD. Singh *et al.* [5] proposed a novel synergetic paradigm that integrates Kohonen self-organizing map (KSOM) to extract features from magnetic resonance (MR) images for individual-level clinical diagnosis of neurodegenerative diseases. Highly accurate diagnosis is then achieved with the least-square support vector machine (LS-SVM). Chen *et al.* [6] proposed an effective and efficient system using fuzzy *k*-nearest neighbor (FKNN) for PD diagnosis on biomedical voice measurement data. Their experimental results demonstrate that the FKNN system greatly outperforms the SVM-based approaches and other methods in the literature.

Most of the existing works for computer-assisted PD diagnosis study on clinical data or voice measurements, rather than neuroimaging data. Few of them utilize sophisticated machine learning tools to select features and build models for PD diagnosis and biomarker identification. Our goal in this paper is finding a non-invasive and low-cost solution to early PD screening using T1-weighted MR images, which is more demanded in less-developed areas with limited access to healthcare resources. Our main contribution is to propose a novel iterative canonical correlation analysis (ICCA) based feature selection method to optimize the diagnosis process. Specifically, we commence by extracting features from the T1-weighted MR images, which are associated with individual regions-of-interest (ROIs). The ROIs are divided into two sets, corresponding to white matter (WM) and gray matter (GM), respectively. Therefore, the extracted features can be naturally separated into two groups, corresponding to the descriptions of the patients from the WM and GM perspectives. Next, we adopt canonical correlation analysis (CCA) and propose the iterative CCA (ICCA) method, in order to transform the WM and GM features into an optimal common feature space. The WM and GM features are transformed and then selected in a common space with ICCA. Finally, the selected optimal features are used for establishing a robust linear discriminant analysis (RLDA) model, which classifies PD patients from normal subjects for disease diagnosis.

Feature selection has proven its pivotal importance in tackling problems in translational medical studies. For example, the sparse logistic regression technique has been applied to find optimal set of features in the early diagnosis of Alzheimer’s disease (AD) [7]. A multi-kernel based framework for feature selection and imbalanced data learning was proposed for lung nodule computer aided detection [8]. Besides, in [9] the least absolute shrinkage and selection operator (LASSO), as well as state-of-the-art sparse models are reviewed, which are usually used on various applications to biomedical data, such as biomarker selection, biological network construction, and magnetic resonance imaging. A feature selection scheme based on support vector machine (SVM) is also used to help train the rotation forest (RF) ensemble classifiers in [10] for improving the diagnosis of PD. Similar works can be found in [11] and [12], where principal component analysis (PCA) and canonical correlation analysis (CCA) are used, respectively. Besides, minimal-redundancy-maximal-relevance (mRMR) [13] is proved in previous work to be an efficient feature selection method. Nie et al. [14] proposed a robust feature selection method by minimizing the joint

l21-norm-based metrics over both the loss function and the regularization.

In this paper we further explore the CCA technique, which is capable of establishing the relationship between two high-dimensional vectors of features, and implementing linear mapping to transform them to a common feature space [15]. Note that the two feature vectors in our study are extracted from WM and GM regions respectively, representing two views of different anatomical feature spaces from each patient/normal subject. These two feature vectors thus need to be transformed to a common space, where all features can be evaluated in accordance to their contributions towards PD classification jointly and fairly. Specifically, after linearly transforming the two views of features to a common space estimated by CCA, we learn a regression model to fit the PD/NC labels of the training subjects based on the transformed feature representations. The regression model not only predicts the unknown labels of new test subjects, but also helps identify the importance of relevant features for PD classification. The identified features are often perceived as important biomarkers, which help researchers better understand the mechanism and evolution of the disease potentially.

In addition, we argue that only a few brain ROIs, as well as associated features, are relevant for computer-assisted PD diagnosis. The redundancy within initially extracted features can be high, incurring difficulty to estimate the common feature space by a one-shot CCA transformation. The inaccurate estimation of CCA common space is also hazardous to the selection of a limited number of optimal features and the subsequent learning of regression model. To this end, we develop the ICCA method for feature selection, in which we iteratively optimize the estimation of the common space and gradually discard the irrelevant features [16]. Specifically, in the first iteration of ICCA, we transform the features of WM/GM views into a tentative common space, and build the PD regression model accordingly. The regression coefficients of the model measure the contributions of individual features, and guide us to eliminate the most irrelevant features for PD classification. The subsequent iterations will update the common space based on the tentatively preserved features, which will be further selected once the common space is updated. In the final, the two feature vectors of WM/GM views are transformed to the common space, while only the limited number of the optimal features is preserved.

The ICCA-based feature selection reveals the optimal common feature space for multi-view learning of neuroimaging data. We then utilize RLDA to complete the PD classification task based on the selected features. Note that, with the linear transformation of the features in CCA, our ICCA provides locally linear feature transformation capabilities that contribute to sophisticated feature selection across two different anatomical views.

The remaining parts of the paper are organized as follows. In Section 2, we provide related works on feature selection as well as robust regression, since the RLDA classifier used in this paper is an extension of robust regression for classification. In Section 3, we present details about the design of our classification framework, especially the ICCA feature selection method. The experimental results of the proposed method and the comparisons with state-of-the-art feature selection approaches are provided in Section 4. The top features selected by the proposed method for PD classification are also discussed. Finally, we conclude the paper in Section 5.

## 2 Related Work

- Feature Selection Methods

A major group of feature selection techniques relies on the features only, without considering the class labels of the subjects. These techniques in general belong to the unsupervised feature selection category, including PCA, t-test, etc. These methods are simple and hence widely used in many applications. For instance, Song et al. [17] exploited the eigenvectors of the PCA covariance matrix to evaluate the significance of each individual feature component in the original data. With the evaluated feature components, their method only takes a few eigenvectors into account. Their experimental results on face recognition show that their method could select an appropriate amount of features without jeopardizing the recognition accuracy. Zhou and Wang [18] modified the t-test ranking measure and used it to discover the significant single Nucleotide polymorphisms (SNPs).

In addition to the above, there are some popular filter-based feature selection methods, which rank features with various criteria and then select them. The criteria include information gain [19], Fisher score [20], ReliefF [21], Laplacian score [22] and Trace Ratio [23], all of which have been extensively studied in different fields. Azhagusundari and Thanamani [18] proposed an algorithm that combines discernibility matrix and information gain to select features. They demonstrated that better results, in terms of the number of selected features and classification accuracy, can be achieved than applying the two criteria separately. Gu et al. [20] presented a generalized Fisher score by reformulating the feature selection problem as a quadratic constrained linear programming. They overcame the shortcoming of the conventional Fish score feature selection method, in which each feature is independently considered. The selected suboptimal subset of features is shown to outperform the conventional Fisher score and many state-of-the-art feature selection methods, including Laplacian score, Hilbert Schmidt Independence criterion, and Trace Ratio criterion. In [21], authors thoroughly investigated Relief and ReliefF, and demonstrated their robustness as well as tolerance to noises.

The class labels of the training subjects can contribute to supervised feature selection methods. In this way, the selection of features is perceived as part of learning upon the training data, which can usually be modeled by optimizing an objective function. Some popular examples are sparse learning [24] and Least Angle Regression (LARS) [25]. In [26], the Least Absolute Shrinkage and Selection Operator (LASSO) is used to perform feature selection. LASSO minimizes the class label regressed from the features from its ground truth, while the regression coefficients measuring the contributions of individual features are penalized by the l1-norm. The l1-norm penalization encourages many coefficients to vanish, while only the most useful features are selected to contribute to the regression of the class labels. Wang et al. [27] proposed a novel Sparse Multi-tAsk Regression and feaTure selection (SMART) model, in which they included both l1-norm and l2,1-norm regularizations for selecting features. In [14], Nie et al. proposed to measure both the loss function and the regularization of coefficients with the l2,1-norm jointly. The l2,1-norm based loss function is robust to outlier subjects, while the l2,1-norm regularization selects features across all subjects with joint sparsity. In [28], Armanfard et al. proposed a novel localized feature selection (LFS) method where the optimal feature subset is associated with each region of the sample space. In [29], Li et al. proposed a granular feature selection method for multi-label learning with a maximal correlation minimal redundancy criterion based on mutual information to select a more relevant, compact feature subset and explore the label dependency.

The performance of feature selection can be limited especially when dealing with complex data, in which features are highly correlated and thus redundant. For instance, Peng et al. [13] presented a two-stage feature selection algorithm based on maximal statistical dependency. By combining minimal-redundancy-maximal-relevance and other sophisticated feature selectors, they demonstrated promising results over several classifiers and datasets. In [30], Senawi et al. proposed a maximum relevance-minimum multicollinearity (MRmMC) method in which conditional variance based correlation characteristics is used to measure relevant features and redundancy elimination is obtained according to multiple correlation assessment through orthogonal projection scheme. In [31], Naghibi et al. proposed a parallel search strategy on semidefinite programming, which can search through the subset space in polynomial time, with mutual information between features and class labels considered as measure function. In [15], Zhu et al. proposed a novel canonical feature selection method, which efficiently integrates the correlation information between structural and functional neuroimaging data into a sparse multi-task learning framework. They assumed that the structural and functional feature descriptors of a single subject are highly redundant. By projecting the features of these two different modalities (and thus two views) into a common space with CCA, they were able to maximize the correlation between features and also conduct task-related feature selection.

- Robust Regression

In this study, we use RLDA for classifying subjects with selected features. RLDA can be perceived as a special case of robust regression [32]. As also confirmed in the literature, regression methods can be extended for the challenging classification tasks [33-35]. Nevertheless, the lack of robustness against noise is one of the major drawbacks of most existing regression approaches, especially when the outliers affect the normal distribution of subjects within high-dimensional feature space. Therefore, robust regression methods [36, 37] have developed during the past decades. In [35], Huber introduced M-estimation for regression, which shows robustness for outlier subjects. Rousseeuw and Leory [36] developed Least Trimmed Squares, which can find a data subset that minimizes the squared residual sum. However, these methods can only remove outliers among subjects, and therefore they cannot deal with outliers within subjects, or in other words noises in the feature values. Hence, Error-In-Variable (EIV) approaches [37] are proposed to deal with noises in the features, even though the existing EIV approaches rely on strong assumptions on classification errors.

Robust regression methods are further extended for robust classification applications. A robust extension of Linear Discriminant Analysis (LDA) is proposed in [38], in which authors substituted their robust counterparts for the empirical estimation of the class mean vectors and covariance matrices. In [39, 40], authors proposed a worst-case Fisher Discriminant Analysis/Linear Discriminant Analysis (FDA/LDA) to increase the separating ability between classes in unbalanced sampling by minimizing the upper bounds of LDA cost function. It is worth noting that these classifiers are only robust to outliers among the subjects, yet not robust to the intrinsic noises in extracted features. To deal with this problem, Fidler and Leonardis [41] modified LDA and made it robust to outliers within subjects through application of PCA on the training data. A robustly estimated basis is computed to replace the minor PCA components, after which they combined these two bases into a new one, and then the data was projected onto the combined basis. Finally, the LDA is calculated in the training phase. In the testing phase, the coefficients of the test data on the recombined basis are estimated, and the class label(s) for the test data are determined by mapping the learned LDA on the estimated coefficients. This method can suppress the outliers outside the PCA subspace, but it cannot deal with the problem of learning LDA with outliers in the PCA subspace of the training data. Zhu and Martinez [42] extended the SVM algorithm, denoted as Partial Support Vector Machine (PSVM), to make it applicable for the cases with missing features in the subjects. They have indicated that their method is also robust to some levels of noises. Of note, the robust LDA [32] used in this study is an extension of the robust regression study, which is inspired by existing work on robust PCA [43]. It enjoys the following advantages compared to the aforementioned methods: 1) it is a convex approach; 2) except for sparsity, there is no assumption imposed on the noise in the data; 3) it automatically cleans the noise in the features when learning a classifier. More implementation details about robust LDA classifier will be provided in Section 3.

## 3 Method

Figure 1 illustrates the pipeline for PD diagnosis in this paper. For each subject, we extract the WM and the GM features from the T1 MR image. The features are then fed into the ICCA feature selection framework. The WM/GM feature vectors are transformed onto a common feature space using CCA. The canonical representations of the WM/GM features are computed in the common space. Next, we build the regression model to fit the PD/NC labels of the training subjects based on the canonical representations of their features. The regression model then yields the coefficients that describe the contributions of the canonical representations, from which the importance of the original WM/GM features can be computed. We thus conduct conservative selection upon the WM/GM features, by only eliminating the least important features. The rest of the WM/GM features are fed into the next iteration of ICCA feature selection, as the features are transformed to the new common space updated by CCA and then selected accordingly. This iterative procedure ends when only a small set of optimal features are remaining. In the final, we build a robust classifier based on the selected features, and further apply it to diagnose the new test subject with unknown PD/NC label.

Figure 1. Pipeline of the automatic diagnosis system with the proposed ICCA based feature selection. We extract ROI based gray matter (GM) and white matter (WM) tissue volumes as features. The GM and WM feature vectors per subject (or matrices for the dataset) are transformed into the canonical space by CCA. Then, we use the regression model to map the canonical representations, corresponding to the features transformed into the canonical space, toward the PD/NC labels. Finally, the regression coefficients are transformed back into the original GM and WM feature spaces, where feature selection is conducted accordingly. The above procedure is iterated, as the finally selected features are used for classification-based PD diagnosis.

- Canonical Correlation Analysis (CCA)

CCA is a method to exploit the correlated relationship between two multi-dimensional feature sets. Specifically, CCA intends to find a pair of basis vectors, such that the correlation between the two sets of feature vectors is maximized after they are transformed following the basis vectors, respectively. The transformed features are highly correlated and thus perceived to be within the common feature space [15].

Consider two multivariate feature vectors of the form

x1,x2. Suppose we are given

nsubjects

X=x11,x12,…,(xn1,xn2). Let

X1denote

[x11,…, xn1]and

X2denote

[x12,…, xn2]. Define a transform matrix

B1to project

X1, and

B2to project

X2. The projected

X1and

X2are computed as

Z1=B1’X1, and Z2=B2’X2.The new representations

Z1and

Z2are called the canonical representations for

X1and

X2, respectively. In CCA particularly, we determine a pair of

B1and

B2to maximize the correlation between the canonical representations

Z1and

Z2, which can be formularized as the following:

ρ=maxB1,B2corr(X1B1,X2B2)=maxB1,B2X1B1,X2B2X1B1X2B2 | (1) |

If we use

Ê[f(x1,x2)]to denote the empirical expectation of the function

f(x1,x2), where

Êfx1,x2=1n∑i=1nfxi1, xi2, then the correlation expression can be rewritten as:

ρ=maxB1,B2ÊB1,x1B2,x2ÊB1,x12ÊB2,x22
=maxB1,B2ÊB1’x1x2’B2ÊB1’x1x1’B1ÊB2’x2x2’B2 =maxB1,B2B1’Ê[x1x2′]B2B1’Ê[x1x1′]B1B2’Ê[x2x2′]B2 . |
(2) |

Now note that the covariance matrix of

x1,x2is:

Cx1,x2=Êx1x2x1x2’=Cx1x1Cx2x1Cx1x2Cx2x2= Σ11Σ21 Σ12Σ22. | (3) |

Hence, we can rewrite the function

ρas:

ρ=maxB1,B2B1’Σ12B2B1’Σ11B1B2’Σ22B2 . | (4) |

The maximal canonical correlation coefficient vector is, equivalently, the maximum of

ρwith respect to

B1 and B2.

Given

nsubjects, their

d-dimensional feature vectors are considered as individual columns in

X1∈Rn×dand

X2∈Rn×d, which correspond to the views of WM and GM, respectively, in our application. The class labels for the subjects are stored in

y∈Rn×1, where each entry is either “1” (PD) or “0” (NC) to indicate the class label that a subject is associated with. Let

X=[X1,X2]∈Rn×2d, and

Σ=(Σ11Σ21 Σ12Σ22)be its covariance matrix. CCA can find the basis matrices

B1∈Rd×dand

B2∈Rd×dto maximize the correlation between the transformed

X1and

X2. The two basis vectors

B1and

B2can be estimated by solving the following optimization problem:

B1,B2=arg maxB1 ,B2B1’Σ12B2B1’Σ11B1 B2’Σ22B2. | (5) |

Note that the generalized Eigen-decomposition in [9] gives the optimal solution of

B1,B2. Therefore, the canonical representations of all features in the common space are obtained by

Zm= Bm’Xm, m={1,2}.

Next, we build a sparse linear regression model based on the canonical representations, aiming to fit the class labels with the latest canonical representations. Specifically, the non-zero weights are assigned to a limited number of the canonical representations following:

w=arg minW12y-ZwF2+βw1+γwCCA2. | (6) |

Here,

Z=Z1,Z2∈Rn×2dis the canonical representation matrix,

w=w1;w2∈R2d×1is the regression coefficient matrix, and

βand

γare the trade-off scalar parameters. The

l1-norm penalty

w1tends to yield sparse coefficients for the canonical representations. And

wCCAdenotes the following canonical regularizer [15]:

wCCA2=∑i=1d1-λiλiwi2+wi+d2. | (7) |

Note that

{λ1,…,λd}denotes the set of the canonical correlation coefficients, and

wiand

wi+dare the weights corresponding to the same feature index (and thus ROI) of the two views (i.e., GM and WM). In the feature selection process, more correlated canonical representations across the two views, which means with large canonical correlation coefficients, tend to be selected and vice versa.

It is worth noting that the canonical regularizer finds canonical representations that are highly correlated across the GM/WM feature views simultaneously. Besides, it can also be considered that the higher

λicomes with the increases of

wiand

wi+dafter optimization. In the conventional methods, the feature matrix in the original space is generally used to build the regression model, find the relationship between feature matrix and output, and use this kind of relationship to select features. These methods are able to select significant features and remove redundant features. However, when there are complex relationships among features, it is really hard to discard redundant features. The CCA-based feature selection uses canonical representations for regressors, which are highly correlated but mutually independent within each representation. Therefore, it is more convenient to identify redundant information in the canonical space, and thus the CCA-based feature selection would be more predictive than the traditional sparse feature selection methods [15].

- ICCA-based Feature Selection

As stated in Section 3.2, the feature selection method by CCA is more powerful in predicting the response variables (i.e., the PD/NC labels) than the conventional sparse feature selection methods, since CCA considers the correlation between the two sets of features from two different views, as well as the correlation between features and response variables. Even though there are many advantages of the CCA based feature selection, its performance might be still limited by the inaccurate one-shot estimation of the common feature space. As all features are linearly transformed to the common space and selected accordingly, it might be too rough to identify and preserve the optimal features. Therefore, we propose a novel ICCA feature selection method, in which we iteratively update the estimation of the common feature space and discard the most irrelevant features for iterative feature selection. In this way, we are able to acquire the best features by approximating locally linear operation upon the input feature vectors.

In the Eq. (6), we can get the regression coefficient matrix

w=[w1;w2]in the tentatively estimated common/canonical space, which present the weights for canonical representations

Z=Z1,Z2in . From the equation,

Zm=Bm’Xm, m=1,2,we can see that canonical representations

Z(

Z=[Z1,Z2]) are linear combinations of

X(

X=[X1,X2]) (WM/GM features in the original feature space) warped by

B(

B=[B1,B2]). Therefore, the weights in

w(

w=[w1;w2]), which learned from Eq. (6), are also linearly correlated with the importance of

X(

X=[X1,X2]) (WM/GM features). Then, the importance or weights of the original WM/GM features

X(

X=[X1,X2]) can be computed by

w̃m=(Bm)-1wm, m={1,2}, where

w̃mrecords the weights of the

m-th view of the original features for discriminant PD from NC. Given the estimated

w̃1and

w̃2, the least important WM and GM features for discrimination in the original feature space can be eliminated respectively. With the updated WM/GM features in original space,

X(

X=[X1,X2]), CCA is applied to update the common feature space and transforms the updated WM/GM features to refined canonical representations in the common space accordingly. This transforming-eliminating scheme is iteratively executed until the number of the iterations exceeds a pre-defined threshold, or our desired number of features is obtained.

In conclusion, our proposed ICCA-based feature selection method consists of the following five steps:

- Compute basis vectorsB1,B2 using CCA via Eq. (5);
- Transform the original feature matricesX1,X2into the canonical space (
Z1,Z2) by

Zm=Bm’Xm, m=1,2;

- Compute the coefficient vectorsw1, w2in the canonical space, which are stored in
w(

w=[w1;w2]), using Eq. (6);

- Compute the feature importance vectors in the original feature space (i.e.,w̃1and
w̃2), by

w̃m=(Bm)-1wm;

- Discard features with the lowest importance inw̃1and
w̃2respectively, and update the feature matrices

X1,X2;

- Repeat Steps 1 to 5 until the stopping criterion is attained.

- Robust Linear Discriminant Analysis Classification

As previously stated in Section 2.2, in this paper we incorporate the RLDA technique for the PD/NC classification work, using the optimal feature subset selected by the proposed ICCA method. RLDA can be seen as a special case of robust regression (RR) method [32], since binary classification can be treated as a special case of regression.

Given

X̃∈Rn×d̃as the input matrix, which consists of

d̃-dimensional samples with noises. LDA learns a linear transformation, which maximizes the inter-class variance while minimizing the intra-class variance. However, when learning upon the high-dimensional data, LDA cannot overcome the small sample size problem. In order to solve this problem, LDA can be formularized as a least-squares (LS) problem [32], in which

X̃is directly mapped to the class labels represented by an indicator vector. Thus, LS-LDA reduces to:

arg mintη2yTy-12(y-X̃t)22. | (6) |

The normalization factor

yTy-1/2compensates for different sample size per class and

t∈Rd×1represents a regression vector. However, when

X̃is corrupted by the noise, the LS-LDA suffers from a biased estimation of

t. The robust LDA (RLDA), by explicitly decomposing the data matrix into a noise-free data matrix and the error matrix, partially resolves the problems above.

In RLDA, by modeled the data as

X̃=D+E, where

Dis the underlying noise-free or clean component and

Econtains the noise or error in the data, the noise in data can be partially considered and removed. The formula of RLDA is shown in Eq. (7), where

H=(In-11T/n)is a centering matrix and

t∈Rd×1is the mapping matrix that learned from the centered clean or noise-free data

HDonly. From the Eq. (7), we can see that, in RLDA,

tis mapped from

X̃to fit the class labels in

y∈Rn×1, and

X̃is decomposed into

Dand

E, thus the mapping

tis computed using the noise-free data

D, which yields the desired effect that we can partially decrease the influence of noise on data

X̃.

arg mint,D,Eη2yTy-12(y-HDt)22+D*+λE1 s.t. X̃=D+E, | (7) |

Since

X̃is decomposed into clean (noise-free) data

Dand noise

E, and mapping matrix

t is only learned from the clean data. Therefore, unlike traditional method (e.g., LDA, LS-LDA), it avoids projecting the noise

Ein the estimation, and thus conduce to an relatively unbiased noise-free estimation. After

tis learned in the training phase, a testing data,

x̃test, is projected by

tand a k-Nearest Neighbor (k-NN) algorithm is used to determine the class label of the test data. The above RLDA formulation can be easily solved by augmented Lagrangian multipliers method. For more details, please refer to [32].

## 4 Experiments

In this section, we apply the proposed ICCA method to the PD/NC diagnosis to demonstrate its validity. Here we employ the Parkinson’s Progression Makers Initiative (PPMI) dataset for the evaluations. The T1-weighted MR images were acquired using 3T SIEMENS MAGNETOM TrioTim Syngo. There are 112 subjects (56 PD, and 56 NC) randomly selected from the PPMI dataset for this study. The parameter settings are given as follows: acquisition matrix = 256×256 mm^{2}, 176 slices, voxel size = 1×1×1 mm^{3}, echo time (TE) = 2.98 ms, repetition time (TR) = 2300 ms, inverse time = 900 ms, and flip angle = 9°.

All T1-weighted MR images are pre-processed by skull stripping [44], cerebellum removal, and tissue segmentation [45] into WM and GM. Then, the anatomical automatic labeling (AAL) atlas with 90 pre-defined ROIs is registered to the native space of each subject, using FLIRT [46], followed by HAMMER [47]. For each ROI, we estimate the WM/GM tissue volumes as the WM/GM feature, respectively. In this way, we extract 90 WM and 90 GM features for each subject. The features are naturally grouped into two vectors, to be handled by the ICCA based feature selection and then the RLDA based classification.

- Evaluation Protocol

In order to evaluate our proposed method, we compare our method with state-of-the-art feature selection methods, including PCA [11], FSASL (unsupervised feature selection with adaptive structure learning) [48], CCA [15], sparse feature selection [7], Elastic Net [49] and robust feature selection method [14]. To this end, we use the same nested cross-validation to evaluate the performances of individual methods on classifying PD patients from NC. Specifically, in the outer layer of the nested cross-validation, an 8-fold scheme is adopted to partition the dataset into 7 training folds and 1 testing fold. In the inner layer of the nested cross-validation, we further conduct 10-fold analysis toward the 7 folds of training data. The inner 10-fold cross-validation is used to tune the parameters automatically. With a fixed subset of features that are selected in the inner 10-fold cross-validation, the diagnostic ability is tested in the outer 8-fold cross-validation. Note that with 12 CPU cores (2.76GHz) and 48 GB memory size, it takes about 2 seconds for each inner loop, 1 hour in the training stage with parallel computing, and about 25 seconds to test a new subject. The implementations are all done in MATLAB.

- Comparisons with the State-of-the-Art Feature Selection Methods

In this study, we compare the PD/NC classification performance of all the configurations along with the alternative methods available. Specifically, we compare our ICCA-based feature selection with several popular feature selection or feature reduction methods, including PCA [11], FSASL (unsupervised feature selection with adaptive structure learning) [48], CCA [15], sparse feature selection [7], Elastic Net [49] and robust feature selection method [14]. No feature selection (NoFS) is also compared here, where all features are input to the RLDA classifier without any feature selection. For Elastic Net, a function namely “lasso” in MATLAB is used for implementation. For the sparse feature selection method, we use the

l2,1-norm regularization, instead of the

l1-norm regularization, as in the traditional lasso regression model. In PCA-based feature selection, only the top 1% principal components are preserved, which result in much fewer yet more powerful features. The details about the one-shot CCA feature selection are described in [15]. The details of the FSASL method can be found in [48], where the code is also available online.

Alternatively, we also measure the performance when selecting upon the canonical representations rather than the original WM/GM features. State succinctly, we directly eliminate the least important canonical representations in every iteration of ICCA, instead of using the inverse-transform and elimination steps in the original WM/GM features. The remaining canonical representations are used for re-estimating the new common space by CCA and further selected. We consider this configuration as “cascaded-CCA”, that several CCA based feature selection steps are cascaded rather than iteratively executed.

The classification results are presented in Table 1, which shows that the proposed ICCA based feature selection method achieves better performance than the competing methods. In particular, our method improves by 11.5% on ACC and 16% on AUC, respectively, compared to the baseline (NoFS). Moreover, compared to feature selection through Elastic Net, PCA, SFS, CCA, FSASL, and RFS, our method achieves 6.2%, 5.3%, 5.4%, 4.4%, 5.2% and 4.4% accuracy improvements, respectively. Also, our methods achieve 6.5%, 11.3%, 6.6%, 6.7%, 4.7%, 7.1% AUC improvements, compared to Elastic Net, PCA, SFS, CCA, FSASL, RFS, respectively. It can also be observed from the last two lines in Table 1 that the proposed ICCA based feature selection outperforms the cascaded-CCA scheme. We attribute this to the observations that the CCA mapping to the common space is unsupervised and, after CCA mapping, the two feature vectors become highly correlated in the canonical space. This leads to the redundancy occurred in the canonical representations, which further mislead the feature selection and thus influence the subsequent classification. It is also concluded that such limitation can be solved by inversely transforming the representations back to the original feature space before conducting feature selection in the ICCA strategy.

**Table ****1**. PD/NC classification comparison (ACC: Accuracy; AUC: Area Under ROC Curve).

Method |
ACC(%) |
AUC(%) |
SEN(%) |
SPE(%) |

NoFS |
58.0 | 55.1 | 32.5 | 82.5 |

Elastic Net |
64.3 | 64.6 | 54.1 | 87.8 |

SFS (with
l2,1 |
65.1 | 64.5 | 57.1 | 73.2 |

PCA |
65.2 | 59.8 | 48.2 | 82.1 |

FSASL |
65.3 | 66.4 | 56.2 | 77.6 |

CCA |
66.1 | 64.4 | 53.6 | 78.6 |

RFS |
66.1 | 63.9 | 53.6 | 89.3 |

MRMR |
66.7 | 65.6 | 52.9 | 61.7 |

Cascaded-CCA (Proposed) |
68.8 | 69.3 | 62.5 | 71.6 |

ICCA (Proposed) |
70.5 |
71.1 |
62.5 |
78.6 |

- The Most Discriminant Regions

Recently, there are a lot of works to study the WM/GM changes on PD patients. They also identify several significant brain regions that are affected by PD mostly. For example, the WM lesions are associated with some motor and cognitive deficits in PD [50]. Significantly increased WM density in occipital lobes, posterior cingulate gyrus, and paracentral lobule are found in PD with olfactory dysfunction compared with the normal controls (NC) [51]. GM volumes are found to be significantly decreased in the patients with PD compared with NC [52]. And reduced gray matter volumes in PD patients are observed in the temporal lobe, occipital lobe, right frontal lobe, left parietal lobe, and some subcortical regions [53].

In this section, we intend to evaluate these optimal feature subsets by ICCA, which are expected to be the biomarkers of PD that correspond to their specific WM/GM regions. Note that we further select the most discriminative regional features, which are available in at least 60% of feature subset in all the cross-validation folds. Figure 2 presents these selected regional features, which have ‘precuneus’, ‘thalamus’, ‘hippocampus’, ‘superior temporal pole’, ‘postcentral gyrus’, ‘middle frontal gyrus’, and ‘medial frontal gyrus’.

It is worth noting that the most discriminative regional features shown in Figure 2 fit with the PD pathology, which is also proven by the previous clinical researches. As reported in [54, 55], the PD patients reveal under-activation in precuneus and temporal cortex, and the pathology in thalamus also contribute to the abnormal neural activity characteristic of PD. In additional, temporal pole and hippocampal atrophy are reported as an early sign of PD in [53, 56, 57]. In [58] bilateral areas of atrophy in middle frontal gyrus and bilateral GM loss in medial-superior frontal are reported in PD patients. It is found in [59] that the areas of reduction in the absolute amount of GM are bilateral in the medial frontal gyrus, right precuneus, inferior parietal lobule, and so on. In [60], gray matter densities are found significantly decreased in bilateral temporal, right postcentral, and some other brain regions in PD of dementia patients.

We also conduct t-tests for the statistical analysis on GM/WM volumes computed from the 7 significant ROIs between PD and NC, with a p-value of 0.05 as the threshold for the statistical significance. One interesting finding is that only three regions show significant differences between PD and NC, namely ‘precuneus’ (p=0.0065), ‘hippocampus’ (p=0.0164), and ‘middle frontal’ (p=0.0429). This implies that the GM/WM volumes extracted from brain regions may not be simply used for prognosis. Instead, when combining them together and using the highly efficient machine learning tools for the analysis, we can achieve high prognosis capability.

Fig. 2. Visualization of the most discriminative ROIs used for automatic diagnosis of PD.

## 5 Conclusion

In this paper, we present the novel ICCA based feature selection method for the PD diagnosis. Extended from the CCA based feature selection approach, our proposed ICCA based feature selection method is not only capable of locally linear mapping ability, but also able to explore the relationship among features and underlying structure that deeper into the feature space by an iterative manner. It is also worth noting that the two view of feature vectors in th canonical space (i.e., canonical representations) would become closer and closer, when iterations of learning in ICCA feature selection framework is increased. This property also results in the decreased number of the selected features. We also conduct extensive PD/NC classification experiments for performance comparison between the proposed method using our ICCA strategy and the state-of-art approaches. The results show that our method achieves the highest accuracy output, which demonstrate its validity in improving the PD diagnosis process using the T1-weighted MR images.

As previously stated in Section 3, in each iteration of ICCA method, we discard the least important features from GM and WM, respectively. Note that here the number of discarded features are set up conservatively in order to avoid missing the potential features that maybe important for PD/NC classification. A smarter dropping out way could be redesigned in the feature selection strategy to optimize the whole feature selection framework in our future work. Besides, another limitation of our proposed ICCA based feature selection method is that it can only explore relationships between two views of features one time. Thus, we will also find the solution to extend the proposed method so that it can handle more than two views of features simultaneously, which can make it more feasible in the feature selection domain.

Except for the further optimization on the efficiency and performance of proposed ICCA based feature selection method, another future work is improving the classification method for PD diagnosis, which is currently implemented using RLDA classifier in this paper. Although RLDA is able to achieve favorable classification performance which has been validated in Section 4.1, it is not able to model the nonlinear relationship between features and labels since RLDA is a linear classifier. Therefore, using a nonlinear classifier in the classification would make the proposed method be able to utilize feature information in the different way, which may conduce to at least equal or better than linear classifier. Finally, there is only structure information in T1 MR images used in this study. Thus, we will also turn to other modalities for the PD diagnosis apart from the T1-weighted MR images, such as diffusion tensor imaging (DTI) and function MRI (fMRI). The integrated feature information from multi-modality images can provide more comprehensive details, which can be utilized by the proposed framework for better classification performance.