Multi-view Clustering: A Survey

**Abstrac****t**— In big data era, data are generated from different sources or observed from different views. These data are referred to as multi-view data. How to unlock the power of knowledge from multi-view data is very important in big data mining and analysis. This calls for advanced techniques that can consider the diversity of different views and meanwhile fuse these data. Aiming to exploit the complementary and consensus information across multiple views, Multi-view Clustering (MvC) has attracted increasing attention in recent years. This paper summarizes a large number of multi-view clustering algorithms. It provides a taxonomy according to the mechanisms and principles involved, classifying these algorithms into four categories: co-training style algorithms, multi-kernel learning, multi-view graph clustering, and multi-view subspace clustering. Thereinto, multi-view graph clustering is organized as graph-based, network-based, and spectral-based methods. Multi-view subspace clustering methods are further divided into subspace learning-based, and nonnegative matrix factorization-based methods. It also gives an insight into a high-level category, named Multi-task Multi-view Clustering (M2vC). This paper does not only introduce the mechanisms for each category of methods, but also give specific examples in which these techniques are used. In addition, it lists some publicly available multi-view datasets. Overall, this paper serves as an introductory text and survey to multi-view clustering.

- INTRODUCTION

N many real-world applications of big data mining and analysis, data are collected from diverse data collection sources or obtained from various feature extractors. For instance, pictures shared on websites often have corre- sponding textual tags and descriptions, the same news is reported by multiple news organizations, the same semantic meaning, e.g., hello, can be represented with multilingual forms, sensor signals are decomposed in time domain and frequency domain, an image is described by different types of features, e.g., SIFT, LBP and HOG etc. All these are referred to as *multi-view**data*. These data exhibit heteroge- neous properties but hold potentially connection. That is to say, in such data, each individual view has its specific property for a particular knowledge discovery task, but dif- ferent views often contain complementary information that should be exploited. So, how to exploit these information to uncover the potential values of multi-view data is very significant in big data research. Applications in real-life data analysis are also calls for advanced technologies that can deal with data objects with multiple views to bring data

I

mining and knowledge discovery to new heights.

In the past decade, a great many machine learning methods have been proposed to deal with multi-view data. Good surveys on multi-view learning are covered in [1], [2], [3]. Besides, Zheng [4] gave an overview on methodologies for multi-view (cross-domain) data fusion, in which some specific applications were discussed. Existing multi-view learning technologies are roughly divided into supervised learning and unsupervised learning. We denote that this paper focuses on one of the unsupervised learning tech-

niques, i.e., clustering. Clustering has emerged as an alterna- tive powerful meta-learning tool to explore the underlying structure of data [5], [6], especially in the era of big data [7]. The basic idea of clustering algorithms is to partition a set of data objects according to some criteria such that similar objects are grouped into the same cluster, and dissimilar objects are separated into different clusters.

As statistical machine learning matures, many advanced clustering algorithms have been investigated in the last few decades. Although these algorithms have been very successful to some extent, most of them are only suitable for dealing with single view data. Even concatenating all views into a single view and then adopting state-of-the- art clustering algorithms on this concatenated features may not improve the clustering performance as this concatena- tion is not physically meaningful because each view has a specific statistical property. In comparison, Multi-view Clustering (MvC) performs effectively on multi-view data by considering the diversity and complementarity of differ- ent views. The early studies on MvC began around 2003, such as reinforcement clustering for multi-type interrelated data [8], multi-view version of DBSCAN [9], and two-view version of EM-based and agglomerative algorithms [10], etc. As an advanced clustering paradigm, MvC has caught an increasing attention in recent years. So far, four workshops [11], [12], [13], [14] and a mini-symposium [15] have been held in conjunction with related international conferences. In the context of MvC, an inherent problem (is also the goal) that all methods have to deal with elaborately is how to maximize the within-view clustering quality, meanwhile, the clustering consistency across different views should be also taken into consideration. Besides, incomplete multi- view data, where some data objects could be missing their observation on one view (i.e., missing objects) or could be available only for their partial features on that view (i.e., missing feature), also brings challenges to MvC.

In this paper, we review a number of representative MvC methods. According to the mechanisms and principles

TABLE 1

Notations and Definitions

which these methods base on, we organize and summarize __ __

them with four categories:

Notations Definitions

*k* user-specified number of clusters the number of views

*Co-training**style**algorithms*: This category of methods

treats multi-view data with co-training strategy. It

bootstraps the clusterings of different views by using information (e.g., prior or learning knowledge) from one another. By iteratively applying this strategy, the

*x*

*m*

*n* t

*n**v* t*d**v*

*X*

*X**v*

*v*

clusterings of all views tend to each other, which *j*

*v*

*x*

*ij*

results in a broadest consensus across all views.

*Multi-kernel**learnin**g*: This category of methods uses predefined kernels corresponding to different views and then combines these kernels either linearly or non-linearly to improve clustering performance.*Multi-view**graph**clusterin**g*: This category of methods seeks to find a fusion graph (or network) across all views and then uses graph-cut algorithms or other technologies (e.g., spectral clustering) on the fusion graph to produce the clustering result.*Multi-view**subspace**clustering*: This category learns a unified feature representation (then be fed into a model for clustering) from all views’ feature sub- space by assuming that all views share this repre- sentation. Subspace learning and nonnegative matrix factorization (NMF) are typical models.

Besides, we give an insight into a high-level technologies, named Multi-task Multi-view Clustering (M2vC), which integrates multiple related tasks with multi-view data. For each category, we will provide specific introduction and example in the following sections. To help researchers in this field, we also list some widely used multi-view datasets.

The rest of this paper is organized as follows. In Section 2, we illustrate related principles, which ensure MvC suc- ceed. In Section 3, we focus on discussing existing research on MvC, where we give an overview of earlier and more recent MvC methods, and introduce specific examples for each category. Section 4 focuses on M2vC. Some publicly available datasets are covered in Section 5. In Section 6, we conclude this paper and discuss some challenges and future directions of MvC.

*Notations and Definition**s*: We begin with a description of the notations used in this paper. We state that matrices and vectors throughout this paper are written in uppercase and lowercase letters, respectively. The common notations and corresponding definitions are summarized in Table 1.

- PRINCIPLES ON MVC

This section is devoted to analyzing two significant princi- ples on MvC, i.e., *complementary*and *consensus*principles.

*I* i

__ ____1____ ____ __

*†*If multi-view data is complete, then *n*= *n**v*.

Fig. 1. Illustration of complementary and consensus principles [16].

(described by part B) of the object are shared by both views, i.e., the consensus between two views.

Next, we further give analyses on these two principles as follows:

*Complementary**principle*: It states that multiple views should be employed to describe the data objects more comprehensively and accurately. In the context of multi- view data, each single view is sufficient for a particular knowledge discover task, but different views often contain complementary information to each other. For example, in image processing, each image is described by different types of features, such as SIFT, LBP, and HOG, where SIFT is robust to image rotation, noise, illumination, LBP is a powerful texture feature, and HOG is sensitive to marginal information. Therefore, it is intuitive to exploit these mutu- ally complementary information underlying multiple views to describe those data objects, and provide deeper insights into the internal clusterings.

*Consensus**principl**e*: It aims to maximize the consistency across multiple distinct views. Based on probably approx- imately correct, Dasgupta et al. [17] gave a generalization error analysis for the consensus principle. Given a multi-

These two principles can partially answer why MvC is

view dataset *X*, which has two views *X*1

and *X*

2. Under

useful, what the underlying assumptions are, and above all how we should model and perform MvC.

Referring to [16], we give an illustration on these two principles as shown in Fig. 1. Suppose an data object has two views and is mapped into a latent space, it can be observed that: 1) some ingredients exist in the individual view, such as part A in view 1 and part C in view 2, i.e., the complementarity of two views, and 2) some ingredients

some mild assumptions, they demonstrated that the con-

nection between the consensus of two hypotheses on two views respectively and their error rates, formulated as the following inequality

*P*(*f*1 I= *f*2) ≥ *maxP**e**r**r*(*f*1)*,**P**e**r**r*(*f*2)*.* (1)

From this inequality, we conclude that the probability of a disagreement of two independent hypotheses is an

upper bound on the error rate of either hypothesis. Thus by minimizing the disagreement rate of the two hypotheses, the error rate of each hypothesis will be minimized. In other words, by maximizing the agreement (or consistence) rate of the two hypotheses, the error rate of each hypothesis will be minimized. This is called as *maximizing-consistence**policy*. Landmark technology is co-training [18], which is one of the earliest schemes to multi-view learning in semi-supervised learning setting. Co-training algorithm trains alternately to maximize the mutual consistency on two distinct views of the unlabeled data by using the learning or providing labeled data from one another. In terms of clustering, De Sa [19] pioneered a two-view spectral clustering algorithm, which creates a bipartite graph based on the minimizing- disagreement (the same concept to maximizing-consistency) idea. There are also many MvC algorithms in co-training fashion (summarized in Section 3.1).

To sum up, both complementary and consensus prin-

ciples play important roles in addressing the problem of MvC, and both of them should be kept in mind to take full advantage of multi-view data.

- MULTI-VIEW CLUSTERING

**3.1** Co-training style algorithms

Co-training style algorithms are investigated under the con- sideration of the consensus of multiple views. This category of methods intends to maximize the agreement across all views and arrive at a broadest consensus of them. The gen- eral procedure of the conventional co-training algorithms is shown in Fig. 2. It trains alternately to maximize the mutual agreement on two distinct views of the unlabeled data by using the prior information or learning knowledge from each other. The success of co-training mainly relies on three assumptions: 1) Sufficiency: each view is sufficient for the learning task on its own, 2) Compatibility: the target func- tions in both views predict the same labels for co-occurring features with high probability, and 3) Conditional inde- pendence: the views are conditionally independent given the learning label. While, the conditional independence assumption is usually too strong to satisfy in practice, and thus several weaker alternatives such as weak conditional dependence assumption [20], much weaker ”expansion” assumption [21], and the difference assumption [22] have been investigated. In addition, several extended versions of co-training were studied, co-EM [23], co-regularization [24], co-clustering [25], to name a few.

Scheffer [10] first studied MvC with co-training idea, and proposed two kinds of MvC algorithms for text data. One is a multi-view EM algorithm that alternates between the views. Another one is an agglomerative algorithm which is inspired by the co-training algorithm. As a result, Bickel and Scheffer [10] concluded that EM-based multi-view algo- rithm significantly outperforms the single-view algorithm while the agglomerative algorithm leads to negative results. They further studied estimation of mixture models with co- EM for multi-view data analysis [26], which is contributive to an understanding of mixture model estimation for multi- ple views by showing that co-EM algorithm is a special case of mixture model estimation. Besides, Tzortzis and Likas [27] put forward a weighted multi-view convex mixture model that assigns the weights to the views automatically via EM. Assuming that the true underlying clustering would assign a point to the same cluster irrespective of the view, Kumar and Daume´ [28] proposed a co-training approach for multi-view spectral clustering, where the clusterings of dif- ferent views are bootstrapped using information from one another. Kumar et al. [29] further proposed a co-regularized approach for multi-view spectral clustering, where they constructed an objective function that consists of the graph Laplacians from all views, and made regularizations on the eigenvectors of the Laplacians such that the resulting cluster structures would be consistent. Inspired by [29], Ye et al. [30] discussed co-regularized kernel k-means for MvC. This method automatically learns the weights of different view from data. In addition, a multiple view aware method with co-training strategy was investigated to cluster process executions (traces) [31], in which it considers that traces of an event log can be represented by multiple trace profiles, and adapts an iterative co-training strategy to the process mining setting. A co-regularized probabilistic latent seman- tic analysis (PLSA) model for MvC was developed in [32]. The central idea behind it is that the sample similarities in the topic space from one view should agree with those from another view. To address the challenge of partial map- ping between the views (i.e., incomplete views), multi-view constrained clustering via co-EM with pairwise constraints propagation were studied in [33], [34]. Given a set of pair- wise constraints in each view, it uses co-EM to iteratively estimate the propagation within each view, transfer the constraints across views, and update the clustering model, finally learns a unified model for all views.

MvC based on co-clustering (simultaneously cluster the objects and features) has also been investigated. For in- stance, Meng et al. [35] put forward a heterogeneous data

co-clustering approach, which not only extends fusion from

Data

View 1

View 2

Trainer 1

Cooperation

Trainer 2

Knowledge

Knowledge

two views to multiple views but also weights the features of multiple data sources. Based on matrix decomposition, Sun et al. [36] presented a proximal alternating linearized minimization algorithm, which can simultaneously decom- pose multiple data matrices into sparse row and column vectors, and link different views of data with a binary vector that is used to enforce the row clusters from all views to

Fig. 2. The general procedure of co-training

Most of the above-mentioned methods are designed for multi-view data in a semi-supervised learning setting. In an unsupervised learning (i.e., clustering) setting, Bickel and

be consistent. Simultaneously building similarity matrices, rather than set of clusters, between rows and columns of a data matrix, an architecture to learn co-similarities from multi-view datasets was designed in [37], which was paral- lelized afterwards in [38]. Assuming that similarity values

generated from the individual data are transferred from one view to the others resulting in a better clusterings of the data, Hussain and Bashir [39] extended the co-similarity

View 1

Kernel 1

Linear or non-linear Kernel

…

Unified

based architecture to handle multiple datasets with two

kinds of integration schemes (i.e., intermediate integration

Data

View 2 Kernel 2

learning

kernel

and late integration). In addition, several collaborative MvC approaches were studied in [40], [41]. This type of ap-

…

View m

Kernel m

proaches consist of two phases: local and collaboration. The local phase would apply a clustering algorithm on each view. The collaboration phase collaborates each view with clustering findings associated to other views obtained from the local phase.

*Example**1*: As illustrated in Fig. 3 (using two views for brevity), Kumar and Daume´ [28] first applied the idea of co-training to the problem of multi-view spectral clustering. There is no labeled data in unsupervised learning problems unlike semi-supervised learning, so the prototypical co- training algorithms are not available for MvC directly. How- ever, the motivation of co-training still remains the same as in unsupervised learning problems: to limit the search to hypotheses (clusterings) that agree with those in other views. Assuming that the true underlying clustering would assign a point to the same cluster irrespective of the views as done in most of those co-training based MVC approaches, Kumar and Daume´ [28] took the spectral embedding from one view to constrain the similarity graph used for the other view. By iteratively applying this process, the clusterings of the two views tend to each other.

1: Solve spectral clustering on individual graphs to get the discriminative eigenvectors in each view, say *U*1 and *U*2.

2: Cluster data points using *U*1 and use this clustering to modify the graph structure in view 2.

3: Cluster data points using *U*2 and use this clustering to modify the graph structure in view 1.

4: Go to Step 1 and repeat for a number of iterations.

Fig. 3. A co-training approach for multi-view spectral clustering

**3.2** Multi-kernel learning

Multi-kernel learning was originally developed to boost the search space capacity of possible kernel functions, (e.g., Linear kernel, Polynomial kernel, and Gaussian kernels) to achieve good generalization. As kernels in multi-kernel learning naturally correspond to different views, it has been widely applied to deal with multi-view data. The general procedure of multi-kernel learning approaches is shown in Fig. 4, where it uses different predefined kernels to deal with different views, and then combines these kernels either linearly or non-linearly to arrive at a unified kernel. In MvC setting, multi-kernel learning based MvC intends to optimally combine a group of predefined kernels to improve clustering performance. While, in such methods, an essen- tial problem is how to choose the suitable kernel functions and combine these kernels optimally.

In single view scenario, based on maximum margin clustering [42], Zhao et al. [43] presented a multiple kernel clustering algorithm, which can simultaneously find the

Fig. 4. The general procedure of multi-kernel learning

maximum margin hyperplane, the best clustering labels and the optimal kernels. Du et al. [44] performed a robust K- means (with *l*2*,*1-norm) on kernel space, and further pro- posed a multiple kernel K-means algorithm, which is able to simultaneously find the best clustering labels, the cluster membership and the optimal combination of multiple ker- nels. It is worth stressing that this type of above mentioned algorithms is available to deal with multi-view data under the framework shown in Fig. 4. In multi-view scenario, de Sa et al. [45] constructed a custom kernel combination method based on minimizing-disagreement algorithm [46], [47]. Specifically, they generated a multi-partite graph (with the number of parts given by the number of views) to induce a kernel, which is then used for spectral clustering. In fact, this method can be seen as a generalization of co-clustering, spectral clustering, and a relative of kernel cannonical correlation analysis. Besides, Yu et al. [48] ex- tended the classical K-means clustering into Hilbert space, where multi-view data are represented as kernel matrices and combined automatically for data fusion. Similar work has also been done in [49], with the difference that the kernels are combined in a localized way to better capture the sample-adaptive characteristics of data. Rather than extending the existing clustering algorithms with a multi- kernel learning setting, Lu et al. [50] studied multiple kernel clustering based on centered kernel alignment, an effective kernel evaluation measure, which is employed to unify two tasks of clustering and multi-kernel learning into a single optimization framework.

The methods with a weighted combination of the kernels have also been studied under considering the difference of views (or kernels). For instances, kernel based weighted MvC was investigated in [51], where weights assigned to kernels are indicative of the quality of the corresponding views’ information. A systemic MvC approach was pro- posed to automatically select weights for learning the kernel matrix on each view through an optimization process in [52], where the kernel matrix learning is based on the kernel alignment to measure the similarity between two kernel matrices. In addition, Liu et al. [53] showed a weighted multiple kernel K-means clustering with matrix-induced regularization that can reduce redundant kernels and en- hance the diversity of the selected kernels, and Zhao et al. [54] further provided a weighted MvC method via low- rank and matrix-induced regularization. Zhang et al. [55] also gave a weighted MvC algorithm based on improved Gaussian kernels with variable weights.

However, in many applications, it is common that data on some views is not or partially available, leading to incomplete multi-view data, as we mentioned in the intro- duction section. To address this issue, Trivedi and Daum

[56] presented a general approach that allows those MvC in complete view settings to be applicable in this scenario where only one view is complete and the auxiliary views are incomplete. They took the kernel CCA based MvC as an example to illustrate their idea. de Sa et al. [45] (mentioned above) also stated that a significant benefit of their algorithm is that it can calculate affinities for samples with missing views. In setting of there no one view is complete, Shao et al. [57] proposed an approach called collective kernel learning to infer hidden sample similarity. The idea behind this approach is to collectively complete the kernel matrices of incomplete views by optimizing the alignment of shared instances of these views. In addition, different from some

View 1

View 2

View m

…

KPCA

a

existing approaches where incomplete kernels are firstly im- puted and then a standard multi-kernel clustering algorithm is applied to the inputting kernels, Liu et al. [58] integrated the kernel imputation and clustering into a unified learning procedure for incomplete MvC.

*Example**2*: One challenge of multi-kernel learning is to choose appropriate kernel functions (e.g., Linear ker-

Fig. 5. The flow chart of auto-weighted multi-kernel MvC

As *K*(*x,**y*) = *φ*(*x*) · *φ*(*y*), plugging the designed Gaus- sian kernel into Eq. 3, then Eq. 3 is rewritten as

*m** ** **n* *k*

min ， ， *ω**i,v*， 2*δ**ij*I(1 + *R*)*p*− *K*(*x**v* *v*

nel, Polynomial kernel, and Gaussian kernel), which map

*v*=1 *i*=1

*j*=1

*i*− *c**j*)l

(4)

the original low-dimensional space to a high-dimensional space. The general method for multi-view data is using a linear combination of several kernel functions. While, the weights of different kernels should be took into consider- ation. Besides, the weights of different views are also an important factor for MvC. To these ends, Zhang et al. [55] developed an auto-weighted multi-kernel MvC algorithm, which weights the views and the kernels simultaneously. Fig. 5 gives an illustration for the proposed algorithm. It first employs Kernel Principal Component Analysis (KPCA) on each view to reduce the dimension of the original data, resulting in a low-dimensional multi-view data. Then, it applies the designed weighted Gaussian kernel on the low- dimensional multi-view data. This step drives the weight of each view, and the cluster centers. After finite iterations, it arrives at the final clustering result. It is worth noting that the designed weighted Gaussian kernel integrates the advantages of Gaussian kernel and Polynomial kernel. The designed weighted Gaussian kernel is formulated as

*s.t.**ω**i,v*≥ 0*,*n*v**ω**i,v*= 1*.*

Note that Eq. 4 inherits the properties of K-means and kernel, and the designed kernel integrates the advantages of Gaussian kernel and Polynomial kernel.

**3.3** Multi-view graph clustering

Graphs (or networks) are widely used in representing re- lationships between objects, where each node corresponds to a data object and each edge depicts the relationship between a pair of objects. In practice, the relationship is often denoted by the similarity or affinity relationship, i.e., the input graph matrix is generated from a data similarity matrix. In multi-view scenario, data objects are captured by multiple graphs. It is often assumed that each individual graph captures the partial information but they all admit the same underlying clustering structure of the data. Thus, these graphs can mutually reinforce each other by consolidating the correlation between two objects collectively. In general,

*K*(*x,**y*) =

## exp

/ *x*− *y*2 l*p*

− 2*σ*2 + *R*

the procedure of graph-based fusion for multi-view data is similar to Fig. 6. Multi-view graph clustering aims to find a fusion graph across all views and then uses graph-cut

= ， / *p* *R**p*−*q**exp*/

*p*

−

## q

*q*=0

*p*

__q____x____−__* y*2

2*σ*2

2

(2)

algorithms or other technologies (e.g., spectral clustering) on the fusion graph to produce the final clustering result. It is worth noting that spectral clustering based MvC also

=*R**p*+ ， / *p* *R**p*−*q**exp*/

−

## q

__x____−____y__*.*

2*σ*2*/q*

goes into this category, as spectral clustering has a close

relationship to graph.

*q*=0

Given a multi-view data with *m*views, *n*samples and *k*clusters, the objective function based on the K-means and the designed kernel is formulated as

*s.t.**ω**i,v*≥ 0*,*n*v**ω**i,v*= 1

where *c**v*is the cluster center, and *δ**ij*is the indicator variable with *δ**ij ** *= 1 if *x**i*∈ *c**j*, otherwise *δ**ij ** *= 0.

*j*

View m

Fig. 6. The general procedure of graph-based clustering

The review of the literatures in this category is organized with three parts: graph-based MvC, network-based MvC, and spectral-based MvC.

- Graph-based MvC

In [59], Tang et al. discussed a general problem of clustering based on multiple similarity graphs in both unsupervised and semi-supervised settings. One major contribution of this paper is that the proposed linked matrix factorization method can extract common factors from multiple graphs, leading to various graph-based clustering methods are nat- urally applied on multi-view data. Hussain et al [60] put forward a multi-view document clustering algorithm. It first applies existing clustering algorithms on the data matrix of each view to generate multiple partitions. Then, these parti- tions are used to generate a set of three different similarity matrices, i.e., cluster based similarity matrix, affinity matrix and pair-wise dissimilarity matrix. Finally, it employs an ensemble technique to aggregate these matrices and forms a new similarity matrix, which is available for the clustering task. Besides, the impact of different similarity measures (e.g., Pearson, and Spearman correlations, Euclidean, and Canberra Distances etc.) on MvC has been studied in [61]. Later on, Xue et al. [62] proposed a group-aware multi-view fusion approach, which assigns different weights to evaluate the pairwise similarity between different groups. Further- more, even though some MvC methods learn a weight for each graph, such methods have additional parameters. To address these challenges, Nie et al. [63] developed an auto- weighted multiple graph learning framework to learn a set of weights automatically for all graphs. This process does not need any parameters. In addition, unsupervised feature selection for multi-view data have also been studied, where these selected features are then used for specific tasks. For instance, Wei et al. [64] proposed a method, named cross diffused matrix alignment based feature selection, to select features for each view by performing alignment on a cross diffused matrix. Then they applied the co-regularized spectral clustering [29] on these selected features to produce the final results. Besides, as traditional approaches like [63] characterize the similarity by fixed or predefined graph Laplacian in each view separately and ignore the underlying common structures across different views, Hou et al. [65] presented an algorithm named multi-view unsupervised feature selection with adaptive similarity and view weight. This method attempts to use three kinds of information, i.e., data cluster structure, data similarity and the correlations between different views for feature selection.

On the other side, the methods with a combination of

the nearest neighbor techniques have also been studied. For instance, Hamzaoui et al. [66] designed a multi-source shared nearest neighbors (SNN) scheme for multi-modal image clustering. The central idea is that they first extended the existing SNN-based similarity measures to the case of multiple sources and then introduced an original automatic source selection step to build candidate clusters. Consider- ing generative and manifold structure of the data, Wang et al. developed a generative model with ensemble manifold regularization for MvC in [67], [68]. Specifically, a nearest neighbor graph for each view is constructed to encode the corresponding manifold information. A multiple graph

ensemble regularization framework is designed to learn the optimal intrinsic manifold. Then, the manifold regulariza- tion term is incorporated into a multi-view topic model based on PLSA, resulting in a unified objective function. Different from the above two methods, Zhang and Mao [69] focused on the problem of efficiently selecting class- consistent neighbors for graph-based MvC. The proposed approach employs jointly sparse learning to filter unreliable neighbors in the union of view-specific neighborhoods by representing each object in a weighted sum of its neighbors under each view. The learned jointly sparse weights are used to generate a similarity neighborhood graph, and the new graph is further utilized for MvC and view-specific graph preconditioning. In addition, Nie et al. [70] intro- duced a novel multi-view learning model with adaptive neighbors. This model performs clustering, semi-supervised classification, and local manifold structure learning simul- taneously, modifying similarity matrix during each itera- tion until reach to the optimal one. Moreover, it allocates the weight coefficient for each view automatically without penalty parameters.

*Example**3*: It should be noted that most of the exist-

ing methods basically adopt a globally uniform similarity measure over the whole data space. But for the real-world images, the visual distribution is complicated and different images have different visual appearance. It is difficult to de- scribe the similarity accurately by using a globally uniform measure. To solve this problem, Xue et al. [62] presented a group-aware multi-view fusion approach for image clus- tering. The key point behind this approach is that images are partitioned into groups with more compact visual co- hesiveness, and the images within group share the same fusion weights. Compared with the global fusion methods, this group-aware fusion approach provides a more flexi- ble fusion strategy and more effective similarity measures among images. The framework of this method is shown in Fig. 7. First multiple features such as LBP, GIST, and Centrist are extracted from images, then a graph is constructed for each feature (view). Next, the images are partitioned into different groups, and the fused graph is constructed by the proposed group-aware fusion strategy, i.e., the images belong to different groups are assigned with different fusion weights. Finally, the fusion weights are learned by solving the formulated optimization function, and the clustering results are obtained by conducting spectral clustering on the fused graph.

- Network-based MvC

Despite the previous success, most of the graph-based MvC approaches usually assume that different views are avail- able for the same set of data objects. Thus, data objects in different views are treated as having one-to-one rela- tionship. However, in many real-life applications, such as social networks, biology interaction networks, and literature citation networks, data objects are collected from different domains and an object in one domain may correspond to multiple objects in another domain, resulting in many- to-many mapping relationships. Representing these rela- tionships with networks rather than graphs may be more appropriate, which is the main reason that we distinguish network-based MvC and graph-based MvC in this paper.

Fig. 7. The framework of group-aware multi-view fusion approach [62].

Relate work on network-based MVC starts from [71], in which it developed a network based multi-view graph clus- tering framework, termed co-regularized graph clustering, based on NMF. This framework illustrates several key prop- erties, i.e., many-to-many mapping relationship, mapping associated with weights, and partial mapping. However, different networks may have different data distributions, where the assumption like [71] that all networks share a single common clustering structure does not hold. To relax this assumption, Ni et al. [72] presented a flexible and robust framework that allows multiple underlying clustering struc- tures across different networks. It models domain similarity as a main network where main nodes represent domain- specific networks, and formulates the clustering problem via NMF on the designed network of networks setting as a two phase regularized optimization problem. Similar work

does not hold. Fig. 8 shows an example with six domains

{*A,**B**,**C**,**D**,**E**,**F*}, each of which corresponds to a network. Domains {*A,**B**,**C*} are similar to each other and so are domains {*D**,**E**,**F*}. But these two sets of domains are not similar to each other. To address the above mentioned two

challenges, Ni et al. [72] modeled each domain similarity as a network, which was utilized to regularize the clustering structures in different networks. With these networks, they developed a flexible and robust multi-network clustering framework that allows many-to-many relationship and mul- tiple underlying clustering structures.

A

1

13

11 4

B 9 10 C

has also been done in [73], where the network grouping and common cluster detection are coupled and mutually enhanced during the learning process. Furthermore, Liu et

1 12

11 4 2

10

5 3 7

11

1

9 4

3 10

al [74] stated that existing methods tend to focus on network clustering, but ignore any associations that may be exhibited between clusters from different domains. Given this, they

9 5 7

3 7 2

D E

offered a robust clustering approach, which can detect net- 1 9 1

work clusters in multiple domains and their cross-domain 2 2 3

associations. In addition, Yu et al. [75] studied community 3 7 4 7

F

detection in multiple social networks, and attempted to

find the communities for multiple networks involving both anchor and non-anchor users simultaneously. Wang et al. [76] proposed a MvC algorithm, named multi-view affinity propagation, based on max-product belief propagation. The key point is to establish a MvC model consisting of two components, which measures the within-view quality and the explicit clustering consistency across different view, respectively.

*Example**4*: Most of the existing multi-view graph clus- tering methods usually assume that different views are available for the same set of instances. Therefore, instances in different views can be treated as having strict one-to-one relationship. In many actual applications, data is collected from diverse domains (views), where the cross-domain rela- tionship is many-to-many rather than one-to-one. That is to say, an instance in one domain may correspond to multiple instances in another domain. Moreover, different domains have different data distributions, thus another assumption that all views share a single common clustering structure

4 6 6 5

1 8

2 3

4 5

6

Fig. 8. An example of NoN. The main network is represented by dashed nodes and edges. The domain-specific networks are represented by solid nodes and edges [72].

Note that the similarity among different domains can be also modeled as a network (represented by dashed lines in Fig. 8). The structure shown in Fig. 8 is defined as a *network**of**networks*(NoN), where the dashed network represents the

*main**network*among six domains {*A,**B**,**C**,**D**,**E**,**F*}, and

each node in the main network corresponds to a *domain-*

*specific**network*represented by solid lines. The nodes in the main network and domain-specific networks are referred to as the *main**nodes*and *domain**nodes*respectively. Similarly, the clusters in the main network and domain-specific networks are refereed to as the *main**clusters*and *domain**clusters*

respectively. Given these concepts, they sought to partition the NoN by using a two-phase approach. It first partitions the main network, then incorporates the learning cluster information to cluster domain-specific networks. In prac- tice, they adopted the SymNMF [77] to partition the main network, formulated as the following objective function

*O**M*= *G*− *HH**T** ** *2 (5)

*F*

where · 2 is the *F**r**obenius**norm*, *G*∈ R*g*×*g*(*g*is the number of nodes in the main network) is the main network, and *H*∈ R+*g*×*k ** *is the factor matrix of the main network.

*F*

The domain-specific network clustering is formulated as

follows

*g*

2

on this bipartite graph. Zhou and Burges [78] studied multi- view spectral clustering via generalizing the normalized cut from a single view to multiple views by considering how to find a clustering which is close to optimal for all graphs, and further built multi-view transductive inference on the basis of multi-view spectral clustering. Similar work has been done in [79], where it also intended to find a balance cut that can separate all similarity graphs well. In addition, Long et al. [80] developed a general model for multi-view unsuper- vised learning under a distributed framework, aiming to learn hidden patterns individually from each representation of a multi-view data and then seek for the optimal hidden patterns from those multiple patterns. This general model introduces the concept of mapping function to make the

# min

∀*i,U**i*≥0

∀*j**,V**j*≥0

*O**S*= ， *A**i*− *U**i*(*U**i*)*T ** **F*

*i*=1

‘ 气卢 J

*Domain-specific**network**clustering*

different patterns from various pattern spaces comparable

and hence an optimal pattern is achieved from the multiple patterns of multiple representations. Instead of committing

*g* *k* (6)

+ *α*， ， *h**ij ** **U**i*− *V**i*2

*F*

*i*=1 *j*=1 __ __

‘ 气卢 J

*Main**cluster**guided**r**egularization*

where, *A**i*is the domain-specific network corresponding to node *i*(*i *= 1*,**…,**g*) in the main network *G*, *U**i*is the factor matrix of *A**i*, *V**j*is the *j**th*hidden factor matrix, *h**ij*(an element of *H*) indicates to which degree main node *i*belongs to the *j**th*main cluster, and *α*is a regularization parameter.

In real-life, the domain-specific networks may have dif- ferent node sets and sizes. Based on the above proposed model, they further designed a general model that allows partially aligned domain-specific networks to have different sizes and number of clusters. This model is formulated as the following function:

*g* *g* *k*

to one clustering solution, Niu et al. [81] proposed a method

providing several non-redundant clustering solutions. This method simultaneously learns non-redundant subspaces that provide multiple views, and finds a clustering solution in each view. To address the issue that data may have con- siderable noise, Xia et al. [82] investigated the Markov chain for a robust multi-view spectral clustering. This approach has a flavor of low-rank and sparse decomposition, where it first constructs a transition probability matrix from each single view, and then uses these matrices to recover a shared low-rank transition probability matrix as a crucial input to the standard Markov chain method for clustering. To handle large-scale problem and improve computational efficiency, Li et al. [83] offered a large-scale multi-view spectral cluster- ing approach based on the bipartite graph. This method uses local manifold fusion to integrate heterogeneous features, and bipartite graphs to approximate the similarity graphs. Moreover, Chikhi [84] presented a multi-view normalized cuts approach, a parameter free multi-view spectral cluster-

# min

*O**S*= ， *J**A*+ *α*， ， *J**R* (7)

ing algorithm, via spectral partitioning and local refinement.

where

∀*i,U**i*≥0

∀*j**,V**j*≥0

*i*=1

2

*i*=1 *j*=1

Lu et al. [85] studied the convex sparse spectral clustering with sparse regularization for single view data, and further proposed a pairwise sparse spectral clustering to handle

*J**A*= *A**i*− *U**i*(*U**i*)*T ** **F*

multi-view data. However, as the number of views grows, it

2

*J**R*= (*Q**ij**U**i*)(*Q**ij**U**i*)*T*− (*P**ij**V**j*)(*P**ij**V**j*)*T ** **F*

where *P**ij*is the mapping matrix between *U**i*and *V**j*, and *Q**ij*is denoted by *Q**ij*= *P**ij*(*P**ij*)*T*. In the end, the cluster- ing results are driven on the factor matrices *U**i**,**…,**U**g*.

- Spectral-based MvC

Spectral clustering is a classic paradigm for data clustering. The basic idea is to form a pairwise affinity matrix between all pairs of objects, normalize it, and compute eigenvectors of this normalized affinity matrix (i.e., graph Laplacin). It has been shown that the second eigenvector of the normal- ized graph Laplacian is a relaxation of a binary vector solu- tion that minimizes the normalized cut on a graph, which is the relation to graph. In [19], De Sa developed a spectral clustering algorithm on two independent views, each of which could be used for clustering. This spectral-based MvC algorithm creates a bipartite graph with the minimizing- disagreement criterion [46], [47] to connect the two-view fea- ture, and then uses standard spectral clustering algorithms

is almost impossible to avoid dependencies among views,

and such dependencies often delude correct estimations. Son et al. [86] extended traditional spectral clustering to deal with multi-view data and dependencies among views. The method given in this paper forces the information of each view to be shared by using the brainstorming process.

By combining spectral clustering with other technolo- gies, several MvC methods have also been studied. For instance, Huang et al. [87] developed an affinity aggregation spectral clustering algorithm which extends spectral cluster- ing to a setting with multiple affinities available. Shao et al. [88] designed a multi-source MvC framework based on col- lective spectral clustering with a discrepancy penalty across sources. Note that this method is applicable to incomplete multi-view data. Besides, Feng et al. [89] introduced a multi- view spectral clustering via roust local subspace learning by considering that all the views are derived from a robust unified subspace and noisy. Wang et al. [90] proposed an iterative low-rank based structured optimization method for multi-view spectral clustering, which encodes the local data

manifold structure from each view-dependent feature space, and arrives at the multi-view agreement via an iterative process. Zhao et al. [91] discussed the semi-supervised MvC,

local results and the global result. To further uncover the inter-view relation, Eq. 8 is rewritten as

and presented a multi-view matrix completion method with

pairwise similarity matrix to utilize side information, i.e.,

must-link and cannot-link.

In addition, there are also spectral-based MvC methods

where *L*= ?*m*

*i*

for multi-type relational data [92], multi-modal image data [93], social media data [94], as well as some applications with spectral-based MvC, such as multimodal brain net-

*i*(*α**i*)*r**L**i*. Denote that *L*is regraded as the

local manifold fusion of all the views. Eq. 10 is solved by

iterative optimization techniques. The total computational complexity is about *O*(*T**k**n*2 + ?*m**mn*(*d**i*)2), where *T*is

*i*

works inference [95], social circle detection [96], and human microbiome data analysis [97].

*Example**5*: In general, spectral clustering methods in- volve two time consuming steps. The first step, construct- ing the affinity (or similarity) graph, takes *O*(*n*2*d*) time, while the second step, computing the eigen-decomposition, takes *O*(*k**n*2) time. Besides, another drawback of spectral clustering methods is that they usually do not provide natural extension to handle the out-of-sample problem. To overcome the above two issues, Li et al. [83] proposed a

the number of iterations.

**3.4** Multi-view subspace clustering

Multi-view subspace clustering, i.e., learning a new and unified representation for all view data from multiple sub- spaces or a latent space that make it easier to deal with high-dimensional data when building clustering models, has become a hot direction in the field of MvC. The general procedure of multi-view subspace clustering is illustrated as Fig. 9, where it obtains such a unified feature repre-

large-scale multi-view spectral clustering approach based

sentation with two ways: 01

learn a unified representation

on the bipartite graph. The designed algorithm uses local

from multiple subspace directly, or 02

first learn a latent

manifold fusion to integrate heterogeneous features, and

approximates the similarity graphs by bipartite graphs to improve efficiency. It is also easily extended to handle the out-of-sample problem. In summary, it first generates a few consensus salient points for all views. These salient points play an important role in capturing the manifold of the original views. Then it constructs a bipartite graph between raw data points and these salient points for each view. The graphs of all the views are combined together with a

space then arrive at this unified representation. Finally, this unified representation is fed into an off-the-shelf clustering model to produce the clustering results. After reviewing the literatures on MvC, we further divide multi-view subspace clustering methods into two major types, i.e., subspace learning-based and NMF-based (a special case in subspace learning) methods.

local manifold fusion method. Finally, it performs spectral

clustering algorithm on the resulting fused graph. It also

outputs cluster indicator of the salient points, so as to handle

the out-of-sample problem efficiently.

Here, two important questions needed to be stressed are how to reach consensus across all views, and how to express the relationship of them. With the local manifold learning,

Data

it formulates the above mentioned two questions with the following function

Fig. 9. The general procedure of multi-view subspace clustering

# min

*F**T**F*=*I**,α**v*

*m*

，(*α**i*)*r**T**r*(*F**T**L**i**F*)

*i*=1

*s.t.*?*i**α**i*= 1*,**α**i*≥ 0

(8)

- Subspace learning-based MvC

Subspace learning-based MvC seeks to find a latent space from multiple low-dimensional subspaces by assuming that data points are drawn from this latent subspaces. Here

where *α**i*is the nonnegative normalized weight factor for the *i*-th view, *T**r*(·) is the trace of a matrix, *r *is a scalar to control the distribution of different weights among different views, *F*∈ R*n*×*k*is the class indicator matrix, *L**i*is the normalized graph Laplacian matrix for the *i*-th view. The

normalized graph Laplacian matrix is defined as

*L*= *I*− *D*−1*/*2*W**D*−1*/*2 (9)

where *W*∈ R*n*×*n*is the adjacent matrix of the graph, *D*∈

R*b*×*n ** *is the degree matrix whose *i*-th diagonal element is

we extend this concept to make its role more general. In this paper, the technologies involved in subspace learning- based MvC include subspace learning, subspace clustering, subspace projection, low-rank approximation, and tensor decomposition.

Under the assumption that the views are conditionally

uncorrelated given the clustering label, Chaudhuri et al. [98] presented a subspace learning method based on canonical correlation analysis, which provides auxiliary results for Gaussian mixtures and log concave distribution mixtures. Guo [99] proposed a convex subspace representation learn-

*d**ii*= ?*n*

*j*=1

*w**ij *.

ing method for MvC. The key idea is to identify a common

In Eq. 8, it aims to find a consensus result *F ** *among

all the views. This unique consensus eliminates the need for computing the local results for each view, and the computation cost of communicating back and forth between

intrinsic subspace representation of the data across multiple views and then perform standard clustering algorithms on this shared representation. Zhao et al. [100] developed a co- training framework for multi-view subspace clustering. This

framework combines the simplicity of K-means clustering and linear discriminant analysis within a co-training scheme which exploits labels learned automatically in one view to get discriminative subspaces in another. Deng et al. [101] put forward a locally adaptive feature weighting method based on subspace learning for multi-view data, where it locally adapts the feature weighting of each cluster automatically according to the tightness of views to improve the clustering performance. Besides, Cao et al. [102] stated that exploiting the specific independently constructed matrices is insuffi- cient, and exploring the underlying complementarity is of great importance, for the success of MvC. They designed a MvC framework, called diversity-induced multi-view sub- space clustering for this task. Specially, they extended the existing subspace clustering into the multi-view domain, and utilized the Hilbert Schmidt Independence Criterion as a diversity term to explore the complementarity of multi- view representations. However, many work usually focused on the combination on information rather than improving the feature representation capability of each view. To solve this problem, Wang et al. [103] presented a framework with extreme learning machine, and implemented three algorithms on this framework. Unlike the methods in [99], [103] which performs subspace clustering on a common view, Gao et al. [104] performed subspace clustering on each view simultaneously, meanwhile guaranteed the con- sistence of the clustering structure among different views by adopting a common indicator. In addition, Xu et al. [105], [106] proposed a MvC method called discriminatingly embedded K-means, which embeds the synchronous learn- ing of multiple discriminative subspaces into multi-view K-means clustering to construct a unified framework, and adaptively control the inter coordinations between these subspaces simultaneously. To effectively exploit data cor- relation consensus among multi-views, subspace clustering with similarity matrix for multi-view data was studied in [107], [108], where the authors intended to find a correlation or similarity consensus among all views, which was inspired by the idea that the data objects within the same subspace should encode a large similarity and small similarity for data objects within the distinct subspaces for each view. Rather than using similarity matrix, Fan et al. [109] draw global low-rank constraint and local cross topology preserv- ing constraints into subspace clustering to character data correlations. There are also several methods investigated by combing with other technologies, sparse subspace cluster- ing [110], [111], low-rank approximation [112], and tensor

decomposition [113], [114], [115], [116], to name a few.

Different from the above mentioned methods or frame- works, which output just a single clustering, Cui et al. [117], [118] gave pioneering work to find alternative and multiple clustering solutions based subspace learning. They designed a MvC framework to find all non-redundant clus-

distributions in subspace projections. Besides, Muller et al. [120] presented a short tutorial on this topic.

In semi-supervised clustering setting, Gu¨ nnemann et

al. [121] developed a new Bayesian framework for semi- supervised MvC based on their previous work [119], which also seeks to the detection of multiple and alternative clus- terings. This new framework treads the different clustering views with several multivariate mixture distributions lo- cated in subspace projections, and handles prior knowledge in the form of instance-level constraints indicating which objects should or should not be grouped together. What’s more, Yin et al. [122] presented a pairwise sparse subspace representation model for MvC. The designed model har- nesses the prior information to obtain the view specific sparse representation, and meanwhile utilizes the corre- lation between different views. Besides, Cao et al. [123] put forward a constrained multi-view video face cluster- ing method, which considers both the video face pairwise constraints as well as the multi-view consistence simultane- ously. Unlike most existing video face clustering methods which only employ these constraints in the clustering step, this method strengthens the pairwise constraints through the whole framework, both in sparse subspace representa- tion and spectral clustering.

Incomplete MvC based on subspace learning has also

been investigated. For instance, Yin et al. [124], [125] pro- posed an incomplete MvC method, which incorporated fea- ture selection, subspace learning, and inter-view and intra- view similarity preserving into a unified objective. It learns a latent representation and projection matrices for incomplete multi-view data, where the latent representation serves as an approximation of the normalized indicator matrix. Xu et al. [126] suggested that the key to handle incomplete view problem is to exploit the connections between multiple views, enabling incomplete views to be restored with the help of the complete views. Assuming that different views are generated from a shared subspace, they studied esti- mating the incomplete views by integrating the information from the other observed views through this subspace.

*Example**6*: Multi-view data is often incomplete, namely

data objects have incomplete feature sets. Based on subspace learning, Yin et al. [124], [125] studied incomplete multi- view learning for incomplete and unlabeled multi-view data. Fig. 10 shows the presented subspace learning model, which learns unified latent representations and projection matrices for the incomplete multi-view data. This model directly optimizes the class indicator matrix, which estab- lishes a bridge for incomplete feature sets. Besides, feature selection is considered to deal with high dimensional and noisy features. Furthermore, the inter-view and intra-view data structure are preserved to enhance its performance. To this ends, an objective is developed along with an efficient optimization strategy.

in orthogonal subspaces. The difference is that the former

seeks the orthogonality in the cluster space, while the latter

represents the data matrix in the *v*-th view for the complete

instances and the partial instances, respectively. Similarly,

*v* *v*

in the feature space. Similar work has also been done in [119], where it provides multiple generalizations of the data by using multiple mixture models. Each mixture describes a specific view on the data by using a mixture of Beta

*Y*= [*Y**v**,**Y*ˆ *v*] ∈ R(*n**c*+*n*ˆ )×*k*represent the class indicator of

the *v*-th view. To learn the class indicator matrix, it drives a

*c*

projection matrix *U**v ** *∈ R*d**v*×*k ** *for each view to project their

original spaces to a unified space. The objective function is

Fig. 10. The overview of the proposed model with two views, i.e., text and image [125]. For the incomplete multi-view dataset, it uses projection matrix to project the original features to the class indicator matrix, which explicitly captures the clustering structure and severs as the latent space. Besides, group sparsity is imposed on the projection matrices for feature selection. Furthermore, the inter-view and intra-view data similarities are preserved to enhance the model. Finally, it is applied for clustering and retrieval tasks.

- NMF-based MvC

NMF, which was originally investigated as a dimensional- ity reduction technique [127], has emerged as an effective latent factor learning method. The nonnegative constraint leads to the parts-based representation of objects, which accords with the cognitive process of human brain from the psychological and physiological evidences. Given an input

nonnegative data matrix *X ** *∈ R*d*×*n*, each column of *X*is

an object of vector. The NMF aims to find two nonnegative

matrices *W*∈ R*d*×*p*and *H*∈ R*p*×*n*, whose product can well

approximate the original matrix *X*. Here, the former matrix

*W*is termed as a basis matrix (basic space), while the latter matrix *H*represents the coefficient matrix (representation feature), and *p*(in general, *p*< *min*{*n,**d*}) denotes the desired reduced dimension. The reconstruction processes

can be formulated as a *F**r**obenius**norm*optimization problem, defined as

NMF was studied in [134]. The presented approach takes clustering results generated independently on each avail- able view, constructs an intermediate matrix representation of those results, and applies a NMF procedure to this representation to reconcile the groups arising from the in- dividual views. Unlike [134] plugging the clustering results into NMF, Liu et al. [135] developed a new NMF-based MvC framework, which feeds the data into NMF directly, and drives a fused representation. The contributions are formulating a joint matrix factorization with normalization strategy that pushes the representing result of each view towards a common consensus, and giving a new insight into applying NMF to MVC. Inspired by this framework, some improved work has also been investigated. For instance, a semi-supervised MvC algorithm based on NMF with weight for each view was studied in [16], [136], where it discovers a partially shared latent representation. With this learned representation of multi-view data, a robust sparse regression model was introduced to predict the clustering results. Embedding the similarity matrices of the data points into NMF, He et al. [137] also intended to learn a shared latent representation for MvC. Taking a new regularization term to advocate the structural incoherence between the repre- senting result of each view,