That is to say, there are only 0. Another challenge is the curse of dimensionality [ 22 ]. It has been proven [ 23 ] that the distances or similarities between pairs of elements in the high dimensional tensor are almost the same for the vast majority of data distributions and distance functions. Therefore, most existing clustering methods cannot be used in the sparse and high dimensional heterogeneous information networks directly. To solve the problem of clustering heterogeneous information networks with general network schemas or even without network schema information, e. We model a heterogeneous information network as a multiway array, i.
Each object type maps onto one mode of the tensor, and the relations between different types of objects map onto the elements in tensor. The main contributions made by our paper are as follows: We propose a novel clustering framework based on sparse tensor factorization, namely STFClus, which can cluster heterogeneous information networks Looking for a frienship and possibly more in qinggang general network schemas or even without network schema information. Another advantage is that STFClus can cluster all types of objects simultaneously in a single pass.
The clustering issue based on tensor factorization is modeled as an optimization problem, which is similar to the wellknown Tucker decomposition [ 2425 ]. In STFClus, only nonzero tensor elements together with corresponding tensor indices are handled, and a nondistance function for similarity measurement between pairs of objects is needed. We discuss the bottleneck of implementation for STFClus, and propose a performance improvement method that avoids the need to calculate large scale intermediate results. We also propose a feasible initialization method to start STFClus. STFClus is tested on both synthetic and realworld networks. Experimental results show that STFClus outperforms the stateoftheart baselines in terms of key performance indicators such as accuracy and efficiency.
Methods Preliminaries First, we introduce some related concepts and tensor notation that will be used in this paper. More details about tensor algebra can be found in [ 27 — 29 ]. A tensor is a multidimensional array. The order of a tensor is the number of dimensions, also known as ways or modes. We will follow the convention used in [ 27 ] to denote scalars by lowercase letters, e. Elements of a matrix or a tensor are denoted by lowercase letters with subscripts, i. Some common definitions for tensors are set out below, as used in [ 28 ]. Definition 1 Matricization [ 28 ]. Matricization transforms an Norder tensor into a matrix by arranging the elements in a particular order.
For example, the matricization of a tensor along the nth mode is denoted as. A special case of matricization is vectorization, which transforms a tensor into a vector, i. The vectorization of a tensor is denoted by. Definition 2 Hadamard product [ 28 ]. The Hadamard product for two tensors with the same dimensions is also known as the elementwise product. Fortheir Hadamard product is denoted byand its elements are given by. Definition 3 Kronecker product [ 28 ].
The inner product for two tensors with the same dimension,is denoted by. The result of the inner product is the sum of all elements in their Hadamard product, and defined as Definition 5 Frobenius norm [ 28 ]. Friensgip Frobenius norm for a tensor is defined as Qiinggang 6 Moden matrix product [ Looking for a frienship and possibly more in qinggang ]. Its elements possiblu given by. The Moden matrix product of a tensor with mord matrix is equivalent to first matricization friehship along the nth mode, followed Lookig the matrix multiplication of with U, before finally folding the result back as a tensor.
In traditional Tucker decomposition, the factor matrices are assumed to be orthogonal. We now give the definition for an information network, which is based on work by Y. Definition 7 Information network [ 3 ]. An information network is a weighted graph defined on a set of objects belonging to T types, denoted bya set of binary relations ondenoted by E, and a weight mapping function, denoted by. The information network is denoted by. We denote each object of type aswhere Nt is the number of objects in typei.
