a novel text clustering approach using deep-learning...

14
Research Article A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network Junkai Yi, 1,2 Yacong Zhang, 1 Xianghui Zhao, 2 and Jing Wan 1 1 College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China 2 China Information Technology Security Evaluation Center, Beijing 100085, China Correspondence should be addressed to Xianghui Zhao; [email protected] Received 9 October 2016; Revised 1 February 2017; Accepted 16 February 2017; Published 15 March 2017 Academic Editor: Nazrul Islam Copyright © 2017 Junkai Yi et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuable information on the Internet. However, there exist some issues to tackle such as feature extraction and data dimension reduction. To overcome these problems, we present a novel approach named deep-learning vocabulary network. e vocabulary network is constructed based on related-word set, which contains the “cooccurrence” relations of words or terms. We replace term frequency in feature vectors with the “importance” of words in terms of vocabulary network and PageRank, which can generate more precise feature vectors to represent the meaning of text clustering. Furthermore, sparse-group deep belief network is proposed to reduce the dimensionality of feature vectors, and we introduce coverage rate for similarity measure in Single-Pass clustering. To verify the effectiveness of our work, we compare the approach to the representative algorithms, and experimental results show that feature vectors in terms of deep-learning vocabulary network have better clustering performance. 1. Introduction Webpages, microblogs, and social networks provide much useful information for us, and text clustering is an important text mining method to collect valuable information on the Internet. Text clustering helps us to group an enormous amount of text documents into small meaningful clusters, which have been used in many research fields such as sentiment analysis (opinion mining) [1–3], text classification [4–6], text summarization [7], and event tracking and topic detection [8–10]. e process of text clustering is usually divided into two phases: preprocessing phase and clustering phase. Before preprocessing phase, there are some basic steps (includ- ing tokenization, remove-stop-words, and stemming-word) needed to process text documents, and these steps split sentences into words and remove useless words or terms. e first phase is the preprocessing of text, and the second phase is clustering for text documents. e preprocessing phase is mainly to transform text documents into structured data that can be processed by clustering algorithms. is phase contains two parts: feature extraction and feature selection. In existing scientific literatures, there are two categories of feature extraction methods: term frequency-based method and semantic web-based method. Term frequency-based method is a method of counting words’ number, and semantic web is to construct the knowledge in certain domain to an ontology, which contains words and their relations. Term-document vectors are extracted from text doc- uments in the process of feature extraction. Most term frequency-based methods employ vector space model (VSM) to represent text documents, and each entry of VSM is the frequency of words or terms. e most representative method based on term frequency is term frequency-inverse document frequency (tf-idf) algorithm. For its simplicity and high efficiency, researchers have proposed many improved tf- idf algorithms [11, 12]. However, the relations of words (or word order) are lost when text documents are transformed into term-document vectors. Many researchers find that the words or terms have lexical “cooccurrence” phenomenon [13], which means some Hindawi Mathematical Problems in Engineering Volume 2017, Article ID 8310934, 13 pages https://doi.org/10.1155/2017/8310934

Upload: others

Post on 21-May-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Research ArticleA Novel Text Clustering Approach UsingDeep-Learning Vocabulary Network

Junkai Yi12 Yacong Zhang1 Xianghui Zhao2 and JingWan1

1College of Information Science and Technology Beijing University of Chemical Technology Beijing 100029 China2China Information Technology Security Evaluation Center Beijing 100085 China

Correspondence should be addressed to Xianghui Zhao zxhitsecsinacom

Received 9 October 2016 Revised 1 February 2017 Accepted 16 February 2017 Published 15 March 2017

Academic Editor Nazrul Islam

Copyright copy 2017 Junkai Yi et alThis is an open access article distributed under the Creative CommonsAttribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Text clustering is an effective approach to collect and organize text documents into meaningful groups for mining valuableinformation on the Internet However there exist some issues to tackle such as feature extraction and data dimension reductionTo overcome these problems we present a novel approach named deep-learning vocabulary network The vocabulary network isconstructed based on related-word set which contains the ldquocooccurrencerdquo relations of words or terms We replace term frequencyin feature vectors with the ldquoimportancerdquo of words in terms of vocabulary network and PageRank which can generate more precisefeature vectors to represent the meaning of text clustering Furthermore sparse-group deep belief network is proposed to reducethe dimensionality of feature vectors and we introduce coverage rate for similarity measure in Single-Pass clustering To verify theeffectiveness of our work we compare the approach to the representative algorithms and experimental results show that featurevectors in terms of deep-learning vocabulary network have better clustering performance

1 Introduction

Webpages microblogs and social networks provide muchuseful information for us and text clustering is an importanttext mining method to collect valuable information on theInternet Text clustering helps us to group an enormousamount of text documents into small meaningful clusterswhich have been used in many research fields such assentiment analysis (opinion mining) [1ndash3] text classification[4ndash6] text summarization [7] and event tracking and topicdetection [8ndash10]

The process of text clustering is usually divided into twophases preprocessing phase and clustering phase Beforepreprocessing phase there are some basic steps (includ-ing tokenization remove-stop-words and stemming-word)needed to process text documents and these steps splitsentences into words and remove useless words or terms

The first phase is the preprocessing of text and the secondphase is clustering for text documents The preprocessingphase is mainly to transform text documents into structureddata that can be processed by clustering algorithms This

phase contains two parts feature extraction and featureselection

In existing scientific literatures there are two categoriesof feature extractionmethods term frequency-basedmethodand semantic web-based method Term frequency-basedmethod is amethod of countingwordsrsquo number and semanticweb is to construct the knowledge in certain domain to anontology which contains words and their relations

Term-document vectors are extracted from text doc-uments in the process of feature extraction Most termfrequency-basedmethods employ vector spacemodel (VSM)to represent text documents and each entry of VSM isthe frequency of words or terms The most representativemethod based on term frequency is term frequency-inversedocument frequency (tf-idf) algorithm For its simplicity andhigh efficiency researchers have proposedmany improved tf-idf algorithms [11 12]

However the relations of words (or word order) are lostwhen text documents are transformed into term-documentvectors Many researchers find that the words or terms havelexical ldquocooccurrencerdquo phenomenon [13] which means some

HindawiMathematical Problems in EngineeringVolume 2017 Article ID 8310934 13 pageshttpsdoiorg10115520178310934

2 Mathematical Problems in Engineering

words or terms have a high probability of occurrence in atext document Researchers think that the ldquocooccurrencerdquorelations of words or terms can generate more precise featurevectors to represent the meaning of text documents

The objective of feature selection is to remove redun-dant information and reduce the dimensionality of term-document vectors The methods of feature selection arecategorized as corpus-based method Latent Semantic Index-ing (LSI) and subspace-based clustering The corpus-basedmethodmerges synonyms together to reduce the dimension-ality of features which depends on large corpora such asWordNet and HowNet Traditional LSI decomposes a term-document vector into a term-space matrix by singular valuedecomposition (SVD) Subspace-based clustering groups textdocuments in a low-dimensional subspace

In our paper we propose a novel approach to addresstwo issues one is the loss of word relations in the process offeature extraction and the other is to retain theword relationsin dimension reduction Considering that the relations ofwords and terms are lost in term frequency-based methodswe construct a vocabulary network to retain ldquocooccurrencerdquorelations of words or terms Term frequency is replaced withthe ldquoimportancerdquo of words or terms in VSM Furthermoretraditional feature selection methods can lose some informa-tion that affects the performance of clustering [14] and weintroduce deep learning for dimension reduction

Themain contributions of our paper are that we present anovel graph-based approach for text clustering called deep-learning vocabulary network (DLVN) We employ the edgesof vocabulary network to represent the relations betweenwords or terms and extract features of text documents interms of related-word set The related-word set is a set ofwords in the same class and we utilize association ruleslearning to obtain relations between words In addition highdimensional and sparse features of text have a big influenceon clustering algorithms and we employ deep learning fordimensionality reduction Accordingly an improved deep-learning Single-Pass (DL-SP) is used in the process of clus-tering To verify the effectiveness of the approach we provideour experimental evaluation based on Chinese corpora

The rest of this paper is organized as follows Section 2reviews related work in previous literatures Section 3 intro-duces theoretical foundation related to this paper Section 4describes the approach of DLVN we propose Section 5 isexperimental analysis Section 6 is the conclusion of ourwork

2 Related Work

Text clustering groups text documents of similar content(so-called topic) into a cluster In this section we use threesubsections to review related literatures

21 Feature Extraction Term frequency-based method is animportant method to extract features In term frequency-based method text documents are represented as VSM andeach document is transformed into a vector whose entries arethe frequency of words or terms Most term frequency-basedmethods are to improve tf-idf

Semantic web is to structure knowledge into an ontol-ogy As researchers find that the relations between wordscontribute to understanding the meaning of text they con-struct a semantic network in terms of concepts events andtheir relations Yue et al [15] constructed a domain-specificontology to describe the hazards related to dairy productsand translated the term-document vectors (namely featurevectors of text) into a concept space Wei et al [16] exploitedan ontology hierarchical structure for word sense disam-biguation to assess similarity ofwordsThe experiment resultsshowed better clustering performance for ontology-basedmethods considering the semantic relations between wordsBing et al [17] proposed an adaptive concept resolution(ACR) model for the characteristics of text documents andACR was an ontology-based method of text representationHowever the efficiency of semanticweb analysis is a challengefor researchers and the large scale of text corpora has a greatinfluence on algorithms [18]

For retaining the relations of words and terms someresearchers proposed to employ graph-based model in textclustering [19 20] Mousavi et al [21] proposed a weighted-graph representation of text to extract semantic relations interms of parse trees of sentences In our work we introducefrequent itemsets to construct related-word set and use eachitemset of related-word set to represent the relations betweenwords Language is always changing and new words areappearing every day Related-word set can capture the changeof language by mining frequent itemsets

22 Feature Selection Feature selection is a feature con-struction method to transform a high dimensional featurespace into a low-dimensional feature space SVD is a repre-sentative method using mathematical theory for dimensionreduction Jun et al [22] combined SVD and principal-component analysis (PCA) for dimensionality reduction Zhuand Allen [23] proposed a latent semantic indexing subspacesignature model (LSISSM) based on LSI and transformedterm-document vectors into a low-rank approximation fordimensionality reduction However LSI selects a new featuresubset to construct a semantic space which loses someimportant features and suffers from the irrelevant features

Due to the sparsity and high-dimensionality of textfeatures the performance of the subspace-based clustering isbetter than traditional clustering algorithm [24 25] More-over some researchers integrate many related theories fordimensionality reduction Bharti and Singh [26] proposed ahybrid intelligent algorithm which integrated binary particleswarm optimization chaotic map dynamic inertia weightand mutation for feature selection

23 Clustering Algorithm Clustering is an unsupervisedapproach of machine learning and it groups similar objectsinto a cluster The most representative clustering algorithmis partitional clustering such as k-means and k-medoids [27]and each cluster has a center called centroid in partitionalclustering Mei and Chen [28] proposed a clustering aroundweighted prototypes (CAWP) based on new cluster represen-tation method where each cluster was represented by multi-ple objects with various weights Tunali et al [29] improved

Mathematical Problems in Engineering 3

spherical k-means (SKM) and proposed a multicluster spher-ical k-means (MCSKM) which allowed documents to beassigned more than one cluster Li et al [30] introduced aconcept of neighbor and proposed a parallel k-means basedon neighbors (PKBN)

Another representative clustering algorithm is hierarchi-cal clustering which contains divisive hierarchical clusteringand agglomerative hierarchical clustering [31] Peng and Liu[32] proposed an incremental hierarchical text clusteringapproach which represented a cluster hierarchy using CFu-tree In addition Chen et al [33] proposed an improveddensity clustering algorithm named density-based spatialclustering of applications with noise (DBSCAN) DBSCANwas sensitive to choosing parameters the authors combinedk-means to estimate the parameters

Ensemble clustering is another clustering algorithmEnsemble clustering combines themultiple results of differentclustering algorithms to obtain final results Multiview clus-tering is an extension of ensemble clustering and combinesdifferent data that have different properties and views [34 35]

Matrix factorization-based clustering is an importantclustering approach [36] Lu et al [37] proposed a semisu-pervised concept factorization (SSCF) which containednonnegative matrix factorization and concept factorizationfor text clustering SSCF integrated penalized and rewardterms by pairwise constraints must-link constraints 119862ML andcannot-link constraints 119862CL which implied two documentsbelonging to the same cluster or different clusters

Topic-based text clustering is an effective text clusteringapproach in which text documents are projected into a topicspace Latent Dirichlet allocation (LDA) is a common topicmodel Yau et al [38] separated scientific publications intoseveral clusters based on LDA Ma et al [39] employed thetopic model of LDA to represent the centroids of clusters andcombined k-means++ algorithm for document clustering

In some literatures additional information is introducedfor text clustering such as side-information [40] and priv-ileged information [41] What is more several global opti-mization algorithms are utilized for text clustering such asparticle swarm optimization (PSO) algorithm [42 43] andbee colony optimization (BCO) algorithm [44 45]

Similarity measure is also an important issue in textclustering algorithms To compute the similarity between atext document and a cluster is a fundamental problem in clus-tering algorithms The most common similarity measure isdistance metric such as Euclidean distance Cosine distanceandGeneralizedMahalanobis distance [46]There exist othersimilarity measure methods such as IT-Sim (an information-theoretic measure) [47] Besides similarity measure mea-surement of discrimination information (MDI) is an oppositeconcept to compute the relations of text documents [48ndash50]

3 Theoretical Foundation

In this section we describe some theories related to ourwork This section contains three subsections which arefrequent pattern maximal (FPMAX) PageRank and deepbelief network (DBN)

Procedure FPMAX(119879)Input 119879 (an FP-tree)Global

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path 119875(2) insert Head cup 119875 into MFIT(3) else for each 119894 in Header-table of 119879(4) append 119894 to Head(5) construct the Head-pattern base(6) Tail = frequent items in base(7) subset_checking(Head cup Tail)(8) if Head cup Tail is not in MFIT(9) construct the FP-tree 119879Head(10) call FPMAX(119879Head)(11) remove 119894 from Head

Algorithm 1 FPMAX

31 FPMAX FPMAX is a depth-first and recursive algorithmforminingmaximal frequent itemsets (MFIs) in given dataset[51] Before FPMAX is called frequent pattern tree (FP-tree)is structured to store frequent itemsets and each branch ofthe FP-tree is a representation of a frequent itemset FP-tree includes a linked list head which contains all itemsof the dataset Maximal frequent itemset tree (MFI-tree) isintroduced to store all MFIs in FPMAX The procedure ofFPMAX is described Algorithm 1

32 PageRank PageRank is a link-based ranking algorithmwhich is used in the Google search engine Most of webpageson the Internet are connected with hyperlinks which carryimportant information Hence some webpages pointed bymany webpages are considered to include quality informa-tion

Webpages and hyperlinks in PageRank are structured todirected graph 119866 = (119881 119864) where 119881 is the set of webpagesand 119864 is the set of hyperlinks Let 119899 be the total number ofwebpages The PageRank score of the webpage 119894 is defined by

119875 (119894) = sum(119894119895)isin119864

119875 (119895)119874119895 (1)

where 119874119895 is the number of page 119895 pointing out to otherwebpages Let 119875 be a vector to represent all PageRank scores

119875 = (119875 (1) 119875 (2) 119875 (119899))119879 (2)

Let 119860 be the adjacency matrix of the graph 119866 with

119860 119894119895 = 1119874119894 if (119894 119895) isin 1198640 if (119894 119895) notin 119864 (3)

Hence (1) can be written as the system of equations with

119875 = 119860119879119875 (4)

4 Mathematical Problems in Engineering

Output layer

Hidden layer k

Hidden layer 2

Hidden layer 1

Visible layer

middot middot middot

middot middot middot

middot middot middotmiddot middot middotmiddot middot middotmiddot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

y1 y2 y3 yn

hk1 hk

2 hkn푘

h21 h2

2 h23 h2

n2

h11 h1

2 h13 h1

4 h1n1

b1 b2 b3 b4 bn1

a1 a2 a3 a4 a5 am

1 2 3 4 5 m

Wh1

Figure 1 The structure of DBN

PageRankmodels web surfing as a stochastic process andthe theory of Markov chain can be applied However theweb graph does notmeet the conditions of stochastic processwhich requires 119860 to be stochastic irreducible and aperiodicAfter the adjustment of 119860 to fix this problem we obtain animproved model with

119875 = ((1 minus 119889) 119864119899 + 119889119860119879)119875 (5)

where 119864 is 119890119890119879 (119890 is a column vector of all 1rsquos) and thus 119864 is an119899 times 119899 matrix with all 1rsquos and 119889 is a parameter called dampingfactor After scaling we obtain

119875 = (1 minus 119889) 119890 + 119889119860119879119875 (6)

Equation (6) is also transformed as follows

119875 (119894) = (1 minus 119889) + 119889 sum(119894119895isin119864)

119875 (119895)119874119895 (7)

The computation of PageRank score is a process of iterationGiven an initial value of 119875 the iteration ends when the scoreof PageRank does not change or the change is less than athreshold

33 Deep Belief Network (DBN) DBN is a model of deepleaning and composed of multilayer restricted Boltzmannmachines (RBMs) DBN contains the input layer (visiblelayer) the hidden layers and the output layer There areconnections between a layer and adjacent layer but no

connections among units in each layerThe structure of DBNis shown in Figure 1

As shown in Figure 1 an RBM consists of two adjacentlayers The training of DBN includes two steps pretrainingand fine-tuning RBM contains a visible layer V(119896) and ahidden layer ℎ(119896) The parameters of RBM are (119882(119896) 119886(119896)119887(119896)) (119882(119896)) are the weights of connections between thevisible layer and the hidden layer and (119886(119896) 119887(119896)) are the biasvectors of the visible units and the hidden units Giving aninitial value to119882(119896) the parameters are updated with

119882(119896)119894119895 = 119882(119896)119894119895 + 120578nabla119882(119896)119894119895 (8)

where 120578 is learning rate and (119886(119896) 119887(119896)) are similar to119882(119896)The gradient of nabla119882(119896) is obtained by Gibbs Sampling

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

(9)

where 119864[sdot]data and 119864[sdot]Gibbs are the expectations of datasamples and samples fromGibbs Sampling and (nabla119886(119896) nabla119887(119896))are similar to nabla119882(119896)119894119895

DBN is fine-tuned with a set of labeled inputs in termsof error back propagation after the pretraining of DBN Theparameters are updated by

119882(119896)119894119895 = 119882(119896)119894119895 + 120578nabla119882(119896)119894119895 (10)

where nabla119882(119896)119894119895 = ℎ(119896minus1)119894 120575(119896)119895 and 120575(119896)119895 is an error vector

Mathematical Problems in Engineering 5

TongYiCi CiLin

RS

Vocabularynetwork

FeatureVectors

Sparse-groupDBN

DL-SPPageRank

Feature extraction Feature selection Clustering

Figure 2 The procedure of DLVN

LMajor classes

Middle classes

Small classes

Word groups

Atomic word groups

A

a

01

A

01

B

f n

02 1103

KB C D

2202 03 04 05 06 07 Af03A06

Af03A06

Af03A06

Af03A06

Af03A06

Codemiddot middot middot

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 3 The structure of TongYiCi CiLin

4 Deep-Learning Vocabulary Network

In this section we propose an approach called deep-learningvocabulary network (DLVN) for text clusteringThe first stepof DLVN is the construction of vocabulary network Thecooccurrence of words or terms is useful information for textclustering We use the nodes of the vocabulary network torepresent words or terms and the edges of the vocabularynetwork to represent the relations betweenwords or terms Inour work there are two methods to obtain the cooccurrencerelations of words related-word set and TongYiCi CiLinFrequent itemsets are used to discover the relations of itemsin database We create related-word set by frequent itemsetsand each itemset of related-word set is a set of words withcooccurrence relation PageRank is employed to obtain theldquoimportancerdquo of nodes (feature vectors) instead of the termfrequency in VSM Then an improved DBN (called sparse-group DBN) is proposed for dimensionality reduction Inthe process of clustering algorithm we present DL-SP forclustering in which coverage rate is used for similaritymeasure The procedure of DLVN is shown in Figure 2

41 Related-Word Set The relations of words or terms areimportant information in text documents Usually naturallanguage has the fixed collocation and corresponding con-texts which means some words or terms have a high prob-ability of occurrence in a text document Thus the relationsbetweenwords are important to represent themeaning of textdocuments In our paper we use frequent itemsets to obtaincooccurrence relations between words or terms

Definition 1 (related-word set) Let 119863 = word1word2 word119899 be the words of text documents from the same topicand sup[sdot] be the support of itemsets Given a minimumsupport supms 119883 = word119894word119895 word119896 is defined asan itemset of related-word set where sup[119883] gt supms

FPMAX is a depth-first and recursive algorithm formining MFIs and it is based on FP-tree to store frequentitemsets When a database has a large scale all itemsetsof MFI-tree are detected in subset checking of FPMAXwhich has a big influence on the efficiency of FPMAX Forimproving the efficiency of FPMAX we use TongYiCi CiLinand string match to compress the FP-tree

TongYiCi CiLin is a Chinese semantic dictionary ofsynonyms and related words which organizes all words asa five-layer hierarchical tree It contains 77343 words whichare divided into 12 major classes 94 middle classes and 1438small classes The fourth layer and the fifth layer are furtherdivided into word groups and atomic word groups We useFigure 3 to illustrate the structure of TongYiCi CiLin

TongYiCi CiLin maps an atomic word group into a codethe first layer and the fourth layer are capital letters thesecond layer is a lowercase letter and the third layer andthe fifth layer are integers For example code ldquoAa01A02rdquostands for the atomic word group man mankind humanWe replace the words or terms with the code of word groupsin MFIrsquos mining which contains 4223 nodes We randomlyselect 10 documents from the same topic and the frequentitems (words) are listed in Table 1 As some words belong tothe same word group the number of words is compressedlargely

The structures of FP-trees that are created based onwordsand word groups are shown in Figure 4 Figure 4(a) is FP-treeof words and FP-tree of word groups is shown in Figure 4(b)The nodes of FP-tree based on the word groups are fewer thanthe nodes of FP-tree based on the words

The MFIs have redundant items in Figure 4(b) Forexample the MFIs of Figure 4(b) are listed in Table 2

MFIs include two categories of word groups in Table 2The word groups a(Bo21A01) b(Bo01A06) d(Dd14B36)and a(Bo21A01) c(Bo21A27) are closely related and theword groups k(Da21D01) i(Dm04A01) f(Cb08A01) and

6 Mathematical Problems in Engineering

Root

a3 e1 b1 p3k1 q1

j1 c1 b1 f1 h1 l1 d1 n1 q1 s1

h1 c1 g1 n1 o1 r1 t1

i1 d1 o1 m1

abcd

ef g

h

ijkl

m

no

p

q

rst

Item Head of node-linksHeader table

(a)

Root

a4 f1 i4

d1

b3c2 l1 k4

j2e1 c1 f2

e1d1 j1

a

bc

de

f

i

j

k

l

Item Head of node-linksHeader table

(b)

Figure 4 The structures of FP-trees

Table 1 The comparison of words and word groups

Words Word groupsa(vehicle)b(car)c(engine)d(quality) a(Bo21A01)b(Bo01A06)e(Dd12A01)e(automobile)f(truck) a(Bo21A01)c(Bo21A27)b(car)g(engine)h(power) a(Bo21A01)b(Bo01A06)d(Dd14B36)a(vehicle)i(lorry)c(engine)h(power) a(Bo21A01)b(Bo01A06)c(Bo21A27)d(Dd14B36)a(vehicle)j(jeep) a(Bo21A01)c(Bo21A27)k(ground)l(situation) f(Cb08A01)l(Da21A07)m(area)n(store)o(construction)p(environment) f(Cb08A01)i(Dm04A01)j(Bn01A01)k(Da21D01)q(shop)r(building)p(environment) i(Dm04A01)k(Da21D01)j(Bn01A01)s(location)q(shop)t(condition) f(Cb08A01)i(Dm04A01)k(Da21D01)d(quality)n(store)o(construction)p(environment) e(Dd12A01)i(Dm04A01)j(Bn01A01)k(Da21D01)

Table 2 MFIs of FP-tree based on word groups

MFIs1 a(Bo21A01)b(Bo01A06)d(Dd14B36)2 a(Bo21A01)c(Bo21A27)3 k(Da21D01)i(Dm04A01)f(Cb08A01)4 k(Da21D01)i(Dm04A01)j(Bn01A01)

k(Da21D01) i(Dm04A01) j(Bn01A01) are closely related Infact the aim of related-word set is tomine the ldquocooccurrencerdquoof words and we assume that the relations of words havetransitivity Therefore we utilize string matching and thesame items to combine MFIs

Definition 2 (combination of MFIs) Let MFIS = MFI1MFI2 MFI119898 be the MFIrsquos set obtained from text doc-uments and cov(sdot) be the number of the same items in twoMFIs Suppose that cov(MFI1MFI2) gt covmin where covminis minimum number of the same items MFI1 and MFI2 areremoved fromMFIS and the combination of MFI1 cupMFI2 isadd to MFIS

MFIs are inserted into MFI-tree in terms of covmin Forexample given MFI1 = 119886 119887 119888 119889 119890 119891 MFI2 = 119886 119887 119888 119889119890 ℎ MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 and covmin = 07 the combi-nation of MFIs is MFI1 cup MFI2 = 119886 119887 119888 119889 119890 119891 ℎ The new

MFI-tree only has two paths (MFI1 cup MFI2) = 119886 119887 119888 119889 119890119891 ℎ and MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 The scale of MFI-treeis simplified and we integrate FPMAX with combination ofMFIs to propose an algorithm named FPMAX with related-word set (FPMAX-RS) The step of FPMAX-RS is listed inAlgorithm 2

42 The Construction of Vocabulary Network In this sectionvocabulary network is constructed to represent text docu-ments and the vocabulary network contains the relationsbetween words or terms We employ the ldquoimportancerdquo ofnodes instead of term frequency in VSM

421 The Selection of Vocabulary Network Nodes The wordgroups in TongYiCi CiLin are used as nodes instead of wordsin vocabulary network The number of word groups is muchfewer than the number of words In addition we choosethe word groups whose frequency is higher than specifiedminimal frequency 119891min

422 The Construction of Edges in the Vocabulary NetworkEdges of complex network are the important carrier ofinformation and the edges of the vocabulary network areused in calculating the ldquoimportancerdquo of nodes Consideringthe semantic and related information among words of termsan edge is add to the vocabulary network in terms ofthe similarity of nodes Therefore we add an edge to thevocabulary network if word groups have a closer position

Mathematical Problems in Engineering 7

Procedure FPMAX-RS(T)Input T (an FP-tree) covminGlobal

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path P(2) if cov(Head cup 119875MFI) gt covmin(3) combine MFI-tree to this path(4) else(5) insert Head cup 119875 into MFIT(6) else for each 119894 in Header-table of T(7) append 119894 to Head(8) construct the Head-pattern base(9) Tail = frequent items in base(10) subset_checking (Head cup Tail)(11) if Head cup Tail is not in MFI-tree(12) construct the FP-tree 119879Head(13) call FPMAX-RS(119879Head)(14) remove 119894 from Head

Algorithm 2 FPMAX-RS

in TongYiCi CiLin The semantic similarity of word groupssim(119894 119895) is defined as

sim (119894 119895) = depth (119894 119895)119897 times TN minus Dis (119894 119895) + 1

TN (11)

where depth(119894 119895) is the depth of the first common father node119897 is the depth of 119894 and 119895 TN is the total number of wordgroups and Dis(119894 119895) denotes the distance between 119894 and 119895For example there are two words 119888119886119903 119908ℎ119890119890119897 and the wordgroup codes of 119888119886119903 119908ℎ119890119890119897 are Bo21A Bo25119861 Because twonodes are in fourth layer the first common father node is119861119900 which is in the second layer In addition the fourth layercontains 4223 word groups and Dis(119894 119895) of Bo21A Bo25119861 is14 Therefore sim(Bo21A Bo25B) is calculated as follows

sim(11986111990021119860 11986111990025119861) = 24 times 4223 minus 14 + 14223 (12)

The nodes in the vocabulary network are traversed and anedge between 119894 and 119895 is added when sim(119894 119895) gt simmin (thespecified threshold)

In addition we add an edge between two nodes if anMFI in related-word set includes the words and each MFIin related-word set is a word set with cooccurrence relationsIn fact the meaning of words in an MFI is not similarand an MFI includes a group of words cooccurring in thesame topic documents When a text document has the wordsin an MFI the text document has a high probability ofbelonging to certain topic Therefore we add an edge intothe vocabulary network with low-frequency word pointing tohigh-frequency word

423 The Extraction of Feature Vectors In the vocabularynetwork the number and the direction of edges reflect the

importance of nodes which is similar to evaluating theimportance of webpagesThus PageRank is utilized to obtainthe importance of nodes and the initial value PR119894 of nodes isdefined by

PR119894 = 119891119894sum119873119895=1 119891119895 (13)

where 119891119894 is the frequency of word groups After iterativecomputation and normalization of PR119894 we use the PageRankscores of nodes as the feature vectors of text documentsinstead of term frequency in this paper

43 Deep-Learning Single-Pass (DL-SP) In this paper sparse-group DBN is proposed for dimensionality reduction offeature vectors DBN is a model of deep learning Luo et al[52] found that the units of hidden layers exhibited statisticaldependencies and proposed a regularization constant torestrict the relations in hidden layers Due to the sparsity offeature vectors we combine the word dependencies andDBNto propose a sparse-groupDBN for dimensionality reductionIn addition coverage rate (CoR) is proposed for similaritymeasure among feature vectors in DL-SP

431 Sparse-Group DBN Deep learning simulates the pro-cess of human thinking and the result of deep learning is thedistributed representation of an input vector By analyzingfeature vectors extracted from the vocabulary network wefind that there exists statistical dependency between entries offeature vectors whichmeans the entries of feature vectorswillcooccur in the part of feature vectors The word dependencyis also mentioned by many researchers in previous literatures[5 18 53] Cooccurrence relations are typically collectedin feature vectors which means a unique word commonlyreferring to ldquotarget wordrdquo and the word dependency isquantified to measure words similarity in text clustering Weprovide an example which is the part of a feature vector inTable 3

Because the documents in the same topic usually includerelated words a part of units in visible layer is activesimultaneously and accordingly the documents in differenttopics usually activate different part of units Based onthis observation we add a regularization constant to thelog-likelihood of training data to retain these relations Inexperiments we use different topic documents to train thesparse-group DBN The sequence of units in output layer isadjusted accordingly and the cooccurring units are dividedinto one group In other words the feature vectors of differenttopic documents can activate different group of units inoutput layer The structure of sparse-group DBN is shown inFigure 5

Sparse-group DBN is comprised of several RBMs andtwo adjacent layers are an RBMs For retaining the depen-dency of the units in output layer we define the activationprobability of each group Given a group 119878 = 1199101 1199102 119910119904and training sample V(119896) the group probability 119875119878(sdot) is givenby

119875119878 (V(119896)) = radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (14)

8 Mathematical Problems in Engineering

Table 3 The word dependencies of a feature vector

Hj47A01 Hg19C01 Ba03A18 Ba08A07 Dm05A01 Hg01A01 Ae13B01032 017 012 004 0 002 0017 021 007 011 0 0 0014 023 017 009 0 0 0023 012 006 014 0 001 00 0 0 0 012 034 0200 0 0 0 024 021 0100 0 0 0 013 014 009

Visible layer

Hidden layer

Output layery1 y2 ys y1 y2 ys y1 y2 ysmiddot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

h1 h2 h3 hj

Wh

b1 b2 b3 bj

a1 a2 a3 a4 ai1 2 3 4 i

1 2 K

Figure 5 The structure of sparse-group DBN

The output layer of the sparse-group DBN is divided into119870 groups and the probability of output layer 119875ol(sdot) is definedby

119875ol (V(119896)) = 119870sum119896=1

radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (15)

We add a regularization constant 120582 and 119875ol(V(119896)) to opti-mization function which is maximum likelihood estimateof energy function of an RBM The optimization function isdefined by

max119882119887119888

sum log119875 (V(119896)) minus 120582 119870sum119896=1

radicsum119875(119910119904 = 1 | V(119896))2 (16)

Equation (11) is improved to (21) accordingly and nabla119882(119896)119894119895 isdefined by

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

minus 120582 sdot 120572 (17)

where 120572 = (120597120597(119882(119896)119894119895 ))119875119878(V(119896)) = 119875(119910119904 = 1 | V(119896))119875119878(V(119896)) sdot (120597120597(119882(119896)119894119895 ))119875(119910119904 = 1 | V(119896)) = (119875(119910119904 = 1 | V(119896))2 sdot 119875(119910119904 = 0 |V(119896)) sdot V(119896))119875119878(V(119896))Accordingly the gradient of (nabla119886(119896) nabla119887(119896))is defined by

nabla119886(119896)119894 = 119864 [V(119896)119894 ]data minus 119864 [V(119896)119894 ]Gibbs minus 120582 sdot 120572nabla119887(119896)119895 = 119864 [ℎ(119896)119895 ]

dataminus 119864 [ℎ(119896)119895 ]

Gibbsminus 120582 sdot 120572 (18)

432 Similarity Measure of DL-SP Single-Pass is a parti-tional clustering algorithm The first document is treated asthe first cluster in Single-Pass and similarity is computedbetween new document and existing clusters which decidesnew document to join the existing cluster or to create anew cluster in terms of specified threshold The output ofsparse-group DBN is binary and Euclidean distance andCosine angle distance are not suitable for similarity measureinDL-SPTherefore we use coverage rate (CoR) for similaritymeasure and CoR is defined by

CoR (119862 119889) = |119862 cap 119889|119862 (19)

where 119862 = (1198881 1198882 119888119899) is the feature vector of a cluster(named topic feature vector) and 119889 = (1198891 1198892 119889119899) is thefeature vector of new document

Moreover the addition of many text documents to clus-ters has an influence on topic feature vector In our work weintroduce optional topic feature vector 1198621015840 = (11988810158401 11988810158402 1198881015840119899)and the weight of feature vector to solve this problemWe provide an example of optional topic feature vector inFigure 6

When theweight of optional topic feature vector is greaterthan a specified threshold in each time interval we replacetopic feature vector with optional topic feature vector as newcluster center The weight of topic feature vector is definedby

1199081198621015840 = sum119862119891 (119888119894)sum119862119891 (119888119894) + sum1198621015840 119891 (1198881015840119895) minus 120582119890119896(119905minus1199050) (20)

where 120582119890119896(119905minus1199050) is time damping function and 119891(119888119894) is fre-quency function

5 Experimental Analysis

In this section we conduct three sets of experiments tovalidate the effectiveness of the proposed approach includingthe efficiency of FPMAX-RS in related-word set miningthe comparison of feature vectors and the comparison ofDL-SP efficiency In this work three Chinese text corporaTanCorpV10 Encyclopedia of China and Sogou Corpus areused as the experimental datasets

51 The Efficiency of FPMAX-RS in Related-Word Set MiningThis section is to compare running time of FPMAX and

Mathematical Problems in Engineering 9

Text feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1

di

(a)

Topic feature

00000000000000000000000

ci

c6 c8 c9 c10 c18 c21 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37

(b)

Sim

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1

(c)

Optional topic feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(ci c㰀i )

c6 c8 c9 c10 c21 c22 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37c13 c15

(d)

Figure 6 An example of optional topic feature vector

FPMAX-RS in related-word set mining We choose sevencategories (museum property education military car sportand health) of text documents from the datasets and eachcategory has 50 articles The result of experiment is listed inFigure 7

FPMAX generates a larger amount of maximal frequentitemsets and traverses all MFI-trees for subset checkingwhich has an influence on the running time of FPMAXCompared with FPMAX FPMAX-RS has higher efficiencywhen supmin is smaller

52 The Comparison of Feature Vectors In this work wecompare the distance among the feature vectors based on

tf-idf FC-VSM [12] and DLVN We randomly choose twodocuments from the categorymuseum and one document inother categories including property education and militaryThe aim of feature extraction is to extract the feature vectorsthat can represent the meaning of text documents In otherwords feature vectors in different categories have longerdistance Therefore we compute the Euclidean distanceof feature vectors in different categories based on tf-idfFC-VSM and DLVN Table 4 shows the results in differentcategories of text documents

In the following experiment feature vectors are extractedbased on tf-idf FC-VSM and DLVN Then k-means isapplied for clustering We evaluate clustering performance

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

2 Mathematical Problems in Engineering

words or terms have a high probability of occurrence in atext document Researchers think that the ldquocooccurrencerdquorelations of words or terms can generate more precise featurevectors to represent the meaning of text documents

The objective of feature selection is to remove redun-dant information and reduce the dimensionality of term-document vectors The methods of feature selection arecategorized as corpus-based method Latent Semantic Index-ing (LSI) and subspace-based clustering The corpus-basedmethodmerges synonyms together to reduce the dimension-ality of features which depends on large corpora such asWordNet and HowNet Traditional LSI decomposes a term-document vector into a term-space matrix by singular valuedecomposition (SVD) Subspace-based clustering groups textdocuments in a low-dimensional subspace

In our paper we propose a novel approach to addresstwo issues one is the loss of word relations in the process offeature extraction and the other is to retain theword relationsin dimension reduction Considering that the relations ofwords and terms are lost in term frequency-based methodswe construct a vocabulary network to retain ldquocooccurrencerdquorelations of words or terms Term frequency is replaced withthe ldquoimportancerdquo of words or terms in VSM Furthermoretraditional feature selection methods can lose some informa-tion that affects the performance of clustering [14] and weintroduce deep learning for dimension reduction

Themain contributions of our paper are that we present anovel graph-based approach for text clustering called deep-learning vocabulary network (DLVN) We employ the edgesof vocabulary network to represent the relations betweenwords or terms and extract features of text documents interms of related-word set The related-word set is a set ofwords in the same class and we utilize association ruleslearning to obtain relations between words In addition highdimensional and sparse features of text have a big influenceon clustering algorithms and we employ deep learning fordimensionality reduction Accordingly an improved deep-learning Single-Pass (DL-SP) is used in the process of clus-tering To verify the effectiveness of the approach we provideour experimental evaluation based on Chinese corpora

The rest of this paper is organized as follows Section 2reviews related work in previous literatures Section 3 intro-duces theoretical foundation related to this paper Section 4describes the approach of DLVN we propose Section 5 isexperimental analysis Section 6 is the conclusion of ourwork

2 Related Work

Text clustering groups text documents of similar content(so-called topic) into a cluster In this section we use threesubsections to review related literatures

21 Feature Extraction Term frequency-based method is animportant method to extract features In term frequency-based method text documents are represented as VSM andeach document is transformed into a vector whose entries arethe frequency of words or terms Most term frequency-basedmethods are to improve tf-idf

Semantic web is to structure knowledge into an ontol-ogy As researchers find that the relations between wordscontribute to understanding the meaning of text they con-struct a semantic network in terms of concepts events andtheir relations Yue et al [15] constructed a domain-specificontology to describe the hazards related to dairy productsand translated the term-document vectors (namely featurevectors of text) into a concept space Wei et al [16] exploitedan ontology hierarchical structure for word sense disam-biguation to assess similarity ofwordsThe experiment resultsshowed better clustering performance for ontology-basedmethods considering the semantic relations between wordsBing et al [17] proposed an adaptive concept resolution(ACR) model for the characteristics of text documents andACR was an ontology-based method of text representationHowever the efficiency of semanticweb analysis is a challengefor researchers and the large scale of text corpora has a greatinfluence on algorithms [18]

For retaining the relations of words and terms someresearchers proposed to employ graph-based model in textclustering [19 20] Mousavi et al [21] proposed a weighted-graph representation of text to extract semantic relations interms of parse trees of sentences In our work we introducefrequent itemsets to construct related-word set and use eachitemset of related-word set to represent the relations betweenwords Language is always changing and new words areappearing every day Related-word set can capture the changeof language by mining frequent itemsets

22 Feature Selection Feature selection is a feature con-struction method to transform a high dimensional featurespace into a low-dimensional feature space SVD is a repre-sentative method using mathematical theory for dimensionreduction Jun et al [22] combined SVD and principal-component analysis (PCA) for dimensionality reduction Zhuand Allen [23] proposed a latent semantic indexing subspacesignature model (LSISSM) based on LSI and transformedterm-document vectors into a low-rank approximation fordimensionality reduction However LSI selects a new featuresubset to construct a semantic space which loses someimportant features and suffers from the irrelevant features

Due to the sparsity and high-dimensionality of textfeatures the performance of the subspace-based clustering isbetter than traditional clustering algorithm [24 25] More-over some researchers integrate many related theories fordimensionality reduction Bharti and Singh [26] proposed ahybrid intelligent algorithm which integrated binary particleswarm optimization chaotic map dynamic inertia weightand mutation for feature selection

23 Clustering Algorithm Clustering is an unsupervisedapproach of machine learning and it groups similar objectsinto a cluster The most representative clustering algorithmis partitional clustering such as k-means and k-medoids [27]and each cluster has a center called centroid in partitionalclustering Mei and Chen [28] proposed a clustering aroundweighted prototypes (CAWP) based on new cluster represen-tation method where each cluster was represented by multi-ple objects with various weights Tunali et al [29] improved

Mathematical Problems in Engineering 3

spherical k-means (SKM) and proposed a multicluster spher-ical k-means (MCSKM) which allowed documents to beassigned more than one cluster Li et al [30] introduced aconcept of neighbor and proposed a parallel k-means basedon neighbors (PKBN)

Another representative clustering algorithm is hierarchi-cal clustering which contains divisive hierarchical clusteringand agglomerative hierarchical clustering [31] Peng and Liu[32] proposed an incremental hierarchical text clusteringapproach which represented a cluster hierarchy using CFu-tree In addition Chen et al [33] proposed an improveddensity clustering algorithm named density-based spatialclustering of applications with noise (DBSCAN) DBSCANwas sensitive to choosing parameters the authors combinedk-means to estimate the parameters

Ensemble clustering is another clustering algorithmEnsemble clustering combines themultiple results of differentclustering algorithms to obtain final results Multiview clus-tering is an extension of ensemble clustering and combinesdifferent data that have different properties and views [34 35]

Matrix factorization-based clustering is an importantclustering approach [36] Lu et al [37] proposed a semisu-pervised concept factorization (SSCF) which containednonnegative matrix factorization and concept factorizationfor text clustering SSCF integrated penalized and rewardterms by pairwise constraints must-link constraints 119862ML andcannot-link constraints 119862CL which implied two documentsbelonging to the same cluster or different clusters

Topic-based text clustering is an effective text clusteringapproach in which text documents are projected into a topicspace Latent Dirichlet allocation (LDA) is a common topicmodel Yau et al [38] separated scientific publications intoseveral clusters based on LDA Ma et al [39] employed thetopic model of LDA to represent the centroids of clusters andcombined k-means++ algorithm for document clustering

In some literatures additional information is introducedfor text clustering such as side-information [40] and priv-ileged information [41] What is more several global opti-mization algorithms are utilized for text clustering such asparticle swarm optimization (PSO) algorithm [42 43] andbee colony optimization (BCO) algorithm [44 45]

Similarity measure is also an important issue in textclustering algorithms To compute the similarity between atext document and a cluster is a fundamental problem in clus-tering algorithms The most common similarity measure isdistance metric such as Euclidean distance Cosine distanceandGeneralizedMahalanobis distance [46]There exist othersimilarity measure methods such as IT-Sim (an information-theoretic measure) [47] Besides similarity measure mea-surement of discrimination information (MDI) is an oppositeconcept to compute the relations of text documents [48ndash50]

3 Theoretical Foundation

In this section we describe some theories related to ourwork This section contains three subsections which arefrequent pattern maximal (FPMAX) PageRank and deepbelief network (DBN)

Procedure FPMAX(119879)Input 119879 (an FP-tree)Global

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path 119875(2) insert Head cup 119875 into MFIT(3) else for each 119894 in Header-table of 119879(4) append 119894 to Head(5) construct the Head-pattern base(6) Tail = frequent items in base(7) subset_checking(Head cup Tail)(8) if Head cup Tail is not in MFIT(9) construct the FP-tree 119879Head(10) call FPMAX(119879Head)(11) remove 119894 from Head

Algorithm 1 FPMAX

31 FPMAX FPMAX is a depth-first and recursive algorithmforminingmaximal frequent itemsets (MFIs) in given dataset[51] Before FPMAX is called frequent pattern tree (FP-tree)is structured to store frequent itemsets and each branch ofthe FP-tree is a representation of a frequent itemset FP-tree includes a linked list head which contains all itemsof the dataset Maximal frequent itemset tree (MFI-tree) isintroduced to store all MFIs in FPMAX The procedure ofFPMAX is described Algorithm 1

32 PageRank PageRank is a link-based ranking algorithmwhich is used in the Google search engine Most of webpageson the Internet are connected with hyperlinks which carryimportant information Hence some webpages pointed bymany webpages are considered to include quality informa-tion

Webpages and hyperlinks in PageRank are structured todirected graph 119866 = (119881 119864) where 119881 is the set of webpagesand 119864 is the set of hyperlinks Let 119899 be the total number ofwebpages The PageRank score of the webpage 119894 is defined by

119875 (119894) = sum(119894119895)isin119864

119875 (119895)119874119895 (1)

where 119874119895 is the number of page 119895 pointing out to otherwebpages Let 119875 be a vector to represent all PageRank scores

119875 = (119875 (1) 119875 (2) 119875 (119899))119879 (2)

Let 119860 be the adjacency matrix of the graph 119866 with

119860 119894119895 = 1119874119894 if (119894 119895) isin 1198640 if (119894 119895) notin 119864 (3)

Hence (1) can be written as the system of equations with

119875 = 119860119879119875 (4)

4 Mathematical Problems in Engineering

Output layer

Hidden layer k

Hidden layer 2

Hidden layer 1

Visible layer

middot middot middot

middot middot middot

middot middot middotmiddot middot middotmiddot middot middotmiddot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

y1 y2 y3 yn

hk1 hk

2 hkn푘

h21 h2

2 h23 h2

n2

h11 h1

2 h13 h1

4 h1n1

b1 b2 b3 b4 bn1

a1 a2 a3 a4 a5 am

1 2 3 4 5 m

Wh1

Figure 1 The structure of DBN

PageRankmodels web surfing as a stochastic process andthe theory of Markov chain can be applied However theweb graph does notmeet the conditions of stochastic processwhich requires 119860 to be stochastic irreducible and aperiodicAfter the adjustment of 119860 to fix this problem we obtain animproved model with

119875 = ((1 minus 119889) 119864119899 + 119889119860119879)119875 (5)

where 119864 is 119890119890119879 (119890 is a column vector of all 1rsquos) and thus 119864 is an119899 times 119899 matrix with all 1rsquos and 119889 is a parameter called dampingfactor After scaling we obtain

119875 = (1 minus 119889) 119890 + 119889119860119879119875 (6)

Equation (6) is also transformed as follows

119875 (119894) = (1 minus 119889) + 119889 sum(119894119895isin119864)

119875 (119895)119874119895 (7)

The computation of PageRank score is a process of iterationGiven an initial value of 119875 the iteration ends when the scoreof PageRank does not change or the change is less than athreshold

33 Deep Belief Network (DBN) DBN is a model of deepleaning and composed of multilayer restricted Boltzmannmachines (RBMs) DBN contains the input layer (visiblelayer) the hidden layers and the output layer There areconnections between a layer and adjacent layer but no

connections among units in each layerThe structure of DBNis shown in Figure 1

As shown in Figure 1 an RBM consists of two adjacentlayers The training of DBN includes two steps pretrainingand fine-tuning RBM contains a visible layer V(119896) and ahidden layer ℎ(119896) The parameters of RBM are (119882(119896) 119886(119896)119887(119896)) (119882(119896)) are the weights of connections between thevisible layer and the hidden layer and (119886(119896) 119887(119896)) are the biasvectors of the visible units and the hidden units Giving aninitial value to119882(119896) the parameters are updated with

119882(119896)119894119895 = 119882(119896)119894119895 + 120578nabla119882(119896)119894119895 (8)

where 120578 is learning rate and (119886(119896) 119887(119896)) are similar to119882(119896)The gradient of nabla119882(119896) is obtained by Gibbs Sampling

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

(9)

where 119864[sdot]data and 119864[sdot]Gibbs are the expectations of datasamples and samples fromGibbs Sampling and (nabla119886(119896) nabla119887(119896))are similar to nabla119882(119896)119894119895

DBN is fine-tuned with a set of labeled inputs in termsof error back propagation after the pretraining of DBN Theparameters are updated by

119882(119896)119894119895 = 119882(119896)119894119895 + 120578nabla119882(119896)119894119895 (10)

where nabla119882(119896)119894119895 = ℎ(119896minus1)119894 120575(119896)119895 and 120575(119896)119895 is an error vector

Mathematical Problems in Engineering 5

TongYiCi CiLin

RS

Vocabularynetwork

FeatureVectors

Sparse-groupDBN

DL-SPPageRank

Feature extraction Feature selection Clustering

Figure 2 The procedure of DLVN

LMajor classes

Middle classes

Small classes

Word groups

Atomic word groups

A

a

01

A

01

B

f n

02 1103

KB C D

2202 03 04 05 06 07 Af03A06

Af03A06

Af03A06

Af03A06

Af03A06

Codemiddot middot middot

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 3 The structure of TongYiCi CiLin

4 Deep-Learning Vocabulary Network

In this section we propose an approach called deep-learningvocabulary network (DLVN) for text clusteringThe first stepof DLVN is the construction of vocabulary network Thecooccurrence of words or terms is useful information for textclustering We use the nodes of the vocabulary network torepresent words or terms and the edges of the vocabularynetwork to represent the relations betweenwords or terms Inour work there are two methods to obtain the cooccurrencerelations of words related-word set and TongYiCi CiLinFrequent itemsets are used to discover the relations of itemsin database We create related-word set by frequent itemsetsand each itemset of related-word set is a set of words withcooccurrence relation PageRank is employed to obtain theldquoimportancerdquo of nodes (feature vectors) instead of the termfrequency in VSM Then an improved DBN (called sparse-group DBN) is proposed for dimensionality reduction Inthe process of clustering algorithm we present DL-SP forclustering in which coverage rate is used for similaritymeasure The procedure of DLVN is shown in Figure 2

41 Related-Word Set The relations of words or terms areimportant information in text documents Usually naturallanguage has the fixed collocation and corresponding con-texts which means some words or terms have a high prob-ability of occurrence in a text document Thus the relationsbetweenwords are important to represent themeaning of textdocuments In our paper we use frequent itemsets to obtaincooccurrence relations between words or terms

Definition 1 (related-word set) Let 119863 = word1word2 word119899 be the words of text documents from the same topicand sup[sdot] be the support of itemsets Given a minimumsupport supms 119883 = word119894word119895 word119896 is defined asan itemset of related-word set where sup[119883] gt supms

FPMAX is a depth-first and recursive algorithm formining MFIs and it is based on FP-tree to store frequentitemsets When a database has a large scale all itemsetsof MFI-tree are detected in subset checking of FPMAXwhich has a big influence on the efficiency of FPMAX Forimproving the efficiency of FPMAX we use TongYiCi CiLinand string match to compress the FP-tree

TongYiCi CiLin is a Chinese semantic dictionary ofsynonyms and related words which organizes all words asa five-layer hierarchical tree It contains 77343 words whichare divided into 12 major classes 94 middle classes and 1438small classes The fourth layer and the fifth layer are furtherdivided into word groups and atomic word groups We useFigure 3 to illustrate the structure of TongYiCi CiLin

TongYiCi CiLin maps an atomic word group into a codethe first layer and the fourth layer are capital letters thesecond layer is a lowercase letter and the third layer andthe fifth layer are integers For example code ldquoAa01A02rdquostands for the atomic word group man mankind humanWe replace the words or terms with the code of word groupsin MFIrsquos mining which contains 4223 nodes We randomlyselect 10 documents from the same topic and the frequentitems (words) are listed in Table 1 As some words belong tothe same word group the number of words is compressedlargely

The structures of FP-trees that are created based onwordsand word groups are shown in Figure 4 Figure 4(a) is FP-treeof words and FP-tree of word groups is shown in Figure 4(b)The nodes of FP-tree based on the word groups are fewer thanthe nodes of FP-tree based on the words

The MFIs have redundant items in Figure 4(b) Forexample the MFIs of Figure 4(b) are listed in Table 2

MFIs include two categories of word groups in Table 2The word groups a(Bo21A01) b(Bo01A06) d(Dd14B36)and a(Bo21A01) c(Bo21A27) are closely related and theword groups k(Da21D01) i(Dm04A01) f(Cb08A01) and

6 Mathematical Problems in Engineering

Root

a3 e1 b1 p3k1 q1

j1 c1 b1 f1 h1 l1 d1 n1 q1 s1

h1 c1 g1 n1 o1 r1 t1

i1 d1 o1 m1

abcd

ef g

h

ijkl

m

no

p

q

rst

Item Head of node-linksHeader table

(a)

Root

a4 f1 i4

d1

b3c2 l1 k4

j2e1 c1 f2

e1d1 j1

a

bc

de

f

i

j

k

l

Item Head of node-linksHeader table

(b)

Figure 4 The structures of FP-trees

Table 1 The comparison of words and word groups

Words Word groupsa(vehicle)b(car)c(engine)d(quality) a(Bo21A01)b(Bo01A06)e(Dd12A01)e(automobile)f(truck) a(Bo21A01)c(Bo21A27)b(car)g(engine)h(power) a(Bo21A01)b(Bo01A06)d(Dd14B36)a(vehicle)i(lorry)c(engine)h(power) a(Bo21A01)b(Bo01A06)c(Bo21A27)d(Dd14B36)a(vehicle)j(jeep) a(Bo21A01)c(Bo21A27)k(ground)l(situation) f(Cb08A01)l(Da21A07)m(area)n(store)o(construction)p(environment) f(Cb08A01)i(Dm04A01)j(Bn01A01)k(Da21D01)q(shop)r(building)p(environment) i(Dm04A01)k(Da21D01)j(Bn01A01)s(location)q(shop)t(condition) f(Cb08A01)i(Dm04A01)k(Da21D01)d(quality)n(store)o(construction)p(environment) e(Dd12A01)i(Dm04A01)j(Bn01A01)k(Da21D01)

Table 2 MFIs of FP-tree based on word groups

MFIs1 a(Bo21A01)b(Bo01A06)d(Dd14B36)2 a(Bo21A01)c(Bo21A27)3 k(Da21D01)i(Dm04A01)f(Cb08A01)4 k(Da21D01)i(Dm04A01)j(Bn01A01)

k(Da21D01) i(Dm04A01) j(Bn01A01) are closely related Infact the aim of related-word set is tomine the ldquocooccurrencerdquoof words and we assume that the relations of words havetransitivity Therefore we utilize string matching and thesame items to combine MFIs

Definition 2 (combination of MFIs) Let MFIS = MFI1MFI2 MFI119898 be the MFIrsquos set obtained from text doc-uments and cov(sdot) be the number of the same items in twoMFIs Suppose that cov(MFI1MFI2) gt covmin where covminis minimum number of the same items MFI1 and MFI2 areremoved fromMFIS and the combination of MFI1 cupMFI2 isadd to MFIS

MFIs are inserted into MFI-tree in terms of covmin Forexample given MFI1 = 119886 119887 119888 119889 119890 119891 MFI2 = 119886 119887 119888 119889119890 ℎ MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 and covmin = 07 the combi-nation of MFIs is MFI1 cup MFI2 = 119886 119887 119888 119889 119890 119891 ℎ The new

MFI-tree only has two paths (MFI1 cup MFI2) = 119886 119887 119888 119889 119890119891 ℎ and MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 The scale of MFI-treeis simplified and we integrate FPMAX with combination ofMFIs to propose an algorithm named FPMAX with related-word set (FPMAX-RS) The step of FPMAX-RS is listed inAlgorithm 2

42 The Construction of Vocabulary Network In this sectionvocabulary network is constructed to represent text docu-ments and the vocabulary network contains the relationsbetween words or terms We employ the ldquoimportancerdquo ofnodes instead of term frequency in VSM

421 The Selection of Vocabulary Network Nodes The wordgroups in TongYiCi CiLin are used as nodes instead of wordsin vocabulary network The number of word groups is muchfewer than the number of words In addition we choosethe word groups whose frequency is higher than specifiedminimal frequency 119891min

422 The Construction of Edges in the Vocabulary NetworkEdges of complex network are the important carrier ofinformation and the edges of the vocabulary network areused in calculating the ldquoimportancerdquo of nodes Consideringthe semantic and related information among words of termsan edge is add to the vocabulary network in terms ofthe similarity of nodes Therefore we add an edge to thevocabulary network if word groups have a closer position

Mathematical Problems in Engineering 7

Procedure FPMAX-RS(T)Input T (an FP-tree) covminGlobal

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path P(2) if cov(Head cup 119875MFI) gt covmin(3) combine MFI-tree to this path(4) else(5) insert Head cup 119875 into MFIT(6) else for each 119894 in Header-table of T(7) append 119894 to Head(8) construct the Head-pattern base(9) Tail = frequent items in base(10) subset_checking (Head cup Tail)(11) if Head cup Tail is not in MFI-tree(12) construct the FP-tree 119879Head(13) call FPMAX-RS(119879Head)(14) remove 119894 from Head

Algorithm 2 FPMAX-RS

in TongYiCi CiLin The semantic similarity of word groupssim(119894 119895) is defined as

sim (119894 119895) = depth (119894 119895)119897 times TN minus Dis (119894 119895) + 1

TN (11)

where depth(119894 119895) is the depth of the first common father node119897 is the depth of 119894 and 119895 TN is the total number of wordgroups and Dis(119894 119895) denotes the distance between 119894 and 119895For example there are two words 119888119886119903 119908ℎ119890119890119897 and the wordgroup codes of 119888119886119903 119908ℎ119890119890119897 are Bo21A Bo25119861 Because twonodes are in fourth layer the first common father node is119861119900 which is in the second layer In addition the fourth layercontains 4223 word groups and Dis(119894 119895) of Bo21A Bo25119861 is14 Therefore sim(Bo21A Bo25B) is calculated as follows

sim(11986111990021119860 11986111990025119861) = 24 times 4223 minus 14 + 14223 (12)

The nodes in the vocabulary network are traversed and anedge between 119894 and 119895 is added when sim(119894 119895) gt simmin (thespecified threshold)

In addition we add an edge between two nodes if anMFI in related-word set includes the words and each MFIin related-word set is a word set with cooccurrence relationsIn fact the meaning of words in an MFI is not similarand an MFI includes a group of words cooccurring in thesame topic documents When a text document has the wordsin an MFI the text document has a high probability ofbelonging to certain topic Therefore we add an edge intothe vocabulary network with low-frequency word pointing tohigh-frequency word

423 The Extraction of Feature Vectors In the vocabularynetwork the number and the direction of edges reflect the

importance of nodes which is similar to evaluating theimportance of webpagesThus PageRank is utilized to obtainthe importance of nodes and the initial value PR119894 of nodes isdefined by

PR119894 = 119891119894sum119873119895=1 119891119895 (13)

where 119891119894 is the frequency of word groups After iterativecomputation and normalization of PR119894 we use the PageRankscores of nodes as the feature vectors of text documentsinstead of term frequency in this paper

43 Deep-Learning Single-Pass (DL-SP) In this paper sparse-group DBN is proposed for dimensionality reduction offeature vectors DBN is a model of deep learning Luo et al[52] found that the units of hidden layers exhibited statisticaldependencies and proposed a regularization constant torestrict the relations in hidden layers Due to the sparsity offeature vectors we combine the word dependencies andDBNto propose a sparse-groupDBN for dimensionality reductionIn addition coverage rate (CoR) is proposed for similaritymeasure among feature vectors in DL-SP

431 Sparse-Group DBN Deep learning simulates the pro-cess of human thinking and the result of deep learning is thedistributed representation of an input vector By analyzingfeature vectors extracted from the vocabulary network wefind that there exists statistical dependency between entries offeature vectors whichmeans the entries of feature vectorswillcooccur in the part of feature vectors The word dependencyis also mentioned by many researchers in previous literatures[5 18 53] Cooccurrence relations are typically collectedin feature vectors which means a unique word commonlyreferring to ldquotarget wordrdquo and the word dependency isquantified to measure words similarity in text clustering Weprovide an example which is the part of a feature vector inTable 3

Because the documents in the same topic usually includerelated words a part of units in visible layer is activesimultaneously and accordingly the documents in differenttopics usually activate different part of units Based onthis observation we add a regularization constant to thelog-likelihood of training data to retain these relations Inexperiments we use different topic documents to train thesparse-group DBN The sequence of units in output layer isadjusted accordingly and the cooccurring units are dividedinto one group In other words the feature vectors of differenttopic documents can activate different group of units inoutput layer The structure of sparse-group DBN is shown inFigure 5

Sparse-group DBN is comprised of several RBMs andtwo adjacent layers are an RBMs For retaining the depen-dency of the units in output layer we define the activationprobability of each group Given a group 119878 = 1199101 1199102 119910119904and training sample V(119896) the group probability 119875119878(sdot) is givenby

119875119878 (V(119896)) = radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (14)

8 Mathematical Problems in Engineering

Table 3 The word dependencies of a feature vector

Hj47A01 Hg19C01 Ba03A18 Ba08A07 Dm05A01 Hg01A01 Ae13B01032 017 012 004 0 002 0017 021 007 011 0 0 0014 023 017 009 0 0 0023 012 006 014 0 001 00 0 0 0 012 034 0200 0 0 0 024 021 0100 0 0 0 013 014 009

Visible layer

Hidden layer

Output layery1 y2 ys y1 y2 ys y1 y2 ysmiddot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

h1 h2 h3 hj

Wh

b1 b2 b3 bj

a1 a2 a3 a4 ai1 2 3 4 i

1 2 K

Figure 5 The structure of sparse-group DBN

The output layer of the sparse-group DBN is divided into119870 groups and the probability of output layer 119875ol(sdot) is definedby

119875ol (V(119896)) = 119870sum119896=1

radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (15)

We add a regularization constant 120582 and 119875ol(V(119896)) to opti-mization function which is maximum likelihood estimateof energy function of an RBM The optimization function isdefined by

max119882119887119888

sum log119875 (V(119896)) minus 120582 119870sum119896=1

radicsum119875(119910119904 = 1 | V(119896))2 (16)

Equation (11) is improved to (21) accordingly and nabla119882(119896)119894119895 isdefined by

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

minus 120582 sdot 120572 (17)

where 120572 = (120597120597(119882(119896)119894119895 ))119875119878(V(119896)) = 119875(119910119904 = 1 | V(119896))119875119878(V(119896)) sdot (120597120597(119882(119896)119894119895 ))119875(119910119904 = 1 | V(119896)) = (119875(119910119904 = 1 | V(119896))2 sdot 119875(119910119904 = 0 |V(119896)) sdot V(119896))119875119878(V(119896))Accordingly the gradient of (nabla119886(119896) nabla119887(119896))is defined by

nabla119886(119896)119894 = 119864 [V(119896)119894 ]data minus 119864 [V(119896)119894 ]Gibbs minus 120582 sdot 120572nabla119887(119896)119895 = 119864 [ℎ(119896)119895 ]

dataminus 119864 [ℎ(119896)119895 ]

Gibbsminus 120582 sdot 120572 (18)

432 Similarity Measure of DL-SP Single-Pass is a parti-tional clustering algorithm The first document is treated asthe first cluster in Single-Pass and similarity is computedbetween new document and existing clusters which decidesnew document to join the existing cluster or to create anew cluster in terms of specified threshold The output ofsparse-group DBN is binary and Euclidean distance andCosine angle distance are not suitable for similarity measureinDL-SPTherefore we use coverage rate (CoR) for similaritymeasure and CoR is defined by

CoR (119862 119889) = |119862 cap 119889|119862 (19)

where 119862 = (1198881 1198882 119888119899) is the feature vector of a cluster(named topic feature vector) and 119889 = (1198891 1198892 119889119899) is thefeature vector of new document

Moreover the addition of many text documents to clus-ters has an influence on topic feature vector In our work weintroduce optional topic feature vector 1198621015840 = (11988810158401 11988810158402 1198881015840119899)and the weight of feature vector to solve this problemWe provide an example of optional topic feature vector inFigure 6

When theweight of optional topic feature vector is greaterthan a specified threshold in each time interval we replacetopic feature vector with optional topic feature vector as newcluster center The weight of topic feature vector is definedby

1199081198621015840 = sum119862119891 (119888119894)sum119862119891 (119888119894) + sum1198621015840 119891 (1198881015840119895) minus 120582119890119896(119905minus1199050) (20)

where 120582119890119896(119905minus1199050) is time damping function and 119891(119888119894) is fre-quency function

5 Experimental Analysis

In this section we conduct three sets of experiments tovalidate the effectiveness of the proposed approach includingthe efficiency of FPMAX-RS in related-word set miningthe comparison of feature vectors and the comparison ofDL-SP efficiency In this work three Chinese text corporaTanCorpV10 Encyclopedia of China and Sogou Corpus areused as the experimental datasets

51 The Efficiency of FPMAX-RS in Related-Word Set MiningThis section is to compare running time of FPMAX and

Mathematical Problems in Engineering 9

Text feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1

di

(a)

Topic feature

00000000000000000000000

ci

c6 c8 c9 c10 c18 c21 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37

(b)

Sim

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1

(c)

Optional topic feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(ci c㰀i )

c6 c8 c9 c10 c21 c22 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37c13 c15

(d)

Figure 6 An example of optional topic feature vector

FPMAX-RS in related-word set mining We choose sevencategories (museum property education military car sportand health) of text documents from the datasets and eachcategory has 50 articles The result of experiment is listed inFigure 7

FPMAX generates a larger amount of maximal frequentitemsets and traverses all MFI-trees for subset checkingwhich has an influence on the running time of FPMAXCompared with FPMAX FPMAX-RS has higher efficiencywhen supmin is smaller

52 The Comparison of Feature Vectors In this work wecompare the distance among the feature vectors based on

tf-idf FC-VSM [12] and DLVN We randomly choose twodocuments from the categorymuseum and one document inother categories including property education and militaryThe aim of feature extraction is to extract the feature vectorsthat can represent the meaning of text documents In otherwords feature vectors in different categories have longerdistance Therefore we compute the Euclidean distanceof feature vectors in different categories based on tf-idfFC-VSM and DLVN Table 4 shows the results in differentcategories of text documents

In the following experiment feature vectors are extractedbased on tf-idf FC-VSM and DLVN Then k-means isapplied for clustering We evaluate clustering performance

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Mathematical Problems in Engineering 3

spherical k-means (SKM) and proposed a multicluster spher-ical k-means (MCSKM) which allowed documents to beassigned more than one cluster Li et al [30] introduced aconcept of neighbor and proposed a parallel k-means basedon neighbors (PKBN)

Another representative clustering algorithm is hierarchi-cal clustering which contains divisive hierarchical clusteringand agglomerative hierarchical clustering [31] Peng and Liu[32] proposed an incremental hierarchical text clusteringapproach which represented a cluster hierarchy using CFu-tree In addition Chen et al [33] proposed an improveddensity clustering algorithm named density-based spatialclustering of applications with noise (DBSCAN) DBSCANwas sensitive to choosing parameters the authors combinedk-means to estimate the parameters

Ensemble clustering is another clustering algorithmEnsemble clustering combines themultiple results of differentclustering algorithms to obtain final results Multiview clus-tering is an extension of ensemble clustering and combinesdifferent data that have different properties and views [34 35]

Matrix factorization-based clustering is an importantclustering approach [36] Lu et al [37] proposed a semisu-pervised concept factorization (SSCF) which containednonnegative matrix factorization and concept factorizationfor text clustering SSCF integrated penalized and rewardterms by pairwise constraints must-link constraints 119862ML andcannot-link constraints 119862CL which implied two documentsbelonging to the same cluster or different clusters

Topic-based text clustering is an effective text clusteringapproach in which text documents are projected into a topicspace Latent Dirichlet allocation (LDA) is a common topicmodel Yau et al [38] separated scientific publications intoseveral clusters based on LDA Ma et al [39] employed thetopic model of LDA to represent the centroids of clusters andcombined k-means++ algorithm for document clustering

In some literatures additional information is introducedfor text clustering such as side-information [40] and priv-ileged information [41] What is more several global opti-mization algorithms are utilized for text clustering such asparticle swarm optimization (PSO) algorithm [42 43] andbee colony optimization (BCO) algorithm [44 45]

Similarity measure is also an important issue in textclustering algorithms To compute the similarity between atext document and a cluster is a fundamental problem in clus-tering algorithms The most common similarity measure isdistance metric such as Euclidean distance Cosine distanceandGeneralizedMahalanobis distance [46]There exist othersimilarity measure methods such as IT-Sim (an information-theoretic measure) [47] Besides similarity measure mea-surement of discrimination information (MDI) is an oppositeconcept to compute the relations of text documents [48ndash50]

3 Theoretical Foundation

In this section we describe some theories related to ourwork This section contains three subsections which arefrequent pattern maximal (FPMAX) PageRank and deepbelief network (DBN)

Procedure FPMAX(119879)Input 119879 (an FP-tree)Global

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path 119875(2) insert Head cup 119875 into MFIT(3) else for each 119894 in Header-table of 119879(4) append 119894 to Head(5) construct the Head-pattern base(6) Tail = frequent items in base(7) subset_checking(Head cup Tail)(8) if Head cup Tail is not in MFIT(9) construct the FP-tree 119879Head(10) call FPMAX(119879Head)(11) remove 119894 from Head

Algorithm 1 FPMAX

31 FPMAX FPMAX is a depth-first and recursive algorithmforminingmaximal frequent itemsets (MFIs) in given dataset[51] Before FPMAX is called frequent pattern tree (FP-tree)is structured to store frequent itemsets and each branch ofthe FP-tree is a representation of a frequent itemset FP-tree includes a linked list head which contains all itemsof the dataset Maximal frequent itemset tree (MFI-tree) isintroduced to store all MFIs in FPMAX The procedure ofFPMAX is described Algorithm 1

32 PageRank PageRank is a link-based ranking algorithmwhich is used in the Google search engine Most of webpageson the Internet are connected with hyperlinks which carryimportant information Hence some webpages pointed bymany webpages are considered to include quality informa-tion

Webpages and hyperlinks in PageRank are structured todirected graph 119866 = (119881 119864) where 119881 is the set of webpagesand 119864 is the set of hyperlinks Let 119899 be the total number ofwebpages The PageRank score of the webpage 119894 is defined by

119875 (119894) = sum(119894119895)isin119864

119875 (119895)119874119895 (1)

where 119874119895 is the number of page 119895 pointing out to otherwebpages Let 119875 be a vector to represent all PageRank scores

119875 = (119875 (1) 119875 (2) 119875 (119899))119879 (2)

Let 119860 be the adjacency matrix of the graph 119866 with

119860 119894119895 = 1119874119894 if (119894 119895) isin 1198640 if (119894 119895) notin 119864 (3)

Hence (1) can be written as the system of equations with

119875 = 119860119879119875 (4)

4 Mathematical Problems in Engineering

Output layer

Hidden layer k

Hidden layer 2

Hidden layer 1

Visible layer

middot middot middot

middot middot middot

middot middot middotmiddot middot middotmiddot middot middotmiddot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

y1 y2 y3 yn

hk1 hk

2 hkn푘

h21 h2

2 h23 h2

n2

h11 h1

2 h13 h1

4 h1n1

b1 b2 b3 b4 bn1

a1 a2 a3 a4 a5 am

1 2 3 4 5 m

Wh1

Figure 1 The structure of DBN

PageRankmodels web surfing as a stochastic process andthe theory of Markov chain can be applied However theweb graph does notmeet the conditions of stochastic processwhich requires 119860 to be stochastic irreducible and aperiodicAfter the adjustment of 119860 to fix this problem we obtain animproved model with

119875 = ((1 minus 119889) 119864119899 + 119889119860119879)119875 (5)

where 119864 is 119890119890119879 (119890 is a column vector of all 1rsquos) and thus 119864 is an119899 times 119899 matrix with all 1rsquos and 119889 is a parameter called dampingfactor After scaling we obtain

119875 = (1 minus 119889) 119890 + 119889119860119879119875 (6)

Equation (6) is also transformed as follows

119875 (119894) = (1 minus 119889) + 119889 sum(119894119895isin119864)

119875 (119895)119874119895 (7)

The computation of PageRank score is a process of iterationGiven an initial value of 119875 the iteration ends when the scoreof PageRank does not change or the change is less than athreshold

33 Deep Belief Network (DBN) DBN is a model of deepleaning and composed of multilayer restricted Boltzmannmachines (RBMs) DBN contains the input layer (visiblelayer) the hidden layers and the output layer There areconnections between a layer and adjacent layer but no

connections among units in each layerThe structure of DBNis shown in Figure 1

As shown in Figure 1 an RBM consists of two adjacentlayers The training of DBN includes two steps pretrainingand fine-tuning RBM contains a visible layer V(119896) and ahidden layer ℎ(119896) The parameters of RBM are (119882(119896) 119886(119896)119887(119896)) (119882(119896)) are the weights of connections between thevisible layer and the hidden layer and (119886(119896) 119887(119896)) are the biasvectors of the visible units and the hidden units Giving aninitial value to119882(119896) the parameters are updated with

119882(119896)119894119895 = 119882(119896)119894119895 + 120578nabla119882(119896)119894119895 (8)

where 120578 is learning rate and (119886(119896) 119887(119896)) are similar to119882(119896)The gradient of nabla119882(119896) is obtained by Gibbs Sampling

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

(9)

where 119864[sdot]data and 119864[sdot]Gibbs are the expectations of datasamples and samples fromGibbs Sampling and (nabla119886(119896) nabla119887(119896))are similar to nabla119882(119896)119894119895

DBN is fine-tuned with a set of labeled inputs in termsof error back propagation after the pretraining of DBN Theparameters are updated by

119882(119896)119894119895 = 119882(119896)119894119895 + 120578nabla119882(119896)119894119895 (10)

where nabla119882(119896)119894119895 = ℎ(119896minus1)119894 120575(119896)119895 and 120575(119896)119895 is an error vector

Mathematical Problems in Engineering 5

TongYiCi CiLin

RS

Vocabularynetwork

FeatureVectors

Sparse-groupDBN

DL-SPPageRank

Feature extraction Feature selection Clustering

Figure 2 The procedure of DLVN

LMajor classes

Middle classes

Small classes

Word groups

Atomic word groups

A

a

01

A

01

B

f n

02 1103

KB C D

2202 03 04 05 06 07 Af03A06

Af03A06

Af03A06

Af03A06

Af03A06

Codemiddot middot middot

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 3 The structure of TongYiCi CiLin

4 Deep-Learning Vocabulary Network

In this section we propose an approach called deep-learningvocabulary network (DLVN) for text clusteringThe first stepof DLVN is the construction of vocabulary network Thecooccurrence of words or terms is useful information for textclustering We use the nodes of the vocabulary network torepresent words or terms and the edges of the vocabularynetwork to represent the relations betweenwords or terms Inour work there are two methods to obtain the cooccurrencerelations of words related-word set and TongYiCi CiLinFrequent itemsets are used to discover the relations of itemsin database We create related-word set by frequent itemsetsand each itemset of related-word set is a set of words withcooccurrence relation PageRank is employed to obtain theldquoimportancerdquo of nodes (feature vectors) instead of the termfrequency in VSM Then an improved DBN (called sparse-group DBN) is proposed for dimensionality reduction Inthe process of clustering algorithm we present DL-SP forclustering in which coverage rate is used for similaritymeasure The procedure of DLVN is shown in Figure 2

41 Related-Word Set The relations of words or terms areimportant information in text documents Usually naturallanguage has the fixed collocation and corresponding con-texts which means some words or terms have a high prob-ability of occurrence in a text document Thus the relationsbetweenwords are important to represent themeaning of textdocuments In our paper we use frequent itemsets to obtaincooccurrence relations between words or terms

Definition 1 (related-word set) Let 119863 = word1word2 word119899 be the words of text documents from the same topicand sup[sdot] be the support of itemsets Given a minimumsupport supms 119883 = word119894word119895 word119896 is defined asan itemset of related-word set where sup[119883] gt supms

FPMAX is a depth-first and recursive algorithm formining MFIs and it is based on FP-tree to store frequentitemsets When a database has a large scale all itemsetsof MFI-tree are detected in subset checking of FPMAXwhich has a big influence on the efficiency of FPMAX Forimproving the efficiency of FPMAX we use TongYiCi CiLinand string match to compress the FP-tree

TongYiCi CiLin is a Chinese semantic dictionary ofsynonyms and related words which organizes all words asa five-layer hierarchical tree It contains 77343 words whichare divided into 12 major classes 94 middle classes and 1438small classes The fourth layer and the fifth layer are furtherdivided into word groups and atomic word groups We useFigure 3 to illustrate the structure of TongYiCi CiLin

TongYiCi CiLin maps an atomic word group into a codethe first layer and the fourth layer are capital letters thesecond layer is a lowercase letter and the third layer andthe fifth layer are integers For example code ldquoAa01A02rdquostands for the atomic word group man mankind humanWe replace the words or terms with the code of word groupsin MFIrsquos mining which contains 4223 nodes We randomlyselect 10 documents from the same topic and the frequentitems (words) are listed in Table 1 As some words belong tothe same word group the number of words is compressedlargely

The structures of FP-trees that are created based onwordsand word groups are shown in Figure 4 Figure 4(a) is FP-treeof words and FP-tree of word groups is shown in Figure 4(b)The nodes of FP-tree based on the word groups are fewer thanthe nodes of FP-tree based on the words

The MFIs have redundant items in Figure 4(b) Forexample the MFIs of Figure 4(b) are listed in Table 2

MFIs include two categories of word groups in Table 2The word groups a(Bo21A01) b(Bo01A06) d(Dd14B36)and a(Bo21A01) c(Bo21A27) are closely related and theword groups k(Da21D01) i(Dm04A01) f(Cb08A01) and

6 Mathematical Problems in Engineering

Root

a3 e1 b1 p3k1 q1

j1 c1 b1 f1 h1 l1 d1 n1 q1 s1

h1 c1 g1 n1 o1 r1 t1

i1 d1 o1 m1

abcd

ef g

h

ijkl

m

no

p

q

rst

Item Head of node-linksHeader table

(a)

Root

a4 f1 i4

d1

b3c2 l1 k4

j2e1 c1 f2

e1d1 j1

a

bc

de

f

i

j

k

l

Item Head of node-linksHeader table

(b)

Figure 4 The structures of FP-trees

Table 1 The comparison of words and word groups

Words Word groupsa(vehicle)b(car)c(engine)d(quality) a(Bo21A01)b(Bo01A06)e(Dd12A01)e(automobile)f(truck) a(Bo21A01)c(Bo21A27)b(car)g(engine)h(power) a(Bo21A01)b(Bo01A06)d(Dd14B36)a(vehicle)i(lorry)c(engine)h(power) a(Bo21A01)b(Bo01A06)c(Bo21A27)d(Dd14B36)a(vehicle)j(jeep) a(Bo21A01)c(Bo21A27)k(ground)l(situation) f(Cb08A01)l(Da21A07)m(area)n(store)o(construction)p(environment) f(Cb08A01)i(Dm04A01)j(Bn01A01)k(Da21D01)q(shop)r(building)p(environment) i(Dm04A01)k(Da21D01)j(Bn01A01)s(location)q(shop)t(condition) f(Cb08A01)i(Dm04A01)k(Da21D01)d(quality)n(store)o(construction)p(environment) e(Dd12A01)i(Dm04A01)j(Bn01A01)k(Da21D01)

Table 2 MFIs of FP-tree based on word groups

MFIs1 a(Bo21A01)b(Bo01A06)d(Dd14B36)2 a(Bo21A01)c(Bo21A27)3 k(Da21D01)i(Dm04A01)f(Cb08A01)4 k(Da21D01)i(Dm04A01)j(Bn01A01)

k(Da21D01) i(Dm04A01) j(Bn01A01) are closely related Infact the aim of related-word set is tomine the ldquocooccurrencerdquoof words and we assume that the relations of words havetransitivity Therefore we utilize string matching and thesame items to combine MFIs

Definition 2 (combination of MFIs) Let MFIS = MFI1MFI2 MFI119898 be the MFIrsquos set obtained from text doc-uments and cov(sdot) be the number of the same items in twoMFIs Suppose that cov(MFI1MFI2) gt covmin where covminis minimum number of the same items MFI1 and MFI2 areremoved fromMFIS and the combination of MFI1 cupMFI2 isadd to MFIS

MFIs are inserted into MFI-tree in terms of covmin Forexample given MFI1 = 119886 119887 119888 119889 119890 119891 MFI2 = 119886 119887 119888 119889119890 ℎ MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 and covmin = 07 the combi-nation of MFIs is MFI1 cup MFI2 = 119886 119887 119888 119889 119890 119891 ℎ The new

MFI-tree only has two paths (MFI1 cup MFI2) = 119886 119887 119888 119889 119890119891 ℎ and MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 The scale of MFI-treeis simplified and we integrate FPMAX with combination ofMFIs to propose an algorithm named FPMAX with related-word set (FPMAX-RS) The step of FPMAX-RS is listed inAlgorithm 2

42 The Construction of Vocabulary Network In this sectionvocabulary network is constructed to represent text docu-ments and the vocabulary network contains the relationsbetween words or terms We employ the ldquoimportancerdquo ofnodes instead of term frequency in VSM

421 The Selection of Vocabulary Network Nodes The wordgroups in TongYiCi CiLin are used as nodes instead of wordsin vocabulary network The number of word groups is muchfewer than the number of words In addition we choosethe word groups whose frequency is higher than specifiedminimal frequency 119891min

422 The Construction of Edges in the Vocabulary NetworkEdges of complex network are the important carrier ofinformation and the edges of the vocabulary network areused in calculating the ldquoimportancerdquo of nodes Consideringthe semantic and related information among words of termsan edge is add to the vocabulary network in terms ofthe similarity of nodes Therefore we add an edge to thevocabulary network if word groups have a closer position

Mathematical Problems in Engineering 7

Procedure FPMAX-RS(T)Input T (an FP-tree) covminGlobal

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path P(2) if cov(Head cup 119875MFI) gt covmin(3) combine MFI-tree to this path(4) else(5) insert Head cup 119875 into MFIT(6) else for each 119894 in Header-table of T(7) append 119894 to Head(8) construct the Head-pattern base(9) Tail = frequent items in base(10) subset_checking (Head cup Tail)(11) if Head cup Tail is not in MFI-tree(12) construct the FP-tree 119879Head(13) call FPMAX-RS(119879Head)(14) remove 119894 from Head

Algorithm 2 FPMAX-RS

in TongYiCi CiLin The semantic similarity of word groupssim(119894 119895) is defined as

sim (119894 119895) = depth (119894 119895)119897 times TN minus Dis (119894 119895) + 1

TN (11)

where depth(119894 119895) is the depth of the first common father node119897 is the depth of 119894 and 119895 TN is the total number of wordgroups and Dis(119894 119895) denotes the distance between 119894 and 119895For example there are two words 119888119886119903 119908ℎ119890119890119897 and the wordgroup codes of 119888119886119903 119908ℎ119890119890119897 are Bo21A Bo25119861 Because twonodes are in fourth layer the first common father node is119861119900 which is in the second layer In addition the fourth layercontains 4223 word groups and Dis(119894 119895) of Bo21A Bo25119861 is14 Therefore sim(Bo21A Bo25B) is calculated as follows

sim(11986111990021119860 11986111990025119861) = 24 times 4223 minus 14 + 14223 (12)

The nodes in the vocabulary network are traversed and anedge between 119894 and 119895 is added when sim(119894 119895) gt simmin (thespecified threshold)

In addition we add an edge between two nodes if anMFI in related-word set includes the words and each MFIin related-word set is a word set with cooccurrence relationsIn fact the meaning of words in an MFI is not similarand an MFI includes a group of words cooccurring in thesame topic documents When a text document has the wordsin an MFI the text document has a high probability ofbelonging to certain topic Therefore we add an edge intothe vocabulary network with low-frequency word pointing tohigh-frequency word

423 The Extraction of Feature Vectors In the vocabularynetwork the number and the direction of edges reflect the

importance of nodes which is similar to evaluating theimportance of webpagesThus PageRank is utilized to obtainthe importance of nodes and the initial value PR119894 of nodes isdefined by

PR119894 = 119891119894sum119873119895=1 119891119895 (13)

where 119891119894 is the frequency of word groups After iterativecomputation and normalization of PR119894 we use the PageRankscores of nodes as the feature vectors of text documentsinstead of term frequency in this paper

43 Deep-Learning Single-Pass (DL-SP) In this paper sparse-group DBN is proposed for dimensionality reduction offeature vectors DBN is a model of deep learning Luo et al[52] found that the units of hidden layers exhibited statisticaldependencies and proposed a regularization constant torestrict the relations in hidden layers Due to the sparsity offeature vectors we combine the word dependencies andDBNto propose a sparse-groupDBN for dimensionality reductionIn addition coverage rate (CoR) is proposed for similaritymeasure among feature vectors in DL-SP

431 Sparse-Group DBN Deep learning simulates the pro-cess of human thinking and the result of deep learning is thedistributed representation of an input vector By analyzingfeature vectors extracted from the vocabulary network wefind that there exists statistical dependency between entries offeature vectors whichmeans the entries of feature vectorswillcooccur in the part of feature vectors The word dependencyis also mentioned by many researchers in previous literatures[5 18 53] Cooccurrence relations are typically collectedin feature vectors which means a unique word commonlyreferring to ldquotarget wordrdquo and the word dependency isquantified to measure words similarity in text clustering Weprovide an example which is the part of a feature vector inTable 3

Because the documents in the same topic usually includerelated words a part of units in visible layer is activesimultaneously and accordingly the documents in differenttopics usually activate different part of units Based onthis observation we add a regularization constant to thelog-likelihood of training data to retain these relations Inexperiments we use different topic documents to train thesparse-group DBN The sequence of units in output layer isadjusted accordingly and the cooccurring units are dividedinto one group In other words the feature vectors of differenttopic documents can activate different group of units inoutput layer The structure of sparse-group DBN is shown inFigure 5

Sparse-group DBN is comprised of several RBMs andtwo adjacent layers are an RBMs For retaining the depen-dency of the units in output layer we define the activationprobability of each group Given a group 119878 = 1199101 1199102 119910119904and training sample V(119896) the group probability 119875119878(sdot) is givenby

119875119878 (V(119896)) = radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (14)

8 Mathematical Problems in Engineering

Table 3 The word dependencies of a feature vector

Hj47A01 Hg19C01 Ba03A18 Ba08A07 Dm05A01 Hg01A01 Ae13B01032 017 012 004 0 002 0017 021 007 011 0 0 0014 023 017 009 0 0 0023 012 006 014 0 001 00 0 0 0 012 034 0200 0 0 0 024 021 0100 0 0 0 013 014 009

Visible layer

Hidden layer

Output layery1 y2 ys y1 y2 ys y1 y2 ysmiddot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

h1 h2 h3 hj

Wh

b1 b2 b3 bj

a1 a2 a3 a4 ai1 2 3 4 i

1 2 K

Figure 5 The structure of sparse-group DBN

The output layer of the sparse-group DBN is divided into119870 groups and the probability of output layer 119875ol(sdot) is definedby

119875ol (V(119896)) = 119870sum119896=1

radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (15)

We add a regularization constant 120582 and 119875ol(V(119896)) to opti-mization function which is maximum likelihood estimateof energy function of an RBM The optimization function isdefined by

max119882119887119888

sum log119875 (V(119896)) minus 120582 119870sum119896=1

radicsum119875(119910119904 = 1 | V(119896))2 (16)

Equation (11) is improved to (21) accordingly and nabla119882(119896)119894119895 isdefined by

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

minus 120582 sdot 120572 (17)

where 120572 = (120597120597(119882(119896)119894119895 ))119875119878(V(119896)) = 119875(119910119904 = 1 | V(119896))119875119878(V(119896)) sdot (120597120597(119882(119896)119894119895 ))119875(119910119904 = 1 | V(119896)) = (119875(119910119904 = 1 | V(119896))2 sdot 119875(119910119904 = 0 |V(119896)) sdot V(119896))119875119878(V(119896))Accordingly the gradient of (nabla119886(119896) nabla119887(119896))is defined by

nabla119886(119896)119894 = 119864 [V(119896)119894 ]data minus 119864 [V(119896)119894 ]Gibbs minus 120582 sdot 120572nabla119887(119896)119895 = 119864 [ℎ(119896)119895 ]

dataminus 119864 [ℎ(119896)119895 ]

Gibbsminus 120582 sdot 120572 (18)

432 Similarity Measure of DL-SP Single-Pass is a parti-tional clustering algorithm The first document is treated asthe first cluster in Single-Pass and similarity is computedbetween new document and existing clusters which decidesnew document to join the existing cluster or to create anew cluster in terms of specified threshold The output ofsparse-group DBN is binary and Euclidean distance andCosine angle distance are not suitable for similarity measureinDL-SPTherefore we use coverage rate (CoR) for similaritymeasure and CoR is defined by

CoR (119862 119889) = |119862 cap 119889|119862 (19)

where 119862 = (1198881 1198882 119888119899) is the feature vector of a cluster(named topic feature vector) and 119889 = (1198891 1198892 119889119899) is thefeature vector of new document

Moreover the addition of many text documents to clus-ters has an influence on topic feature vector In our work weintroduce optional topic feature vector 1198621015840 = (11988810158401 11988810158402 1198881015840119899)and the weight of feature vector to solve this problemWe provide an example of optional topic feature vector inFigure 6

When theweight of optional topic feature vector is greaterthan a specified threshold in each time interval we replacetopic feature vector with optional topic feature vector as newcluster center The weight of topic feature vector is definedby

1199081198621015840 = sum119862119891 (119888119894)sum119862119891 (119888119894) + sum1198621015840 119891 (1198881015840119895) minus 120582119890119896(119905minus1199050) (20)

where 120582119890119896(119905minus1199050) is time damping function and 119891(119888119894) is fre-quency function

5 Experimental Analysis

In this section we conduct three sets of experiments tovalidate the effectiveness of the proposed approach includingthe efficiency of FPMAX-RS in related-word set miningthe comparison of feature vectors and the comparison ofDL-SP efficiency In this work three Chinese text corporaTanCorpV10 Encyclopedia of China and Sogou Corpus areused as the experimental datasets

51 The Efficiency of FPMAX-RS in Related-Word Set MiningThis section is to compare running time of FPMAX and

Mathematical Problems in Engineering 9

Text feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1

di

(a)

Topic feature

00000000000000000000000

ci

c6 c8 c9 c10 c18 c21 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37

(b)

Sim

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1

(c)

Optional topic feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(ci c㰀i )

c6 c8 c9 c10 c21 c22 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37c13 c15

(d)

Figure 6 An example of optional topic feature vector

FPMAX-RS in related-word set mining We choose sevencategories (museum property education military car sportand health) of text documents from the datasets and eachcategory has 50 articles The result of experiment is listed inFigure 7

FPMAX generates a larger amount of maximal frequentitemsets and traverses all MFI-trees for subset checkingwhich has an influence on the running time of FPMAXCompared with FPMAX FPMAX-RS has higher efficiencywhen supmin is smaller

52 The Comparison of Feature Vectors In this work wecompare the distance among the feature vectors based on

tf-idf FC-VSM [12] and DLVN We randomly choose twodocuments from the categorymuseum and one document inother categories including property education and militaryThe aim of feature extraction is to extract the feature vectorsthat can represent the meaning of text documents In otherwords feature vectors in different categories have longerdistance Therefore we compute the Euclidean distanceof feature vectors in different categories based on tf-idfFC-VSM and DLVN Table 4 shows the results in differentcategories of text documents

In the following experiment feature vectors are extractedbased on tf-idf FC-VSM and DLVN Then k-means isapplied for clustering We evaluate clustering performance

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

4 Mathematical Problems in Engineering

Output layer

Hidden layer k

Hidden layer 2

Hidden layer 1

Visible layer

middot middot middot

middot middot middot

middot middot middotmiddot middot middotmiddot middot middotmiddot middot middotmiddot middot middot

middot middot middot

middot middot middot

middot middot middot

y1 y2 y3 yn

hk1 hk

2 hkn푘

h21 h2

2 h23 h2

n2

h11 h1

2 h13 h1

4 h1n1

b1 b2 b3 b4 bn1

a1 a2 a3 a4 a5 am

1 2 3 4 5 m

Wh1

Figure 1 The structure of DBN

PageRankmodels web surfing as a stochastic process andthe theory of Markov chain can be applied However theweb graph does notmeet the conditions of stochastic processwhich requires 119860 to be stochastic irreducible and aperiodicAfter the adjustment of 119860 to fix this problem we obtain animproved model with

119875 = ((1 minus 119889) 119864119899 + 119889119860119879)119875 (5)

where 119864 is 119890119890119879 (119890 is a column vector of all 1rsquos) and thus 119864 is an119899 times 119899 matrix with all 1rsquos and 119889 is a parameter called dampingfactor After scaling we obtain

119875 = (1 minus 119889) 119890 + 119889119860119879119875 (6)

Equation (6) is also transformed as follows

119875 (119894) = (1 minus 119889) + 119889 sum(119894119895isin119864)

119875 (119895)119874119895 (7)

The computation of PageRank score is a process of iterationGiven an initial value of 119875 the iteration ends when the scoreof PageRank does not change or the change is less than athreshold

33 Deep Belief Network (DBN) DBN is a model of deepleaning and composed of multilayer restricted Boltzmannmachines (RBMs) DBN contains the input layer (visiblelayer) the hidden layers and the output layer There areconnections between a layer and adjacent layer but no

connections among units in each layerThe structure of DBNis shown in Figure 1

As shown in Figure 1 an RBM consists of two adjacentlayers The training of DBN includes two steps pretrainingand fine-tuning RBM contains a visible layer V(119896) and ahidden layer ℎ(119896) The parameters of RBM are (119882(119896) 119886(119896)119887(119896)) (119882(119896)) are the weights of connections between thevisible layer and the hidden layer and (119886(119896) 119887(119896)) are the biasvectors of the visible units and the hidden units Giving aninitial value to119882(119896) the parameters are updated with

119882(119896)119894119895 = 119882(119896)119894119895 + 120578nabla119882(119896)119894119895 (8)

where 120578 is learning rate and (119886(119896) 119887(119896)) are similar to119882(119896)The gradient of nabla119882(119896) is obtained by Gibbs Sampling

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

(9)

where 119864[sdot]data and 119864[sdot]Gibbs are the expectations of datasamples and samples fromGibbs Sampling and (nabla119886(119896) nabla119887(119896))are similar to nabla119882(119896)119894119895

DBN is fine-tuned with a set of labeled inputs in termsof error back propagation after the pretraining of DBN Theparameters are updated by

119882(119896)119894119895 = 119882(119896)119894119895 + 120578nabla119882(119896)119894119895 (10)

where nabla119882(119896)119894119895 = ℎ(119896minus1)119894 120575(119896)119895 and 120575(119896)119895 is an error vector

Mathematical Problems in Engineering 5

TongYiCi CiLin

RS

Vocabularynetwork

FeatureVectors

Sparse-groupDBN

DL-SPPageRank

Feature extraction Feature selection Clustering

Figure 2 The procedure of DLVN

LMajor classes

Middle classes

Small classes

Word groups

Atomic word groups

A

a

01

A

01

B

f n

02 1103

KB C D

2202 03 04 05 06 07 Af03A06

Af03A06

Af03A06

Af03A06

Af03A06

Codemiddot middot middot

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 3 The structure of TongYiCi CiLin

4 Deep-Learning Vocabulary Network

In this section we propose an approach called deep-learningvocabulary network (DLVN) for text clusteringThe first stepof DLVN is the construction of vocabulary network Thecooccurrence of words or terms is useful information for textclustering We use the nodes of the vocabulary network torepresent words or terms and the edges of the vocabularynetwork to represent the relations betweenwords or terms Inour work there are two methods to obtain the cooccurrencerelations of words related-word set and TongYiCi CiLinFrequent itemsets are used to discover the relations of itemsin database We create related-word set by frequent itemsetsand each itemset of related-word set is a set of words withcooccurrence relation PageRank is employed to obtain theldquoimportancerdquo of nodes (feature vectors) instead of the termfrequency in VSM Then an improved DBN (called sparse-group DBN) is proposed for dimensionality reduction Inthe process of clustering algorithm we present DL-SP forclustering in which coverage rate is used for similaritymeasure The procedure of DLVN is shown in Figure 2

41 Related-Word Set The relations of words or terms areimportant information in text documents Usually naturallanguage has the fixed collocation and corresponding con-texts which means some words or terms have a high prob-ability of occurrence in a text document Thus the relationsbetweenwords are important to represent themeaning of textdocuments In our paper we use frequent itemsets to obtaincooccurrence relations between words or terms

Definition 1 (related-word set) Let 119863 = word1word2 word119899 be the words of text documents from the same topicand sup[sdot] be the support of itemsets Given a minimumsupport supms 119883 = word119894word119895 word119896 is defined asan itemset of related-word set where sup[119883] gt supms

FPMAX is a depth-first and recursive algorithm formining MFIs and it is based on FP-tree to store frequentitemsets When a database has a large scale all itemsetsof MFI-tree are detected in subset checking of FPMAXwhich has a big influence on the efficiency of FPMAX Forimproving the efficiency of FPMAX we use TongYiCi CiLinand string match to compress the FP-tree

TongYiCi CiLin is a Chinese semantic dictionary ofsynonyms and related words which organizes all words asa five-layer hierarchical tree It contains 77343 words whichare divided into 12 major classes 94 middle classes and 1438small classes The fourth layer and the fifth layer are furtherdivided into word groups and atomic word groups We useFigure 3 to illustrate the structure of TongYiCi CiLin

TongYiCi CiLin maps an atomic word group into a codethe first layer and the fourth layer are capital letters thesecond layer is a lowercase letter and the third layer andthe fifth layer are integers For example code ldquoAa01A02rdquostands for the atomic word group man mankind humanWe replace the words or terms with the code of word groupsin MFIrsquos mining which contains 4223 nodes We randomlyselect 10 documents from the same topic and the frequentitems (words) are listed in Table 1 As some words belong tothe same word group the number of words is compressedlargely

The structures of FP-trees that are created based onwordsand word groups are shown in Figure 4 Figure 4(a) is FP-treeof words and FP-tree of word groups is shown in Figure 4(b)The nodes of FP-tree based on the word groups are fewer thanthe nodes of FP-tree based on the words

The MFIs have redundant items in Figure 4(b) Forexample the MFIs of Figure 4(b) are listed in Table 2

MFIs include two categories of word groups in Table 2The word groups a(Bo21A01) b(Bo01A06) d(Dd14B36)and a(Bo21A01) c(Bo21A27) are closely related and theword groups k(Da21D01) i(Dm04A01) f(Cb08A01) and

6 Mathematical Problems in Engineering

Root

a3 e1 b1 p3k1 q1

j1 c1 b1 f1 h1 l1 d1 n1 q1 s1

h1 c1 g1 n1 o1 r1 t1

i1 d1 o1 m1

abcd

ef g

h

ijkl

m

no

p

q

rst

Item Head of node-linksHeader table

(a)

Root

a4 f1 i4

d1

b3c2 l1 k4

j2e1 c1 f2

e1d1 j1

a

bc

de

f

i

j

k

l

Item Head of node-linksHeader table

(b)

Figure 4 The structures of FP-trees

Table 1 The comparison of words and word groups

Words Word groupsa(vehicle)b(car)c(engine)d(quality) a(Bo21A01)b(Bo01A06)e(Dd12A01)e(automobile)f(truck) a(Bo21A01)c(Bo21A27)b(car)g(engine)h(power) a(Bo21A01)b(Bo01A06)d(Dd14B36)a(vehicle)i(lorry)c(engine)h(power) a(Bo21A01)b(Bo01A06)c(Bo21A27)d(Dd14B36)a(vehicle)j(jeep) a(Bo21A01)c(Bo21A27)k(ground)l(situation) f(Cb08A01)l(Da21A07)m(area)n(store)o(construction)p(environment) f(Cb08A01)i(Dm04A01)j(Bn01A01)k(Da21D01)q(shop)r(building)p(environment) i(Dm04A01)k(Da21D01)j(Bn01A01)s(location)q(shop)t(condition) f(Cb08A01)i(Dm04A01)k(Da21D01)d(quality)n(store)o(construction)p(environment) e(Dd12A01)i(Dm04A01)j(Bn01A01)k(Da21D01)

Table 2 MFIs of FP-tree based on word groups

MFIs1 a(Bo21A01)b(Bo01A06)d(Dd14B36)2 a(Bo21A01)c(Bo21A27)3 k(Da21D01)i(Dm04A01)f(Cb08A01)4 k(Da21D01)i(Dm04A01)j(Bn01A01)

k(Da21D01) i(Dm04A01) j(Bn01A01) are closely related Infact the aim of related-word set is tomine the ldquocooccurrencerdquoof words and we assume that the relations of words havetransitivity Therefore we utilize string matching and thesame items to combine MFIs

Definition 2 (combination of MFIs) Let MFIS = MFI1MFI2 MFI119898 be the MFIrsquos set obtained from text doc-uments and cov(sdot) be the number of the same items in twoMFIs Suppose that cov(MFI1MFI2) gt covmin where covminis minimum number of the same items MFI1 and MFI2 areremoved fromMFIS and the combination of MFI1 cupMFI2 isadd to MFIS

MFIs are inserted into MFI-tree in terms of covmin Forexample given MFI1 = 119886 119887 119888 119889 119890 119891 MFI2 = 119886 119887 119888 119889119890 ℎ MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 and covmin = 07 the combi-nation of MFIs is MFI1 cup MFI2 = 119886 119887 119888 119889 119890 119891 ℎ The new

MFI-tree only has two paths (MFI1 cup MFI2) = 119886 119887 119888 119889 119890119891 ℎ and MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 The scale of MFI-treeis simplified and we integrate FPMAX with combination ofMFIs to propose an algorithm named FPMAX with related-word set (FPMAX-RS) The step of FPMAX-RS is listed inAlgorithm 2

42 The Construction of Vocabulary Network In this sectionvocabulary network is constructed to represent text docu-ments and the vocabulary network contains the relationsbetween words or terms We employ the ldquoimportancerdquo ofnodes instead of term frequency in VSM

421 The Selection of Vocabulary Network Nodes The wordgroups in TongYiCi CiLin are used as nodes instead of wordsin vocabulary network The number of word groups is muchfewer than the number of words In addition we choosethe word groups whose frequency is higher than specifiedminimal frequency 119891min

422 The Construction of Edges in the Vocabulary NetworkEdges of complex network are the important carrier ofinformation and the edges of the vocabulary network areused in calculating the ldquoimportancerdquo of nodes Consideringthe semantic and related information among words of termsan edge is add to the vocabulary network in terms ofthe similarity of nodes Therefore we add an edge to thevocabulary network if word groups have a closer position

Mathematical Problems in Engineering 7

Procedure FPMAX-RS(T)Input T (an FP-tree) covminGlobal

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path P(2) if cov(Head cup 119875MFI) gt covmin(3) combine MFI-tree to this path(4) else(5) insert Head cup 119875 into MFIT(6) else for each 119894 in Header-table of T(7) append 119894 to Head(8) construct the Head-pattern base(9) Tail = frequent items in base(10) subset_checking (Head cup Tail)(11) if Head cup Tail is not in MFI-tree(12) construct the FP-tree 119879Head(13) call FPMAX-RS(119879Head)(14) remove 119894 from Head

Algorithm 2 FPMAX-RS

in TongYiCi CiLin The semantic similarity of word groupssim(119894 119895) is defined as

sim (119894 119895) = depth (119894 119895)119897 times TN minus Dis (119894 119895) + 1

TN (11)

where depth(119894 119895) is the depth of the first common father node119897 is the depth of 119894 and 119895 TN is the total number of wordgroups and Dis(119894 119895) denotes the distance between 119894 and 119895For example there are two words 119888119886119903 119908ℎ119890119890119897 and the wordgroup codes of 119888119886119903 119908ℎ119890119890119897 are Bo21A Bo25119861 Because twonodes are in fourth layer the first common father node is119861119900 which is in the second layer In addition the fourth layercontains 4223 word groups and Dis(119894 119895) of Bo21A Bo25119861 is14 Therefore sim(Bo21A Bo25B) is calculated as follows

sim(11986111990021119860 11986111990025119861) = 24 times 4223 minus 14 + 14223 (12)

The nodes in the vocabulary network are traversed and anedge between 119894 and 119895 is added when sim(119894 119895) gt simmin (thespecified threshold)

In addition we add an edge between two nodes if anMFI in related-word set includes the words and each MFIin related-word set is a word set with cooccurrence relationsIn fact the meaning of words in an MFI is not similarand an MFI includes a group of words cooccurring in thesame topic documents When a text document has the wordsin an MFI the text document has a high probability ofbelonging to certain topic Therefore we add an edge intothe vocabulary network with low-frequency word pointing tohigh-frequency word

423 The Extraction of Feature Vectors In the vocabularynetwork the number and the direction of edges reflect the

importance of nodes which is similar to evaluating theimportance of webpagesThus PageRank is utilized to obtainthe importance of nodes and the initial value PR119894 of nodes isdefined by

PR119894 = 119891119894sum119873119895=1 119891119895 (13)

where 119891119894 is the frequency of word groups After iterativecomputation and normalization of PR119894 we use the PageRankscores of nodes as the feature vectors of text documentsinstead of term frequency in this paper

43 Deep-Learning Single-Pass (DL-SP) In this paper sparse-group DBN is proposed for dimensionality reduction offeature vectors DBN is a model of deep learning Luo et al[52] found that the units of hidden layers exhibited statisticaldependencies and proposed a regularization constant torestrict the relations in hidden layers Due to the sparsity offeature vectors we combine the word dependencies andDBNto propose a sparse-groupDBN for dimensionality reductionIn addition coverage rate (CoR) is proposed for similaritymeasure among feature vectors in DL-SP

431 Sparse-Group DBN Deep learning simulates the pro-cess of human thinking and the result of deep learning is thedistributed representation of an input vector By analyzingfeature vectors extracted from the vocabulary network wefind that there exists statistical dependency between entries offeature vectors whichmeans the entries of feature vectorswillcooccur in the part of feature vectors The word dependencyis also mentioned by many researchers in previous literatures[5 18 53] Cooccurrence relations are typically collectedin feature vectors which means a unique word commonlyreferring to ldquotarget wordrdquo and the word dependency isquantified to measure words similarity in text clustering Weprovide an example which is the part of a feature vector inTable 3

Because the documents in the same topic usually includerelated words a part of units in visible layer is activesimultaneously and accordingly the documents in differenttopics usually activate different part of units Based onthis observation we add a regularization constant to thelog-likelihood of training data to retain these relations Inexperiments we use different topic documents to train thesparse-group DBN The sequence of units in output layer isadjusted accordingly and the cooccurring units are dividedinto one group In other words the feature vectors of differenttopic documents can activate different group of units inoutput layer The structure of sparse-group DBN is shown inFigure 5

Sparse-group DBN is comprised of several RBMs andtwo adjacent layers are an RBMs For retaining the depen-dency of the units in output layer we define the activationprobability of each group Given a group 119878 = 1199101 1199102 119910119904and training sample V(119896) the group probability 119875119878(sdot) is givenby

119875119878 (V(119896)) = radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (14)

8 Mathematical Problems in Engineering

Table 3 The word dependencies of a feature vector

Hj47A01 Hg19C01 Ba03A18 Ba08A07 Dm05A01 Hg01A01 Ae13B01032 017 012 004 0 002 0017 021 007 011 0 0 0014 023 017 009 0 0 0023 012 006 014 0 001 00 0 0 0 012 034 0200 0 0 0 024 021 0100 0 0 0 013 014 009

Visible layer

Hidden layer

Output layery1 y2 ys y1 y2 ys y1 y2 ysmiddot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

h1 h2 h3 hj

Wh

b1 b2 b3 bj

a1 a2 a3 a4 ai1 2 3 4 i

1 2 K

Figure 5 The structure of sparse-group DBN

The output layer of the sparse-group DBN is divided into119870 groups and the probability of output layer 119875ol(sdot) is definedby

119875ol (V(119896)) = 119870sum119896=1

radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (15)

We add a regularization constant 120582 and 119875ol(V(119896)) to opti-mization function which is maximum likelihood estimateof energy function of an RBM The optimization function isdefined by

max119882119887119888

sum log119875 (V(119896)) minus 120582 119870sum119896=1

radicsum119875(119910119904 = 1 | V(119896))2 (16)

Equation (11) is improved to (21) accordingly and nabla119882(119896)119894119895 isdefined by

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

minus 120582 sdot 120572 (17)

where 120572 = (120597120597(119882(119896)119894119895 ))119875119878(V(119896)) = 119875(119910119904 = 1 | V(119896))119875119878(V(119896)) sdot (120597120597(119882(119896)119894119895 ))119875(119910119904 = 1 | V(119896)) = (119875(119910119904 = 1 | V(119896))2 sdot 119875(119910119904 = 0 |V(119896)) sdot V(119896))119875119878(V(119896))Accordingly the gradient of (nabla119886(119896) nabla119887(119896))is defined by

nabla119886(119896)119894 = 119864 [V(119896)119894 ]data minus 119864 [V(119896)119894 ]Gibbs minus 120582 sdot 120572nabla119887(119896)119895 = 119864 [ℎ(119896)119895 ]

dataminus 119864 [ℎ(119896)119895 ]

Gibbsminus 120582 sdot 120572 (18)

432 Similarity Measure of DL-SP Single-Pass is a parti-tional clustering algorithm The first document is treated asthe first cluster in Single-Pass and similarity is computedbetween new document and existing clusters which decidesnew document to join the existing cluster or to create anew cluster in terms of specified threshold The output ofsparse-group DBN is binary and Euclidean distance andCosine angle distance are not suitable for similarity measureinDL-SPTherefore we use coverage rate (CoR) for similaritymeasure and CoR is defined by

CoR (119862 119889) = |119862 cap 119889|119862 (19)

where 119862 = (1198881 1198882 119888119899) is the feature vector of a cluster(named topic feature vector) and 119889 = (1198891 1198892 119889119899) is thefeature vector of new document

Moreover the addition of many text documents to clus-ters has an influence on topic feature vector In our work weintroduce optional topic feature vector 1198621015840 = (11988810158401 11988810158402 1198881015840119899)and the weight of feature vector to solve this problemWe provide an example of optional topic feature vector inFigure 6

When theweight of optional topic feature vector is greaterthan a specified threshold in each time interval we replacetopic feature vector with optional topic feature vector as newcluster center The weight of topic feature vector is definedby

1199081198621015840 = sum119862119891 (119888119894)sum119862119891 (119888119894) + sum1198621015840 119891 (1198881015840119895) minus 120582119890119896(119905minus1199050) (20)

where 120582119890119896(119905minus1199050) is time damping function and 119891(119888119894) is fre-quency function

5 Experimental Analysis

In this section we conduct three sets of experiments tovalidate the effectiveness of the proposed approach includingthe efficiency of FPMAX-RS in related-word set miningthe comparison of feature vectors and the comparison ofDL-SP efficiency In this work three Chinese text corporaTanCorpV10 Encyclopedia of China and Sogou Corpus areused as the experimental datasets

51 The Efficiency of FPMAX-RS in Related-Word Set MiningThis section is to compare running time of FPMAX and

Mathematical Problems in Engineering 9

Text feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1

di

(a)

Topic feature

00000000000000000000000

ci

c6 c8 c9 c10 c18 c21 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37

(b)

Sim

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1

(c)

Optional topic feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(ci c㰀i )

c6 c8 c9 c10 c21 c22 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37c13 c15

(d)

Figure 6 An example of optional topic feature vector

FPMAX-RS in related-word set mining We choose sevencategories (museum property education military car sportand health) of text documents from the datasets and eachcategory has 50 articles The result of experiment is listed inFigure 7

FPMAX generates a larger amount of maximal frequentitemsets and traverses all MFI-trees for subset checkingwhich has an influence on the running time of FPMAXCompared with FPMAX FPMAX-RS has higher efficiencywhen supmin is smaller

52 The Comparison of Feature Vectors In this work wecompare the distance among the feature vectors based on

tf-idf FC-VSM [12] and DLVN We randomly choose twodocuments from the categorymuseum and one document inother categories including property education and militaryThe aim of feature extraction is to extract the feature vectorsthat can represent the meaning of text documents In otherwords feature vectors in different categories have longerdistance Therefore we compute the Euclidean distanceof feature vectors in different categories based on tf-idfFC-VSM and DLVN Table 4 shows the results in differentcategories of text documents

In the following experiment feature vectors are extractedbased on tf-idf FC-VSM and DLVN Then k-means isapplied for clustering We evaluate clustering performance

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Mathematical Problems in Engineering 5

TongYiCi CiLin

RS

Vocabularynetwork

FeatureVectors

Sparse-groupDBN

DL-SPPageRank

Feature extraction Feature selection Clustering

Figure 2 The procedure of DLVN

LMajor classes

Middle classes

Small classes

Word groups

Atomic word groups

A

a

01

A

01

B

f n

02 1103

KB C D

2202 03 04 05 06 07 Af03A06

Af03A06

Af03A06

Af03A06

Af03A06

Codemiddot middot middot

middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

Figure 3 The structure of TongYiCi CiLin

4 Deep-Learning Vocabulary Network

In this section we propose an approach called deep-learningvocabulary network (DLVN) for text clusteringThe first stepof DLVN is the construction of vocabulary network Thecooccurrence of words or terms is useful information for textclustering We use the nodes of the vocabulary network torepresent words or terms and the edges of the vocabularynetwork to represent the relations betweenwords or terms Inour work there are two methods to obtain the cooccurrencerelations of words related-word set and TongYiCi CiLinFrequent itemsets are used to discover the relations of itemsin database We create related-word set by frequent itemsetsand each itemset of related-word set is a set of words withcooccurrence relation PageRank is employed to obtain theldquoimportancerdquo of nodes (feature vectors) instead of the termfrequency in VSM Then an improved DBN (called sparse-group DBN) is proposed for dimensionality reduction Inthe process of clustering algorithm we present DL-SP forclustering in which coverage rate is used for similaritymeasure The procedure of DLVN is shown in Figure 2

41 Related-Word Set The relations of words or terms areimportant information in text documents Usually naturallanguage has the fixed collocation and corresponding con-texts which means some words or terms have a high prob-ability of occurrence in a text document Thus the relationsbetweenwords are important to represent themeaning of textdocuments In our paper we use frequent itemsets to obtaincooccurrence relations between words or terms

Definition 1 (related-word set) Let 119863 = word1word2 word119899 be the words of text documents from the same topicand sup[sdot] be the support of itemsets Given a minimumsupport supms 119883 = word119894word119895 word119896 is defined asan itemset of related-word set where sup[119883] gt supms

FPMAX is a depth-first and recursive algorithm formining MFIs and it is based on FP-tree to store frequentitemsets When a database has a large scale all itemsetsof MFI-tree are detected in subset checking of FPMAXwhich has a big influence on the efficiency of FPMAX Forimproving the efficiency of FPMAX we use TongYiCi CiLinand string match to compress the FP-tree

TongYiCi CiLin is a Chinese semantic dictionary ofsynonyms and related words which organizes all words asa five-layer hierarchical tree It contains 77343 words whichare divided into 12 major classes 94 middle classes and 1438small classes The fourth layer and the fifth layer are furtherdivided into word groups and atomic word groups We useFigure 3 to illustrate the structure of TongYiCi CiLin

TongYiCi CiLin maps an atomic word group into a codethe first layer and the fourth layer are capital letters thesecond layer is a lowercase letter and the third layer andthe fifth layer are integers For example code ldquoAa01A02rdquostands for the atomic word group man mankind humanWe replace the words or terms with the code of word groupsin MFIrsquos mining which contains 4223 nodes We randomlyselect 10 documents from the same topic and the frequentitems (words) are listed in Table 1 As some words belong tothe same word group the number of words is compressedlargely

The structures of FP-trees that are created based onwordsand word groups are shown in Figure 4 Figure 4(a) is FP-treeof words and FP-tree of word groups is shown in Figure 4(b)The nodes of FP-tree based on the word groups are fewer thanthe nodes of FP-tree based on the words

The MFIs have redundant items in Figure 4(b) Forexample the MFIs of Figure 4(b) are listed in Table 2

MFIs include two categories of word groups in Table 2The word groups a(Bo21A01) b(Bo01A06) d(Dd14B36)and a(Bo21A01) c(Bo21A27) are closely related and theword groups k(Da21D01) i(Dm04A01) f(Cb08A01) and

6 Mathematical Problems in Engineering

Root

a3 e1 b1 p3k1 q1

j1 c1 b1 f1 h1 l1 d1 n1 q1 s1

h1 c1 g1 n1 o1 r1 t1

i1 d1 o1 m1

abcd

ef g

h

ijkl

m

no

p

q

rst

Item Head of node-linksHeader table

(a)

Root

a4 f1 i4

d1

b3c2 l1 k4

j2e1 c1 f2

e1d1 j1

a

bc

de

f

i

j

k

l

Item Head of node-linksHeader table

(b)

Figure 4 The structures of FP-trees

Table 1 The comparison of words and word groups

Words Word groupsa(vehicle)b(car)c(engine)d(quality) a(Bo21A01)b(Bo01A06)e(Dd12A01)e(automobile)f(truck) a(Bo21A01)c(Bo21A27)b(car)g(engine)h(power) a(Bo21A01)b(Bo01A06)d(Dd14B36)a(vehicle)i(lorry)c(engine)h(power) a(Bo21A01)b(Bo01A06)c(Bo21A27)d(Dd14B36)a(vehicle)j(jeep) a(Bo21A01)c(Bo21A27)k(ground)l(situation) f(Cb08A01)l(Da21A07)m(area)n(store)o(construction)p(environment) f(Cb08A01)i(Dm04A01)j(Bn01A01)k(Da21D01)q(shop)r(building)p(environment) i(Dm04A01)k(Da21D01)j(Bn01A01)s(location)q(shop)t(condition) f(Cb08A01)i(Dm04A01)k(Da21D01)d(quality)n(store)o(construction)p(environment) e(Dd12A01)i(Dm04A01)j(Bn01A01)k(Da21D01)

Table 2 MFIs of FP-tree based on word groups

MFIs1 a(Bo21A01)b(Bo01A06)d(Dd14B36)2 a(Bo21A01)c(Bo21A27)3 k(Da21D01)i(Dm04A01)f(Cb08A01)4 k(Da21D01)i(Dm04A01)j(Bn01A01)

k(Da21D01) i(Dm04A01) j(Bn01A01) are closely related Infact the aim of related-word set is tomine the ldquocooccurrencerdquoof words and we assume that the relations of words havetransitivity Therefore we utilize string matching and thesame items to combine MFIs

Definition 2 (combination of MFIs) Let MFIS = MFI1MFI2 MFI119898 be the MFIrsquos set obtained from text doc-uments and cov(sdot) be the number of the same items in twoMFIs Suppose that cov(MFI1MFI2) gt covmin where covminis minimum number of the same items MFI1 and MFI2 areremoved fromMFIS and the combination of MFI1 cupMFI2 isadd to MFIS

MFIs are inserted into MFI-tree in terms of covmin Forexample given MFI1 = 119886 119887 119888 119889 119890 119891 MFI2 = 119886 119887 119888 119889119890 ℎ MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 and covmin = 07 the combi-nation of MFIs is MFI1 cup MFI2 = 119886 119887 119888 119889 119890 119891 ℎ The new

MFI-tree only has two paths (MFI1 cup MFI2) = 119886 119887 119888 119889 119890119891 ℎ and MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 The scale of MFI-treeis simplified and we integrate FPMAX with combination ofMFIs to propose an algorithm named FPMAX with related-word set (FPMAX-RS) The step of FPMAX-RS is listed inAlgorithm 2

42 The Construction of Vocabulary Network In this sectionvocabulary network is constructed to represent text docu-ments and the vocabulary network contains the relationsbetween words or terms We employ the ldquoimportancerdquo ofnodes instead of term frequency in VSM

421 The Selection of Vocabulary Network Nodes The wordgroups in TongYiCi CiLin are used as nodes instead of wordsin vocabulary network The number of word groups is muchfewer than the number of words In addition we choosethe word groups whose frequency is higher than specifiedminimal frequency 119891min

422 The Construction of Edges in the Vocabulary NetworkEdges of complex network are the important carrier ofinformation and the edges of the vocabulary network areused in calculating the ldquoimportancerdquo of nodes Consideringthe semantic and related information among words of termsan edge is add to the vocabulary network in terms ofthe similarity of nodes Therefore we add an edge to thevocabulary network if word groups have a closer position

Mathematical Problems in Engineering 7

Procedure FPMAX-RS(T)Input T (an FP-tree) covminGlobal

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path P(2) if cov(Head cup 119875MFI) gt covmin(3) combine MFI-tree to this path(4) else(5) insert Head cup 119875 into MFIT(6) else for each 119894 in Header-table of T(7) append 119894 to Head(8) construct the Head-pattern base(9) Tail = frequent items in base(10) subset_checking (Head cup Tail)(11) if Head cup Tail is not in MFI-tree(12) construct the FP-tree 119879Head(13) call FPMAX-RS(119879Head)(14) remove 119894 from Head

Algorithm 2 FPMAX-RS

in TongYiCi CiLin The semantic similarity of word groupssim(119894 119895) is defined as

sim (119894 119895) = depth (119894 119895)119897 times TN minus Dis (119894 119895) + 1

TN (11)

where depth(119894 119895) is the depth of the first common father node119897 is the depth of 119894 and 119895 TN is the total number of wordgroups and Dis(119894 119895) denotes the distance between 119894 and 119895For example there are two words 119888119886119903 119908ℎ119890119890119897 and the wordgroup codes of 119888119886119903 119908ℎ119890119890119897 are Bo21A Bo25119861 Because twonodes are in fourth layer the first common father node is119861119900 which is in the second layer In addition the fourth layercontains 4223 word groups and Dis(119894 119895) of Bo21A Bo25119861 is14 Therefore sim(Bo21A Bo25B) is calculated as follows

sim(11986111990021119860 11986111990025119861) = 24 times 4223 minus 14 + 14223 (12)

The nodes in the vocabulary network are traversed and anedge between 119894 and 119895 is added when sim(119894 119895) gt simmin (thespecified threshold)

In addition we add an edge between two nodes if anMFI in related-word set includes the words and each MFIin related-word set is a word set with cooccurrence relationsIn fact the meaning of words in an MFI is not similarand an MFI includes a group of words cooccurring in thesame topic documents When a text document has the wordsin an MFI the text document has a high probability ofbelonging to certain topic Therefore we add an edge intothe vocabulary network with low-frequency word pointing tohigh-frequency word

423 The Extraction of Feature Vectors In the vocabularynetwork the number and the direction of edges reflect the

importance of nodes which is similar to evaluating theimportance of webpagesThus PageRank is utilized to obtainthe importance of nodes and the initial value PR119894 of nodes isdefined by

PR119894 = 119891119894sum119873119895=1 119891119895 (13)

where 119891119894 is the frequency of word groups After iterativecomputation and normalization of PR119894 we use the PageRankscores of nodes as the feature vectors of text documentsinstead of term frequency in this paper

43 Deep-Learning Single-Pass (DL-SP) In this paper sparse-group DBN is proposed for dimensionality reduction offeature vectors DBN is a model of deep learning Luo et al[52] found that the units of hidden layers exhibited statisticaldependencies and proposed a regularization constant torestrict the relations in hidden layers Due to the sparsity offeature vectors we combine the word dependencies andDBNto propose a sparse-groupDBN for dimensionality reductionIn addition coverage rate (CoR) is proposed for similaritymeasure among feature vectors in DL-SP

431 Sparse-Group DBN Deep learning simulates the pro-cess of human thinking and the result of deep learning is thedistributed representation of an input vector By analyzingfeature vectors extracted from the vocabulary network wefind that there exists statistical dependency between entries offeature vectors whichmeans the entries of feature vectorswillcooccur in the part of feature vectors The word dependencyis also mentioned by many researchers in previous literatures[5 18 53] Cooccurrence relations are typically collectedin feature vectors which means a unique word commonlyreferring to ldquotarget wordrdquo and the word dependency isquantified to measure words similarity in text clustering Weprovide an example which is the part of a feature vector inTable 3

Because the documents in the same topic usually includerelated words a part of units in visible layer is activesimultaneously and accordingly the documents in differenttopics usually activate different part of units Based onthis observation we add a regularization constant to thelog-likelihood of training data to retain these relations Inexperiments we use different topic documents to train thesparse-group DBN The sequence of units in output layer isadjusted accordingly and the cooccurring units are dividedinto one group In other words the feature vectors of differenttopic documents can activate different group of units inoutput layer The structure of sparse-group DBN is shown inFigure 5

Sparse-group DBN is comprised of several RBMs andtwo adjacent layers are an RBMs For retaining the depen-dency of the units in output layer we define the activationprobability of each group Given a group 119878 = 1199101 1199102 119910119904and training sample V(119896) the group probability 119875119878(sdot) is givenby

119875119878 (V(119896)) = radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (14)

8 Mathematical Problems in Engineering

Table 3 The word dependencies of a feature vector

Hj47A01 Hg19C01 Ba03A18 Ba08A07 Dm05A01 Hg01A01 Ae13B01032 017 012 004 0 002 0017 021 007 011 0 0 0014 023 017 009 0 0 0023 012 006 014 0 001 00 0 0 0 012 034 0200 0 0 0 024 021 0100 0 0 0 013 014 009

Visible layer

Hidden layer

Output layery1 y2 ys y1 y2 ys y1 y2 ysmiddot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

h1 h2 h3 hj

Wh

b1 b2 b3 bj

a1 a2 a3 a4 ai1 2 3 4 i

1 2 K

Figure 5 The structure of sparse-group DBN

The output layer of the sparse-group DBN is divided into119870 groups and the probability of output layer 119875ol(sdot) is definedby

119875ol (V(119896)) = 119870sum119896=1

radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (15)

We add a regularization constant 120582 and 119875ol(V(119896)) to opti-mization function which is maximum likelihood estimateof energy function of an RBM The optimization function isdefined by

max119882119887119888

sum log119875 (V(119896)) minus 120582 119870sum119896=1

radicsum119875(119910119904 = 1 | V(119896))2 (16)

Equation (11) is improved to (21) accordingly and nabla119882(119896)119894119895 isdefined by

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

minus 120582 sdot 120572 (17)

where 120572 = (120597120597(119882(119896)119894119895 ))119875119878(V(119896)) = 119875(119910119904 = 1 | V(119896))119875119878(V(119896)) sdot (120597120597(119882(119896)119894119895 ))119875(119910119904 = 1 | V(119896)) = (119875(119910119904 = 1 | V(119896))2 sdot 119875(119910119904 = 0 |V(119896)) sdot V(119896))119875119878(V(119896))Accordingly the gradient of (nabla119886(119896) nabla119887(119896))is defined by

nabla119886(119896)119894 = 119864 [V(119896)119894 ]data minus 119864 [V(119896)119894 ]Gibbs minus 120582 sdot 120572nabla119887(119896)119895 = 119864 [ℎ(119896)119895 ]

dataminus 119864 [ℎ(119896)119895 ]

Gibbsminus 120582 sdot 120572 (18)

432 Similarity Measure of DL-SP Single-Pass is a parti-tional clustering algorithm The first document is treated asthe first cluster in Single-Pass and similarity is computedbetween new document and existing clusters which decidesnew document to join the existing cluster or to create anew cluster in terms of specified threshold The output ofsparse-group DBN is binary and Euclidean distance andCosine angle distance are not suitable for similarity measureinDL-SPTherefore we use coverage rate (CoR) for similaritymeasure and CoR is defined by

CoR (119862 119889) = |119862 cap 119889|119862 (19)

where 119862 = (1198881 1198882 119888119899) is the feature vector of a cluster(named topic feature vector) and 119889 = (1198891 1198892 119889119899) is thefeature vector of new document

Moreover the addition of many text documents to clus-ters has an influence on topic feature vector In our work weintroduce optional topic feature vector 1198621015840 = (11988810158401 11988810158402 1198881015840119899)and the weight of feature vector to solve this problemWe provide an example of optional topic feature vector inFigure 6

When theweight of optional topic feature vector is greaterthan a specified threshold in each time interval we replacetopic feature vector with optional topic feature vector as newcluster center The weight of topic feature vector is definedby

1199081198621015840 = sum119862119891 (119888119894)sum119862119891 (119888119894) + sum1198621015840 119891 (1198881015840119895) minus 120582119890119896(119905minus1199050) (20)

where 120582119890119896(119905minus1199050) is time damping function and 119891(119888119894) is fre-quency function

5 Experimental Analysis

In this section we conduct three sets of experiments tovalidate the effectiveness of the proposed approach includingthe efficiency of FPMAX-RS in related-word set miningthe comparison of feature vectors and the comparison ofDL-SP efficiency In this work three Chinese text corporaTanCorpV10 Encyclopedia of China and Sogou Corpus areused as the experimental datasets

51 The Efficiency of FPMAX-RS in Related-Word Set MiningThis section is to compare running time of FPMAX and

Mathematical Problems in Engineering 9

Text feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1

di

(a)

Topic feature

00000000000000000000000

ci

c6 c8 c9 c10 c18 c21 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37

(b)

Sim

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1

(c)

Optional topic feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(ci c㰀i )

c6 c8 c9 c10 c21 c22 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37c13 c15

(d)

Figure 6 An example of optional topic feature vector

FPMAX-RS in related-word set mining We choose sevencategories (museum property education military car sportand health) of text documents from the datasets and eachcategory has 50 articles The result of experiment is listed inFigure 7

FPMAX generates a larger amount of maximal frequentitemsets and traverses all MFI-trees for subset checkingwhich has an influence on the running time of FPMAXCompared with FPMAX FPMAX-RS has higher efficiencywhen supmin is smaller

52 The Comparison of Feature Vectors In this work wecompare the distance among the feature vectors based on

tf-idf FC-VSM [12] and DLVN We randomly choose twodocuments from the categorymuseum and one document inother categories including property education and militaryThe aim of feature extraction is to extract the feature vectorsthat can represent the meaning of text documents In otherwords feature vectors in different categories have longerdistance Therefore we compute the Euclidean distanceof feature vectors in different categories based on tf-idfFC-VSM and DLVN Table 4 shows the results in differentcategories of text documents

In the following experiment feature vectors are extractedbased on tf-idf FC-VSM and DLVN Then k-means isapplied for clustering We evaluate clustering performance

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

6 Mathematical Problems in Engineering

Root

a3 e1 b1 p3k1 q1

j1 c1 b1 f1 h1 l1 d1 n1 q1 s1

h1 c1 g1 n1 o1 r1 t1

i1 d1 o1 m1

abcd

ef g

h

ijkl

m

no

p

q

rst

Item Head of node-linksHeader table

(a)

Root

a4 f1 i4

d1

b3c2 l1 k4

j2e1 c1 f2

e1d1 j1

a

bc

de

f

i

j

k

l

Item Head of node-linksHeader table

(b)

Figure 4 The structures of FP-trees

Table 1 The comparison of words and word groups

Words Word groupsa(vehicle)b(car)c(engine)d(quality) a(Bo21A01)b(Bo01A06)e(Dd12A01)e(automobile)f(truck) a(Bo21A01)c(Bo21A27)b(car)g(engine)h(power) a(Bo21A01)b(Bo01A06)d(Dd14B36)a(vehicle)i(lorry)c(engine)h(power) a(Bo21A01)b(Bo01A06)c(Bo21A27)d(Dd14B36)a(vehicle)j(jeep) a(Bo21A01)c(Bo21A27)k(ground)l(situation) f(Cb08A01)l(Da21A07)m(area)n(store)o(construction)p(environment) f(Cb08A01)i(Dm04A01)j(Bn01A01)k(Da21D01)q(shop)r(building)p(environment) i(Dm04A01)k(Da21D01)j(Bn01A01)s(location)q(shop)t(condition) f(Cb08A01)i(Dm04A01)k(Da21D01)d(quality)n(store)o(construction)p(environment) e(Dd12A01)i(Dm04A01)j(Bn01A01)k(Da21D01)

Table 2 MFIs of FP-tree based on word groups

MFIs1 a(Bo21A01)b(Bo01A06)d(Dd14B36)2 a(Bo21A01)c(Bo21A27)3 k(Da21D01)i(Dm04A01)f(Cb08A01)4 k(Da21D01)i(Dm04A01)j(Bn01A01)

k(Da21D01) i(Dm04A01) j(Bn01A01) are closely related Infact the aim of related-word set is tomine the ldquocooccurrencerdquoof words and we assume that the relations of words havetransitivity Therefore we utilize string matching and thesame items to combine MFIs

Definition 2 (combination of MFIs) Let MFIS = MFI1MFI2 MFI119898 be the MFIrsquos set obtained from text doc-uments and cov(sdot) be the number of the same items in twoMFIs Suppose that cov(MFI1MFI2) gt covmin where covminis minimum number of the same items MFI1 and MFI2 areremoved fromMFIS and the combination of MFI1 cupMFI2 isadd to MFIS

MFIs are inserted into MFI-tree in terms of covmin Forexample given MFI1 = 119886 119887 119888 119889 119890 119891 MFI2 = 119886 119887 119888 119889119890 ℎ MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 and covmin = 07 the combi-nation of MFIs is MFI1 cup MFI2 = 119886 119887 119888 119889 119890 119891 ℎ The new

MFI-tree only has two paths (MFI1 cup MFI2) = 119886 119887 119888 119889 119890119891 ℎ and MFI3 = 119890 119891 119892 ℎ 119894 119895 119896 The scale of MFI-treeis simplified and we integrate FPMAX with combination ofMFIs to propose an algorithm named FPMAX with related-word set (FPMAX-RS) The step of FPMAX-RS is listed inAlgorithm 2

42 The Construction of Vocabulary Network In this sectionvocabulary network is constructed to represent text docu-ments and the vocabulary network contains the relationsbetween words or terms We employ the ldquoimportancerdquo ofnodes instead of term frequency in VSM

421 The Selection of Vocabulary Network Nodes The wordgroups in TongYiCi CiLin are used as nodes instead of wordsin vocabulary network The number of word groups is muchfewer than the number of words In addition we choosethe word groups whose frequency is higher than specifiedminimal frequency 119891min

422 The Construction of Edges in the Vocabulary NetworkEdges of complex network are the important carrier ofinformation and the edges of the vocabulary network areused in calculating the ldquoimportancerdquo of nodes Consideringthe semantic and related information among words of termsan edge is add to the vocabulary network in terms ofthe similarity of nodes Therefore we add an edge to thevocabulary network if word groups have a closer position

Mathematical Problems in Engineering 7

Procedure FPMAX-RS(T)Input T (an FP-tree) covminGlobal

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path P(2) if cov(Head cup 119875MFI) gt covmin(3) combine MFI-tree to this path(4) else(5) insert Head cup 119875 into MFIT(6) else for each 119894 in Header-table of T(7) append 119894 to Head(8) construct the Head-pattern base(9) Tail = frequent items in base(10) subset_checking (Head cup Tail)(11) if Head cup Tail is not in MFI-tree(12) construct the FP-tree 119879Head(13) call FPMAX-RS(119879Head)(14) remove 119894 from Head

Algorithm 2 FPMAX-RS

in TongYiCi CiLin The semantic similarity of word groupssim(119894 119895) is defined as

sim (119894 119895) = depth (119894 119895)119897 times TN minus Dis (119894 119895) + 1

TN (11)

where depth(119894 119895) is the depth of the first common father node119897 is the depth of 119894 and 119895 TN is the total number of wordgroups and Dis(119894 119895) denotes the distance between 119894 and 119895For example there are two words 119888119886119903 119908ℎ119890119890119897 and the wordgroup codes of 119888119886119903 119908ℎ119890119890119897 are Bo21A Bo25119861 Because twonodes are in fourth layer the first common father node is119861119900 which is in the second layer In addition the fourth layercontains 4223 word groups and Dis(119894 119895) of Bo21A Bo25119861 is14 Therefore sim(Bo21A Bo25B) is calculated as follows

sim(11986111990021119860 11986111990025119861) = 24 times 4223 minus 14 + 14223 (12)

The nodes in the vocabulary network are traversed and anedge between 119894 and 119895 is added when sim(119894 119895) gt simmin (thespecified threshold)

In addition we add an edge between two nodes if anMFI in related-word set includes the words and each MFIin related-word set is a word set with cooccurrence relationsIn fact the meaning of words in an MFI is not similarand an MFI includes a group of words cooccurring in thesame topic documents When a text document has the wordsin an MFI the text document has a high probability ofbelonging to certain topic Therefore we add an edge intothe vocabulary network with low-frequency word pointing tohigh-frequency word

423 The Extraction of Feature Vectors In the vocabularynetwork the number and the direction of edges reflect the

importance of nodes which is similar to evaluating theimportance of webpagesThus PageRank is utilized to obtainthe importance of nodes and the initial value PR119894 of nodes isdefined by

PR119894 = 119891119894sum119873119895=1 119891119895 (13)

where 119891119894 is the frequency of word groups After iterativecomputation and normalization of PR119894 we use the PageRankscores of nodes as the feature vectors of text documentsinstead of term frequency in this paper

43 Deep-Learning Single-Pass (DL-SP) In this paper sparse-group DBN is proposed for dimensionality reduction offeature vectors DBN is a model of deep learning Luo et al[52] found that the units of hidden layers exhibited statisticaldependencies and proposed a regularization constant torestrict the relations in hidden layers Due to the sparsity offeature vectors we combine the word dependencies andDBNto propose a sparse-groupDBN for dimensionality reductionIn addition coverage rate (CoR) is proposed for similaritymeasure among feature vectors in DL-SP

431 Sparse-Group DBN Deep learning simulates the pro-cess of human thinking and the result of deep learning is thedistributed representation of an input vector By analyzingfeature vectors extracted from the vocabulary network wefind that there exists statistical dependency between entries offeature vectors whichmeans the entries of feature vectorswillcooccur in the part of feature vectors The word dependencyis also mentioned by many researchers in previous literatures[5 18 53] Cooccurrence relations are typically collectedin feature vectors which means a unique word commonlyreferring to ldquotarget wordrdquo and the word dependency isquantified to measure words similarity in text clustering Weprovide an example which is the part of a feature vector inTable 3

Because the documents in the same topic usually includerelated words a part of units in visible layer is activesimultaneously and accordingly the documents in differenttopics usually activate different part of units Based onthis observation we add a regularization constant to thelog-likelihood of training data to retain these relations Inexperiments we use different topic documents to train thesparse-group DBN The sequence of units in output layer isadjusted accordingly and the cooccurring units are dividedinto one group In other words the feature vectors of differenttopic documents can activate different group of units inoutput layer The structure of sparse-group DBN is shown inFigure 5

Sparse-group DBN is comprised of several RBMs andtwo adjacent layers are an RBMs For retaining the depen-dency of the units in output layer we define the activationprobability of each group Given a group 119878 = 1199101 1199102 119910119904and training sample V(119896) the group probability 119875119878(sdot) is givenby

119875119878 (V(119896)) = radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (14)

8 Mathematical Problems in Engineering

Table 3 The word dependencies of a feature vector

Hj47A01 Hg19C01 Ba03A18 Ba08A07 Dm05A01 Hg01A01 Ae13B01032 017 012 004 0 002 0017 021 007 011 0 0 0014 023 017 009 0 0 0023 012 006 014 0 001 00 0 0 0 012 034 0200 0 0 0 024 021 0100 0 0 0 013 014 009

Visible layer

Hidden layer

Output layery1 y2 ys y1 y2 ys y1 y2 ysmiddot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

h1 h2 h3 hj

Wh

b1 b2 b3 bj

a1 a2 a3 a4 ai1 2 3 4 i

1 2 K

Figure 5 The structure of sparse-group DBN

The output layer of the sparse-group DBN is divided into119870 groups and the probability of output layer 119875ol(sdot) is definedby

119875ol (V(119896)) = 119870sum119896=1

radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (15)

We add a regularization constant 120582 and 119875ol(V(119896)) to opti-mization function which is maximum likelihood estimateof energy function of an RBM The optimization function isdefined by

max119882119887119888

sum log119875 (V(119896)) minus 120582 119870sum119896=1

radicsum119875(119910119904 = 1 | V(119896))2 (16)

Equation (11) is improved to (21) accordingly and nabla119882(119896)119894119895 isdefined by

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

minus 120582 sdot 120572 (17)

where 120572 = (120597120597(119882(119896)119894119895 ))119875119878(V(119896)) = 119875(119910119904 = 1 | V(119896))119875119878(V(119896)) sdot (120597120597(119882(119896)119894119895 ))119875(119910119904 = 1 | V(119896)) = (119875(119910119904 = 1 | V(119896))2 sdot 119875(119910119904 = 0 |V(119896)) sdot V(119896))119875119878(V(119896))Accordingly the gradient of (nabla119886(119896) nabla119887(119896))is defined by

nabla119886(119896)119894 = 119864 [V(119896)119894 ]data minus 119864 [V(119896)119894 ]Gibbs minus 120582 sdot 120572nabla119887(119896)119895 = 119864 [ℎ(119896)119895 ]

dataminus 119864 [ℎ(119896)119895 ]

Gibbsminus 120582 sdot 120572 (18)

432 Similarity Measure of DL-SP Single-Pass is a parti-tional clustering algorithm The first document is treated asthe first cluster in Single-Pass and similarity is computedbetween new document and existing clusters which decidesnew document to join the existing cluster or to create anew cluster in terms of specified threshold The output ofsparse-group DBN is binary and Euclidean distance andCosine angle distance are not suitable for similarity measureinDL-SPTherefore we use coverage rate (CoR) for similaritymeasure and CoR is defined by

CoR (119862 119889) = |119862 cap 119889|119862 (19)

where 119862 = (1198881 1198882 119888119899) is the feature vector of a cluster(named topic feature vector) and 119889 = (1198891 1198892 119889119899) is thefeature vector of new document

Moreover the addition of many text documents to clus-ters has an influence on topic feature vector In our work weintroduce optional topic feature vector 1198621015840 = (11988810158401 11988810158402 1198881015840119899)and the weight of feature vector to solve this problemWe provide an example of optional topic feature vector inFigure 6

When theweight of optional topic feature vector is greaterthan a specified threshold in each time interval we replacetopic feature vector with optional topic feature vector as newcluster center The weight of topic feature vector is definedby

1199081198621015840 = sum119862119891 (119888119894)sum119862119891 (119888119894) + sum1198621015840 119891 (1198881015840119895) minus 120582119890119896(119905minus1199050) (20)

where 120582119890119896(119905minus1199050) is time damping function and 119891(119888119894) is fre-quency function

5 Experimental Analysis

In this section we conduct three sets of experiments tovalidate the effectiveness of the proposed approach includingthe efficiency of FPMAX-RS in related-word set miningthe comparison of feature vectors and the comparison ofDL-SP efficiency In this work three Chinese text corporaTanCorpV10 Encyclopedia of China and Sogou Corpus areused as the experimental datasets

51 The Efficiency of FPMAX-RS in Related-Word Set MiningThis section is to compare running time of FPMAX and

Mathematical Problems in Engineering 9

Text feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1

di

(a)

Topic feature

00000000000000000000000

ci

c6 c8 c9 c10 c18 c21 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37

(b)

Sim

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1

(c)

Optional topic feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(ci c㰀i )

c6 c8 c9 c10 c21 c22 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37c13 c15

(d)

Figure 6 An example of optional topic feature vector

FPMAX-RS in related-word set mining We choose sevencategories (museum property education military car sportand health) of text documents from the datasets and eachcategory has 50 articles The result of experiment is listed inFigure 7

FPMAX generates a larger amount of maximal frequentitemsets and traverses all MFI-trees for subset checkingwhich has an influence on the running time of FPMAXCompared with FPMAX FPMAX-RS has higher efficiencywhen supmin is smaller

52 The Comparison of Feature Vectors In this work wecompare the distance among the feature vectors based on

tf-idf FC-VSM [12] and DLVN We randomly choose twodocuments from the categorymuseum and one document inother categories including property education and militaryThe aim of feature extraction is to extract the feature vectorsthat can represent the meaning of text documents In otherwords feature vectors in different categories have longerdistance Therefore we compute the Euclidean distanceof feature vectors in different categories based on tf-idfFC-VSM and DLVN Table 4 shows the results in differentcategories of text documents

In the following experiment feature vectors are extractedbased on tf-idf FC-VSM and DLVN Then k-means isapplied for clustering We evaluate clustering performance

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Mathematical Problems in Engineering 7

Procedure FPMAX-RS(T)Input T (an FP-tree) covminGlobal

MFIT an MFI-treeHead a linked list of items

Output The MFIT that contains all MFIrsquosMethod(1) if 119879 only contains a single path P(2) if cov(Head cup 119875MFI) gt covmin(3) combine MFI-tree to this path(4) else(5) insert Head cup 119875 into MFIT(6) else for each 119894 in Header-table of T(7) append 119894 to Head(8) construct the Head-pattern base(9) Tail = frequent items in base(10) subset_checking (Head cup Tail)(11) if Head cup Tail is not in MFI-tree(12) construct the FP-tree 119879Head(13) call FPMAX-RS(119879Head)(14) remove 119894 from Head

Algorithm 2 FPMAX-RS

in TongYiCi CiLin The semantic similarity of word groupssim(119894 119895) is defined as

sim (119894 119895) = depth (119894 119895)119897 times TN minus Dis (119894 119895) + 1

TN (11)

where depth(119894 119895) is the depth of the first common father node119897 is the depth of 119894 and 119895 TN is the total number of wordgroups and Dis(119894 119895) denotes the distance between 119894 and 119895For example there are two words 119888119886119903 119908ℎ119890119890119897 and the wordgroup codes of 119888119886119903 119908ℎ119890119890119897 are Bo21A Bo25119861 Because twonodes are in fourth layer the first common father node is119861119900 which is in the second layer In addition the fourth layercontains 4223 word groups and Dis(119894 119895) of Bo21A Bo25119861 is14 Therefore sim(Bo21A Bo25B) is calculated as follows

sim(11986111990021119860 11986111990025119861) = 24 times 4223 minus 14 + 14223 (12)

The nodes in the vocabulary network are traversed and anedge between 119894 and 119895 is added when sim(119894 119895) gt simmin (thespecified threshold)

In addition we add an edge between two nodes if anMFI in related-word set includes the words and each MFIin related-word set is a word set with cooccurrence relationsIn fact the meaning of words in an MFI is not similarand an MFI includes a group of words cooccurring in thesame topic documents When a text document has the wordsin an MFI the text document has a high probability ofbelonging to certain topic Therefore we add an edge intothe vocabulary network with low-frequency word pointing tohigh-frequency word

423 The Extraction of Feature Vectors In the vocabularynetwork the number and the direction of edges reflect the

importance of nodes which is similar to evaluating theimportance of webpagesThus PageRank is utilized to obtainthe importance of nodes and the initial value PR119894 of nodes isdefined by

PR119894 = 119891119894sum119873119895=1 119891119895 (13)

where 119891119894 is the frequency of word groups After iterativecomputation and normalization of PR119894 we use the PageRankscores of nodes as the feature vectors of text documentsinstead of term frequency in this paper

43 Deep-Learning Single-Pass (DL-SP) In this paper sparse-group DBN is proposed for dimensionality reduction offeature vectors DBN is a model of deep learning Luo et al[52] found that the units of hidden layers exhibited statisticaldependencies and proposed a regularization constant torestrict the relations in hidden layers Due to the sparsity offeature vectors we combine the word dependencies andDBNto propose a sparse-groupDBN for dimensionality reductionIn addition coverage rate (CoR) is proposed for similaritymeasure among feature vectors in DL-SP

431 Sparse-Group DBN Deep learning simulates the pro-cess of human thinking and the result of deep learning is thedistributed representation of an input vector By analyzingfeature vectors extracted from the vocabulary network wefind that there exists statistical dependency between entries offeature vectors whichmeans the entries of feature vectorswillcooccur in the part of feature vectors The word dependencyis also mentioned by many researchers in previous literatures[5 18 53] Cooccurrence relations are typically collectedin feature vectors which means a unique word commonlyreferring to ldquotarget wordrdquo and the word dependency isquantified to measure words similarity in text clustering Weprovide an example which is the part of a feature vector inTable 3

Because the documents in the same topic usually includerelated words a part of units in visible layer is activesimultaneously and accordingly the documents in differenttopics usually activate different part of units Based onthis observation we add a regularization constant to thelog-likelihood of training data to retain these relations Inexperiments we use different topic documents to train thesparse-group DBN The sequence of units in output layer isadjusted accordingly and the cooccurring units are dividedinto one group In other words the feature vectors of differenttopic documents can activate different group of units inoutput layer The structure of sparse-group DBN is shown inFigure 5

Sparse-group DBN is comprised of several RBMs andtwo adjacent layers are an RBMs For retaining the depen-dency of the units in output layer we define the activationprobability of each group Given a group 119878 = 1199101 1199102 119910119904and training sample V(119896) the group probability 119875119878(sdot) is givenby

119875119878 (V(119896)) = radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (14)

8 Mathematical Problems in Engineering

Table 3 The word dependencies of a feature vector

Hj47A01 Hg19C01 Ba03A18 Ba08A07 Dm05A01 Hg01A01 Ae13B01032 017 012 004 0 002 0017 021 007 011 0 0 0014 023 017 009 0 0 0023 012 006 014 0 001 00 0 0 0 012 034 0200 0 0 0 024 021 0100 0 0 0 013 014 009

Visible layer

Hidden layer

Output layery1 y2 ys y1 y2 ys y1 y2 ysmiddot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

h1 h2 h3 hj

Wh

b1 b2 b3 bj

a1 a2 a3 a4 ai1 2 3 4 i

1 2 K

Figure 5 The structure of sparse-group DBN

The output layer of the sparse-group DBN is divided into119870 groups and the probability of output layer 119875ol(sdot) is definedby

119875ol (V(119896)) = 119870sum119896=1

radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (15)

We add a regularization constant 120582 and 119875ol(V(119896)) to opti-mization function which is maximum likelihood estimateof energy function of an RBM The optimization function isdefined by

max119882119887119888

sum log119875 (V(119896)) minus 120582 119870sum119896=1

radicsum119875(119910119904 = 1 | V(119896))2 (16)

Equation (11) is improved to (21) accordingly and nabla119882(119896)119894119895 isdefined by

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

minus 120582 sdot 120572 (17)

where 120572 = (120597120597(119882(119896)119894119895 ))119875119878(V(119896)) = 119875(119910119904 = 1 | V(119896))119875119878(V(119896)) sdot (120597120597(119882(119896)119894119895 ))119875(119910119904 = 1 | V(119896)) = (119875(119910119904 = 1 | V(119896))2 sdot 119875(119910119904 = 0 |V(119896)) sdot V(119896))119875119878(V(119896))Accordingly the gradient of (nabla119886(119896) nabla119887(119896))is defined by

nabla119886(119896)119894 = 119864 [V(119896)119894 ]data minus 119864 [V(119896)119894 ]Gibbs minus 120582 sdot 120572nabla119887(119896)119895 = 119864 [ℎ(119896)119895 ]

dataminus 119864 [ℎ(119896)119895 ]

Gibbsminus 120582 sdot 120572 (18)

432 Similarity Measure of DL-SP Single-Pass is a parti-tional clustering algorithm The first document is treated asthe first cluster in Single-Pass and similarity is computedbetween new document and existing clusters which decidesnew document to join the existing cluster or to create anew cluster in terms of specified threshold The output ofsparse-group DBN is binary and Euclidean distance andCosine angle distance are not suitable for similarity measureinDL-SPTherefore we use coverage rate (CoR) for similaritymeasure and CoR is defined by

CoR (119862 119889) = |119862 cap 119889|119862 (19)

where 119862 = (1198881 1198882 119888119899) is the feature vector of a cluster(named topic feature vector) and 119889 = (1198891 1198892 119889119899) is thefeature vector of new document

Moreover the addition of many text documents to clus-ters has an influence on topic feature vector In our work weintroduce optional topic feature vector 1198621015840 = (11988810158401 11988810158402 1198881015840119899)and the weight of feature vector to solve this problemWe provide an example of optional topic feature vector inFigure 6

When theweight of optional topic feature vector is greaterthan a specified threshold in each time interval we replacetopic feature vector with optional topic feature vector as newcluster center The weight of topic feature vector is definedby

1199081198621015840 = sum119862119891 (119888119894)sum119862119891 (119888119894) + sum1198621015840 119891 (1198881015840119895) minus 120582119890119896(119905minus1199050) (20)

where 120582119890119896(119905minus1199050) is time damping function and 119891(119888119894) is fre-quency function

5 Experimental Analysis

In this section we conduct three sets of experiments tovalidate the effectiveness of the proposed approach includingthe efficiency of FPMAX-RS in related-word set miningthe comparison of feature vectors and the comparison ofDL-SP efficiency In this work three Chinese text corporaTanCorpV10 Encyclopedia of China and Sogou Corpus areused as the experimental datasets

51 The Efficiency of FPMAX-RS in Related-Word Set MiningThis section is to compare running time of FPMAX and

Mathematical Problems in Engineering 9

Text feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1

di

(a)

Topic feature

00000000000000000000000

ci

c6 c8 c9 c10 c18 c21 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37

(b)

Sim

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1

(c)

Optional topic feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(ci c㰀i )

c6 c8 c9 c10 c21 c22 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37c13 c15

(d)

Figure 6 An example of optional topic feature vector

FPMAX-RS in related-word set mining We choose sevencategories (museum property education military car sportand health) of text documents from the datasets and eachcategory has 50 articles The result of experiment is listed inFigure 7

FPMAX generates a larger amount of maximal frequentitemsets and traverses all MFI-trees for subset checkingwhich has an influence on the running time of FPMAXCompared with FPMAX FPMAX-RS has higher efficiencywhen supmin is smaller

52 The Comparison of Feature Vectors In this work wecompare the distance among the feature vectors based on

tf-idf FC-VSM [12] and DLVN We randomly choose twodocuments from the categorymuseum and one document inother categories including property education and militaryThe aim of feature extraction is to extract the feature vectorsthat can represent the meaning of text documents In otherwords feature vectors in different categories have longerdistance Therefore we compute the Euclidean distanceof feature vectors in different categories based on tf-idfFC-VSM and DLVN Table 4 shows the results in differentcategories of text documents

In the following experiment feature vectors are extractedbased on tf-idf FC-VSM and DLVN Then k-means isapplied for clustering We evaluate clustering performance

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

8 Mathematical Problems in Engineering

Table 3 The word dependencies of a feature vector

Hj47A01 Hg19C01 Ba03A18 Ba08A07 Dm05A01 Hg01A01 Ae13B01032 017 012 004 0 002 0017 021 007 011 0 0 0014 023 017 009 0 0 0023 012 006 014 0 001 00 0 0 0 012 034 0200 0 0 0 024 021 0100 0 0 0 013 014 009

Visible layer

Hidden layer

Output layery1 y2 ys y1 y2 ys y1 y2 ysmiddot middot middot middot middot middot middot middot middot middot middot middot

middot middot middot

middot middot middot

middot middot middot

h1 h2 h3 hj

Wh

b1 b2 b3 bj

a1 a2 a3 a4 ai1 2 3 4 i

1 2 K

Figure 5 The structure of sparse-group DBN

The output layer of the sparse-group DBN is divided into119870 groups and the probability of output layer 119875ol(sdot) is definedby

119875ol (V(119896)) = 119870sum119896=1

radicsum119904isin119878

119875 (119910119904 = 1 | V(119896))2 (15)

We add a regularization constant 120582 and 119875ol(V(119896)) to opti-mization function which is maximum likelihood estimateof energy function of an RBM The optimization function isdefined by

max119882119887119888

sum log119875 (V(119896)) minus 120582 119870sum119896=1

radicsum119875(119910119904 = 1 | V(119896))2 (16)

Equation (11) is improved to (21) accordingly and nabla119882(119896)119894119895 isdefined by

nabla119882(119896)119894119895 = 119864 [V(119896)119894 ℎ(119896)119895 ]data

minus 119864 [V(119896)119894 ℎ(119896)119895 ]Gibbs

minus 120582 sdot 120572 (17)

where 120572 = (120597120597(119882(119896)119894119895 ))119875119878(V(119896)) = 119875(119910119904 = 1 | V(119896))119875119878(V(119896)) sdot (120597120597(119882(119896)119894119895 ))119875(119910119904 = 1 | V(119896)) = (119875(119910119904 = 1 | V(119896))2 sdot 119875(119910119904 = 0 |V(119896)) sdot V(119896))119875119878(V(119896))Accordingly the gradient of (nabla119886(119896) nabla119887(119896))is defined by

nabla119886(119896)119894 = 119864 [V(119896)119894 ]data minus 119864 [V(119896)119894 ]Gibbs minus 120582 sdot 120572nabla119887(119896)119895 = 119864 [ℎ(119896)119895 ]

dataminus 119864 [ℎ(119896)119895 ]

Gibbsminus 120582 sdot 120572 (18)

432 Similarity Measure of DL-SP Single-Pass is a parti-tional clustering algorithm The first document is treated asthe first cluster in Single-Pass and similarity is computedbetween new document and existing clusters which decidesnew document to join the existing cluster or to create anew cluster in terms of specified threshold The output ofsparse-group DBN is binary and Euclidean distance andCosine angle distance are not suitable for similarity measureinDL-SPTherefore we use coverage rate (CoR) for similaritymeasure and CoR is defined by

CoR (119862 119889) = |119862 cap 119889|119862 (19)

where 119862 = (1198881 1198882 119888119899) is the feature vector of a cluster(named topic feature vector) and 119889 = (1198891 1198892 119889119899) is thefeature vector of new document

Moreover the addition of many text documents to clus-ters has an influence on topic feature vector In our work weintroduce optional topic feature vector 1198621015840 = (11988810158401 11988810158402 1198881015840119899)and the weight of feature vector to solve this problemWe provide an example of optional topic feature vector inFigure 6

When theweight of optional topic feature vector is greaterthan a specified threshold in each time interval we replacetopic feature vector with optional topic feature vector as newcluster center The weight of topic feature vector is definedby

1199081198621015840 = sum119862119891 (119888119894)sum119862119891 (119888119894) + sum1198621015840 119891 (1198881015840119895) minus 120582119890119896(119905minus1199050) (20)

where 120582119890119896(119905minus1199050) is time damping function and 119891(119888119894) is fre-quency function

5 Experimental Analysis

In this section we conduct three sets of experiments tovalidate the effectiveness of the proposed approach includingthe efficiency of FPMAX-RS in related-word set miningthe comparison of feature vectors and the comparison ofDL-SP efficiency In this work three Chinese text corporaTanCorpV10 Encyclopedia of China and Sogou Corpus areused as the experimental datasets

51 The Efficiency of FPMAX-RS in Related-Word Set MiningThis section is to compare running time of FPMAX and

Mathematical Problems in Engineering 9

Text feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1

di

(a)

Topic feature

00000000000000000000000

ci

c6 c8 c9 c10 c18 c21 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37

(b)

Sim

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1

(c)

Optional topic feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(ci c㰀i )

c6 c8 c9 c10 c21 c22 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37c13 c15

(d)

Figure 6 An example of optional topic feature vector

FPMAX-RS in related-word set mining We choose sevencategories (museum property education military car sportand health) of text documents from the datasets and eachcategory has 50 articles The result of experiment is listed inFigure 7

FPMAX generates a larger amount of maximal frequentitemsets and traverses all MFI-trees for subset checkingwhich has an influence on the running time of FPMAXCompared with FPMAX FPMAX-RS has higher efficiencywhen supmin is smaller

52 The Comparison of Feature Vectors In this work wecompare the distance among the feature vectors based on

tf-idf FC-VSM [12] and DLVN We randomly choose twodocuments from the categorymuseum and one document inother categories including property education and militaryThe aim of feature extraction is to extract the feature vectorsthat can represent the meaning of text documents In otherwords feature vectors in different categories have longerdistance Therefore we compute the Euclidean distanceof feature vectors in different categories based on tf-idfFC-VSM and DLVN Table 4 shows the results in differentcategories of text documents

In the following experiment feature vectors are extractedbased on tf-idf FC-VSM and DLVN Then k-means isapplied for clustering We evaluate clustering performance

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Mathematical Problems in Engineering 9

Text feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 001 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1

di

(a)

Topic feature

00000000000000000000000

ci

c6 c8 c9 c10 c18 c21 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37

(b)

Sim

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 01 1 1 1 1 1 1 1 1 1 1 1 1 1

(c)

Optional topic feature

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

(ci c㰀i )

c6 c8 c9 c10 c21 c22 c23 c24 c25 c27 c32 c33 c34 c35 c36 c37c13 c15

(d)

Figure 6 An example of optional topic feature vector

FPMAX-RS in related-word set mining We choose sevencategories (museum property education military car sportand health) of text documents from the datasets and eachcategory has 50 articles The result of experiment is listed inFigure 7

FPMAX generates a larger amount of maximal frequentitemsets and traverses all MFI-trees for subset checkingwhich has an influence on the running time of FPMAXCompared with FPMAX FPMAX-RS has higher efficiencywhen supmin is smaller

52 The Comparison of Feature Vectors In this work wecompare the distance among the feature vectors based on

tf-idf FC-VSM [12] and DLVN We randomly choose twodocuments from the categorymuseum and one document inother categories including property education and militaryThe aim of feature extraction is to extract the feature vectorsthat can represent the meaning of text documents In otherwords feature vectors in different categories have longerdistance Therefore we compute the Euclidean distanceof feature vectors in different categories based on tf-idfFC-VSM and DLVN Table 4 shows the results in differentcategories of text documents

In the following experiment feature vectors are extractedbased on tf-idf FC-VSM and DLVN Then k-means isapplied for clustering We evaluate clustering performance

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

10 Mathematical Problems in Engineering

Table 4 Euclidean distance comparison

Category Documents Distancetf-idf FC-VSM DLVN

museum museum1- museum2 1302 1049 917

property property - museum1 1285 1347 1359property - museum2 1593 1586 1687

education education - museum1 1468 1461 1472education - museum2 1139 1133 1207

military military - museum1 1556 1649 1658military - museum2 1369 1403 1841

FPMAXFPMAX-RS

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Tim

e (m

s)

002 003 004 005 006001supmin

Figure 7 The comparison of running time

withF_measure LetD = 1198891 1198892 119889119899 be clustering resultand Dlowast = 119889lowast1 119889lowast2 119889lowast119899 be standard dataset F_measureis defined by

F_measure = 2 timesP (DDlowast) timesR (DDlowast)P (DDlowast) +R (DDlowast) (21)

where P(DDlowast) = |D cap Dlowast||D| is precision and R(DDlowast) = |D capDlowast||Dlowast| is recall

Because seven categories of text documents are chosenin our experiment the specified number of clusters 119896 is 7Figure 8 illustrates that feature vectors based on DLVN havebetter performance

53 The Comparison of DL-SP Efficiency In this experimentwe choose text documents from the datasets and the numberof each category is listed in Table 5

The aim of the experiment is to compare DL-SP with LSIand Single-Pass The sparse-group DBN has 3 layers and the

Spor

t

Mili

tary

Mus

eum

Hea

lth

Prop

erty

Educ

atio

n

Car

DLVN FC-VSM

ℱ_m

easu

re

0

01

02

03

04

05

06

07

08

09

10

tf-idf

Figure 8 The comparison ofF_measure

Table 5 The datasets of experiment

Category The number of text documentssport 1300military 1500health 1400property 900education 800car 500

number of each layer is 4223 3500 and 3000 In addition thegroup number 119870 of top layer is 200 The structure of sparse-group DBN is shown in Figure 9

The experimental result is shown in Figure 10 DL-SP hasbetter performance than LSI and Single-Pass in sport mili-tary property education and health HoweverF_measure ofDL-SP is lower than LSI and Single-Pass in category car due

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 11: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Mathematical Problems in Engineering 11

Visible layer

Hidden layer

Output layery1 y2 y15 y16 y17 y30 y2986 y2987 y3000middot middot middot middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

middot middot middot middot middot middot

h1 h2 h3 hj h3500

Wh

b1 b2 b3 bj b3500

a1 a2 a3 a4 ai1 2 3 4 i

a42234223

K = 1 K = 2 K = 200

Figure 9 The structure of sparse-group DBN

02

04

06

08

10

00

Spor

t

Mili

tary

Hea

lth

Prop

erty

Educ

atio

n

Car

Single-Pass

DL-SPLSI

ℱ_m

easu

re

Figure 10 The comparison of DL-SP

Table 6 The running time of DL-SP and Single-Pass

The dimensionality offeature vectors

Running time (s)

Single-Pass 4223 3866DL-SP 3000 1084

to the smaller number of documents not training the sparse-group DBN effectively

In this subsection we compare the running time of DL-SP and Single-Pass and the result is listed in Table 6

6 Conclusions

In this paper we propose an approach DLVN for textclustering The existing term frequency-based methods onlycalculate the number of words but the relations of words arenot considered in feature extractionThe approach constructsvocabulary network to mine the importance of words usingrelated-word set which contains ldquocooccurrencerdquo relationsof words Therefore the text features of documents in thesame category have shorter distance and feature vectorshave longer distance among different categories Moreoverwe employ sparse-group DBN to reduce the dimensionalityof feature vectors in terms of the group relations of wordsThus sparse-group DBN can retain the word dependency indimensionality reduction In the experiments we comparethe approach with well-known methods to verify our workand the results show the performance of DLVN

In current work we verify the approach using Chinesecorpora We will use English text to prove the approacheffectiveness in the future work Moreover in the processof dimension reduction we need to train the sparse-groupDBN using a large amount of text documents to improve itsperformance

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work has been supported by Projects U1536116 andU1636208 funded by National Natural Science Foundation ofChina (NSFC)

References

[1] A Trabelsi and O R Zaıane ldquoExtraction and clustering ofarguing expressions in contentious textrdquo Data and KnowledgeEngineering vol 100 pp 226ndash239 2015

[2] K Schouten and F Frasincar ldquoSurvey on aspect-level sentimentanalysisrdquo IEEE Transactions on Knowledge and Data Engineer-ing vol 28 no 3 pp 813ndash830 2016

[3] MTsytsarau andT Palpanas ldquoSurvey onmining subjective dataon the webrdquo Data Mining and Knowledge Discovery vol 24 no3 pp 478ndash514 2012

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 12: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

12 Mathematical Problems in Engineering

[4] S-J Lee and J-Y Jiang ldquoMultilabel text categorization based onfuzzy relevance clusteringrdquo IEEETransactions on Fuzzy Systemsvol 22 no 6 pp 1457ndash1471 2014

[5] P Wang B Xu J Xu G Tian C-L Liu and H Hao ldquoSemanticexpansion using word embedding clustering and convolutionalneural network for improving short text classificationrdquo Neuro-computing vol 174 pp 806ndash814 2016

[6] W Zhang X Tang and T Yoshida ldquoTESC an approach to TExtclassification using Semi-supervised Clusteringrdquo Knowledge-Based Systems vol 75 pp 152ndash160 2015

[7] A B Al-Saleh and M E B Menai ldquoAutomatic Arabic textsummarization a surveyrdquo Artificial Intelligence Review vol 45no 2 pp 203ndash234 2016

[8] F Atefeh and W Khreich ldquoA survey of techniques for eventdetection in Twitterrdquo Computational Intelligence vol 31 no 1pp 132ndash164 2015

[9] G Stilo and P Velardi ldquoEfficient temporalmining ofmicro-blogtexts and its application to event discoveryrdquo Data Mining andKnowledge Discovery vol 30 no 2 pp 372ndash402 2016

[10] G Huang J He Y Zhang et al ldquoMining streams of short textfor analysis of world-wide event evolutionsrdquo World Wide Webvol 18 no 5 pp 1201ndash1217 2014

[11] U Erra S Senatore F Minnella and G Caggianese ldquoApproxi-mate TF-IDF based on topic extraction from massive messagestream using the GPUrdquo Information Sciences vol 292 pp 143ndash161 2015

[12] C Qimin G Qiao W Yongliang andW Xianghua ldquoText clus-tering using VSM with feature clustersrdquo Neural Computing andApplications vol 26 no 4 pp 995ndash1003 2015

[13] J Martinez-Gil ldquoAn overview of textual semantic similaritymeasures based on web intelligencerdquo Artificial IntelligenceReview vol 42 no 4 pp 935ndash943 2012

[14] K K Bharti and P K Singh ldquoHybrid dimension reduction byintegrating feature selection with feature extraction method fortext clusteringrdquo Expert Systems with Applications vol 42 no 6pp 3105ndash3114 2015

[15] L Yue W Zuo T Peng Y Wang and X Han ldquoA fuzzy docu-ment clustering approach based on domain-specified ontologyrdquoData and Knowledge Engineering vol 100 pp 148ndash166 2015

[16] T Wei Y Lu H Chang Q Zhou and X Bao ldquoA semanticapproach for text clustering using WordNet and lexical chainsrdquoExpert Systems with Applications vol 42 no 4 pp 2264ndash22752015

[17] L Bing S Jiang W Lam Y Zhang and S Jameel ldquoAdaptiveconcept resolution for document representation and its appli-cations in text miningrdquo Knowledge-Based Systems vol 74 no 1pp 1ndash13 2015

[18] R Irfan C K King D Grages et al ldquoA survey on text miningin social networksrdquo Knowledge Engineering Review vol 30 no2 pp 157ndash170 2015

[19] N Indurkhya ldquoEmerging directions in predictive text miningrdquoWiley Interdisciplinary Reviews Data Mining and KnowledgeDiscovery vol 5 no 4 pp 155ndash164 2015

[20] M T Mills and N G Bourbakis ldquoGraph-based methods fornatural language processing and understandingmdasha survey andanalysisrdquo IEEE Transactions on Systems Man and CyberneticsPart C Applications and Reviews vol 44 no 1 pp 59ndash71 2014

[21] HMousavi D Kerr M Iseli and C Zaniolo ldquoMining semanticstructures from syntactic structures in free text documentsrdquoin Proceedings of the 8th IEEE International Conference onSemantic Computing (ICSC rsquo14) pp 84ndash91 IEEE NewportBeach Calif USA June 2014

[22] S Jun S-S Park and D-S Jang ldquoDocument clusteringmethodusing dimension reduction and support vector clustering toovercome sparsenessrdquo Expert Systems with Applications vol 41no 7 pp 3204ndash3212 2014

[23] W Z Zhu and R B Allen ldquoDocument clustering using the LSIsubspace signature modelrdquo Journal of the American Society forInformation Science and Technology vol 64 no 4 pp 844ndash8602013

[24] X Wu X Chen X Li L Zhou and J Lai ldquoAdaptive subspacelearning an iterative approach for document clusteringrdquoNeuralComputing amp Applications vol 25 no 2 pp 333ndash342 2014

[25] H Kriegel and E Ntoutsi ldquoClustering high dimensional datardquoACM SIGKDD Explorations Newsletter vol 15 no 2 pp 1ndash82014

[26] K K Bharti and P K Singh ldquoOpposition chaotic fitness muta-tion based adaptive inertia weight BPSO for feature selectionin text clusteringrdquo Applied Soft Computing Journal vol 43 pp20ndash34 2016

[27] M C N Barioni H Razente A M R Marcelino A J MTraina and C Traina ldquoOpen issues for partitioning clusteringmethods an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 4 no 3 pp 161ndash177 2014

[28] J-P Mei and L Chen ldquoProximity-based k-partitions clusteringwith ranking for document categorization and analysisrdquo ExpertSystems with Applications vol 41 no 16 pp 7095ndash7105 2014

[29] V Tunali T Bilgin and A Camurcu ldquoAn improved clusteringalgorithm for text mining multi-cluster spherical K-meansrdquoInternational Arab Journal of Information Technology vol 13 no1 pp 12ndash19 2016

[30] Y Li C Luo and S M Chung ldquoA parallel text documentclustering algorithm based on neighborsrdquo Cluster Computingvol 18 no 2 pp 933ndash948 2015

[31] F Murtagh and P Contreras ldquoAlgorithms for hierarchicalclustering an overviewrdquo Wiley Interdisciplinary Reviews DataMining and Knowledge Discovery vol 2 no 1 pp 86ndash97 2012

[32] T Peng and L Liu ldquoA novel incremental conceptual hierarchicaltext clusteringmethod usingCFu-treerdquoApplied SoftComputingvol 27 pp 269ndash278 2015

[33] Q Chen J F Lu and H Zhang ldquoA text mining model basedon improved density clustering algorithmrdquo in Proceedings of the4th IEEE International Conference on Electronics Informationand Emergency Communication (ICEIEC rsquo13) Beijing ChinaNovember 2013

[34] S FHussainMMushtaq andZHalim ldquoMulti-viewdocumentclustering via ensemble methodrdquo Journal of Intelligent Informa-tion Systems vol 43 no 1 pp 81ndash99 2014

[35] A Wahid X Gao and P Andreae ldquoMulti-view clustering ofweb documents using multi-objective genetic algorithmrdquo inProceedings of the IEEE Congress on Evolutionary Computation(CEC rsquo14) pp 2625ndash2632 Beijing China July 2014

[36] X Pei T Wu and C Chen ldquoAutomated graph regularizedprojective nonnegative matrix factorization for document clus-teringrdquo IEEE Transactions on Cybernetics vol 44 no 10 pp1821ndash1831 2014

[37] M Lu X-J Zhao L Zhang and F-Z Li ldquoSemi-supervisedconcept factorization for document clusteringrdquo InformationSciences vol 331 pp 86ndash98 2016

[38] C-K Yau A Porter NNewman andA Suominen ldquoClusteringscientific documents with topic modelingrdquo Scientometrics vol100 no 3 pp 767ndash786 2014

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 13: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Mathematical Problems in Engineering 13

[39] Y Ma Y Wang and B Jin ldquoA three-phase approach to docu-ment clustering based on topic significance degreerdquo ExpertSystems with Applications vol 41 no 18 pp 8203ndash8210 2014

[40] C C Aggarwal Y Zhao and P S Yu ldquoOn the use of side infor-mation for mining text datardquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 6 pp 1415ndash1429 2014

[41] R M Marcacini M A Domingues E R Hruschka and SO Rezende ldquoPrivileged information for hierarchical documentclustering a metric learning approachrdquo in Proceedings of the22nd International Conference on Pattern Recognition (ICPRrsquo14) pp 3636ndash3641 August 2014

[42] L Cagnina M Errecalde D Ingaramo and P Rosso ldquoAnefficient particle swarm optimization approach to cluster shorttextsrdquo Information Sciences vol 265 pp 36ndash49 2014

[43] W Song YQiao S C Park andXQian ldquoAhybrid evolutionarycomputation approach with its application for optimizing textdocument clusteringrdquo Expert Systems with Applications vol 42no 5 pp 2517ndash2524 2015

[44] R Forsati A Keikha and M Shamsfard ldquoAn improved beecolony optimization algorithmwith an application to documentclusteringrdquo Neurocomputing vol 159 no 1 pp 9ndash26 2015

[45] K K Bharti and P K Singh ldquoChaotic gradient artificial beecolony for text clusteringrdquo Soft Computing vol 20 no 3 pp1113ndash1126 2016

[46] F Wang and J Sun ldquoSurvey on distance metric learning anddimensionality reduction in data miningrdquo Data Mining andKnowledge Discovery vol 29 no 2 pp 534ndash564 2014

[47] Y-S Lin J-Y Jiang and S-J Lee ldquoA similarity measure for textclassification and clusteringrdquo IEEE Transactions on Knowledgeand Data Engineering vol 26 no 7 pp 1575ndash1590 2014

[48] M T Hassan A Karim J-B Kim and M Jeon ldquoCDIM docu-ment clustering by discrimination information maximizationrdquoInformation Sciences vol 316 pp 87ndash106 2015

[49] M T Hassan and A Karim ldquoClustering and understandingdocuments via discrimination information maximizationrdquo inProceedings of the Pacific-AsiaConference onAdvances inKnowl-edge Discovery amp Data Mining (PAKDD rsquo12) Kuala LumpurMalaysia May 2012

[50] D Cai and C J van Rijsbergen ldquoLearning semantic relatednessfrom term discrimination informationrdquo Expert Systems withApplications vol 36 no 2 pp 1860ndash1875 2009

[51] G Grahne and J Zhu ldquoHigh performance mining of maximalfrequent itemsetsrdquo in Proceedings of the SIAM Workshop HighPerformance Data Mining Pervasive and Data Stream Mining(HPDMPDS rsquo03) San Francisco Calif USA May 2003

[52] H Luo R Shen andCNiu ldquoSparse group restricted boltzmannmachinesrdquo in Proceedings of the 25th AAAI Conference onArtificial Intelligence (AAAI rsquo11) San Francisco Calif USAAugust 2011

[53] S Pado and M Lapata ldquoDependency-based construction ofsemantic space modelsrdquo Computational Linguistics vol 33 no2 pp 161ndash199 2007

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 14: A Novel Text Clustering Approach Using Deep-Learning ...downloads.hindawi.com/journals/mpe/2017/8310934.pdf · A Novel Text Clustering Approach Using Deep-Learning Vocabulary Network

Submit your manuscripts athttpswwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of