CN102929894A

CN102929894A - Online clustering visualization method of text

Info

Publication number: CN102929894A
Application number: CN2011102309783A
Authority: CN
Inventors: 金烨; 徐诗恒
Original assignee: FIFTY SEVENTH RESEARCH INSTITUTE OF CHINESE PEOPLE'S LIBERATION ARMY GENERAL STAFF
Current assignee: FIFTY SEVENTH RESEARCH INSTITUTE OF CHINESE PEOPLE'S LIBERATION ARMY GENERAL STAFF
Priority date: 2011-08-12
Filing date: 2011-08-12
Publication date: 2013-02-13

Abstract

The invention provides an online clustering visualization method of a text and belongs to the field of intelligent information processing of computer science. The method aims to introduce type characteristic word marking information to a user to realize the restriction and the optimization on a clustering process and improve the definition and the intelligibility of a text clustering structure; and an online clustering technology of the test is designed to realize increment clustering on a text data flow, keep the stability of the whole body of the clustering structure and update a model in a self-adaptive manner. The invention designs an online type high-dimensional data dimension-reducing and arrangement method to be suitable for large-scale data or a data flow environment; and the dimension reduction and the arrangement are carried out on a clustered text type distribution vector, so as to realize the increment visualization of text data and realize the visualized display of the text data and the type structure in a two-dimensional or three-dimensional Euclidean space.

Description

The online cluster method for visualizing of a kind of text

Technical field

The invention belongs to the text intelligent information processing technology under the Computer Subject, be specifically related to a kind of online text cluster method for visualizing.

Background technology

Text data is one of of paramount importance information carrier, common operative scenario during to the browsing and process of text message.Along with the surge of quantity of information, the user can carry out automatic classification and management to the data that arrive continually in the urgent need to a kind of new computer technology, browses and inquires about according to classification to make things convenient for the user.If data volume further increases, the traditional text formation is no longer fully competent display requirement to text message just, this moment need to be in the mode of two dimension or three-dimensional visible figure, result to cluster intuitively shows, to make things convenient for more easily understanding information of user distribution situation, realize the Obtaining Accurate to information.

In Text Clustering Algorithm, the Latent Dirichlet Allocation that David M.Blei proposes (being called for short LDA) model is a kind of generation model that obtains widespread use, start with by the signature analysis from text, explore the general character distribution that distributes and have in feature between different pieces of information, the distribution parameter that these general character of recycling Bayesian analysis technique computes distribute and meet, thus realize to text modeling and according to the clustering of model parameter realization to text.Chinese patent CN101968798.A discloses and a kind of the LDA model has been carried out on-line Algorithm, and is used for the method that community is recommended.The method is the method for a kind of text classification and online updating for the processing mode of new data, namely obtain model by the primary data cluster, then fixed model is used for new data is classified, and the recycling new data upgrades training to model, is not a kind of online clustering algorithm therefore; Secondly, the method is not introduced user's prior imformation, and the cluster structures that obtains does not often meet the user for the priori demand of classification.

Aspect text visualization, Laurens van der Maaten and Geoffrey Hinton have proposed the t-SNE algorithm, algorithm supposition higher-dimension Text eigenvector space meets Gaussian and distributes, behind the dimensionality reduction in the low-dimensional Euclidean space corresponding coordinate points meet t-and distribute, algorithm adopts KL divergence function to assess the otherness that distributes between high dimensional data and low dimension data, and by minimizing KL divergence function, explore one group of coordinate points of low-dimensional Euclidean space, so that this group coordinate points can keep the distributed architecture same with high dimensional data as far as possible.The t-SNE algorithm that the people such as L and G proposes can only be processed the batch type data, and algorithm is comparatively limited to the capacity of data, can not support textstream is carried out online processing.

Summary of the invention

Fundamental purpose of the present invention is to improve the LDA model, enables to accept the prior imformation that the user provides in vocabulary mark mode, thereby improves cluster structures for user's practicality; Propose simultaneously a kind of online clustering method, can finish the online cluster of textstream and automatic Renewal model; On the other hand, also propose the online method for visualizing of a kind of text, can carry out increment dimensionality reduction layout to cluster structures and show.

The objective of the invention is to be achieved through the following technical solutions:

The online cluster method for visualizing of a kind of text comprises the online cluster of text, online high dimensional data dimension reduction and visualization based on the vocabulary mark.

The described online sorting procedure of text based on the vocabulary mark is:

Step a, the cluster task arranges, and the user arranges the number K of cluster according to the task needs, if the user has clearly defined classification, allow the user to provide a small amount of feature vocabulary (normally 5～20 vocabulary) to indicate classification;

Step b, the text pre-service, for the text among the set D, the vocabulary frequency of occurrence in the statistics text (if Chinese data then needs to carry out first Chinese word segmentation and processes) is with d _iI text in the expression set is with w _jJ word among the vocabulary W that all vocabulary form in the expression set is with n (d _i, w _j) j word w of expression _jAt i text d _iThe frequency of middle appearance represents to gather the Chinese version sum with N, represents to gather the vocabulary word sum with M, represents classification with Z;

Step c adopts LDA model pair set Chinese version to carry out modeling, and utilizes category feature vocabulary that model is retrained and optimize, and recycling Gibbs Sampling carries out the model solution computing, realizes text cluster, and detailed process is as follows:

Step c1, random initializtion is each the vocabulary w (w ∈ W) of every a text d among the D, classification z of random labelling (z ∈ Z); Then add up: n (d _i, z _k), expression text d _iIn be labeled as the word frequency sum of k classification; N (d _i), text d _iVocabulary sum (meter repeats); N (w _j, z _k), expression vocabulary w _jIn all texts, be noted as total frequency of k classification; N (z _k), all vocabulary are noted as total frequency of k classification;

Step c2, markup information constraint initialization utilizes mark vocabulary to revise for the initialization model parameter, and computing formula is:

Wherein, C refers to mark strength factor, usually gets 50～5000 integer, and revises

Step c3, sampling utilizes Gibbs sampling technology that the vocabulary in the text is carried out stochastic sampling, and concrete formula is as follows:

P (z = k | z_{&Not;, k}, w_{j}, d_{i}) &Proportional; \frac{n {(w_{j}, z_{k})}_{&Not;} + β}{n {(z_{k})}_{&Not;} + W \cdot β} (n {(d_{i}, z_{k})}_{&Not;} + α)

Wherein,

With

Expression sampled point (d _i, w _j) residue frequency value after current mark state has removed from statistic, α and β represent respectively Dirichlet prior distribution parameter;

Step c4, model parameter is calculated, and after Gibbs sampling reaches stop condition, calculates text d for the distribution probability θ of classification according to following formula _d, preserve simultaneously the classification sampling frequency n (w, z) of all vocabulary:

θ_{d_{i}} (z_{k}) = P (z_{k} | d_{i}) = \frac{n (d_{i}, z_{k}) + α}{Σ_{k^{'} = 1}^{K} n (d_{i}, z_{k^{'}}) + K \cdot α}

Step c5, text categories is judged, for text d, is got θ _dLargest component place classification be the judgement classification of d, namely

z (d) = \arg \max_{k} θ_{d} (z_{k});

Steps d, the online cluster of new data is utilized existing model vocabulary category distribution frequency n ⁽⁰⁾(w, z) realizes the incremental clustering to new text data cluster, and n ⁽⁰⁾The automatic renewal of (w, z), detailed process is as follows:

Steps d 1 is carried out pre-service to new text data, then carries out random initializtion according to the c1 step;

Steps d 2, to the vocabulary w in the new text data, if exist at master mould, then revise the category distribution frequency of w according to following formula, otherwise do not change:

Steps d 3 is carried out cluster according to c3-c5 to new text data;

Steps d 4, the vocabulary category distribution frequency n (w, z) to new if w is emerging vocabulary, then adds in the master pattern, if w is not new term, then uses n (w, z) to replace original distribution frequency n ⁽⁰⁾(w, z).

Described online high dimensional data dimension reduction and visualization step is:

Step e, the higher-dimension category distribution vector x that text cluster is obtained ₁x ₂... x _n(x wherein _iBe text d _iThe category distribution vector that cluster obtains

), calculate any two vector x _i, x _jBetween similarity p _Ij, produce at random simultaneously corresponding low dimensional vector initial value y ₁y ₂... y _n, calculate any two vectorial y _i, y _jBetween similarity q _Ijp _IjAnd q _IjComputing method are as follows:

p_{ij} = \frac{p_{j | i} + p_{i | j}}{2 n}

q_{ij} = \frac{{(1 + {| | y_{i} - y_{j} | |}^{2})}^{- 1}}{\underset{k &NotEqual; l}{Σ} {(1 + {| | y_{k} - y_{l} | |}^{2})}^{- 1}}

Wherein

σ _i ²For with x _iTo x _k(k ≠ Euclidean distance i) is considered as the variance of Gaussian distribution;

Step f utilizes KL distance metric { p _IjAnd { q _IjBetween difference, { p _IjAnd { q _IjBetween the KL distance D _KLBe defined as:

D_{KL} = \underset{i}{Σ} \underset{j}{Σ} p_{ij} \log \frac{p_{ij}}{q_{ij}}

Step g is sought D by optimization method (gradient descent method) _KLMinimum value, constantly update simultaneously low dimensional vector y ₁y ₂... y _nGradient descent method is according to D _KLThe negative gradient direction progressively search for optimum solution, with D _KLTo y _iAsk partial derivative, have

\frac{&PartialD; D_{KL}}{&PartialD; y_{i}} = 4 \underset{j}{Σ} (p_{ij} - q_{ij}) (y_{i} - y_{j}) {(1 + {| | y_{i} - y_{j} | |}^{2})}^{- 1}

Note Y=(y ₁, y ₂... y _n), Y ^(t)=(y ₁ ^(t), y ₂ ^(t)... y _n ^(t)) being the solution after t iteration of process, gradient descent method adopts following formula to upgrade Y:

Y^{(t)} = Y^{(t - 1)} + η \frac{&PartialD; D_{KL}}{&PartialD; Y} + α (t) (Y^{(t - 1)} - Y^{(t - 2)})

Wherein η is step-length, usually gets 50～500 integer, and α (t) is speedup factor, 0～1 value.Behind the iteration several times, can be met the low dimensional vector y of predictive error requirement ₁y ₂... y _n, utilize visualization tool to y ₁y ₂... y _nVisual.

Step h, during online dimension-reduction treatment, the low dimensional vector β that random selection has produced ₁β ₂... β _m, corresponding high dimension vector is α ₁α ₂... α _mSuppose x ₁x ₂... x _nBe the high dimension vector that needs increment type to process, corresponding low dimensional vector is y ₁y ₂... y _nFor any x _i(i=1,2...n), itself and α _jThe similarity p of (j=1,2...m) _J|iBe defined as:

p_{j | i} = \frac{\exp (- {| | x_{i} - α_{j} | |}^{2} / 2 {σ_{i}}^{2})}{Σ_{k = 1}^{m} \exp (- {| | x_{i} - α_{k} | |}^{2} / {σ_{i}}^{2})}

σ _i ²For with x _iTo α _kThe Euclidean distance of (k=1,2...m) is considered as the variance of Gaussian distribution.x _iAnd α _jThe low dimensional vector y that difference is corresponding _iAnd β _jBetween similarity q _J|iBe defined as:

q_{j | i} = \frac{{(1 + {| | y_{i} - β_{j} | |}^{2})}^{- 1}}{Σ_{k = 1}^{m} {(1 + {| | y_{i} - β_{k} | |}^{2})}^{- 1}}

{ p _J|i, { q _J|iBetween the KL distance D _KLBe defined as:

Utilize equally gradient descent method can obtain y ₁y ₂... y _n, with y ₁y ₂... y _nAdd in the visualized graphs.

Description of drawings

Fig. 1 is the online cluster visualization technique of text process flow diagram.

Fig. 2 is based on the online cluster process flow diagram of text of vocabulary mark.

Fig. 3 is that the online cluster of text is with the performance change figure of vocabulary mark coefficient.

Fig. 4 is the online layout visualization technique of text process flow diagram.

Fig. 5 is the online cluster visualization technique of text two-dimensional effects figure.

Fig. 6 is the online cluster visualization technique of text 3 d effect graph.

Embodiment

In order to make objects and advantages of the present invention clearer, the invention will be further described below in conjunction with the drawings and specific embodiments.

As shown in Figure 1, for using the present invention text is carried out the visual system schematic of online cluster.System at first collects the historical data of some as initial text data, and these data need not to mark the classification of text; System carries out cluster to primary data, to obtain text categories distribution vector parameter and the vocabulary category distribution frequency parameter of initial text data, the former provides Data Source for high dimensional data dimensionality reduction layout method initial layout, and calculating obtains placement model's parameter, the latter as constrained parameters, is used for the online cluster of text as model data; When processing online text data, system carries out online cluster to the data that acquire at every turn, and the text categories distribution vector of acquisition is again by dimensionality reduction layout calculation lower dimensional space coordinate, according to user's requirement, carry out visual presentation with two dimension or three-dimensional mode at last.

The experimental data that this example relates to, all from Fudan University's Chinese text data corpus, it is 10 classes that data are divided into, the 2800 parts of texts of altogether having an appointment, wherein maximum classifications approximately has 500 parts of texts, and minimum classification approximately has 200 parts.

The present invention need to set in advance the cluster class number by the user, usually according to the familiarity of user to data, selects in a comparatively suitable numerical range; For the classification that clear and definite prior imformation is arranged, permit a user to these classifications some keywords are provided respectively, the keyword number also can be more usually between 5～20, the subject content of classification can be explained preferably in these vocabulary, allows same vocabulary to occur between different classes of.

Chinese version preprocessing process of the present invention need to utilize participle software to realize cutting to Chinese text, the ICTCLAS software that we mainly adopt the Computer Department of the Chinese Academy of Science to provide, and in the part of speech of software mark screening noun and verb as the sincere vocabulary of text.These contents are not within the scope of the invention.

The below is elaborated to the online cluster process of text.

As shown in Figure 2, the online cluster process of text is divided into initial clustering and online cluster two parts, and initial clustering utilizes the vocabulary markup information that the LDA model is retrained, and realizes the cluster to primary data, and obtains model parameter; Online cluster keeps initial clustering classification structure, realizes the online cluster to textstream, and the real-time update model parameter.Fig. 3 is the performance comparison figure of cluster with the nothing mark cluster of vocabulary mark.

The detailed process of initial clustering is as follows:

Step 101, the primary data pre-service, by Chinese data being carried out participle and part of speech is selected, and the further vocabulary that only in a text, occurred of deletion, be vectorial by the word frequency of vocabulary frequency sign with text-converted;

Through after the pre-service, the definition cluster data is various meet for: D represents text collection, total N part text, d _iExpression is i part text wherein; W represents the vocabulary by vocabulary generation among the D, total M vocabulary, w _jExpression is j vocabulary wherein; Z represents the classification set, total K classification, z _kExpression is k classification wherein; N (d _i, w _j) expression w _jAt d _iThe frequency of middle appearance;

Step 102, initiation parameter, at first to the vocabulary in the text vector, the random choose classification marks from Z, and statistics: n (d _i, z _k), expression text d _iIn all words for classification z _kThe total frequency of sampling, n (d _i), expression text d _iMeter repeated vocabulary sum, n (w _j, z _k), expression word w _jFor classification z _kThe total frequency of sampling, n (z _k), expression classification z _kThe vocabulary that obtains total frequency of sampling; Secondly, utilize user's lexical information that some classifications mark in Z, to n (w _j, z _k) revise, computing formula is:

Step 103, the iteration sampling, we adopt the single-chain sampling usually, and chain length is traditionally arranged to be 1000～2000; Wherein, once the implementation process of sampling is specific as follows:

Step 103a arranges text subscript i from 1 to N:

Step 103b arranges vocabulary subscript j from 1 to M:

If step 103c is w _jNot at d _iMiddle appearance then turns step 103b; Otherwise, to extract, the mark classification of current vocabulary is z, with n (d _i, z), n (w _j, z) and the numerical value of n (z) subtract separately 1;

Step 103d arranges classification subscript k from 1 to K, calculates respectively the transition probability of vocabulary from current classification z to classification k, and computing formula is as follows:

P (z = k | z_{&Not;, k}, w_{j}, d_{i}) &Proportional; \frac{n {(w_{j}, z_{k})}_{&Not;} + β}{n {(z_{k})}_{&Not;} + W \cdot β} (n {(d_{i}, z_{k})}_{&Not;} + α);

Wherein, α and β represent respectively Dirichlet prior distribution parameter, generally we fixedly to get α be that 0.1, β is 0.01;

Step 103e calculates according to back

Carry out to scale stochastic sampling, with the classification z ' that the obtains new sample states as current vocabulary, with n (d _i, z '), n (w _j, z ') and the numerical value of n (z ') add separately 1;

Step 104, during last 100 samplings, after sampling was finished each time, according to current sample states computation model parameter, computing formula was as follows:

θ_{d_{i}}^{(n)} (z_{k}) = P (z_{k} | d_{i}) = \frac{n (d_{i}, z_{k}) + α}{Σ_{k^{'} = 1}^{K} n (d_{i}, z_{k^{'}}) + K \cdot α};

Mod ⁽ⁿ⁾(w _j，z _k)＝n(w _j，z _k)

Expression text d _iThe category distribution probability, Mod (w _j) expression vocabulary w _jThe classification sampling frequency.

Step 105, after all samplings were finished, the mean value of getting the parameter of last 100 next states calculating was final model parameter, namely has

Mod (w _j, z _k)=E _n[Mod ⁽ⁿ⁾(w _j, z _k)]; Get

Largest component place classification be d _iThe judgement classification, namely

z (d_{i}) = \arg \max_{k} θ_{d_{i}} (z_{k}) .

The detailed process of online cluster is as follows:

Step 106 is carried out the text pre-service according to the method for step 101, according to method in the step 102 vocabulary in the new text is carried out random initializtion, and adds up the n (d, z) of new text, n (d), n (w, z), n (z); The vocabulary that occurs in new text if occur in original model, is then revised n (w, z) and n (z), otherwise is remained unchanged, and circular is as follows:

Step 107 is carried out the sampling iteration of new text according to step 103 and step 104, and realizes the judgement to new text categories;

Step 108, upgrade original model, vocabulary category distribution frequency Mod (w for new text acquisition, z), if vocabulary w does not occur in master mould, then the frequency distribution with w is added to master mould, if w is the existing vocabulary in the master mould, the frequency that the Mod (w, z) that then will newly obtain is substituted in the master mould distributes.

The below is elaborated to online high dimensional data dimension reduction and visualization process.

Online high dimensional data dimension reduction and visualization process as shown in Figure 4, the below introduces online high dimensional data dimension reduction and visualization process in detail.

Step 101, the higher-dimension category distribution vector x that the input text cluster process produces ₁x ₂... x _n(x wherein _iBe text d _iThe category distribution vector that cluster obtains

), judge whether to implement first dimension reduction and visualization, if turn step 102, otherwise turn step 106;

Step 102 is calculated any two high dimension vector x _i, x _jBetween similarity p _Ij, computing method are as follows:

p_{ij} = \frac{p_{j | i} + p_{i | j}}{2 n}

Wherein,

p_{j | i} = \frac{\exp (- {| | x_{i} - x_{j} | |}^{2} / 2 {σ_{i}}^{2})}{Σ_{k &NotEqual; 1}^{m} \exp (- {| | x_{i} - x_{k} | |}^{2} / {σ_{i}}^{2})}

Step 103 produces x at random ₁x ₂... x _nCorresponding low dimensional vector initial value y ₁y ₂... y _n, calculate any two low dimensional vector y _i, y _jBetween similarity q _Ij, computing method are as follows:

q_{ij} = \frac{{(1 + {| | y_{i} - y_{j} | |}^{2})}^{- 1}}{Σ_{k &NotEqual; 1}^{m} {(1 + {| | y_{i} - y_{k} | |}^{2})}^{- 1}}

Step 104 is utilized KL distance metric { p _IjAnd { q _IjBetween difference, { p _IjAnd { q _IjBetween the KL distance D _KLBe defined as:

D_{KL} = \underset{i}{Σ} \underset{j}{Σ} p_{ij} \log \frac{p_{ij}}{q_{ij}}

Step 105 utilizes the gradient descent method iteration to seek D _KLMinimum value, upgrade simultaneously low dimensional vector y ₁y ₂... y _n, detailed process is as follows:

Step 105a is according to D _KLThe negative gradient direction progressively search for optimum solution, with D _KLTo y _iAsk partial derivative, have

\frac{&PartialD; D_{KL}}{&PartialD; y_{i}} = 4 \underset{j}{Σ} (p_{ij} - q_{ij}) (y_{i} - y_{j}) {(1 + {| | y_{i} - y_{j} | |}^{2})}^{- 1}

Step 105b, note Y=(y ₁, y ₂... y _n), Y ^(t)=(y ₁ ^(t), y ₂ ^(t)... y _n ^(t)) being the solution after t iteration of process, gradient descent method adopts following formula to upgrade Y:

Y^{(t)} = Y^{(t - 1)} + η \frac{&PartialD; D_{KL}}{&PartialD; Y} + α (t) (Y^{(t - 1)} - Y^{(t - 2)})

Wherein η is step-length, usually gets 50～500 integer, and α (t) is speedup factor, 0～1 value.Behind the iteration several times, can be met the low dimensional vector y of predictive error requirement ₁y ₂... y _n, then turn step 112;

Step 106 selects one group of first dimension reduction and visualization generation to hang down dimensional vector β at random ₁β ₂... β _mAs the low dimensional vector of benchmark, corresponding high dimension vector α ₁α ₂... α _mAs the benchmark high dimension vector;

Step 107 is judged x ₁x ₂... x _nIn whether untreated vector is arranged, if the step 108 of turning is arranged, otherwise turn step 112;

Step 108 is got first vector x that is untreated _i, adopt following formula to calculate x _iWith benchmark high dimension vector α _jThe similarity p of (j=1,2...m) _J|i:

p_{j | i} = \frac{\exp (- {| | x_{i} - α_{j} | |}^{2} / 2 {σ_{i}}^{2})}{Σ_{k = 1}^{m} \exp (- {| | x_{i} - α_{k} | |}^{2} / {σ_{i}}^{2})}

σ _i ²For with x _iTo α _kThe Euclidean distance of (k=1,2...m) is considered as the variance of Gaussian distribution.

Step 109 produces x at random _iCorresponding low dimensional vector initial value y _i, adopt following formula to calculate y _iWith the low dimensional vector β of benchmark _jSimilarity q between (j=1,2...m) _J|i:

q_{j | i} = \frac{{(1 + {| | y_{i} - β_{j} | |}^{2})}^{- 1}}{Σ_{k = 1}^{m} {(1 + {| | y_{i} - β_{k} | |}^{2})}^{- 1}}

Step 110: utilize KL distance metric { p _J|iAnd { q _J|iBetween difference, { p _J|iAnd { q _J|iBetween the KL distance D _KLBe defined as:

D_{KL} = Σ_{j = 1}^{m} p_{j | i} \log \frac{p_{j | i}}{q_{j | i}}

Step 111 utilizes the gradient descent method iteration to seek D _KLMinimum value, upgrade simultaneously low dimensional vector y _i, detailed process is as follows:

Step 111a is according to D _KLThe negative gradient direction progressively search for optimum solution, with D _KLTo y _iAsk partial derivative, have

\frac{{&PartialD; D}_{KL}}{&PartialD; y_{i}} = 2 \underset{j}{Σ} (p_{j | i} - q_{j | i}) (y_{i} - β_{j}) {(1 + {| | y_{i} - β_{j} | |}^{2})}^{- 1}

Step 111b, note y _i ^(t)Be the solution after t iteration of process, gradient descent method adopts following formula to upgrade y _i:

{y_{i}}^{(t)} = {y_{i}}^{(t - 1)} + η \frac{&PartialD; D_{KL}}{&PartialD; y_{i}} + α (t) ({y_{i}}^{(t - 1)} - {y_{i}}^{(t - 2)})

Wherein η is step-length, usually gets 50～500 integer, and α (t) is speedup factor, 0～1 value.Behind the iteration several times, can be met the low dimensional vector y of predictive error requirement _i

Step 112 is to the low dimensional vector y behind the dimensionality reduction _iCarry out visual presentation.In the reality we often text categories distribution vector dimensionality reduction in two-dimentional Euclidean space, i.e. y _iBe two-dimensional coordinate, for the two-dimensional visualization requirement, can directly export the coordinate behind all text dimensionality reductions, as shown in Figure 5; For the three-dimensional visualization requirement, we often get largest component numerical value in the text text categories distribution vector as the coordinate figure of the third dimension, in conjunction with y _iCarry out three-dimensional display, as shown in Figure 6.

Above-described embodiment is the better embodiment of the present invention; but embodiments of the present invention are not restricted to the described embodiments; other any do not deviate from change, the modification done under Spirit Essence of the present invention and the principle, substitutes, combination, simplify; all should be the substitute mode of equivalence, be included within protection scope of the present invention.

Claims

1. the online cluster method for visualizing of text is characterized in that, comprises the online cluster of text, online high dimensional data dimension reduction and visualization two large steps based on the vocabulary mark:

The described online sorting procedure of text based on the vocabulary mark is:

Step a, the user arranges clusters number, and provides some feature vocabulary to some or all of classification wherein;

Step b, the word word frequency information in the set of statistics original text adopts the LDA model that data are carried out modeling, and utilizes the category feature vocabulary of mark that the LDA model is retrained, and adopts Gibbs Sampling technology solving model parameter;

Step c, the document category distribution θ in the model parameter is used for the prediction of text categories, and the vocabulary in the model parameter-category distribution frequency n (w, z) will as constrained parameters, be used for the increment cluster process;

Steps d, during online cluster, new text data carries out initialization on existing model parameter n (w, z) basis, then carries out the modeling computing according to step b and step c, and after calculating was finished, new text was realized the increment cluster, and model parameter realizes automatically renewal;

Step e, the higher-dimension category distribution vector to text cluster obtains calculates the similarity between any two vectors, and the random low dimensional vector initial value of correspondence that produces calculates the similarity between any two low dimensional vectors simultaneously;

Step f utilizes the set of KL distance (Kullback-Leibler Divergence) tolerance high dimension vector similarity and hangs down the difference of dimensional vector similarity between gathering;

Step g by the minimum value of difference between similarity set among the optimization method iterative search step f, is constantly updated low dimensional vector simultaneously, stops iteration when reaching the specification error scope, utilizes visualization tool to the low-dimensional vector visualization;

Step h, during online processing, high dimension vector dimensionality reduction to new arrival has utilized the low-dimensional vector information that has produced, and the low dimensional vector that has produced when iterative search no longer upgrades, and only the high dimension vector of new arrival is made increment type according to step e, step f and step g and processes;

2. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step a, for the class number K that the user arranges, the user can select therefrom to mark arbitrarily several classifications; For selected classification, the user only need provide a small amount of feature vocabulary, also can provide the mark text.

3. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step b, and word w _jAt text d _iThe frequency that occurs is n (d _i, w _j), word w _jFor classification z _kThe total frequency of sampling be n (w _j, z _k), text d _iIn all words for classification z _kThe total frequency of sampling be n (d _i, z _k).

4. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step b, utilizes mark vocabulary to revise for initialization model, and computing formula is:

Wherein, C refers to mark strength factor.

5. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step b, Gibbs Sampling computing formula is as follows:

P (z = k | z_{&Not;, k}, w_{j}, d_{i}) &Proportional; \frac{n {(w_{j}, z_{k})}_{&Not;} + β}{n {(z_{k})}_{&Not;} + W \cdot β} (n {(d_{i}, z_{k})}_{&Not;} + α)

Wherein,

With Expression sampled point (d _i, w _j) residue frequency value after current mark state has removed from statistic, α and β represent respectively Dirichlet prior distribution parameter.

6. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step c, and document d _iFor different classes of probability distribution θ, its computing formula is as follows:

θ_{d_{i}} (z_{k}) = P (z_{k} | d_{i}) = \frac{n (d_{i}, z_{k}) + α}{Σ_{k^{'} = 1}^{K} n (d_{i}, z_{k^{'}}) + K \cdot α}

7. the online cluster of text based on vocabulary mark according to claim 1, it is characterized in that, in the described steps d, new data initialization on original model parameter basis, its implementation is: at first to the vocabulary random labelling classification in the new data, then add up the vocabulary mark frequency n (w of new data _j, z _k) and n (d _i, z _k), after mark is finished, utilize the vocabulary category distribution frequency in the master mould, the vocabulary distribution of new data is revised, revise formula as follows:

Wherein, n ⁽⁰⁾(w _j, z _k) the vocabulary category distribution frequency of expression in the master mould.

8. the online cluster of text based on vocabulary mark according to claim 1, it is characterized in that, in the described steps d, increment cluster for new text can be according to the method for standard LDA model solution, need not fixing original model parameter, after GibbsSampling reaches stop condition, realize the classification of new text is judged by calculating θ, simultaneously model parameter n (w _j, z _k) also automatically revise.

9. online high dimensional data dimension reduction and visualization according to claim 1 is characterized in that, in described step e, and the higher-dimension category distribution vector x that text cluster is obtained ₁x ₂... x _n, x _i, x _jBetween similarity p _IjBe defined as:

Wherein

σ _i ²For with x _iTo x _k(k ≠ Euclidean distance i) is considered as the variance of Gaussian distribution, note and x ₁x ₂... x _nCorresponding low dimension data is y ₁y ₂... y _n, y _i, y _jBetween similarity q _IjBe defined as:

q_{ij} = \frac{{(1 + {| | y_{i} - y_{j} | |}^{2})}^{- 1}}{\underset{k &NotEqual; l}{Σ} {(1 + {| | y_{k} - y_{l} | |}^{2})}^{- 1}}

10. online high dimensional data dimension reduction and visualization according to claim 1 is characterized in that, in described step f, and high dimensional data similarity set { p _IjGather { q with the low-dimensional data similarity _IjBetween the KL distance D _KLBe defined as:

D_{KL} = \underset{i}{Σ} \underset{j}{Σ} p_{ij} \log \frac{p_{ij}}{q_{ij}}

11. online high dimensional data dimension reduction and visualization according to claim 1 is characterized in that, in described step h, and note β ₁β ₂... β _mBe the low dimensional vector that has produced, corresponding high dimension vector is α ₁α ₂... α _m, suppose x ₁x ₂... x _nBe the high dimension vector that needs online processing, corresponding low dimensional vector is y ₁y ₂... y _n, for any x _i(i=1,2...n), itself and α _jThe similarity p of (j=1,2...m) _J|iBe defined as:

p_{j | i} = \frac{\exp (- {| | x_{i} - α_{j} | |}^{2} / 2 {σ_{i}}^{2})}{Σ_{k = 1}^{m} \exp (- {| | x_{i} - α_{k} | |}^{2} / {σ_{i}}^{2})}

σ _i ²For with x _iBe considered as the variance of Gaussian distribution to the Euclidean distance of α k (k=1,2...m), x _iAnd α _jThe low dimensional vector y that difference is corresponding _iAnd β _jBetween similarity q _J|iBe defined as:

q_{j | i} = \frac{{(1 + {| | y_{i} - β_{j} | |}^{2})}^{- 1}}{Σ_{k = 1}^{m} {(1 + {| | y_{i} - β_{k} | |}^{2})}^{- 1}}

{ p _J|i, { q _J|iBetween the KL distance D _KLBe defined as:

D_{KL} = Σ_{j = 1}^{m} p_{j | i} \log \frac{p_{j | i}}{q_{j | i}}