CN102929894A - Online clustering visualization method of text - Google Patents

Online clustering visualization method of text Download PDF

Info

Publication number
CN102929894A
CN102929894A CN2011102309783A CN201110230978A CN102929894A CN 102929894 A CN102929894 A CN 102929894A CN 2011102309783 A CN2011102309783 A CN 2011102309783A CN 201110230978 A CN201110230978 A CN 201110230978A CN 102929894 A CN102929894 A CN 102929894A
Authority
CN
China
Prior art keywords
text
vocabulary
online
cluster
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102309783A
Other languages
Chinese (zh)
Inventor
金烨
徐诗恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
FIFTY SEVENTH RESEARCH INSTITUTE OF CHINESE PEOPLE'S LIBERATION ARMY GENERAL STAFF
Original Assignee
FIFTY SEVENTH RESEARCH INSTITUTE OF CHINESE PEOPLE'S LIBERATION ARMY GENERAL STAFF
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by FIFTY SEVENTH RESEARCH INSTITUTE OF CHINESE PEOPLE'S LIBERATION ARMY GENERAL STAFF filed Critical FIFTY SEVENTH RESEARCH INSTITUTE OF CHINESE PEOPLE'S LIBERATION ARMY GENERAL STAFF
Priority to CN2011102309783A priority Critical patent/CN102929894A/en
Publication of CN102929894A publication Critical patent/CN102929894A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides an online clustering visualization method of a text and belongs to the field of intelligent information processing of computer science. The method aims to introduce type characteristic word marking information to a user to realize the restriction and the optimization on a clustering process and improve the definition and the intelligibility of a text clustering structure; and an online clustering technology of the test is designed to realize increment clustering on a text data flow, keep the stability of the whole body of the clustering structure and update a model in a self-adaptive manner. The invention designs an online type high-dimensional data dimension-reducing and arrangement method to be suitable for large-scale data or a data flow environment; and the dimension reduction and the arrangement are carried out on a clustered text type distribution vector, so as to realize the increment visualization of text data and realize the visualized display of the text data and the type structure in a two-dimensional or three-dimensional Euclidean space.

Description

The online cluster method for visualizing of a kind of text
Technical field
The invention belongs to the text intelligent information processing technology under the Computer Subject, be specifically related to a kind of online text cluster method for visualizing.
Background technology
Text data is one of of paramount importance information carrier, common operative scenario during to the browsing and process of text message.Along with the surge of quantity of information, the user can carry out automatic classification and management to the data that arrive continually in the urgent need to a kind of new computer technology, browses and inquires about according to classification to make things convenient for the user.If data volume further increases, the traditional text formation is no longer fully competent display requirement to text message just, this moment need to be in the mode of two dimension or three-dimensional visible figure, result to cluster intuitively shows, to make things convenient for more easily understanding information of user distribution situation, realize the Obtaining Accurate to information.
In Text Clustering Algorithm, the Latent Dirichlet Allocation that David M.Blei proposes (being called for short LDA) model is a kind of generation model that obtains widespread use, start with by the signature analysis from text, explore the general character distribution that distributes and have in feature between different pieces of information, the distribution parameter that these general character of recycling Bayesian analysis technique computes distribute and meet, thus realize to text modeling and according to the clustering of model parameter realization to text.Chinese patent CN101968798.A discloses and a kind of the LDA model has been carried out on-line Algorithm, and is used for the method that community is recommended.The method is the method for a kind of text classification and online updating for the processing mode of new data, namely obtain model by the primary data cluster, then fixed model is used for new data is classified, and the recycling new data upgrades training to model, is not a kind of online clustering algorithm therefore; Secondly, the method is not introduced user's prior imformation, and the cluster structures that obtains does not often meet the user for the priori demand of classification.
Aspect text visualization, Laurens van der Maaten and Geoffrey Hinton have proposed the t-SNE algorithm, algorithm supposition higher-dimension Text eigenvector space meets Gaussian and distributes, behind the dimensionality reduction in the low-dimensional Euclidean space corresponding coordinate points meet t-and distribute, algorithm adopts KL divergence function to assess the otherness that distributes between high dimensional data and low dimension data, and by minimizing KL divergence function, explore one group of coordinate points of low-dimensional Euclidean space, so that this group coordinate points can keep the distributed architecture same with high dimensional data as far as possible.The t-SNE algorithm that the people such as L and G proposes can only be processed the batch type data, and algorithm is comparatively limited to the capacity of data, can not support textstream is carried out online processing.
Summary of the invention
Fundamental purpose of the present invention is to improve the LDA model, enables to accept the prior imformation that the user provides in vocabulary mark mode, thereby improves cluster structures for user's practicality; Propose simultaneously a kind of online clustering method, can finish the online cluster of textstream and automatic Renewal model; On the other hand, also propose the online method for visualizing of a kind of text, can carry out increment dimensionality reduction layout to cluster structures and show.
The objective of the invention is to be achieved through the following technical solutions:
The online cluster method for visualizing of a kind of text comprises the online cluster of text, online high dimensional data dimension reduction and visualization based on the vocabulary mark.
The described online sorting procedure of text based on the vocabulary mark is:
Step a, the cluster task arranges, and the user arranges the number K of cluster according to the task needs, if the user has clearly defined classification, allow the user to provide a small amount of feature vocabulary (normally 5~20 vocabulary) to indicate classification;
Step b, the text pre-service, for the text among the set D, the vocabulary frequency of occurrence in the statistics text (if Chinese data then needs to carry out first Chinese word segmentation and processes) is with d iI text in the expression set is with w jJ word among the vocabulary W that all vocabulary form in the expression set is with n (d i, w j) j word w of expression jAt i text d iThe frequency of middle appearance represents to gather the Chinese version sum with N, represents to gather the vocabulary word sum with M, represents classification with Z;
Step c adopts LDA model pair set Chinese version to carry out modeling, and utilizes category feature vocabulary that model is retrained and optimize, and recycling Gibbs Sampling carries out the model solution computing, realizes text cluster, and detailed process is as follows:
Step c1, random initializtion is each the vocabulary w (w ∈ W) of every a text d among the D, classification z of random labelling (z ∈ Z); Then add up: n (d i, z k), expression text d iIn be labeled as the word frequency sum of k classification; N (d i), text d iVocabulary sum (meter repeats); N (w j, z k), expression vocabulary w jIn all texts, be noted as total frequency of k classification; N (z k), all vocabulary are noted as total frequency of k classification;
Step c2, markup information constraint initialization utilizes mark vocabulary to revise for the initialization model parameter, and computing formula is:
Figure BSA00000555789700021
Wherein, C refers to mark strength factor, usually gets 50~5000 integer, and revises
Figure BSA00000555789700022
Step c3, sampling utilizes Gibbs sampling technology that the vocabulary in the text is carried out stochastic sampling, and concrete formula is as follows:
P ( z = k | z ⫬ , k , w j , d i ) ∝ n ( w j , z k ) ⫬ + β n ( z k ) ⫬ + W · β ( n ( d i , z k ) ⫬ + α )
Wherein,
Figure BSA00000555789700024
With
Figure BSA00000555789700025
Expression sampled point (d i, w j) residue frequency value after current mark state has removed from statistic, α and β represent respectively Dirichlet prior distribution parameter;
Step c4, model parameter is calculated, and after Gibbs sampling reaches stop condition, calculates text d for the distribution probability θ of classification according to following formula d, preserve simultaneously the classification sampling frequency n (w, z) of all vocabulary:
θ d i ( z k ) = P ( z k | d i ) = n ( d i , z k ) + α Σ k ′ = 1 K n ( d i , z k ′ ) + K · α
Step c5, text categories is judged, for text d, is got θ dLargest component place classification be the judgement classification of d, namely z ( d ) = arg max k θ d ( z k ) ;
Steps d, the online cluster of new data is utilized existing model vocabulary category distribution frequency n (0)(w, z) realizes the incremental clustering to new text data cluster, and n (0)The automatic renewal of (w, z), detailed process is as follows:
Steps d 1 is carried out pre-service to new text data, then carries out random initializtion according to the c1 step;
Steps d 2, to the vocabulary w in the new text data, if exist at master mould, then revise the category distribution frequency of w according to following formula, otherwise do not change:
Figure BSA00000555789700033
Steps d 3 is carried out cluster according to c3-c5 to new text data;
Steps d 4, the vocabulary category distribution frequency n (w, z) to new if w is emerging vocabulary, then adds in the master pattern, if w is not new term, then uses n (w, z) to replace original distribution frequency n (0)(w, z).
Described online high dimensional data dimension reduction and visualization step is:
Step e, the higher-dimension category distribution vector x that text cluster is obtained 1x 2... x n(x wherein iBe text d iThe category distribution vector that cluster obtains
Figure BSA00000555789700034
), calculate any two vector x i, x jBetween similarity p Ij, produce at random simultaneously corresponding low dimensional vector initial value y 1y 2... y n, calculate any two vectorial y i, y jBetween similarity q Ijp IjAnd q IjComputing method are as follows:
p ij = p j | i + p i | j 2 n
q ij = ( 1 + | | y i - y j | | 2 ) - 1 Σ k ≠ l ( 1 + | | y k - y l | | 2 ) - 1
Wherein
Figure BSA00000555789700037
σ i 2For with x iTo x k(k ≠ Euclidean distance i) is considered as the variance of Gaussian distribution;
Step f utilizes KL distance metric { p IjAnd { q IjBetween difference, { p IjAnd { q IjBetween the KL distance D KLBe defined as:
D KL = Σ i Σ j p ij log p ij q ij
Step g is sought D by optimization method (gradient descent method) KLMinimum value, constantly update simultaneously low dimensional vector y 1y 2... y nGradient descent method is according to D KLThe negative gradient direction progressively search for optimum solution, with D KLTo y iAsk partial derivative, have
∂ D KL ∂ y i = 4 Σ j ( p ij - q ij ) ( y i - y j ) ( 1 + | | y i - y j | | 2 ) - 1
Note Y=(y 1, y 2... y n), Y (t)=(y 1 (t), y 2 (t)... y n (t)) being the solution after t iteration of process, gradient descent method adopts following formula to upgrade Y:
Y ( t ) = Y ( t - 1 ) + η ∂ D KL ∂ Y + α ( t ) ( Y ( t - 1 ) - Y ( t - 2 ) )
Wherein η is step-length, usually gets 50~500 integer, and α (t) is speedup factor, 0~1 value.Behind the iteration several times, can be met the low dimensional vector y of predictive error requirement 1y 2... y n, utilize visualization tool to y 1y 2... y nVisual.
Step h, during online dimension-reduction treatment, the low dimensional vector β that random selection has produced 1β 2... β m, corresponding high dimension vector is α 1α 2... α mSuppose x 1x 2... x nBe the high dimension vector that needs increment type to process, corresponding low dimensional vector is y 1y 2... y nFor any x i(i=1,2...n), itself and α jThe similarity p of (j=1,2...m) J|iBe defined as:
p j | i = exp ( - | | x i - α j | | 2 / 2 σ i 2 ) Σ k = 1 m exp ( - | | x i - α k | | 2 / σ i 2 )
σ i 2For with x iTo α kThe Euclidean distance of (k=1,2...m) is considered as the variance of Gaussian distribution.x iAnd α jThe low dimensional vector y that difference is corresponding iAnd β jBetween similarity q J|iBe defined as:
q j | i = ( 1 + | | y i - β j | | 2 ) - 1 Σ k = 1 m ( 1 + | | y i - β k | | 2 ) - 1
{ p J|i, { q J|iBetween the KL distance D KLBe defined as:
Figure BSA00000555789700046
Utilize equally gradient descent method can obtain y 1y 2... y n, with y 1y 2... y nAdd in the visualized graphs.
Description of drawings
Fig. 1 is the online cluster visualization technique of text process flow diagram.
Fig. 2 is based on the online cluster process flow diagram of text of vocabulary mark.
Fig. 3 is that the online cluster of text is with the performance change figure of vocabulary mark coefficient.
Fig. 4 is the online layout visualization technique of text process flow diagram.
Fig. 5 is the online cluster visualization technique of text two-dimensional effects figure.
Fig. 6 is the online cluster visualization technique of text 3 d effect graph.
Embodiment
In order to make objects and advantages of the present invention clearer, the invention will be further described below in conjunction with the drawings and specific embodiments.
As shown in Figure 1, for using the present invention text is carried out the visual system schematic of online cluster.System at first collects the historical data of some as initial text data, and these data need not to mark the classification of text; System carries out cluster to primary data, to obtain text categories distribution vector parameter and the vocabulary category distribution frequency parameter of initial text data, the former provides Data Source for high dimensional data dimensionality reduction layout method initial layout, and calculating obtains placement model's parameter, the latter as constrained parameters, is used for the online cluster of text as model data; When processing online text data, system carries out online cluster to the data that acquire at every turn, and the text categories distribution vector of acquisition is again by dimensionality reduction layout calculation lower dimensional space coordinate, according to user's requirement, carry out visual presentation with two dimension or three-dimensional mode at last.
The experimental data that this example relates to, all from Fudan University's Chinese text data corpus, it is 10 classes that data are divided into, the 2800 parts of texts of altogether having an appointment, wherein maximum classifications approximately has 500 parts of texts, and minimum classification approximately has 200 parts.
The present invention need to set in advance the cluster class number by the user, usually according to the familiarity of user to data, selects in a comparatively suitable numerical range; For the classification that clear and definite prior imformation is arranged, permit a user to these classifications some keywords are provided respectively, the keyword number also can be more usually between 5~20, the subject content of classification can be explained preferably in these vocabulary, allows same vocabulary to occur between different classes of.
Chinese version preprocessing process of the present invention need to utilize participle software to realize cutting to Chinese text, the ICTCLAS software that we mainly adopt the Computer Department of the Chinese Academy of Science to provide, and in the part of speech of software mark screening noun and verb as the sincere vocabulary of text.These contents are not within the scope of the invention.
The below is elaborated to the online cluster process of text.
As shown in Figure 2, the online cluster process of text is divided into initial clustering and online cluster two parts, and initial clustering utilizes the vocabulary markup information that the LDA model is retrained, and realizes the cluster to primary data, and obtains model parameter; Online cluster keeps initial clustering classification structure, realizes the online cluster to textstream, and the real-time update model parameter.Fig. 3 is the performance comparison figure of cluster with the nothing mark cluster of vocabulary mark.
The detailed process of initial clustering is as follows:
Step 101, the primary data pre-service, by Chinese data being carried out participle and part of speech is selected, and the further vocabulary that only in a text, occurred of deletion, be vectorial by the word frequency of vocabulary frequency sign with text-converted;
Through after the pre-service, the definition cluster data is various meet for: D represents text collection, total N part text, d iExpression is i part text wherein; W represents the vocabulary by vocabulary generation among the D, total M vocabulary, w jExpression is j vocabulary wherein; Z represents the classification set, total K classification, z kExpression is k classification wherein; N (d i, w j) expression w jAt d iThe frequency of middle appearance;
Step 102, initiation parameter, at first to the vocabulary in the text vector, the random choose classification marks from Z, and statistics: n (d i, z k), expression text d iIn all words for classification z kThe total frequency of sampling, n (d i), expression text d iMeter repeated vocabulary sum, n (w j, z k), expression word w jFor classification z kThe total frequency of sampling, n (z k), expression classification z kThe vocabulary that obtains total frequency of sampling; Secondly, utilize user's lexical information that some classifications mark in Z, to n (w j, z k) revise, computing formula is:
Figure BSA00000555789700061
Wherein, C refers to mark strength factor, usually gets 50~5000 integer, and revises
Figure BSA00000555789700062
Step 103, the iteration sampling, we adopt the single-chain sampling usually, and chain length is traditionally arranged to be 1000~2000; Wherein, once the implementation process of sampling is specific as follows:
Step 103a arranges text subscript i from 1 to N:
Step 103b arranges vocabulary subscript j from 1 to M:
If step 103c is w jNot at d iMiddle appearance then turns step 103b; Otherwise, to extract, the mark classification of current vocabulary is z, with n (d i, z), n (w j, z) and the numerical value of n (z) subtract separately 1;
Step 103d arranges classification subscript k from 1 to K, calculates respectively the transition probability of vocabulary from current classification z to classification k, and computing formula is as follows:
P ( z = k | z ⫬ , k , w j , d i ) ∝ n ( w j , z k ) ⫬ + β n ( z k ) ⫬ + W · β ( n ( d i , z k ) ⫬ + α ) ;
Wherein, α and β represent respectively Dirichlet prior distribution parameter, generally we fixedly to get α be that 0.1, β is 0.01;
Step 103e calculates according to back
Figure BSA00000555789700071
Carry out to scale stochastic sampling, with the classification z ' that the obtains new sample states as current vocabulary, with n (d i, z '), n (w j, z ') and the numerical value of n (z ') add separately 1;
Step 104, during last 100 samplings, after sampling was finished each time, according to current sample states computation model parameter, computing formula was as follows:
θ d i ( n ) ( z k ) = P ( z k | d i ) = n ( d i , z k ) + α Σ k ′ = 1 K n ( d i , z k ′ ) + K · α ;
Mod (n)(w j,z k)=n(w j,z k)
Figure BSA00000555789700073
Expression text d iThe category distribution probability, Mod (w j) expression vocabulary w jThe classification sampling frequency.
Step 105, after all samplings were finished, the mean value of getting the parameter of last 100 next states calculating was final model parameter, namely has
Figure BSA00000555789700074
Mod (w j, z k)=E n[Mod (n)(w j, z k)]; Get
Figure BSA00000555789700075
Largest component place classification be d iThe judgement classification, namely z ( d i ) = arg max k θ d i ( z k ) .
The detailed process of online cluster is as follows:
Step 106 is carried out the text pre-service according to the method for step 101, according to method in the step 102 vocabulary in the new text is carried out random initializtion, and adds up the n (d, z) of new text, n (d), n (w, z), n (z); The vocabulary that occurs in new text if occur in original model, is then revised n (w, z) and n (z), otherwise is remained unchanged, and circular is as follows:
Figure BSA00000555789700077
Step 107 is carried out the sampling iteration of new text according to step 103 and step 104, and realizes the judgement to new text categories;
Step 108, upgrade original model, vocabulary category distribution frequency Mod (w for new text acquisition, z), if vocabulary w does not occur in master mould, then the frequency distribution with w is added to master mould, if w is the existing vocabulary in the master mould, the frequency that the Mod (w, z) that then will newly obtain is substituted in the master mould distributes.
The below is elaborated to online high dimensional data dimension reduction and visualization process.
Online high dimensional data dimension reduction and visualization process as shown in Figure 4, the below introduces online high dimensional data dimension reduction and visualization process in detail.
Step 101, the higher-dimension category distribution vector x that the input text cluster process produces 1x 2... x n(x wherein iBe text d iThe category distribution vector that cluster obtains
Figure BSA00000555789700078
), judge whether to implement first dimension reduction and visualization, if turn step 102, otherwise turn step 106;
Step 102 is calculated any two high dimension vector x i, x jBetween similarity p Ij, computing method are as follows:
p ij = p j | i + p i | j 2 n
Wherein,
p j | i = exp ( - | | x i - x j | | 2 / 2 σ i 2 ) Σ k ≠ 1 m exp ( - | | x i - x k | | 2 / σ i 2 )
σ i 2For with x iTo x k(k ≠ Euclidean distance i) is considered as the variance of Gaussian distribution;
Step 103 produces x at random 1x 2... x nCorresponding low dimensional vector initial value y 1y 2... y n, calculate any two low dimensional vector y i, y jBetween similarity q Ij, computing method are as follows:
q ij = ( 1 + | | y i - y j | | 2 ) - 1 Σ k ≠ 1 m ( 1 + | | y i - y k | | 2 ) - 1
Step 104 is utilized KL distance metric { p IjAnd { q IjBetween difference, { p IjAnd { q IjBetween the KL distance D KLBe defined as:
D KL = Σ i Σ j p ij log p ij q ij
Step 105 utilizes the gradient descent method iteration to seek D KLMinimum value, upgrade simultaneously low dimensional vector y 1y 2... y n, detailed process is as follows:
Step 105a is according to D KLThe negative gradient direction progressively search for optimum solution, with D KLTo y iAsk partial derivative, have
∂ D KL ∂ y i = 4 Σ j ( p ij - q ij ) ( y i - y j ) ( 1 + | | y i - y j | | 2 ) - 1
Step 105b, note Y=(y 1, y 2... y n), Y (t)=(y 1 (t), y 2 (t)... y n (t)) being the solution after t iteration of process, gradient descent method adopts following formula to upgrade Y:
Y ( t ) = Y ( t - 1 ) + η ∂ D KL ∂ Y + α ( t ) ( Y ( t - 1 ) - Y ( t - 2 ) )
Wherein η is step-length, usually gets 50~500 integer, and α (t) is speedup factor, 0~1 value.Behind the iteration several times, can be met the low dimensional vector y of predictive error requirement 1y 2... y n, then turn step 112;
Step 106 selects one group of first dimension reduction and visualization generation to hang down dimensional vector β at random 1β 2... β mAs the low dimensional vector of benchmark, corresponding high dimension vector α 1α 2... α mAs the benchmark high dimension vector;
Step 107 is judged x 1x 2... x nIn whether untreated vector is arranged, if the step 108 of turning is arranged, otherwise turn step 112;
Step 108 is got first vector x that is untreated i, adopt following formula to calculate x iWith benchmark high dimension vector α jThe similarity p of (j=1,2...m) J|i:
p j | i = exp ( - | | x i - α j | | 2 / 2 σ i 2 ) Σ k = 1 m exp ( - | | x i - α k | | 2 / σ i 2 )
σ i 2For with x iTo α kThe Euclidean distance of (k=1,2...m) is considered as the variance of Gaussian distribution.
Step 109 produces x at random iCorresponding low dimensional vector initial value y i, adopt following formula to calculate y iWith the low dimensional vector β of benchmark jSimilarity q between (j=1,2...m) J|i:
q j | i = ( 1 + | | y i - β j | | 2 ) - 1 Σ k = 1 m ( 1 + | | y i - β k | | 2 ) - 1
Step 110: utilize KL distance metric { p J|iAnd { q J|iBetween difference, { p J|iAnd { q J|iBetween the KL distance D KLBe defined as:
D KL = Σ j = 1 m p j | i log p j | i q j | i
Step 111 utilizes the gradient descent method iteration to seek D KLMinimum value, upgrade simultaneously low dimensional vector y i, detailed process is as follows:
Step 111a is according to D KLThe negative gradient direction progressively search for optimum solution, with D KLTo y iAsk partial derivative, have
∂ D KL ∂ y i = 2 Σ j ( p j | i - q j | i ) ( y i - β j ) ( 1 + | | y i - β j | | 2 ) - 1
Step 111b, note y i (t)Be the solution after t iteration of process, gradient descent method adopts following formula to upgrade y i:
y i ( t ) = y i ( t - 1 ) + η ∂ D KL ∂ y i + α ( t ) ( y i ( t - 1 ) - y i ( t - 2 ) )
Wherein η is step-length, usually gets 50~500 integer, and α (t) is speedup factor, 0~1 value.Behind the iteration several times, can be met the low dimensional vector y of predictive error requirement i
Step 112 is to the low dimensional vector y behind the dimensionality reduction iCarry out visual presentation.In the reality we often text categories distribution vector dimensionality reduction in two-dimentional Euclidean space, i.e. y iBe two-dimensional coordinate, for the two-dimensional visualization requirement, can directly export the coordinate behind all text dimensionality reductions, as shown in Figure 5; For the three-dimensional visualization requirement, we often get largest component numerical value in the text text categories distribution vector as the coordinate figure of the third dimension, in conjunction with y iCarry out three-dimensional display, as shown in Figure 6.
Above-described embodiment is the better embodiment of the present invention; but embodiments of the present invention are not restricted to the described embodiments; other any do not deviate from change, the modification done under Spirit Essence of the present invention and the principle, substitutes, combination, simplify; all should be the substitute mode of equivalence, be included within protection scope of the present invention.

Claims (11)

1. the online cluster method for visualizing of text is characterized in that, comprises the online cluster of text, online high dimensional data dimension reduction and visualization two large steps based on the vocabulary mark:
The described online sorting procedure of text based on the vocabulary mark is:
Step a, the user arranges clusters number, and provides some feature vocabulary to some or all of classification wherein;
Step b, the word word frequency information in the set of statistics original text adopts the LDA model that data are carried out modeling, and utilizes the category feature vocabulary of mark that the LDA model is retrained, and adopts Gibbs Sampling technology solving model parameter;
Step c, the document category distribution θ in the model parameter is used for the prediction of text categories, and the vocabulary in the model parameter-category distribution frequency n (w, z) will as constrained parameters, be used for the increment cluster process;
Steps d, during online cluster, new text data carries out initialization on existing model parameter n (w, z) basis, then carries out the modeling computing according to step b and step c, and after calculating was finished, new text was realized the increment cluster, and model parameter realizes automatically renewal;
Described online high dimensional data dimension reduction and visualization step is:
Step e, the higher-dimension category distribution vector to text cluster obtains calculates the similarity between any two vectors, and the random low dimensional vector initial value of correspondence that produces calculates the similarity between any two low dimensional vectors simultaneously;
Step f utilizes the set of KL distance (Kullback-Leibler Divergence) tolerance high dimension vector similarity and hangs down the difference of dimensional vector similarity between gathering;
Step g by the minimum value of difference between similarity set among the optimization method iterative search step f, is constantly updated low dimensional vector simultaneously, stops iteration when reaching the specification error scope, utilizes visualization tool to the low-dimensional vector visualization;
Step h, during online processing, high dimension vector dimensionality reduction to new arrival has utilized the low-dimensional vector information that has produced, and the low dimensional vector that has produced when iterative search no longer upgrades, and only the high dimension vector of new arrival is made increment type according to step e, step f and step g and processes;
2. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step a, for the class number K that the user arranges, the user can select therefrom to mark arbitrarily several classifications; For selected classification, the user only need provide a small amount of feature vocabulary, also can provide the mark text.
3. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step b, and word w jAt text d iThe frequency that occurs is n (d i, w j), word w jFor classification z kThe total frequency of sampling be n (w j, z k), text d iIn all words for classification z kThe total frequency of sampling be n (d i, z k).
4. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step b, utilizes mark vocabulary to revise for initialization model, and computing formula is:
Figure FSA00000555789600011
Wherein, C refers to mark strength factor.
5. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step b, Gibbs Sampling computing formula is as follows:
P ( z = k | z ⫬ , k , w j , d i ) ∝ n ( w j , z k ) ⫬ + β n ( z k ) ⫬ + W · β ( n ( d i , z k ) ⫬ + α )
Wherein,
Figure FSA00000555789600022
With Expression sampled point (d i, w j) residue frequency value after current mark state has removed from statistic, α and β represent respectively Dirichlet prior distribution parameter.
6. the online cluster of text based on the vocabulary mark according to claim 1 is characterized in that, among the described step c, and document d iFor different classes of probability distribution θ, its computing formula is as follows:
θ d i ( z k ) = P ( z k | d i ) = n ( d i , z k ) + α Σ k ′ = 1 K n ( d i , z k ′ ) + K · α
7. the online cluster of text based on vocabulary mark according to claim 1, it is characterized in that, in the described steps d, new data initialization on original model parameter basis, its implementation is: at first to the vocabulary random labelling classification in the new data, then add up the vocabulary mark frequency n (w of new data j, z k) and n (d i, z k), after mark is finished, utilize the vocabulary category distribution frequency in the master mould, the vocabulary distribution of new data is revised, revise formula as follows:
Figure FSA00000555789600025
Wherein, n (0)(w j, z k) the vocabulary category distribution frequency of expression in the master mould.
8. the online cluster of text based on vocabulary mark according to claim 1, it is characterized in that, in the described steps d, increment cluster for new text can be according to the method for standard LDA model solution, need not fixing original model parameter, after GibbsSampling reaches stop condition, realize the classification of new text is judged by calculating θ, simultaneously model parameter n (w j, z k) also automatically revise.
9. online high dimensional data dimension reduction and visualization according to claim 1 is characterized in that, in described step e, and the higher-dimension category distribution vector x that text cluster is obtained 1x 2... x n, x i, x jBetween similarity p IjBe defined as:
Figure FSA00000555789600026
Wherein
Figure FSA00000555789600027
σ i 2For with x iTo x k(k ≠ Euclidean distance i) is considered as the variance of Gaussian distribution, note and x 1x 2... x nCorresponding low dimension data is y 1y 2... y n, y i, y jBetween similarity q IjBe defined as:
q ij = ( 1 + | | y i - y j | | 2 ) - 1 Σ k ≠ l ( 1 + | | y k - y l | | 2 ) - 1
10. online high dimensional data dimension reduction and visualization according to claim 1 is characterized in that, in described step f, and high dimensional data similarity set { p IjGather { q with the low-dimensional data similarity IjBetween the KL distance D KLBe defined as:
D KL = Σ i Σ j p ij log p ij q ij
11. online high dimensional data dimension reduction and visualization according to claim 1 is characterized in that, in described step h, and note β 1β 2... β mBe the low dimensional vector that has produced, corresponding high dimension vector is α 1α 2... α m, suppose x 1x 2... x nBe the high dimension vector that needs online processing, corresponding low dimensional vector is y 1y 2... y n, for any x i(i=1,2...n), itself and α jThe similarity p of (j=1,2...m) J|iBe defined as:
p j | i = exp ( - | | x i - α j | | 2 / 2 σ i 2 ) Σ k = 1 m exp ( - | | x i - α k | | 2 / σ i 2 )
σ i 2For with x iBe considered as the variance of Gaussian distribution to the Euclidean distance of α k (k=1,2...m), x iAnd α jThe low dimensional vector y that difference is corresponding iAnd β jBetween similarity q J|iBe defined as:
q j | i = ( 1 + | | y i - β j | | 2 ) - 1 Σ k = 1 m ( 1 + | | y i - β k | | 2 ) - 1
{ p J|i, { q J|iBetween the KL distance D KLBe defined as:
D KL = Σ j = 1 m p j | i log p j | i q j | i
CN2011102309783A 2011-08-12 2011-08-12 Online clustering visualization method of text Pending CN102929894A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102309783A CN102929894A (en) 2011-08-12 2011-08-12 Online clustering visualization method of text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102309783A CN102929894A (en) 2011-08-12 2011-08-12 Online clustering visualization method of text

Publications (1)

Publication Number Publication Date
CN102929894A true CN102929894A (en) 2013-02-13

Family

ID=47644693

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102309783A Pending CN102929894A (en) 2011-08-12 2011-08-12 Online clustering visualization method of text

Country Status (1)

Country Link
CN (1) CN102929894A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823848A (en) * 2014-02-11 2014-05-28 浙江大学 LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method
CN105205145A (en) * 2015-09-18 2015-12-30 中国科学院自动化研究所 Track modeling and searching method
CN106372051A (en) * 2016-10-20 2017-02-01 长城计算机软件与系统有限公司 Patent map visualization method and system
CN106897572A (en) * 2017-03-08 2017-06-27 山东大学 Lung neoplasm case matching assisted detection system and its method of work based on manifold learning
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN107368506A (en) * 2015-05-11 2017-11-21 斯图飞腾公司 Unstructured data analysis system and method
CN107688870A (en) * 2017-08-15 2018-02-13 中国科学院软件研究所 A kind of the classification factor visual analysis method and device of the deep neural network based on text flow input
CN108052560A (en) * 2017-12-04 2018-05-18 四川理工学院 A kind of data analysis processing method of data analysis processing method and employment trend data based on colleges and universities' data
CN108268469A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of Algorithm of documents categorization based on mixing multinomial distribution
CN108845560A (en) * 2018-05-30 2018-11-20 国网浙江省电力有限公司宁波供电公司 A kind of power scheduling log Fault Classification
CN108959344A (en) * 2018-04-10 2018-12-07 天津大学 One kind being directed to the dynamic analysis method of vocational education
CN109165359A (en) * 2017-11-06 2019-01-08 徐海飞 A kind of Methods of Dimensionality Reduction in High-dimensional Data and system based on web service platform
CN109388711A (en) * 2018-09-05 2019-02-26 广州视源电子科技股份有限公司 The method and apparatus of log stream cluster
CN109657087A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 A kind of batch data mask method, device and computer readable storage medium
CN111027599A (en) * 2019-11-25 2020-04-17 中国建设银行股份有限公司 Clustering visualization method and device based on random sampling
CN111061880A (en) * 2019-12-24 2020-04-24 成都迪普曼林信息技术有限公司 Method for rapidly clustering massive text data
CN111324737A (en) * 2020-03-23 2020-06-23 中国电子科技集团公司第三十研究所 Bag-of-words model-based distributed text clustering method, storage medium and computing device
WO2021017736A1 (en) * 2019-07-31 2021-02-04 平安科技(深圳)有限公司 Image analysis apparatus
CN112948583A (en) * 2021-02-26 2021-06-11 中国光大银行股份有限公司 Data classification method and device, storage medium and electronic device
CN113421632A (en) * 2021-07-09 2021-09-21 中国人民大学 Psychological disease type diagnosis system based on time series

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6735589B2 (en) * 2001-06-07 2004-05-11 Microsoft Corporation Method of reducing dimensionality of a set of attributes used to characterize a sparse data set
CN101853239A (en) * 2010-05-06 2010-10-06 复旦大学 Nonnegative matrix factorization-based dimensionality reducing method used for clustering
CN101968798A (en) * 2010-09-10 2011-02-09 中国科学技术大学 Community recommendation method based on on-line soft constraint LDA algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6735589B2 (en) * 2001-06-07 2004-05-11 Microsoft Corporation Method of reducing dimensionality of a set of attributes used to characterize a sparse data set
CN101853239A (en) * 2010-05-06 2010-10-06 复旦大学 Nonnegative matrix factorization-based dimensionality reducing method used for clustering
CN101968798A (en) * 2010-09-10 2011-02-09 中国科学技术大学 Community recommendation method based on on-line soft constraint LDA algorithm

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823848A (en) * 2014-02-11 2014-05-28 浙江大学 LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method
CN103823848B (en) * 2014-02-11 2017-11-14 浙江大学 A kind of recommendation method of the Chinese herbal medicine similar information based on LDA and VSM
CN107368506A (en) * 2015-05-11 2017-11-21 斯图飞腾公司 Unstructured data analysis system and method
CN107368506B (en) * 2015-05-11 2020-11-06 斯图飞腾公司 Unstructured data analysis system and method
CN105205145A (en) * 2015-09-18 2015-12-30 中国科学院自动化研究所 Track modeling and searching method
CN107085581A (en) * 2016-02-16 2017-08-22 腾讯科技(深圳)有限公司 Short text classification method and device
CN106372051B (en) * 2016-10-20 2019-05-03 长城计算机软件与系统有限公司 A kind of method for visualizing and system of patent map
CN106372051A (en) * 2016-10-20 2017-02-01 长城计算机软件与系统有限公司 Patent map visualization method and system
CN106372051B8 (en) * 2016-10-20 2019-06-18 长城计算机软件与系统有限公司 A kind of method for visualizing and system of patent map
CN108268469A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of Algorithm of documents categorization based on mixing multinomial distribution
CN106897572A (en) * 2017-03-08 2017-06-27 山东大学 Lung neoplasm case matching assisted detection system and its method of work based on manifold learning
CN107688870B (en) * 2017-08-15 2020-07-24 中国科学院软件研究所 Text stream input-based hierarchical factor visualization analysis method and device for deep neural network
CN107688870A (en) * 2017-08-15 2018-02-13 中国科学院软件研究所 A kind of the classification factor visual analysis method and device of the deep neural network based on text flow input
CN109165359A (en) * 2017-11-06 2019-01-08 徐海飞 A kind of Methods of Dimensionality Reduction in High-dimensional Data and system based on web service platform
CN108052560A (en) * 2017-12-04 2018-05-18 四川理工学院 A kind of data analysis processing method of data analysis processing method and employment trend data based on colleges and universities' data
CN108959344A (en) * 2018-04-10 2018-12-07 天津大学 One kind being directed to the dynamic analysis method of vocational education
CN108845560B (en) * 2018-05-30 2021-07-13 国网浙江省电力有限公司宁波供电公司 Power dispatching log fault classification method
CN108845560A (en) * 2018-05-30 2018-11-20 国网浙江省电力有限公司宁波供电公司 A kind of power scheduling log Fault Classification
CN109388711A (en) * 2018-09-05 2019-02-26 广州视源电子科技股份有限公司 The method and apparatus of log stream cluster
CN109657087A (en) * 2018-11-30 2019-04-19 平安科技(深圳)有限公司 A kind of batch data mask method, device and computer readable storage medium
WO2021017736A1 (en) * 2019-07-31 2021-02-04 平安科技(深圳)有限公司 Image analysis apparatus
CN111027599A (en) * 2019-11-25 2020-04-17 中国建设银行股份有限公司 Clustering visualization method and device based on random sampling
CN111027599B (en) * 2019-11-25 2023-07-04 中国建设银行股份有限公司 Clustering visualization method and device based on random sampling
CN111061880A (en) * 2019-12-24 2020-04-24 成都迪普曼林信息技术有限公司 Method for rapidly clustering massive text data
CN111324737A (en) * 2020-03-23 2020-06-23 中国电子科技集团公司第三十研究所 Bag-of-words model-based distributed text clustering method, storage medium and computing device
CN112948583A (en) * 2021-02-26 2021-06-11 中国光大银行股份有限公司 Data classification method and device, storage medium and electronic device
CN113421632A (en) * 2021-07-09 2021-09-21 中国人民大学 Psychological disease type diagnosis system based on time series

Similar Documents

Publication Publication Date Title
CN102929894A (en) Online clustering visualization method of text
CN106919689B (en) Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge
CN105243129B (en) Item property Feature words clustering method
CN105045812B (en) The classification method and system of text subject
CN105718528B (en) Academic map methods of exhibiting based on adduction relationship between paper
CN103207899B (en) Text recommends method and system
Huang et al. Dirichlet process mixture model for document clustering with feature partition
CN106649272B (en) A kind of name entity recognition method based on mixed model
CN106709754A (en) Power user grouping method based on text mining
CN106845717A (en) A kind of energy efficiency evaluation method based on multi-model convergence strategy
CN101159064A (en) Image generation system and method for generating image
CN105354593B (en) A kind of threedimensional model sorting technique based on NMF
CN103207913A (en) Method and system for acquiring commodity fine-grained semantic relation
CN109408641A (en) It is a kind of based on have supervision topic model file classification method and system
CN104573070B (en) A kind of Text Clustering Method for mixing length text set
CN104077417A (en) Figure tag recommendation method and system in social network
CN104572631A (en) Training method and system for language model
CN104572614A (en) Training method and system for language model
CN104462408A (en) Topic modeling based multi-granularity sentiment analysis method
CN108427756A (en) Personalized query word completion recommendation method and device based on same-class user model
CN109241442B (en) Project recommendation method based on predictive value filling, readable storage medium and terminal
CN114997288A (en) Design resource association method
CN104572915B (en) One kind is based on the enhanced customer incident relatedness computation method of content environment
CN106202515A (en) A kind of Mobile solution based on sequence study recommends method and commending system thereof
CN110019563B (en) Portrait modeling method and device based on multi-dimensional data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130213