CN100495405C - Hierarchy clustering method of successive dichotomy for document in large scale - Google Patents

Hierarchy clustering method of successive dichotomy for document in large scale Download PDF

Info

Publication number
CN100495405C
CN100495405C CNB2007100363096A CN200710036309A CN100495405C CN 100495405 C CN100495405 C CN 100495405C CN B2007100363096 A CNB2007100363096 A CN B2007100363096A CN 200710036309 A CN200710036309 A CN 200710036309A CN 100495405 C CN100495405 C CN 100495405C
Authority
CN
China
Prior art keywords
text
piece
writing
vector
article
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2007100363096A
Other languages
Chinese (zh)
Other versions
CN101004761A (en
Inventor
黄萱菁
赵林
钱线
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CNB2007100363096A priority Critical patent/CN100495405C/en
Publication of CN101004761A publication Critical patent/CN101004761A/en
Application granted granted Critical
Publication of CN100495405C publication Critical patent/CN100495405C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A method for clustering large-capacity text includes presenting vector space of text, calculating similarity between each two texts, embedding pattern into dimensional space and using K-means algorithm to cluster texts to be two types, carrying out successive bisect till requirement is satisfied and pattern is not divided any more.

Description

The hierarchy clustering method of successive dichotomy for document in large scale
Technical field
The invention belongs to the text message technical field, be specifically related to a kind of clustering method of extensive text.
Background technology
Along with popularizing of internet, increasing people likes the medium that are used as stating one's views with network.A lot of forums, blog, the chatroom all provides abundant public opinion information, how to become an important problems with these information of computer automatic analysis.Text cluster is a kind of technology that can utilize computing machine automatically text message to be sorted out, and through after the cluster, those articles that belong to same topic will be classified as same class, thereby make things convenient for the user to search reading.Mainly contain following text cluster methods at present:
1, K-means is a kind of fast based on the clustering algorithm of optimizing criterion.This algorithm is looked for k initial class center at the beginning at random.Then each text is assigned to the center in its that nearest class, obtain the classification of every piece of text after, recomputate the center of each class.So iterate, till the variation no longer obviously at class center.The advantage of this method is that speed is fast, is not very desirable but may obtain cluster result, and the number of class needs manually given in advance.
2, hierarchical clustering algorithm begins to regard each text as a class, then, merges two classes the most similar, till the number of class is 1 at every turn.Class is represented to the similarity that similarity between the class is used in two the most similar in these two classes texts.The advantage of this method is, can not know the number of class at the beginning, by continuous merging, constituted one tree, and the user can obtain the respective classified system according to the needs of oneself.The advantage of this method is, the number of class does not need artificial appointment at the beginning, but shortcoming is the cluster poor effect.
3, spectral clustering algorithm calculates text similarity between any two, and n piece of writing text has constituted the non-directed graph that contains n node like this, and the weight on limit is exactly the similarity between these two texts between the node.The spectral clustering algorithm is attempted this figure is embedded into the space of a low-dimensional and is got on, and makes that the bigger limit of weights can keep as much as possible among the figure, and the less limit of weights then can be ignored.After the lower dimensional space that obtains each text is represented, can carry out cluster with any one of above-mentioned two kinds of algorithms.The advantage of this algorithm is that the cluster effect is better, and shortcoming is that speed is very slow.
Summary of the invention
It is effective that purpose of the present invention proposes a kind of cluster, the clustering algorithm of the extensive text that computing velocity is fast.
The clustering method of the extensive text that the present invention proposes is an advantage of having drawn above-mentioned second the third algorithm.And improve the algorithm that forms.Its core technology is made up of two parts: figure embeds and cluster.First's technology is similar with the spectral clustering algorithm, figure need be embedded on the space of a low-dimensional, and different is that the dimension of lower dimensional space is defined as one dimension here.In fact exactly all texts have been done an ordering this moment.Then text is gathered into two classes with K-means or hierarchical clustering algorithm.Promptly figure was carried out two minutes one by one.So-called " two minutes one by one " promptly are that the subgraph that obtains is proceeded " two minutes ", enough closely then stop to cut apart up to each subgraph that obtains.
Concrete steps comprise: the vector space of text is represented; The similarity of text is calculated in twos; Figure is embedded into the one-dimensional space, and uses K-means or hierarchical clustering algorithm, and figure is carried out cluster, and figure is divided into two classes; One by one two minutes again,, figure is no longer done cutting up to meeting the demands.
Advantage of the present invention is as follows:
Because the hierarchical clustering algorithm of " two minutes one by one " is a kind of improvement algorithm, thus its advantage can from the comparison of existing algorithm draw.
1, the number of class does not need people's prior appointment.With respect to the number that needs artificial specified class in the K-means algorithm, the hierarchical clustering algorithm of " two minutes one by one " only needs to decide a threshold values in advance as stopping to cut apart condition, and is identical with hierarchical clustering algorithm.This is very easily in actual applications.
2, effective.Because the hierarchical clustering algorithm of " two minutes one by one " has adopted figure to embed, so its cluster result is very close with the spectral clustering algorithm significantly better than K-means and hierarchical clustering algorithm.
3, speed is fast.In the spectral clustering algorithm, dimensionality reduction has consumed a large amount of time, and particularly the number as fruit is many especially, the also corresponding increase of the dimension that is fallen so, and the time overhead of spectral clustering algorithm will be very big this moment.And the cluster time complexity of K-means is directly proportional with the number of class, also can consume a large amount of time when the number of class is very big.And in " two minutes one by one " method, text is divided into two classes at every turn, and the required dimension that falls is minimum one dimension, and obtains k class and only need make log and operate for k time, and this compares with the K-means algorithm with the spectral clustering algorithm, and advantage is fairly obvious.
In sum, that the hierarchical clustering algorithm of " two minutes one by one " has is automatic, effective, characteristic fast, is a kind of more outstanding text cluster algorithm.
Embodiment
Basic procedure is after text table is shown as space vector, to calculate the similarity between the text in twos, obtains figure, and carries out cluster with the hierarchical clustering algorithm of " two minutes one by one ".
1, the vector space of text is represented.
Suppose to have now n piece of writing article, occurred m speech altogether.Then every piece of vector representation that article is tieed up with a m, n piece of writing article has constituted the matrix of m * n, is designated as M.M IjRepresent the tfidf value of i word in j piece of writing article: M ij = tf ij × log n d f i , Tf wherein IjRepresent the frequency that i speech occurs in j piece of writing article, df iExpression comprises the article number of i speech.In order to eliminate the difference of text length, text table is shown as after the vector, do normalized again, each vector is long divided by its mould:
X ij = M ij Σ i = 1 m M ij 2
Like this, just text table has been shown as in the space mould length and is 1 vector.
2, the calculating of figure.
Calculate the similarity between the text in twos.Similarity between two texts.I.e. included angle cosine between the vector of two texts.N piece of writing text has constituted the non-directed graph that contains n node, and the weight on limit is exactly the similarity between these two texts between the node.Its similarity matrix is represented with S.
3, cut apart:
A) figure is embedded into 1 dimension space: compute matrix L=D*S*D, wherein D is a diagonal matrix, Then calculate time big eigenwert characteristic of correspondence vector y of L.The component y of y then iJust represented the position of i piece of writing document on the one-dimensional space.
B) cut apart: ask the average y ' of vectorial y, if y i0, then i piece of writing document is assigned to the first kind, otherwise assign to second class.
4, differentiate:
If the minimal weight on the limit in the subgraph that obtains surpasses a value given in advance, think that then all documents have belonged to same topic among this figure, do not do cutting; Otherwise got back to for the 3rd step.

Claims (1)

1, a kind of hierarchy clustering method of successive dichotomy for document in large scale, it is characterized in that concrete steps are as follows: the vector space of text is represented; The similarity of text is calculated in twos; Figure is embedded into the one-dimensional space, and uses K-means or hierarchical clustering algorithm, and figure is carried out cluster, and figure is divided into two classes; One by one two minutes again, up to meeting the demands, figure is no longer done cutting, wherein:
(1) vector space of text is represented:
Suppose to have n piece of writing article, occurred m speech altogether, then every piece of vector representation that article is tieed up with a m, n piece of writing article has constituted the matrix of m * n, is designated as M, M IjRepresent the tfidf value of i word in j piece of writing article: M ij = tf ij × log n df i , Tf wherein IjRepresent the frequency that i speech occurs in j piece of writing article, df iExpression comprises the article number of i speech; Text table is shown as after the vector, does normalized again:
X ij = M ij Σ i = 1 m M ij 2
(2) calculating of figure:
Calculate the similarity between the text in twos, its similarity matrix is represented with S; Similarity in its Chinese is the included angle cosine between the vector of two texts, and the n of a non-directed graph node is made of n piece of writing text, and the weight on limit is two similarities between the text between the node;
(3) cut apart:
A) figure is embedded into the one-dimensional space: compute matrix L=D * S * D, wherein D is a diagonal matrix, D ii = 1 Σ j = 1 n S ij , then calculate time big eigenwert characteristic of correspondence vector y, then the component y of y of L iJust represented the position of i piece of writing document on the one-dimensional space;
B) cut apart: ask the average y ' of vectorial y, if y i0, then i piece of writing document is assigned to the first kind, otherwise assign to second class;
(4) differentiate:
If the minimal weight on the limit in the subgraph that obtains surpasses a value given in advance, think that then all documents have belonged to same topic in the subgraph after cutting apart, do not do cutting; Otherwise got back to for (3) step.
CNB2007100363096A 2007-01-10 2007-01-10 Hierarchy clustering method of successive dichotomy for document in large scale Expired - Fee Related CN100495405C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100363096A CN100495405C (en) 2007-01-10 2007-01-10 Hierarchy clustering method of successive dichotomy for document in large scale

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100363096A CN100495405C (en) 2007-01-10 2007-01-10 Hierarchy clustering method of successive dichotomy for document in large scale

Publications (2)

Publication Number Publication Date
CN101004761A CN101004761A (en) 2007-07-25
CN100495405C true CN100495405C (en) 2009-06-03

Family

ID=38703898

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100363096A Expired - Fee Related CN100495405C (en) 2007-01-10 2007-01-10 Hierarchy clustering method of successive dichotomy for document in large scale

Country Status (1)

Country Link
CN (1) CN100495405C (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178703B (en) * 2007-11-23 2010-05-19 西安交通大学 Failure diagnosis chart clustering method based on network dividing
US8055693B2 (en) * 2008-02-25 2011-11-08 Mitsubishi Electric Research Laboratories, Inc. Method for retrieving items represented by particles from an information database
US8370278B2 (en) * 2010-03-08 2013-02-05 Microsoft Corporation Ontological categorization of question concepts from document summaries
CN103049569A (en) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 Text similarity matching method on basis of vector space model
CN103365999A (en) * 2013-07-16 2013-10-23 盐城工学院 Text clustering integrated method based on similarity degree matrix spectral factorization
CN104102726A (en) * 2014-07-22 2014-10-15 南昌航空大学 Modified K-means clustering algorithm based on hierarchical clustering
CN107291760A (en) * 2016-04-05 2017-10-24 阿里巴巴集团控股有限公司 Unsupervised feature selection approach, device
CN106815310B (en) * 2016-12-20 2020-04-21 华南师范大学 Hierarchical clustering method and system for massive document sets
CN108664538B (en) * 2017-11-30 2022-02-01 全球能源互联网研究院有限公司 Automatic identification method and system for suspected familial defects of power transmission and transformation equipment
CN108170840B (en) * 2018-01-15 2019-11-19 浙江大学 A kind of domain classification relationship Auto-learning Method of text-oriented
CN109376381A (en) * 2018-09-10 2019-02-22 平安科技(深圳)有限公司 Method for detecting abnormality, device, computer equipment and storage medium are submitted an expense account in medical insurance
CN110032606B (en) * 2019-03-29 2021-05-14 创新先进技术有限公司 Sample clustering method and device
CN111310467B (en) * 2020-03-23 2023-12-12 应豪 Topic extraction method and system combining semantic inference in long text
CN113449108B (en) * 2021-06-30 2022-10-21 南京理工大学 Financial news stream burst detection method based on hierarchical clustering
CN114328922B (en) * 2021-12-28 2022-08-02 盐城工学院 Selective text clustering integration method based on spectrogram theory

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Web文档聚类中k-means算法的改进. 王子兴,冯志勇.微型机与应用,第4期. 2004
Web文档聚类中k-means算法的改进. 王子兴,冯志勇.微型机与应用,第4期. 2004 *
基于K-Means的文本层次聚类算法研究. 尉景辉,何丕廉,孙越恒.计算机应用,第25卷第10期. 2005
基于K-Means的文本层次聚类算法研究. 尉景辉,何丕廉,孙越恒.计算机应用,第25卷第10期. 2005 *
基于分布模型的层次聚类算法. 叶茂,陈勇.电子科技大学学报,第33卷第2期. 2004
基于分布模型的层次聚类算法. 叶茂,陈勇.电子科技大学学报,第33卷第2期. 2004 *

Also Published As

Publication number Publication date
CN101004761A (en) 2007-07-25

Similar Documents

Publication Publication Date Title
CN100495405C (en) Hierarchy clustering method of successive dichotomy for document in large scale
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN104199972B (en) A kind of name entity relation extraction and construction method based on deep learning
CN103514183B (en) Information search method and system based on interactive document clustering
Van Laere et al. Spatially aware term selection for geotagging
CN103020293B (en) A kind of construction method and system of the ontology library of mobile application
CN103425777B (en) A kind of based on the short message intelligent classification and the searching method that improve Bayes's classification
CN103514181B (en) A kind of searching method and device
CN105389341B (en) A kind of service calls repeat the text cluster and analysis method of incoming call work order
CN103823893A (en) User comment-based product search method and system
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN109165294A (en) Short text classification method based on Bayesian classification
CN104866572A (en) Method for clustering network-based short texts
CN103365924A (en) Method, device and terminal for searching information
CN101369279A (en) Detection method for academic dissertation similarity based on computer searching system
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN104050556A (en) Feature selection method and detection method of junk mails
CN105512333A (en) Product comment theme searching method based on emotional tendency
Zhang et al. A cue-based hub-authority approach for multi-document text summarization
Li et al. Netnews bursty hot topic detection based on bursty features
Jahnavi et al. FPST: a new term weighting algorithm for long running and short lived events
Al-Radaideh et al. An approach for Arabic text categorization using association rule mining
CN102799666B (en) Method for automatically categorizing texts of network news based on frequent term set
Annam et al. Entropy based informative content density approach for efficient web content extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090603

Termination date: 20130110