CN103324712A - Extraction method for non-redundancy plot rule - Google Patents

Extraction method for non-redundancy plot rule Download PDF

Info

Publication number
CN103324712A
CN103324712A CN2013102446012A CN201310244601A CN103324712A CN 103324712 A CN103324712 A CN 103324712A CN 2013102446012 A CN2013102446012 A CN 2013102446012A CN 201310244601 A CN201310244601 A CN 201310244601A CN 103324712 A CN103324712 A CN 103324712A
Authority
CN
China
Prior art keywords
plot
sup
frequent
support
leading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102446012A
Other languages
Chinese (zh)
Inventor
尤涛
杜承烈
徐伟
赵湑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN2013102446012A priority Critical patent/CN103324712A/en
Publication of CN103324712A publication Critical patent/CN103324712A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to an extraction method for a non-redundancy plot rule. The extraction method for the non-redundancy plot rule comprises the steps of: comparing support degree of the closed plots and true sub-plots according to a more exactly described confidence interval of frequent closed plots obtained based on a concept of minimum generation and aiming at the frequent closed plots which have not been determined as generators, so as to determine whether the true sub-plots are non-derivable generators. The extraction method for the non-redundancy plot rule avoids judgment of redundant plot generators, and increases generation quality and generation efficiency of the plot rule as the non-redundancy plot rule is generated directly by the frequent closed plots and the generators thereof.

Description

A kind of abstracting method of irredundant episode rule
Technical field
The invention belongs to episode rule abstracting method on the data stream in the data mining technology, relate to a kind of abstracting method of irredundant episode rule, improve generation quality and the formation efficiency of episode rule.
Background technology
Containing a large amount of information in the historical flow data, the potential rule of research history flow data is also used these rules and is found out potential episode rule, can provide important decision support for many real world applications.
For example, that puts down in writing on library's Web server relates to a plurality of readers to the reading sequence of a plurality of documents, for example,<A, A, B, D, A, C, B, B, E, A, B, A, C, E, F>.Read sequence according to these, need to find out readers' reading behavior, thereby help Librarians to find related between the document and provide the Extraordinary recommendation service to the reader.For sequence<A, A, B, D, A, C, B, B, E, A, B, A, C, E, F>, frequent plot such as the table 1 excavated, Frequent Closed plot such as table 2, although the number of Frequent Closed plot directly still can produce the various episode rule of number by Frequent Closed plot collection far less than the number of frequent plot, wherein there are many redundancy rules.
For this class sequence, how finding out break-even plot from sequence is a very important problem.
At present, for how finding out potential this problem of break-even episode rule, the people such as Li have adopted the algorithm that is directly produced irredundant correlation rule base by Frequent Closed Itemsets and generation thereof, the algorithm DPMiner that proposes has adopted the search strategy of depth-first, utilizes inverted FP to set to find Frequent Closed Itemsets and generation thereof; Generate sub application in sequential mode mining in order to expand, the people such as Lo have introduced the sequence library equivalence class and sequence pattern generates sub concept, and the characteristic of binding sequence database class of equal value, have proposed sequence pattern and have generated sub mining algorithm GenMiner; Adopt the search strategy of depth-first and the support definition of minimum and non-overlapped generation, proposed algorithm MANEPI to find the frequent plot on the given sequence of events.In addition, the people such as Zhu Qiusheng have proposed algorithm Extractor, utilize the Apriori character of the sub-plot of non-generation, have avoided redundant plot to generate son and have judged; Directly produce episode rule by Frequent Closed plot and generation thereof, improved generation quality and the formation efficiency of episode rule.
Above-mentioned episode rule extraction algorithm is in the generative process of episode rule, closing plot and generating son of current optimum also can generate redundant episode rule, although utilized some shearing techniques to screen redundant episode rule, the pruning modes in this later stage has increased the time cost of algorithm.
Summary of the invention
The technical matters that solves
For fear of the deficiencies in the prior art part, the present invention proposes a kind of abstracting method of irredundant episode rule, produces irredundant episode rule.
Technical scheme
A kind of abstracting method of irredundant episode rule is characterized in that step is as follows:
Step 1: the sequence of events on the sequence definition data stream that is successively sorted by time of origin by some events
Figure BDA00003371010300021
T wherein i<t j(1≤i≤j≤s), scan event sequence ES, vertical expression way of generation sequence of events, given minimum support min_sup and minimum confidence level;
Step 2: if the support of plot α more than or equal to min_sup, then α is a frequent plot, for sequence of events ES, finds out all frequent plots; Described min_sup is the support threshold value;
Step 3: for the data in the moving window, if plot α is frequently, and the support of the very super plot of any one of α all is not equal to the support of α, then α is a Frequent Closed plot, use is closed plot mining algorithm Apriori algorithm and is carried out the single pass excavation, supported degree is stored in the result in the overall Frequent Closed plot tree greater than the plot of closing of threshold value;
Step 4: for Frequent Closed plot α, for each
Figure BDA00003371010300022
Boundary ul x(α); When | when α/x| is odd number, ul xBe the upper limit of α .sup (α), be expressed as ux (α), the minimum value of ux (α) becomes the minimum upper limit of α .sup, is expressed as mu (α); When | when α/x| is even number, ul xBe the lower limit of α .sup (α), be expressed as lx (α), the maximal value of lx (α) is called the greatest lower bound of α .sup, is expressed as ml (α).If α is .sup=ml (α)=mu (α), but title α is derived set, otherwise, but claim that α is non-derived set.According to given the three unities, if its a very sub-plot and its support are consistent, this plot and for other any plot of prefix all is to lead the character of plot then can be excavated fast and close non-the leading of plot and generate son;
Step 5: establish
Figure BDA00003371010300031
J is the end that occurs first of plot β in plot α till, then from α behind the 1st to j event type of deletion remaining plot be called the projection of β on α, be designated as project (α, β); Given plot α=<E 1E 2E m>and β=<E 1' E 2' ... E k'>, then<E 1E 2E mE 1' E 2' ... E k'>claim the serial connection of α and β, be designated as
Figure BDA00003371010300032
For given close plot with and non-leading generate son, obtain non-leading and generate son in the projection of closing on the plot, for non-lead generate son with and in the projection of closing on the plot, obtain non-leading and generate son and in the serial connection of closing the projection on the plot;
Step 6: relatively Frequent Closed plot and non-leading thereof generate sub support ratio, if ratio greater than minimum confidence level, then joins Frequent Closed plot and the non-sub episode rule that generates of generation of leading thereof in the episode rule set.
Beneficial effect
The abstracting method of a kind of irredundant episode rule that the present invention proposes, fiducial interval according to the Frequent Closed plot of the concept attainment of Minimum occurrence is described more accurate, for the Frequent Closed plot that not yet is defined as generating son, relatively close the support of plot and its very sub-plot and determine whether its very sub-plot is that non-leading generates son, having avoided redundant plot to generate son judges, directly produce irredundant episode rule by Frequent Closed plot and generation thereof, improved generation quality and the formation efficiency of episode rule.
Description of drawings
Fig. 1 is the process flow diagram of the abstracting method of a kind of irredundant episode rule of the present invention
Embodiment
Now in conjunction with the embodiments, the invention will be further described for accompanying drawing:
According to step 1: the sequence of sequence for successively being sorted by time of origin by the some time read in the definition library.If given sequence of events is ES, A, B, D, A, C, B, B, E, A, B, A, C, E, F>, minimum support min_sup is 2, min confidence min_conf is 0.
According to step 2: according to the definition of frequent plot: given support threshold value min_sup, if the support of plot α more than or equal to min_sup, then α is a frequent plot.For sequence of events ES, according to given support threshold value min_sup, obtain all frequent plot such as table 1.
Table 1
Figure BDA00003371010300041
According to step 3: according to the definition of Frequent Closed plot: if plot α is frequently, and the support of any one very super plot of α all is not equal to the support of α, and then α is a Frequent Closed plot.Obtain all Frequent Closed plot such as table 2.Adopt the longitudinal data form to represent, vertical expression mode of data set refers to the database that adopts two tuple (item, tidlist) forms to represent, the i.e. corresponding affairs id tabulation TS (i) that covers this collection of each project i.
Table 2
Figure BDA00003371010300042
According to step 4: the Frequent Closed plot that obtains in the step 3 saves as D with the longitudinal data form, scan data set D, note C lExpression length is the set of the plot of L, has
C 1=<A, 8, (1,2,2,3,4,4,6,7)>,<B, 7, (2,3,4,5,6,7,7)>,<C, 1, (4)>,<E, 1, (4)>, at first according to minimum support min_sup, remove C and E, for remaining A, B, A.sup 〉=l φ(A)=1, A.sup≤u A(A)=5, so can leading, the A right and wrong generate son; In like manner, by calculating the minimum upper limit and the maximum upper limit of B, B.sup 〉=l φ(B)=1, B.sup≤u B(B)=5, thus B also right and wrong can lead and generate son.At this moment, A, B are put into set Gen, by Gen={A, B} generates superset C 2For
C 2=<AB, 5, (2,2,3,4,7)>,<BA, 3, (4,6,7)>,<AA, 2, (2,4)>,<, BB, 1, (7)>, according to minimum support, remove plot BB, to AB, BA, these three plots of AA are obtained respectively their the minimum upper limit and maximum upper limit
AB . sup ≤ u A ( AB ) = 7 u B ( AB ) = 8 ⇒ AB . sup ≤ 7 , AB . sup ≥ u AB ( AB ) = 0 u φ ( AB ) = 5 ⇒ AB . sup ≥ 5
BA . sup ≤ u A ( BA ) = 7 u B ( BA ) = 8 ⇒ AB . sup ≤ 7 , BA . sup ≥ u BA ( BA ) = 0 u φ ( BA ) = 3 ⇒ AB . sup ≥ 3
AA . sup ≤ u A ( AA ) = 8 u A ( AA ) = 8 ⇒ AB . sup ≤ 8 , AA . sup ≥ u AA ( AA ) = 0 u φ ( AA ) = 2 ⇒ AB . sup ≥ 2
As can be known, AB, BA, these three plot right and wrong of AA can be led and generate son, put into set Gen, by
Figure BDA00003371010300057
Generate C 3Be sky, algorithm finishes, so the final non-set of leading generation is { A, B, AB, BA, AA}.
According to step 5: by the set of Frequent Closed plot A, AAB, AB, ABACE, B, BA, BAB} and non-lead generating subset close A, B, AB, BA, AA} produces episode rule.Generate son being projected as in the Frequent Closed plot by non-leading: non-leading generates sub-A be projected as r on the Frequent Closed plot A={ AB, B, BACE}; Non-leading generates sub-B be projected as r on the Frequent Closed plot B={ AB, A}; Non-leading generates sub-AB be projected as r on the Frequent Closed plot AB={ ACE}; Non-leading generates sub-BA be projected as r on the Frequent Closed plot BA={ CE, B};
Non-leading generates sub-AA be projected as r on the Frequent Closed plot AA={ CE, B}.
Table 3
Figure BDA00003371010300061
According to step 6: generate sub-g and the projection r of g on f for non-leading, obtain the serial connection α of g and r=concat (g, r), projection r is as the former piece of episode rule, α compares the support ratio of α and g, if α is .sup/g.sup 〉=min_conf as the consequent of episode rule, then with (g, r, α .sup, α .sup/g.sup, α .w) join among the episode rule set R, net result is as shown in table 4.
Table 4
Figure BDA00003371010300062

Claims (1)

1. the abstracting method of an irredundant episode rule is characterized in that step is as follows:
Step 1: the sequence of events on the sequence definition data stream that is successively sorted by time of origin by some events
Figure FDA00003371010200011
T wherein i<t j(1≤i≤j≤s), scan event sequence ES, vertical expression way of generation sequence of events, given minimum support min_sup and minimum confidence level;
Step 2: if the support of plot α more than or equal to min_sup, then α is a frequent plot, for sequence of events ES, finds out all frequent plots; Described min_sup is the support threshold value;
Step 3: for the data in the moving window, if plot α is frequently, and the support of the very super plot of any one of α all is not equal to the support of α, then α is a Frequent Closed plot, use is closed plot mining algorithm Apriori algorithm and is carried out the single pass excavation, supported degree is stored in the result in the overall Frequent Closed plot tree greater than the plot of closing of threshold value;
Step 4: for Frequent Closed plot α, for each
Figure FDA00003371010200013
Boundary ul x(α); When | when α/x| is odd number, ul xBe the upper limit of α .sup (α), be expressed as ux (α), the minimum value of ux (α) becomes the minimum upper limit of α .sup, is expressed as mu (α); When | when α/x| is even number, ul xBe the lower limit of α .sup (α), be expressed as lx (α), the maximal value of lx (α) is called the greatest lower bound of α .sup, is expressed as ml (α).If α is .sup=ml (α)=mu (α), but title α is derived set, otherwise, but claim that α is non-derived set.According to given the three unities, if its a very sub-plot and its support are consistent, this plot and for other any plot of prefix all is to lead the character of plot then can be excavated fast and close non-the leading of plot and generate son;
Step 5: establish
Figure FDA00003371010200014
J is the end that occurs first of plot β in plot α till, then from α behind the 1st to j event type of deletion remaining plot be called the projection of β on α, be designated as project (α, β); Given plot α=<E 1E 2E m>and β=<E 1' E 2' ... E k'>, then<E 1E 2E mE 1' E 2' ... E k'>claim the serial connection of α and β, be designated as
Figure FDA00003371010200012
For given close plot with and non-leading generate son, obtain non-leading and generate son in the projection of closing on the plot, for non-lead generate son with and in the projection of closing on the plot, obtain non-leading and generate son and in the serial connection of closing the projection on the plot;
Step 6: relatively Frequent Closed plot and non-leading thereof generate sub support ratio, if ratio greater than minimum confidence level, then joins Frequent Closed plot and the non-sub episode rule that generates of generation of leading thereof in the episode rule set.
CN2013102446012A 2013-06-19 2013-06-19 Extraction method for non-redundancy plot rule Pending CN103324712A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102446012A CN103324712A (en) 2013-06-19 2013-06-19 Extraction method for non-redundancy plot rule

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102446012A CN103324712A (en) 2013-06-19 2013-06-19 Extraction method for non-redundancy plot rule

Publications (1)

Publication Number Publication Date
CN103324712A true CN103324712A (en) 2013-09-25

Family

ID=49193455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102446012A Pending CN103324712A (en) 2013-06-19 2013-06-19 Extraction method for non-redundancy plot rule

Country Status (1)

Country Link
CN (1) CN103324712A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008185A (en) * 2014-06-11 2014-08-27 西北工业大学 Frequent close scenario mining method based on same node table and scenario tree
CN104765847A (en) * 2015-04-20 2015-07-08 西北工业大学 Frequent closed item set mining method based on order-preserving characteristic and preamble tree
CN106469170A (en) * 2015-08-18 2017-03-01 阿里巴巴集团控股有限公司 The treating method and apparatus of text data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6044366A (en) * 1998-03-16 2000-03-28 Microsoft Corporation Use of the UNPIVOT relational operator in the efficient gathering of sufficient statistics for data mining
CN101799810A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Association rule mining method and system thereof
CN101937447A (en) * 2010-06-07 2011-01-05 华为技术有限公司 Alarm association rule mining method, and rule mining engine and system
CN102622447A (en) * 2012-03-19 2012-08-01 南京大学 Hadoop-based frequent closed itemset mining method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6044366A (en) * 1998-03-16 2000-03-28 Microsoft Corporation Use of the UNPIVOT relational operator in the efficient gathering of sufficient statistics for data mining
CN101799810A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Association rule mining method and system thereof
CN101937447A (en) * 2010-06-07 2011-01-05 华为技术有限公司 Alarm association rule mining method, and rule mining engine and system
CN102622447A (en) * 2012-03-19 2012-08-01 南京大学 Hadoop-based frequent closed itemset mining method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
朱辉生: "情节规则匹配的数据流预测研究", 《中国博士学位论文全文数据库信息科技辑》, no. 12, 15 December 2011 (2011-12-15), pages 138 - 18 *
朱辉生等: "基于情节规则匹配的数据流预测", 《软件学报》, vol. 23, no. 5, 31 May 2012 (2012-05-31), pages 1183 - 1194 *
朱辉生等: "基于频繁闭情节及其生成子的规则抽取", 《计算机学报》, vol. 35, no. 1, 31 January 2012 (2012-01-31), pages 53 - 64 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008185A (en) * 2014-06-11 2014-08-27 西北工业大学 Frequent close scenario mining method based on same node table and scenario tree
CN104765847A (en) * 2015-04-20 2015-07-08 西北工业大学 Frequent closed item set mining method based on order-preserving characteristic and preamble tree
CN106469170A (en) * 2015-08-18 2017-03-01 阿里巴巴集团控股有限公司 The treating method and apparatus of text data
CN106469170B (en) * 2015-08-18 2019-09-10 阿里巴巴集团控股有限公司 The treating method and apparatus of text data

Similar Documents

Publication Publication Date Title
Leung et al. CanTree: a tree structure for efficient incremental mining of frequent patterns
Leung et al. CanTree: a canonical-order tree for incremental frequent-pattern mining
US20100287466A1 (en) Method for organizing large numbers of documents
CN103324712A (en) Extraction method for non-redundancy plot rule
Le et al. An efficient strategy for mining high utility itemsets
Watts Resource extraction responsible for half world’s carbon emissions
CN102298681B (en) Software identification method based on data stream sliced sheet
CN106844607A (en) A kind of SQLite data reconstruction methods suitable for non-integer major key and idle merged block
CN104462041A (en) Method for completely detecting hot event from beginning to end
Lund et al. Improving optical character recognition through efficient multiple system alignment
CN102591931B (en) Recognition and extraction method for webpage data records based on tree weight
CN107194468A (en) Towards the decision tree Increment Learning Algorithm of information big data
CN106294617A (en) Method for efficiently mining frequent item sets in association rule
Li et al. Alkline/surfactant/polymer (ASP) commercial flooding test in central Xing2 area of Daqing oilfield
CN103136212A (en) Mining method of class new words and device
CN104809185A (en) Closed item set mining method facing uncertain data
CN117347500A (en) Rock fracture state identification method and related equipment
Möller et al. CO2-free power generation in combined cycles—Integration of post-combustion separation of carbon dioxide in the steam cycle
Gago Alonso et al. Mining frequent connected subgraphs reducing the number of candidates
Singh et al. RSTDB a new candidate generation and test algorithm for frequent pattern mining
CN108985103B (en) Information security discrimination method, system and related device based on rough set theory
Zhong et al. A generalized hidden markov model approach for web information extraction
CN110377845B (en) Collaborative filtering recommendation method based on interval semi-supervised LDA
Valtchev et al. Incremental maintenance of association rule bases
CN104516978A (en) Algorithm for compressing middle candidate frequent item sets in field of database intrusion detection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130925