CN103324712A

CN103324712A - Extraction method for non-redundancy plot rule

Info

Publication number: CN103324712A
Application number: CN2013102446012A
Authority: CN
Inventors: 尤涛; 杜承烈; 徐伟; 赵湑
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2013-06-19
Filing date: 2013-06-19
Publication date: 2013-09-25

Abstract

The invention relates to an extraction method for a non-redundancy plot rule. The extraction method for the non-redundancy plot rule comprises the steps of: comparing support degree of the closed plots and true sub-plots according to a more exactly described confidence interval of frequent closed plots obtained based on a concept of minimum generation and aiming at the frequent closed plots which have not been determined as generators, so as to determine whether the true sub-plots are non-derivable generators. The extraction method for the non-redundancy plot rule avoids judgment of redundant plot generators, and increases generation quality and generation efficiency of the plot rule as the non-redundancy plot rule is generated directly by the frequent closed plots and the generators thereof.

Description

A kind of abstracting method of irredundant episode rule

Technical field

The invention belongs to episode rule abstracting method on the data stream in the data mining technology, relate to a kind of abstracting method of irredundant episode rule, improve generation quality and the formation efficiency of episode rule.

Background technology

Containing a large amount of information in the historical flow data, the potential rule of research history flow data is also used these rules and is found out potential episode rule, can provide important decision support for many real world applications.

For example, that puts down in writing on library's Web server relates to a plurality of readers to the reading sequence of a plurality of documents, for example,＜A, A, B, D, A, C, B, B, E, A, B, A, C, E, F＞.Read sequence according to these, need to find out readers' reading behavior, thereby help Librarians to find related between the document and provide the Extraordinary recommendation service to the reader.For sequence＜A, A, B, D, A, C, B, B, E, A, B, A, C, E, F＞, frequent plot such as the table 1 excavated, Frequent Closed plot such as table 2, although the number of Frequent Closed plot directly still can produce the various episode rule of number by Frequent Closed plot collection far less than the number of frequent plot, wherein there are many redundancy rules.

For this class sequence, how finding out break-even plot from sequence is a very important problem.

At present, for how finding out potential this problem of break-even episode rule, the people such as Li have adopted the algorithm that is directly produced irredundant correlation rule base by Frequent Closed Itemsets and generation thereof, the algorithm DPMiner that proposes has adopted the search strategy of depth-first, utilizes inverted FP to set to find Frequent Closed Itemsets and generation thereof; Generate sub application in sequential mode mining in order to expand, the people such as Lo have introduced the sequence library equivalence class and sequence pattern generates sub concept, and the characteristic of binding sequence database class of equal value, have proposed sequence pattern and have generated sub mining algorithm GenMiner; Adopt the search strategy of depth-first and the support definition of minimum and non-overlapped generation, proposed algorithm MANEPI to find the frequent plot on the given sequence of events.In addition, the people such as Zhu Qiusheng have proposed algorithm Extractor, utilize the Apriori character of the sub-plot of non-generation, have avoided redundant plot to generate son and have judged; Directly produce episode rule by Frequent Closed plot and generation thereof, improved generation quality and the formation efficiency of episode rule.

Above-mentioned episode rule extraction algorithm is in the generative process of episode rule, closing plot and generating son of current optimum also can generate redundant episode rule, although utilized some shearing techniques to screen redundant episode rule, the pruning modes in this later stage has increased the time cost of algorithm.

Summary of the invention

The technical matters that solves

For fear of the deficiencies in the prior art part, the present invention proposes a kind of abstracting method of irredundant episode rule, produces irredundant episode rule.

Technical scheme

A kind of abstracting method of irredundant episode rule is characterized in that step is as follows:

Step 1: the sequence of events on the sequence definition data stream that is successively sorted by time of origin by some events

T wherein _i＜t _j(1≤i≤j≤s), scan event sequence ES, vertical expression way of generation sequence of events, given minimum support min_sup and minimum confidence level;

Step 2: if the support of plot α more than or equal to min_sup, then α is a frequent plot, for sequence of events ES, finds out all frequent plots; Described min_sup is the support threshold value;

Step 3: for the data in the moving window, if plot α is frequently, and the support of the very super plot of any one of α all is not equal to the support of α, then α is a Frequent Closed plot, use is closed plot mining algorithm Apriori algorithm and is carried out the single pass excavation, supported degree is stored in the result in the overall Frequent Closed plot tree greater than the plot of closing of threshold value;

Step 4: for Frequent Closed plot α, for each

Boundary ul _x(α); When | when α/x| is odd number, ul _xBe the upper limit of α .sup (α), be expressed as ux (α), the minimum value of ux (α) becomes the minimum upper limit of α .sup, is expressed as mu (α); When | when α/x| is even number, ul _xBe the lower limit of α .sup (α), be expressed as lx (α), the maximal value of lx (α) is called the greatest lower bound of α .sup, is expressed as ml (α).If α is .sup=ml (α)=mu (α), but title α is derived set, otherwise, but claim that α is non-derived set.According to given the three unities, if its a very sub-plot and its support are consistent, this plot and for other any plot of prefix all is to lead the character of plot then can be excavated fast and close non-the leading of plot and generate son;

Step 5: establish

J is the end that occurs first of plot β in plot α till, then from α behind the 1st to j event type of deletion remaining plot be called the projection of β on α, be designated as project (α, β); Given plot α=＜E ₁E ₂E _m＞and β=＜E ₁' E ₂' ... E _k'＞, then＜E ₁E ₂E _mE ₁' E ₂' ... E _k'＞claim the serial connection of α and β, be designated as

For given close plot with and non-leading generate son, obtain non-leading and generate son in the projection of closing on the plot, for non-lead generate son with and in the projection of closing on the plot, obtain non-leading and generate son and in the serial connection of closing the projection on the plot;

Step 6: relatively Frequent Closed plot and non-leading thereof generate sub support ratio, if ratio greater than minimum confidence level, then joins Frequent Closed plot and the non-sub episode rule that generates of generation of leading thereof in the episode rule set.

Beneficial effect

The abstracting method of a kind of irredundant episode rule that the present invention proposes, fiducial interval according to the Frequent Closed plot of the concept attainment of Minimum occurrence is described more accurate, for the Frequent Closed plot that not yet is defined as generating son, relatively close the support of plot and its very sub-plot and determine whether its very sub-plot is that non-leading generates son, having avoided redundant plot to generate son judges, directly produce irredundant episode rule by Frequent Closed plot and generation thereof, improved generation quality and the formation efficiency of episode rule.

Description of drawings

Fig. 1 is the process flow diagram of the abstracting method of a kind of irredundant episode rule of the present invention

Embodiment

Now in conjunction with the embodiments, the invention will be further described for accompanying drawing:

According to step 1: the sequence of sequence for successively being sorted by time of origin by the some time read in the definition library.If given sequence of events is ES, A, B, D, A, C, B, B, E, A, B, A, C, E, F＞, minimum support min_sup is 2, min confidence min_conf is 0.

According to step 2: according to the definition of frequent plot: given support threshold value min_sup, if the support of plot α more than or equal to min_sup, then α is a frequent plot.For sequence of events ES, according to given support threshold value min_sup, obtain all frequent plot such as table 1.

Table 1

According to step 3: according to the definition of Frequent Closed plot: if plot α is frequently, and the support of any one very super plot of α all is not equal to the support of α, and then α is a Frequent Closed plot.Obtain all Frequent Closed plot such as table 2.Adopt the longitudinal data form to represent, vertical expression mode of data set refers to the database that adopts two tuple (item, tidlist) forms to represent, the i.e. corresponding affairs id tabulation TS (i) that covers this collection of each project i.

Table 2

According to step 4: the Frequent Closed plot that obtains in the step 3 saves as D with the longitudinal data form, scan data set D, note C _lExpression length is the set of the plot of L, has

C ₁=＜A, 8, (1,2,2,3,4,4,6,7)＞,＜B, 7, (2,3,4,5,6,7,7)＞,＜C, 1, (4)＞,＜E, 1, (4)＞, at first according to minimum support min_sup, remove C and E, for remaining A, B, A.sup 〉=l _φ(A)=1, A.sup≤u _A(A)=5, so can leading, the A right and wrong generate son; In like manner, by calculating the minimum upper limit and the maximum upper limit of B, B.sup 〉=l _φ(B)=1, B.sup≤u _B(B)=5, thus B also right and wrong can lead and generate son.At this moment, A, B are put into set Gen, by Gen={A, B} generates superset C ₂For

C ₂=＜AB, 5, (2,2,3,4,7)＞,＜BA, 3, (4,6,7)＞,＜AA, 2, (2,4)＞,＜, BB, 1, (7)＞, according to minimum support, remove plot BB, to AB, BA, these three plots of AA are obtained respectively their the minimum upper limit and maximum upper limit

AB . \sup \leq \{\begin{matrix} u_{A} (AB) = 7 \\ u_{B} (AB) = 8 \end{matrix} &DoubleRightArrow; AB . \sup \leq 7,

AB . \sup &GreaterEqual; \{\begin{matrix} u_{AB} (AB) = 0 \\ u_{φ} (AB) = 5 \end{matrix} &DoubleRightArrow; AB . \sup &GreaterEqual; 5

BA . \sup \leq \{\begin{matrix} u_{A} (BA) = 7 \\ u_{B} (BA) = 8 \end{matrix} &DoubleRightArrow; AB . \sup \leq 7,

BA . \sup &GreaterEqual; \{\begin{matrix} u_{BA} (BA) = 0 \\ u_{φ} (BA) = 3 \end{matrix} &DoubleRightArrow; AB . \sup &GreaterEqual; 3

AA . \sup \leq \{\begin{matrix} u_{A} (AA) = 8 \\ u_{A} (AA) = 8 \end{matrix} &DoubleRightArrow; AB . \sup \leq 8,

AA . \sup &GreaterEqual; \{\begin{matrix} u_{AA} (AA) = 0 \\ u_{φ} (AA) = 2 \end{matrix} &DoubleRightArrow; AB . \sup &GreaterEqual; 2

As can be known, AB, BA, these three plot right and wrong of AA can be led and generate son, put into set Gen, by

Generate C ₃Be sky, algorithm finishes, so the final non-set of leading generation is { A, B, AB, BA, AA}.

According to step 5: by the set of Frequent Closed plot A, AAB, AB, ABACE, B, BA, BAB} and non-lead generating subset close A, B, AB, BA, AA} produces episode rule.Generate son being projected as in the Frequent Closed plot by non-leading: non-leading generates sub-A be projected as r on the Frequent Closed plot _A={ AB, B, BACE}; Non-leading generates sub-B be projected as r on the Frequent Closed plot _B={ AB, A}; Non-leading generates sub-AB be projected as r on the Frequent Closed plot _AB={ ACE}; Non-leading generates sub-BA be projected as r on the Frequent Closed plot _BA={ CE, B};

Non-leading generates sub-AA be projected as r on the Frequent Closed plot _AA={ CE, B}.

Table 3

According to step 6: generate sub-g and the projection r of g on f for non-leading, obtain the serial connection α of g and r=concat (g, r), projection r is as the former piece of episode rule, α compares the support ratio of α and g, if α is .sup/g.sup 〉=min_conf as the consequent of episode rule, then with (g, r, α .sup, α .sup/g.sup, α .w) join among the episode rule set R, net result is as shown in table 4.

Table 4

Claims

1. the abstracting method of an irredundant episode rule is characterized in that step is as follows:

Step 4: for Frequent Closed plot α, for each

Step 5: establish