CN102622447A

CN102622447A - Hadoop-based frequent closed itemset mining method

Info

Publication number: CN102622447A
Application number: CN2012100725242A
Authority: CN
Inventors: 高阳; 杨育彬; 陈光鹏; 商琳
Original assignee: JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY; Nanjing University
Current assignee: JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY; Nanjing University
Priority date: 2012-03-19
Filing date: 2012-03-19
Publication date: 2012-08-01
Anticipated expiration: 2032-03-19
Also published as: CN102622447B

Abstract

The invention discloses a Hadoop-based frequent closed itemset mining method. The method comprises the following steps of: performing parallel counting; parallelly scanning a database once, and counting frequent time of each data item in the database; constructing global frequency-list (F-List) and group-list (G-List); parallelly mining a local frequent closed itemset; scanning the database again, mining the local frequent closed itemset in each node by adopting a first algorithm, and only saving a global frequent closed itemset. According to the method, calculation tasks are calculated on the basis of Group, so that the allocation of calculation amount is uniform; and meanwhile, the method is simple, and the mining task can be completed only by three steps (two steps of Map-Reduce).

Description

A kind of frequent closed term collection method for digging based on Hadoop

Technical field

The present invention relates to a kind of frequent closed term collection method for digging based on Hadoop.

Background technology

In high-volume database, excavating frequent closed term collection, is a crucial research contents in data mining field.It is widely used in the excavation correlation rule between the mining data, like the market basket analysis problem, and collaborative filtering problem etc.It usually is superior to other data mining algorithms at aspects such as applicability, digging efficiency, accuracy and intelligibilitys.

Present frequent closed term collection method for digging is varied, but basically all is single cpu mode.When facing the data of magnanimity, the algorithm that single cpu mode excavates frequent closed term collection down usually seems unable to do what one wishes.

Summary of the invention

Goal of the invention: the problem and shortage to above-mentioned prior art exists, the purpose of this invention is to provide a kind of frequent closed term collection method for digging based on Hadoop, realize parallel computation.

Technical scheme: for realizing the foregoing invention purpose, the technical scheme that the present invention adopts is a kind of frequent closed term collection method for digging based on Hadoop, comprises the steps:

(1) parallel counting: run-down database concurrently, the number of times (being frequent number of times) that each data item occurs in the staqtistical data base;

(2) overall F-List of structure and G-List:

1) the output result with parallel counting in the said step (1) is input, constructs overall F-List;

2) on the basis of this overall situation F-List, structure G-List;

(3) parallelly excavate local frequent closed term collection: scan database once more, adopt first algorithm to excavate local frequent closed term collection at each node, and only preserve the frequent closed term collection of the overall situation.

In the said step 1): the output result with parallel counting in the step (1) is input, can get the item that satisfies minimum support min_sup, sorts according to frequent number of times is descending, and the result leaves among the F-List.

The said first optimal algorithm selection AFOPT-Closed (AFOPT:Ascending Frequency Ordered Prefix Tree) algorithm.

Beneficial effect: the inventive method utilizes Hadoop to realize parallel computation based on Group Distribution Calculation task, makes that the distribution of calculated amount is balanced more, has improved efficient; Simultaneously, this method is more succinct, as long as three steps (twice Map-Reduce process) just can be accomplished mining task.

Description of drawings

Fig. 1 is the process flow diagram of the inventive method.

Embodiment

Below in conjunction with accompanying drawing and specific embodiment; Further illustrate the present invention; Should understand these embodiment only be used to the present invention is described and be not used in the restriction scope of the present invention; After having read the present invention, those skilled in the art all fall within the application's accompanying claims institute restricted portion to the modification of the various equivalent form of values of the present invention.

As shown in Figure 1, the step of the inventive method comprises:

Step 1, parallel counting.The number of times that each data item in the staqtistical data base (being called for short " item ") occurs.This step is the run-down database concurrently, reads by row, and statistics is the frequent number of times of each wherein.

Step 2 is constructed overall F-List (Frequency-List) and G-List (Groups-List).The output result of the above parallel counting of a step of this step gets the item that satisfies minimum support min_sup for input, and frequent number of times is descending sorts by it, and the result leaves among the overall F-List.On the basis of this overall situation F-List, structure G-List makes that the item collection with identical support must be under the situation of a Group again, and the calculated amount of each Group is balanced as far as possible.

Step 3 is based on the parallel frequent closed term collection in part that excavates of Groups.This step is parallel work-flow, and scan database excavates local frequent closed term collection at each node with the AFOPT algorithm once more, and only preserves the frequent closed term collection of the overall situation.Mapper is by the row reading of data, and each row i.e. affairs.The order of wherein each data item with overall F-List sorted again.Then, from right to left, the id of the Group that in G-List, belongs to the data item in the new affairs successively is key, is that value sends with the data item collection that appears at this data item left side.Thereby the relevant issues of the item collection of same Group all can be gathered together.Reducer constructs an AFOPT with all affairs that receive.With excavating frequent closed term collection AFOPT-Closed algorithm, recursively in this tree, excavate frequent closed term collection then.The frequent closed term collection that obtain this moment all is the local frequent closed term collection with the Xiang Jiwei condition of same Group.These frequent closed term collection are sorted according to overall F-list, and the Group-id of the item of the leftmost side is the affairs of this group, is overall closed term collection.

Claims

1. the frequent closed term collection method for digging based on Hadoop comprises the steps:

(1) parallel counting: run-down database concurrently, the number of times that the frequent number of times of each data item occurs in the staqtistical data base;

(2) overall F-List of structure and G-List:

2) on the basis of this overall situation F-List, structure G-List;

2. according to the said a kind of frequent closed term collection method for digging of claim 1 based on Hadoop; It is characterized in that: in the said step 1): the output result with parallel counting in the step (1) is input; Get the item that satisfies minimum support min_sup; Sort according to frequent number of times is descending, the result leaves among the F-List.

3. according to the said a kind of frequent closed term collection method for digging based on Hadoop of claim 1, it is characterized in that: said first algorithm is the AFOPT-Closed algorithm.