CN102622447A - Hadoop-based frequent closed itemset mining method - Google Patents

Hadoop-based frequent closed itemset mining method Download PDF

Info

Publication number
CN102622447A
CN102622447A CN2012100725242A CN201210072524A CN102622447A CN 102622447 A CN102622447 A CN 102622447A CN 2012100725242 A CN2012100725242 A CN 2012100725242A CN 201210072524 A CN201210072524 A CN 201210072524A CN 102622447 A CN102622447 A CN 102622447A
Authority
CN
China
Prior art keywords
list
frequent
frequent closed
hadoop
term collection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100725242A
Other languages
Chinese (zh)
Other versions
CN102622447B (en
Inventor
高阳
杨育彬
陈光鹏
商琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY
Nanjing University
Original Assignee
JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY, Nanjing University filed Critical JIANGYIN INSTITUTE OF INFORMATION TECHNOLOGY OF NANJING UNIVERSITY
Priority to CN201210072524.2A priority Critical patent/CN102622447B/en
Publication of CN102622447A publication Critical patent/CN102622447A/en
Application granted granted Critical
Publication of CN102622447B publication Critical patent/CN102622447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a Hadoop-based frequent closed itemset mining method. The method comprises the following steps of: performing parallel counting; parallelly scanning a database once, and counting frequent time of each data item in the database; constructing global frequency-list (F-List) and group-list (G-List); parallelly mining a local frequent closed itemset; scanning the database again, mining the local frequent closed itemset in each node by adopting a first algorithm, and only saving a global frequent closed itemset. According to the method, calculation tasks are calculated on the basis of Group, so that the allocation of calculation amount is uniform; and meanwhile, the method is simple, and the mining task can be completed only by three steps (two steps of Map-Reduce).

Description

A kind of frequent closed term collection method for digging based on Hadoop
Technical field
The present invention relates to a kind of frequent closed term collection method for digging based on Hadoop.
Background technology
In high-volume database, excavating frequent closed term collection, is a crucial research contents in data mining field.It is widely used in the excavation correlation rule between the mining data, like the market basket analysis problem, and collaborative filtering problem etc.It usually is superior to other data mining algorithms at aspects such as applicability, digging efficiency, accuracy and intelligibilitys.
Present frequent closed term collection method for digging is varied, but basically all is single cpu mode.When facing the data of magnanimity, the algorithm that single cpu mode excavates frequent closed term collection down usually seems unable to do what one wishes.
Summary of the invention
Goal of the invention: the problem and shortage to above-mentioned prior art exists, the purpose of this invention is to provide a kind of frequent closed term collection method for digging based on Hadoop, realize parallel computation.
Technical scheme: for realizing the foregoing invention purpose, the technical scheme that the present invention adopts is a kind of frequent closed term collection method for digging based on Hadoop, comprises the steps:
(1) parallel counting: run-down database concurrently, the number of times (being frequent number of times) that each data item occurs in the staqtistical data base;
(2) overall F-List of structure and G-List:
1) the output result with parallel counting in the said step (1) is input, constructs overall F-List;
2) on the basis of this overall situation F-List, structure G-List;
(3) parallelly excavate local frequent closed term collection: scan database once more, adopt first algorithm to excavate local frequent closed term collection at each node, and only preserve the frequent closed term collection of the overall situation.
In the said step 1): the output result with parallel counting in the step (1) is input, can get the item that satisfies minimum support min_sup, sorts according to frequent number of times is descending, and the result leaves among the F-List.
The said first optimal algorithm selection AFOPT-Closed (AFOPT:Ascending Frequency Ordered Prefix Tree) algorithm.
Beneficial effect: the inventive method utilizes Hadoop to realize parallel computation based on Group Distribution Calculation task, makes that the distribution of calculated amount is balanced more, has improved efficient; Simultaneously, this method is more succinct, as long as three steps (twice Map-Reduce process) just can be accomplished mining task.
Description of drawings
Fig. 1 is the process flow diagram of the inventive method.
Embodiment
Below in conjunction with accompanying drawing and specific embodiment; Further illustrate the present invention; Should understand these embodiment only be used to the present invention is described and be not used in the restriction scope of the present invention; After having read the present invention, those skilled in the art all fall within the application's accompanying claims institute restricted portion to the modification of the various equivalent form of values of the present invention.
As shown in Figure 1, the step of the inventive method comprises:
Step 1, parallel counting.The number of times that each data item in the staqtistical data base (being called for short " item ") occurs.This step is the run-down database concurrently, reads by row, and statistics is the frequent number of times of each wherein.
Step 2 is constructed overall F-List (Frequency-List) and G-List (Groups-List).The output result of the above parallel counting of a step of this step gets the item that satisfies minimum support min_sup for input, and frequent number of times is descending sorts by it, and the result leaves among the overall F-List.On the basis of this overall situation F-List, structure G-List makes that the item collection with identical support must be under the situation of a Group again, and the calculated amount of each Group is balanced as far as possible.
Step 3 is based on the parallel frequent closed term collection in part that excavates of Groups.This step is parallel work-flow, and scan database excavates local frequent closed term collection at each node with the AFOPT algorithm once more, and only preserves the frequent closed term collection of the overall situation.Mapper is by the row reading of data, and each row i.e. affairs.The order of wherein each data item with overall F-List sorted again.Then, from right to left, the id of the Group that in G-List, belongs to the data item in the new affairs successively is key, is that value sends with the data item collection that appears at this data item left side.Thereby the relevant issues of the item collection of same Group all can be gathered together.Reducer constructs an AFOPT with all affairs that receive.With excavating frequent closed term collection AFOPT-Closed algorithm, recursively in this tree, excavate frequent closed term collection then.The frequent closed term collection that obtain this moment all is the local frequent closed term collection with the Xiang Jiwei condition of same Group.These frequent closed term collection are sorted according to overall F-list, and the Group-id of the item of the leftmost side is the affairs of this group, is overall closed term collection.

Claims (3)

1. the frequent closed term collection method for digging based on Hadoop comprises the steps:
(1) parallel counting: run-down database concurrently, the number of times that the frequent number of times of each data item occurs in the staqtistical data base;
(2) overall F-List of structure and G-List:
1) the output result with parallel counting in the said step (1) is input, constructs overall F-List;
2) on the basis of this overall situation F-List, structure G-List;
(3) parallelly excavate local frequent closed term collection: scan database once more, adopt first algorithm to excavate local frequent closed term collection at each node, and only preserve the frequent closed term collection of the overall situation.
2. according to the said a kind of frequent closed term collection method for digging of claim 1 based on Hadoop; It is characterized in that: in the said step 1): the output result with parallel counting in the step (1) is input; Get the item that satisfies minimum support min_sup; Sort according to frequent number of times is descending, the result leaves among the F-List.
3. according to the said a kind of frequent closed term collection method for digging based on Hadoop of claim 1, it is characterized in that: said first algorithm is the AFOPT-Closed algorithm.
CN201210072524.2A 2012-03-19 2012-03-19 Hadoop-based frequent closed itemset mining method Active CN102622447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210072524.2A CN102622447B (en) 2012-03-19 2012-03-19 Hadoop-based frequent closed itemset mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210072524.2A CN102622447B (en) 2012-03-19 2012-03-19 Hadoop-based frequent closed itemset mining method

Publications (2)

Publication Number Publication Date
CN102622447A true CN102622447A (en) 2012-08-01
CN102622447B CN102622447B (en) 2014-03-05

Family

ID=46562366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210072524.2A Active CN102622447B (en) 2012-03-19 2012-03-19 Hadoop-based frequent closed itemset mining method

Country Status (1)

Country Link
CN (1) CN102622447B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324712A (en) * 2013-06-19 2013-09-25 西北工业大学 Extraction method for non-redundancy plot rule
CN103714009A (en) * 2013-12-20 2014-04-09 华中科技大学 MapReduce realizing method based on unified management of internal memory on GPU
CN104008185A (en) * 2014-06-11 2014-08-27 西北工业大学 Frequent close scenario mining method based on same node table and scenario tree
CN104834709A (en) * 2015-04-29 2015-08-12 南京理工大学 Parallel cosine mode mining method based on load balancing

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765847A (en) * 2015-04-20 2015-07-08 西北工业大学 Frequent closed item set mining method based on order-preserving characteristic and preamble tree

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6044366A (en) * 1998-03-16 2000-03-28 Microsoft Corporation Use of the UNPIVOT relational operator in the efficient gathering of sufficient statistics for data mining
CN101446978A (en) * 2008-12-11 2009-06-03 南京大学 Core node discovery method based on frequent itemset mining
CN101799810A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Association rule mining method and system thereof
CN101996102A (en) * 2009-08-31 2011-03-30 中国移动通信集团公司 Method and system for mining data association rule

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6044366A (en) * 1998-03-16 2000-03-28 Microsoft Corporation Use of the UNPIVOT relational operator in the efficient gathering of sufficient statistics for data mining
CN101446978A (en) * 2008-12-11 2009-06-03 南京大学 Core node discovery method based on frequent itemset mining
CN101799810A (en) * 2009-02-06 2010-08-11 中国移动通信集团公司 Association rule mining method and system thereof
CN101996102A (en) * 2009-08-31 2011-03-30 中国移动通信集团公司 Method and system for mining data association rule

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
缪裕青: "频繁闭合项目集的并行挖掘算法研究", 《计算机科学》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324712A (en) * 2013-06-19 2013-09-25 西北工业大学 Extraction method for non-redundancy plot rule
CN103714009A (en) * 2013-12-20 2014-04-09 华中科技大学 MapReduce realizing method based on unified management of internal memory on GPU
CN104008185A (en) * 2014-06-11 2014-08-27 西北工业大学 Frequent close scenario mining method based on same node table and scenario tree
CN104834709A (en) * 2015-04-29 2015-08-12 南京理工大学 Parallel cosine mode mining method based on load balancing
CN104834709B (en) * 2015-04-29 2018-07-31 南京理工大学 A kind of parallel cosine mode method for digging based on load balancing

Also Published As

Publication number Publication date
CN102622447B (en) 2014-03-05

Similar Documents

Publication Publication Date Title
Koseleva et al. Big data in building energy efficiency: understanding of big data and main challenges
CN102622447B (en) Hadoop-based frequent closed itemset mining method
CN103150163A (en) Map/Reduce mode-based parallel relating method
CN103761236A (en) Incremental frequent pattern increase data mining method
CN107229751A (en) A kind of concurrent incremental formula association rule mining method towards stream data
CN103678671A (en) Dynamic community detection method in social network
CN105959372A (en) Internet user data analysis method based on mobile application
Liao et al. MRPrePost—A parallel algorithm adapted for mining big data
CN103020163A (en) Node-similarity-based network community division method in network
CN103218692A (en) Workflow excavating method based on inter-movement dependency relation analysis
CN102799625B (en) Method and system for excavating topic core circle in social networking service
CN105279187A (en) Edge clustering coefficient-based social network group division method
CN106294390A (en) A kind of data mining analysis method and system
Xu et al. Distributed maximal clique computation and management
CN105138650A (en) Hadoop data cleaning method and system based on outlier mining
CN104216889B (en) Data dissemination analyzing and predicting method and system based on cloud service
Abdullah et al. Density grid-based clustering for wireless sensors networks
CN104834557A (en) Data analysis method based on Hadoop
CN111475837B (en) Network big data privacy protection method
CN105069290A (en) Parallelization critical node discovery method for postal delivery data
Xie et al. Vital node identification in hypergraphs via gravity model
Le et al. A novel algorithm for mining high utility itemsets
CN103984723A (en) Method used for updating data mining for frequent item by incremental data
CN104573864A (en) Data analysis alarm method based on autoregressive prediction
KR101693727B1 (en) Apparatus and method for reorganizing social issues from research and development perspective using social network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant