Fast Frequent Item Mining from Big Data using Map Reduce and Bit Vectors

Article History: Received: 10 November 2020; Revised: 12 January 2021; Accepted: 27 January 2021; Published online: 05 April 2021 Abstract: One of the most important areas that are constantly being focused recently is the big data and mining frequent patterns from them is an interesting vertical which is perpetually being evolved and gained plethora of attention among the research fraternities. Generally, the data is mined with the aid of Apriori based algorithms, tree based algorithm and hash based algorithm but most of these existing algorithms suffer many snags and limitations. This paper proposes a new method that overrides and overcomes the most common problems related to speed, memory consumption and search space. The algorithm named Dual Mine employs binary vector representation and vertical data representations in the map reduce and then discover the most patterns from the large data sets. The Dual mine algorithm is then compared with some of the existing algorithms to determine the efficiency of the proposed algorithm and from the experimental results it is quite evident that the proposed algorithm “Dual Mine” outscored the other algorithms by a big magnitude with respect to speed and memory.


Introduction
The main purpose of the data mining is to unearth the previously unknown patterns hidden beneath the raw data [1]. The most common task that is hugely popular in the data mining vertical is frequent pattern mining where the most frequently occurring items are found (market basket analysis, frequently purchased commodities by the consumers, frequently visited web pages in a website). The pioneer in this frequent itemset mining is carried out by Srikanthagarwal who proposed the Apriori algorithm [2]. The Apriori algorithm employs the test and generate notion and then discovers the frequent patterns using level wise paradigm. But the most important drawback is excessive generation of candidate itemset especially the 2-itemset candidates which will increase the operational cost related to execution time and memory usage.
The FP-growth algorithm [3] is another popular frequent pattern mining algorithm that employs tree based structure to unearth the frequently co-occurring item sets in the raw data. The important advantage of this algorithm is that it scans the database only two times unlike Apriori which scans k time where k is the maximum cardinality of the unearthed frequent patterns.
According to the author [Witten and Frank, 2000], the term Data mining is defined as a process discovering hidden, anonymous, and putatively useful information from the given huge junk dataset. Data mining is one of the most exciting information based invention development created by us to ease out decision making. Data mining has become an essential service that can decode and unearth the cloaked patterns and data present clueless in the raw data into human readable and understandable information for a wider usage. It has a wide scope of usage in the field of marketing, bioengineering, gene technologies, finance, and engineering.
According to the authors David Hand, Mannila and Smyth [4] data mining is defined as, "The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner"

Big Data
Big data is a mind boggling term for immense data sets having enormous, progressively changed and complex structure with the difficulties of chronicling, investigating and imagining for additional methodology or results. Big data consolidates data from email, online life, content reports, images, sound, video files, and from plenty of sources that is absent in the customary social databases. Henceforth the big data will be unstructured, random and irregular which represent a bigger risk in the investigation. Coming up next are the primary qualities of the big data, and they are shown by 5V's, Veracity-accentuations on the nature of the information crude data (e.g., suspiciousness, issue, and validity of the data) Velocitythe speed at which data are filed or found (Speed and promptness). Value -the handiness of data (Merit and worth). Variety -different sorts, substance or configurations of data (Class, categories). Volume-centers on the amount of data (Size and quantity). Because of the 5Vs" characteristics of big data, new kinds of calculation are required for managing, questioning, and handling these big data so as to enable improved fundamental administration, comprehension, and procedure upgrade. This saturates and prods research and practices in data science, which plan to make organized or quantitative data scientific calculations to look at (e.g., evaluate, clean, change, and model) and mine big data.

Background of the Paper
The Knowledge Data Discovery process includes not many advances driving from crude data to some type of significant and valuable knowledge. The huge component of data present in a database regularly outperforms the ability to break down and mine it productively, in this manner bringing about a slack to comprehend the data totally in wording with the business needs like benefit, item infiltration in the market, and promotions.
Frequent itemset mining has increased tremendous significance among the examination society generally since the business houses have become globalized. It is basic for the business houses to tap the accessible data assets conveniently to the full degree to advance their items all inclusive. Numerous affiliation rule based frequent itemset mining calculations are created and proposed by various research researchers to improve and support the volume of business exchanges over the globe.

Scope of the Paper
The proposed work mainly focuses on the discovery of frequent patterns by utilizing bit vectors and map reduce in very large databases and evades the time complexity which is considered as the main culprit in degrading the performance of the algorithm speed while mining the data required. The proposed algorithm uses new pruning technique to elude huge computation and unearth frequent patterns present in the very large databases. The primary scope of this paper is to improvise and alleviate the complicated computations present in the state of the art existing algorithms and to come up with a simple approach without huge computational cost and overheads to discover frequent patterns.

Challenges in the Paper
Finding fascinating frequent itemset from the crude big data is an extreme undertaking as the whole procedure includes part of entangled computations identified with frequent count, pruning of unpromising things and memory related overheads bringing about expenses. So far numerous creators have proposed their methods to find frequent examples however the vast majority of the current strategies proposed endures the serious issue of delivering an enormous number of candidates and this confinement in fact decrease mining execution as far as speed and memory space. The first test present in this paper work is to uncover the frequent item sets without compromising on the speed, memory and search space.

Motivation
The inspiration of the proposed strategy depends on the assessment that for each kind of data and each sort of client inclination, it is basic to give the new methodology to produce ideal outcomes which limit the computational expense related to time as well as memory usage. The essential inspiration to complete this paper work is to give the clients a methodology which can proficiently deal with extremely big data with no difficulty and simplicity out the lumbering calculations required during the item set generation. The main reason which inspires adequate number of analysts in this vertical is that abundant volume online data that are promptly accessible over the globe and the majority of the firms are utilizing new and novel methods to pick up the bit of leeway and to draw in and hold the customers.

Map Reduce
MapReduce [16] is a synchronous and extendable programming architecture for data, thorough applications and specialized examination. MapReduce works just in Key/esteem sets. There are two phases of MapReduce work, alluded to as Map stage and Reduce stage. The information is separated into various sections by the Map stage. Each Map task gets a key/esteem set and creates a rundown of center key/esteem set. At that point the underline condition of MapReduce consolidations and mix all the center an incentive as indicated by the indistinguishable center key, the underline condition of MapReduce sends the center an incentive to the reducer. Every Reducer gets all the center records identified with a particular key and produces a final pair of key/esteem set. The proposed approach involves lot of process and the important procedures are enumerated here under with some sample database. The sample database is shown in the table 1. The sample data shown in the table 1 is initially processed to find the unique items present in the transactions, and then the data is represented in the binary format as shown in the table 2. This binary table represents the bit vector representation and this minimizes the memory usage considerably to a greater extent. The map reduce phases are employed shrewdly to alleviate the overheads regarding the running time and the memory usage which is the major drawback in the existing algorithms.

Procedure Find Unique
The procedure to find the unique items present in the table 1 is found using the following procedure shown in the figure 2.  Figure 2. Pseudo code to discover the unique items

Procedure Discover Uniques (Input Data D) INPUT: Input DataÐ
The procedure shown in the figure 2 produces the following output with their respective count values as shown in then table 2.  The first unique item is fetched from the table 2 and then the depending upon the presence of the item in the sample transaction table (i.e.) if the item is present in the transaction row, then it is marked by "1" else it is marked by "0" as shown in the final binary table 3. Table 3. Binary table created using the Create Binary Table procedure  ITEMS  T1 T2 T3 T4 T5 T6 Count   M  0  1  1  0  1 Let us consider the first row as shown below and since the item "M" is present in the transactional rows T2, T3, T5 and T6 they are marked with 1's and the rest of the transactional rows are marked with 0's.
The first step of the map reduce is applied to the binary table shown in the table 3 and the pseudo code is shown here, and the results of the first stage of the operation is showcased in this section.
The first Map function is applied as shown in the following procedure and the generalized mapping function is, To make it simple the first reduce function is applied to the data and this reduces the unwanted items and prunes away the unpromising items from the database and there by reduces the memory space and the items discovered will be decreased with a perpetual decrease in the time taken for the execution of the algorithm. The first reduce function is showcased in the following figure 5.  Figure 5. First Reduce function The minimum support value is assumed to be 2 and the following result is produced by the first reduce function as shown in the table 5, The first reduce function prunes away the item or the element "P" as its count is one which is lesser than the user defined minimum support value. The items M,N,O and Q are retained as their counts are found to be (4,4,5,4).
The second mapping function is applied to the data and the pseudo code is shown in the following section,  Note that (i) items present in the transaction row 4 is not emitted as element M is not present and ⟨5, MP, 1⟩ is not emitted as the item P is already pruned because its count is less than the user defined minimum support value.
3. Patterns ⟨5, MN, 1⟩ Figure 7. Second Reduce function The second reduce function produces the following results, The item pairs MN, MO, MQ, NO, NQ and OQ are frequent as the support of the corresponding pair is found to be 2,3,3,3,3,3 as these six pairs appears at least in two of the transaction rows.
The third mapping and the reduce functions are applied and the pseudo code is showcased in the following segment,  The next mapping function is applied (i.e.) map4 but since there are no relevant items, the function returns nothing and algorithm ends after discovering the frequent item sets.

Experimental Evaluation
The proposed dual mine algorithm is evaluated with some synthetic datasets and compared with the existing algorithms with respect to running time and memory usage. The results are illustrated and it clearly indicates that the proposed DM algorithm outscores the existing algorithms by a good margin. The synthetic dataset are generated by the IBM Quest data mining code. The parameters of the dataset are shown in the table 7.

Table 7. Parameters used in synthetic dataset generation
The proposed algorithm DM is compared with many existing algorithms to check the precise working against the available best algorithms accessible in the research world. The comparison is made on execution time or running time, memory utilization while execution, and on the volume of candidates produced during the execution and the results are exhibited. The existing algorithms compared in this paper are illustrated in this section, The bigFIM [6] algorithm a famous algorithm to discover frequent itemsets from big data and the bigFIM algorithm combines the features of the Apriori algorithm and Eclat algorithm [7] to produce a hybrid approach and discovers the frequent itemsets.
The parEclat [8] algorithm developed by Zaki is a parallel Eclat algorithm which utilizes vertical representation of the data and then uses the concept of parallel computing to generate frequent itemsets from very large databases. The Single pass counting SPC algorithm [9] is an implementation of Apriori algorithm in parallel using map reduce. Here in this algorithm the support count of the candidates is parallelized and the entire algorithm is classified into two phases and operates well by overcoming the shortfalls present in the classical Apriori algorithm.
The proposed dual mine algorithm as well as the other existing algorithms are executed on the synthetic dataset T4I2.5D1M and the resultant candidates that are generated are noted are shown in the table 8 and then compared with the graphical representations as shown in the figure 10. Table 8. Experimental evaluation on synthetic T30I20D10Mdataset regarding the volume of candidate generated.

Figure 10. Candidate generation comparison for T30I20D10M synthetic database
As the density of the synthetic dataset is very large and transactions are long, the proposed DM algorithm performance was considerably good and executed most of the time without abnormal exit and outscored most of the other algorithms. Table 9. Experimental evaluation on synthetic T30I20D10Mdataset regarding the running time. The proposed DM algorithm performed extremely well along with the SPC algorithm on the denser datasets and the proposed DM algorithm worked without out of memory error even for a very small minimum support count value (0.15). But BigFIM performed very badly and suffered many glitches. The memory based comparison was carried out in the next section.

MEMORY USAGE (MB) SYNTHETIC DATASET NAME -T30I20D10M Algorithm Name
The memory consumption of the proposed dual mined algorithm was better than most of the existing algorithms and bigFIM algorithm performed the worst and the rest of the algorithm performed quite medially but lagged behind the DM algorithm.

Conclusion
The experimental evaluation of the three existing algorithms along with the proposed DM algorithm is carried out on dense synthetic dataset, the outcome of the experiments proved that the proposed algorithm fares better in consuming minimum memory, generates very less candidates and more importantly consumed very less time to complete the execution and the overall performance of the proposed algorithm was extremely good.