Integrated Security and Privacy Framework for Big Data in Hadoop MapReduce Framework

Public cloud infrastructure is widely used by enterprises to store and process big data. Cloud and its distributed computing phenomena not only provides scalable, available and affordable solution for storage and compute services but also raises security concerns. Many security solutions that came into existence encrypt data and allow accessing plaintext for data analytics in the confines of secure hardware. However, the fact remains that the large volumes of data is processed in distributed environment involving hundreds of commodity machines. There exist numerous communications between machines in MapReduce computing model. In the process, compromised MapReduce machines or functions are vulnerable to query based inference attacks on big data that lead to leakage of sensitive information. The main focus of this paper is to overcome the problem aforementioned. Towards this end, a methodology is proposed with an underlying algorithm for defeating query based inference attacks on big data in Hadoop. The proposed algorithm is known as Multi-Model Defence Against Query Based Inference Attacks (MMD-QBIA). A realistic attack model is considered for validating the effectiveness of the proposed methodology. Then an integrated framework for security and privacy to big data is evaluated. Cloudera Distribution Hadoop (CDH) is the environment used for empirical study. The experimental results revealed that the proposed solution prevents different kinds of query based inference attacks on big data besides security to big data in Hadoop MapReduce framework.


INTRODUCTIOIN
With the emergence of cloud computing and big data eco-system, there is every possibility to have innovative approaches to deal with massive amounts of data without losing value possessed in the data [14].MapReduce is the programming phenomenon that supports parallel processing in presence of thousands of commodity computers in cloud computing or distributed environments.The MapReduce frameworks like Hadoop plays vital role in data analytics in distributed computing environments (DCE).And it has proved to be efficient to deal with big data in numerous application domains [4].With big data, there are security problems.Different attacks may occur when data is at rest or in transit.Big data analytics has many security and privacy challenges [19].Our prior works [31] and [32] provided security enhancements.For instance, in [31] a light weight security mechanism known as Lightweight Security Scheme (LSS) is defined.In [32], an algorithm named Flexible and Efficient Encryption (FEE) is defined to deal with structured data security and data dynamics on the encrypted data that has been outsourced.However, our work in [31] and [32] does not consider the scenario where big data needs to be protected from privacy attacks when data is subjected to analytics in distributed environment.In this paper, we considered this problem and solution is provided to prevent query based inference attacks on big data.Big data throws privacy challenges unless there is a fool proof mechanism that not only provides cryptographic solution to data security but also for preventing data leakage [12].Many solutions came into existence to protect privacy of big data.Airavat is one of them where differential privacy (DP) based solution is provided.Privacy issues with MapReduce programming phenomenon are explored in [1].The usage of DP is advocated in [2] and [7].Irrespective of DP based solutions, the protection concept is illustrated in Figure 1.

Figure 1:Modus operandi of differential privacy based protection
As presented in Figure 1, the query is made by analyst or adversary to database.Then the privacy guard is implemented based on DP that will add noise to sensitive data and returns to adversary.Thus adversary is defeated analytics.The data is transmitted, however, in transformed format with DP.The DP transformation includes hashing, subsampling and adding noise [34].

Enhancing MapReduce Layer for Big Data Privacy
Privacy attacks on big data may occur when data is subjected to Map and Reduce methods.To overcome this problem, Jain et al. [3] enhanced the MapReduce (MR) layer with an additional layer between MR layer and Hadoop Distributed File System (HDFS).The input and output privacy is combined along with security.The method employed here protects data from privacy attacks and reduces information loss.It also promotes scalability as it uses lightweight encryption.Adithamet al. [11] proposed different thread detection mechanisms that may arise from malicious insiders.Their solution includes profiling process behaviour using library and system calls and memory access patterns.After building process profiles, they are verified dynamically at runtime to know any discrepancies.Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) methods are employed to estimate violations.In future, they intended to deal with big data privacy when the data is subjected to analytics.Gambset al. [13] used a tool known as GEPETO for analysing big data privacy by interpretation of mobility traces in large scale.They considered MapReduce environment for their empirical study where they sanitized data to prevent privacy attacks.They intended to improve it by using spatial cloaking methods in future.Dinhet al. [15] proposed a methodology for privacy preserving MapReduce computations.They incorporated secure shuffling, secure grouping and execution integrity.
Stephen et al. [16] proposed a method with program analysis to find security threats in MapReduce code.Geyer et al. [17] on the other hand proposed a security framework for processing big data in distributed environment.Pireset al. [20] proposed a light weight security framework for MapReduce programming paradigm.Raiziet al. [22] proposed a hybrid framework for secure data analytics.In the same fashion, Lu et al. [23] focused on opportunistic computing framework with privacy and security in healthcare domain.Du et al. [24] proposed an attestation mechanism for cloud service integrity as part of Software as a Service (SaaS).Xue and Hong [26] proposed a framework for secure data sharing in presence of dynamic groups while Li et al. [27] proposed secure and privacy preserving mechanism for sharing of health records.Yang et al. [28] used polynomial codes for security in distributed environments.Dong et al. [29] proposed a distributed processing approach that is hierarchical in nature.From the literature, it is observed that there has been considerable research to make MapReduce operations with privacy consideration.However, with respect to differential privacy, the existing works showed different approaches and there is need for an integrated multi-modal approach for preventing query based inference attacks.

PRELIMINARIES
Differential Privacy (DP) is the technique used to protect data from privacy attacks.It was originally developed by Dwork, Nissim, McSherry and Smith and later on improved by others [8].To be formal, let two databases denoted as D1 and D2.These two are known as neighbouring databases when they have difference in at most a single data entry.Accordingly, any algorithm denoted as M is considered to be ε-differentially private if D1 and D2 output x for all pairs as in Eq. 1.
(1) The output of the computation does not reveal the presence of any data item as input.As adversaries will not be able to know whether a specific item is part of the dataset as it precludes deriving any sensitive information from the data.Ideally, DP needs to be employed in such a way that when (after adding DP) data is given to third party analyst, he/she will never be able to know identity of any entity.Such way of characterization of data is part of DP based methods.DP is best used to prevent query based inference attacks.In other words, adversaries cannot know the participation of an item (presence or absence of an item) in the dataset.Privacy Budget (ε) is the control parameter for enforcing privacy on big data (as used in this paper).Considering two neighbouring datasets D1 and D2 and an output function A, the privacy budget needs to be low such as a value that is almost equal to 1.It does mean that the outcome probability of A on D1 and D2 is almost same.This is the ideal way of using privacy budget when DP is employed.With higher DP more security is possible but it leads to less utility of data subjected to analytics.Therefore, the privacy budget € is generally kept at 0.01, 0.1 etc. Eq. 2 shows the usage of privacy budget.

Pr[A(D1 ∈ S] ≤ e(2) ∈ Pr [A(D2) ∈ S]
(2) There is another important term pertaining to DP.It is known as sensitivity that tells the amount of noise added to the output of MapReduce function in Hadoop (with respect to the work of this paper).The sensitivity is based on the magnitude of change in outcome when a single row is added or removed.When a series of counting queries denoted as Q made on D1 and D2 the sensitivity of Q is denoted as ∆Q and it is computed as in Eq. 3 In order to achieve DP noise is added to dataset.There are two primary mechanisms of adding noise.They are known as Exponential Mechanism (EM) and Laplace Mechanism (LM).The amount of noise added has its influence on the global sensitivity and privacy budget.EM is a security controlled strategy to achieve DP.It is used for output that is in categorical form.Quite intuitively, it can be understood that EM guarantees the DP definition as the change in a row of database will not affect the outcome of the function.It is desired to handle situations where best response is to be picked up.In a queryresponse system, let input database is denoted as D and a potential response is denoted as rϵR for a score function denoted as u: D × R -> R. Let an algorithm named A gives a response to query in order to satisfy e-differential privacy as in Eq. 4.
A(D, u ) = {r ∶| Pr[r ∈ R] ∞ exp (εu(D, r)/2∆u)} (4) The score function determines the yield of exponential mechanism.The privacy budget will have its influence on the possible outcome.For higher level of security, it is essential to keep the value of privacy budget as low as possible.LM on the other hand computes given function and perturbs coordinates with noise that is acquired from distribution of LM.The level of noise is controlled based on the privacy budget parameter.LM is useful for producing numerical outputs.An algorithm denoted A when applied to D with global sensitivity denoted as ∆f and the function denoted as f: D ->R ∧ d, Eq. 5 shows how the noise is added.A(D) = f(D) + noice (5) If the noise added complies with Laplace distribution, e-differential privacy is satisfied.Thus it is denoted as noise~Lap(∆f/ε) where the zero is considered for location parameter while the scale parameter is denoted as ∆f/ε.The probability density function, when scale parameter is b and location parameter is zero, is computed as.
In Eq. 6, standard deviation is denote as σ(X)and variance is denoted as D(X).Finally, the results are obtained as in Eq. 7 and Eq. 8. DP also exhibits two important properties.They are known as sequential composition (SC) and parallel composition (PC).The former refers to the sequence of computations that provide DP at each computation and DP at sequence level as well while the latter involves in many disjoint computations in parallel.

PROPOSED FRAMEWORK
Technological innovations changed the way an application stores and processes data.With cloud computing infrastructure, Internet-based computing has emerged to be an ideal approach.The traditional means of storage and retrieval are no longer preferred for cost and other reasons while handling large volumes of data.Before indulging into the proposed framework here is the problem statement or motivating scenario.

Problem Statement
A MapReduce paradigm consists of two units of computation known as map and reduce.Each unit takes key-value pairs as input and produces desired output in the form of key-value pairs after processing.Both input and output are stored in Hadoop Distributed File System (HDFS).It is assumed that input given to any MapReduce computing is encrypted and adversaries can only get encrypted data if they succeed in launching attacks.However, when data is being processed, it is done in plain text.Therefore, it is essential to protect such data from privacy attacks.Figure 2 shows the map reduce computations for WordCount benchmark.

Figure 2:Shows execution of WordCount benchmark with MapReduce computations
For big data analytics, a programmer typically implements Map and Reduce tasks.In many real world data analytics applications, data of an organization is used for gaining business intelligence (BI).In the process, the main problem considered in this paper is privacy attacks launched by adversaries.To be specific the attacks are query based inference attacks where adversary wanted to infer knowledge by knowing the presence of specific customer in the dataset.That is the reason it is known as query based inference attack.Considering the MapReduce source code in Listing 1, there is evidence of such attack from adversary.

Listing 1: Application specific MapReduce source code (malicious)
As observed in Line 14 through Line 16, attacker is trying to find the presence of a customer named "Suzuki".If the customer is found in the data being processed, then the adversary is setting the word "Sazaki" at Line 15 and set a value 1000000 as output.This is in map() function.In the reduce() function, from Line 33 through Line 35, the adversary is manipulating sum value to gain very specific value as output.If that value is found in output, adversary confirms that there is presence of customer "Suzuki" in the big data.This kind of attack is known as query based inference attack.This is the main privacy leakage problem addressed in this paper.

Methodology
The objective of this methodology is to protect sensitive data even in presence of potentially untrusted mapper and reducer codes.It is based on cloud computing technology and distribued programming framework.Hadoop is the MapReduce framework used for empirical study which facilitates new paradigm in programming with map and reduce functions.The source code of map() and reduce() functions may have code which is malicious in nature.Often it is injected by adversaries to launch privacy attacks.Therefore, it is essential to ensure non-disclosure of sensitive information.The solution provided in this paper is based on DP.Cloudera Distribution Hadoop (CDH) has MapReduce framework.This framework is generally scalable, available and fault tolerant.It supports various real world use cases associated with big data analytics.Figure 3 shows the architectural overview of the proposed methodology.

Figure 3: Architectural overview of the proposed methodology
The computation system has different components such as MapReduce framework, cloud storage, JVM and compute cloud.MapReduce supports input from HDFS and sending output to HDFS.MapReduce runs in a cluster of commodity computers.Privacy of big data is achieved using DP based algorithm.In presence of malicious mapper or reducer, the proposed algorithm ensures that the privacy of big data is not lost.DP, as explained in Section 3, has capabilities to prevent certain kinds of privacy attacks such as query based inference attacks.Here is the procedure used to prevent privacy attacks.Let us consider D as original dataset while D' is derived from D by using DP technique.There will be no much difference between these two datasets.Hence they are known as neighbouring datasets.An algorithm A achieves DP with output denoted as O as in Eq. 9.

Pr[A(D)=O] <=exp(𝜺). Pr[A(D') =O]
(9) The degree with which privacy is protected is represented as .As the proposed methodology deals with large volumes of data, it is likely that there is sensitive data.The data may be of any domain like healthcare, banking, social networks and so on.There are many algorithms to provide privacy but may result in information loss due to data transformation for ensuring privacy.Thus, there is trade-off between privacy level and deterioration of utility of data.If such trade-off is not handled, it leads to utility problem of big data.This fact is understood with the reconstruction function found in [33] which is shown in Eq. 10. (10) When there are many randomsamples with cumulative distribution function (CDF) is denoted as Fy and samples are as x1+y1, x2+y2..., xn+yn and Fx.As in Eq. 10, there is posterior distribution.In the same fashion, for x1+y1, x2+y2, ..., xn+ynestimation is as in Eq. 11. (11) By differentiating the Fx density function is obtained as in Eq. 12. Fx i (a)= The reconstruction and randomization solutions may cause data leakage as they have some associated data.Therefore, such solutions are not suitable for strong privacy for big data.The DP construct (∈, a)-differential Privacy satisfies the computation function given in Eq. 9 by considering D and D' as neighboring datasets.The relation S⊆ Range(F)is found to be true.Eq. 13 is thus used to achieve this.P r [F(D)ϵS] ≤ exp(ϵ) × P r [F(D)ϵS ̇] (13) As per this, it is not possible to have inference attacks or query based privacy attacks.Thus adversaries cannot find the presence or absence of given entity in the data with any probability.

Algorithm Design
We defined an algorithm known as Multi-Model Defence Against Query Based Inference Attacks (MMD-QBIA) which analyses reducer code and original dataset D. After execution of the algorithm, the neighbouring dataset D' is resulted.The D' is produced by reducer in MapReduce paradigm in order to defeat query based inference attacks.It is achieved by adding noise to the reducer output.However, adding too much noise will lead to losing utility of big data.Therefore, we devised a plan to determine the noise level by analysing reducer code (byte code pattern).If the bytecode pattern is found genuine, NoiseAddition() function is invoked to add little noise.If pattern is not found, StrongNoiseAddition() function is invoked as the attack is suspected.

SECURITY INTEGRATED FRAMEWORK
Our prior works [31] and [32] provided security enhancements.In [31] a security mechanism known as Lightweight Security Scheme (LSS) is defined.In [32], an algorithm named Flexible and Efficient Encryption (FEE) is defined to deal with structured data security and data dynamics on the encrypted data that has been outsourced.As shown in Figure 4, the key sharing scheme LSS is used to have secure exchange of keys between data provider and users.A lightweight cryptographic method is used to outsource data and retrieve data from cloud.This method ensures secure end to end communication between cloud server and data provider.When data is to be shared to users, then the data provider and users need LSS for secure key exchange.the security keys, the users can perform two kinds of operations on cloud.First, they can make queries to obtain data from non-relational data.Second, they can use SQL based queries for storage and retrieval of data from relational database.Besides, users can perform data dynamics (changes on the outsourced relational data) directly.It is achieved with Flexible and Efficient Encryption (FEE) scheme.The scheme takes care of security of data flown between the users and cloud infrastructure.MMD-QBIA algorithm described in Section 4 is used to see that the queries made by users for data from non-relational databases are securely processed.In fact, the algorithm is aimed at preventing query based inference attacks.With the integration architecture shown in Figure 4, it has a comprehensive and holistic phenomenon for realizing big data security and privacy when data is at rest, in transit and when being used for data analytics.

EXPERIMENTAL RESULTS
Experiments are made with a cluster made up of 3 machines, one master node and two slave nodes.Intel Core i5 processor with 3.4 GHz is the configuration used for the machines.The configuration of Hadoop is changed to have replication number set to 2 in conf/hadoop-site.xml.Two datasets are used for empirical study.The first dataset is the real word dataset collected from [6].The second dataset is synthesized one.Therefore, details of two experiments are provided in this section.In the first experiment dataset from [6] is used while the second experiment used the synthesized dataset.

Experiment 1
The big data [6] contains different attributes such as IP address, date, time and link.Here the sensitive attribute is IP address and adversaries launch query based inference attacks to know the presence or absence of a specific IP address in the big data being processed in Hadoop MapReduce.The program uses aggregate function sum.For this reason, the MMD-QBIA algorithm invokes NoiseAddition() procedure that turn calls NoiseAdditionForSum() procedure which returns the noisy (privacy protected) outcome to its caller and thus D' is generated.The D' is then returned by the reducer as final output.The system gets two queries from user.The first query is genuine (not malicious).The purpose of the program is to know how many times each link is repeated in the given log file.
Noise is added accordingly.With respect to second query, the attacker tries to fine the presence of an IP address "192.168.133.33".As noise is added, it is not possible for adversary to infer the privacy or presence of an IP address.As presented in Figure 5, the workload of experiments is shown in horizontal axis and vertical axis shows the time taken in seconds.As the results showed, the size of workload has its influence on the execution time.There is linear increase in the execution time as workload size increases.It shows the performance difference when the proposed DP algorithm MMD-QBIA is employed.

Experiment 2
In this experiment, the proposed algorithm is evaluated using synthetic dataset.The dataset contains family name of a person, day of birth (0-30) and the score achieved in an entrance examination conducted by a university.The purpose of the program is to compute average score of the data provided based on the family name of person.As number of people with same family name existed in the data, the MapReduce computing finds the average score.
In the bytecode of the program, the average function is intentionally removed.This action forces the algorithm to go with StrongNoiseAddition() procedure.Here the attacker knows the birth day of a person and tries to find the family name of the person.Since day of birth has range of values from 0 to 30, the average is always less than 30.Attacker finds birth day value and replaces it with a big number such as 1000000.Thus attacker expects the average value greater than 30.In the experiment, however, the algorithm replaces the value 1000000 with a random number between 0 and 30.This will defeat the attack and proves the efficacy of the algorithm.The rationale behind this is that the actual value in the experiment is different from that of expected value by the adversary.

Results of Integrated Architecture
The integrated architecture with FEE and LSS algorithms is evaluated with Cloudera Distribution Hadoop (CDH).
The observations are recorded in terms of encryption decryption time, total upload time and total download time for given workload size.As presented in Figure 6, the and decryption performance is evaluated.The security schemes used in the empirical study are provided in horizontal axis while the vertical axis shows the execution time for encryption and decryption.The results revealed that the proposed methods such as FEE and LSS took relatively less time for cryptographic operations.The rationale behind this is that, they are designed to be lightweight.As presented in Figure 7, the execution time for total upload time is evaluated.The security schemes used in the empirical study are provided in horizontal axis while the vertical axis shows the execution time for total upload time.The results revealed that the proposed methods such as FEE and LSS took relatively less time for uploading data.However, there is linear increased in the time taken as the data grows in size.As presented in Figure 8, the execution time for total download time is evaluated.The security schemes used in the empirical study are provided in horizontal axis while the vertical axis shows the execution time for total download time.The results revealed that the proposed methods such as FEE and LSS took relatively less time for downloading data.However, there is linear increased in the time taken as the data grows in size.

PERFORMANCE EVALUATION
There are certain assumptions made in the empirical study.First, attackers have no direct access to data in HDFS.Second, users have access to code of map and reduce functions.Third, Map and Reduce functions in the MapReduce framwork can gain access to storge media and network.Fourth, there is secure communication among the nodes involved in the cluster.Fifth, any user of the system has normal network access previleges as end users.The attack model is as follows.Attackers gain access to map() function and encodes sensitive data to a key.Afterwards, the same key is sent to reduce() function.Reducer does not change key and finally it results in output.
The presence of the key in the final output indicates attack is successful.This model is known as query based inference attack.When compared with the existing system named Airavat [25], the proposed system uses reducer analysis to know whether thre is a pre-registered pattern and the proposed algoritm is employed to apply noise to the data.Unlike Airavat, potentially malicious value is replaced by the value with noise to defeat privacy attack.Ther threat from Unique critial value used by adverasary is removed with thhe proposed algorithm.The usability of the proposed algorithm is found better than Airavat in case of prevention of query based inference attacks.The integrated framework provides improved security and privacy to big data.

THREATS TO VALIDITY
The proposed solution to prevent privacy attacks on big data has targeted query based inference attacks.The presence of a sensitive entity in the big data is interested by the attacker in this case.The proposed solution is based on the pre-registered reduce concept that assumes that the reducer pattern is known beforehand.This is a threat to validity of the proposed system if there is undetectable reducer pattern priori.Nevertheless, it the proposed algorithm is able to detect query based inference attacks with pattern analysis and noise addition of multiple modals.It is useful for preventing privacy attacks of that kind aforementioned.Another threat to validity of the system is that, the empirical study is made with 3 nodes in the cluster.This may appear less as thousands of commodity computers are involved in the real world cloud based distributed frameworks.However, it is to be understood that experiments are made with 3 low configured systems for developing proof of concept prototype that needs further enhancement to generalize the proposed solution to big data of different fields or domains.

CONCLUSIONS AND FUTURE WORK
In distributed programming frameworks, it is essential to protect data from untrusted or malicious code.MapReduce programming model is widely used for handling big data.However, there are number of security attacks on the big data.Our prior works [31] and [32] provided security enhancements to protect big data when it is in rest and when it is on transit.They also considered both structured and unstructured data for security besides supporting data dynamics on encrypted structured data.However, they do not cover the query based inference attacks when big data is subjected to data analytics.The proposed algorithm Multi-Model Defence Against Query Based Inference Attacks (MMD-QBIA) in this paper considers multiple models of preventing privacy attacks on

DOWNLOAD TIME COMPARISON
big data.The algorithm has different strategies for different scenarios.For protecting aggregate values produced by the reducer, it has provision for various procedures by adding appropriate noise.When there is inference attack exhibited by the absence of pre-defined mapper pattern, it invokes StrongNoiseAddition() that ensure privacy so as to prevent disclosure of sensitive data to adversaries.Cloudera Distribution Hadoop (CDH) is the environment used for empirical study.One real time dataset and one synthetic dataset are used for the experiments.Proof of concept prototype is made and the results revealed that the proposed system shows better usability over the existing system named Airavat in providing privacy protection to big data.Then integrated security architecture is evaluated with different schemes and found that the framework provides enhanced security and privacy to big data.In future, we intend to perform experiments with more machines in Hadoop cluster.Another direction for future work is to consider attacks other than query based inference attacks and improve our methodology to handle such attacks.

Figure 4 :
Figure 4: Integrated architecture for big data security, privacy and data dynamics

Figure 4 :
Figure 4:MapReduce outcome comparison (Stacked Line Graph) As presented in Figure 4, it is observed that the IP address is shown in horizontal axis and the count (genuine and in presence of attacker) is shown in vertical axis.As the different between D and D' is very less, stacked line graph is preferred.It shows the result of proposed DP algorithm.

5 :
Execution time of MapReduce with and without MMD-QBIA

Figure 6 :
Figure 6: Performance comparison with encryption time and decryption time

Figure 7 :
Figure 7: Total upload time comparison with the schemes in integrated architecture

Figure 8 :
Figure 8: Total download time comparison with the schemes in integrated architecture

Table 1 :
Notations used in the MMD-QBIA algorithm Step 3 through Step 7, one of the two procedures named NoiseAddition() and StrongNoiseAddition() is executed based on the given condition that checks whether pattern is found.NoiseAddition() procedure is desinged in such a way that it takes care of only aggregated values that are generally produced by the reduce() function of MapReduce computing.As there are different aggregate values known as sum, max, min, average and count, different procedures are defiend to deal with all these aggregates.
In order to add noise to aggregate values such as sum, count, average, max and min produced by reducer, different procedures are defined as part of the algorithm.They include NoiseAdditionForSum (), NoiseAdditionForCount (), NoiseAdditionForAvg (), NoiseAdditionForMax () and NoiseAdditionForMin ().All these procedures are invoked as part of normal noise addition procedure named NoiseAddition ().If the expected bytecode pattern is not found (suspected attack), the reducer output will be subjected to StrongNoiseAddition() that ensures complete sanitization of all target values prior to returning final reducer output.Algorithm: Multi-Model Defence Against Query Based Inference Attacks (MMD-QBIA) Input:MapReduce code, Dataset D,  Algorithm 1:Multi-Model Defence Against Query Based Inference AttacksAs defined in Algorithm 1, the given data (big data) from reducer prior to returning output of reduce function is taken as input.It is denoted as D and the algorithm transforms it into a neighboring dataset D'.Finally, instead of returning D, the reducer returns D' as output.Thus adversaries fail in succeeding query based inference attacks.The main algorithm starts analyzing reducer byte code pattern.If the regular pattern is found, there is need for noise addition but it is limited considering computational cost.Step 2 of the algorithm does this analysis.itfinds all single valued numbers.It also considers a random value list used for privacy protection as in Step 3.Step 5 through Step 9 is an iterative process for adding noise.Step 10 takes the output values and before returning them by the reducer, they are subjected to noise addition.Step 11 computes noise and Step 12 returns noise added outcome that comes back to Step 6 of the main algorithm.This outcome is nothing but D' that represents neighboring database that defeats query based infernce attacks launched by adversaries as per DP philosophy.
The work presented in Section 4 deals with prevention of query based inference attacks on big data in Hadoop MapReduce framework.

Table 2 :
MapReduce outcome in absence of attackerAs presented in Table2, 192.168.133.33is the IP address highlighted as it is used later by the adversary to know its presence in the data.The table shows actual count of the IP address in the dataset.The results of MapReduce came as expected by the mapper and reducer that are genuine.However, when mapper or reducer are compromised, the intention is to have privacy attacks on big data.To safeguard the IP address from disclosure to attacker, the patter of Reduce function is obtained using decompiler.The outcome is used in the algorithm to know whether attack is made or not.The result in presence of attacker is shown in Table3.

Table 3 :
Results of MapReduce in presence of attackerAs presented in Table3, the IP address and corresponding noise added value for count is provided.The D is converted to D' with the proposed algorithm.For the IP address 192.168.133.33 which is the target of privacy attack made by adversary, the result is slightly changed and the value is 13653.The computation process of the algorithm for 1 highlighted IP is as follows.= 8.85x10 −12 count (in presence of attacker) =count+ [(1+ )+R] =13654+[(1+8.85x10−12 )-2.00000000001] =13653 The above computation illustration shows proof of the concept.It is computed for the target IP address 192.168.133.33.

Table 4 :
MapReduce outcome with and without privacy protectionAs presented in Table4, the results pertaining to execution time of MapReduce based DP algorithm MMD-QBIAmeant for protecting big data is observed.

Table 5 :
Encryption and decryption time comparisonAs presented in Table5, the encryption and decryption time for AES, FEE and LSS are compared against different workloads.

Table 6 :
Total upload time taken by security schemesAs presented in Table6, the total upload time for AES, RSA, FEE, ECDH and LSS are compared against different workloads.

Table 7 :
Total download time taken by security schemesAs presented in Table7, the total download time for AES, RSA, FEE, ECDH and LSS are compared against different workloads.