Euclidean Distance Based Similarity Measurement and Ensuing Ranking Scheme for Document Search from Outsourced Cloud Data

In this paper, we propose the Euclidean Distance based Similarity Measurement and Ensuing Ranking (EDSMER) scheme to aid effective document search from outsourced cloud data. It is another attempt to find an alternative to binary based approaches. In this approach, the User or the Data owner needs to filter out the suitable keywords for the document and then the index is prepared. To provide security and privacy, both the data and the index are encrypted and moved to the cloud space. The application of Euclidean Distance based Similarity Measurement and Ensuing Ranking (EDSMER) scheme for document searching takes place after the authorized user requests for the documents through query terms. Initially the authorized user sends a query to Cloud Service Provider to retrieve all the documents which are mapped with the keywords provided by him. The proposed algorithm calculates the distance between the query terms and the


Introduction
This paper is intended to develop a document searching algorithm based on ranking through finding of similarity between two set of data by way of Euclidean distance calculation. The ultimate objective of this algorithm is to rank the relevant documents based on the received user query words.
Whenever, diverse sets of data are available, it is necessary to pick the common features between them for the purpose of analysis overlaid on the objectives of the study. Several methods and principles are available to put forth this objective. Here we have applied pure mathematical approach to measure the distance between the two terms, which eventually gives us the degree of closeness between them. Based on this degree of closeness, we can able to extract the best matching words against the reference input words.

System Description
Our proposed system includes the following three major stake holders: 1. Data Owner 2.
Authorized user

2.1Theory behind Euclidean Algorithm and Euclidean Distance
Let us take a collection of N documents 1 , 2 , 3 … . . being converted and stored in the Cloud space and the user query is Q. The expected retrieved information from EDSMER system is 1 , 2 , 3 … . . . In the process of getting 1 , the calculation of relevant score and subsequent distance calculation are involved. Every document is assigned with a relevance score. Regarding the judgment of relevance, the result can be either 0 or 1 if the binary relevance method is used and the result can be 0, 0.5 and 1 for the graded system for relevancy.
In general, the Euclidean distance between any two points X and Y is the length of line which connects them. By the reference of Cartesian coordinates, if two points X and Y are described as, X= (X1, X2, X3,….,Xn) and Y= (Y 1,Y 2,Y 3,….Yn). Then the distance between X and Y can be calculated using Pythagoras formula as, In this manner, the Euclidean distance between the ideal scores = { 1 , 2, … . , } and obtained scores can be calculated by, Distance (I,r) = ( − ) 2 =1 r and I are the Euclidean vectors because any point on the Euclidean space can be regarded as vectors.
The length of this vector from its origin is called as the Euclidean length or the Euclidean norm.

= .
In the particular direction, the relationship between I and r is given by, r-I = (r 1 -I 1 , r 2 -I 2 , r 3 -I 3, …. r n -I n ) The displacement between the points are: The general approach to calculate the Euclidean distance is summarized below:  Pre-process the two sets of data. One data is from cloud database and another one is from user query.
 Calculate the one-dimensional distance between the first keyword from user query and the keywords from Cloud database.


Based on the number of keywords, the dot operation is performed.


The distance for every keyword can be calculated.

2.2Description of EDSMER based Information Retrieval
This method also consists of two directional processes. The first one is from the Data owner to CSP and the second one is from the authenticated data user to CSP. The first process involves the preliminary security and data preparation tasks, while the second process involves, finding the smallest distance keywords, which in turn will give the best match to the query input and subsequent efficient ranking also. This operation is described below in detail.
Initially the Data owner extract keywords from numerous contents, then these keywords are bundled together to form the index. After creation of index, the encryption must be done to protect privacy. Both the document and the index are encrypted and stored in the Cloud database. We used asymmetric encryption in this method. With these steps, the first stage of process is completed.
In the second stage, the user query is processed with keywords from CSP in the EDSMER algorithm to find the smallest distance between the words using Euclidean approach. Based on the distance calculations, the ranking is prepared and revealed to the authenticated user.
The complete EDSMER algorithm is explained in two different stages namely,

(a) Set-up Stage
This stage is the primary stage of EDSMER algorithm based IR from encrypted Cloud data. This stage involves the following three processes: Index preparation (iii) Encryption

(i) Key Generation
In this stage a pair of keys is generated. Pair refers to combination of public key and private key. Encryption will be done with public key. But the cipher text resulted from this process can be extracted back only if the corresponding pair of private key is applied.

(ii) Index Generation
Step 1: Representation of document Let us assume that the data owner has the document as a collection of "n" number of files and mathematically, it can be represented as, D = 1 , 2 , 3 … Step 2: Extraction of Keywords Every file contains certain keywords. Let us denote the keywords as 1 , 2 , 3 … for the particular document. All the keywords from the incorporated files needs to be extracted and grouped under a single entity as suggested by, Step 3: Index creation While combining the keywords to form an Index, it is necessary to differentiate each and every keyword. One such way to differentiate is to provide a tag, which tells about the weight or some other attributes of the keyword.

(iii) Encryption
The EDSMER algorithm incorporates the asymmetric encryption. In this type of cryptographic techniques, two different keys are used for encryption and decryption respectively.

(b) Retrieval Stage
This second stage begins the process from the input fed by authorized user. This stage consists of three steps namely: Ranking based on EDSMER algorithm (iii) Decryption

(i) Trap-door creation
The input received from the authorized user is not encrypted, but the data in CSP are encrypted. Then in order to compare the two different kinds of keywords, it is required to generate the one-way mathematical function called Trap-Door function.

(ii) Ranking based on EDSMER algorithm
After the trap door is created, the inputted keywords from authorized user are received by EDSMER algorithm. Assume that there is "n" number of keywords in the user query, then "n" number of times, the Euclidean distance is calculated with respect to all the keywords in the index. After completing all the manipulations, the distance values are pooled up and finally the rank has been prepared.

EDSMER Algorithm
Step 1: Authorized user inputs the keywords through query Step 2: The received inputs forms a Database of keywords Step 3: Every keyword of this database is compared with the index stored in Cloud storage and the Euclidean minimum distance is calculated Step 4: All the minimum distance values are pooled up and least "n" number of results are filtered out Step 5: Ranking based on the minimum distance is completed Step 6: Ranked documents are displayed to the user

(iii) Decryption
After completion of ranking, the results need to be converted into plain text. This can be done by decryption process. At the data owner side, encryption is performed, by which the index and the document are converted into cipher text. It has to be reverted back to plain text before delivery to the user. The type of encryption used in EDSMER method is asymmetric encryption. It consists of two keys. Public key is used for encryption while the private key is used for decryption.
The algorithm for decryption is given below:

Decryption Algorithm
Step 1: Compute D 1 and D 2 D 1 = ( +1) 4 +1 mod ( − 1) Step 2: Compute a and b, Step 3: For all the ranked items, j to m, compute y as, = + Step 4: Continue till completion of list Step 5: Perform XOR operation between q and y

Dataset Description
RFC and FIRE datasets are used for the experimentation. Details about these datasets are given below:

Results and Discussions
Our proposed method is evaluated using two real time databases namely, RFC and FIRE. The metrics used for the analysis are given as follows: 1.
Time required for generating the trap door function. (with respect to number of documents and number of queries) 2. Computational cost 3.
Response time of server 5.
Mean Average Precision (MAP) 7. F-Measure Two standard mechanisms namely, TRSE and RRSE are chosen for comparing the performance and hence analyze the attributes of our proposed EDSMER system. This comparison restricted to the above mentioned items 1 to 4. For the rest of the items (5 to 7) few other algorithms have been taken.
Those are delivered by the following authors: 1.

4.1Time Required for Generating the Trap Door Function
The time required for generation of trap door function is analyzed below.
The trapdoor function generation time is compared against the number of queries. Our proposed method took only 120 to 150 seconds consistently throughout the sample size of 500 to 2500 numbers of documents. Also our method took only half of the time taken by the TRSE and only one-fourth of the time taken by RRSE systems.

4.2Computational Cost
The time taken by the system to complete a task, in general, is called as computational cost. In this work, the task is to prepare the ranking. Figure 2 depicts this performance graphically. In general, the three methods under analysis has taken from 490 seconds to 2400 seconds to complete the process.
For the document size of 1 GB, EDSMER takes 490 seconds, but TRSE and RRSE takes about 500 and 510 seconds respectively. There is not much difference in the performance. For the document size of 2GB and 3GB, all the three methods performed relatively the same. But there is a wide deviation observed when the document size increases. Between 4GB to 6GB, the EDSMER algorithm outperforms the other two methods.

Response Time of Server
The response times taken by the server along with the "m" values are depicted in Figure 3 and Figure 4. The plot for the response time of server is made between the number of queries and the response time. The number of queries was taken up to 1000 and the response time varied from 1 second to 7 seconds.

4.4Communication Cost
The communication cost covers the entire to and from of the communication, in general. In our work, we considered the total time taken by the system to receive, process and complete the entire task. We have taken 200 to 1200 queries for consideration. If we look at the sample size of 200 to 800, our system outperforms the TRSE and RRSE mechanisms. In the remaining time period, it is consistent with the other two mechanisms.

4.5Performance Measures Comparison
Three parameters were taken for measuring the performance of our proposed method in two different datasets.

(i) Recall
It refers to the fraction of relevant documents that are successfully retrieved from the total pool of documents.

(ii) Mean Average Precision (MAP)
It refers to Mean Average Precision. This score gives us the average value of precision of each query. It is calculated by the ratio of sum of precision to the total number of queries.
(iii) F-Measure F-Measure or F-Score is a harmonic mean of precision and recall; hence, higher the F-Measure, higher the information retrieval.
For analysis of the above mentioned metrics, two dimensional approach was followed in this paper. First the RFC dataset was studied with EDSMER and the other five mechanisms to understand the robustness of our

EDSMER TRSE RRSE
system. Then FIRE dataset was used for the same comparison. Figure 6 exhibits the performance of EDSMER against the other five different mechanisms in RFC dataset. Figure 7 explains the performance in FIRE dataset.  The same trend is exhibited in FIRE data set as given by Figure 7. But the striking factor is that, if we compare the performance of EDSMER between the two datasets, it performed well in RFC than FIRE.

Conclusion& FutureResearch Enhancements
The Euclidean Distance Based Similarity Measurement and Ensuing Ranking (EDSMER) scheme for document search from outsourced Cloud data is narrated from the core principles to the results of experimentation, in this paper. It is yet another attempt to find an alternative to binary based approaches.
The approach of Euclidean distance performed well for the larger document sizes. Hence it can be viewed as an alternative to binary approach. In fact, this scheme performed well than the reference systems, TRSE and RRSE. To conclude, the EDSMER algorithm produced good performance among all the parameters taken for

FIRE Dataset
Recall MAP F-Measure testing. This mechanism outperformed its counterparts in several metrics, but the recall rate is slightly lower than few other systems taken for comparison. Hence this is the critical area to be developed further. Since this approach has a good potential for Information retrieval, this area can be taken up to be developed as a future work.