Hadoop Job Scheduling Using Improvised Ant Colony Optimization

Hadoop Distributed File System is used for storage along with a programming framework MapReduce for processing large datasets allowing parallel processing. The process of handling such complex and vast data and maintaining the performance parameters up to certain level is a difficult task. Hence, an improvised mechanism is proposed here that will enhance the job scheduling capabilities of Hadoop and optimize allocation and utilization of resources. Significantly, an aggregator node is added to the default HDFS framework architecture to improve the performance of Hadoop Name node. In this paper, four entities viz., the name node, secondary name node, aggregator nodes, and data nodes have been modified. Here, the aggregator node assigns jobs to data node, while Name node tracks aggregator nodes. Also, based on the job size and expected execution time, an improvised ant colony optimization method is developed for scheduling jobs.In the end, the results demonstrate notable improvisation over native Hadoop and other approaches. Keywords-Hadoop, Virtualization, MapReduce, Job Scheduling, Improved Ant Colony Optimization.


Introduction
Hadoop clusters have acquired great acceptance for their efficiency in computation, thereby helping save time and cost.Hadoopcomprises an HDFS and MapReduce as its two mainstays. The users are endowed with the facility of distributed storage access by the HDFS, while distributed processing is offered by MapReduce.The name node (master node) and the data nodes (slave nodes) build up the HDFS, which ensure that distributed environments and storage facilities are efficiently managed. The tasks are run on a cluster in MapReduce which allows data to be managed in a distributed data storage system.In fact, MapReduce splits input dataset and creates many blocks, measuring 64 or 128 MB, before storing in an HDFS.The two functions used by MapReduce component are as follows: • Mapper: Each block requires a map operation to run that is kept separate from rest of the blocks, while ensuring that the data node is exactly place in the data storage site. The mapping functions, i.e., < key, value > for each term, such as < xyz, 1 > are performed by this operation at this phase.  Job Tracker: Job tracker primarily handles scheduling andprocessing of allthe jobs. Whenever job tracker receives jobs, task tracker tends to be assigned that particular job. The task tracker then has the responsibility of coordinating the execution process of the job. It is run onMaster node i.e. Name node.
Task Tracker: Theperform map and reduce functions are performed by a task tracker for a particular job assigned by job tracker. The job tracker receives report about the status of an assigned work. Each task tracker is assigned many slots as part of 'map andreduce function' required in performing a task. The balance of map to reducetask is also an important consideration which is managed by JVM. The Task trackerrun on slave node i.e. Data node.

Fig 2: Job and Task Tracker
The task tracker is assigned a job by the jobtracker which is submitted by a user and that job is processed in task tracker. The Task tracker will update the job Tracker with current status.

Virtualization
Virtualization is a concept of dividing the resource of the same hardware tomultiple operating system (same or different) and applications at same time,achieving higher efficiency and resource utilization.Hypervisor or virtual machine monitor is the manager who carries out the process of virtualization.
• Bare metal hypervisor run directly on top of hardware to have more control power overhardware. • Hosted hypervisor run on the top of conventional operating system along with some native processes. It will differentiate the guest and host operating system running on the single hardware.

Types of Virtualization
Virtualization can be utilized in different areas to enhance performance and resource sharing. So various types of virtualizations are as following: Server Virtualization: Server virtualization is a process in which multiple OS and applications run on a single server so that resources of that single host will be divided among all the operating system and the various applications. It brings the resource utilization and the cuts the cost by having only one server in place of many.
Application Virtualization: Application virtualization enables the end user to access the application from remotely located server. Application is not installed on every user desktop but is available on user 's demand. So, it is very cost efficient for the organizations.
Desktop Virtualization: It is similar to server virtualization. In desktop virtualization the workstation with all its application is virtualized. The hypervisor contains the customization and preferences of an application. It runs on a centralized mechanism so user can access the desktop from any location. It provides higher efficiency and the cost reduction.
Storage Virtualization: Storage virtualization is a technique in which the storage from the multiple hardware devices will be combined to act as a single storage device. In this the storage space by SAN and NAS is virtualized along with the disk file of virtual machine. So, it provides the disaster recovery management by replication.
Network Virtualization: Network virtualization is a concept in which the all the network from different network devices are combined into a single network called asa virtual network so that its bandwidth will be divided to channels which will be given to the server or device.
Hadoop Map Reduce is used to compute the huge amount of complex data in tolerable elapsed time, because all the processes performed to analyze data will be done in parallel on different nodes. Although Hadoop provides many functionalities but when it comes to manage and providing resources for each upcoming request, it will become a difficult task to handle it. So, with the built-in features of Hadoop there is a need to virtualize it. Virtualizing the Hadoop clusteradds new features to it.
• It brings elasticity as cluster can be expanded or reduced by adding or removing the nodes on demand. The whole process is very fast.
• A physical cluster will be shared between multiple virtual clusters so that the physical cluster will be reused which enables resource utilization.
• Roles of task tracker and data nodes will be separated into different machines for achieving high security as they both have their own access authorization.
• After virtualization of physical cluster, cloning of single image (for e.g. cloning of data node) can be performed which reduce the cost and enhance the performance.It also adds new features to it.
When their functioning was analyzed, a majority of the known schedulers, including LATE and FCFS failed in performing better. In fact, they were found not being able to utilize resources in balanced ways. Moreover, the workload of a job is neglected by schedulers, thereby causing imbalances in resource utilization. In order to enhance Hadoop performance,new methodsfor job scheduling, resource utilization and allocation have been proposed in this paper.The Amazon EC2 nodes are applied wherein a particular node is designated as the master node, while other nodes are designated as slave nodes. Every master node in the proposed HDFS cluster is found to be populated by many aggregator nodes, while in the slave nodes, map, reduce, and shuffle functions are assigned as a three-phased process.

TotalCost(I) = (vm ) UCost
In the model, each particlemust be allocated the virtual machine of its choice to perform a designated task. The optimized target is used by the ants to identify an optimal matching scheme.The optimal solution is found by the ants in parallel, which allows them to communicate and pass information to each other. Here, the pheromone was adjusted to the task aligned on the virtual machine path, which determines which ant to be selected for the other ant as well as readies the next iteration for the ant [20].

τ (t + 1) = (1 -)τ (t) + τ (t)
if (i j) T L τ (t) 0 otherwise 6.If the current number of iterations is less than the limit, go back to Step 2. If not, the iteration will stop, and the best solution will be given.

Artificial Neural Network for Node Usage Prediction
The suggested ANN is based on the wording of the node and the node of aggregation. Accept a set of input variables and determine the weights to reach the output variable of the input variables. This procedure can be defined as an entry, an activation, and an output package [19].

Results and Experiments
A simulation platform is applied to implement the experimentwith 2 Datacenters and 50-250 tasks. The task runs within a range of 5000 MI (Million Instructions)-100000 MI. The cloud simulator is set as per the following parameters as shown in Table 1, while virtual machine are 10 in number. vm per unit 1-100 This paper predicted the above algorithm parameters. In addition, before selecting the right collection of parameters as parameters to experiment, the efficiency of 10 groups of different parameters α , β and r was evaluated and compared.  The makespan is analyzed on all categories of synthetic dataset for batch sizes of100 to 1000 with the difference of 100 in batch sizes. The Fig.6. shows the makespan analysis on right skewed dataset. It is evidentthat in most of the batch sizes theIACOANNshowed better makespan.

Conclusion
The task scheduling problem in cloud computing requires the efficient mappingof jobs to virtual resources. Due to the heterogeneity of jobs and resources manypossible mappings can be defined. The heuristic and metaheuristic schedulers areutilized to map independent jobs. The meta-heuristic has potential to explore thehuge search space of possible solutions. The genetic based IACOANN has been presentedin the research to improve the makespan and resource utilization.