Hadoop framework implementation and performance analysis on a cloud


Abstract: The Hadoop framework uses the MapReduce programming paradigm to process big data by distributing data across a cluster and aggregating. MapReduce is one of the methods used to process big data hosted on large clusters. In this method, jobs are processed by dividing into small pieces and distributing over nodes. Parameters such as distributing method over nodes, the number of jobs held in a parallel fashion, and the number of nodes in the cluster affect the execution time of jobs. The aim of this paper is to determine how the numbers of nodes, maps, and reduces affect the performance of the Hadoop framework in a cloud environment. For this purpose, tests were carried out on a Hadoop cluster with 10 nodes hosted in a cloud environment by running PiEstimator, Grep, Teragen, and Terasort benchmarking tools on it. These benchmarking tools available under the Hadoop framework are classified as CPU-intensive and CPU-light applications as a result of tests. In CPU-light applications, increasing the numbers of nodes, maps, and reduces does not improve the efficiency of these applications; they even cause an increase in time spent on jobs by using system resources unnecessarily. Therefore, in CPU-light applications, selecting the numbers of nodes, maps, and reduces as minimum is found as the optimization of time spent on a process. In CPU-intensive applications, according to the phase that small job pieces is processed, it is found that selecting the number of maps or reduces equal to total number of CPUs on a cluster is the optimization of time spent on a process.

Keywords: Big data, Hadoop, cloud computing, MapReduce, KVM virtualization environment, benchmarking tools, Ganglia

Full Text: PDF