Turkish Journal of Electrical Engineering and Computer Sciences




High impact scientific applications processed in distributed data centers often involve big data. To avoid the intolerable delays due to huge data movements across data centers during processing, the concept of moving tasks to data was introduced in the last decade. Even after the realization of this concept termed as data locality, the expected quality of service was not achieved. Later, data colocality was introduced where data groupings were identified and then data chunks were placed wisely. However, the aspect of the expected data traffic during run time is generally not considered while placing data. To identify the expected data traffic, the knowledge of the history of data movements is useful. In this work, this knowledge is utilized and an approach to intelligently select the nodes for placing data groups to ensure the least possible data movements is proposed. Systematic scrutiny of log files is conducted and a gain matrix is generated based on maximum likelihood estimation of data movements. Formally, the gain matrix is inversely proportional to the expected data traffic inside the data center. It reflects the performance gain obtained by assigning a block to a node with the lowest possible future data movements. To identify the optimal placement, a many-to-one assignment problem-based algorithm is presented. By experimental analysis, it is observed that the movement of data is significantly reduced by the proposed approach. It is also found that the performance has improved considerably.


Data management, data organization, cloud computing, parallel and distributed systems

First Page


Last Page