A Comparative Study of Clustering Algorithms Using MapReduce in Hadoop

This article explores various scheduling algorithms employed in Hadoop MapReduce for optimizing cluster performance. We delve into six distinct categories of schedulers: Deadline-aware, Data Locality-aware, Cost-aware, Resource-aware, Makespan-aware, and Learning-aware. Each category addresses specific challenges inherent in large-scale data processing. The effectiveness of each approach is evaluated by comparing its performance against established solutions and benchmarks.

Deadline-Aware Schedulers

These schedulers prioritize job completion within specified deadlines. They are crucial for time-sensitive applications in big data platforms. We further categorize them into schedulers designed for homogeneous and heterogeneous Hadoop clusters.

Deadline-Aware Schedulers in Homogeneous Clusters

Several algorithms aim to minimize deadline misses in homogeneous environments. DAPS, for instance, formulates scheduling as an online optimization problem, employing a preemptive resource allocation strategy. RDS leverages resource prediction and job completion time estimation for resource assignment. DamRT guarantees deadlines while considering data locality, prioritizing tasks based on urgency values calculated from estimated response times and slack time. MinEDF-WC extends the Earliest Deadline First (EDF) policy, allocating minimal resources necessary to meet deadlines and efficiently utilizing spare slots. EDF/TD and EDF/MR minimize tardiness and miss rates respectively, focusing on task laxity and schedulability. Other approaches employ schedulability tests, paused rate monotonic scheduling (PRM), and effective sequence (ES) identification for deadline adherence.

Fig 1: Table summarizing Deadline-aware schedulers in homogeneous clusters and their properties.

Deadline-Aware Schedulers in Heterogeneous Clusters

In heterogeneous clusters, schedulers must account for varying node capabilities. Cost-efficient approaches focus on selecting appropriate Virtual Machines (VMs) and distributing workload to minimize expenses while meeting deadlines. Strategies include dynamic capacity management with deadline-driven policies, data locality-aware task re-allocation using minimum cost maximum-flow (MCMF) algorithms, and bipartite graph modeling for optimal task-resource mapping. ARIA utilizes job profiling and Lagrange’s method for resource estimation and deadline-based job ordering. Adaptive schedulers dynamically adjust resource allocation based on pending work and hardware heterogeneity, considering specialized accelerators like GPUs.

Fig 2: Table summarizing Deadline-aware schedulers in heterogeneous clusters and their properties.

Data Locality-Aware Schedulers

These schedulers aim to minimize data transfer over the network by assigning tasks to nodes where the data resides. This optimization significantly improves Hadoop performance by reducing network congestion and task execution times. KNN-based schedulers use speculative prefetching and clustering of intermediate map outputs. Other strategies involve prioritizing nodes with low data localization saturation, employing bipartite graph modeling for optimal task placement, and using dynamic priority and localization ID techniques. MatchMaking, Maestro, and delay scheduling algorithms further enhance data locality by considering task and replica locations, node statuses, and maximum delay times.

Fig 3: Table summarizing Data Locality-aware schedulers and their properties.

Cost-Aware Schedulers

Minimizing power consumption and operational costs is paramount in large Hadoop clusters. Cost-aware schedulers tackle this challenge by optimizing resource allocation based on various factors. Approaches include Real Coded Genetic Algorithms (RCGA) for node allocation, considering load, makespan, and memory constraints. Heuristic strategies like CETSS focus on minimizing financial cost while meeting deadlines, employing dynamic task renting and sub-deadline relaxation techniques. Other algorithms optimize resource utilization based on earliest finish time, base unit execution time estimation, and workflow-level resource management. ChEsS minimizes budget and optimizes execution time across multiple heterogeneous clusters. Cura leverages MapReduce profiling for efficient resource multiplexing and cost-aware provisioning.

Fig 4: Table summarizing Cost-aware schedulers and their properties.

Resource-Aware Schedulers

Efficient resource utilization is crucial for maximizing cluster throughput. Resource-aware schedulers strive to improve resource allocation and prevent bottlenecks. Dynamic Performance Heuristic-based Bin Packing (DP-HBP) optimizes resource allocation in heterogeneous virtualized environments. ACO-BP transforms task scheduling into a 2-Dimensional bin packing problem solved using Ant Colony Optimization. PRISM employs phase-level scheduling, assigning resources based on task phases and their utility values. COSHH categorizes jobs based on requirements and resource characteristics using k-means clustering. Adaptive schedulers like RAS introduce the “job slot” concept for efficient resource management. Other frameworks focus on resource bottleneck detection and mitigation, size-based scheduling with aging functions, and cost-based resource provisioning models.

Fig 5: Table summarizing Resource-aware schedulers and their properties.

Makespan-Aware Schedulers

Minimizing makespan, the total completion time of a set of jobs, is essential for optimizing cluster performance. Virtual Job Schedulers (VJS) employ partitioning algorithms like TLSP and PRED to optimize reducer busy time and overall reducer phase time. Secure and performance-aware optimization frameworks like SPO utilize HEFT algorithm and consider network traffic during shuffling. TMaR minimizes makespan in heterogeneous environments by optimizing task placement and reducing shuffling traffic. Approximation algorithms address makespan minimization with varying server speeds, considering both preemptive and non-preemptive reduce tasks. BalancedPools divides jobs into pools with equal makespan and applies Johnson’s algorithm for optimal job ordering. TuMM dynamically adjusts slot configurations based on workload predictions. Other strategies involve joint scheduling of overlapping phases and prioritizing jobs based on I/O and computing intensity.

Fig 6: Table summarizing Makespan-aware schedulers and their properties.

Learning-Aware Schedulers

These schedulers leverage machine learning techniques to adapt to dynamic workload characteristics and optimize scheduling decisions. CLQLMRS uses Q-learning to improve data and cache locality, reducing execution time. Hybrid schedulers employing reinforcement learning identify straggler tasks and schedule them on fast processing nodes, considering data locality. MapReduce Reinforcement Learning (MRRL) schedulers observe system state and suggest speculative re-execution of slower tasks for faster completion. These adaptive approaches learn from past executions and continuously improve scheduling efficiency.

Fig 7: Table summarizing Learning-aware schedulers and their properties.

This comprehensive analysis of various scheduling algorithms in Hadoop MapReduce provides insights into the diverse approaches for optimizing cluster performance based on specific needs and challenges. Each category offers unique advantages and contributes to the efficient processing of large-scale datasets in Hadoop.