TR-02-5.pdf

Adaptive and Virtual Reconfigurations for Dynamic Job Scheduling in Clusters

Songqing Chen, Li Xiao, and Xiaodong Zhang 
 
Proceedings of the 22nd International Conference on Distributed Computing
Systems, (ICDCS'2002), Vienna, Austria, July 2-5, 2002.

Abstract

In a cluster system with dynamic load sharing support, a job submission
or migration to a workstation is determined by the availability of CPU
and memory resources of the workstation at the time. In such a
system, a small number of running jobs with unexpectedly large memory
allocation requirements may significantly increase the queuing delay
times of the rest of jobs with normal memory requirements, slowing down
executions of individual jobs and decreasing the system throughput.
We call this phenomenon as the job blocking problem
because the big jobs block the execution pace of majority jobs in the
cluster. Since the memory demand of jobs may not be known in advance and
may change dynamically, the possibility of unsuitable job
submissions/migrations to cause the blocking problem is high, and the
existing load sharing schemes are unable to effectively handle this
problem. We propose a software method incorporating with dynamic load
sharing, which adaptively reserves a small set of workstations through
virtual cluster reconfiguration to provide special services to the jobs
demanding large memory allocations. This policy implies the principle
of shortest-remaining-processing-time policy. As soon as the blocking problem is
resolved by the reconfiguration, the system will adaptively switch back
to the normal load sharing state. We present three contributions in this
study. (1) we quantitatively present the conditions to cause the job
blocking problem. (2) We present the adaptive software method
in a dynamic load sharing system. We show the adaptive process
causes little additional overhead. (3) Conducting trace-driven simulations, we
show that our method can effectively improve the cluster computing
performance by quickly resolving the job blocking problem. The
effectiveness and performance insights are also analytically verified.