Using Volcano's Gang Scheduler¶
The Gang scheduling strategy is one of the core scheduling algorithms of the volcano-scheduler. It satisfies the "All or nothing" scheduling requirement during the scheduling process, preventing arbitrary scheduling of Pods that could waste cluster resources. The specific algorithm observes whether the number of scheduled Pods under a Job meets the minimum running quantity. When the Job's minimum running quantity is satisfied, scheduling actions are performed for all Pods under the Job; otherwise, no actions are taken.
Use Cases¶
The Gang scheduling algorithm, based on the concept of a Pod group, is particularly suitable for scenarios that require multi-process collaboration. AI scenarios often involve complex workflows, such as Data Ingestion, Data Analysis, Data Splitting, Training, Serving, and Logging, which require a group of containers to work together. This makes the Gang scheduling strategy based on container groups very appropriate.
In multi-threaded parallel computing communication scenarios under the MPI computation framework, Gang scheduling is also very suitable because it requires master and slave processes to work together. High relevance among containers in a container group may lead to resource contention, and overall scheduling allocation can effectively resolve deadlocks.
In scenarios with insufficient cluster resources, the Gang scheduling strategy significantly improves the utilization of cluster resources. For example, if the cluster can currently accommodate only 2 Pods, but the minimum number of Pods required for scheduling is 3, then all Pods of this Job will remain pending until the cluster can accommodate 3 Pods, at which point the Pods will be scheduled. This effectively prevents the partial scheduling of Pods, which would not meet the requirements and would occupy resources, making other Jobs unable to run.
Concept Explanation¶
The Gang Scheduler is the core scheduling plugin of Volcano, and it is enabled by default upon installing Volcano. When creating a workload, you only need to specify the scheduler name as Volcano.
Volcano schedules based on PodGroups. When creating a workload, there is no need to manually create PodGroup resources; Volcano will automatically create them based on the workload information. Below is an example of a PodGroup:
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: test
namespace: default
spec:
minMember: 1
minResources:
cpu: "3"
memory: "2048Mi"
priorityClassName: high-prority
queue: default
Use Case¶
In a multi-threaded parallel computing communication scenario under the MPI computation framework, we need to ensure that all Pods can be successfully scheduled to ensure the task is completed correctly. Setting minAvailable to 4 means that 1 mpimaster and 3 mpiworkers are required to run.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: lm-mpi-job
labels:
"volcano.sh/job-type": "MPI"
spec:
minAvailable: 4
schedulerName: volcano
plugins:
ssh: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: mpimaster
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`;
mkdir -p /var/run/sshd; /usr/sbin/sshd;
mpiexec --allow-run-as-root --host ${MPI_HOST} -np 3 mpi_hello_world;
image: docker.m.daocloud.io/volcanosh/example-mpi:0.0.1
name: mpimaster
ports:
- containerPort: 22
name: mpijob-port
workingDir: /home
resources:
requests:
cpu: "500m"
limits:
cpu: "500m"
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
- replicas: 3
name: mpiworker
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: docker.m.daocloud.io/volcanosh/example-mpi:0.0.1
name: mpiworker
ports:
- containerPort: 22
name: mpijob-port
workingDir: /home
resources:
requests:
cpu: "1000m"
limits:
cpu: "1000m"
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
Generate the resources for PodGroup:
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
annotations:
creationTimestamp: "2024-05-28T09:18:50Z"
generation: 5
labels:
volcano.sh/job-type: MPI
name: lm-mpi-job-9c571015-37c7-4a1a-9604-eaa2248613f2
namespace: default
ownerReferences:
- apiVersion: batch.volcano.sh/v1alpha1
blockOwnerDeletion: true
controller: true
kind: Job
name: lm-mpi-job
uid: 9c571015-37c7-4a1a-9604-eaa2248613f2
resourceVersion: "25173454"
uid: 7b04632e-7cff-4884-8e9a-035b7649d33b
spec:
minMember: 4
minResources:
count/pods: "4"
cpu: 3500m
limits.cpu: 3500m
pods: "4"
requests.cpu: 3500m
minTaskMember:
mpimaster: 1
mpiworker: 3
queue: default
status:
conditions:
- lastTransitionTime: "2024-05-28T09:19:01Z"
message: '3/4 tasks in gang unschedulable: pod group is not ready, 1 Succeeded,
3 Releasing, 4 minAvailable'
reason: NotEnoughResources
status: "True"
transitionID: f875efa5-0358-4363-9300-06cebc0e7466
type: Unschedulable
- lastTransitionTime: "2024-05-28T09:18:53Z"
reason: tasks in gang are ready to be scheduled
status: "True"
transitionID: 5a7708c8-7d42-4c33-9d97-0581f7c06dab
type: Scheduled
phase: Pending
succeeded: 1
From the PodGroup, it can be seen that it is associated with the workload through ownerReferences and sets the minimum number of running Pods to 4.