gpu-cpu-mutex - Zion Boggan

Why not an existing tool

I looked at the obvious options before writing this. NVIDIA MPS and MIG both assume you want to co-locate work on a GPU, either spatial sharing or partitioning. I wanted strict serialization: one job in VRAM at a time, full stop, and MIG doesn't even run on consumer cards. Slurm works but standing up slurmctld + slurmd for two tmux loops is wildly heavy. task-spooler and parallel --sem model a single submission queue rather than a transparent inline wrapper any script can call.

So this is ~120 lines of bash. flock(2) already does the hard part: kernel advisory locks that release automatically on exit, crash, or kill -9. There is no stale-lock problem, no PID file to clean up, no daemon to restart.

The GPU mutex

gpu_run.sh wraps any command with an exclusive flock on a shared lock file. The second caller blocks until the first releases. Two details matter in practice:

Reentrancy. A wrapped command often calls other wrapped commands, like gpu_run.sh bash gen.sh where gen.sh itself dispatches GPU work. Without a guard, that child blocks on a lock its ancestor already holds and deadlocks. The script solves this by exporting GPU_LOCK_HELD=1 when it acquires the lock; nested calls see that and skip re-locking, so the outer process owns the lock for the entire tree.

An escape hatch. Not everything that runs on the GPU box touches the GPU (build steps, upload jobs). Setting PSRUN_SKIP_GPU_LOCK=1 bypasses the mutex so those jobs don't queue behind a running model.

gpu_run.sh: the whole GPU mutex (kernel flock + reentrancy guard)
LOCK="${GPU_LOCK_PATH:-/tmp/gpu.lock}"

# Reentrancy: if an ancestor already holds the lock, skip re-acquiring.
# Without this, a wrapped command that internally calls another wrapped
# command deadlocks waiting on a lock its own parent holds.
if [ -n "${GPU_LOCK_HELD:-}" ]; then
  exec "$@"
fi

exec 9>"$LOCK"
echo "[gpu_run] waiting for GPU lock..." >&2
flock 9
echo "[gpu_run] acquired GPU lock -> running: $*" >&2
export GPU_LOCK_HELD=1   # propagate to children so they skip self-locking
"$@"
rc=$?
echo "[gpu_run] done (rc=$rc), releasing GPU lock" >&2
exit $rc

The CPU/RAM guard

cpu_run.sh is a counting semaphore (N flock slots rather than one) so heavy local jobs queue N-at-a-time instead of all-at-once.

The thread caps are what make it actually useful: the script pins OMP_NUM_THREADS, OPENBLAS_NUM_THREADS, MKL_NUM_THREADS, NUMEXPR_NUM_THREADS, and VECLIB_MAXIMUM_THREADS to CPU_RUN_THREADS before the command runs. With CPU_RUN_SLOTS=3 and CPU_RUN_THREADS=3, at most 9 cores are in use by guarded jobs at any time, leaving the rest of the system responsive. Everything runs under nice -n 15 and ionice -c3 so interactive workloads stay fast even when the queue is full.

The maxwait escape means a job that has been waiting too long runs anyway, capped and niced, rather than stalling indefinitely. The system degrades gracefully rather than locking up.