28319ff · GPU CPU Mutex

gpu-cpu-mutex: flock-based GPU mutex + CPU/RAM guard

Two small shell tools to share one GPU and a bounded CPU budget across
independent processes using only flock. No daemon, scheduler, or root.

28319ff Zion Boggan committed on Jun 5, 2026 (2 weeks ago)

LICENSE +21 -0

		@@ -0,0 +1,21 @@
	+	MIT License
	+
	+	Copyright (c) 2026 Zion Boggan
	+
	+	Permission is hereby granted, free of charge, to any person obtaining a copy
	+	of this software and associated documentation files (the "Software"), to deal
	+	in the Software without restriction, including without limitation the rights
	+	to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
	+	copies of the Software, and to permit persons to whom the Software is
	+	furnished to do so, subject to the following conditions:
	+
	+	The above copyright notice and this permission notice shall be included in all
	+	copies or substantial portions of the Software.
	+
	+	THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
	+	IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
	+	FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
	+	AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
	+	LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
	+	OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
	+	SOFTWARE.

README.md +65 -0

		@@ -0,0 +1,65 @@
	+	# gpu-cpu-mutex
	+
	+	Two tiny shell tools that let multiple independent processes share one GPU and a bounded CPU/RAM budget without colliding - using nothing but `flock`. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or terminals on the same box and they all occasionally hit the same GPU or kick off heavy CPU batches, this stops them from stepping on each other.
	+
	+	I built this to coordinate two perpetual AI worker loops that share a single RTX 3060 over SSH. Either loop can fire a GPU job at any time; without coordination they'd both load a model into VRAM at once and OOM/corrupt. With `gpu_run.sh`, the second job simply waits for the first to finish. The CPU guard came after an unthrottled image batch drove load to 46 on 12 cores and 98.7% swap and froze the box - now heavy jobs queue through N slots with pinned thread counts.
	+
	+	## Why not just use $EXISTING_THING?
	+
	+	I looked. None of them fit "two unrelated processes, one consumer GPU, share politely":
	+
	+	\| Option \| Why it didn't fit \|
	+	\|---\|---\|
	+	\| NVIDIA MPS / MIG \| Designed to co-locate work on a GPU (spatial sharing / partitioning). I want the opposite - strict serialization so only one job touches VRAM at a time. MIG isn't even supported on consumer cards. \|
	+	\| Slurm `--gres=gpu` \| A real scheduler, and it works - but standing up `slurmctld`+`slurmd` to serialize two tmux loops on a homelab box is wildly heavy. \|
	+	\| `CUDA_VISIBLE_DEVICES` gating \| Picks which GPU, doesn't stop two jobs from grabbing the same one. \|
	+	\| `task-spooler` (`ts`), `parallel --sem` \| Closer - but they model a single queue you submit into. I wanted a transparent wrapper any script/loop can call inline, that's reentrant and has an escape hatch (see below). \|
	+
	+	So this is ~120 lines of bash. `flock` already does the hard part (kernel-level advisory locks, auto-released on process exit - even on crash/kill). These tools are just the ergonomic shell around it.
	+
	+	## `gpu_run.sh` - the GPU mutex
	+
	+	Serializes every GPU job through one exclusive `flock` on a shared lock file. The second caller blocks until the first releases.
	+
	+	```bash
	+	gpu_run.sh python gen_image.py "a knight" out.png
	+	gpu_run.sh bash gen_music.sh "lo-fi" track.wav
	+	```
	+
	+	Two details that matter in practice:
	+
	+	- Reentrancy. A wrapped command often calls other wrapped commands (`gpu_run.sh bash gen.sh`, where `gen.sh` internally also dispatches GPU work). Naively that self-deadlocks - the child blocks on a lock its own ancestor holds. So when the lock is acquired it exports `GPU_LOCK_HELD=1`; any nested call sees that and skips re-locking. The lock is held for the whole tree and released once at the top.
	+	- An escape hatch. Not everything that runs on the GPU box touches the GPU (a build step, a publish/upload). Those shouldn't queue behind a running model. Set `PSRUN_SKIP_GPU_LOCK=1` (or wire your own check) so non-GPU work runs immediately instead of waiting.
	+
	+	## `cpu_run.sh` - the CPU/RAM guard
	+
	+	A counting semaphore (N flock slots) plus thread caps, so heavy local jobs queue `N`-at-a-time instead of all-at-once, and no single job can grab every core.
	+
	+	```bash
	+	CPU_RUN_SLOTS=3 CPU_RUN_THREADS=3 cpu_run.sh python postproc.py raw.png out/
	+	```
	+
	+	- N slots: at most `CPU_RUN_SLOTS` (default 3) run concurrently across every caller. The rest spin in a non-blocking retry loop until a slot frees (or `CPU_RUN_MAXWAIT` elapses, after which it runs anyway - capped + niced, never indefinitely stalled).
	+	- Thread caps: pins `OMP_NUM_THREADS` / `OPENBLAS_NUM_THREADS` / `MKL_NUM_THREADS` / `NUMEXPR_NUM_THREADS` / `VECLIB_MAXIMUM_THREADS` to `CPU_RUN_THREADS`, so one numpy/torch job can't silently fan out to all cores. `3 slots × 3 threads ≈ 9 cores`, leaving headroom for the rest of the system.
	+	- Politeness: runs under `nice -n 15` + `ionice -c3` so interactive work stays responsive even when saturated.
	+
	+	## How it works (the whole trick)
	+
	+	`flock(2)` gives you a kernel advisory lock tied to an open file descriptor. The kernel releases it automatically when the fd closes - including when the process exits, crashes, or is `kill -9`'d. That last part is why this is robust: there's no lock file to "clean up," no stale-lock problem, no PID files. The lock is the live process holding an fd.
	+
	+	- Mutex = one exclusive lock on one path (`flock 9` on `/tmp/gpu.lock`).
	+	- Counting semaphore = N lock files; grab any one with `flock -n` (non-blocking); if all N are held, wait and retry.
	+
	+	## Install
	+
	+	```bash
	+	git clone https://github.com/zionboggan/gpu-cpu-mutex
	+	chmod +x gpu-cpu-mutex/*.sh
	+	# put them on PATH, or call by path
	+	```
	+
	+	No dependencies beyond `bash`, `flock` (util-linux), and optionally `ionice`. Lock paths default to `/tmp` and are configurable via env (see the top of each script).
	+
	+	## License
	+
	+	MIT - see [LICENSE](LICENSE).

cpu_run.sh +57 -0

		@@ -0,0 +1,57 @@
	+	#!/bin/bash
	+	# cpu_run.sh - CPU/RAM GUARD. Bounds heavy local jobs so an unthrottled batch can never thrash the
	+	# box. (This exists because a parallel image-postproc batch once drove load to 46 on 12 cores and
	+	# swap to 98.7% → the machine froze and had to be restarted.)
	+	#
	+	# What it does:
	+	# 1. COUNTING SEMAPHORE - at most CPU_RUN_SLOTS (default 3) run at once across ALL callers (flock slots).
	+	# So 18 jobs queue through 3-at-a-time instead of all-at-once.
	+	# 2. THREAD CAPS - pins BLAS/OpenMP to CPU_RUN_THREADS (default 3) so one numpy/torch job can't grab
	+	# every core. 3 slots × 3 threads ≈ 9 cores max, leaving headroom for everything else.
	+	# 3. nice + ionice - heavy work yields CPU/IO so interactive sessions stay responsive under load.
	+	#
	+	# Usage: cpu_run.sh <command...>
	+	# e.g. cpu_run.sh python postproc.py raw.png out/
	+	#
	+	# Config (env):
	+	# CPU_RUN_SLOTS concurrent slots (default 3)
	+	# CPU_RUN_THREADS BLAS/OpenMP thread cap (default 3)
	+	# CPU_RUN_MAXWAIT seconds to wait for a free slot before running anyway (default 3600)
	+	set -u
	+	SLOTS="${CPU_RUN_SLOTS:-3}"
	+	THREADS="${CPU_RUN_THREADS:-3}"
	+	MAXWAIT="${CPU_RUN_MAXWAIT:-3600}"
	+	DIR="${CPU_RUN_SLOT_DIR:-/tmp/cpu_run_slots}"
	+	mkdir -p "$DIR" 2>/dev/null \|\| true
	+
	+	# Marks that a job is running INSIDE the guard - your tools can check for it and auto-re-exec through
	+	# this wrapper if it's absent, so they can never accidentally run unbounded.
	+	export CPU_RUN_ACTIVE=1
	+	export OMP_NUM_THREADS="$THREADS" OPENBLAS_NUM_THREADS="$THREADS" MKL_NUM_THREADS="$THREADS" \
	+	NUMEXPR_NUM_THREADS="$THREADS" VECLIB_MAXIMUM_THREADS="$THREADS"
	+
	+	run() {
	+	if command -v ionice >/dev/null 2>&1; then
	+	exec nice -n 15 ionice -c3 "$@"
	+	else
	+	exec nice -n 15 "$@"
	+	fi
	+	}
	+
	+	# Try to grab one of N slots (non-blocking sweep); if all busy, retry until MAXWAIT, then run unbounded.
	+	waited=0
	+	while :; do
	+	for i in $(seq 1 "$SLOTS"); do
	+	exec 9>"$DIR/slot$i"
	+	if flock -n 9; then
	+	echo "[cpu_run] slot $i/$SLOTS acquired (threads=$THREADS, nice 15) → $*" >&2
	+	run "$@" # exec replaces shell; flock released on exit
	+	fi
	+	exec 9>&-
	+	done
	+	if [ "$waited" -ge "$MAXWAIT" ]; then
	+	echo "[cpu_run] all $SLOTS slots busy ${MAXWAIT}s - running un-slotted (still capped+nice'd): $*" >&2
	+	run "$@"
	+	fi
	+	sleep 2; waited=$((waited+2))
	+	done

examples/remote_dispatch.py +45 -0

		@@ -0,0 +1,45 @@
	+	#!/usr/bin/env python3
	+	"""Optional: enforce the same GPU mutex from inside a Python dispatcher.
	+
	+	If your GPU jobs are fired over SSH to a remote box (or from a long-running Python
	+	loop rather than a shell), you can hold the same flock in-process so the wrapper
	+	and the dispatcher share one lock. The reentrancy trick is identical to gpu_run.sh:
	+	honor an inherited GPU_LOCK_HELD and skip re-locking, and respect a skip flag for
	+	non-GPU work (builds, uploads) that happens to run on the same box.
	+
	+	This is a generic skeleton - replace the dispatch body with your own call.
	+	"""
	+	import fcntl
	+	import os
	+	import subprocess
	+	import sys
	+
	+	GPU_LOCK_PATH = os.environ.get("GPU_LOCK_PATH", "/tmp/gpu.lock")
	+
	+
	+	def gpu_dispatch(argv, is_gpu_job=True, timeout=600):
	+	# Lock only for GPU work, and skip if an ancestor already holds it (GPU_LOCK_HELD) or the caller
	+	# marked this dispatch as non-GPU (GPU_SKIP_LOCK=1 - e.g. a build/publish that never touches VRAM).
	+	need_lock = (
	+	is_gpu_job
	+	and not os.environ.get("GPU_LOCK_HELD")
	+	and not os.environ.get("GPU_SKIP_LOCK")
	+	)
	+	lockf = None
	+	try:
	+	if need_lock:
	+	lockf = open(GPU_LOCK_PATH, "w")
	+	sys.stderr.write("[dispatch] waiting for GPU lock...\n")
	+	fcntl.flock(lockf, fcntl.LOCK_EX)
	+	sys.stderr.write("[dispatch] acquired GPU lock\n")
	+	os.environ["GPU_LOCK_HELD"] = "1" # propagate to children + any nested dispatch
	+	return subprocess.run(argv, text=True, timeout=timeout).returncode
	+	finally:
	+	if lockf is not None:
	+	fcntl.flock(lockf, fcntl.LOCK_UN)
	+	lockf.close()
	+	os.environ.pop("GPU_LOCK_HELD", None)
	+
	+
	+	if __name__ == "__main__":
	+	sys.exit(gpu_dispatch(sys.argv[1:]))

gpu_run.sh +32 -0

		@@ -0,0 +1,32 @@
	+	#!/bin/bash
	+	# gpu_run.sh - GPU MUTEX. Serializes every GPU job through one shared flock so independent callers
	+	# (two agent loops, cron jobs, terminals - whatever) never load a model into VRAM at the same time
	+	# and OOM/corrupt each other. CPU work is unaffected; only GPU jobs serialize.
	+	#
	+	# Usage: gpu_run.sh <command...>
	+	# e.g. gpu_run.sh python gen_image.py "<prompt>" out.png
	+	# gpu_run.sh bash gen_music.sh "<prompt>" track.wav
	+	#
	+	# Every caller that touches the GPU MUST go through this. flock holds the lock for the duration;
	+	# any other caller blocks until it's free, then proceeds. The lock is released automatically on
	+	# exit/crash/kill (kernel advisory lock tied to the fd) - there is no stale-lock to clean up.
	+	#
	+	# Config:
	+	# GPU_LOCK_PATH lock file (default /tmp/gpu.lock)
	+	LOCK="${GPU_LOCK_PATH:-/tmp/gpu.lock}"
	+
	+	# Reentrancy: if an ancestor already holds the lock, don't re-acquire - otherwise a wrapped command
	+	# that internally calls another wrapped command would deadlock waiting on a lock its own parent holds.
	+	if [ -n "${GPU_LOCK_HELD:-}" ]; then
	+	exec "$@"
	+	fi
	+
	+	exec 9>"$LOCK"
	+	echo "[gpu_run] waiting for GPU lock..." >&2
	+	flock 9
	+	echo "[gpu_run] acquired GPU lock → running: $*" >&2
	+	export GPU_LOCK_HELD=1 # propagate to children + any nested gpu_run so they skip self-locking
	+	"$@"
	+	rc=$?
	+	echo "[gpu_run] done (rc=$rc), releasing GPU lock" >&2
	+	exit $rc