Zion Boggan zionboggan.com ↗

gpu-cpu-mutex: flock-based GPU mutex + CPU/RAM guard

Two small shell tools to share one GPU and a bounded CPU budget across
independent processes using only flock. No daemon, scheduler, or root.
28319ff   Zion Boggan committed on Jun 5, 2026 (2 weeks ago)
LICENSE +21 -0
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 Zion Boggan
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
README.md +65 -0
@@ -0,0 +1,65 @@
+# gpu-cpu-mutex
+
+Two tiny shell tools that let **multiple independent processes share one GPU and a bounded CPU/RAM budget** without colliding - using nothing but `flock`. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or terminals on the same box and they all occasionally hit the *same* GPU or kick off heavy CPU batches, this stops them from stepping on each other.
+
+I built this to coordinate two perpetual AI worker loops that share a single RTX 3060 over SSH. Either loop can fire a GPU job at any time; without coordination they'd both load a model into VRAM at once and OOM/corrupt. With `gpu_run.sh`, the second job simply *waits* for the first to finish. The CPU guard came after an unthrottled image batch drove load to 46 on 12 cores and 98.7% swap and froze the box - now heavy jobs queue through N slots with pinned thread counts.
+
+## Why not just use $EXISTING_THING?
+
+I looked. None of them fit "two unrelated processes, one consumer GPU, share politely":
+
+| Option | Why it didn't fit |
+|---|---|
+| **NVIDIA MPS / MIG** | Designed to *co-locate* work on a GPU (spatial sharing / partitioning). I want the opposite - strict **serialization** so only one job touches VRAM at a time. MIG isn't even supported on consumer cards. |
+| **Slurm `--gres=gpu`** | A real scheduler, and it works - but standing up `slurmctld`+`slurmd` to serialize two tmux loops on a homelab box is wildly heavy. |
+| **`CUDA_VISIBLE_DEVICES` gating** | Picks *which* GPU, doesn't stop two jobs from grabbing the *same* one. |
+| **`task-spooler` (`ts`), `parallel --sem`** | Closer - but they model a single queue you submit *into*. I wanted a transparent wrapper any script/loop can call inline, that's reentrant and has an escape hatch (see below). |
+
+So this is ~120 lines of bash. `flock` already does the hard part (kernel-level advisory locks, auto-released on process exit - even on crash/kill). These tools are just the ergonomic shell around it.
+
+## `gpu_run.sh` - the GPU mutex
+
+Serializes every GPU job through one exclusive `flock` on a shared lock file. The second caller blocks until the first releases.
+
+```bash
+gpu_run.sh python gen_image.py "a knight" out.png
+gpu_run.sh bash gen_music.sh "lo-fi" track.wav
+```
+
+Two details that matter in practice:
+
+- **Reentrancy.** A wrapped command often calls *other* wrapped commands (`gpu_run.sh bash gen.sh`, where `gen.sh` internally also dispatches GPU work). Naively that self-deadlocks - the child blocks on a lock its own ancestor holds. So when the lock is acquired it exports `GPU_LOCK_HELD=1`; any nested call sees that and skips re-locking. The lock is held for the *whole* tree and released once at the top.
+- **An escape hatch.** Not everything that runs *on* the GPU box touches the GPU (a build step, a publish/upload). Those shouldn't queue behind a running model. Set `PSRUN_SKIP_GPU_LOCK=1` (or wire your own check) so non-GPU work runs immediately instead of waiting.
+
+## `cpu_run.sh` - the CPU/RAM guard
+
+A **counting semaphore** (N flock slots) plus **thread caps**, so heavy local jobs queue `N`-at-a-time instead of all-at-once, and no single job can grab every core.
+
+```bash
+CPU_RUN_SLOTS=3 CPU_RUN_THREADS=3 cpu_run.sh python postproc.py raw.png out/
+```
+
+- **N slots:** at most `CPU_RUN_SLOTS` (default 3) run concurrently across every caller. The rest spin in a non-blocking retry loop until a slot frees (or `CPU_RUN_MAXWAIT` elapses, after which it runs anyway - capped + niced, never indefinitely stalled).
+- **Thread caps:** pins `OMP_NUM_THREADS` / `OPENBLAS_NUM_THREADS` / `MKL_NUM_THREADS` / `NUMEXPR_NUM_THREADS` / `VECLIB_MAXIMUM_THREADS` to `CPU_RUN_THREADS`, so one numpy/torch job can't silently fan out to all cores. `3 slots × 3 threads ≈ 9 cores`, leaving headroom for the rest of the system.
+- **Politeness:** runs under `nice -n 15` + `ionice -c3` so interactive work stays responsive even when saturated.
+
+## How it works (the whole trick)
+
+`flock(2)` gives you a kernel advisory lock tied to an open file descriptor. The kernel releases it automatically when the fd closes - including when the process exits, crashes, or is `kill -9`'d. That last part is why this is robust: there's no lock file to "clean up," no stale-lock problem, no PID files. The lock *is* the live process holding an fd.
+
+- **Mutex** = one exclusive lock on one path (`flock 9` on `/tmp/gpu.lock`).
+- **Counting semaphore** = N lock files; grab any one with `flock -n` (non-blocking); if all N are held, wait and retry.
+
+## Install
+
+```bash
+git clone https://github.com/zionboggan/gpu-cpu-mutex
+chmod +x gpu-cpu-mutex/*.sh
+# put them on PATH, or call by path
+```
+
+No dependencies beyond `bash`, `flock` (util-linux), and optionally `ionice`. Lock paths default to `/tmp` and are configurable via env (see the top of each script).
+
+## License
+
+MIT - see [LICENSE](LICENSE).
cpu_run.sh +57 -0
@@ -0,0 +1,57 @@
+#!/bin/bash
+# cpu_run.sh - CPU/RAM GUARD. Bounds heavy local jobs so an unthrottled batch can never thrash the
+# box. (This exists because a parallel image-postproc batch once drove load to 46 on 12 cores and
+# swap to 98.7% → the machine froze and had to be restarted.)
+#
+# What it does:
+# 1. COUNTING SEMAPHORE - at most CPU_RUN_SLOTS (default 3) run at once across ALL callers (flock slots).
+# So 18 jobs queue through 3-at-a-time instead of all-at-once.
+# 2. THREAD CAPS - pins BLAS/OpenMP to CPU_RUN_THREADS (default 3) so one numpy/torch job can't grab
+# every core. 3 slots × 3 threads ≈ 9 cores max, leaving headroom for everything else.
+# 3. nice + ionice - heavy work yields CPU/IO so interactive sessions stay responsive under load.
+#
+# Usage: cpu_run.sh <command...>
+# e.g. cpu_run.sh python postproc.py raw.png out/
+#
+# Config (env):
+# CPU_RUN_SLOTS concurrent slots (default 3)
+# CPU_RUN_THREADS BLAS/OpenMP thread cap (default 3)
+# CPU_RUN_MAXWAIT seconds to wait for a free slot before running anyway (default 3600)
+set -u
+SLOTS="${CPU_RUN_SLOTS:-3}"
+THREADS="${CPU_RUN_THREADS:-3}"
+MAXWAIT="${CPU_RUN_MAXWAIT:-3600}"
+DIR="${CPU_RUN_SLOT_DIR:-/tmp/cpu_run_slots}"
+mkdir -p "$DIR" 2>/dev/null || true
+
+# Marks that a job is running INSIDE the guard - your tools can check for it and auto-re-exec through
+# this wrapper if it's absent, so they can never accidentally run unbounded.
+export CPU_RUN_ACTIVE=1
+export OMP_NUM_THREADS="$THREADS" OPENBLAS_NUM_THREADS="$THREADS" MKL_NUM_THREADS="$THREADS" \
+ NUMEXPR_NUM_THREADS="$THREADS" VECLIB_MAXIMUM_THREADS="$THREADS"
+
+run() {
+ if command -v ionice >/dev/null 2>&1; then
+ exec nice -n 15 ionice -c3 "$@"
+ else
+ exec nice -n 15 "$@"
+ fi
+}
+
+# Try to grab one of N slots (non-blocking sweep); if all busy, retry until MAXWAIT, then run unbounded.
+waited=0
+while :; do
+ for i in $(seq 1 "$SLOTS"); do
+ exec 9>"$DIR/slot$i"
+ if flock -n 9; then
+ echo "[cpu_run] slot $i/$SLOTS acquired (threads=$THREADS, nice 15) → $*" >&2
+ run "$@" # exec replaces shell; flock released on exit
+ fi
+ exec 9>&-
+ done
+ if [ "$waited" -ge "$MAXWAIT" ]; then
+ echo "[cpu_run] all $SLOTS slots busy ${MAXWAIT}s - running un-slotted (still capped+nice'd): $*" >&2
+ run "$@"
+ fi
+ sleep 2; waited=$((waited+2))
+done
examples/remote_dispatch.py +45 -0
@@ -0,0 +1,45 @@
+#!/usr/bin/env python3
+"""Optional: enforce the same GPU mutex from inside a Python dispatcher.
+
+If your GPU jobs are fired over SSH to a remote box (or from a long-running Python
+loop rather than a shell), you can hold the *same* flock in-process so the wrapper
+and the dispatcher share one lock. The reentrancy trick is identical to gpu_run.sh:
+honor an inherited GPU_LOCK_HELD and skip re-locking, and respect a skip flag for
+non-GPU work (builds, uploads) that happens to run on the same box.
+
+This is a generic skeleton - replace the dispatch body with your own call.
+"""
+import fcntl
+import os
+import subprocess
+import sys
+
+GPU_LOCK_PATH = os.environ.get("GPU_LOCK_PATH", "/tmp/gpu.lock")
+
+
+def gpu_dispatch(argv, is_gpu_job=True, timeout=600):
+ # Lock only for GPU work, and skip if an ancestor already holds it (GPU_LOCK_HELD) or the caller
+ # marked this dispatch as non-GPU (GPU_SKIP_LOCK=1 - e.g. a build/publish that never touches VRAM).
+ need_lock = (
+ is_gpu_job
+ and not os.environ.get("GPU_LOCK_HELD")
+ and not os.environ.get("GPU_SKIP_LOCK")
+ )
+ lockf = None
+ try:
+ if need_lock:
+ lockf = open(GPU_LOCK_PATH, "w")
+ sys.stderr.write("[dispatch] waiting for GPU lock...\n")
+ fcntl.flock(lockf, fcntl.LOCK_EX)
+ sys.stderr.write("[dispatch] acquired GPU lock\n")
+ os.environ["GPU_LOCK_HELD"] = "1" # propagate to children + any nested dispatch
+ return subprocess.run(argv, text=True, timeout=timeout).returncode
+ finally:
+ if lockf is not None:
+ fcntl.flock(lockf, fcntl.LOCK_UN)
+ lockf.close()
+ os.environ.pop("GPU_LOCK_HELD", None)
+
+
+if __name__ == "__main__":
+ sys.exit(gpu_dispatch(sys.argv[1:]))
gpu_run.sh +32 -0
@@ -0,0 +1,32 @@
+#!/bin/bash
+# gpu_run.sh - GPU MUTEX. Serializes every GPU job through one shared flock so independent callers
+# (two agent loops, cron jobs, terminals - whatever) never load a model into VRAM at the same time
+# and OOM/corrupt each other. CPU work is unaffected; only GPU jobs serialize.
+#
+# Usage: gpu_run.sh <command...>
+# e.g. gpu_run.sh python gen_image.py "<prompt>" out.png
+# gpu_run.sh bash gen_music.sh "<prompt>" track.wav
+#
+# Every caller that touches the GPU MUST go through this. flock holds the lock for the duration;
+# any other caller blocks until it's free, then proceeds. The lock is released automatically on
+# exit/crash/kill (kernel advisory lock tied to the fd) - there is no stale-lock to clean up.
+#
+# Config:
+# GPU_LOCK_PATH lock file (default /tmp/gpu.lock)
+LOCK="${GPU_LOCK_PATH:-/tmp/gpu.lock}"
+
+# Reentrancy: if an ancestor already holds the lock, don't re-acquire - otherwise a wrapped command
+# that internally calls another wrapped command would deadlock waiting on a lock its own parent holds.
+if [ -n "${GPU_LOCK_HELD:-}" ]; then
+ exec "$@"
+fi
+
+exec 9>"$LOCK"
+echo "[gpu_run] waiting for GPU lock..." >&2
+flock 9
+echo "[gpu_run] acquired GPU lock → running: $*" >&2
+export GPU_LOCK_HELD=1 # propagate to children + any nested gpu_run so they skip self-locking
+"$@"
+rc=$?
+echo "[gpu_run] done (rc=$rc), releasing GPU lock" >&2
+exit $rc