| @@ -0,0 +1,21 @@ | ||
| + | MIT License | |
| + | ||
| + | Copyright (c) 2026 Zion Boggan | |
| + | ||
| + | Permission is hereby granted, free of charge, to any person obtaining a copy | |
| + | of this software and associated documentation files (the "Software"), to deal | |
| + | in the Software without restriction, including without limitation the rights | |
| + | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
| + | copies of the Software, and to permit persons to whom the Software is | |
| + | furnished to do so, subject to the following conditions: | |
| + | ||
| + | The above copyright notice and this permission notice shall be included in all | |
| + | copies or substantial portions of the Software. | |
| + | ||
| + | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
| + | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
| + | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
| + | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
| + | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
| + | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |
| + | SOFTWARE. |
| @@ -0,0 +1,65 @@ | ||
| + | # gpu-cpu-mutex | |
| + | ||
| + | Two tiny shell tools that let **multiple independent processes share one GPU and a bounded CPU/RAM budget** without colliding - using nothing but `flock`. No daemon, no scheduler, no root, no software to install. If you run two (or more) long-lived agent loops, cron jobs, or terminals on the same box and they all occasionally hit the *same* GPU or kick off heavy CPU batches, this stops them from stepping on each other. | |
| + | ||
| + | I built this to coordinate two perpetual AI worker loops that share a single RTX 3060 over SSH. Either loop can fire a GPU job at any time; without coordination they'd both load a model into VRAM at once and OOM/corrupt. With `gpu_run.sh`, the second job simply *waits* for the first to finish. The CPU guard came after an unthrottled image batch drove load to 46 on 12 cores and 98.7% swap and froze the box - now heavy jobs queue through N slots with pinned thread counts. | |
| + | ||
| + | ## Why not just use $EXISTING_THING? | |
| + | ||
| + | I looked. None of them fit "two unrelated processes, one consumer GPU, share politely": | |
| + | ||
| + | | Option | Why it didn't fit | | |
| + | |---|---| | |
| + | | **NVIDIA MPS / MIG** | Designed to *co-locate* work on a GPU (spatial sharing / partitioning). I want the opposite - strict **serialization** so only one job touches VRAM at a time. MIG isn't even supported on consumer cards. | | |
| + | | **Slurm `--gres=gpu`** | A real scheduler, and it works - but standing up `slurmctld`+`slurmd` to serialize two tmux loops on a homelab box is wildly heavy. | | |
| + | | **`CUDA_VISIBLE_DEVICES` gating** | Picks *which* GPU, doesn't stop two jobs from grabbing the *same* one. | | |
| + | | **`task-spooler` (`ts`), `parallel --sem`** | Closer - but they model a single queue you submit *into*. I wanted a transparent wrapper any script/loop can call inline, that's reentrant and has an escape hatch (see below). | | |
| + | ||
| + | So this is ~120 lines of bash. `flock` already does the hard part (kernel-level advisory locks, auto-released on process exit - even on crash/kill). These tools are just the ergonomic shell around it. | |
| + | ||
| + | ## `gpu_run.sh` - the GPU mutex | |
| + | ||
| + | Serializes every GPU job through one exclusive `flock` on a shared lock file. The second caller blocks until the first releases. | |
| + | ||
| + | ```bash | |
| + | gpu_run.sh python gen_image.py "a knight" out.png | |
| + | gpu_run.sh bash gen_music.sh "lo-fi" track.wav | |
| + | ``` | |
| + | ||
| + | Two details that matter in practice: | |
| + | ||
| + | - **Reentrancy.** A wrapped command often calls *other* wrapped commands (`gpu_run.sh bash gen.sh`, where `gen.sh` internally also dispatches GPU work). Naively that self-deadlocks - the child blocks on a lock its own ancestor holds. So when the lock is acquired it exports `GPU_LOCK_HELD=1`; any nested call sees that and skips re-locking. The lock is held for the *whole* tree and released once at the top. | |
| + | - **An escape hatch.** Not everything that runs *on* the GPU box touches the GPU (a build step, a publish/upload). Those shouldn't queue behind a running model. Set `PSRUN_SKIP_GPU_LOCK=1` (or wire your own check) so non-GPU work runs immediately instead of waiting. | |
| + | ||
| + | ## `cpu_run.sh` - the CPU/RAM guard | |
| + | ||
| + | A **counting semaphore** (N flock slots) plus **thread caps**, so heavy local jobs queue `N`-at-a-time instead of all-at-once, and no single job can grab every core. | |
| + | ||
| + | ```bash | |
| + | CPU_RUN_SLOTS=3 CPU_RUN_THREADS=3 cpu_run.sh python postproc.py raw.png out/ | |
| + | ``` | |
| + | ||
| + | - **N slots:** at most `CPU_RUN_SLOTS` (default 3) run concurrently across every caller. The rest spin in a non-blocking retry loop until a slot frees (or `CPU_RUN_MAXWAIT` elapses, after which it runs anyway - capped + niced, never indefinitely stalled). | |
| + | - **Thread caps:** pins `OMP_NUM_THREADS` / `OPENBLAS_NUM_THREADS` / `MKL_NUM_THREADS` / `NUMEXPR_NUM_THREADS` / `VECLIB_MAXIMUM_THREADS` to `CPU_RUN_THREADS`, so one numpy/torch job can't silently fan out to all cores. `3 slots × 3 threads ≈ 9 cores`, leaving headroom for the rest of the system. | |
| + | - **Politeness:** runs under `nice -n 15` + `ionice -c3` so interactive work stays responsive even when saturated. | |
| + | ||
| + | ## How it works (the whole trick) | |
| + | ||
| + | `flock(2)` gives you a kernel advisory lock tied to an open file descriptor. The kernel releases it automatically when the fd closes - including when the process exits, crashes, or is `kill -9`'d. That last part is why this is robust: there's no lock file to "clean up," no stale-lock problem, no PID files. The lock *is* the live process holding an fd. | |
| + | ||
| + | - **Mutex** = one exclusive lock on one path (`flock 9` on `/tmp/gpu.lock`). | |
| + | - **Counting semaphore** = N lock files; grab any one with `flock -n` (non-blocking); if all N are held, wait and retry. | |
| + | ||
| + | ## Install | |
| + | ||
| + | ```bash | |
| + | git clone https://github.com/zionboggan/gpu-cpu-mutex | |
| + | chmod +x gpu-cpu-mutex/*.sh | |
| + | # put them on PATH, or call by path | |
| + | ``` | |
| + | ||
| + | No dependencies beyond `bash`, `flock` (util-linux), and optionally `ionice`. Lock paths default to `/tmp` and are configurable via env (see the top of each script). | |
| + | ||
| + | ## License | |
| + | ||
| + | MIT - see [LICENSE](LICENSE). |
| @@ -0,0 +1,57 @@ | ||
| + | #!/bin/bash | |
| + | # cpu_run.sh - CPU/RAM GUARD. Bounds heavy local jobs so an unthrottled batch can never thrash the | |
| + | # box. (This exists because a parallel image-postproc batch once drove load to 46 on 12 cores and | |
| + | # swap to 98.7% → the machine froze and had to be restarted.) | |
| + | # | |
| + | # What it does: | |
| + | # 1. COUNTING SEMAPHORE - at most CPU_RUN_SLOTS (default 3) run at once across ALL callers (flock slots). | |
| + | # So 18 jobs queue through 3-at-a-time instead of all-at-once. | |
| + | # 2. THREAD CAPS - pins BLAS/OpenMP to CPU_RUN_THREADS (default 3) so one numpy/torch job can't grab | |
| + | # every core. 3 slots × 3 threads ≈ 9 cores max, leaving headroom for everything else. | |
| + | # 3. nice + ionice - heavy work yields CPU/IO so interactive sessions stay responsive under load. | |
| + | # | |
| + | # Usage: cpu_run.sh <command...> | |
| + | # e.g. cpu_run.sh python postproc.py raw.png out/ | |
| + | # | |
| + | # Config (env): | |
| + | # CPU_RUN_SLOTS concurrent slots (default 3) | |
| + | # CPU_RUN_THREADS BLAS/OpenMP thread cap (default 3) | |
| + | # CPU_RUN_MAXWAIT seconds to wait for a free slot before running anyway (default 3600) | |
| + | set -u | |
| + | SLOTS="${CPU_RUN_SLOTS:-3}" | |
| + | THREADS="${CPU_RUN_THREADS:-3}" | |
| + | MAXWAIT="${CPU_RUN_MAXWAIT:-3600}" | |
| + | DIR="${CPU_RUN_SLOT_DIR:-/tmp/cpu_run_slots}" | |
| + | mkdir -p "$DIR" 2>/dev/null || true | |
| + | ||
| + | # Marks that a job is running INSIDE the guard - your tools can check for it and auto-re-exec through | |
| + | # this wrapper if it's absent, so they can never accidentally run unbounded. | |
| + | export CPU_RUN_ACTIVE=1 | |
| + | export OMP_NUM_THREADS="$THREADS" OPENBLAS_NUM_THREADS="$THREADS" MKL_NUM_THREADS="$THREADS" \ | |
| + | NUMEXPR_NUM_THREADS="$THREADS" VECLIB_MAXIMUM_THREADS="$THREADS" | |
| + | ||
| + | run() { | |
| + | if command -v ionice >/dev/null 2>&1; then | |
| + | exec nice -n 15 ionice -c3 "$@" | |
| + | else | |
| + | exec nice -n 15 "$@" | |
| + | fi | |
| + | } | |
| + | ||
| + | # Try to grab one of N slots (non-blocking sweep); if all busy, retry until MAXWAIT, then run unbounded. | |
| + | waited=0 | |
| + | while :; do | |
| + | for i in $(seq 1 "$SLOTS"); do | |
| + | exec 9>"$DIR/slot$i" | |
| + | if flock -n 9; then | |
| + | echo "[cpu_run] slot $i/$SLOTS acquired (threads=$THREADS, nice 15) → $*" >&2 | |
| + | run "$@" # exec replaces shell; flock released on exit | |
| + | fi | |
| + | exec 9>&- | |
| + | done | |
| + | if [ "$waited" -ge "$MAXWAIT" ]; then | |
| + | echo "[cpu_run] all $SLOTS slots busy ${MAXWAIT}s - running un-slotted (still capped+nice'd): $*" >&2 | |
| + | run "$@" | |
| + | fi | |
| + | sleep 2; waited=$((waited+2)) | |
| + | done |
| @@ -0,0 +1,45 @@ | ||
| + | #!/usr/bin/env python3 | |
| + | """Optional: enforce the same GPU mutex from inside a Python dispatcher. | |
| + | ||
| + | If your GPU jobs are fired over SSH to a remote box (or from a long-running Python | |
| + | loop rather than a shell), you can hold the *same* flock in-process so the wrapper | |
| + | and the dispatcher share one lock. The reentrancy trick is identical to gpu_run.sh: | |
| + | honor an inherited GPU_LOCK_HELD and skip re-locking, and respect a skip flag for | |
| + | non-GPU work (builds, uploads) that happens to run on the same box. | |
| + | ||
| + | This is a generic skeleton - replace the dispatch body with your own call. | |
| + | """ | |
| + | import fcntl | |
| + | import os | |
| + | import subprocess | |
| + | import sys | |
| + | ||
| + | GPU_LOCK_PATH = os.environ.get("GPU_LOCK_PATH", "/tmp/gpu.lock") | |
| + | ||
| + | ||
| + | def gpu_dispatch(argv, is_gpu_job=True, timeout=600): | |
| + | # Lock only for GPU work, and skip if an ancestor already holds it (GPU_LOCK_HELD) or the caller | |
| + | # marked this dispatch as non-GPU (GPU_SKIP_LOCK=1 - e.g. a build/publish that never touches VRAM). | |
| + | need_lock = ( | |
| + | is_gpu_job | |
| + | and not os.environ.get("GPU_LOCK_HELD") | |
| + | and not os.environ.get("GPU_SKIP_LOCK") | |
| + | ) | |
| + | lockf = None | |
| + | try: | |
| + | if need_lock: | |
| + | lockf = open(GPU_LOCK_PATH, "w") | |
| + | sys.stderr.write("[dispatch] waiting for GPU lock...\n") | |
| + | fcntl.flock(lockf, fcntl.LOCK_EX) | |
| + | sys.stderr.write("[dispatch] acquired GPU lock\n") | |
| + | os.environ["GPU_LOCK_HELD"] = "1" # propagate to children + any nested dispatch | |
| + | return subprocess.run(argv, text=True, timeout=timeout).returncode | |
| + | finally: | |
| + | if lockf is not None: | |
| + | fcntl.flock(lockf, fcntl.LOCK_UN) | |
| + | lockf.close() | |
| + | os.environ.pop("GPU_LOCK_HELD", None) | |
| + | ||
| + | ||
| + | if __name__ == "__main__": | |
| + | sys.exit(gpu_dispatch(sys.argv[1:])) |
| @@ -0,0 +1,32 @@ | ||
| + | #!/bin/bash | |
| + | # gpu_run.sh - GPU MUTEX. Serializes every GPU job through one shared flock so independent callers | |
| + | # (two agent loops, cron jobs, terminals - whatever) never load a model into VRAM at the same time | |
| + | # and OOM/corrupt each other. CPU work is unaffected; only GPU jobs serialize. | |
| + | # | |
| + | # Usage: gpu_run.sh <command...> | |
| + | # e.g. gpu_run.sh python gen_image.py "<prompt>" out.png | |
| + | # gpu_run.sh bash gen_music.sh "<prompt>" track.wav | |
| + | # | |
| + | # Every caller that touches the GPU MUST go through this. flock holds the lock for the duration; | |
| + | # any other caller blocks until it's free, then proceeds. The lock is released automatically on | |
| + | # exit/crash/kill (kernel advisory lock tied to the fd) - there is no stale-lock to clean up. | |
| + | # | |
| + | # Config: | |
| + | # GPU_LOCK_PATH lock file (default /tmp/gpu.lock) | |
| + | LOCK="${GPU_LOCK_PATH:-/tmp/gpu.lock}" | |
| + | ||
| + | # Reentrancy: if an ancestor already holds the lock, don't re-acquire - otherwise a wrapped command | |
| + | # that internally calls another wrapped command would deadlock waiting on a lock its own parent holds. | |
| + | if [ -n "${GPU_LOCK_HELD:-}" ]; then | |
| + | exec "$@" | |
| + | fi | |
| + | ||
| + | exec 9>"$LOCK" | |
| + | echo "[gpu_run] waiting for GPU lock..." >&2 | |
| + | flock 9 | |
| + | echo "[gpu_run] acquired GPU lock → running: $*" >&2 | |
| + | export GPU_LOCK_HELD=1 # propagate to children + any nested gpu_run so they skip self-locking | |
| + | "$@" | |
| + | rc=$? | |
| + | echo "[gpu_run] done (rc=$rc), releasing GPU lock" >&2 | |
| + | exit $rc |