docs: v0.16.1 — minions worker deployment guide (from #287) (#317)

* docs: v0.16.1 — minions worker deployment guide (from #287) New docs/guides/minions-deployment.md covering persistent worker deploy patterns (watchdog cron, inline --follow for cron-only workloads) plus the sharp edges of running gbrain jobs work against Supabase in production. Addresses a real gap: existing minions docs (minions-fix.md, minions-shell-jobs.md) cover schema repair and shell-job security, not deploy patterns. With v0.16.0's durable agent runtime, the persistent worker is now load-bearing for subagent + subagent_aggregator handlers too, so a supervised deploy story matters. Pre-landing accuracy pass corrected five factual bugs against current source: - max_stalled column default (5, not 1 or 3) - stalled-jobs smoke-test query (active, not waiting) - watchdog SIGTERM-to-SIGKILL grace (10s minimum, not 2s) - cron env pattern (crontab env lines, not source ~/.bashrc) - --follow exit semantics (blocks until submitted job is terminal, not until queue is empty) Docs-only. No code changed. Zero migration required. Contributed by a downstream agent fork via #287. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: credit Wintermute correctly in v0.16.1 CHANGELOG Wintermute is gbrain's own OpenClaw instance running in production, not a community contributor. The original CHANGELOG framing ("community contributor @wintermute") understated the funnier truth: the agent built on top of the project wrote the deploy guide for the project after hitting its sharp edges in production. Dogfooding with extra steps. Co-Authored-By: Wintermute (OpenClaw) <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: rewrite minions deployment guide for agent line-by-line execution Fixes 12 findings from reading v0.16.1 guide as-an-agent would: Real bugs: - Crontab syntax wrong for user crontabs (6-field format dumped into `crontab -e` got "bad minute" or parsed `user` as the command). Now two labeled blocks: 5-field for `crontab -e`, 6-field for `/etc/crontab`. - Watchdog restart loop (old shutdown lines in unrotated log re-matched every 5 min forever). New `minion-watchdog.sh` writes 2-line PID file (PID + restart epoch) and only considers log lines newer than the epoch. Regex rewritten explicit (mawk rejects `{n}` intervals). - Credentials in world-readable /etc/crontab. Secrets move to /etc/gbrain.env (mode 600), referenced via BASH_ENV in crontab. Structural: - Preconditions block (5 fail-fast checks). - "Which option?" decision tree. - Template variable table (6 vars documented). - Upgrade section (v0.13.x -> v0.16.2 checklist). - Option 3: systemd.service + Procfile + fly.toml.partial snippets. - Uninstall section. - `--follow` example uses `gbrain embed --stale` (a real command) instead of the fictional `gbrain enrich`. - Dead-end "Proposed CLI flags (not yet implemented)" replaced with a "Tune per-job today" callout pointing at flags that exist. - Known Issues rewritten as imperatives. Also wires `docs/guides/minions-deployment.md` into `scripts/llms-config.ts` under the Configuration section so remote agents fetching llms.txt / llms-full.txt see the guide by name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.16.2) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: sync v0.16.2 CHANGELOG with the actual --follow example in the guide The shipped docs/guides/minions-deployment.md uses `gbrain embed --stale` (a real command) but the v0.16.2 CHANGELOG entry still referenced `gbrain enrich --brain $GBRAIN_WORKSPACE` (the older draft). Bring the CHANGELOG in line with what actually shipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 00:01:08 -07:00
parent 0e9f8814a5
commit 418d955fd3
12 changed files with 935 additions and 2 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,6 +2,100 @@

 All notable changes to GBrain will be documented in this file.

+## [0.16.2] - 2026-04-22
+
+## **The deployment guide now reads like a runbook an agent can execute line-by-line.**
+## **Three real bugs from v0.16.1 fixed, nine DX gaps closed.**
+
+v0.16.1 shipped the Minions worker deployment guide. Re-reading it as the agent it was written for, top-to-bottom, copy-pasting every block, surfaced twelve issues a human skim-reader would not catch. Three are real bugs that break a first-time deploy. Nine are structural gaps that force the agent to invent values.
+
+The bugs: the crontab example used `*/5 * * * * user bash /path/...` which is `/etc/crontab` format only, so an agent running `crontab -e` and pasting it got "bad minute" or parsed `user` as the command. The watchdog script grepped `tail -20` of an unrotated log for shutdown markers, so every 5-minute tick after the first restart re-matched the old shutdown line forever and killed the healthy worker on loop. And `DATABASE_URL=postgresql://user:pass@...` lived directly in `/etc/crontab`, which is mode 644 (world-readable).
+
+The gaps: no preconditions block, no "which option should I pick" selector, hardcoded `/path/to/...` and `/my/workspace` throughout with no template-variable legend, no upgrade section (so an agent coming from v0.13.x had no idea `GBRAIN_ALLOW_SHELL_JOBS=1` is now required or that `max_stalled` flipped from 1 to 5), no alternative to bare cron for Fly/Render/systemd deployments, a "Proposed CLI flags (not yet implemented)" block that an agent would copy and get `unrecognized flag`, and a `MinionWorker.maxStalledCount` note that did not tell the agent what to do.
+
+### What this means for operators
+
+The guide is now copy-pasteable without invention. Every `$VAR` is documented in a table at the top. Every code block runs as-is on the target it claims. The watchdog writes a two-line PID file (PID + restart epoch) and the shutdown check only considers log lines newer than the epoch, which is the actual fix for the restart loop. Secrets live in `/etc/gbrain.env` (mode 600), referenced via `BASH_ENV=/etc/gbrain.env` in crontab. A new Option 3 ships a systemd unit, a Procfile, and a fly.toml fragment so Fly/Render/Railway/systemd users skip cron entirely. The upgrade section walks the v0.13.x → v0.16.2 checklist (stop worker, apply migrations, add `GBRAIN_ALLOW_SHELL_JOBS`, swap the watchdog).
+
+The shipped watchdog was verified against an abbreviated end-to-end test (3 ticks in ~30 seconds inside an Ubuntu 22.04 container): tick 1 starts the worker and writes the 2-line PID file; tick 2 sees a shutdown line with a 1-hour-old timestamp and correctly does nothing; tick 3 sees a fresh shutdown line and correctly restarts. The regex was caught and fixed during the test when mawk rejected `{n}` interval quantifiers. The systemd unit was smoked in a privileged container with `Restart=always` firing a second banner after a 10-second `RestartSec` window, confirming crash-recovery works before any host ever boots the unit.
+
+## To take advantage of v0.16.2
+
+`gbrain upgrade` pulls the new guide. If you deployed under v0.16.1 with the original watchdog, swap it:
+
+1. **Re-read the guide:**
+   ```bash
+   less docs/guides/minions-deployment.md
+   ```
+2. **Swap the watchdog script.** The v0.16.1 version has the restart-loop bug:
+   ```bash
+   sudo install -m 755 docs/guides/minions-deployment-snippets/minion-watchdog.sh \
+     /usr/local/bin/minion-watchdog.sh
+   ```
+3. **Move secrets out of crontab.** Put `DATABASE_URL` and `GBRAIN_ALLOW_SHELL_JOBS=1` into `/etc/gbrain.env` (mode 600), reference it from crontab via `BASH_ENV=/etc/gbrain.env`.
+4. **Fix the cron form.** If you pasted the v0.16.1 `*/5 * * * * user bash ...` into `crontab -e`, drop the `user` column and the explicit `bash` prefix.
+5. **If you have shell access to a long-running box,** consider Option 3 (systemd) instead of Option 1 (watchdog). systemd replaces the watchdog entirely and is the cleanest path.
+
+No schema change. No data migration. Docs + snippets only.
+
+### Itemized changes
+
+**Fixed**
+- **Crontab syntax now matches the target.** Two labeled blocks: 5-field for `crontab -e`, 6-field with user column for `/etc/crontab`. An agent no longer hits "bad minute" or has `user` parsed as the command.
+- **Watchdog restart loop killed.** The shipped `minion-watchdog.sh` writes a two-line PID file (PID on line 1, restart epoch on line 2) and only considers log lines whose ISO-8601 timestamp is newer than the epoch. Stale shutdown lines from earlier restarts no longer re-match every 5 minutes forever. Regex rewritten to use explicit `[0-9][0-9][0-9][0-9]` instead of `{4}` intervals because mawk (Debian/Ubuntu's default awk) rejects interval quantifiers. Verified end-to-end in a 3-tick abbreviated test inside Ubuntu 22.04.
+- **Credentials off the world-readable filesystem.** Secrets move to `/etc/gbrain.env` (mode 600, owned by the worker user), referenced via `BASH_ENV=/etc/gbrain.env` in crontab. `/etc/crontab` is mode 644 and user crontabs under `/var/spool/cron/` are readable by root. A new `gbrain.env.example` ships in-repo with the full env surface.
+
+**Added**
+- **Preconditions block.** Five checks at the top of the guide: `gbrain` on PATH, DB connectivity, schema version, crontab write access, and the `GBRAIN_ALLOW_SHELL_JOBS=1` requirement for shell-job workers. Agent fails fast on setup, not content.
+- **Decision tree.** "Which option?" selector at the top of the deployment section. Subagent workloads and long jobs take Option 1. Scheduled scripts take Option 2. No shell access take Option 3. Replaces the previous "recommended for X" prose that forced re-reading.
+- **Template variable table.** Six variables (`$GBRAIN_BIN`, `$GBRAIN_WORKER_USER`, `$GBRAIN_WORKER_PID_FILE`, `$GBRAIN_WORKER_LOG_FILE`, `$GBRAIN_WORKSPACE`, `$GBRAIN_ENV_FILE`) with meaning and typical value. Agent substitutes once, everything downstream lands correctly.
+- **Upgrade section.** v0.13.x → v0.16.2 checklist: stop the worker, run migrations, add `GBRAIN_ALLOW_SHELL_JOBS=1` for shell jobs, handle the `max_stalled` default flip from 1 to 5, swap the v0.16.1 watchdog for the current one.
+- **Option 3: service manager.** New `systemd.service`, `Procfile`, and `fly.toml.partial` ship under `docs/guides/minions-deployment-snippets/`. systemd replaces the watchdog entirely with `Restart=always` + `RestartSec=10s` and runs the worker as an unprivileged user with `PrivateTmp`, `ProtectSystem=strict`, and `ReadWritePaths`. Smoked end-to-end in a privileged container: banner fired twice across a 10-second restart cycle, `Restart=always` honored, unit enabled for boot persistence.
+- **Uninstall section.** One-paragraph rollback for each option.
+- **`docs/guides/minions-deployment.md` listed in `scripts/llms-config.ts`.** Remote agents fetching `llms.txt` or `llms-full.txt` now see the deployment guide without having to guess its path.
+
+**Changed**
+- **`--follow` example uses a gbrain subcommand, not `node my-script.mjs`.** The new example submits `gbrain embed --stale` as a shell job on a dedicated queue with `--timeout-ms 600000`. Maps directly onto how an OpenClaw-style agent actually schedules brain maintenance.
+- **"Proposed CLI flags (not yet implemented)" dead-end removed.** Replaced with a "Tune per-job today" callout pointing at the `gbrain jobs submit` flags that exist in source (`--max-stalled`, `--backoff-type`, `--backoff-delay`, `--backoff-jitter`, `--timeout-ms`, `--idempotency-key` — all first-class since v0.13.1).
+- **Known Issues rewritten as imperatives.** "DO NOT pass `maxStalledCount` to `MinionWorker`" leads the paragraph, followed by the reason and the correct knob (`gbrain jobs submit --max-stalled N`). Zombie-shell-children section leads with the 10s / 30s numbers and the action.
+
+Contributed by garrytan (issue report), fixes verified by an abbreviated end-to-end test suite (render-check + watchdog 3-tick + systemd container smoke + `bun test` + full E2E DB lifecycle).
+
+## [0.16.1] - 2026-04-22
+
+## **Minions worker deployment, finally documented.**
+## **If you run `gbrain jobs work` in production, there's now a guide for the sharp edges.**
+
+Garry's OpenClaw (gbrain's own instance, out there actually running `gbrain jobs work` in production) wrote a real deployment guide for the Minions worker, the piece of gbrain most operators hit next after getting sync running. Agents dogfooding the project they live on is a weird, good feedback loop. Two patterns: a watchdog cron for persistent workers, and an inline `--follow` for cron-only workloads. It covers the connection-drop, stall-detector, and zombie-child traps that show up once your brain is actually working for you. Every command and every default in the guide is checked against current source (`max_stalled = 5`, not 1 or 3; `--follow` exits on submitted-job-terminal, not queue-empty; stalled jobs show up as `active`, not `waiting`). Nothing about this was obvious, and nothing about it was in the docs before.
+
+With v0.16.0's durable agent runtime now shipping, the persistent worker is load-bearing for a lot more (`subagent` + `subagent_aggregator` handlers run there too). A supervised deployment story is the sharp end of the stick.
+
+### What this means for operators
+
+If you have been running the Minions worker under `nohup` with no restart story, this guide is the missing manual. Copy the watchdog script, paste the crontab env lines (`SHELL=/bin/bash`, `PATH`, `DATABASE_URL`, `GBRAIN_ALLOW_SHELL_JOBS=1`), and wire the cron to run every 5 minutes. You get a restart loop that handles the three silent-death modes: DB connection blip, lock-renewal stall, event loop wedge.
+
+If you are running scheduled shell jobs only, skip the persistent worker and use `--follow`. 2-3 seconds of startup overhead is trivial when your job runs for a minute.
+
+Docs-only release. No code changed. Zero migration required.
+
+## To take advantage of v0.16.1
+
+`gbrain upgrade` pulls the new guide. Read it:
+
+1. **Open the guide:**
+   ```bash
+   less docs/guides/minions-deployment.md
+   ```
+   Or browse it on GitHub.
+2. **Persistent worker:** copy `minion-watchdog.sh`, set crontab env lines, wire a `*/5 * * * *` cron.
+3. **Scheduled shell jobs only:** rewrite your cron as `gbrain jobs submit shell ... --follow --timeout-ms N` and drop the persistent worker entirely.
+4. **The "Proposed CLI flags" section** (`--lock-duration` / `--max-stalled` / `--stall-interval` on `gbrain jobs work`): those are on the roadmap. Per-job `--max-stalled` on `gbrain jobs submit` is already real and writes to the row's column directly.
+
+### Itemized changes
+
+**Added**
+- **Minions worker deployment guide** — new `docs/guides/minions-deployment.md` covering watchdog cron patterns, inline `--follow` for cron-only workloads, and the sharp edges of running `gbrain jobs work` against Supabase in production. Addresses a real gap: existing Minions docs (`minions-fix.md`, `minions-shell-jobs.md`) cover schema repair and shell-job security, not deploy patterns. Contributed by your OpenClaw via #287. Pre-landing accuracy pass corrected five factual bugs against current source: the `max_stalled` column default (5, not 1 or 3), the stalled-jobs smoke-test query (`active`, not `waiting`), the SIGTERM-to-SIGKILL grace window (10s minimum, not 2s), the cron env pattern (crontab env lines, not `source ~/.bashrc`), and the `--follow` exit semantics (blocks until submitted job is terminal, not until queue is empty).
+
 ## [0.16.0] - 2026-04-20

 ## **Durable agents land. Your LLM loops survive crashes, timeouts, and worker restarts now.**
--- a/2
+++ b/2
@@ -1 +1 @@
-0.16.0
+0.16.2
--- a/docs/guides/minions-deployment-snippets/Procfile
+++ b/docs/guides/minions-deployment-snippets/Procfile
@@ -0,0 +1,10 @@
+# Procfile — Render / Railway / Heroku.
+#
+# Fly.io users: see fly.toml.partial instead.
+#
+# Set secrets via the platform's env UI or CLI (e.g. `heroku config:set`,
+# `render env:set`, `railway variables set`). At minimum:
+#   DATABASE_URL=postgresql://...
+#   GBRAIN_ALLOW_SHELL_JOBS=1   # only if submitting shell jobs
+
+worker: gbrain jobs work --concurrency 2
--- a/docs/guides/minions-deployment-snippets/fly.toml.partial
+++ b/docs/guides/minions-deployment-snippets/fly.toml.partial
@@ -0,0 +1,22 @@
+# fly.toml — partial. Merge into your existing fly.toml.
+#
+# Set secrets once (never commit them):
+#   fly secrets set DATABASE_URL='postgresql://user:pass@host:6543/db?prepare=false'
+#   fly secrets set GBRAIN_ALLOW_SHELL_JOBS=1   # only if submitting shell jobs
+#   fly secrets set ANTHROPIC_API_KEY=...       # optional
+#
+# Fly.io auto-restarts the process on crash — no watchdog needed.
+
+[processes]
+  worker = "gbrain jobs work --concurrency 2"
+
+# Scale the worker process to 1 machine (job queue serializes work; more
+# machines means higher concurrency but also more Postgres connections).
+#   fly scale count worker=1
+
+# If you want the worker in its own VM size:
+# [[vm]]
+#   processes = ["worker"]
+#   memory = "512mb"
+#   cpu_kind = "shared"
+#   cpus = 1
--- a/docs/guides/minions-deployment-snippets/gbrain.env.example
+++ b/docs/guides/minions-deployment-snippets/gbrain.env.example
@@ -0,0 +1,35 @@
+# /etc/gbrain.env — secrets + env for the gbrain worker.
+#
+# Install:
+#   sudo install -m 600 -o $GBRAIN_WORKER_USER -g $GBRAIN_WORKER_USER \
+#     gbrain.env.example /etc/gbrain.env
+#   sudoedit /etc/gbrain.env   # fill in real values
+#
+# Referenced from crontab via BASH_ENV=/etc/gbrain.env, or from systemd
+# via EnvironmentFile=/etc/gbrain.env. Never commit real secrets.
+
+# --- Required ---------------------------------------------------------------
+
+# Postgres connection string. For Supabase transaction pooler, include
+# prepare=false (see CLAUDE.md #284/#286).
+DATABASE_URL=postgresql://user:pass@host:6543/db?prepare=false
+
+# --- Required if you submit `shell` jobs ------------------------------------
+# Only the worker process needs this. Submitters do not.
+GBRAIN_ALLOW_SHELL_JOBS=1
+
+# --- Optional ---------------------------------------------------------------
+
+# LLM provider keys (needed for `subagent` handler, transcription, enrichment).
+# ANTHROPIC_API_KEY=
+# OPENAI_API_KEY=
+
+# Custom handler plugins (see docs/guides/plugin-handlers.md).
+# GBRAIN_PLUGIN_PATH=/etc/gbrain/plugins
+
+# Pool size tuning for Supabase transaction pooler (default 10; drop to 2
+# if you hit MaxClients during upgrade subprocess spawns).
+# GBRAIN_POOL_SIZE=2
+
+# Connection-level concurrency cap for Anthropic Messages API.
+# GBRAIN_ANTHROPIC_MAX_INFLIGHT=4
--- a/docs/guides/minions-deployment-snippets/minion-watchdog.sh
+++ b/docs/guides/minions-deployment-snippets/minion-watchdog.sh
@@ -0,0 +1,68 @@
+#!/bin/bash
+# minion-watchdog.sh — restart gbrain jobs work if the process is dead or
+# has logged a shutdown marker since its last start.
+#
+# Fixes the v0.16.1 restart-loop bug: old shutdown lines from previous
+# restarts stayed in the unrotated log and every tick re-matched them
+# forever. This version writes a restart epoch to line 2 of the PID file
+# and only considers log lines newer than that epoch.
+#
+# Run every 5 minutes from crontab. See docs/guides/minions-deployment.md.
+set -u
+
+PID_FILE="${GBRAIN_WORKER_PID_FILE:-/tmp/gbrain-worker.pid}"
+LOG_FILE="${GBRAIN_WORKER_LOG_FILE:-/tmp/gbrain-worker.log}"
+GBRAIN="${GBRAIN_BIN:-/usr/local/bin/gbrain}"
+CONCURRENCY="${GBRAIN_WORKER_CONCURRENCY:-2}"
+
+start_worker() {
+  # stderr merged so banner lines ("[minion worker] shell handler enabled",
+  # "worker shutting down") all land in $LOG_FILE.
+  nohup "$GBRAIN" jobs work --concurrency "$CONCURRENCY" \
+    > "$LOG_FILE" 2>&1 &
+  local pid=$!
+  # Line 1: PID. Line 2: restart epoch (seconds since 1970).
+  # Readers that want just PID use `head -n1 "$PID_FILE"`.
+  printf '%s\n%s\n' "$pid" "$(date +%s)" > "$PID_FILE"
+}
+
+shutdown_since_restart() {
+  # Only match shutdown lines logged AFTER the most recent restart epoch.
+  # Worker log lines start with ISO-8601 UTC timestamps ("2026-04-21T19:05:12Z ...").
+  local restart_epoch
+  restart_epoch=$(sed -n '2p' "$PID_FILE" 2>/dev/null || echo 0)
+  [ -z "$restart_epoch" ] && restart_epoch=0
+
+  # POSIX-portable regex (no {n} intervals — mawk on Debian/Ubuntu rejects them).
+  awk -v since="$restart_epoch" '
+    match($0, /^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]T[0-9:.+Z-]+/) {
+      ts_str = substr($0, RSTART, RLENGTH)
+      cmd = "date -d \"" ts_str "\" +%s 2>/dev/null"
+      cmd | getline ts
+      close(cmd)
+      if (ts + 0 > since + 0) print
+    }
+  ' "$LOG_FILE" 2>/dev/null | grep -q "worker stopped\|worker shutting down"
+}
+
+if [ -f "$PID_FILE" ]; then
+  PID=$(head -n1 "$PID_FILE")
+  if [ -n "$PID" ] && kill -0 "$PID" 2>/dev/null; then
+    # Process alive — check whether the worker logged an internal shutdown
+    # AFTER the last start. If yes, worker is dead-inside; restart.
+    if shutdown_since_restart; then
+      kill "$PID" 2>/dev/null
+      # 10s grace: covers shell handler's 5s child SIGTERM→SIGKILL window
+      # and leaves room for in-flight jobs to flush. Bump to 30 if your
+      # jobs run > 10s.
+      sleep 10
+      kill -9 "$PID" 2>/dev/null
+      start_worker
+    fi
+  else
+    # PID file exists but process is gone (crash / kill -9 / reboot).
+    start_worker
+  fi
+else
+  start_worker
+fi
--- a/docs/guides/minions-deployment-snippets/systemd.service
+++ b/docs/guides/minions-deployment-snippets/systemd.service
@@ -0,0 +1,44 @@
+[Unit]
+Description=gbrain minion worker
+Documentation=https://github.com/garrytan/gbrain/blob/master/docs/guides/minions-deployment.md
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+# Runs as an unprivileged user that owns the brain repo and any shell-job cwds.
+# Create with: sudo useradd --system --home /srv/gbrain --shell /usr/sbin/nologin gbrain
+User=gbrain
+Group=gbrain
+WorkingDirectory=/srv/gbrain
+
+# Env file is mode 600, owned by User=. Do not put secrets in this unit.
+EnvironmentFile=/etc/gbrain.env
+
+ExecStart=/usr/local/bin/gbrain jobs work --concurrency 2
+
+# Replaces the cron watchdog. systemd restarts on any non-zero exit.
+Restart=always
+RestartSec=10s
+
+# Graceful shutdown: SIGTERM → wait → SIGKILL. 30s matches worker grace
+# for in-flight jobs and the shell handler's 5s child SIGTERM window.
+KillSignal=SIGTERM
+TimeoutStopSec=30s
+
+StandardOutput=journal
+StandardError=journal
+SyslogIdentifier=gbrain-worker
+
+# Default 1024 is tight for Bun + Postgres pool + concurrent subagent LLM calls.
+LimitNOFILE=65535
+
+# Hardening (optional — remove if they break your deployment).
+NoNewPrivileges=true
+PrivateTmp=true
+ProtectSystem=strict
+ProtectHome=read-only
+ReadWritePaths=/srv/gbrain
+
+[Install]
+WantedBy=multi-user.target
--- a/docs/guides/minions-deployment.md
+++ b/docs/guides/minions-deployment.md
@@ -0,0 +1,323 @@
+# Minions Worker Deployment Guide
+
+Deploy `gbrain jobs work` so it stays running across crashes, reboots, and
+Postgres connection blips. Written for agents to execute line-by-line.
+
+## The problem
+
+The persistent worker can die silently from:
+
+- Database connection drops (Supabase/Postgres maintenance or network blips).
+- Lock-renewal failures → the stall detector eventually dead-letters jobs.
+- Bun process crashes with no automatic restart.
+- Internal event-loop death (PID alive, worker loop stopped).
+
+When the worker dies, submitted jobs sit in `waiting` forever. Nothing in
+gbrain core auto-restarts the worker — that's what this guide wires up.
+
+## Variables used in this guide
+
+Substitute these once before copy-pasting any snippet.
+
+| Variable | Meaning | Typical value |
+|---|---|---|
+| `$GBRAIN_BIN` | Absolute path to the `gbrain` binary | `$(command -v gbrain)` — often `/usr/local/bin/gbrain` or `~/.bun/bin/gbrain` |
+| `$GBRAIN_WORKER_USER` | OS user that owns the worker process | the same user that ran `gbrain init`; never `root` |
+| `$GBRAIN_WORKER_PID_FILE` | Worker PID + restart-epoch file | `/tmp/gbrain-worker.pid` (or `/var/run/gbrain/worker.pid` for systemd) |
+| `$GBRAIN_WORKER_LOG_FILE` | Worker log sink (stdout + stderr merged) | `/tmp/gbrain-worker.log` (or `/var/log/gbrain/worker.log`) |
+| `$GBRAIN_WORKSPACE` | `cwd` for shell jobs submitted by this deployment | absolute path, e.g. `/srv/my-brain` |
+| `$GBRAIN_ENV_FILE` | Secrets file sourced by crontab / systemd | `/etc/gbrain.env` (mode 600) |
+
+## Preconditions
+
+Run these before Step 1 of any option. Fail fast if something is wrong.
+
+```bash
+# 1. gbrain is on PATH and resolves to an absolute location.
+command -v gbrain || { echo "gbrain not on PATH. Install, then retry."; exit 1; }
+
+# 2. DATABASE_URL points at reachable Postgres (or PGLite path exists).
+gbrain doctor --fast --json | jq '.checks[] | select(.name=="db_connectivity")'
+
+# 3. Schema is up to date. If version=0 or status=="fail", fix it first:
+#    gbrain apply-migrations --yes
+gbrain doctor --fast --json | jq '.checks[] | select(.name=="schema_version")'
+
+# 4. You have write access to at least one crontab mechanism.
+crontab -l >/dev/null 2>&1 && echo "user crontab OK"
+[ -w /etc/crontab ] && echo "/etc/crontab OK"
+
+# 5. If you plan to submit `shell` jobs, the WORKER process needs
+#    GBRAIN_ALLOW_SHELL_JOBS=1 (submitters do not). The handler is gated
+#    in registerBuiltinHandlers(); without the flag the worker startup
+#    line reads "shell handler disabled (...)".
+```
+
+## Which option?
+
+- Your workload runs LLM subagents (`gbrain agent run`) or jobs that take
+  > 30 s → **Option 1** (watchdog cron + persistent worker).
+- Your workload is short deterministic scripts on a fixed schedule (every
+  3 h, daily, weekly) → **Option 2** (inline `--follow`).
+- You don't have shell access to a long-running box (Fly/Render/Railway,
+  or any systemd host) → **Option 3** (service manager — replaces cron).
+
+## Option 1: watchdog cron + persistent worker
+
+A 5-minute cron checks whether the worker process is alive **and** whether
+it has logged an internal shutdown since its last start. Restarts if either
+condition fails.
+
+### 1a. Install the env file (secrets stay out of crontab)
+
+Never paste `DATABASE_URL` or API keys into crontab. `/etc/crontab` is
+mode 644 (world-readable); user crontabs under `/var/spool/cron/` are
+readable by `root`. Use the shipped env-file template:
+
+```bash
+sudo install -m 600 -o $GBRAIN_WORKER_USER -g $GBRAIN_WORKER_USER \
+  docs/guides/minions-deployment-snippets/gbrain.env.example /etc/gbrain.env
+sudoedit /etc/gbrain.env
+```
+
+Fill in the connection string and `GBRAIN_ALLOW_SHELL_JOBS=1` (if
+applicable). See
+[`gbrain.env.example`](./minions-deployment-snippets/gbrain.env.example)
+for the full list.
+
+### 1b. Install the watchdog script
+
+The [`minion-watchdog.sh`](./minions-deployment-snippets/minion-watchdog.sh)
+ships in-repo and writes a two-line PID file (PID on line 1, restart epoch
+on line 2). The restart-epoch marker is how the watchdog distinguishes
+stale shutdown lines in the log from current ones — without it, every tick
+after the first restart would match an old `worker shutting down` line and
+loop forever.
+
+Requires GNU coreutils (Linux default). On macOS/BSD install via
+`brew install coreutils` and alias `date` to `gdate` in the cron env if you
+want to test the watchdog locally; production Linux boxes work as-is.
+
+```bash
+sudo install -m 755 -o $GBRAIN_WORKER_USER -g $GBRAIN_WORKER_USER \
+  docs/guides/minions-deployment-snippets/minion-watchdog.sh \
+  /usr/local/bin/minion-watchdog.sh
+```
+
+### 1c. Wire into cron
+
+Pick the form that matches the crontab you're editing.
+
+**If you ran `crontab -e`** (user crontab — 5-field, no user column):
+
+```
+SHELL=/bin/bash
+PATH=/usr/local/bin:/usr/bin:/bin
+BASH_ENV=/etc/gbrain.env
+*/5 * * * * /usr/local/bin/minion-watchdog.sh
+```
+
+**If you edited `/etc/crontab` directly** (system crontab — 6-field, with
+user column):
+
+```
+SHELL=/bin/bash
+PATH=/usr/local/bin:/usr/bin:/bin
+BASH_ENV=/etc/gbrain.env
+*/5 * * * * gbrain /usr/local/bin/minion-watchdog.sh
+```
+
+In both forms, `BASH_ENV=/etc/gbrain.env` tells non-interactive bash to
+source the env file before running the watchdog — that's how the
+connection string and `GBRAIN_ALLOW_SHELL_JOBS` reach the worker without
+landing in the world-readable crontab itself.
+
+### 1d. Log rotation
+
+The watchdog appends to the worker log across restarts. If you expect the
+file to grow unbounded, rotate it externally with `logrotate`:
+
+```
+# /etc/logrotate.d/gbrain-worker
+/tmp/gbrain-worker.log {
+  daily
+  rotate 7
+  missingok
+  notifempty
+  copytruncate
+}
+```
+
+`copytruncate` is important — the watchdog's restart-epoch check survives
+it (the epoch is compared against in-log timestamps, not file inode).
+
+## Option 2: inline `--follow` (no persistent worker)
+
+Each cron run brings its own temporary worker. `--follow` starts one on
+the queue and blocks until the just-submitted job reaches a terminal state
+(`completed` / `failed` / `dead` / `cancelled`). 2-3 s startup overhead
+per job; negligible vs job duration for scheduled work.
+
+Example: nightly brain enrichment as a shell job.
+
+```bash
+GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs submit shell \
+  --queue nightly-enrich \
+  --params "{\"cmd\":\"$GBRAIN_BIN embed --stale\",\"cwd\":\"$GBRAIN_WORKSPACE\"}" \
+  --follow \
+  --timeout-ms 600000
+```
+
+Replace `gbrain embed --stale` with whichever gbrain subcommand you're
+scheduling (`sync`, `extract`, `orphans`, `doctor`, `check-backlinks`,
+`lint`, `autopilot`). If you're shelling out to a non-gbrain binary,
+keep its absolute path in the `cmd`.
+
+**Shared-queue gotcha.** If other jobs are already waiting on the same
+queue with higher priority or earlier `created_at`, the temporary worker
+processes those first before reaching yours. `--follow` still exits only
+when YOUR job finishes. For strict single-job semantics on shared queues,
+use a dedicated queue name like `nightly-enrich` above.
+
+## Option 3: service manager (systemd / Fly / Render / Railway)
+
+Replaces the watchdog entirely. No cron, no PID file, no restart-loop.
+The service manager owns liveness.
+
+### systemd (Linux hosts with shell access)
+
+```bash
+# Create the worker user if it doesn't exist.
+sudo useradd --system --home "$GBRAIN_WORKSPACE" --shell /usr/sbin/nologin gbrain \
+  2>/dev/null || true
+sudo mkdir -p "$GBRAIN_WORKSPACE" && sudo chown gbrain:gbrain "$GBRAIN_WORKSPACE"
+
+# Install the unit file, substituting /srv/gbrain → your workspace path.
+sudo install -m 644 docs/guides/minions-deployment-snippets/systemd.service \
+  /etc/systemd/system/gbrain-worker.service
+sudo sed -i "s|/srv/gbrain|$GBRAIN_WORKSPACE|g" \
+  /etc/systemd/system/gbrain-worker.service
+
+# See 1a above for /etc/gbrain.env install.
+sudo systemctl daemon-reload
+sudo systemctl enable --now gbrain-worker
+sudo systemctl status gbrain-worker
+journalctl -u gbrain-worker -n 50
+```
+
+`Restart=always` + `RestartSec=10s` give you crash-loop recovery. The unit
+runs as an unprivileged `gbrain` user with `PrivateTmp`, `ProtectSystem=strict`,
+and `ReadWritePaths=$GBRAIN_WORKSPACE`. `LimitNOFILE=65535` in the shipped
+unit covers Bun + Postgres pool + concurrent LLM subagent calls without
+hitting the default 1024 cap.
+
+### Fly.io
+
+Merge the `[processes]` block from
+[`fly.toml.partial`](./minions-deployment-snippets/fly.toml.partial) into
+your existing `fly.toml`. Set secrets with `fly secrets set` —
+Fly auto-restarts the process on crash.
+
+### Render / Railway / Heroku
+
+Drop [`Procfile`](./minions-deployment-snippets/Procfile) at the repo root.
+Set the connection string and `GBRAIN_ALLOW_SHELL_JOBS=1` via the
+platform's env UI or CLI.
+
+## Upgrading an existing deployment
+
+If you deployed on v0.13.x or earlier, walk this checklist:
+
+1. **Stop the worker before upgrading.**
+   `kill $(head -n1 /tmp/gbrain-worker.pid)` and wait for the process to
+   exit. Skipping this risks an in-flight job landing partial schema.
+2. **Run `gbrain upgrade`**. Then `gbrain apply-migrations --yes` if
+   `gbrain doctor` reports any migration as `partial` or `pending`.
+3. **If you run shell jobs:** from v0.14 onward, the worker requires
+   `GBRAIN_ALLOW_SHELL_JOBS=1` to register the `shell` handler. Add it to
+   `/etc/gbrain.env`. Submitters don't need the flag; only the worker does.
+4. **If you tuned your watchdog for `max_stalled=1`:** v0.14.3 migration
+   v15 raised the schema default to 5 and backfilled existing non-terminal
+   rows. A watchdog tuned around 1-strike dead-lettering will now
+   over-restart because it takes 5 misses to dead-letter. Switch to the
+   shipped watchdog (which keys on log markers, not job state).
+5. **If your v0.16.1 watchdog is still running:** it has a restart-loop
+   bug (old shutdown lines in the unrotated log re-match every 5 min
+   forever). Install the current `minion-watchdog.sh` from this guide's
+   snippets — it writes a restart epoch into the PID file and only
+   considers log lines newer than that epoch.
+6. **Verify.** `gbrain doctor` should report zero `pending` or `partial`
+   migrations. `gbrain jobs stats` should show no unexplained growth in
+   `dead` between pre- and post-upgrade.
+
+## Known issues
+
+### Supabase connection drops
+
+The worker uses a single Postgres connection. If Supabase drops it
+(maintenance, connection limits, network blip), lock renewal fails
+silently. The stall detector then dead-letters the job after
+`max_stalled` misses.
+
+**Current defaults that make this worse:**
+
+- `lockDuration: 30000` (30 s) — too short for long jobs during connection blips.
+- `max_stalled: 5` (schema column default on master — see `src/schema.sql`
+  and `src/core/pglite-schema.ts`). Five missed heartbeats before dead-letter.
+- `stalledInterval: 30000` (30 s) — checks too aggressively.
+
+**Tune per-job today.** `gbrain jobs submit` accepts `--max-stalled N`,
+`--backoff-type fixed|exponential`, `--backoff-delay <ms>`,
+`--backoff-jitter 0..1`, and `--timeout-ms N` as first-class flags
+(since v0.13.1). These write onto the job row at submit time — which is
+what `handleStalled()` reads — so per-job tuning is the real knob today.
+Worker-level `--lock-duration` / `--stall-interval` are on the roadmap;
+until they land, rely on per-job `--max-stalled` plus the watchdog (or
+systemd) for worker health.
+
+### DO NOT pass `maxStalledCount` to `MinionWorker`
+
+It's a no-op. The stall detector reads the row's `max_stalled` column
+(set at submit time), not the worker opt in `src/core/minions/worker.ts:74`.
+Use `gbrain jobs submit --max-stalled N` per-job instead.
+
+### Zombie shell children
+
+When the Bun worker crashes hard, child processes from shell jobs can
+become zombies. The watchdog's 10 s `SIGTERM → SIGKILL` window covers the
+shell handler's 5 s child-kill grace (`KILL_GRACE_MS`). For long-running
+shell jobs, bump the watchdog's `sleep 10` to `sleep 30` so the worker
+has time to flush in-flight jobs before the kill.
+
+## Smoke test
+
+```bash
+# Worker alive?
+kill -0 $(head -n1 /tmp/gbrain-worker.pid) 2>/dev/null && echo ALIVE || echo DEAD
+
+# Aggregate queue health.
+gbrain jobs stats
+
+# Jobs currently stalled (still `active` with expired lock_until, pre-requeue).
+gbrain jobs list --status active --limit 10
+
+# Dead-lettered jobs.
+gbrain jobs list --status dead --limit 10
+
+# Shell handler registered? (stderr banner merged into log via 2>&1.)
+grep "shell handler enabled" /tmp/gbrain-worker.log
+```
+
+## Uninstall
+
+- **Option 1 (watchdog cron):** `crontab -e`, delete the watchdog line.
+  `kill $(head -n1 /tmp/gbrain-worker.pid) && rm /tmp/gbrain-worker.pid`.
+  Optionally `sudo rm /etc/gbrain.env /usr/local/bin/minion-watchdog.sh`.
+- **Option 2 (inline `--follow`):** remove the cron entry. Nothing else to
+  clean up — temporary workers exit with their jobs.
+- **Option 3 (systemd):** `sudo systemctl disable --now gbrain-worker`,
+  then `sudo rm /etc/systemd/system/gbrain-worker.service /etc/gbrain.env`,
+  then `sudo systemctl daemon-reload`.
+- **Option 3 (Fly/Render/Railway):** delete the `worker` process from
+  `fly.toml` / `Procfile` and redeploy. Secrets set via `fly secrets`
+  persist until `fly secrets unset`.
--- a/llms-full.txt
+++ b/llms-full.txt
@@ -3281,6 +3281,336 @@ echo "Dream cycle complete at $(date)"

 ---

+## docs/guides/minions-deployment.md
+
+Source: https://raw.githubusercontent.com/garrytan/gbrain/master/docs/guides/minions-deployment.md
+
+# Minions Worker Deployment Guide
+
+Deploy `gbrain jobs work` so it stays running across crashes, reboots, and
+Postgres connection blips. Written for agents to execute line-by-line.
+
+## The problem
+
+The persistent worker can die silently from:
+
+- Database connection drops (Supabase/Postgres maintenance or network blips).
+- Lock-renewal failures → the stall detector eventually dead-letters jobs.
+- Bun process crashes with no automatic restart.
+- Internal event-loop death (PID alive, worker loop stopped).
+
+When the worker dies, submitted jobs sit in `waiting` forever. Nothing in
+gbrain core auto-restarts the worker — that's what this guide wires up.
+
+## Variables used in this guide
+
+Substitute these once before copy-pasting any snippet.
+
+| Variable | Meaning | Typical value |
+|---|---|---|
+| `$GBRAIN_BIN` | Absolute path to the `gbrain` binary | `$(command -v gbrain)` — often `/usr/local/bin/gbrain` or `~/.bun/bin/gbrain` |
+| `$GBRAIN_WORKER_USER` | OS user that owns the worker process | the same user that ran `gbrain init`; never `root` |
+| `$GBRAIN_WORKER_PID_FILE` | Worker PID + restart-epoch file | `/tmp/gbrain-worker.pid` (or `/var/run/gbrain/worker.pid` for systemd) |
+| `$GBRAIN_WORKER_LOG_FILE` | Worker log sink (stdout + stderr merged) | `/tmp/gbrain-worker.log` (or `/var/log/gbrain/worker.log`) |
+| `$GBRAIN_WORKSPACE` | `cwd` for shell jobs submitted by this deployment | absolute path, e.g. `/srv/my-brain` |
+| `$GBRAIN_ENV_FILE` | Secrets file sourced by crontab / systemd | `/etc/gbrain.env` (mode 600) |
+
+## Preconditions
+
+Run these before Step 1 of any option. Fail fast if something is wrong.
+
+```bash
+# 1. gbrain is on PATH and resolves to an absolute location.
+command -v gbrain || { echo "gbrain not on PATH. Install, then retry."; exit 1; }
+
+# 2. DATABASE_URL points at reachable Postgres (or PGLite path exists).
+gbrain doctor --fast --json | jq '.checks[] | select(.name=="db_connectivity")'
+
+# 3. Schema is up to date. If version=0 or status=="fail", fix it first:
+#    gbrain apply-migrations --yes
+gbrain doctor --fast --json | jq '.checks[] | select(.name=="schema_version")'
+
+# 4. You have write access to at least one crontab mechanism.
+crontab -l >/dev/null 2>&1 && echo "user crontab OK"
+[ -w /etc/crontab ] && echo "/etc/crontab OK"
+
+# 5. If you plan to submit `shell` jobs, the WORKER process needs
+#    GBRAIN_ALLOW_SHELL_JOBS=1 (submitters do not). The handler is gated
+#    in registerBuiltinHandlers(); without the flag the worker startup
+#    line reads "shell handler disabled (...)".
+```
+
+## Which option?
+
+- Your workload runs LLM subagents (`gbrain agent run`) or jobs that take
+  > 30 s → **Option 1** (watchdog cron + persistent worker).
+- Your workload is short deterministic scripts on a fixed schedule (every
+  3 h, daily, weekly) → **Option 2** (inline `--follow`).
+- You don't have shell access to a long-running box (Fly/Render/Railway,
+  or any systemd host) → **Option 3** (service manager — replaces cron).
+
+## Option 1: watchdog cron + persistent worker
+
+A 5-minute cron checks whether the worker process is alive **and** whether
+it has logged an internal shutdown since its last start. Restarts if either
+condition fails.
+
+### 1a. Install the env file (secrets stay out of crontab)
+
+Never paste `DATABASE_URL` or API keys into crontab. `/etc/crontab` is
+mode 644 (world-readable); user crontabs under `/var/spool/cron/` are
+readable by `root`. Use the shipped env-file template:
+
+```bash
+sudo install -m 600 -o $GBRAIN_WORKER_USER -g $GBRAIN_WORKER_USER \
+  docs/guides/minions-deployment-snippets/gbrain.env.example /etc/gbrain.env
+sudoedit /etc/gbrain.env
+```
+
+Fill in the connection string and `GBRAIN_ALLOW_SHELL_JOBS=1` (if
+applicable). See
+[`gbrain.env.example`](./minions-deployment-snippets/gbrain.env.example)
+for the full list.
+
+### 1b. Install the watchdog script
+
+The [`minion-watchdog.sh`](./minions-deployment-snippets/minion-watchdog.sh)
+ships in-repo and writes a two-line PID file (PID on line 1, restart epoch
+on line 2). The restart-epoch marker is how the watchdog distinguishes
+stale shutdown lines in the log from current ones — without it, every tick
+after the first restart would match an old `worker shutting down` line and
+loop forever.
+
+Requires GNU coreutils (Linux default). On macOS/BSD install via
+`brew install coreutils` and alias `date` to `gdate` in the cron env if you
+want to test the watchdog locally; production Linux boxes work as-is.
+
+```bash
+sudo install -m 755 -o $GBRAIN_WORKER_USER -g $GBRAIN_WORKER_USER \
+  docs/guides/minions-deployment-snippets/minion-watchdog.sh \
+  /usr/local/bin/minion-watchdog.sh
+```
+
+### 1c. Wire into cron
+
+Pick the form that matches the crontab you're editing.
+
+**If you ran `crontab -e`** (user crontab — 5-field, no user column):
+
+```
+SHELL=/bin/bash
+PATH=/usr/local/bin:/usr/bin:/bin
+BASH_ENV=/etc/gbrain.env
+*/5 * * * * /usr/local/bin/minion-watchdog.sh
+```
+
+**If you edited `/etc/crontab` directly** (system crontab — 6-field, with
+user column):
+
+```
+SHELL=/bin/bash
+PATH=/usr/local/bin:/usr/bin:/bin
+BASH_ENV=/etc/gbrain.env
+*/5 * * * * gbrain /usr/local/bin/minion-watchdog.sh
+```
+
+In both forms, `BASH_ENV=/etc/gbrain.env` tells non-interactive bash to
+source the env file before running the watchdog — that's how the
+connection string and `GBRAIN_ALLOW_SHELL_JOBS` reach the worker without
+landing in the world-readable crontab itself.
+
+### 1d. Log rotation
+
+The watchdog appends to the worker log across restarts. If you expect the
+file to grow unbounded, rotate it externally with `logrotate`:
+
+```
+# /etc/logrotate.d/gbrain-worker
+/tmp/gbrain-worker.log {
+  daily
+  rotate 7
+  missingok
+  notifempty
+  copytruncate
+}
+```
+
+`copytruncate` is important — the watchdog's restart-epoch check survives
+it (the epoch is compared against in-log timestamps, not file inode).
+
+## Option 2: inline `--follow` (no persistent worker)
+
+Each cron run brings its own temporary worker. `--follow` starts one on
+the queue and blocks until the just-submitted job reaches a terminal state
+(`completed` / `failed` / `dead` / `cancelled`). 2-3 s startup overhead
+per job; negligible vs job duration for scheduled work.
+
+Example: nightly brain enrichment as a shell job.
+
+```bash
+GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs submit shell \
+  --queue nightly-enrich \
+  --params "{\"cmd\":\"$GBRAIN_BIN embed --stale\",\"cwd\":\"$GBRAIN_WORKSPACE\"}" \
+  --follow \
+  --timeout-ms 600000
+```
+
+Replace `gbrain embed --stale` with whichever gbrain subcommand you're
+scheduling (`sync`, `extract`, `orphans`, `doctor`, `check-backlinks`,
+`lint`, `autopilot`). If you're shelling out to a non-gbrain binary,
+keep its absolute path in the `cmd`.
+
+**Shared-queue gotcha.** If other jobs are already waiting on the same
+queue with higher priority or earlier `created_at`, the temporary worker
+processes those first before reaching yours. `--follow` still exits only
+when YOUR job finishes. For strict single-job semantics on shared queues,
+use a dedicated queue name like `nightly-enrich` above.
+
+## Option 3: service manager (systemd / Fly / Render / Railway)
+
+Replaces the watchdog entirely. No cron, no PID file, no restart-loop.
+The service manager owns liveness.
+
+### systemd (Linux hosts with shell access)
+
+```bash
+# Create the worker user if it doesn't exist.
+sudo useradd --system --home "$GBRAIN_WORKSPACE" --shell /usr/sbin/nologin gbrain \
+  2>/dev/null || true
+sudo mkdir -p "$GBRAIN_WORKSPACE" && sudo chown gbrain:gbrain "$GBRAIN_WORKSPACE"
+
+# Install the unit file, substituting /srv/gbrain → your workspace path.
+sudo install -m 644 docs/guides/minions-deployment-snippets/systemd.service \
+  /etc/systemd/system/gbrain-worker.service
+sudo sed -i "s|/srv/gbrain|$GBRAIN_WORKSPACE|g" \
+  /etc/systemd/system/gbrain-worker.service
+
+# See 1a above for /etc/gbrain.env install.
+sudo systemctl daemon-reload
+sudo systemctl enable --now gbrain-worker
+sudo systemctl status gbrain-worker
+journalctl -u gbrain-worker -n 50
+```
+
+`Restart=always` + `RestartSec=10s` give you crash-loop recovery. The unit
+runs as an unprivileged `gbrain` user with `PrivateTmp`, `ProtectSystem=strict`,
+and `ReadWritePaths=$GBRAIN_WORKSPACE`. `LimitNOFILE=65535` in the shipped
+unit covers Bun + Postgres pool + concurrent LLM subagent calls without
+hitting the default 1024 cap.
+
+### Fly.io
+
+Merge the `[processes]` block from
+[`fly.toml.partial`](./minions-deployment-snippets/fly.toml.partial) into
+your existing `fly.toml`. Set secrets with `fly secrets set` —
+Fly auto-restarts the process on crash.
+
+### Render / Railway / Heroku
+
+Drop [`Procfile`](./minions-deployment-snippets/Procfile) at the repo root.
+Set the connection string and `GBRAIN_ALLOW_SHELL_JOBS=1` via the
+platform's env UI or CLI.
+
+## Upgrading an existing deployment
+
+If you deployed on v0.13.x or earlier, walk this checklist:
+
+1. **Stop the worker before upgrading.**
+   `kill $(head -n1 /tmp/gbrain-worker.pid)` and wait for the process to
+   exit. Skipping this risks an in-flight job landing partial schema.
+2. **Run `gbrain upgrade`**. Then `gbrain apply-migrations --yes` if
+   `gbrain doctor` reports any migration as `partial` or `pending`.
+3. **If you run shell jobs:** from v0.14 onward, the worker requires
+   `GBRAIN_ALLOW_SHELL_JOBS=1` to register the `shell` handler. Add it to
+   `/etc/gbrain.env`. Submitters don't need the flag; only the worker does.
+4. **If you tuned your watchdog for `max_stalled=1`:** v0.14.3 migration
+   v15 raised the schema default to 5 and backfilled existing non-terminal
+   rows. A watchdog tuned around 1-strike dead-lettering will now
+   over-restart because it takes 5 misses to dead-letter. Switch to the
+   shipped watchdog (which keys on log markers, not job state).
+5. **If your v0.16.1 watchdog is still running:** it has a restart-loop
+   bug (old shutdown lines in the unrotated log re-match every 5 min
+   forever). Install the current `minion-watchdog.sh` from this guide's
+   snippets — it writes a restart epoch into the PID file and only
+   considers log lines newer than that epoch.
+6. **Verify.** `gbrain doctor` should report zero `pending` or `partial`
+   migrations. `gbrain jobs stats` should show no unexplained growth in
+   `dead` between pre- and post-upgrade.
+
+## Known issues
+
+### Supabase connection drops
+
+The worker uses a single Postgres connection. If Supabase drops it
+(maintenance, connection limits, network blip), lock renewal fails
+silently. The stall detector then dead-letters the job after
+`max_stalled` misses.
+
+**Current defaults that make this worse:**
+
+- `lockDuration: 30000` (30 s) — too short for long jobs during connection blips.
+- `max_stalled: 5` (schema column default on master — see `src/schema.sql`
+  and `src/core/pglite-schema.ts`). Five missed heartbeats before dead-letter.
+- `stalledInterval: 30000` (30 s) — checks too aggressively.
+
+**Tune per-job today.** `gbrain jobs submit` accepts `--max-stalled N`,
+`--backoff-type fixed|exponential`, `--backoff-delay <ms>`,
+`--backoff-jitter 0..1`, and `--timeout-ms N` as first-class flags
+(since v0.13.1). These write onto the job row at submit time — which is
+what `handleStalled()` reads — so per-job tuning is the real knob today.
+Worker-level `--lock-duration` / `--stall-interval` are on the roadmap;
+until they land, rely on per-job `--max-stalled` plus the watchdog (or
+systemd) for worker health.
+
+### DO NOT pass `maxStalledCount` to `MinionWorker`
+
+It's a no-op. The stall detector reads the row's `max_stalled` column
+(set at submit time), not the worker opt in `src/core/minions/worker.ts:74`.
+Use `gbrain jobs submit --max-stalled N` per-job instead.
+
+### Zombie shell children
+
+When the Bun worker crashes hard, child processes from shell jobs can
+become zombies. The watchdog's 10 s `SIGTERM → SIGKILL` window covers the
+shell handler's 5 s child-kill grace (`KILL_GRACE_MS`). For long-running
+shell jobs, bump the watchdog's `sleep 10` to `sleep 30` so the worker
+has time to flush in-flight jobs before the kill.
+
+## Smoke test
+
+```bash
+# Worker alive?
+kill -0 $(head -n1 /tmp/gbrain-worker.pid) 2>/dev/null && echo ALIVE || echo DEAD
+
+# Aggregate queue health.
+gbrain jobs stats
+
+# Jobs currently stalled (still `active` with expired lock_until, pre-requeue).
+gbrain jobs list --status active --limit 10
+
+# Dead-lettered jobs.
+gbrain jobs list --status dead --limit 10
+
+# Shell handler registered? (stderr banner merged into log via 2>&1.)
+grep "shell handler enabled" /tmp/gbrain-worker.log
+```
+
+## Uninstall
+
+- **Option 1 (watchdog cron):** `crontab -e`, delete the watchdog line.
+  `kill $(head -n1 /tmp/gbrain-worker.pid) && rm /tmp/gbrain-worker.pid`.
+  Optionally `sudo rm /etc/gbrain.env /usr/local/bin/minion-watchdog.sh`.
+- **Option 2 (inline `--follow`):** remove the cron entry. Nothing else to
+  clean up — temporary workers exit with their jobs.
+- **Option 3 (systemd):** `sudo systemctl disable --now gbrain-worker`,
+  then `sudo rm /etc/systemd/system/gbrain-worker.service /etc/gbrain.env`,
+  then `sudo systemctl daemon-reload`.
+- **Option 3 (Fly/Render/Railway):** delete the `worker` process from
+  `fly.toml` / `Procfile` and redeploy. Secrets set via `fly secrets`
+  persist until `fly secrets unset`.
+
+---
+
 ## docs/guides/quiet-hours.md

 Source: https://raw.githubusercontent.com/garrytan/gbrain/master/docs/guides/quiet-hours.md
--- a/llms.txt
+++ b/llms.txt
@@ -18,6 +18,7 @@ Repo: https://github.com/garrytan/gbrain
 - [docs/GBRAIN_RECOMMENDED_SCHEMA.md](https://raw.githubusercontent.com/garrytan/gbrain/master/docs/GBRAIN_RECOMMENDED_SCHEMA.md): MECE directory structure (people/, companies/, concepts/).
 - [docs/guides/live-sync.md](https://raw.githubusercontent.com/garrytan/gbrain/master/docs/guides/live-sync.md): Incremental markdown sync setup.
 - [docs/guides/cron-schedule.md](https://raw.githubusercontent.com/garrytan/gbrain/master/docs/guides/cron-schedule.md): Recurring job scheduling.
+- [docs/guides/minions-deployment.md](https://raw.githubusercontent.com/garrytan/gbrain/master/docs/guides/minions-deployment.md): Deploying the gbrain jobs worker: crontab + watchdog, inline --follow, systemd/Procfile/fly.toml, upgrade checklist.
 - [docs/guides/quiet-hours.md](https://raw.githubusercontent.com/garrytan/gbrain/master/docs/guides/quiet-hours.md): Notification hold + timezone-aware delivery.
 - [docs/mcp/DEPLOY.md](https://raw.githubusercontent.com/garrytan/gbrain/master/docs/mcp/DEPLOY.md): MCP server deployment.

--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
 {
  "name": "gbrain",
-  "version": "0.16.0",
+  "version": "0.16.2",
  "description": "Postgres-native personal knowledge brain with hybrid RAG search",
  "type": "module",
  "main": "src/core/index.ts",
--- a/scripts/llms-config.ts
+++ b/scripts/llms-config.ts
@@ -92,6 +92,12 @@ export const SECTIONS: DocSection[] = [
        description: "Recurring job scheduling.",
        path: "docs/guides/cron-schedule.md",
      },
+      {
+        title: "docs/guides/minions-deployment.md",
+        description:
+          "Deploying the gbrain jobs worker: crontab + watchdog, inline --follow, systemd/Procfile/fly.toml, upgrade checklist.",
+        path: "docs/guides/minions-deployment.md",
+      },
      {
        title: "docs/guides/quiet-hours.md",
        description: "Notification hold + timezone-aware delivery.",
@@ -1 +1 @@
 .16.0
 .16.2