Files

Garry Tan 5fd9cd2644 feat: shell job type + worker abort-path fix (v0.13.0) (#217 )

* feat(minions): add protected-name constant + ctx.shutdownSignal

Introduce PROTECTED_JOB_NAMES ('shell') in a side-effect-free core module
so queue.ts can check it without importing from handlers/. MinionJobContext
gains shutdownSignal (distinct from signal) — handlers that need to run
SIGTERM-triggered cleanup subscribe to both; most handlers ignore shutdown
and run through the worker's 30s cleanup race to natural completion.

* fix(minions): MinionQueue.add gains trusted 4th arg + trim-normalized guard

Adds allowProtectedSubmit opt-in as a separate 4th parameter (NOT folded into
opts) so callers spreading user-provided opts ({...userOpts}) can't accidentally
carry the trust flag. PROTECTED_JOB_NAMES check runs on the trimmed name BEFORE
insert, closing the queue.add(' shell ', ...) whitespace bypass that would have
evaded a has(name) check.

* fix(minions): worker calls failJob on abort + wires ctx.shutdownSignal

Pre-v0.13.0 worker returned silently when ctx.signal.aborted fired, leaving
jobs in 'active' until stall sweep. Handlers using cooperative cancel had
no deterministic status flip — timeout/cancel/lock-loss all looked the same
from downstream callers (gbrain jobs get, --follow loops).

Fix: derive abort reason from abort.signal.reason ('timeout' | 'cancel' |
'lock-lost' | 'shutdown') and call failJob with 'aborted: <reason>' text.
failJob is idempotent via token+status match, so no-op when another path
already flipped status (handleTimeouts, cancelJob, stall).

Also: new shutdownAbort (instance-level AbortController) fires on process
SIGTERM/SIGINT and propagates to every handler's ctx.shutdownSignal.
Shell handler listens to both signals and runs SIGTERM→5s→SIGKILL on its
child on either; other handlers only listen to ctx.signal so deploy
restarts don't cancel them mid-flight.

* feat(minions): add shell job handler + submission audit log

New 'shell' job type spawns arbitrary commands under the Minions worker.
Deterministic cron scripts (API fetch, token refresh, scrape+write) can
move off the LLM gateway — zero Opus tokens per fire.

Handler contract:
- cmd or argv (exactly one required). cmd spawns via /bin/sh -c (absolute
  path, not 'sh', to block PATH-override shell substitution). argv spawns
  direct with no shell.
- cwd required, must be absolute. Operator-trust boundary.
- env defaults to SHELL_ENV_ALLOWLIST ({PATH, HOME, USER, LANG, TZ,
  NODE_ENV}) picked from process.env, with caller overrides merged on top.
  Prevents accidental $OPENAI_API_KEY interpolation into scripts.
- stdout/stderr retained as UTF-8-safe tails (64KB/16KB) via
  string_decoder.StringDecoder. Prepends [truncated N bytes] marker.
- Abort (either ctx.signal or ctx.shutdownSignal) fires SIGTERM → 5s grace
  → SIGKILL on child. Timer NOT .unref'd so worker's 30s race waits for
  the child to actually die.

shell-audit.ts writes a JSONL line per submission to
~/.gbrain/audit/shell-jobs-YYYY-Www.jsonl (ISO-week rotated, override via
GBRAIN_AUDIT_DIR). argv logged as JSON array (not space-joined, which would
flatten args with spaces). Never logs env values. Best-effort writes:
failures log to stderr but don't block submission.

* feat(jobs): submit_job MCP guard + CLI --timeout-ms + starvation warning

submit_job operation gains timeout_ms param (was missing — couldn't plumb
the existing MinionJobInput field through from either CLI or MCP). When
ctx.remote=true and name is in PROTECTED_JOB_NAMES, throws
OperationError('permission_denied'). Combined with the queue.add trusted
guard, MCP callers can never submit shell jobs even if the env flag is on.

CLI submit: new --timeout-ms N flag. Passes {allowProtectedSubmit:true}
as the 4th arg to queue.add only when the submitted name is protected
(not blanket-set for every job). Prints a starvation-warning block to
stderr when a shell job is submitted without --follow, pointing at both
--follow and 'gbrain jobs work' remediation. Fires for every shell submit
regardless of the submitter's env — the submitter env is a weak proxy for
the worker env.

Worker handler registration: conditional on GBRAIN_ALLOW_SHELL_JOBS=1.
Default: off. 'gbrain jobs submit --help' now lists handler types with a
pointer to docs/guides/minions-shell-jobs.md for shell.

* test(minions): 40 unit + 4 E2E cases for shell handler

Unit (test/minions-shell.test.ts):
- Protected names: trim-normalized, case-sensitive, whitespace bypass defense
- MinionQueue.add: trusted opt-in, whitespace bypass, non-protected untouched
- Handler validation: cmd|argv exclusive, cwd required/absolute, env strings
- Spawn: cmd/argv happy paths, non-zero exit, ENOENT, result shape
- Env allowlist: leaked-secret blocked, PATH inherited, caller override
- Abort: ctx.signal, ctx.shutdownSignal, pre-aborted signal
- Audit: ISO-week year boundary (2027-01-01 → W53 2026), mid-year W52/W53,
  GBRAIN_AUDIT_DIR override, argv as JSON array, env never logged, EACCES
  non-blocking
- Output truncation: 100KB → last 64KB with [truncated N bytes] marker

E2E (test/e2e/minions-shell.test.ts):
- Full lifecycle: submit → worker claim → spawn → complete
- MinionQueue.add without trusted arg throws (including whitespace bypass)
- submit_job with ctx.remote=true rejects shell (MCP guard)
- submit_job with ctx.remote=false allows shell (CLI path)

* chore: bump version and changelog (v0.13.0)

Move gateway crons to Minions. Zero LLM tokens per cron fire.
Worker abort path finally marks aborted jobs dead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: reframe v0.13.0 copy for OpenClaw operators (not Wintermute-specific)

gbrain is an open-source product for any OpenClaw/Hermes operator, not
Garry's personal Wintermute deployment. Rewords the v0.13.0 CHANGELOG
entry, the minions-shell-jobs guide, and the deferred TODOS entries to
speak to "your OpenClaw" / "OpenClaw operators" instead.

Replaces /data/wintermute cwd examples with the canonical
/data/.openclaw/workspace path. Pre-existing Wintermute references in
older CHANGELOG entries (v0.11/v0.10.3) left unchanged.

* feat(migrations): add v0.13.0 adoption playbook for shell jobs

Adding the migration file the CEO review originally scoped out. Without
it, operators upgrade to v0.13.0 and the capability ships but adoption
doesn't happen — the 60% gateway CPU reduction only lands if someone
actually rewrites their crontab.

skills/migrations/v0.13.0.md is the instruction manual the host agent
reads on gbrain upgrade:

- Enable worker: GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs work (Postgres)
  or per-tick --follow (PGLite)
- Audit cron manifest: classify LLM-requiring vs deterministic
- Propose per-cron rewrites with diffs, approved one at a time
- Env allowlist guidance for scripts that need API keys
- Verification playbook: run one fire, compare pre/post, only then
  approve the next batch
- Starvation sanity-check runbook item

Iron rules: never auto-rewrite the operator's crontab (host-specific
code per CLAUDE.md). LLM-requiring crons stay on the gateway. Ambiguous
cases ask the operator.

No mechanical orchestrator ships with this migration — every rewrite
is operator judgment. A future gbrain crontab-to-minions helper is
tracked in TODOS.md as P1.

* docs: sync UPGRADING + SKILLPACK with v0.13.0 shell jobs

UPGRADING_DOWNSTREAM_AGENTS.md: append v0.13.0 section per the file's
convention (each release appends). No skill edits required, feature is
off-by-default, optional adoption via skills/migrations/v0.13.0.md.
Lists typical LLM-vs-deterministic classifications so operators know
which of their crons are candidates for migration.

GBRAIN_SKILLPACK.md: add shell-jobs guide row to the cron/Minions guide
table so it's discoverable alongside existing Cron via Minions, Plugin
Handlers, and Minions fix guides.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-20 10:54:31 +08:00

7.0 KiB

Raw Permalink Blame History

Minions shell jobs — move deterministic crons off the gateway

30 seconds

# Run your first shell job:
GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs submit shell \
  --params '{"cmd":"echo hello","cwd":"/tmp"}' --follow
# → exit_code: 0, stdout_tail: "hello\n", duration_ms: 43

That's it. Your cron scripts now have a home with retry, backoff, DLQ, and gbrain jobs list visibility, without each one booting a full LLM session.

PGLite users: gbrain jobs work does not run on PGLite (exclusive file lock). Every crontab invocation must use --follow for inline execution. Postgres users can run a persistent worker; see recipes below.

Why it exists

If your agent runs deterministic scripts from cron (token refresh, API fetch, scrape + write), each one pays the cost of a full LLM session on the gateway. Fourteen simultaneous fires on a Series A deployment pin CPU at 100% and block live messages. None of those scripts need reasoning. They need a shell.

Shell jobs move them to the Minions worker: one deterministic-script execution per cron, zero LLM tokens, unified visibility and retry.

Security model (read this)

Shell exec is a large blast radius. We ship two independent gates, both must pass:

MCP boundary. submit_job with name: 'shell' is rejected when ctx.remote === true (MCP callers). Independent of the env flag. Remote agents can never submit shell jobs. MinionQueue.add('shell', ...) has its own guard too, so an in-process handler can't programmatically bypass this.
Env flag. The worker only registers the shell handler when GBRAIN_ALLOW_SHELL_JOBS=1 is set on the worker process. Default: off. Your agent opts in per-host.

What the env allowlist does AND does not do. Shell jobs run with a minimal env: PATH, HOME, USER, LANG, TZ, NODE_ENV. Your secrets like OPENAI_API_KEY and DATABASE_URL are NOT passed to the child. You opt-in additional keys per job via env: { ... }. This stops accidental $OPENAI_API_KEY interpolation in a user-authored script. It does not sandbox filesystem reads: a shell script can cat ~/.env or any file the worker process can read. The operator picks a safe cwd. That is the trust boundary.

Audit trail, not forensic insurance. Every submission writes a JSONL line to ~/.gbrain/audit/shell-jobs-YYYY-Www.jsonl (ISO-week rotation; override with GBRAIN_AUDIT_DIR). Failures log to stderr and don't block submission, so a disk-full adversary could silently disable the trail. Good for "what did this cron submit last Tuesday", not for security-critical forensics.

The command text is logged as-is. If you embed a secret in cmd (curl -H 'Authorization: Bearer ...'), it shows up in the audit file. Put secrets in env: instead.

Migrate a cron

Postgres worker (recommended)

On one terminal, start a persistent worker:

GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs work

Rewrite crontab to submit shell jobs (no --follow):

# Before (LLM gateway):
#   OpenClaw cron: x-garrytan-unified
# After (Minions worker):
3 13,16,19,22,1,4,7,10 * * * \
  gbrain jobs submit shell \
    --params '{"cmd":"node scripts/x-garrytan-daily.mjs","cwd":"/data/.openclaw/workspace"}' \
    --max-attempts 3 --timeout-ms 300000

Worker claims the job on next poll, runs it, records exit_code + stdout_tail + stderr_tail in the result. Failures retry per --max-attempts with exponential backoff.

PGLite (inline execution)

PGLite doesn't support the persistent worker daemon. Every crontab invocation uses --follow to run inline:

# Each cron tick spawns a short-lived worker that runs the job inline.
3 13,16,19,22,1,4,7,10 * * * \
  GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs submit shell \
    --params '{"cmd":"node scripts/x-garrytan-daily.mjs","cwd":"/data/.openclaw/workspace"}' \
    --follow --timeout-ms 300000

Note: --follow blocks the crontab slot until the job finishes. If 14 shell crons land at the same minute and each takes 30s, they serialize through crontab's spawning limits. Postgres + persistent worker scales better.

Submitting with `argv` (no shell interpolation)

For programmatic callers assembling commands from JSON, use argv instead of cmd. No shell, no injection surface:

gbrain jobs submit shell \
  --params '{"argv":["node","scripts/fetch.mjs","--date","2026-04-19"],"cwd":"/data"}' \
  --follow

Debug a failed job

# List dead shell jobs
gbrain jobs list --status dead

# Inspect one
gbrain jobs get 42
# → error_text, stacktrace, result.stdout_tail, result.stderr_tail

# Submission audit log (operator trail, not forensic)
cat ~/.gbrain/audit/shell-jobs-*.jsonl | jq '.'

# First-time failure mode: submitted without env flag on the worker
gbrain jobs list --status waiting --name shell
# If rows pile up here, no worker with GBRAIN_ALLOW_SHELL_JOBS=1 is running.

Limitations

Filesystem reads are not sandboxed. See "Security model" above. Don't point cwd at a directory full of secrets.
Audit log is advisory. Disk-full or EACCES silently disables it.
Cancel latency is lock-renewal-bounded (~7-15 s by default). A cancelled child keeps running until the next lock-renewal tick fails.
--follow claim order is by priority/created_at. If another job is waiting in the same queue at the time of --follow, that one runs first.
cwd symlink TOCTOU. The absolute-path check doesn't guard against symlinks pointing elsewhere at execution time. Operator-scope concern.

Errors

Error	What it means	Fix
`shell: specify exactly one of cmd or argv`	`cmd` and `argv` are mutually exclusive. Both absent is also invalid.	Choose one. `cmd` for shell-interpolated strings; `argv` for structured args.
`shell: cwd is required and must be an absolute path`	`cwd` must be a string starting with `/`.	Set `cwd` in `--params` to an absolute path.
`shell: argv must be an array of strings`	`argv` has a non-string entry or isn't an array.	Pass `argv: ["bin","arg1","arg2"]`.
`shell: env values must all be strings`	`env` has a number/bool/object value.	Stringify: `"env":{"COUNT":"3"}` not `"env":{"COUNT":3}`.
`permission_denied: shell jobs cannot be submitted over MCP`	An MCP client tried to submit a shell job. By design CLI-only.	Submit from CLI or via a trusted operation handler (`ctx.remote === false`).
`protected job name 'shell' requires CLI or operation-local submitter`	A caller invoked `MinionQueue.add('shell', ...)` without the `trusted` opt-in.	Pass `{ allowProtectedSubmit: true }` as the 4th arg. CLI and `submit_job` do this automatically.
`aborted: timeout` / `aborted: cancel` / `aborted: shutdown` / `aborted: lock-lost`	The worker's abort signal fired mid-execution. Child got SIGTERM, 5s grace, then SIGKILL.	Expected: timeout / user cancel / deploy restart / stall. Inspect `gbrain jobs get` to see which.
`exit N: <stderr_tail_500>`	Script exited non-zero.	Read `stderr_tail` in `gbrain jobs get`.

7.0 KiB Raw Permalink Blame History