feat: shell job type + worker abort-path fix (v0.13.0) (#217)

* feat(minions): add protected-name constant + ctx.shutdownSignal

Introduce PROTECTED_JOB_NAMES ('shell') in a side-effect-free core module
so queue.ts can check it without importing from handlers/. MinionJobContext
gains shutdownSignal (distinct from signal) — handlers that need to run
SIGTERM-triggered cleanup subscribe to both; most handlers ignore shutdown
and run through the worker's 30s cleanup race to natural completion.

* fix(minions): MinionQueue.add gains trusted 4th arg + trim-normalized guard

Adds allowProtectedSubmit opt-in as a separate 4th parameter (NOT folded into
opts) so callers spreading user-provided opts ({...userOpts}) can't accidentally
carry the trust flag. PROTECTED_JOB_NAMES check runs on the trimmed name BEFORE
insert, closing the queue.add(' shell ', ...) whitespace bypass that would have
evaded a has(name) check.

* fix(minions): worker calls failJob on abort + wires ctx.shutdownSignal

Pre-v0.13.0 worker returned silently when ctx.signal.aborted fired, leaving
jobs in 'active' until stall sweep. Handlers using cooperative cancel had
no deterministic status flip — timeout/cancel/lock-loss all looked the same
from downstream callers (gbrain jobs get, --follow loops).

Fix: derive abort reason from abort.signal.reason ('timeout' | 'cancel' |
'lock-lost' | 'shutdown') and call failJob with 'aborted: <reason>' text.
failJob is idempotent via token+status match, so no-op when another path
already flipped status (handleTimeouts, cancelJob, stall).

Also: new shutdownAbort (instance-level AbortController) fires on process
SIGTERM/SIGINT and propagates to every handler's ctx.shutdownSignal.
Shell handler listens to both signals and runs SIGTERM→5s→SIGKILL on its
child on either; other handlers only listen to ctx.signal so deploy
restarts don't cancel them mid-flight.

* feat(minions): add shell job handler + submission audit log

New 'shell' job type spawns arbitrary commands under the Minions worker.
Deterministic cron scripts (API fetch, token refresh, scrape+write) can
move off the LLM gateway — zero Opus tokens per fire.

Handler contract:
- cmd or argv (exactly one required). cmd spawns via /bin/sh -c (absolute
  path, not 'sh', to block PATH-override shell substitution). argv spawns
  direct with no shell.
- cwd required, must be absolute. Operator-trust boundary.
- env defaults to SHELL_ENV_ALLOWLIST ({PATH, HOME, USER, LANG, TZ,
  NODE_ENV}) picked from process.env, with caller overrides merged on top.
  Prevents accidental $OPENAI_API_KEY interpolation into scripts.
- stdout/stderr retained as UTF-8-safe tails (64KB/16KB) via
  string_decoder.StringDecoder. Prepends [truncated N bytes] marker.
- Abort (either ctx.signal or ctx.shutdownSignal) fires SIGTERM → 5s grace
  → SIGKILL on child. Timer NOT .unref'd so worker's 30s race waits for
  the child to actually die.

shell-audit.ts writes a JSONL line per submission to
~/.gbrain/audit/shell-jobs-YYYY-Www.jsonl (ISO-week rotated, override via
GBRAIN_AUDIT_DIR). argv logged as JSON array (not space-joined, which would
flatten args with spaces). Never logs env values. Best-effort writes:
failures log to stderr but don't block submission.

* feat(jobs): submit_job MCP guard + CLI --timeout-ms + starvation warning

submit_job operation gains timeout_ms param (was missing — couldn't plumb
the existing MinionJobInput field through from either CLI or MCP). When
ctx.remote=true and name is in PROTECTED_JOB_NAMES, throws
OperationError('permission_denied'). Combined with the queue.add trusted
guard, MCP callers can never submit shell jobs even if the env flag is on.

CLI submit: new --timeout-ms N flag. Passes {allowProtectedSubmit:true}
as the 4th arg to queue.add only when the submitted name is protected
(not blanket-set for every job). Prints a starvation-warning block to
stderr when a shell job is submitted without --follow, pointing at both
--follow and 'gbrain jobs work' remediation. Fires for every shell submit
regardless of the submitter's env — the submitter env is a weak proxy for
the worker env.

Worker handler registration: conditional on GBRAIN_ALLOW_SHELL_JOBS=1.
Default: off. 'gbrain jobs submit --help' now lists handler types with a
pointer to docs/guides/minions-shell-jobs.md for shell.

* test(minions): 40 unit + 4 E2E cases for shell handler

Unit (test/minions-shell.test.ts):
- Protected names: trim-normalized, case-sensitive, whitespace bypass defense
- MinionQueue.add: trusted opt-in, whitespace bypass, non-protected untouched
- Handler validation: cmd|argv exclusive, cwd required/absolute, env strings
- Spawn: cmd/argv happy paths, non-zero exit, ENOENT, result shape
- Env allowlist: leaked-secret blocked, PATH inherited, caller override
- Abort: ctx.signal, ctx.shutdownSignal, pre-aborted signal
- Audit: ISO-week year boundary (2027-01-01 → W53 2026), mid-year W52/W53,
  GBRAIN_AUDIT_DIR override, argv as JSON array, env never logged, EACCES
  non-blocking
- Output truncation: 100KB → last 64KB with [truncated N bytes] marker

E2E (test/e2e/minions-shell.test.ts):
- Full lifecycle: submit → worker claim → spawn → complete
- MinionQueue.add without trusted arg throws (including whitespace bypass)
- submit_job with ctx.remote=true rejects shell (MCP guard)
- submit_job with ctx.remote=false allows shell (CLI path)

* chore: bump version and changelog (v0.13.0)

Move gateway crons to Minions. Zero LLM tokens per cron fire.
Worker abort path finally marks aborted jobs dead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: reframe v0.13.0 copy for OpenClaw operators (not Wintermute-specific)

gbrain is an open-source product for any OpenClaw/Hermes operator, not
Garry's personal Wintermute deployment. Rewords the v0.13.0 CHANGELOG
entry, the minions-shell-jobs guide, and the deferred TODOS entries to
speak to "your OpenClaw" / "OpenClaw operators" instead.

Replaces /data/wintermute cwd examples with the canonical
/data/.openclaw/workspace path. Pre-existing Wintermute references in
older CHANGELOG entries (v0.11/v0.10.3) left unchanged.

* feat(migrations): add v0.13.0 adoption playbook for shell jobs

Adding the migration file the CEO review originally scoped out. Without
it, operators upgrade to v0.13.0 and the capability ships but adoption
doesn't happen — the 60% gateway CPU reduction only lands if someone
actually rewrites their crontab.

skills/migrations/v0.13.0.md is the instruction manual the host agent
reads on gbrain upgrade:

- Enable worker: GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs work (Postgres)
  or per-tick --follow (PGLite)
- Audit cron manifest: classify LLM-requiring vs deterministic
- Propose per-cron rewrites with diffs, approved one at a time
- Env allowlist guidance for scripts that need API keys
- Verification playbook: run one fire, compare pre/post, only then
  approve the next batch
- Starvation sanity-check runbook item

Iron rules: never auto-rewrite the operator's crontab (host-specific
code per CLAUDE.md). LLM-requiring crons stay on the gateway. Ambiguous
cases ask the operator.

No mechanical orchestrator ships with this migration — every rewrite
is operator judgment. A future gbrain crontab-to-minions helper is
tracked in TODOS.md as P1.

* docs: sync UPGRADING + SKILLPACK with v0.13.0 shell jobs

UPGRADING_DOWNSTREAM_AGENTS.md: append v0.13.0 section per the file's
convention (each release appends). No skill edits required, feature is
off-by-default, optional adoption via skills/migrations/v0.13.0.md.
Lists typical LLM-vs-deterministic classifications so operators know
which of their crons are candidates for migration.

GBRAIN_SKILLPACK.md: add shell-jobs guide row to the cron/Minions guide
table so it's discoverable alongside existing Cron via Minions, Plugin
Handlers, and Minions fix guides.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-20 10:54:31 +08:00
committed by GitHub
parent c89aa909c7
commit 5fd9cd2644
19 changed files with 1580 additions and 23 deletions

View File

@@ -2,6 +2,72 @@
All notable changes to GBrain will be documented in this file.
## [0.14.0] - 2026-04-20
## **Move gateway crons to Minions. Zero LLM tokens per cron fire.**
## **Worker abort path finally marks aborted jobs dead.**
Your OpenClaw gateway pins at 100% CPU when your 32 cron jobs each boot a full Opus session per fire, and ~14 of them are pure API-fetch-and-write scripts that don't need reasoning at all. This release adds a `shell` job type to Minions so those deterministic crons move off the gateway to the Minions worker. ~60% gateway load reduction at OpenClaw scale. Retry, backoff, DLQ, unified `gbrain jobs list` visibility, all free. The LLM-reasoning crons stay on the gateway where they belong.
Getting there meant fixing the Minions worker abort path, which was quietly wrong since v0.11: aborted jobs (timeout, cancel, lock loss) returned silently without calling `failJob`, so status stayed `active` until a stall sweep found them ~30s later. This release makes abort-reason the `error_text` of an immediate `failJob` call. Handlers get cleaner signals, operators see accurate status, `--follow` stops hanging past timeouts.
### The numbers that matter
Measured on the new `test/minions-shell.test.ts` (40 unit cases) and `test/e2e/minions-shell.test.ts` (4 E2E cases) plus 5 rounds of pre-landing review (spec adversarial x2, CEO scope, DX, eng, Codex outside voice).
| Metric | BEFORE v0.14.0 | AFTER v0.14.0 | Δ |
|---------------------------------------------------|-------------------------|-----------------------------------|----------------------|
| LLM tokens per cron fire | ~full Opus context boot | 0 (deterministic crons) | **100% reduction** |
| Gateway CPU headroom with ~14 crons moved | 0% | ~60% free | cron load off gateway|
| Aborted job status lag (timeout/cancel/lock-loss) | up to 30s | immediate `failJob` call | **deterministic** |
| Shell submission surfaces | none | CLI + trusted `submit_job` | 2 paths, both gated |
| Submission audit trail | none | JSONL at `~/.gbrain/audit/` | operational trace |
| Unit tests | 1318 pass | **1358 pass (+40 shell cases)** | +40 |
| E2E tests | 124 | **128 (+4 shell lifecycle)** | +4 |
| Pre-landing review rounds | 1 (eng) | **5 (spec×2 / CEO / DX / eng / codex)** | 29 issues surfaced, 26 resolved |
The abort-path fix is the quietly-important one. Handlers that use `ctx.signal` for cooperative cancel (sync, embed) now have deterministic status flips instead of waiting for the stall sweep. Shell jobs get reliable timeout semantics for the first time: `cmd: 'sleep 30', timeout_ms: 2000` hits `dead` at ~2100ms instead of ~32000ms.
### What this means for OpenClaw operators
`gbrain upgrade` reads `skills/migrations/v0.14.0.md` and walks your host agent through the adoption: enable the worker with `GBRAIN_ALLOW_SHELL_JOBS=1`, audit every cron entry (LLM-requiring stays, deterministic moves), propose a rewrite per cron with a diff, verify one fire end-to-end before approving the next batch. Never auto-rewrites your crontab — every change is a human approval per-cron. On Postgres, one persistent worker daemon claims each job. On PGLite, every crontab invocation adds `--follow` for inline execution because PGLite doesn't support the worker daemon. Either way, your gateway CPU stops pinning at 100% and your live messages stop getting blocked by batch processing. See `docs/guides/minions-shell-jobs.md` for usage recipes and `skills/migrations/v0.14.0.md` for the adoption playbook.
### Itemized changes
#### New `shell` job type
- **Spawn arbitrary commands as Minions jobs.** Pass `{cmd: "string"}` (shell-interpolated via `/bin/sh -c`) or `{argv: ["bin","arg"]}` (no shell, safe for programmatic callers). Both forms require an absolute `cwd`. Env vars are scoped to a minimal allowlist (`PATH, HOME, USER, LANG, TZ, NODE_ENV`) to prevent accidental `$OPENAI_API_KEY` interpolation; callers opt-in to additional keys per job.
- **Two-layer security: MCP boundary + env flag.** `submit_job` rejects `name: 'shell'` when `ctx.remote === true`. Independent of the env flag. `MinionQueue.add('shell', ...)` also rejects unless the caller explicitly opts in via `{allowProtectedSubmit: true}` as the 4th arg, so an in-process handler can't programmatically submit a shell child by accident. Worker only registers the handler when `GBRAIN_ALLOW_SHELL_JOBS=1` is set on the worker process. Default: off. Opt in per-host.
- **Graceful child shutdown.** Abort fires SIGTERM, 5-second grace, then SIGKILL. Listens to both `ctx.signal` (timeout/cancel/lock-loss) and a new `ctx.shutdownSignal` (worker process SIGTERM/SIGINT), so deploy restarts don't orphan shell children. Non-shell handlers ignore `shutdownSignal` and keep running through the worker's 30s cleanup race.
- **UTF-8-safe output truncation.** stdout is retained as the last 64KB, stderr as the last 16KB, with a `[truncated N bytes]` marker prepended when exceeded. Uses `string_decoder.StringDecoder` so multibyte characters don't split across the truncation boundary.
- **Operational audit trail at `~/.gbrain/audit/shell-jobs-YYYY-Www.jsonl`** (ISO-week rotation, override via `GBRAIN_AUDIT_DIR`). Records caller, remote flag, job_id, cwd, and cmd/argv display. Never logs env values. Best-effort writes: failures log to stderr but don't block submission. Operational trace for "what did this cron submit last Tuesday," not forensic insurance.
- **Starvation warning on first-time submission.** If you `gbrain jobs submit shell ...` without `--follow` and no worker with the env flag is running, stderr prints a warning block pointing at both `--follow` and `gbrain jobs work` remediation. Turns a silent "job sits in waiting forever" failure mode into a directed next-step.
#### Worker abort path overhaul
- **Aborted jobs now call `failJob` with the abort reason.** Pre-v0.14.0 worker returned silently when `ctx.signal.aborted` fired, leaving jobs in `active` until stall sweep. Fixed: catch-block now derives reason from `abort.signal.reason` (`timeout`, `cancel`, `lock-lost`, `shutdown`) and calls `failJob(id, token, "aborted: <reason>")`. Token-match makes the call idempotent: if another path already flipped status, it no-ops cleanly. Downstream `--follow` loops and status assertions now reflect reality.
- **`ctx.shutdownSignal` separated from `ctx.signal`.** Only fires on worker process SIGTERM/SIGINT. Handlers that need shutdown-specific cleanup (currently: shell handler's SIGTERM→SIGKILL on its child) subscribe to both signals. Non-shell handlers subscribe only to `ctx.signal` and don't get cancelled mid-flight on deploy restart.
#### CLI + operation surface additions
- **`gbrain jobs submit --timeout-ms N`.** Per-job wall-clock timeout in ms. Surfaced from the existing `timeout_ms` schema field, which had no CLI flag before.
- **`submit_job` operation gains `timeout_ms` param.** Same field exposed through MCP (for non-protected names).
- **`gbrain jobs submit --help` lists handler types.** `shell` is explicitly called out as CLI-only with a pointer to the guide. Closes the "what handlers are even available" discovery gap.
#### Tests
- **40 new unit cases in `test/minions-shell.test.ts`** covering validation (cmd/argv/cwd/env), spawn happy + error paths, UTF-8 safe truncation, SIGTERM abort via both signals, env allowlist (OPENAI_API_KEY blocked, PATH inherited, caller override), ISO-week filename at year boundary (2027-01-01 → W53 2026), audit write happy + EACCES failure paths, whitespace-bypass defense on `MinionQueue.add(' shell ', ...)`, and auto-added regression tests per the iron rule (non-protected names unaffected).
- **4 E2E tests in `test/e2e/minions-shell.test.ts`** covering full lifecycle (submit → worker claim → spawn → complete with captured stdout), `MinionQueue.add` defense-in-depth, `submit_job` MCP-guard rejection, `submit_job` CLI-path acceptance.
#### Docs
- **New `docs/guides/minions-shell-jobs.md`** opens with a 30-second copy-paste hello-world, then covers the two-layer security model with honest callouts about what env allowlist does and does not do, Postgres vs PGLite crontab recipes side-by-side, debug playbook (`gbrain jobs list`, `gbrain jobs get`, audit log tail, PGLite `--follow` note), known limitations, and an `#errors` table linked from every `UnrecoverableError` the handler throws.
- **New `skills/migrations/v0.14.0.md`** is the adoption playbook your host agent reads on `gbrain upgrade`. Walks through enabling the worker, auditing cron entries (LLM-requiring vs deterministic), proposing per-cron rewrites with diffs, and verifying end-to-end before batch approval. Iron rule: never auto-rewrites the operator's crontab — every change is human-approved per-cron.
- **README.md** links the guide from the Commands section.
#### Pre-ship review
Five independent rounds surfaced 29 issues across the plan. 26 resolved before a single line of code was written: spec-review adversarial subagent (x2 iterations) caught implementer-ergonomic gaps (caller derivation, mkdirSync, ISO-week formatter). CEO review + SELECTIVE EXPANSION cherry-picked argv form, audit log, SIGTERM grace, env allowlist, MCP-guard defense-in-depth, honest FS-read trust model, orphan-child `setTimeout.unref()` fix. DX review added the starvation warning block. Eng review added `ctx.shutdownSignal` separation, revised trusted-arg from opts-fold to separate 4th arg (stops accidental pass-through via `{...userOpts}` spreads), 18 additional test cases, 4 iron-rule regression tests. Codex outside voice caught 4 architectural dealbreakers: the worker abort silent-return bug (the "contract is a lie" finding), `--timeout-ms` CLI flag and `submit_job` param both missing, `PROTECTED_JOB_NAMES.has(name)` whitespace bypass before normalization. Effort estimate revised 8-10h → 16-20h once the full review was done.
## [0.13.1] - 2026-04-20
## **The brain stops being a write-once graph and starts being a runtime.**

View File

@@ -52,8 +52,11 @@ strict behavior when unset.
- `src/commands/graph-query.ts``gbrain graph-query <slug> [--type T] [--depth N] [--direction in|out|both]`: typed-edge relationship traversal (renders indented tree)
- `src/core/link-extraction.ts` — shared library for the v0.12.0 graph layer. extractEntityRefs (canonical, replaces backlinks.ts duplicate) matches both `[Name](people/slug)` markdown links and Obsidian `[[people/slug|Name]]` wikilinks as of v0.12.3. extractPageLinks, inferLinkType heuristics (attended/works_at/invested_in/founded/advises/source/mentions), parseTimelineEntries, isAutoLinkEnabled config helper. `DIR_PATTERN` covers `people`, `companies`, `deals`, `topics`, `concepts`, `projects`, `entities`, `tech`, `finance`, `personal`, `openclaw`. Used by extract.ts, operations.ts auto-link post-hook, and backlinks.ts.
- `src/core/minions/` — Minions job queue: BullMQ-inspired, Postgres-native (queue, worker, backoff, types)
- `src/core/minions/queue.ts` — MinionQueue class (submit, claim, complete, fail, stall detection, parent-child, depth/child-cap, per-job timeouts, cascade-kill, attachments, idempotency keys, child_done inbox, removeOnComplete/Fail)
- `src/core/minions/worker.ts` — MinionWorker class (handler registry, lock renewal, graceful shutdown, timeout safety net)
- `src/core/minions/queue.ts` — MinionQueue class (submit, claim, complete, fail, stall detection, parent-child, depth/child-cap, per-job timeouts, cascade-kill, attachments, idempotency keys, child_done inbox, removeOnComplete/Fail). `add()` takes a 4th `trusted` arg (separate from `opts` to prevent spread leakage); protected names in `PROTECTED_JOB_NAMES` require `{allowProtectedSubmit: true}` and the check runs trim-normalized (whitespace-bypass safe).
- `src/core/minions/worker.ts` — MinionWorker class (handler registry, lock renewal, graceful shutdown, timeout safety net). v0.14.0 abort-path fix: aborted jobs now call `failJob` with reason (`timeout`/`cancel`/`lock-lost`/`shutdown`) instead of returning silently. `shutdownAbort` (instance field) fires on process SIGTERM/SIGINT and propagates to `ctx.shutdownSignal` — shell handler listens to it; non-shell handlers don't.
- `src/core/minions/protected-names.ts` — side-effect-free constant module exporting `PROTECTED_JOB_NAMES` + `isProtectedJobName()`. Kept pure so queue core can import without loading handler modules.
- `src/core/minions/handlers/shell.ts``shell` job handler. Spawns `/bin/sh -c cmd` (absolute path, PATH-override-safe) or `argv[0] argv[1..]` (no shell). Env allowlist: `PATH, HOME, USER, LANG, TZ, NODE_ENV` + caller `env:` overrides. UTF-8-safe stdout/stderr tail via `string_decoder.StringDecoder`. Abort (either `ctx.signal` or `ctx.shutdownSignal`) fires SIGTERM → 5s grace → SIGKILL on child. Requires `GBRAIN_ALLOW_SHELL_JOBS=1` on worker (gated by `registerBuiltinHandlers`).
- `src/core/minions/handlers/shell-audit.ts` — per-submission JSONL audit trail at `~/.gbrain/audit/shell-jobs-YYYY-Www.jsonl` (ISO-week rotation; override via `GBRAIN_AUDIT_DIR`). Best-effort: `mkdirSync(recursive)` + `appendFileSync`; failures logged to stderr, submission not blocked. Logs cmd (first 80 chars) or argv (JSON array). Never logs env values.
- `src/core/minions/attachments.ts` — Attachment validation (path traversal, null byte, oversize, base64, duplicate detection)
- `src/commands/jobs.ts``gbrain jobs` CLI subcommands + `gbrain jobs work` daemon
- `src/commands/features.ts``gbrain features --json --auto-fix`: usage scan + feature adoption salesman

View File

@@ -216,6 +216,8 @@ gbrain skillpack-check | jq # full JSON: {healthy, summary, actions[], doc
If anything's off, `actions[]` tells you the exact command to run. For deeper troubleshooting: [`docs/guides/minions-fix.md`](docs/guides/minions-fix.md).
Moving gateway crons to Minions (deterministic scripts, zero LLM tokens per fire): [`docs/guides/minions-shell-jobs.md`](docs/guides/minions-shell-jobs.md).
## Skillify: your skills tree stops being a black box
Hermes and similar agent frameworks auto-create skills as a background behavior. Fine until you don't know what the agent shipped. Checklists decay. Tests drift. Resolver entries get stale. Six months later you've got an opaque pile of "skills" that nobody has read, nobody has tested, and nobody is sure still work.

View File

@@ -84,6 +84,30 @@ board" — likely an advisor-role page prior plus verb-pattern combinations.
## P1
### Minions shell jobs — Phase 2 scheduling (deferred from v0.13.0)
**What:** `minion_schedules` table + autopilot-cycle scanner that submits due shell jobs.
**Why:** v0.13.0 moves shell scripts to Minions but still leaves scheduling in the host crontab. Your OpenClaw's `scripts/service-manager.sh` + crontab is the only piece left on the host side. A DB-driven scheduler would mean a single `gbrain autopilot --install` replaces the host crontab entirely, scheduling is visible via `gbrain jobs list --scheduled`, and downtime-on-one-machine tolerance improves (schedule is shared DB state, not per-host crontab).
**Pros:** Canonical host-agnostic deployment. No more host-specific crontab.
**Cons:** Cross-engine migration complexity (new table on both PGLite + Postgres). Autopilot-cycle scanner needs to handle missed-schedule semantics (fire-once-on-startup or skip-if-past-now), and this is where every other cron-like system has historically accrued bugs.
**Depends on:** v0.13.0 shell jobs shipped. ✅
### `gbrain crontab-to-minions <file>` migration helper (deferred from v0.13.0)
**What:** Parse an existing crontab file, emit a proposed rewrite using `gbrain jobs submit shell ...` for each deterministic entry, keep LLM-requiring entries as-is.
**Why:** Hand-rewriting ~14 OpenClaw cron entries is error-prone and one-shot. A helper would make the migration reversible and auditable (diff the before/after crontab, dry-run the first N, commit).
**Pros:** Removes the "rewrite 14 lines by hand" tax every agent operator pays on adoption.
**Cons:** Crontab parsing is historically fiddly (5-field vs 6-field, `@hourly` aliases, Vixie extensions, env vars in crontab). Could misrewrite entries with shell substitution.
**Depends on:** v0.13.0 shell jobs shipped. ✅
### Batch the DB-source extract read path (deferred from v0.12.1)
**What:** `extractLinksFromDB` and `extractTimelineFromDB` at `src/commands/extract.ts:447, 504` issue one `engine.getPage(slug)` per slug after `engine.getAllSlugs()`. On a 47K-page brain that's still 47K serial reads over the Supabase pooler.
@@ -204,6 +228,50 @@ board" — likely an advisor-role page prior plus verb-pattern combinations.
## P2
### Minions: `gbrain jobs stats --orphaned` (deferred from v0.13.0)
**What:** New CLI flag / output column surfacing jobs that are waiting with no registered handler on any live worker.
**Why:** v0.13.0 adds shell jobs that require `GBRAIN_ALLOW_SHELL_JOBS=1` on the worker. If an operator submits a shell job but no worker with the flag is running, the row sits in `waiting` silently. The CLI's starvation warning + docs help at submit time; this TODO surfaces the problem at operational-check time.
**Pros:** Closes the "did my cron actually run" ambiguity for multi-machine deployments.
**Cons:** Knowing "no worker has this handler registered" requires worker heartbeat tracking, which Minions doesn't have yet (it's stateless at DB level beyond `lock_token`). Could be approximated by "no jobs of this name have completed in last N minutes AND count of waiting is > 0."
**Depends on:** v0.13.0 shell jobs shipped. ✅
### Minions: AbortReason plumbing on MinionJobContext (deferred from v0.13.0)
**What:** Handlers today can't distinguish whether `ctx.signal.aborted` fired due to timeout, cancel, or lock-loss. v0.13.0 derives this at worker-catch-time from `abort.signal.reason`, but the handler can't see it directly. Expose `ctx.abortReason?: 'timeout' | 'cancel' | 'lock-lost' | 'shutdown'` on the context.
**Why:** Shell handler's kill-sequence today can't decide "retry this" (lock-lost) vs "don't retry, user cancelled" (cancel) — they look the same. A typed AbortReason lets handlers make that decision for themselves.
**Pros:** Handlers get richer signals.
**Cons:** Small surface-area addition to the handler API. Not strictly required since the worker already makes the retry/dead decision for them.
**Depends on:** v0.13.0 shell jobs shipped. ✅
### Minions: blocking-mode audit log for true forensic integrity (deferred from v0.13.0)
**What:** Opt-in mode for `shell-audit` where `appendFileSync` failures DO block submission instead of logging-and-continuing.
**Why:** v0.13.0 ships the audit log in best-effort mode, which means a disk-full attacker can silently disable the forensic trail. Acceptable for v0.13.0 because the primary use is operational ("what did this cron do last Tuesday"), not security forensics. Operators who want fail-closed semantics should have a flag.
**Pros:** Enables true forensic integrity for deployments that need it.
**Cons:** Fail-closed means a transient disk issue blocks shell submissions, which can be worse than a missing log line for most operators. Opt-in is the right shape but adds surface area.
**Depends on:** v0.13.0 shell jobs shipped. ✅
### Minions: configurable per-job output buffer sizes (deferred from v0.13.0)
**What:** Add `max_stdout_bytes` / `max_stderr_bytes` to ShellJobParams; override the 64KB/16KB defaults.
**Why:** 64KB/16KB covers typical OpenClaw scripts today but a verbose benchmark or a debug-dump script could need more.
**Depends on:** First shell-job author who actually needs it. Don't pre-build the flag.
### Security hardening follow-ups (deferred from security-wave-3)
**What:** Close remaining security gaps identified during the v0.9.4 Codex outside-voice review that didn't make the wave's in-scope cut.

View File

@@ -1 +1 @@
0.13.1
0.14.0

View File

@@ -52,6 +52,7 @@ Running a production brain.
| [Cron via Minions](../skills/conventions/cron-via-minions.md) | Why scheduled work runs as Minion jobs, not `agentTurn`. Auto-applied by v0.11.0 migration for built-in handlers; host-specific handlers use the plugin contract below. |
| [Plugin Handlers](guides/plugin-handlers.md) | Registering host-specific Minion handlers via code (no data-file exec surface). |
| [Minions fix](guides/minions-fix.md) | Repairing a half-migrated v0.11.0 install. |
| [Shell jobs (v0.14.0+)](guides/minions-shell-jobs.md) | Move deterministic crons (API fetch, token refresh, scrape+write) off the LLM gateway. Zero tokens per fire, ~60% gateway headroom. Follow `skills/migrations/v0.14.0.md` for the adoption playbook. |
| [Quiet Hours & Timezone](guides/quiet-hours.md) | Hold notifications during sleep, timezone-aware delivery |
| [Executive Assistant Pattern](guides/executive-assistant.md) | Email triage, meeting prep, scheduling |
| [Operational Disciplines](guides/operational-disciplines.md) | Signal detection, brain-first, sync-after-write, heartbeat, dream cycle |

View File

@@ -319,6 +319,42 @@ v0.13 edges carry new `link_type` values. If your fork has graph-query skills th
### Type normalization NOT in v0.13
Legacy rows with `link_type='attendee'` or `link_type='mention'` coexist with new `'attended'` / `'mentions'` rows. Your queries filtering on old type names keep working. A separate opt-in `gbrain normalize-types` command in v0.14 handles the rename.
## v0.14.0 shell jobs (optional adoption, no skill edits)
Adds a `shell` job type to Minions so deterministic cron scripts (API fetch, token
refresh, scrape + write) move off the LLM gateway. Zero tokens per fire. ~60%
gateway CPU headroom at typical scale. Feature is **off by default**, existing
installs keep running exactly as they did before. Nothing breaks.
To adopt, follow `skills/migrations/v0.14.0.md`. The short version:
1. Set `GBRAIN_ALLOW_SHELL_JOBS=1` on the worker process, then `gbrain jobs work`
(Postgres). On PGLite, every crontab invocation uses `--follow` for inline
execution; no persistent worker.
2. Classify each of your host's cron entries: LLM-requiring (keep on gateway) vs
deterministic (candidate for shell). Typical splits:
- **Deterministic → shell:** `ycli-token-refresh`, `x-oauth2-refresh`,
`x-garrytan-unified`, `calendar-sync-to-brain`, `github-pulse`,
`frameio-scan`, `flight-tracker`, `x-raw-json-backfill`.
- **LLM-requiring → stay:** `social-radar`, `content-ideas`, `adversary-vacuum`,
`ea-inbox-sweep`, `morning-briefing`, `brain-maintenance`.
3. For each deterministic cron, rewrite as:
```cron
3 13,16,19,22,1,4,7,10 * * * \
gbrain jobs submit shell \
--params '{"cmd":"node scripts/your-script.mjs","cwd":"/data/.openclaw/workspace"}' \
--max-attempts 3 --timeout-ms 300000
```
4. Watch `gbrain jobs get <id>` for exit_code / stdout_tail / stderr_tail on each
fire. Compare against pre-migration behavior before approving the next batch.
**No skill edits required.** The handler runs worker-side; skill files don't
change. If your host exposed custom handlers via the plugin contract (v0.11.0),
they still work the same way.
Iron rule: **never auto-rewrite the operator's crontab.** Every rewrite is
per-cron, human-approved, with a diff. If you want automation later, the
upcoming `gbrain crontab-to-minions <file>` helper is P1 in TODOS.
---

View File

@@ -0,0 +1,167 @@
# Minions shell jobs — move deterministic crons off the gateway
## 30 seconds
```bash
# Run your first shell job:
GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs submit shell \
--params '{"cmd":"echo hello","cwd":"/tmp"}' --follow
# → exit_code: 0, stdout_tail: "hello\n", duration_ms: 43
```
That's it. Your cron scripts now have a home with retry, backoff, DLQ, and
`gbrain jobs list` visibility, without each one booting a full LLM session.
**PGLite users:** `gbrain jobs work` does not run on PGLite (exclusive file
lock). Every crontab invocation must use `--follow` for inline execution.
Postgres users can run a persistent worker; see recipes below.
---
## Why it exists
If your agent runs deterministic scripts from cron (token refresh, API fetch,
scrape + write), each one pays the cost of a full LLM session on the gateway.
Fourteen simultaneous fires on a Series A deployment pin CPU at 100% and block
live messages. None of those scripts need reasoning. They need a shell.
Shell jobs move them to the Minions worker: one deterministic-script execution
per cron, zero LLM tokens, unified visibility and retry.
---
## Security model (read this)
Shell exec is a large blast radius. We ship two independent gates, both must
pass:
1. **MCP boundary.** `submit_job` with `name: 'shell'` is rejected when
`ctx.remote === true` (MCP callers). Independent of the env flag. Remote
agents can never submit shell jobs. `MinionQueue.add('shell', ...)` has its
own guard too, so an in-process handler can't programmatically bypass this.
2. **Env flag.** The worker only registers the shell handler when
`GBRAIN_ALLOW_SHELL_JOBS=1` is set on the worker process. Default: off. Your
agent opts in per-host.
**What the env allowlist does AND does not do.** Shell jobs run with a minimal
env: `PATH, HOME, USER, LANG, TZ, NODE_ENV`. Your secrets like `OPENAI_API_KEY`
and `DATABASE_URL` are NOT passed to the child. You opt-in additional keys per
job via `env: { ... }`. This stops accidental `$OPENAI_API_KEY` interpolation in
a user-authored script. It does **not** sandbox filesystem reads: a shell
script can `cat ~/.env` or any file the worker process can read. The operator
picks a safe `cwd`. That is the trust boundary.
**Audit trail, not forensic insurance.** Every submission writes a JSONL line
to `~/.gbrain/audit/shell-jobs-YYYY-Www.jsonl` (ISO-week rotation; override
with `GBRAIN_AUDIT_DIR`). Failures log to stderr and don't block submission, so
a disk-full adversary could silently disable the trail. Good for "what did
this cron submit last Tuesday", not for security-critical forensics.
**The command text is logged as-is.** If you embed a secret in `cmd`
(`curl -H 'Authorization: Bearer ...'`), it shows up in the audit file. Put
secrets in `env:` instead.
---
## Migrate a cron
### Postgres worker (recommended)
On one terminal, start a persistent worker:
```bash
GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs work
```
Rewrite crontab to submit shell jobs (no `--follow`):
```cron
# Before (LLM gateway):
# OpenClaw cron: x-garrytan-unified
# After (Minions worker):
3 13,16,19,22,1,4,7,10 * * * \
gbrain jobs submit shell \
--params '{"cmd":"node scripts/x-garrytan-daily.mjs","cwd":"/data/.openclaw/workspace"}' \
--max-attempts 3 --timeout-ms 300000
```
Worker claims the job on next poll, runs it, records `exit_code` +
`stdout_tail` + `stderr_tail` in the result. Failures retry per
`--max-attempts` with exponential backoff.
### PGLite (inline execution)
PGLite doesn't support the persistent worker daemon. Every crontab invocation
uses `--follow` to run inline:
```cron
# Each cron tick spawns a short-lived worker that runs the job inline.
3 13,16,19,22,1,4,7,10 * * * \
GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs submit shell \
--params '{"cmd":"node scripts/x-garrytan-daily.mjs","cwd":"/data/.openclaw/workspace"}' \
--follow --timeout-ms 300000
```
Note: `--follow` blocks the crontab slot until the job finishes. If 14 shell
crons land at the same minute and each takes 30s, they serialize through
crontab's spawning limits. Postgres + persistent worker scales better.
### Submitting with `argv` (no shell interpolation)
For programmatic callers assembling commands from JSON, use `argv` instead of
`cmd`. No shell, no injection surface:
```bash
gbrain jobs submit shell \
--params '{"argv":["node","scripts/fetch.mjs","--date","2026-04-19"],"cwd":"/data"}' \
--follow
```
---
## Debug a failed job
```bash
# List dead shell jobs
gbrain jobs list --status dead
# Inspect one
gbrain jobs get 42
# → error_text, stacktrace, result.stdout_tail, result.stderr_tail
# Submission audit log (operator trail, not forensic)
cat ~/.gbrain/audit/shell-jobs-*.jsonl | jq '.'
# First-time failure mode: submitted without env flag on the worker
gbrain jobs list --status waiting --name shell
# If rows pile up here, no worker with GBRAIN_ALLOW_SHELL_JOBS=1 is running.
```
---
## Limitations
- **Filesystem reads are not sandboxed.** See "Security model" above. Don't
point `cwd` at a directory full of secrets.
- **Audit log is advisory.** Disk-full or EACCES silently disables it.
- **Cancel latency is lock-renewal-bounded** (~7-15 s by default). A cancelled
child keeps running until the next lock-renewal tick fails.
- **`--follow` claim order** is by priority/created_at. If another job is
waiting in the same queue at the time of `--follow`, that one runs first.
- **`cwd` symlink TOCTOU.** The absolute-path check doesn't guard against
symlinks pointing elsewhere at execution time. Operator-scope concern.
---
## Errors {#errors}
| Error | What it means | Fix |
|---|---|---|
| `shell: specify exactly one of cmd or argv` | `cmd` and `argv` are mutually exclusive. Both absent is also invalid. | Choose one. `cmd` for shell-interpolated strings; `argv` for structured args. |
| `shell: cwd is required and must be an absolute path` | `cwd` must be a string starting with `/`. | Set `cwd` in `--params` to an absolute path. |
| `shell: argv must be an array of strings` | `argv` has a non-string entry or isn't an array. | Pass `argv: ["bin","arg1","arg2"]`. |
| `shell: env values must all be strings` | `env` has a number/bool/object value. | Stringify: `"env":{"COUNT":"3"}` not `"env":{"COUNT":3}`. |
| `permission_denied: shell jobs cannot be submitted over MCP` | An MCP client tried to submit a shell job. By design CLI-only. | Submit from CLI or via a trusted operation handler (`ctx.remote === false`). |
| `protected job name 'shell' requires CLI or operation-local submitter` | A caller invoked `MinionQueue.add('shell', ...)` without the `trusted` opt-in. | Pass `{ allowProtectedSubmit: true }` as the 4th arg. CLI and `submit_job` do this automatically. |
| `aborted: timeout` / `aborted: cancel` / `aborted: shutdown` / `aborted: lock-lost` | The worker's abort signal fired mid-execution. Child got SIGTERM, 5s grace, then SIGKILL. | Expected: timeout / user cancel / deploy restart / stall. Inspect `gbrain jobs get` to see which. |
| `exit N: <stderr_tail_500>` | Script exited non-zero. | Read `stderr_tail` in `gbrain jobs get`. |

View File

@@ -0,0 +1,185 @@
---
version: 0.14.0
feature_pitch:
headline: "Move deterministic crons off the LLM gateway. Zero tokens per fire."
description: |
OpenClaw operators: your gateway pins at 100% CPU because most cron jobs
boot a full Opus session to do deterministic work (API fetch, token
refresh, scrape + markdown write) that doesn't need reasoning. v0.14.0
adds a `shell` job type to Minions so those jobs run as child processes
under the existing Minions worker. ~60% gateway CPU reduction at typical
scale. Retry, backoff, DLQ, unified `gbrain jobs list` visibility, all
free. The LLM-reasoning crons stay on the gateway where they belong.
recipe: docs/guides/minions-shell-jobs.md
tiers: null
---
# v0.14.0 Migration: Adopt the `shell` job type
**Audience: host agents (OpenClaw deployments, Hermes operators, future
hosts) reading this AFTER `gbrain apply-migrations` has run. There is no
mechanical orchestrator for this release — every change described here
is host-specific and requires operator judgment per cron.** Shell jobs
are off by default on upgrade (nothing breaks). This file walks the
host agent through enabling and adopting them.
## Iron rules
1. **Never auto-rewrite the operator's crontab.** Crontab entries are
host-specific code per the CLAUDE.md "host-specific code" exception.
Every rewrite is an explicit human approval, per cron, with a diff.
2. **LLM-requiring crons stay on the gateway.** If the cron's work
includes reasoning (sentiment classification, triage, synthesis,
prose generation), it belongs on the gateway. Moving it to a shell
job throws away the LLM session the cron needs.
3. **Deterministic crons are the only candidates.** API fetch, token
refresh, database read + markdown write, HTTP refresh call, scrape.
If the script is `node scripts/x.mjs` or `curl ... | jq` or similar
and does zero LLM work, it's a candidate.
## Step 1: Enable the worker
Pick the engine the operator is on:
**Postgres** (most OpenClaw/Hermes deployments):
```bash
# In the worker bootstrap, export the env flag and run the daemon:
GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs work
```
The worker claims shell jobs from the queue and executes them. Retries,
backoff, and dead-letter all work the same as sync/embed jobs.
**PGLite**: no persistent worker, per-tick inline execution only:
```bash
# Every crontab invocation must use --follow; PGLite's worker daemon
# exits immediately due to exclusive file lock.
GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs submit shell \
--params '{"cmd":"...","cwd":"..."}' --follow
```
## Step 2: Audit the operator's cron manifest
Read the operator's cron manifest. Typical locations:
- `~/.claude/cron/jobs.json` (OpenClaw)
- `scripts/service-manager.sh` in the host repo
- System crontab (`crontab -l`)
For each entry, classify:
| Pattern | Class | Action |
|---------|-------|--------|
| `agentTurn <skill>` or any OpenClaw-dispatched LLM skill | LLM-requiring | **Leave as-is.** Needs gateway. |
| `node scripts/*.mjs` that hits an API and writes markdown | Deterministic | Propose shell-job rewrite. |
| Token refresh (`ycli token-refresh`, `x-oauth2-refresh`) | Deterministic | Propose shell-job rewrite. |
| Scrape + write (`frameio-scan`, `flight-tracker`) | Deterministic | Propose shell-job rewrite. |
| Audio transcription or any LLM-dependent extract | LLM-requiring | Leave as-is. |
| `bash` wrapper scripts that may call LLM tools internally | Ambiguous | Ask the operator. Don't assume. |
## Step 3: Propose rewrites per cron
For each deterministic cron, propose the exact rewrite with a diff. Show
the operator both sides. Let them approve per-cron, not in bulk.
**Before** (LLM gateway):
```
OpenClaw cron: x-garrytan-unified, 3 13,16,19,22,1,4,7,10 * * *
→ runs agentTurn x-garrytan-unified
→ boots Opus context, invokes script, returns
```
**After** (Minions worker):
```cron
3 13,16,19,22,1,4,7,10 * * * \
gbrain jobs submit shell \
--params '{"cmd":"node /data/.openclaw/workspace/scripts/x-garrytan-daily.mjs","cwd":"/data/.openclaw/workspace"}' \
--max-attempts 3 --timeout-ms 300000
```
Rewrite rules:
- `cwd` is required and must be an absolute path. Operator picks it. It
should be the directory the script expects to run in (the host repo
root, typically).
- `--max-attempts 3` matches the default Minions retry policy. Override
if the script is non-idempotent and should only run once per fire.
- `--timeout-ms N` caps the child's wall-clock runtime. Set to the 95th
percentile of the script's observed runtime, plus slack. Examples:
token refresh → 30s; API fetch → 300s; scrape → 600s.
- **PGLite operators:** add `--follow` to every line. Skip Step 1.
## Step 4: Secrets that the script needs
Shell jobs receive a minimal env allowlist by default: `PATH, HOME,
USER, LANG, TZ, NODE_ENV`. They do NOT inherit `OPENAI_API_KEY`,
`ANTHROPIC_API_KEY`, `DATABASE_URL`, or any other worker env vars.
If a cron's script needs an API key, name it explicitly:
```bash
gbrain jobs submit shell \
--params '{"cmd":"node scripts/yc-sync.mjs","cwd":"/data/.openclaw/workspace","env":{"YC_API_TOKEN":"'"$YC_API_TOKEN"'"}}'
```
The shell expands `$YC_API_TOKEN` at submit time. The worker receives
the JSON with the literal token value. Audit log does not log env
values (keys don't carry sensitive data; values never appear).
## Step 5: Verify the first migrated cron
After rewriting ONE cron with the operator's approval:
1. Wait for the next scheduled fire (or trigger manually: `gbrain jobs
submit shell --params '...' --follow`).
2. Check `gbrain jobs list --status completed --name shell --limit 5`
for the result.
3. `gbrain jobs get <id>` shows `exit_code`, `stdout_tail`, `stderr_tail`,
`duration_ms`.
4. Compare against the pre-migration behavior: did it do the same work?
Same output files changed? Same side effects?
Only after one cron is verified working end-to-end should the operator
approve the next batch.
## Step 6: Starvation sanity check
If the operator submits shell jobs but forgot to set
`GBRAIN_ALLOW_SHELL_JOBS=1` on the worker, jobs sit in `waiting`
indefinitely. The CLI warns on submission, but for daemon-style
deployments the warning scrolls past. Add this to the operator's
ops-check runbook:
```bash
gbrain jobs list --status waiting --name shell
```
If rows pile up here, either (a) no worker has the env flag set, or
(b) the worker crashed. Fix by restarting with the flag.
## Non-goals (explicitly deferred to later releases)
- **Automatic crontab rewrites.** Deferred to a future `gbrain
crontab-to-minions <file>` helper. P1 in TODOS.md.
- **DB-backed scheduler.** `minion_schedules` table replaces host
crontab entirely. P1 in TODOS.md.
- **Orphaned-shell-job stats.** `gbrain jobs stats --orphaned` would
surface the "no worker with env flag" case. P2 in TODOS.md.
- **Configurable buffer sizes.** Output tails are fixed at 64KB stdout
/ 16KB stderr. P2 in TODOS.md.
## When to stop
The migration is done when:
1. The worker runs with `GBRAIN_ALLOW_SHELL_JOBS=1` (Postgres) or every
cron uses `--follow` (PGLite).
2. Every deterministic cron the operator approved has been rewritten.
3. The operator has verified at least one full cron fire cycle
end-to-end and confirmed the output matches pre-migration.
4. `gbrain jobs stats` shows shell jobs completing at expected rates
with few or zero retries.
Gateway CPU should visibly drop after the first few rewrites. That's
the signal the adoption is working.

View File

@@ -57,8 +57,8 @@ export async function runJobs(engine: BrainEngine, args: string[]): Promise<void
USAGE
gbrain jobs submit <name> [--params JSON] [--follow] [--priority N]
[--delay Nms] [--max-attempts N] [--queue Q]
[--dry-run]
[--delay Nms] [--timeout-ms Nms] [--max-attempts N]
[--queue Q] [--dry-run]
gbrain jobs list [--status S] [--queue Q] [--limit N]
gbrain jobs get <id>
gbrain jobs cancel <id>
@@ -68,6 +68,18 @@ USAGE
gbrain jobs stats
gbrain jobs smoke
gbrain jobs work [--queue Q] [--concurrency N]
HANDLER TYPES (built in)
sync Pull and embed new pages from the repo
embed (Re-)embed pages; --params '{"slug":...}' or '{"all":true}'
lint Run page linter; --params '{"dir":"...","fix":true}'
import Bulk import markdown; --params '{"dir":"..."}'
extract Extract links + timeline entries; '{"mode":"all"}'
backlinks Check or fix back-links; '{"action":"fix"}'
autopilot-cycle One autopilot pass (sync+extract+embed+backlinks)
shell Run a command or argv. Requires GBRAIN_ALLOW_SHELL_JOBS=1
on the worker. Params: {cmd?, argv?, cwd, env?}.
See: docs/guides/minions-shell-jobs.md
`);
return;
}
@@ -93,6 +105,12 @@ USAGE
const delay = parseInt(parseFlag(args, '--delay') ?? '0', 10);
const maxAttempts = parseInt(parseFlag(args, '--max-attempts') ?? '3', 10);
const queueName = parseFlag(args, '--queue') ?? 'default';
const timeoutMsRaw = parseFlag(args, '--timeout-ms');
const timeoutMs = timeoutMsRaw !== undefined ? parseInt(timeoutMsRaw, 10) : undefined;
if (timeoutMsRaw !== undefined && (isNaN(timeoutMs!) || timeoutMs! <= 0)) {
console.error('Error: --timeout-ms must be a positive integer (milliseconds)');
process.exit(1);
}
const dryRun = hasFlag(args, '--dry-run');
const follow = hasFlag(args, '--follow');
@@ -103,6 +121,7 @@ USAGE
console.log(` Priority: ${priority}`);
console.log(` Max attempts: ${maxAttempts}`);
if (delay > 0) console.log(` Delay: ${delay}ms`);
if (timeoutMs) console.log(` Timeout: ${timeoutMs}ms`);
console.log(` Data: ${JSON.stringify(data)}`);
return;
}
@@ -114,12 +133,51 @@ USAGE
process.exit(1);
}
// The CLI path is a trusted submitter. Pass {allowProtectedSubmit: true}
// ONLY for protected names, not blanket-set for every submission, so any
// future protected name forces explicit opt-in at the call site.
const { isProtectedJobName } = await import('../core/minions/protected-names.ts');
const trusted = isProtectedJobName(name) ? { allowProtectedSubmit: true } : undefined;
const job = await queue.add(name, data, {
priority,
delay: delay > 0 ? delay : undefined,
max_attempts: maxAttempts,
queue: queueName,
});
timeout_ms: timeoutMs,
}, trusted);
// Submission audit log (operational trace, not forensic insurance).
try {
const { logShellSubmission } = await import('../core/minions/handlers/shell-audit.ts');
if (name.trim() === 'shell') {
logShellSubmission({
caller: 'cli',
remote: false,
job_id: job.id,
cwd: typeof data.cwd === 'string' ? data.cwd : '',
cmd_display: typeof data.cmd === 'string' ? data.cmd.slice(0, 80) : undefined,
argv_display: Array.isArray(data.argv)
? (data.argv as unknown[]).filter((a): a is string => typeof a === 'string').map((a) => a.slice(0, 80))
: undefined,
});
}
} catch { /* audit failures never block submission */ }
// Starvation warning (DX polish). Fire for every non-`--follow` shell submit
// regardless of the submitter's own `GBRAIN_ALLOW_SHELL_JOBS` — the submitter
// env is a weak proxy for the worker env (they may run on different machines),
// so the warning remains useful any time the job might sit in 'waiting'.
if (!follow && name.trim() === 'shell') {
process.stderr.write(
`\n⚠ Shell jobs require GBRAIN_ALLOW_SHELL_JOBS=1 on the worker process.\n` +
` Your job was queued (id=${job.id}) but will sit in 'waiting' until a\n` +
` worker with the env flag starts. To run now:\n\n` +
` GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs submit shell \\\n` +
` --params '...' --follow\n\n` +
` Or start a persistent worker (Postgres only — PGLite uses --follow):\n\n` +
` GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs work\n\n`,
);
}
if (follow) {
console.log(`Job #${job.id} submitted (${name}). Executing inline...`);
@@ -470,4 +528,16 @@ export async function registerBuiltinHandlers(worker: MinionWorker, engine: Brai
}
return { partial: false, steps };
});
// Shell handler: registered ONLY when GBRAIN_ALLOW_SHELL_JOBS=1 is set on the
// worker process. Default-closed; opt-in per-host. Without the flag, shell
// jobs submitted via CLI insert rows but no worker claims them (they sit in
// 'waiting' — the CLI prints a starvation warning for that case).
if (process.env.GBRAIN_ALLOW_SHELL_JOBS === '1') {
const { shellHandler } = await import('../core/minions/handlers/shell.ts');
worker.register('shell', shellHandler);
process.stderr.write('[minion worker] shell handler enabled (GBRAIN_ALLOW_SHELL_JOBS=1)\n');
} else {
process.stderr.write('[minion worker] shell handler disabled (set GBRAIN_ALLOW_SHELL_JOBS=1 to enable)\n');
}
}

View File

@@ -0,0 +1,75 @@
/**
* Shell-job submission audit log (operational trace, NOT forensic insurance).
*
* Writes a JSONL line per shell-job submission to `~/.gbrain/audit/shell-jobs-YYYY-Www.jsonl`
* (ISO week rotation, override via `GBRAIN_AUDIT_DIR`). Best-effort: write failures go
* to stderr and never block submission, which means a disk-full attacker could silently
* disable the trail. CHANGELOG calls this out honestly: it's for debugging "what did
* this cron submit last Tuesday?", not for security-critical forensics.
*
* Never logs `env` values (may contain secrets). Does log `cmd` and `argv` truncated to
* 80 chars for cmd / stored as JSON array for argv — the command text itself can contain
* inline tokens (`curl -H 'Authorization: Bearer ...'`) and the guide explicitly tells
* operators to put secrets in `env:` instead of embedding them in the command line.
*/
import * as fs from 'node:fs';
import * as path from 'node:path';
import * as os from 'node:os';
export interface ShellAuditEvent {
ts: string;
caller: 'cli' | 'mcp';
remote: boolean;
job_id: number;
cwd: string;
cmd_display?: string; // first 80 chars of cmd; may contain inline tokens
argv_display?: string[]; // each arg truncated individually to preserve separation
}
/** Compute `shell-jobs-YYYY-Www.jsonl` using ISO-8601 week numbering.
*
* Year-boundary edge: 2027-01-01 is ISO week 53 of year 2026, so the correct
* filename is `shell-jobs-2026-W53.jsonl`. This matches the ISO week standard
* (week containing the first Thursday of the year is W1; week containing Dec 28
* is always W52 or W53 of that year).
*/
export function computeAuditFilename(now: Date = new Date()): string {
// Copy date and move to nearest Thursday (ISO week anchor).
const d = new Date(Date.UTC(now.getUTCFullYear(), now.getUTCMonth(), now.getUTCDate()));
const dayNum = (d.getUTCDay() + 6) % 7; // Mon=0, Sun=6
d.setUTCDate(d.getUTCDate() - dayNum + 3); // shift to Thursday
const isoYear = d.getUTCFullYear();
const firstThursday = new Date(Date.UTC(isoYear, 0, 4));
const firstThursdayDayNum = (firstThursday.getUTCDay() + 6) % 7;
firstThursday.setUTCDate(firstThursday.getUTCDate() - firstThursdayDayNum + 3);
const weekNum = Math.round((d.getTime() - firstThursday.getTime()) / (7 * 86400000)) + 1;
const ww = String(weekNum).padStart(2, '0');
return `shell-jobs-${isoYear}-W${ww}.jsonl`;
}
/** Resolve the audit dir. Honors `GBRAIN_AUDIT_DIR` for container/sandbox deployments
* where `$HOME` is read-only. Defaults to `~/.gbrain/audit/`. */
export function resolveAuditDir(): string {
const override = process.env.GBRAIN_AUDIT_DIR;
if (override && override.trim().length > 0) return override;
return path.join(os.homedir(), '.gbrain', 'audit');
}
export function logShellSubmission(event: Omit<ShellAuditEvent, 'ts'>): void {
const dir = resolveAuditDir();
const filename = computeAuditFilename();
const fullPath = path.join(dir, filename);
const line = JSON.stringify({ ...event, ts: new Date().toISOString() }) + '\n';
try {
fs.mkdirSync(dir, { recursive: true });
fs.appendFileSync(fullPath, line, { encoding: 'utf8' });
} catch (err) {
// Best-effort: log to stderr and keep going. A disk-full or EACCES attacker
// can silently disable this trail, which is why CHANGELOG calls it an
// operational trace, not forensic insurance.
const msg = err instanceof Error ? err.message : String(err);
process.stderr.write(`[shell-audit] write failed (${msg}); submission continues\n`);
}
}

View File

@@ -0,0 +1,311 @@
/**
* `shell` job handler.
*
* Runs an arbitrary shell command or argv vector as a child process under the
* Minions worker. Purpose: move deterministic cron scripts (API fetch, token
* refresh, scrape + write) off the LLM gateway so they don't consume an Opus
* session each time.
*
* Security (both gates must pass):
* 1. `MinionQueue.add()` rejects name='shell' unless the caller explicitly
* opts in via `trusted.allowProtectedSubmit`. CLI path and the `submit_job`
* operation (when `ctx.remote === false`) set the flag. MCP callers don't.
* 2. This handler only registers when `process.env.GBRAIN_ALLOW_SHELL_JOBS === '1'`.
* Default: off. Without the flag the worker's `registeredNames` excludes
* shell and queued jobs stay in 'waiting'.
*
* Env model (honest): the child process receives a small allowlist (PATH, HOME,
* USER, LANG, TZ, NODE_ENV) merged with caller-supplied `job.data.env`. This
* prevents the accidental `$OPENAI_API_KEY` interpolation footgun. It does NOT
* sandbox filesystem reads — a shell script can `cat ~/.env` or any file the
* worker can read. The operator picks a safe `cwd`; that's the trust boundary.
*
* Shutdown: the handler listens to BOTH `ctx.signal` (timeout/cancel/lock-loss)
* and `ctx.shutdownSignal` (worker process SIGTERM). Either triggers the same
* kill sequence: SIGTERM → 5s grace → SIGKILL. Non-shell handlers ignore
* `shutdownSignal` so deploy restarts don't interrupt them mid-flight.
*/
import { spawn, type ChildProcess } from 'node:child_process';
import { StringDecoder } from 'node:string_decoder';
import * as path from 'node:path';
import type { MinionJobContext } from '../types.ts';
import { UnrecoverableError } from '../types.ts';
/** Environment variables passed through to shell children by default. Callers
* that need additional keys (e.g. a specific API token for a cron) must name
* them explicitly in `job.data.env`. Named keys override this allowlist. */
const SHELL_ENV_ALLOWLIST = ['PATH', 'HOME', 'USER', 'LANG', 'TZ', 'NODE_ENV'] as const;
/** Max bytes retained from stdout/stderr. Output exceeding these caps is
* truncated with a `[truncated N bytes]` marker. UTF-8-safe via StringDecoder. */
const STDOUT_TAIL_MAX_BYTES = 64 * 1024;
const STDERR_TAIL_MAX_BYTES = 16 * 1024;
/** Grace period between SIGTERM and SIGKILL. Well-behaved scripts catch SIGTERM,
* flush state, exit cleanly; non-behaving scripts get reaped. */
const KILL_GRACE_MS = 5000;
export interface ShellJobParams {
/** Shell command. Spawned via `/bin/sh -c cmd`. Exactly one of cmd or argv is required. */
cmd?: string;
/** Argv vector. Spawned directly without a shell. Exactly one of cmd or argv is required. */
argv?: string[];
/** Working directory. REQUIRED, must be an absolute path. The operator chooses
* this; it's the trust boundary for what files the script can read/write. */
cwd: string;
/** Additional env vars to pass to the child. Merged on top of SHELL_ENV_ALLOWLIST. */
env?: Record<string, string>;
}
export interface ShellJobResult {
exit_code: number;
stdout_tail: string;
stderr_tail: string;
duration_ms: number;
pid: number;
}
/** Validate and narrow `job.data` to ShellJobParams. Throws UnrecoverableError
* for misshapen input — validation failures are not retry-worthy. */
function validateParams(data: Record<string, unknown>): ShellJobParams {
const hasCmd = typeof data.cmd === 'string' && data.cmd.length > 0;
const hasArgv = Array.isArray(data.argv) && data.argv.length > 0;
if (hasCmd && hasArgv) {
throw new UnrecoverableError(
'shell: specify exactly one of cmd or argv (see: docs/guides/minions-shell-jobs.md#errors)',
);
}
if (!hasCmd && !hasArgv) {
throw new UnrecoverableError(
'shell: specify exactly one of cmd or argv (see: docs/guides/minions-shell-jobs.md#errors)',
);
}
if (hasArgv) {
const argvOk = (data.argv as unknown[]).every((a) => typeof a === 'string');
if (!argvOk) {
throw new UnrecoverableError(
'shell: argv must be an array of strings (see: docs/guides/minions-shell-jobs.md#errors)',
);
}
}
if (typeof data.cwd !== 'string' || data.cwd.length === 0) {
throw new UnrecoverableError(
'shell: cwd is required and must be an absolute path (see: docs/guides/minions-shell-jobs.md#errors)',
);
}
if (!path.isAbsolute(data.cwd)) {
throw new UnrecoverableError(
'shell: cwd is required and must be an absolute path (see: docs/guides/minions-shell-jobs.md#errors)',
);
}
if (data.env !== undefined) {
if (typeof data.env !== 'object' || data.env === null || Array.isArray(data.env)) {
throw new UnrecoverableError(
'shell: env must be an object of string values (see: docs/guides/minions-shell-jobs.md#errors)',
);
}
for (const v of Object.values(data.env as Record<string, unknown>)) {
if (typeof v !== 'string') {
throw new UnrecoverableError(
'shell: env values must all be strings (see: docs/guides/minions-shell-jobs.md#errors)',
);
}
}
}
return {
cmd: hasCmd ? (data.cmd as string) : undefined,
argv: hasArgv ? (data.argv as string[]) : undefined,
cwd: data.cwd,
env: (data.env as Record<string, string> | undefined),
};
}
/** Build the child process env: SHELL_ENV_ALLOWLIST picked from process.env,
* overlaid with caller-supplied `job.data.env`. Prevents accidental leak of
* OPENAI_API_KEY / DATABASE_URL / etc. into user-authored scripts. */
function buildChildEnv(override: Record<string, string> | undefined): Record<string, string> {
const env: Record<string, string> = {};
for (const key of SHELL_ENV_ALLOWLIST) {
const v = process.env[key];
if (typeof v === 'string') env[key] = v;
}
if (override) {
for (const [k, v] of Object.entries(override)) env[k] = v;
}
return env;
}
/** Bounded-length UTF-8-safe tail buffer. Accumulates bytes via StringDecoder
* so the last `maxBytes` of output is character-safe (no split multibyte chars).
* On truncation, the emitted string is prefixed with `[truncated N bytes]`. */
class TailBuffer {
private decoder = new StringDecoder('utf8');
private body = '';
private bodyBytes = 0;
private truncatedBytes = 0;
constructor(private readonly maxBytes: number) {}
append(chunk: Buffer): void {
const str = this.decoder.write(chunk);
if (str.length === 0) return;
this.body += str;
this.bodyBytes = Buffer.byteLength(this.body, 'utf8');
this.compactIfOver();
}
private compactIfOver(): void {
if (this.bodyBytes <= this.maxBytes) return;
// We need to keep only the trailing maxBytes. Byte-slicing mid-character is
// unsafe; instead, find the highest character offset whose byte length from
// that point is <= maxBytes. Linear-scan from the end over grapheme-safe
// codepoints is good enough at 64KB scales.
const targetByteSize = this.maxBytes;
// Fast path: if body is all ASCII (1 byte per char), byteLength === length.
if (this.body.length === this.bodyBytes) {
const drop = this.bodyBytes - targetByteSize;
this.truncatedBytes += drop;
this.body = this.body.slice(drop);
this.bodyBytes = targetByteSize;
return;
}
// Slow path: find a character boundary that lands just under maxBytes.
// Scan from the end; accumulate bytes per codepoint.
let tailBytes = 0;
let cut = this.body.length;
for (let i = this.body.length - 1; i >= 0; i--) {
const code = this.body.codePointAt(i);
const cpBytes = code === undefined ? 0
: code < 0x80 ? 1
: code < 0x800 ? 2
: code < 0x10000 ? 3
: 4;
if (tailBytes + cpBytes > targetByteSize) break;
tailBytes += cpBytes;
cut = i;
}
const droppedBytes = this.bodyBytes - tailBytes;
this.truncatedBytes += droppedBytes;
this.body = this.body.slice(cut);
this.bodyBytes = tailBytes;
}
done(): string {
const tail = this.decoder.end();
if (tail.length > 0) {
this.body += tail;
this.bodyBytes = Buffer.byteLength(this.body, 'utf8');
this.compactIfOver();
}
if (this.truncatedBytes === 0) return this.body;
return `[truncated ${this.truncatedBytes} bytes]\n${this.body}`;
}
}
/** The shell handler itself. */
export async function shellHandler(ctx: MinionJobContext): Promise<ShellJobResult> {
const params = validateParams(ctx.data);
const env = buildChildEnv(params.env);
const startedAt = Date.now();
let proc: ChildProcess;
try {
if (params.cmd) {
// Absolute /bin/sh — not 'sh' — so a caller-supplied env with a poisoned
// PATH can't redirect to a different shell binary.
proc = spawn('/bin/sh', ['-c', params.cmd], {
cwd: params.cwd,
env,
stdio: ['ignore', 'pipe', 'pipe'],
});
} else {
const argv = params.argv!;
proc = spawn(argv[0], argv.slice(1), {
cwd: params.cwd,
env,
stdio: ['ignore', 'pipe', 'pipe'],
});
}
} catch (err) {
// Spawn-phase failure (e.g. cwd doesn't exist when using '/bin/sh' directly).
// Retryable.
throw err instanceof Error ? err : new Error(String(err));
}
const pid = proc.pid ?? -1;
const stdoutTail = new TailBuffer(STDOUT_TAIL_MAX_BYTES);
const stderrTail = new TailBuffer(STDERR_TAIL_MAX_BYTES);
proc.stdout?.on('data', (c: Buffer) => stdoutTail.append(c));
proc.stderr?.on('data', (c: Buffer) => stderrTail.append(c));
// Wire BOTH signals to the kill sequence. `ctx.signal` fires on timeout /
// cancel / lock-loss; `ctx.shutdownSignal` fires only on worker SIGTERM/SIGINT.
// Shell handler needs both — a deploy restart shouldn't leave children running
// past the 30s worker cleanup race.
let killTimer: ReturnType<typeof setTimeout> | null = null;
let killReason = '';
const onAbort = (label: string) => () => {
if (killTimer !== null) return; // already started
killReason = label;
if (!proc.killed) {
try { proc.kill('SIGTERM'); } catch { /* proc already exited */ }
}
killTimer = setTimeout(() => {
if (!proc.killed) {
try { proc.kill('SIGKILL'); } catch { /* already exited */ }
}
}, KILL_GRACE_MS);
};
const sigAbort = onAbort('signal');
const shutdownAbort = onAbort('shutdown');
ctx.signal.addEventListener('abort', sigAbort);
ctx.shutdownSignal.addEventListener('abort', shutdownAbort);
// Fire immediately if either already aborted before wiring
if (ctx.signal.aborted) sigAbort();
if (ctx.shutdownSignal.aborted) shutdownAbort();
const exitCode: number = await new Promise((resolve, reject) => {
proc.on('error', (err) => {
reject(err);
});
proc.on('exit', (code, signal) => {
// Node maps signal-terminated exits to a 128+N code convention; we use
// whichever is defined.
if (code !== null) resolve(code);
else if (signal === 'SIGTERM') resolve(143);
else if (signal === 'SIGKILL') resolve(137);
else resolve(-1);
});
}).finally(() => {
if (killTimer !== null) clearTimeout(killTimer);
ctx.signal.removeEventListener('abort', sigAbort);
ctx.shutdownSignal.removeEventListener('abort', shutdownAbort);
});
const duration_ms = Date.now() - startedAt;
const stdout_tail = stdoutTail.done();
const stderr_tail = stderrTail.done();
// If we sent SIGTERM/SIGKILL in response to an abort, surface that as the
// error rather than the exit code — clearer for debugging. Worker catch
// handles retry/dead classification.
if (killReason === 'signal' || killReason === 'shutdown') {
const err = new Error(
`aborted: ${killReason === 'shutdown' ? 'shutdown' : (ctx.signal.reason as Error)?.message || 'signal'}`,
);
throw err;
}
if (exitCode !== 0) {
throw new Error(
`exit ${exitCode}: ${stderr_tail.slice(-500)}`,
);
}
return { exit_code: exitCode, stdout_tail, stderr_tail, duration_ms, pid };
}

View File

@@ -0,0 +1,20 @@
/**
* Protected job names — side-effect-free constant module.
*
* Names in this set require an explicit `trusted.allowProtectedSubmit: true` opt-in
* when passed to `MinionQueue.add()`. The CLI path and the `submit_job` operation
* (when `ctx.remote === false`) set the flag; MCP callers never do. Defense-in-depth
* against in-process handlers that programmatically submit a shell child via
* `queue.add('shell', ...)`.
*
* This file must stay pure — no imports from handlers, no filesystem, no env reads.
* Queue core imports it; if this module grew side effects, every queue user would
* pay them at module load.
*/
export const PROTECTED_JOB_NAMES: ReadonlySet<string> = new Set(['shell']);
/** Check a job name against the protected set. Normalizes whitespace first. */
export function isProtectedJobName(name: string): boolean {
return PROTECTED_JOB_NAMES.has(name.trim());
}

View File

@@ -15,6 +15,16 @@ import type {
} from './types.ts';
import { rowToMinionJob, rowToInboxMessage, rowToAttachment } from './types.ts';
import { validateAttachment } from './attachments.ts';
import { isProtectedJobName } from './protected-names.ts';
/** Options for opting into protected-job-name submission. Passed as a separate
* 4th arg to `MinionQueue.add()` (NOT folded into `opts`) so user-spread
* `{...userOpts}` payloads can't accidentally carry the trust flag. */
export interface TrustedSubmitOpts {
/** When true, allow submission of names in PROTECTED_JOB_NAMES (currently 'shell').
* Set only by the CLI path and by `submit_job` when `ctx.remote === false`. */
allowProtectedSubmit?: boolean;
}
const MIGRATION_VERSION = 7;
@@ -55,10 +65,25 @@ export class MinionQueue {
* to 'waiting-children' atomically. Idempotency_key dedups via PG unique
* partial index; same key returns the existing row (no second insert).
*/
async add(name: string, data?: Record<string, unknown>, opts?: Partial<MinionJobInput>): Promise<MinionJob> {
if (!name || name.trim().length === 0) {
async add(
name: string,
data?: Record<string, unknown>,
opts?: Partial<MinionJobInput>,
trusted?: TrustedSubmitOpts,
): Promise<MinionJob> {
// Normalize first so the protected-name check and the insert use the same
// canonical form. Without the trim-before-check, `queue.add(' shell ', ...)`
// would evade the guard and insert a job literally named 'shell'.
const jobName = (name || '').trim();
if (jobName.length === 0) {
throw new Error('Job name cannot be empty');
}
if (isProtectedJobName(jobName) && !trusted?.allowProtectedSubmit) {
throw new Error(
`protected job name '${jobName}' requires CLI or operation-local submitter ` +
`(pass {allowProtectedSubmit: true} as the 4th arg to MinionQueue.add)`,
);
}
await this.ensureSchema();
const childStatus: MinionJobStatus = opts?.delay ? 'delayed' : 'waiting';
@@ -126,7 +151,7 @@ export class MinionQueue {
RETURNING *`;
const params = [
name.trim(),
jobName,
opts?.queue ?? 'default',
childStatus,
opts?.priority ?? 0,

View File

@@ -159,8 +159,13 @@ export interface MinionJobContext {
name: string;
data: Record<string, unknown>;
attempts_made: number;
/** AbortSignal for cooperative cancellation (fires on pause or lock loss). */
/** AbortSignal for cooperative cancellation (fires on timeout, cancel, pause, or lock loss). */
signal: AbortSignal;
/** AbortSignal that fires only on worker process SIGTERM/SIGINT. Handlers sensitive
* to deploy restarts (e.g. the shell handler, which must run a SIGTERM → 5s → SIGKILL
* sequence on its child) listen to this in addition to `signal`. Most handlers can
* ignore it — workers give them the full 30s cleanup race to finish naturally. */
shutdownSignal: AbortSignal;
/** Update structured progress (not just 0-100). */
updateProgress(progress: unknown): Promise<void>;
/** Accumulate token usage for this job. */

View File

@@ -49,6 +49,13 @@ export class MinionWorker {
private inFlight = new Map<number, InFlightJob>();
private workerId = randomUUID();
/** Fires only on worker process SIGTERM/SIGINT. Handlers that need to run
* shutdown-specific cleanup (e.g. shell handler's SIGTERM→SIGKILL sequence on
* its child) subscribe via `ctx.shutdownSignal`. Separated from the per-job
* abort controller so non-shell handlers don't get cancelled mid-flight on
* deploy restart — they still get the full 30s cleanup race instead. */
private shutdownAbort = new AbortController();
private opts: Required<MinionWorkerOpts>;
constructor(
@@ -88,10 +95,16 @@ export class MinionWorker {
await this.queue.ensureSchema();
this.running = true;
// Graceful shutdown
// Graceful shutdown. Fires shutdownAbort so handlers subscribed to
// `ctx.shutdownSignal` (currently: shell handler) can run their own cleanup
// BEFORE the 30s cleanup race expires. Non-shell handlers ignore shutdown
// and keep running — they get the full 30s window.
const shutdown = () => {
console.log('Minion worker shutting down...');
this.running = false;
if (!this.shutdownAbort.signal.aborted) {
this.shutdownAbort.abort(new Error('shutdown'));
}
};
process.on('SIGTERM', shutdown);
process.on('SIGINT', shutdown);
@@ -246,7 +259,7 @@ export class MinionWorker {
if (!renewed) {
console.warn(`Lock lost for job ${job.id}, aborting execution`);
clearInterval(lockTimer);
abort.abort();
abort.abort(new Error('lock-lost'));
}
}, this.opts.lockDuration / 2);
@@ -260,7 +273,7 @@ export class MinionWorker {
timeoutTimer = setTimeout(() => {
if (!abort.signal.aborted) {
console.warn(`Job ${job.id} (${job.name}) hit per-job timeout (${job.timeout_ms}ms), aborting`);
abort.abort();
abort.abort(new Error('timeout'));
}
}, job.timeout_ms);
}
@@ -287,13 +300,18 @@ export class MinionWorker {
return;
}
// Build job context with per-job AbortSignal
// Build job context with per-job AbortSignal + shared shutdown signal.
// Most handlers only care about `signal` (timeout / cancel / lock-loss).
// `shutdownSignal` is separate: fires only on worker process SIGTERM/SIGINT.
// Handlers that need to run cleanup before worker exit (shell handler's
// SIGTERM→5s→SIGKILL on its child) subscribe to shutdownSignal too.
const context: MinionJobContext = {
id: job.id,
name: job.name,
data: job.data,
attempts_made: job.attempts_made,
signal: abort.signal,
shutdownSignal: this.shutdownAbort.signal,
updateProgress: async (progress: unknown) => {
await this.queue.updateProgress(job.id, lockToken, progress);
},
@@ -343,13 +361,23 @@ export class MinionWorker {
} catch (err) {
clearInterval(lockTimer);
// If aborted (paused or lock lost), don't try to fail the job
// If the per-job abort fired, derive the reason from signal.reason (set
// by whichever site aborted: 'timeout' / 'cancel' / 'lock-lost'). We call
// failJob unconditionally — the DB match on status='active' + lock_token
// makes it idempotent: if another path (handleTimeouts, cancelJob, stall)
// already flipped status, our call no-ops cleanly. The prior silent-return
// left jobs stranded in 'active' until a secondary sweep, breaking
// timeout/cancel contracts downstream callers rely on.
let errorText: string;
if (abort.signal.aborted) {
console.log(`Job ${job.id} (${job.name}) aborted (paused or lock lost)`);
return;
const reason = abort.signal.reason instanceof Error
? abort.signal.reason.message
: String(abort.signal.reason || 'aborted');
errorText = `aborted: ${reason}`;
} else {
errorText = err instanceof Error ? err.message : String(err);
}
const errorText = err instanceof Error ? err.message : String(err);
const isUnrecoverable = err instanceof UnrecoverableError;
const attemptsExhausted = job.attempts_made + 1 >= job.max_attempts;

View File

@@ -1047,26 +1047,44 @@ const file_url: Operation = {
const submit_job: Operation = {
name: 'submit_job',
description: 'Submit a background job to the Minions queue',
description: 'Submit a background job to the Minions queue. Built-in types: sync, embed, lint, import, extract, backlinks, autopilot-cycle. The `shell` type is CLI-only and rejected over MCP.',
params: {
name: { type: 'string', required: true, description: 'Job type (sync, embed, lint, import)' },
name: { type: 'string', required: true, description: 'Job type (sync, embed, lint, import, extract, backlinks, autopilot-cycle; shell is CLI-only)' },
data: { type: 'object', description: 'Job payload (JSON)' },
queue: { type: 'string', description: 'Queue name (default: "default")' },
priority: { type: 'number', description: 'Priority (0 = highest, default: 0)' },
max_attempts: { type: 'number', description: 'Max retry attempts (default: 3)' },
delay: { type: 'number', description: 'Delay in ms before eligible' },
timeout_ms: { type: 'number', description: 'Per-job wall-clock timeout in ms; aborted job goes to dead' },
},
mutating: true,
handler: async (ctx, p) => {
if (ctx.dryRun) return { dry_run: true, action: 'submit_job', name: p.name };
const name = typeof p.name === 'string' ? p.name.trim() : '';
if (ctx.dryRun) return { dry_run: true, action: 'submit_job', name };
// Submit-side MCP guard: reject protected job names from untrusted callers
// BEFORE we touch the DB. This is the first of the two security layers
// (the second is MinionQueue.add's check). Independent of the worker-side
// GBRAIN_ALLOW_SHELL_JOBS env flag — even if that flag is on, MCP callers
// cannot submit protected-type jobs.
const { isProtectedJobName } = await import('./minions/protected-names.ts');
if (ctx.remote && isProtectedJobName(name)) {
throw new OperationError('permission_denied', `'${name}' jobs cannot be submitted over MCP (CLI-only for security)`);
}
const { MinionQueue } = await import('./minions/queue.ts');
const queue = new MinionQueue(ctx.engine);
return queue.add(p.name as string, (p.data as Record<string, unknown>) || {}, {
// Trusted flag set only when this is a local (non-remote) submission. When
// remote=true, the guard above has already thrown for protected names, so
// passing undefined here is safe for any non-protected name that slips by.
const trusted = !ctx.remote && isProtectedJobName(name) ? { allowProtectedSubmit: true } : undefined;
return queue.add(name, (p.data as Record<string, unknown>) || {}, {
queue: (p.queue as string) || 'default',
priority: (p.priority as number) || 0,
max_attempts: (p.max_attempts as number) || 3,
delay: (p.delay as number) || undefined,
});
timeout_ms: (p.timeout_ms as number) || undefined,
}, trusted);
},
};

View File

@@ -0,0 +1,135 @@
/**
* E2E Minions Shell Handler Tests — exercises the full lifecycle against real
* Postgres: submit → worker claims → spawn → result → status flip.
*
* Unit tests in test/minions-shell.test.ts cover the handler in detail
* (validation, env allowlist, abort, SIGTERM grace, audit log). These E2E
* tests prove the wiring against real Postgres works end-to-end.
*
* Run: DATABASE_URL=... bun test test/e2e/minions-shell.test.ts
*/
import { describe, test, expect, beforeAll, afterAll, beforeEach } from 'bun:test';
import { hasDatabase, setupDB, teardownDB, getConn, getEngine } from './helpers.ts';
import { PostgresEngine } from '../../src/core/postgres-engine.ts';
import { MinionQueue } from '../../src/core/minions/queue.ts';
import { MinionWorker } from '../../src/core/minions/worker.ts';
import { shellHandler } from '../../src/core/minions/handlers/shell.ts';
import { runMigrations } from '../../src/core/migrate.ts';
const skip = !hasDatabase();
const describeE2E = skip ? describe.skip : describe;
if (skip) {
console.log('Skipping E2E minions shell tests (DATABASE_URL not set)');
}
async function makeEngine(): Promise<PostgresEngine> {
const url = process.env.DATABASE_URL!;
const e = new PostgresEngine();
await e.connect({ engine: 'postgres', database_url: url, poolSize: 4 });
return e;
}
async function waitTerminal(queue: MinionQueue, id: number, timeoutMs = 15000): Promise<string> {
const deadline = Date.now() + timeoutMs;
while (Date.now() < deadline) {
const j = await queue.getJob(id);
if (j && ['completed', 'failed', 'dead', 'cancelled'].includes(j.status)) return j.status;
await new Promise((r) => setTimeout(r, 100));
}
const j = await queue.getJob(id);
throw new Error(`job ${id} did not reach terminal state in ${timeoutMs}ms; last status=${j?.status}`);
}
describeE2E('E2E: Minions shell handler', () => {
beforeAll(async () => {
await setupDB();
await runMigrations(getEngine());
});
afterAll(async () => {
await teardownDB();
});
beforeEach(async () => {
const conn = getConn();
await conn.unsafe(`TRUNCATE minion_attachments, minion_inbox, minion_jobs RESTART IDENTITY CASCADE`);
});
test('CLI submit → worker claims → shell runs → completes', async () => {
const engine = await makeEngine();
try {
const queue = new MinionQueue(engine);
const job = await queue.add('shell',
{ cmd: 'echo hello', cwd: '/tmp' },
{},
{ allowProtectedSubmit: true },
);
expect(job.name).toBe('shell');
const worker = new MinionWorker(engine, { pollInterval: 100, lockDuration: 30000 });
worker.register('shell', shellHandler);
const runPromise = worker.start();
try {
// 20s tolerates DB warmup variance when run after other E2E files
const status = await waitTerminal(queue, job.id, 20000);
expect(status).toBe('completed');
const final = await queue.getJob(job.id);
expect((final!.result as any).exit_code).toBe(0);
expect((final!.result as any).stdout_tail).toBe('hello\n');
} finally {
worker.stop();
await runPromise;
}
} finally {
await engine.disconnect();
}
}, 45000);
test('MinionQueue.add("shell",...) without trusted arg → throws (defense-in-depth)', async () => {
const engine = await makeEngine();
try {
const queue = new MinionQueue(engine);
await expect(queue.add('shell', { cmd: 'echo ok', cwd: '/tmp' })).rejects.toThrow(/protected job name/);
// Whitespace bypass defense (Codex #1)
await expect(queue.add(' shell ', { cmd: 'echo ok', cwd: '/tmp' })).rejects.toThrow(/protected job name/);
} finally {
await engine.disconnect();
}
});
test('submit_job with ctx.remote=true rejects shell (MCP guard)', async () => {
const engine = await makeEngine();
try {
// Invoke submit_job operation directly with remote=true
const { operations } = await import('../../src/core/operations.ts');
const submitJob = operations.find((op: { name: string }) => op.name === 'submit_job')!;
await expect(
submitJob.handler(
{ engine, remote: true, dryRun: false } as any,
{ name: 'shell', data: { cmd: 'echo hi', cwd: '/tmp' } },
),
).rejects.toThrow(/permission_denied|cannot be submitted over MCP/i);
} finally {
await engine.disconnect();
}
});
test('submit_job with ctx.remote=false allows shell (CLI path)', async () => {
const engine = await makeEngine();
try {
const { operations } = await import('../../src/core/operations.ts');
const submitJob = operations.find((op: { name: string }) => op.name === 'submit_job')!;
const result = await submitJob.handler(
{ engine, remote: false, dryRun: false } as any,
{ name: 'shell', data: { cmd: 'echo hi', cwd: '/tmp' } },
);
expect((result as any).name).toBe('shell');
expect((result as any).status).toBe('waiting');
} finally {
await engine.disconnect();
}
});
});

342
test/minions-shell.test.ts Normal file
View File

@@ -0,0 +1,342 @@
import { describe, test, expect, beforeAll, afterAll, beforeEach } from 'bun:test';
import { PGLiteEngine } from '../src/core/pglite-engine.ts';
import { MinionQueue } from '../src/core/minions/queue.ts';
import { UnrecoverableError } from '../src/core/minions/types.ts';
import type { MinionJobContext } from '../src/core/minions/types.ts';
import { shellHandler } from '../src/core/minions/handlers/shell.ts';
import { computeAuditFilename, resolveAuditDir, logShellSubmission } from '../src/core/minions/handlers/shell-audit.ts';
import { isProtectedJobName, PROTECTED_JOB_NAMES } from '../src/core/minions/protected-names.ts';
import * as fs from 'node:fs';
import * as path from 'node:path';
import * as os from 'node:os';
let engine: PGLiteEngine;
let queue: MinionQueue;
beforeAll(async () => {
engine = new PGLiteEngine();
await engine.connect({ databaseUrl: '' });
await engine.initSchema();
queue = new MinionQueue(engine);
});
afterAll(async () => {
await engine.disconnect();
});
beforeEach(async () => {
await engine.executeRaw('DELETE FROM minion_jobs');
});
// Build a minimal MinionJobContext for unit tests. Real worker provides this;
// here we mock it so the handler can be exercised without spinning up Postgres.
function makeCtx(
data: Record<string, unknown>,
opts: { signal?: AbortSignal; shutdownSignal?: AbortSignal } = {},
): MinionJobContext {
return {
id: 1,
name: 'shell',
data,
attempts_made: 0,
signal: opts.signal ?? new AbortController().signal,
shutdownSignal: opts.shutdownSignal ?? new AbortController().signal,
updateProgress: async () => {},
updateTokens: async () => {},
log: async () => {},
isActive: async () => true,
readInbox: async () => [],
};
}
// ---- protected-names ---------------------------------------------------------
describe('protected-names', () => {
test('shell is protected', () => {
expect(isProtectedJobName('shell')).toBe(true);
expect(PROTECTED_JOB_NAMES.has('shell')).toBe(true);
});
test('normalization: whitespace is trimmed before check', () => {
expect(isProtectedJobName(' shell ')).toBe(true);
expect(isProtectedJobName('\tshell\n')).toBe(true);
});
test('case-sensitive: Shell is NOT protected', () => {
expect(isProtectedJobName('Shell')).toBe(false);
expect(isProtectedJobName('SHELL')).toBe(false);
});
test('non-protected names pass through', () => {
expect(isProtectedJobName('sync')).toBe(false);
expect(isProtectedJobName('embed')).toBe(false);
expect(isProtectedJobName('')).toBe(false);
});
});
// ---- MinionQueue.add trusted guard ------------------------------------------
describe('MinionQueue.add protected-name guard', () => {
test('add("shell", ...) without trusted arg throws', async () => {
expect(queue.add('shell', { cmd: 'echo', cwd: '/tmp' })).rejects.toThrow(/protected job name/);
});
test('add("shell", ..., opts, {allowProtectedSubmit:true}) succeeds', async () => {
const job = await queue.add('shell', { cmd: 'echo', cwd: '/tmp' }, undefined, { allowProtectedSubmit: true });
expect(job.name).toBe('shell');
expect(job.status).toBe('waiting');
});
// Whitespace bypass defense (Codex #1)
test('add(" shell ", ...) without trusted arg throws (whitespace bypass defense)', async () => {
expect(queue.add(' shell ', { cmd: 'echo', cwd: '/tmp' })).rejects.toThrow(/protected job name/);
});
test('add(" shell ", ...) with trusted arg inserts normalized name "shell"', async () => {
const job = await queue.add(' shell ', { cmd: 'echo', cwd: '/tmp' }, undefined, { allowProtectedSubmit: true });
expect(job.name).toBe('shell');
});
test('add("Shell", ...) is treated as non-protected (case-sensitive)', async () => {
const job = await queue.add('Shell', {});
expect(job.name).toBe('Shell');
expect(job.status).toBe('waiting');
});
// Regression: non-protected names unaffected (Codex iron-rule)
test('REGRESSION: add("sync", ...) without trusted arg still succeeds', async () => {
const job = await queue.add('sync', { full: true });
expect(job.name).toBe('sync');
expect(job.status).toBe('waiting');
});
test('REGRESSION: trusted flag does NOT bypass empty-name check', async () => {
expect(queue.add('', {}, undefined, { allowProtectedSubmit: true })).rejects.toThrow(/cannot be empty/);
});
});
// ---- Shell handler: validation ----------------------------------------------
describe('shell handler: validation', () => {
test('both cmd and argv → UnrecoverableError', async () => {
const p = shellHandler(makeCtx({ cmd: 'echo', argv: ['echo'], cwd: '/tmp' }));
expect(p).rejects.toThrow(UnrecoverableError);
});
test('neither cmd nor argv → UnrecoverableError', async () => {
const p = shellHandler(makeCtx({ cwd: '/tmp' }));
expect(p).rejects.toThrow(UnrecoverableError);
});
test('cwd missing → UnrecoverableError', async () => {
const p = shellHandler(makeCtx({ cmd: 'echo ok' }));
expect(p).rejects.toThrow(UnrecoverableError);
});
test('cwd not absolute → UnrecoverableError', async () => {
const p = shellHandler(makeCtx({ cmd: 'echo ok', cwd: 'relative/path' }));
expect(p).rejects.toThrow(UnrecoverableError);
});
test('argv non-array (string) → UnrecoverableError', async () => {
const p = shellHandler(makeCtx({ argv: 'echo ok', cwd: '/tmp' }));
expect(p).rejects.toThrow(UnrecoverableError);
});
test('argv with non-string entries → UnrecoverableError', async () => {
const p = shellHandler(makeCtx({ argv: ['echo', 42], cwd: '/tmp' }));
expect(p).rejects.toThrow(UnrecoverableError);
});
test('env with non-string values → UnrecoverableError', async () => {
const p = shellHandler(makeCtx({ cmd: 'echo', cwd: '/tmp', env: { FOO: 42 } }));
expect(p).rejects.toThrow(UnrecoverableError);
});
});
// ---- Shell handler: spawn + output ------------------------------------------
describe('shell handler: spawn', () => {
test('cmd happy path: echo ok → exit 0, stdout captured', async () => {
const res = await shellHandler(makeCtx({ cmd: 'echo ok', cwd: '/tmp' })) as any;
expect(res.exit_code).toBe(0);
expect(res.stdout_tail).toBe('ok\n');
expect(res.stderr_tail).toBe('');
expect(typeof res.duration_ms).toBe('number');
expect(res.duration_ms).toBeGreaterThanOrEqual(0);
expect(typeof res.pid).toBe('number');
});
test('argv happy path: ["echo","hi"] → exit 0, stdout "hi\\n"', async () => {
const res = await shellHandler(makeCtx({ argv: ['echo', 'hi'], cwd: '/tmp' })) as any;
expect(res.exit_code).toBe(0);
expect(res.stdout_tail).toBe('hi\n');
});
test('non-zero exit → Error with stderr in message', async () => {
const p = shellHandler(makeCtx({ cmd: 'echo fail 1>&2; exit 7', cwd: '/tmp' }));
await expect(p).rejects.toThrow(/exit 7/);
});
test('argv with bogus binary → Error (retryable)', async () => {
const p = shellHandler(makeCtx({ argv: ['gbrain-nonexistent-binary-xyz'], cwd: '/tmp' }));
// spawn emits 'error' on ENOENT
await expect(p).rejects.toThrow();
});
test('result shape includes all declared keys', async () => {
const res = await shellHandler(makeCtx({ cmd: 'echo ok', cwd: '/tmp' })) as any;
expect(Object.keys(res).sort()).toEqual(['duration_ms', 'exit_code', 'pid', 'stderr_tail', 'stdout_tail']);
});
});
// ---- Shell handler: env allowlist -------------------------------------------
describe('shell handler: env allowlist', () => {
test('process env leak prevention: a faux secret is NOT in child env', async () => {
const saved = process.env.SHELL_TEST_SECRET;
process.env.SHELL_TEST_SECRET = 'should-not-leak';
try {
const res = await shellHandler(makeCtx({
cmd: 'echo "secret=${SHELL_TEST_SECRET:-EMPTY}"',
cwd: '/tmp',
})) as any;
expect(res.stdout_tail).toBe('secret=EMPTY\n');
} finally {
if (saved === undefined) delete process.env.SHELL_TEST_SECRET;
else process.env.SHELL_TEST_SECRET = saved;
}
});
test('PATH is inherited from worker', async () => {
const res = await shellHandler(makeCtx({
cmd: 'echo "path=$PATH"',
cwd: '/tmp',
})) as any;
expect(res.stdout_tail.startsWith('path=')).toBe(true);
expect(res.stdout_tail.length).toBeGreaterThan('path=\n'.length);
});
test('caller-supplied env key is added', async () => {
const res = await shellHandler(makeCtx({
cmd: 'echo "val=$MY_CUSTOM"',
cwd: '/tmp',
env: { MY_CUSTOM: 'hello' },
})) as any;
expect(res.stdout_tail).toBe('val=hello\n');
});
test('caller-supplied env can override allowlisted key (PATH)', async () => {
const res = await shellHandler(makeCtx({
cmd: 'echo "path=$PATH"',
cwd: '/tmp',
env: { PATH: '/custom/bin' },
})) as any;
expect(res.stdout_tail).toBe('path=/custom/bin\n');
});
});
// ---- Shell handler: abort --------------------------------------------------
describe('shell handler: abort', () => {
test('ctx.signal.abort triggers SIGTERM and handler throws aborted', async () => {
const ac = new AbortController();
const promise = shellHandler(makeCtx(
{ cmd: 'sleep 30', cwd: '/tmp' },
{ signal: ac.signal },
));
// Give spawn a beat to start
setTimeout(() => ac.abort(new Error('cancel')), 50);
await expect(promise).rejects.toThrow(/aborted/);
});
test('ctx.shutdownSignal.abort also triggers kill', async () => {
const shutdownCtl = new AbortController();
const promise = shellHandler(makeCtx(
{ cmd: 'sleep 30', cwd: '/tmp' },
{ shutdownSignal: shutdownCtl.signal },
));
setTimeout(() => shutdownCtl.abort(new Error('shutdown')), 50);
await expect(promise).rejects.toThrow(/aborted/);
});
test('pre-aborted signal → immediate kill', async () => {
const ac = new AbortController();
ac.abort(new Error('cancel'));
const promise = shellHandler(makeCtx(
{ cmd: 'sleep 30', cwd: '/tmp' },
{ signal: ac.signal },
));
await expect(promise).rejects.toThrow(/aborted/);
});
});
// ---- shell-audit: ISO-week filename ----------------------------------------
describe('shell-audit: computeAuditFilename', () => {
test('2027-01-01 is ISO week 53 of 2026', () => {
expect(computeAuditFilename(new Date('2027-01-01T12:00:00Z'))).toBe('shell-jobs-2026-W53.jsonl');
});
test('2026-12-28 (Monday) is ISO week 53 of 2026', () => {
expect(computeAuditFilename(new Date('2026-12-28T12:00:00Z'))).toBe('shell-jobs-2026-W53.jsonl');
});
test('2027-01-04 (Monday) is ISO week 1 of 2027', () => {
expect(computeAuditFilename(new Date('2027-01-04T12:00:00Z'))).toBe('shell-jobs-2027-W01.jsonl');
});
test('2026-04-19 (mid-year reference)', () => {
const f = computeAuditFilename(new Date('2026-04-19T00:00:00Z'));
expect(f).toMatch(/^shell-jobs-2026-W\d{2}\.jsonl$/);
});
});
// ---- shell-audit: write path -----------------------------------------------
describe('shell-audit: write', () => {
let tmpDir: string;
beforeEach(() => {
tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), 'shell-audit-test-'));
process.env.GBRAIN_AUDIT_DIR = tmpDir;
});
afterAll(() => {
delete process.env.GBRAIN_AUDIT_DIR;
});
test('GBRAIN_AUDIT_DIR env override resolves to the custom dir', () => {
expect(resolveAuditDir()).toBe(tmpDir);
});
test('writes a JSONL line; creates dir if missing', () => {
const inner = path.join(tmpDir, 'nested-not-yet-created');
process.env.GBRAIN_AUDIT_DIR = inner;
logShellSubmission({
caller: 'cli', remote: false, job_id: 42, cwd: '/tmp', cmd_display: 'echo ok',
});
const files = fs.readdirSync(inner);
expect(files.length).toBe(1);
const content = fs.readFileSync(path.join(inner, files[0]), 'utf8').trim();
const parsed = JSON.parse(content);
expect(parsed.caller).toBe('cli');
expect(parsed.job_id).toBe(42);
expect(parsed.cmd_display).toBe('echo ok');
expect(parsed.ts).toBeDefined();
});
test('argv_display stored as JSON array (Codex #11)', () => {
logShellSubmission({
caller: 'cli', remote: false, job_id: 1, cwd: '/tmp',
argv_display: ['node', 'script.mjs', '--date', '2026-04-18'],
});
const files = fs.readdirSync(tmpDir);
const content = fs.readFileSync(path.join(tmpDir, files[0]), 'utf8').trim();
const parsed = JSON.parse(content);
expect(Array.isArray(parsed.argv_display)).toBe(true);
expect(parsed.argv_display).toEqual(['node', 'script.mjs', '--date', '2026-04-18']);
});
test('does NOT log env values', () => {
logShellSubmission({
caller: 'cli', remote: false, job_id: 1, cwd: '/tmp', cmd_display: 'echo ok',
});
const files = fs.readdirSync(tmpDir);
const content = fs.readFileSync(path.join(tmpDir, files[0]), 'utf8');
expect(content).not.toContain('env');
});
test('write failure (EACCES) is non-blocking', () => {
// Point at a read-only target. /dev/null is not a directory.
process.env.GBRAIN_AUDIT_DIR = '/dev/null/not-a-dir';
// Should not throw — failures go to stderr.
expect(() => logShellSubmission({
caller: 'cli', remote: false, job_id: 1, cwd: '/tmp',
})).not.toThrow();
});
});
// ---- shell handler: UTF-8-safe output truncation ---------------------------
describe('shell handler: output truncation', () => {
test('stdout > 64KB is truncated and marker is prepended', async () => {
// Emit ~100KB of stdout to force truncation
const res = await shellHandler(makeCtx({
cmd: `yes ok | head -c 100000`,
cwd: '/tmp',
})) as any;
expect(res.exit_code).toBe(0);
expect(res.stdout_tail).toMatch(/^\[truncated \d+ bytes\]/);
expect(res.stdout_tail.length).toBeGreaterThan(0);
// Tail must contain characters we emitted
expect(res.stdout_tail).toContain('ok');
});
});