Files
gbrain/src/commands/jobs.ts
Garry Tan ff10796a00 fix(wave): v0.15.1 - 4 hot issues + scope expansion (#248)
* fix(wave): 4 hot issues + 3 scope expansions (v0.13.1)

Addresses four user-filed regressions after v0.13.0 plus three adjacent
footgun closures.

* #170 — CREATE INDEX [CONCURRENTLY] IF NOT EXISTS idx_pages_updated_at_desc
  on pages (updated_at DESC). Engine-aware migration v12 with invalid-index
  cleanup on Postgres, plain CREATE on PGLite. ~700x on 30k+ row brains.
  Contributed by @fuleinist (#215).

* #219 — Minions schema default max_stalled 1 -> 5. v13 migration ALTERs
  the default and UPDATEs existing non-terminal rows (waiting/active/
  delayed/waiting-children/paused) so live queues get rescued on upgrade.
  Adds MinionJobInput.max_stalled with [1,100] clamp. New --max-stalled
  CLI flag on `jobs submit`. Reported by @macbotmini-eng.

* #218 — package.json postinstall surfaces errors instead of silencing.
  trustedDependencies whitelists @electric-sql/pglite. doctor
  schema_version check fails loudly when migrations never ran and links
  to #218. README + INSTALL_FOR_AGENTS warn against `bun install -g`.
  Reported by @gopalpatel.

* #223 — @electric-sql/pglite pinned to exactly 0.4.3 (was ^0.4.4).
  PGLiteEngine.connect() wraps PGlite.create() errors with a message
  pointing at the issue + gbrain doctor. Does NOT suggest 'missing
  migrations' as a cause (create-time abort happens before migrations
  run). Pin is unverified against macOS 26.3; error-wrap is the safety
  net. Reported by @AndreLYL.

* Scope: `gbrain jobs submit` gains --backoff-type/--backoff-delay/
  --backoff-jitter/--timeout-ms/--idempotency-key (MinionJobInput audit).
* Scope: `gbrain jobs smoke --sigkill-rescue` regression case (opt-in,
  CI-only) that simulates a killed worker and asserts the new default
  rescues.
* Scope: `gbrain doctor --index-audit` reports zero-scan Postgres indexes
  as drop candidates (informational; no auto-drop).

Infrastructure:
* Migration interface extended with sqlFor: { postgres?, pglite? } and
  transaction: boolean. Runner picks the engine-specific branch and
  bypasses engine.transaction() when transaction:false (required for
  CONCURRENTLY). BrainEngine.kind readonly discriminator added.
* scripts/check-jsonb-pattern.sh CI guard extended to block
  `max_stalled DEFAULT 1` from regressing.

Tests:
* 15 new unit tests: v12/v13 structural + behavioral assertions,
  max_stalled default/clamp/backfill, PGLite error-wrap source guard,
  engine kind discriminator.
* 3 regression tests pinned by IRON RULE.
* Full unit suite: 1416 pass.
* Full E2E suite against Postgres 16 + pgvector: 126 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.13.1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: sync documentation for v0.13.1

CLAUDE.md "Key files" and "Commands" sections refreshed to match the
v0.13.1 fix wave:

- Note `BrainEngine.kind` discriminator on engine.ts
- Document v0.13.1 connect() error-wrap on pglite-engine.ts
- Refresh src/core/minions/ layout (no shell handler, no protected-names,
  no quiet-hours/stagger — that was v0.13-development scaffolding that
  did not ship)
- Add src/core/migrate.ts entry with `Migration` interface extensions
  (`sqlFor`, `transaction: false`)
- Document new `gbrain jobs submit` flags (--max-stalled, --backoff-type,
  --backoff-delay, --backoff-jitter, --timeout-ms, --idempotency-key)
- Document `gbrain jobs smoke --sigkill-rescue` regression guard
- Document `gbrain doctor --index-audit` and the schema_version=0
  surface that catches #218 postinstall failures
- Extend check-jsonb-pattern.sh note with the max_stalled DEFAULT 1
  regression guard
- Touch up test file blurbs for migrate.test.ts, pglite-engine.test.ts,
  minions.test.ts with v0.13.1 coverage

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): run files sequentially to eliminate shared-DB race

The E2E suite was flaky. ~3 of every 5 runs had 4-10 failures clustered
in Links, Timeline, Versions, Minions resilience, Parallel Import, and
Page CRUD tests. Symptoms included "expected 16 pages, got 8" (half),
"expected 1 link inserted, got 0", timeline entries missing after
round-trip, and similar data-shape mismatches.

Root cause: bun test runs test FILES in parallel (each in a worker
process). 13 E2E files share one DATABASE_URL, and `setupDB()` in
`test/e2e/helpers.ts` does `TRUNCATE ... CASCADE` on all tables before
each file's `importFixtures()`. File A's TRUNCATE would race with file
B's in-flight INSERT stream, producing the observed half-populated or
wrong-count states.

An earlier attempt used a Postgres advisory lock held on a dedicated
single-connection client for the lifetime of each file's run. It broke
because bun's default 5000 ms hook timeout fires on queued beforeAll()
calls: with 13 files serializing through the lock, files 2-13 would
time out waiting for file 1 to finish.

This commit switches to sequential file execution at the harness level
via scripts/run-e2e.sh, which loops through test/e2e/*.test.ts one at
a time, tracks aggregate pass/fail counts, and exits non-zero on the
first failing file. No lock, no timeout issues, no changes to any test
file. package.json test:e2e points at the new script.

Verified: 5 back-to-back runs against the same Postgres container,
each completing in ~5 min. Every run: 13 files, 138 tests, 0 fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version to 0.15.1 (fix wave locked to MINOR line)

Master v0.14.2 was the last /investigate root-cause wave on the
v0.14.x line. This fix wave opens v0.15.x: four hot issues (#170,
#218, #219, #223) close v0.13.x regressions that v0.14.x didn't
cover, so the MINOR bump reflects the semantic shift — new schema
migrations (v14, v15), a new CLI surface (`--max-stalled`,
`--sigkill-rescue`, `--index-audit`), a new BrainEngine contract
(`kind` discriminator + extended `Migration` interface), and a new
install-time contract (PGLite 0.4.3 pin + `trustedDependencies`).

Locked to 0.15.1 in advance: other work may land before/after this
PR, but the version is fixed so reviewers can cite a stable number.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 13:19:23 -07:00

629 lines
28 KiB
TypeScript

/**
* CLI handler for `gbrain jobs` subcommands.
* Thin wrapper around MinionQueue and MinionWorker.
*/
import type { BrainEngine } from '../core/engine.ts';
import { MinionQueue } from '../core/minions/queue.ts';
import { MinionWorker } from '../core/minions/worker.ts';
import type { MinionJob, MinionJobStatus } from '../core/minions/types.ts';
function parseFlag(args: string[], flag: string): string | undefined {
const idx = args.indexOf(flag);
return idx >= 0 && idx + 1 < args.length ? args[idx + 1] : undefined;
}
function hasFlag(args: string[], flag: string): boolean {
return args.includes(flag);
}
function formatJob(job: MinionJob): string {
const dur = job.finished_at && job.started_at
? `${((job.finished_at.getTime() - job.started_at.getTime()) / 1000).toFixed(1)}s`
: '—';
const stalled = job.status === 'active' && job.lock_until && job.lock_until < new Date()
? ' (stalled?)' : '';
return ` ${String(job.id).padEnd(6)} ${job.name.padEnd(14)} ${(job.status + stalled).padEnd(20)} ${job.queue.padEnd(10)} ${dur.padEnd(8)} ${job.created_at.toISOString().slice(0, 19)}`;
}
function formatJobDetail(job: MinionJob): string {
const lines = [
`Job #${job.id}: ${job.name} (${job.status.toUpperCase()}${job.status === 'dead' ? ` after ${job.attempts_made} attempts` : ''})`,
` Queue: ${job.queue} | Priority: ${job.priority}`,
` Attempts: ${job.attempts_made}/${job.max_attempts} (started: ${job.attempts_started})`,
` Backoff: ${job.backoff_type} ${job.backoff_delay}ms (jitter: ${job.backoff_jitter})`,
];
if (job.started_at) lines.push(` Started: ${job.started_at.toISOString()}`);
if (job.finished_at) lines.push(` Finished: ${job.finished_at.toISOString()}`);
if (job.lock_token) lines.push(` Lock: ${job.lock_token} (until ${job.lock_until?.toISOString()})`);
if (job.delay_until) lines.push(` Delayed until: ${job.delay_until.toISOString()}`);
if (job.parent_job_id) lines.push(` Parent: job #${job.parent_job_id} (on_child_fail: ${job.on_child_fail})`);
if (job.error_text) lines.push(` Error: ${job.error_text}`);
if (job.stacktrace.length > 0) {
lines.push(` History:`);
for (const entry of job.stacktrace) lines.push(` - ${entry}`);
}
if (job.progress != null) lines.push(` Progress: ${JSON.stringify(job.progress)}`);
if (job.result != null) lines.push(` Result: ${JSON.stringify(job.result)}`);
lines.push(` Data: ${JSON.stringify(job.data)}`);
return lines.join('\n');
}
export async function runJobs(engine: BrainEngine, args: string[]): Promise<void> {
const sub = args[0];
if (!sub || sub === '--help' || sub === '-h') {
console.log(`gbrain jobs — Minions job queue
USAGE
gbrain jobs submit <name> [--params JSON] [--follow] [--priority N]
[--delay Nms] [--max-attempts N] [--max-stalled N]
[--backoff-type fixed|exponential] [--backoff-delay Nms]
[--backoff-jitter 0..1] [--timeout-ms Nms]
[--idempotency-key K] [--queue Q] [--dry-run]
gbrain jobs list [--status S] [--queue Q] [--limit N]
gbrain jobs get <id>
gbrain jobs cancel <id>
gbrain jobs retry <id>
gbrain jobs prune [--older-than 30d]
gbrain jobs delete <id>
gbrain jobs stats
gbrain jobs smoke
gbrain jobs work [--queue Q] [--concurrency N]
HANDLER TYPES (built in)
sync Pull and embed new pages from the repo
embed (Re-)embed pages; --params '{"slug":...}' or '{"all":true}'
lint Run page linter; --params '{"dir":"...","fix":true}'
import Bulk import markdown; --params '{"dir":"..."}'
extract Extract links + timeline entries; '{"mode":"all"}'
backlinks Check or fix back-links; '{"action":"fix"}'
autopilot-cycle One autopilot pass (sync+extract+embed+backlinks)
shell Run a command or argv. Requires GBRAIN_ALLOW_SHELL_JOBS=1
on the worker. Params: {cmd?, argv?, cwd, env?}.
See: docs/guides/minions-shell-jobs.md
`);
return;
}
const queue = new MinionQueue(engine);
switch (sub) {
case 'submit': {
const name = args[1];
if (!name) {
console.error('Error: job name required. Usage: gbrain jobs submit <name>');
process.exit(1);
}
const paramsStr = parseFlag(args, '--params');
let data: Record<string, unknown> = {};
if (paramsStr) {
try { data = JSON.parse(paramsStr); }
catch { console.error('Error: --params must be valid JSON'); process.exit(1); }
}
const priority = parseInt(parseFlag(args, '--priority') ?? '0', 10);
const delay = parseInt(parseFlag(args, '--delay') ?? '0', 10);
const maxAttempts = parseInt(parseFlag(args, '--max-attempts') ?? '3', 10);
const maxStalledRaw = parseFlag(args, '--max-stalled');
const maxStalled = maxStalledRaw !== undefined ? parseInt(maxStalledRaw, 10) : undefined;
// v0.13.1 field audit: expose retry/backoff/timeout/idempotency knobs so
// users can tune Minions behavior without dropping into TypeScript.
const backoffTypeRaw = parseFlag(args, '--backoff-type');
const backoffType = backoffTypeRaw === 'fixed' || backoffTypeRaw === 'exponential'
? backoffTypeRaw
: undefined;
const backoffDelayRaw = parseFlag(args, '--backoff-delay');
const backoffDelay = backoffDelayRaw !== undefined ? parseInt(backoffDelayRaw, 10) : undefined;
const backoffJitterRaw = parseFlag(args, '--backoff-jitter');
const backoffJitter = backoffJitterRaw !== undefined ? parseFloat(backoffJitterRaw) : undefined;
const timeoutMsRaw = parseFlag(args, '--timeout-ms');
const timeoutMs = timeoutMsRaw !== undefined ? parseInt(timeoutMsRaw, 10) : undefined;
if (timeoutMsRaw !== undefined && (isNaN(timeoutMs!) || timeoutMs! <= 0)) {
console.error('Error: --timeout-ms must be a positive integer (milliseconds)');
process.exit(1);
}
const idempotencyKey = parseFlag(args, '--idempotency-key');
const queueName = parseFlag(args, '--queue') ?? 'default';
const dryRun = hasFlag(args, '--dry-run');
const follow = hasFlag(args, '--follow');
if (dryRun) {
console.log(`[DRY RUN] Would submit job:`);
console.log(` Name: ${name}`);
console.log(` Queue: ${queueName}`);
console.log(` Priority: ${priority}`);
console.log(` Max attempts: ${maxAttempts}`);
if (maxStalled !== undefined) console.log(` Max stalled: ${maxStalled}`);
if (backoffType) console.log(` Backoff type: ${backoffType}`);
if (backoffDelay !== undefined) console.log(` Backoff delay: ${backoffDelay}ms`);
if (backoffJitter !== undefined) console.log(` Backoff jitter: ${backoffJitter}`);
if (timeoutMs !== undefined) console.log(` Timeout: ${timeoutMs}ms`);
if (idempotencyKey) console.log(` Idempotency key: ${idempotencyKey}`);
if (delay > 0) console.log(` Delay: ${delay}ms`);
console.log(` Data: ${JSON.stringify(data)}`);
return;
}
try {
await queue.ensureSchema();
} catch (e) {
console.error(e instanceof Error ? e.message : String(e));
process.exit(1);
}
// The CLI path is a trusted submitter. Pass {allowProtectedSubmit: true}
// ONLY for protected names, not blanket-set for every submission, so any
// future protected name forces explicit opt-in at the call site.
const { isProtectedJobName } = await import('../core/minions/protected-names.ts');
const trusted = isProtectedJobName(name) ? { allowProtectedSubmit: true } : undefined;
const job = await queue.add(name, data, {
priority,
delay: delay > 0 ? delay : undefined,
max_attempts: maxAttempts,
max_stalled: maxStalled,
backoff_type: backoffType,
backoff_delay: backoffDelay,
backoff_jitter: backoffJitter,
timeout_ms: timeoutMs,
idempotency_key: idempotencyKey,
queue: queueName,
}, trusted);
// Submission audit log (operational trace, not forensic insurance).
try {
const { logShellSubmission } = await import('../core/minions/handlers/shell-audit.ts');
if (name.trim() === 'shell') {
logShellSubmission({
caller: 'cli',
remote: false,
job_id: job.id,
cwd: typeof data.cwd === 'string' ? data.cwd : '',
cmd_display: typeof data.cmd === 'string' ? data.cmd.slice(0, 80) : undefined,
argv_display: Array.isArray(data.argv)
? (data.argv as unknown[]).filter((a): a is string => typeof a === 'string').map((a) => a.slice(0, 80))
: undefined,
});
}
} catch { /* audit failures never block submission */ }
// Starvation warning (DX polish). Fire for every non-`--follow` shell submit
// regardless of the submitter's own `GBRAIN_ALLOW_SHELL_JOBS` — the submitter
// env is a weak proxy for the worker env (they may run on different machines),
// so the warning remains useful any time the job might sit in 'waiting'.
if (!follow && name.trim() === 'shell') {
process.stderr.write(
`\n⚠ Shell jobs require GBRAIN_ALLOW_SHELL_JOBS=1 on the worker process.\n` +
` Your job was queued (id=${job.id}) but will sit in 'waiting' until a\n` +
` worker with the env flag starts. To run now:\n\n` +
` GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs submit shell \\\n` +
` --params '...' --follow\n\n` +
` Or start a persistent worker (Postgres only — PGLite uses --follow):\n\n` +
` GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs work\n\n`,
);
}
if (follow) {
console.log(`Job #${job.id} submitted (${name}). Executing inline...`);
// Inline execution: run the job in this process
const worker = new MinionWorker(engine, { queue: queueName, pollInterval: 100 });
// Register built-in handlers
await registerBuiltinHandlers(worker, engine);
if (!worker.registeredNames.includes(name)) {
console.error(`Error: Unknown job type '${name}'.`);
console.error(`Available types: ${worker.registeredNames.join(', ')}`);
console.error(`Register custom types with worker.register('${name}', handler).`);
process.exit(1);
}
// Run worker for one job then stop
const startTime = Date.now();
const workerPromise = worker.start();
// Poll until this job completes
const pollInterval = setInterval(async () => {
const updated = await queue.getJob(job.id);
if (updated && ['completed', 'failed', 'dead', 'cancelled'].includes(updated.status)) {
worker.stop();
clearInterval(pollInterval);
}
}, 200);
await workerPromise;
clearInterval(pollInterval);
const final = await queue.getJob(job.id);
const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
if (final?.status === 'completed') {
console.log(`Job #${job.id} completed in ${elapsed}s`);
if (final.result) console.log(`Result: ${JSON.stringify(final.result)}`);
} else {
console.error(`Job #${job.id} ${final?.status}: ${final?.error_text}`);
process.exit(1);
}
} else {
console.log(JSON.stringify(job, null, 2));
}
break;
}
case 'list': {
const status = parseFlag(args, '--status') as MinionJobStatus | undefined;
const queueName = parseFlag(args, '--queue');
const limit = parseInt(parseFlag(args, '--limit') ?? '20', 10);
try { await queue.ensureSchema(); }
catch (e) { console.error(e instanceof Error ? e.message : String(e)); process.exit(1); }
const jobs = await queue.getJobs({ status, queue: queueName, limit });
if (jobs.length === 0) {
console.log('No jobs found.');
return;
}
console.log(` ${'ID'.padEnd(6)} ${'Name'.padEnd(14)} ${'Status'.padEnd(20)} ${'Queue'.padEnd(10)} ${'Time'.padEnd(8)} Created`);
console.log(' ' + '─'.repeat(80));
for (const job of jobs) console.log(formatJob(job));
console.log(`\n ${jobs.length} jobs shown`);
break;
}
case 'get': {
const id = parseInt(args[1], 10);
if (isNaN(id)) { console.error('Error: job ID required. Usage: gbrain jobs get <id>'); process.exit(1); }
try { await queue.ensureSchema(); }
catch (e) { console.error(e instanceof Error ? e.message : String(e)); process.exit(1); }
const job = await queue.getJob(id);
if (!job) { console.error(`Job #${id} not found.`); process.exit(1); }
console.log(formatJobDetail(job));
break;
}
case 'cancel': {
const id = parseInt(args[1], 10);
if (isNaN(id)) { console.error('Error: job ID required.'); process.exit(1); }
try { await queue.ensureSchema(); }
catch (e) { console.error(e instanceof Error ? e.message : String(e)); process.exit(1); }
const cancelled = await queue.cancelJob(id);
if (cancelled) {
console.log(`Job #${id} cancelled.`);
} else {
console.error(`Could not cancel job #${id} (may already be completed/dead).`);
process.exit(1);
}
break;
}
case 'retry': {
const id = parseInt(args[1], 10);
if (isNaN(id)) { console.error('Error: job ID required.'); process.exit(1); }
try { await queue.ensureSchema(); }
catch (e) { console.error(e instanceof Error ? e.message : String(e)); process.exit(1); }
const retried = await queue.retryJob(id);
if (retried) {
console.log(`Job #${id} re-queued for retry.`);
} else {
console.error(`Could not retry job #${id} (must be failed or dead).`);
process.exit(1);
}
break;
}
case 'delete': {
const id = parseInt(args[1], 10);
if (isNaN(id)) { console.error('Error: job ID required.'); process.exit(1); }
try { await queue.ensureSchema(); }
catch (e) { console.error(e instanceof Error ? e.message : String(e)); process.exit(1); }
const removed = await queue.removeJob(id);
if (removed) {
console.log(`Job #${id} deleted.`);
} else {
console.error(`Could not delete job #${id} (must be in a terminal status).`);
process.exit(1);
}
break;
}
case 'prune': {
const olderThanStr = parseFlag(args, '--older-than') ?? '30d';
const days = parseInt(olderThanStr, 10);
if (isNaN(days) || days <= 0) {
console.error('Error: --older-than must be a positive number (days). Example: --older-than 30d');
process.exit(1);
}
try { await queue.ensureSchema(); }
catch (e) { console.error(e instanceof Error ? e.message : String(e)); process.exit(1); }
const count = await queue.prune({ olderThan: new Date(Date.now() - days * 86400000) });
console.log(`Pruned ${count} jobs older than ${days} days.`);
break;
}
case 'stats': {
try { await queue.ensureSchema(); }
catch (e) { console.error(e instanceof Error ? e.message : String(e)); process.exit(1); }
const stats = await queue.getStats();
console.log('Job Stats (last 24h):');
if (stats.by_type.length > 0) {
console.log(` ${'Type'.padEnd(14)} ${'Total'.padEnd(7)} ${'Done'.padEnd(7)} ${'Failed'.padEnd(8)} ${'Dead'.padEnd(6)} Avg Time`);
for (const t of stats.by_type) {
const avgTime = t.avg_duration_ms != null ? `${(t.avg_duration_ms / 1000).toFixed(1)}s` : '—';
console.log(` ${t.name.padEnd(14)} ${String(t.total).padEnd(7)} ${String(t.completed).padEnd(7)} ${String(t.failed).padEnd(8)} ${String(t.dead).padEnd(6)} ${avgTime}`);
}
} else {
console.log(' No jobs in the last 24 hours.');
}
console.log(`\n Queue health: ${stats.queue_health.waiting} waiting, ${stats.queue_health.active} active, ${stats.queue_health.stalled} stalled`);
break;
}
case 'smoke': {
const startTime = Date.now();
try { await queue.ensureSchema(); }
catch (e) {
console.error(`SMOKE FAIL — schema init: ${e instanceof Error ? e.message : String(e)}`);
process.exit(1);
}
const sigkillRescue = hasFlag(args, '--sigkill-rescue');
const worker = new MinionWorker(engine, { queue: 'smoke', pollInterval: 100 });
worker.register('noop', async () => ({ ok: true, at: new Date().toISOString() }));
const job = await queue.add('noop', {}, { queue: 'smoke', max_attempts: 1 });
const workerPromise = worker.start();
const timeoutMs = 15000;
let final: MinionJob | null = null;
for (let elapsed = 0; elapsed < timeoutMs; elapsed += 100) {
await new Promise(r => setTimeout(r, 100));
final = await queue.getJob(job.id);
if (final && ['completed', 'failed', 'dead', 'cancelled'].includes(final.status)) break;
}
worker.stop();
await workerPromise;
const elapsedSec = ((Date.now() - startTime) / 1000).toFixed(2);
if (final?.status !== 'completed') {
console.error(`SMOKE FAIL — job #${job.id} status: ${final?.status ?? 'timeout'} (${elapsedSec}s elapsed)`);
if (final?.error_text) console.error(` Error: ${final.error_text}`);
process.exit(1);
}
// --sigkill-rescue: regression case for #219. Simulates a SIGKILL
// mid-flight by directly manipulating lock_until via handleStalled.
// Verifies that with the v0.13.1 schema default (max_stalled=5), a
// stalled job is REQUEUED rather than dead-lettered on first stall.
// Full subprocess-level SIGKILL lives in test/e2e/minions.test.ts.
if (sigkillRescue) {
const rescueJob = await queue.add('noop', {}, { queue: 'smoke' });
// Transition to active with a past lock_until, mimicking a worker
// that claimed and then got SIGKILL'd mid-run.
await engine.executeRaw(
`UPDATE minion_jobs
SET status='active',
lock_token='smoke-sigkill-rescue',
lock_until=now() - interval '1 minute',
started_at=now() - interval '2 minute',
attempts_started = attempts_started + 1
WHERE id=$1`,
[rescueJob.id]
);
const result = await queue.handleStalled();
const afterStall = await queue.getJob(rescueJob.id);
if (afterStall?.status === 'dead') {
console.error(
`SMOKE FAIL (--sigkill-rescue) — job #${rescueJob.id} was dead-lettered on first stall. ` +
`This is the #219 regression: schema default max_stalled should rescue, not dead-letter. ` +
`handleStalled: ${JSON.stringify(result)}`
);
process.exit(1);
}
if (afterStall?.status !== 'waiting') {
console.error(
`SMOKE FAIL (--sigkill-rescue) — unexpected status after stall: ${afterStall?.status}. ` +
`Expected 'waiting' (rescued). handleStalled: ${JSON.stringify(result)}`
);
process.exit(1);
}
try { await queue.removeJob(rescueJob.id); } catch { /* non-fatal cleanup */ }
}
const cfg = (await import('../core/config.ts')).loadConfig();
const engineLabel = cfg?.engine ?? 'unknown';
const tag = sigkillRescue ? ' + SIGKILL rescue' : '';
console.log(`SMOKE PASS — Minions healthy${tag} in ${elapsedSec}s (engine: ${engineLabel})`);
if (engineLabel === 'pglite') {
console.log('Note: the `gbrain jobs work` daemon requires Postgres. PGLite');
console.log('supports inline execution only (`submit --follow`).');
}
try { await queue.removeJob(job.id); } catch { /* non-fatal cleanup */ }
process.exit(0);
}
case 'work': {
// Check if PGLite
const config = (await import('../core/config.ts')).loadConfig();
if (config?.engine === 'pglite') {
console.error('Error: Worker daemon requires Postgres. PGLite uses an exclusive file lock that blocks other processes.');
console.error('Use --follow for inline execution: gbrain jobs submit <name> --follow');
process.exit(1);
}
const queueName = parseFlag(args, '--queue') ?? 'default';
const concurrency = parseInt(parseFlag(args, '--concurrency') ?? '1', 10);
try { await queue.ensureSchema(); }
catch (e) { console.error(e instanceof Error ? e.message : String(e)); process.exit(1); }
const worker = new MinionWorker(engine, { queue: queueName, concurrency });
await registerBuiltinHandlers(worker, engine);
console.log(`Minion worker started (queue: ${queueName}, concurrency: ${concurrency})`);
console.log(`Registered handlers: ${worker.registeredNames.join(', ')}`);
await worker.start();
break;
}
default:
console.error(`Unknown subcommand: ${sub}. Run 'gbrain jobs --help' for usage.`);
process.exit(1);
}
}
/**
* Register built-in job handlers.
*
* Handlers call library-level Core functions (runSyncCore via performSync,
* runExtractCore, runEmbedCore, runBacklinksCore) directly — NOT the CLI
* wrappers. CLI wrappers call process.exit(1) on validation errors; if a
* worker claimed a badly-formed job and ran one, the WORKER PROCESS would
* die and every in-flight job would go stalled. Library Cores throw
* instead, so one bad job fails one job — not the worker.
*
* Per the v0.11.1 plan (Codex architecture #5 — tension 3).
*/
export async function registerBuiltinHandlers(worker: MinionWorker, engine: BrainEngine): Promise<void> {
worker.register('sync', async (job) => {
const { performSync } = await import('./sync.ts');
const repoPath = typeof job.data.repoPath === 'string' ? job.data.repoPath : undefined;
const noPull = !!job.data.noPull;
const noEmbed = job.data.noEmbed !== false;
const result = await performSync(engine, { repoPath, noPull, noEmbed });
return result;
});
worker.register('embed', async (job) => {
const { runEmbedCore } = await import('./embed.ts');
await runEmbedCore(engine, {
slug: typeof job.data.slug === 'string' ? job.data.slug : undefined,
slugs: Array.isArray(job.data.slugs) ? (job.data.slugs as string[]) : undefined,
all: !!job.data.all,
stale: job.data.all ? false : (job.data.stale !== false),
});
return { embedded: true };
});
worker.register('lint', async (job) => {
const { runLintCore } = await import('./lint.ts');
const target = typeof job.data.dir === 'string' ? job.data.dir : '.';
const result = await runLintCore({ target, fix: !!job.data.fix, dryRun: !!job.data.dryRun });
return result;
});
worker.register('import', async (job) => {
// import.ts Core extraction deferred to v0.12.0 (import has parallel
// workers + checkpointing). Keep the CLI wrapper call but note the
// worker-kill risk is bounded: import's only process.exit fires on
// a missing dir arg, which this handler always passes.
const { runImport } = await import('./import.ts');
const importArgs: string[] = [];
if (job.data.dir) importArgs.push(String(job.data.dir));
if (job.data.noEmbed) importArgs.push('--no-embed');
await runImport(engine, importArgs);
return { imported: true };
});
worker.register('extract', async (job) => {
const { runExtractCore } = await import('./extract.ts');
const mode = (typeof job.data.mode === 'string' && ['links', 'timeline', 'all'].includes(job.data.mode))
? (job.data.mode as 'links' | 'timeline' | 'all')
: 'all';
const dir = typeof job.data.dir === 'string'
? job.data.dir
: (await engine.getConfig('sync.repo_path')) ?? '.';
return await runExtractCore(engine, { mode, dir, dryRun: !!job.data.dryRun });
});
worker.register('backlinks', async (job) => {
const { runBacklinksCore } = await import('./backlinks.ts');
const action: 'check' | 'fix' = job.data.action === 'check' ? 'check' : 'fix';
const dir = typeof job.data.dir === 'string'
? job.data.dir
: (await engine.getConfig('sync.repo_path')) ?? '.';
return await runBacklinksCore({ action, dir, dryRun: !!job.data.dryRun });
});
// The killer handler. Autopilot submits ONE `autopilot-cycle` per cycle
// (idempotency_key on cycle slot) instead of a 4-job parent-child DAG,
// because Minions' parent/child is NOT a depends_on primitive (Codex
// H3/H4). Each step is wrapped in its own try/catch; the handler returns
// `{ partial: true, failed_steps: [...] }` when any step fails. It does
// NOT throw on partial failure — that would cause the Minion to retry,
// and an intermittent extract bug would block every future cycle.
worker.register('autopilot-cycle', async (job) => {
const { performSync } = await import('./sync.ts');
const { runExtractCore } = await import('./extract.ts');
const { runEmbedCore } = await import('./embed.ts');
const { runBacklinksCore } = await import('./backlinks.ts');
const repoPath = typeof job.data.repoPath === 'string'
? job.data.repoPath
: (await engine.getConfig('sync.repo_path')) ?? '.';
const steps: Record<string, unknown> = {};
const failed: string[] = [];
// Bug 8 — Between phases, yield to the event loop. The worker's lock
// renewal runs on a timer (src/core/minions/worker.ts); without a
// periodic yield, long CPU-bound phases starve the renewal callback
// and the job gets killed by the stalled-sweeper. A single
// `await new Promise(r => setImmediate(r))` gives the timer a chance
// to fire. The per-phase body is async+await already, so each phase
// internally yields on its own I/O boundaries — this is a belt for
// the gap between phases.
//
// Follow-up (deferred to v0.15): thread ctx.signal / ctx.shutdownSignal
// through each core fn so mid-phase cancellation works on huge brains.
const yieldToLoop = () => new Promise<void>(r => setImmediate(r));
try { steps.sync = await performSync(engine, { repoPath, noEmbed: true }); }
catch (e) { steps.sync = { error: e instanceof Error ? e.message : String(e) }; failed.push('sync'); }
await yieldToLoop();
try { steps.extract = await runExtractCore(engine, { mode: 'all', dir: repoPath }); }
catch (e) { steps.extract = { error: e instanceof Error ? e.message : String(e) }; failed.push('extract'); }
await yieldToLoop();
try { await runEmbedCore(engine, { stale: true }); steps.embed = { embedded: true }; }
catch (e) { steps.embed = { error: e instanceof Error ? e.message : String(e) }; failed.push('embed'); }
await yieldToLoop();
try { steps.backlinks = await runBacklinksCore({ action: 'fix', dir: repoPath }); }
catch (e) { steps.backlinks = { error: e instanceof Error ? e.message : String(e) }; failed.push('backlinks'); }
if (failed.length > 0) {
return { partial: true, failed_steps: failed, steps };
}
return { partial: false, steps };
});
// Shell handler: registered ONLY when GBRAIN_ALLOW_SHELL_JOBS=1 is set on the
// worker process. Default-closed; opt-in per-host. Without the flag, shell
// jobs submitted via CLI insert rows but no worker claims them (they sit in
// 'waiting' — the CLI prints a starvation warning for that case).
if (process.env.GBRAIN_ALLOW_SHELL_JOBS === '1') {
const { shellHandler } = await import('../core/minions/handlers/shell.ts');
worker.register('shell', shellHandler);
process.stderr.write('[minion worker] shell handler enabled (GBRAIN_ALLOW_SHELL_JOBS=1)\n');
} else {
process.stderr.write('[minion worker] shell handler disabled (set GBRAIN_ALLOW_SHELL_JOBS=1 to enable)\n');
}
}