Files
gbrain/test/fail-improve.test.ts
Garry Tan e5a9f0126a feat: GStackBrain — 16 new skills, resolver, conventions, identity layer (v0.10.0) (#120)
* feat: migrate 8 existing skills to conformance format

Add YAML frontmatter (name, version, description, triggers, tools, mutating),
Contract, Anti-Patterns, and Output Format sections to all existing skills.
Rename Workflow to Phases. Ingest becomes thin router delegating to specialized
ingestion skills (Phase 2).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add RESOLVER.md, conventions directory, and output rules

RESOLVER.md is the skill dispatcher modeled on Wintermute's AGENTS.md.
Categorized routing table: Always-on, Brain ops, Ingestion, Thinking,
Operational, Setup, Identity. Conventions directory extracts cross-cutting
rules (quality, brain-first lookup, model routing, test-before-bulk).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add skills conformance and resolver validation tests

skills-conformance.test.ts validates every skill has YAML frontmatter with
required fields, Contract, Anti-Patterns, and Output Format sections, and
manifest.json coverage. resolver.test.ts validates routing table categories,
skill path existence, and manifest-to-resolver coverage. 50 new tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add 9 brain skills from Wintermute (Phase 2)

Generalized from Wintermute's battle-tested skills:
- signal-detector: always-on idea+entity capture on every message
- brain-ops: brain-first lookup, read-enrich-write loop, source attribution
- idea-ingest: links/articles/tweets with author people page mandatory
- media-ingest: video/audio/PDF/book with entity extraction (absorbs video/youtube/book)
- meeting-ingestion: transcripts with attendee enrichment chaining
- citation-fixer: audit and fix citation formatting
- repo-architecture: filing rules by primary subject
- skill-creator: create skills with conformance standard + MECE check
- daily-task-manager: task lifecycle with priority levels

All Garry-specific references generalized. Core workflows preserved.
Updated RESOLVER.md and manifest.json.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add operational infrastructure + identity layer (Phase 3)

Operational skills:
- daily-task-prep: morning prep with calendar context and open threads
- cross-modal-review: quality gate via second model with refusal routing
- cron-scheduler: schedule staggering, quiet hours, wake-up override, idempotency
- reports: timestamped reports with keyword routing
- testing: skill validation framework (conformance checks)
- soul-audit: 6-phase interview generating SOUL.md, USER.md, ACCESS_POLICY.md, HEARTBEAT.md
- webhook-transforms: external events to brain signals with dead-letter queue

Identity layer:
- SOUL.md template (agent identity, generated by soul-audit)
- USER.md template (user profile, generated by soul-audit)
- ACCESS_POLICY.md template (4-tier access control)
- HEARTBEAT.md template (operational cadence)
- cross-modal.yaml convention (review pairs, refusal routing chain)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update CLAUDE.md with 24 skills, RESOLVER.md, conventions, templates

GBrain is now a GStack mod for agent platforms. Updated architecture description,
key files listing (16 new skill files, RESOLVER.md, conventions, templates), skills
section (24 skills organized by resolver categories), and testing section (new
conformance and resolver tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add GStack detection + mod status to gbrain init (Phase 4)

After brain initialization, gbrain init now reports:
- Number of skills loaded (from manifest.json)
- GStack detection (checks known host paths, uses gstack-global-discover if available)
- GStack install instructions if not found
- Resolver and soul-audit pointers

Also adds installDefaultTemplates() for SOUL.md/USER.md/ACCESS_POLICY.md/HEARTBEAT.md
deployment, and detectGStack() using gstack-global-discover with fallback to known paths
(DRY: doesn't reimplement GStack's host detection logic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: v0.10.0 release documentation

- CHANGELOG: 24 skills, signal detector, RESOLVER.md, soul-audit, access control,
  conventions, conformance standard, GStack detection in init
- README: updated skill section with 24 skills, resolver, conventions
- TODOS: added runtime MCP access control (P1)
- VERSION: 0.9.2 → 0.10.0
- package.json + manifest.json version bumped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add skill table to CHANGELOG v0.10.0

16-row table detailing every new skill, what it does, and why it matters.
Written to sell the upgrade, not document the implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore package.json version after merge conflict resolution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: zero-based README rewrite for GStackBrain v0.10.0

Lead with GStack mod identity. 24 skills table organized by category.
Install block references RESOLVER.md and soul-audit. GBrain+GStack
relationship explained. Removed redundancy (733 -> 406 lines).
All essential content preserved: install, recipes, architecture,
search, commands, engines, voice, knowledge model.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: extract install block to INSTALL_FOR_AGENTS.md, simplify README

The 30-line copy-paste install block becomes one line:
"Retrieve and follow INSTALL_FOR_AGENTS.md"

Benefits: agent always gets latest instructions (no stale copy-paste),
README stays clean, install details live where agents read them.

README now leads with what GBrain does ("gives your agent a brain")
instead of GStack relationship. Removed "requires frontier model" note.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: 3 bugs in init.ts from merge conflict resolution

1. llstatSync typo (merge corruption) → lstatSync
2. __dirname undefined in ESM module → fileURLToPath polyfill
3. require('fs') in ESM → use imported readFileSync

All three would crash gbrain init at runtime. Caught by /review.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add checkResolvable shared core function for resolver validation

Shared function at src/core/check-resolvable.ts validates that all skills
are reachable from RESOLVER.md, detects MECE overlaps (with whitelist for
always-on/router skills), finds gaps in frontmatter triggers, and scans
for DRY violations. Returns structured ResolvableIssue objects with
machine-parseable fix objects alongside human-readable action strings.

Three call sites: bun test, gbrain doctor, skill-creator skill.

Cleans up test/resolver.test.ts: removes stale 9-line skip list, imports
from production check-resolvable.ts instead of reimplementing parsing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: expand doctor with resolver validation, filesystem-first architecture

Doctor now runs filesystem checks (resolver health, skill conformance) before
connecting to DB. New --fast flag skips DB checks. Falls back to filesystem-only
when DB is unavailable. Adds schema_version: 2 to JSON output, composite health
score (0-100), and structured issues array with action strings for agent parsing.

Resolver health check calls checkResolvable() and surfaces actionable fix
instructions. Link integrity check uses engine.getHealth() dead_links count.

CLI routing split: doctor dispatched before connectEngine() so filesystem
checks always run. Fixes Codex-identified blocker where doctor required DB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add adaptive load-aware throttling and fail-improve loop

backoff.ts: System load checking (CPU via os.loadavg, memory via os.freemem),
exponential backoff with 20-attempt max guard, active hours multiplier (2x
slower during waking hours), concurrent process limit (max 2). Windows-safe:
defaults to "proceed" when os.loadavg returns zeros.

fail-improve.ts: Deterministic-first, LLM-fallback pattern with JSONL failure
logging. Cascade failure handling: when both paths fail, throws LLM error and
logs both. Log rotation at 1000 entries. Call count tracking for deterministic
hit rate metrics. Auto-generates test cases from successful LLM fallbacks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add transcription service and enrichment-as-a-service

transcription.ts: Groq Whisper (default) with OpenAI fallback. Files >25MB
segmented via ffmpeg. Provider auto-detection from env vars. Clear error
messages for missing API keys and unsupported formats.

enrichment-service.ts: Global enrichment service callable from any ingest
pathway. Entity slug generation (people/jane-doe, companies/acme-corp),
mention counting via searchKeyword, tier auto-escalation (Tier 3→2→1 based
on mention frequency and source diversity), batch enrichment with backoff
throttling, regex-based entity extraction from text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add data-research skill with recipe system, extraction, dedup, tracker

New skill: data-research — one parameterized pipeline for any email-to-
structured-data workflow (investor updates, donations, company metrics).
7-phase pipeline: define recipe, search, classify, extract (with extraction
integrity rule), archive, deduplicate, update tracker.

data-research.ts: Recipe validation, MRR/ARR/runway/headcount regex
extraction (battle-tested patterns), dedup with configurable tolerance,
markdown tracker parsing/appending, quarterly/monthly date windowing,
6-phase HTML email stripping with 500KB ReDoS cap.

Registers data-research in manifest.json (25th skill) and RESOLVER.md.
Fixes backoff test robustness for high-load systems.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.10.0 infrastructure additions

CLAUDE.md: added 6 new core files (check-resolvable, backoff, fail-improve,
transcription, enrichment-service, data-research), 6 new test files, updated
skill count to 25, test file count to 34.

README.md: updated skill count to 25, added data-research to skills table.

CHANGELOG.md: added Infrastructure section documenting resolver validation,
doctor expansion, adaptive throttling, fail-improve loop, voice transcription,
enrichment service, and data-research skill.

TODOS.md: anonymized personal references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: doctor.ts use ES module imports, harden backoff test

Replace require('fs') with ES module import in doctor.ts for consistency
with the rest of the file. Backoff test made resilient to parallel test
execution leaking module-level state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: README rewrite with production brain stats, sample output, new infrastructure

Lead with the flex: 17,888 pages, 4,383 people, 723 companies, 526 meeting
transcripts built in 12 days. Show sample query output so readers see what
they'll get. Document self-improving infrastructure (tier auto-escalation,
fail-improve loop, doctor trajectory). Add data-research recipes to Getting
Data In. Update commands section with doctor --fix, transcribe, research
init/list. Fix stale "24" references to "25".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: README lead with YC President origin and production agent deployments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: README lead with skill philosophy and link to Thin Harness Fat Skills

Skills section now explains: skill files are code, they encode entire
workflows, they call deterministic TypeScript for the parts that shouldn't
be LLM judgment. Links to the tweet and the architecture essay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: link GStack repo, add 70K stars and 30K daily users

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: remove meeting transcript count from README (sensitive)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: README lead with YC President origin and production agent deployments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: rename political-donations recipe to expense-tracker (sensitivity)

Renamed the built-in data-research recipe from political-donations to
expense-tracker across README, CHANGELOG, SKILL.md, and reports routing.
Same extraction patterns (amounts, dates, recipients), neutral framing.
Also renamed social-radar keyword route to social-mentions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:41:34 -10:00

175 lines
6.2 KiB
TypeScript

import { describe, test, expect, beforeEach, afterAll } from 'bun:test';
import { FailImproveLoop } from '../src/core/fail-improve.ts';
import { mkdtempSync, rmSync, existsSync, readFileSync } from 'fs';
import { join } from 'path';
import { tmpdir } from 'os';
describe('fail-improve', () => {
let tempDir: string;
let loop: FailImproveLoop;
beforeEach(() => {
tempDir = mkdtempSync(join(tmpdir(), 'gbrain-fail-improve-'));
loop = new FailImproveLoop(tempDir);
});
afterAll(() => {
// Clean up temp dirs
try { rmSync(tempDir, { recursive: true, force: true }); } catch {}
});
test('execute returns deterministic result when it succeeds', async () => {
const result = await loop.execute(
'test_op',
'hello',
(input) => input.toUpperCase(),
async () => 'llm-fallback',
);
expect(result).toBe('HELLO');
});
test('execute falls back to LLM when deterministic returns null', async () => {
const result = await loop.execute(
'test_op',
'hello',
() => null,
async (input) => `llm: ${input}`,
);
expect(result).toBe('llm: hello');
});
test('execute logs failure when deterministic returns null', async () => {
await loop.execute(
'test_op',
'test-input',
() => null,
async () => 'llm-result',
);
const failures = loop.getFailures('test_op');
expect(failures.length).toBe(1);
expect(failures[0].deterministic_result).toBeNull();
expect(failures[0].llm_result).toContain('llm-result');
expect(failures[0].input).toBe('test-input');
});
test('execute throws LLM error when both fail (cascade)', async () => {
try {
await loop.execute(
'cascade_op',
'input',
() => null,
async () => { throw new Error('LLM failed'); },
);
expect(true).toBe(false); // should not reach
} catch (e: any) {
expect(e.message).toBe('LLM failed');
}
// Verify both failures are logged
const failures = loop.getFailures('cascade_op');
expect(failures.length).toBe(1);
expect(failures[0].llm_result).toContain('error: LLM failed');
expect(failures[0].metadata?.cascade_failure).toBe(true);
});
test('logFailure creates JSONL file with valid entries', () => {
loop.logFailure({
timestamp: '2026-04-15T00:00:00Z',
operation: 'jsonl_test',
input: 'test input',
deterministic_result: null,
llm_result: 'llm output',
});
const filePath = join(tempDir, 'jsonl_test.jsonl');
expect(existsSync(filePath)).toBe(true);
const content = readFileSync(filePath, 'utf-8');
const parsed = JSON.parse(content.trim());
expect(parsed.operation).toBe('jsonl_test');
});
test('getFailures returns empty array for non-existent operation', () => {
const failures = loop.getFailures('nonexistent');
expect(failures).toEqual([]);
});
test('getFailuresByPattern groups by input prefix', async () => {
const prefix = 'a]'.repeat(30); // 60 chars, only first 50 used as key
loop.logFailure({ timestamp: 't1', operation: 'group_test', input: prefix + ' suffix1', deterministic_result: null, llm_result: 'a' });
loop.logFailure({ timestamp: 't2', operation: 'group_test', input: prefix + ' suffix2', deterministic_result: null, llm_result: 'b' });
loop.logFailure({ timestamp: 't3', operation: 'group_test', input: 'different input entirely', deterministic_result: null, llm_result: 'c' });
const patterns = loop.getFailuresByPattern('group_test');
expect(patterns.size).toBe(2); // same 50-char prefix groups together, "different" is separate
});
test('analyzeFailures computes metrics', async () => {
// Run some executions
await loop.execute('metrics_op', 'a', () => 'det', async () => 'llm');
await loop.execute('metrics_op', 'b', () => null, async () => 'llm');
await loop.execute('metrics_op', 'c', () => 'det', async () => 'llm');
const analysis = loop.analyzeFailures('metrics_op');
expect(analysis.operation).toBe('metrics_op');
expect(analysis.total_calls).toBe(3);
expect(analysis.deterministic_hits).toBe(2);
expect(analysis.deterministic_rate).toBeCloseTo(2 / 3, 2);
expect(analysis.total_failures).toBe(1); // one LLM fallback logged
});
test('generateTestCases creates tests from successful LLM fallbacks', async () => {
await loop.execute('testgen_op', 'input-1', () => null, async () => 'expected-1');
await loop.execute('testgen_op', 'input-2', () => null, async () => 'expected-2');
const cases = loop.generateTestCases('testgen_op');
expect(cases.length).toBe(2);
expect(cases[0].input).toBe('input-1');
expect(cases[0].source).toBe('fail-improve-loop');
});
test('generateTestCases excludes cascade failures', async () => {
await loop.execute('excl_op', 'ok', () => null, async () => 'good');
try {
await loop.execute('excl_op', 'bad', () => null, async () => { throw new Error('boom'); });
} catch {}
const cases = loop.generateTestCases('excl_op');
expect(cases.length).toBe(1); // only the successful fallback
});
test('logImprovement records improvement history', () => {
loop.logImprovement('improve_op', 'Added regex for MRR format');
loop.logImprovement('improve_op', 'Added regex for ARR format');
const analysis = loop.analyzeFailures('improve_op');
expect(analysis.total_improvements).toBe(2);
});
test('input is truncated to 1000 chars in log entries', async () => {
const longInput = 'x'.repeat(5000);
await loop.execute('trunc_op', longInput, () => null, async () => 'result');
const failures = loop.getFailures('trunc_op');
expect(failures[0].input.length).toBe(1000);
});
test('log rotation keeps last 1000 entries', () => {
// Write 1010 entries
for (let i = 0; i < 1010; i++) {
loop.logFailure({
timestamp: `2026-04-15T00:00:${String(i).padStart(2, '0')}Z`,
operation: 'rotation_test',
input: `entry-${i}`,
deterministic_result: null,
llm_result: `result-${i}`,
});
}
const failures = loop.getFailures('rotation_test');
expect(failures.length).toBeLessThanOrEqual(1000);
// Last entry should be preserved
expect(failures[failures.length - 1].input).toBe('entry-1009');
});
});