Files
gbrain/test/extract.test.ts
Garry Tan c22ca84772 feat: v0.13 frontmatter relationship indexing — YAML becomes typed graph edges (#231)
* feat(schema): links provenance + engine plumbing (v0.13)

Adds link_source, origin_page_id, origin_field columns with
UNIQUE NULLS NOT DISTINCT constraint + CHECK constraint. New indexes
on link_source + origin_page_id.

migrate.ts v11 handles idempotent upgrade path for existing brains.
Both engines: addLink/addLinksBatch threads new columns (4→7 col
unnest). removeLink gains linkSource filter. getLinks/getBacklinks
return new columns.

New engine method findByTitleFuzzy(name, dirPrefix?, minSim?) uses
pg_trgm % operator + similarity(). Drives the v0.13 resolver's
fuzzy-match step with zero LLM/embedding cost.

* feat(graph): frontmatter edge extraction + slug resolver (v0.13)

Canonical FRONTMATTER_LINK_MAP: field → type + direction + dir-hint
for 10 frontmatter patterns (company/companies, key_people, investors,
attendees, partner, lead, founded, sources, source, related/see_also).

Direction semantics: "incoming" means resolved value is the FROM side
so subject-of-verb reads naturally (pedro → meeting, not backwards).

makeResolver(engine, {mode}) — two-mode resolver:
  batch (migration): slug → dir-hint → pg_trgm. NEVER hits search.
  live (put_page):   + optional search fallback with expand=false
                     (dodges hidden Haiku per operations-query learning).
Per-run cache: same name → single DB lookup.

extractFrontmatterLinks handles arrays-of-objects (investors:
[{name: 'Sequoia', role: 'lead'}]), skips bad types silently,
tracks unresolved names for the summary report.

extractPageLinks is now async. LinkCandidate gains fromSlug,
linkSource, originSlug, originField. Returns {candidates, unresolved}.

22 new tests: field-map coverage, direction semantics, source vs
sources, resolver fallback chain (batch + live), cache hit, bad
types skipped, context enrichment, FRONTMATTER_LINK_MAP integrity.

* feat(auto-link): bidirectional reconciliation + unresolved response

put_page auto-link post-hook now handles incoming-direction frontmatter
edges. Reconciliation splits candidates into out (fromSlug === slug)
and in (fromSlug !== slug — frontmatter fields like key_people on a
company page emit person → company edges).

Safe reconciliation via origin_page_id scoping: we only touch
link_source='frontmatter' edges where origin_slug = the page being
written. Markdown + manual edges survive untouched. Edges created
by OTHER pages' frontmatter also survive.

put_page response extends auto_links with unresolved: Array<{field,
name}>. Agents writing attendees: [Pedro, Alex] where Alex doesn't
resolve see it in the response and can queue for enrichment.
Additive — existing agents unaffected.

extract.ts: delete the local 5-field extractFrontmatterLinks + local
inferLinkType. FS-source now calls canonical link-extraction.ts via
a synthetic resolver backed by the allSlugs Set. --include-frontmatter
flag (default OFF in v0.13 for back-compat; migration explicitly
enables for the one-time backfill). Top-20 unresolved names summary
when active.

* feat(migration): v0.13.0 orchestrator

3-phase orchestrator (schema → backfill → verify → record) follows
the v0_12_2.ts pattern. Phase A triggers migrate.ts v11 via
gbrain init --migrate-only. Phase B runs:

  gbrain extract links --source db --include-frontmatter

to backfill frontmatter edges for every existing page. Uses the
batch-mode resolver (pg_trgm only, no LLM calls, zero API cost).
Ignores auto_link=false config — migration is canonical, the
auto_link flag controls per-write post-hook not one-time schema
work.

Idempotent + resumable via ON CONFLICT DO NOTHING + origin_page_id
scoping. Wall-clock budget: 2-5 min on 46K-page brains.

Registered in migrations/index.ts. apply-migrations test updated
to include v0.13.0 in skippedFuture for older installed versions.

* feat(release): upgrade-errors.jsonl trail + doctor surfacing

upgrade.ts catches post-upgrade subprocess failures as best-effort
today (line 65 comment: "post-upgrade is best-effort, don't fail
the upgrade"). When that chain silently fails, users end up with
half-upgraded brains and no signal.

v0.13: on post-upgrade failure, append a structured record to
~/.gbrain/upgrade-errors.jsonl with ts, phase, versions, error
message, and a paste-ready recovery hint.

doctor.ts reads the jsonl and surfaces the latest entry with a
warn-status check. User runs gbrain doctor, sees exactly what
failed, pastes the recovery command, files an issue if needed.

Applies to every future release — doctor grows with the codebase
without per-release edits. The CHANGELOG pattern ("To take advantage
of v[version]" block) mirrors this in user-facing form.

* chore: bump version and changelog (v0.13.0)

v0.13.0 — Frontmatter Relationship Indexing.

Adds the "To take advantage of v[version]" block pattern to
CHANGELOG format (CLAUDE.md documents the requirement going
forward). Pairs with the upgrade-errors.jsonl + doctor surfacing
to close the "half-upgraded brain, no signal" loop.

UPGRADING_DOWNSTREAM_AGENTS.md gets a v0.13 section: no-action-
required verdict for most skills, optional diffs for meeting-
ingestion / enrich / idea-ingest if they want to consume
auto_links.unresolved.

skills/migrations/v0.13.0.md is the user-facing upgrade skill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(v0.13): adversarial review P0s

Codex + Claude adversarial review caught 4 critical issues in the
v0.13 implementation. Fixing before ship.

1. findByTitleFuzzy SET LOCAL was a no-op. postgres.js auto-commits
   each sql`` so SET LOCAL pg_trgm.similarity_threshold committed
   before the `%` operator ran against it. Resolver used server
   default (0.3, not 0.55) → way too many fuzzy matches, wrong
   links on a 46K-page brain. Switched to inline
   `similarity(title, $1) >= $N` which has no transaction scoping.
   Added `ORDER BY sim DESC, slug ASC` for deterministic
   tie-breaking (prevents reconciliation churn on re-runs).

2. v11 migration now checks Postgres ≥ 15 before applying
   UNIQUE NULLS NOT DISTINCT. Old Supabase projects on PG14 would
   have dropped the old unique constraint and failed to add the
   new one, corrupting the uniqueness invariant. The check raises
   a clear error with the actual PG version, leaving the old
   constraint in place.

3. v11 migration now backfills NULL link_source → 'markdown' for
   pre-v0.13 legacy rows. Without this, reconciliation's existKey
   comparison treats NULL and 'markdown' as equivalent but the
   unique constraint sees them as distinct (NULLS NOT DISTINCT
   only collapses NULL with NULL, not NULL with 'markdown'). Result
   was duplicate edges accumulating forever. Treating legacy as
   markdown is the accurate best-guess — pre-v0.13 auto-link only
   emitted markdown edges.

4. v0_13_0.ts orchestrator now uses process.execPath, not a bare
   `gbrain` on PATH. After `gbrain upgrade` rewrites the binary,
   alias shadowing / PATH caching / multiple installs could
   resolve a stale `gbrain` binary. process.execPath is always
   the binary that loaded this migration module.

Phase C verify clarified: reports page + link counts and points to
Phase B's own stdout as the authoritative signal for backfill
results (extract.ts already prints `Links: created N from M pages`).

* docs: scrub real names from public docs + add privacy rule to CLAUDE.md

Public artifacts (CHANGELOG, skills, docs) should never reveal real
contacts, companies, funds, or private agent-fork names from any
user's brain. When a doc copies a query like `gbrain graph diana-hu`
or names a fork like `Wintermute`, that real name gets indexed,
cross-referenced, and distributed with every release.

CLAUDE.md gains a "Privacy rule: scrub real names from public docs"
section with:
- What counts as public (CHANGELOG, README, docs/, skills/, PR bodies,
  commit messages, code comments)
- Name mapping table (agent forks → your agent fork; example person →
  alice-example; example fund → fund-a; etc.)
- Distinction between illustrative API examples with household brands
  (Stripe, Brex) and queries that reveal real relationships

Applied the rule to v0.13 scope:
- CHANGELOG v0.13 entry: Pedro/Diana/Wintermute/Sequoia/Benchmark/a16z
  all replaced with alice/charlie/fund-a/acme/agent-fork placeholders
- skills/migrations/v0.13.0.md: same
- docs/UPGRADING_DOWNSTREAM_AGENTS.md: Wintermute references scrubbed
  throughout (pre-v0.13 and v0.13 sections)
- CLAUDE.md: "Brain skills (from Wintermute)" → "(ported from an
  upstream agent fork)", internal Wintermute provenance notes
  genericized, "Garry finds fragile upgrade paths" → "the gbrain
  maintainers find fragile upgrade paths" in the template

Pre-v0.13 historical CHANGELOG entries (v0.10-v0.12) left alone —
those are shipped releases; rewriting changes public history.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 07:05:27 +08:00

145 lines
6.3 KiB
TypeScript
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
import { describe, it, expect } from 'bun:test';
import {
extractMarkdownLinks,
extractLinksFromFile,
extractTimelineFromContent,
walkMarkdownFiles,
} from '../src/commands/extract.ts';
describe('extractMarkdownLinks', () => {
it('extracts relative markdown links', () => {
const content = 'Check [Pedro](../people/pedro-franceschi.md) and [Brex](../../companies/brex.md).';
const links = extractMarkdownLinks(content);
expect(links).toHaveLength(2);
expect(links[0].name).toBe('Pedro');
expect(links[0].relTarget).toBe('../people/pedro-franceschi.md');
});
it('skips external URLs ending in .md', () => {
const content = 'See [readme](https://example.com/readme.md) for details.';
const links = extractMarkdownLinks(content);
expect(links).toHaveLength(0);
});
it('handles links with no matches', () => {
const content = 'No links here.';
expect(extractMarkdownLinks(content)).toHaveLength(0);
});
it('extracts multiple links from same line', () => {
const content = '[A](a.md) and [B](b.md)';
expect(extractMarkdownLinks(content)).toHaveLength(2);
});
});
describe('extractLinksFromFile', () => {
it('resolves relative paths to slugs', async () => {
const content = '---\ntitle: Test\n---\nSee [Pedro](../people/pedro.md).';
const allSlugs = new Set(['people/pedro', 'deals/test-deal']);
const links = await extractLinksFromFile(content, 'deals/test-deal.md', allSlugs);
expect(links.length).toBeGreaterThanOrEqual(1);
expect(links[0].from_slug).toBe('deals/test-deal');
expect(links[0].to_slug).toBe('people/pedro');
});
it('skips links to non-existent pages', async () => {
const content = 'See [Ghost](../people/ghost.md).';
const allSlugs = new Set(['deals/test']);
const links = await extractLinksFromFile(content, 'deals/test.md', allSlugs);
expect(links).toHaveLength(0);
});
it('extracts frontmatter company links (v0.13, includeFrontmatter opt-in)', async () => {
const content = '---\ncompany: brex\ntype: person\n---\nContent.';
// v0.13 canonical: person page with company: X → person → company works_at (outgoing).
// Resolver needs companies/brex to exist in allSlugs to emit the edge.
const allSlugs = new Set(['people/test', 'companies/brex']);
const links = await extractLinksFromFile(content, 'people/test.md', allSlugs, { includeFrontmatter: true });
const companyLinks = links.filter(l => l.link_type === 'works_at');
expect(companyLinks.length).toBeGreaterThanOrEqual(1);
expect(companyLinks[0].from_slug).toBe('people/test');
expect(companyLinks[0].to_slug).toBe('companies/brex');
});
it('extracts frontmatter investors array (v0.13: incoming direction)', async () => {
// v0.13: deal page with investors:[yc, threshold] emits INCOMING edges:
// companies/yc → deals/seed invested_in and same for threshold.
const content = '---\ninvestors: [yc, threshold]\ntype: deal\n---\nContent.';
const allSlugs = new Set(['deals/seed', 'companies/yc', 'companies/threshold']);
const links = await extractLinksFromFile(content, 'deals/seed.md', allSlugs, { includeFrontmatter: true });
const investorLinks = links.filter(l => l.link_type === 'invested_in');
expect(investorLinks).toHaveLength(2);
// Incoming: from = resolved investor, to = deal page.
for (const l of investorLinks) {
expect(l.to_slug).toBe('deals/seed');
expect(l.from_slug).toMatch(/^companies\/(yc|threshold)$/);
}
});
it('frontmatter extraction is default OFF (back-compat)', async () => {
// Without includeFrontmatter, fs-source no longer auto-extracts frontmatter.
// Matches db-source behavior. User opts in with --include-frontmatter flag.
const content = '---\ncompany: brex\ntype: person\n---\nContent.';
const allSlugs = new Set(['people/test', 'companies/brex']);
const links = await extractLinksFromFile(content, 'people/test.md', allSlugs);
expect(links).toEqual([]);
});
it('infers link type from directory structure', async () => {
const content = 'See [Brex](../companies/brex.md).';
const allSlugs = new Set(['people/pedro', 'companies/brex']);
const links = await extractLinksFromFile(content, 'people/pedro.md', allSlugs);
expect(links[0].link_type).toBe('works_at');
});
it('infers deal_for type for deals -> companies', async () => {
const content = 'See [Brex](../companies/brex.md).';
const allSlugs = new Set(['deals/seed', 'companies/brex']);
const links = await extractLinksFromFile(content, 'deals/seed.md', allSlugs);
expect(links[0].link_type).toBe('deal_for');
});
});
describe('extractTimelineFromContent', () => {
it('extracts bullet format entries', () => {
const content = `## Timeline\n- **2025-03-18** | Meeting — Discussed partnership`;
const entries = extractTimelineFromContent(content, 'people/test');
expect(entries).toHaveLength(1);
expect(entries[0].date).toBe('2025-03-18');
expect(entries[0].source).toBe('Meeting');
expect(entries[0].summary).toBe('Discussed partnership');
});
it('extracts header format entries', () => {
const content = `### 2025-03-28 — Round Closed\n\nAll docs signed. Marcus joins the board.`;
const entries = extractTimelineFromContent(content, 'deals/seed');
expect(entries).toHaveLength(1);
expect(entries[0].date).toBe('2025-03-28');
expect(entries[0].summary).toBe('Round Closed');
expect(entries[0].detail).toContain('Marcus joins the board');
});
it('returns empty for no timeline content', () => {
const content = 'Just plain text without dates.';
expect(extractTimelineFromContent(content, 'test')).toHaveLength(0);
});
it('extracts multiple bullet entries', () => {
const content = `- **2025-01-01** | Source1 — Summary1\n- **2025-02-01** | Source2 — Summary2`;
const entries = extractTimelineFromContent(content, 'test');
expect(entries).toHaveLength(2);
});
it('handles em dash and en dash in bullet format', () => {
const content = `- **2025-03-18** | Meeting Discussed partnership`;
const entries = extractTimelineFromContent(content, 'test');
expect(entries).toHaveLength(1);
});
});
describe('walkMarkdownFiles', () => {
it('is a function', () => {
expect(typeof walkMarkdownFiles).toBe('function');
});
});