feat: v0.18.0 — multi-source brains (one DB, many repos, federation + dotfile resolution) (#337)

* feat(v0.17.0 step 1/9): sources primitive — additive-only multi-source foundation

Lane A of the multi-repo plan. Installs the sources table and seeds a
'default' row that inherits sync.repo_path/last_commit from existing
config. This is the bisectable foundation every later step builds on;
the breaking schema changes (composite UNIQUE, files FK rewrite,
resolution_type, ingest_log.source_id) land with their paired code
rewrites in Steps 2/4/5/7 so no single commit breaks the engine.

- migration v16 (sources_table_additive) + v0_17_0 orchestrator skeleton
- sort-by-version guard in runMigrations (array insertion order can
  never cause a later migration to skip a lower one again)
- default source seeded with config '{"federated": true}' so pre-v0.17
  brains keep single-namespace search semantics after upgrade
- orchestrator phase B detects absence of file_migration_ledger and
  no-ops until Step 7 lands it
- 8 new structural tests in test/migrate.test.ts (shape, idempotency,
  scope-guard that nothing else was smuggled into v16)
- apply-migrations tests include v0.17.0 in the registered list

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.17.0 step 2/9): pages.source_id + composite UNIQUE (Lane B)

Migration v17 adds pages.source_id with DEFAULT 'default' and swaps the
global UNIQUE(slug) for composite UNIQUE(source_id, slug). Ships atomically
with the engine's ON CONFLICT rewrite so the constraint swap and the code
that writes under it land in the same commit — no window where the engine
sees one shape and the schema has another.

Minimum-surface engine change: only putPage's ON CONFLICT target needs
re-targeting. Other slug-based queries work unchanged because single-
source brains (the only brain shape pre-Step-5) have exactly one source
'default', so slug remains effectively unique within it. Step 5+ will
surface an explicit sourceId param on putPage for cross-source sync.

- migration v17 (pages_source_id_composite_unique) in src/core/migrate.ts
- pages.source_id + composite UNIQUE added to schema.sql + pglite-schema.ts
  for fresh installs
- ON CONFLICT (slug) → ON CONFLICT (source_id, slug) in both pglite-engine
  and postgres-engine putPage
- DEFAULT 'default' closes the Codex-flagged race where an INSERT between
  ADD COLUMN and SET NOT NULL could leave source_id NULL
- 5 new v17 structural tests (29 pass / 0 fail in migrate.test.ts)
- Full suite: 1979 pass / 3 fail (same as baseline — no regressions)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.17.0 step 6/9): sources CLI + source-resolver (Lane C)

Adds the CLI surface for multi-source management. Users can now register,
list, rename, federate/unfederate, and attach-to-directory a source. The
source-resolver is the shared 6-priority helper that Steps 4/5 will use
when they start surfacing an explicit --source flag on sync/extract/query.

Commands:
  gbrain sources add <id> --path <p> [--name <n>] [--federated|--no-federated]
  gbrain sources list [--json]
  gbrain sources remove <id> [--yes] [--dry-run] [--keep-storage]
  gbrain sources rename <id> <new-name>
  gbrain sources default <id>
  gbrain sources attach <id>   — writes .gbrain-source in CWD
  gbrain sources detach
  gbrain sources federate <id> / unfederate <id>

Resolution priority (source-resolver.ts) — highest first:
  1. --source flag  2. GBRAIN_SOURCE env  3. .gbrain-source dotfile walk-up
  4. longest-prefix match on registered local_path (Codex #2 fix)
  5. sources.default config  6. fallback 'default'

- add: validates id format (kebab-case alnum, 1-32), rejects overlapping
  paths (eng review §4 finding 4.1), supports federated default opt-in
- remove: guards against --yes omission + refuses to remove 'default',
  supports --dry-run, reports cascade page count
- attach/detach: matches kubectl/terraform context-pinning semantics
- Throws on overlap rather than process.exit() so the CLI error wrapper
  reports it consistently (also makes unit testing clean)

28 new tests across sources.test.ts (dispatcher + validation + overlap
guard) and source-resolver.test.ts (full 6-priority coverage including
longest-prefix). Full suite: 2012 pass / 3 fail (pre-existing PGLite
infra timeouts).

NOT in scope for Step 6 (deferred):
  - import-from-github (SSRF + clone integration)
  - prune (retention/TTL, lands v0.18)
  - MCP tool-defs regen for source-scoping on read ops (Step 5)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(v0.17.0 step 8/9): getting-started guide + migration skill + citation rule

Step 8 (Lane F) documents what Steps 1+2+6 have shipped and sets up
the agent-facing rules for multi-source.

New files:
- skills/migrations/v0.17.0.md — migration skill read by host agents
  after `gbrain apply-migrations`. Covers the v16+v17 chain, what's
  in v0.17.0 vs what lands later (v0.17.1 ACL, v0.18 sessions), and
  the new sources CLI surface. Cites docs/guides/multi-source-brains.md
  as the recipe.
- docs/guides/multi-source-brains.md — getting-started for end users.
  Three canonical scenarios (unified wiki+gstack / purpose-separated
  yc-media+garrys-list / mixed), full resolution priority, federation
  flag semantics, command reference, and citation format.

skills/brain-ops/SKILL.md — new "Cross-source citation format"
section mandating `[source-id:slug]` when the brain has multiple
sources. Matches the contract the /plan-devex-review DX review
pinned down (DX Finding 5: surface source_id in every page payload
+ citation contract). Key must be sources.id (immutable), never
sources.name.

No behavior change — this is pure documentation for what already
exists in the binary. 144 skills conformance tests still pass.

NOT in this commit (deferred to later steps):
- docs/guides/repo-architecture.md rewrite (lands with the full
  v0.17.0 PR description + release notes)
- skills/_brain-filing-rules.md "which source to file into"
  guidance (lands with Step 5 when sync surfaces --source)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.17.0 step 5/9): sync --source <id> routes through sources table (Lane D)

Adds the --source flag to `gbrain sync`. When set, sync reads local_path
+ last_commit from the matching sources(id) row instead of the global
sync.repo_path / sync.last_commit config keys, and writes last_commit +
last_sync_at back to the same row. Backward compat: --source omitted =
pre-v0.17 behavior exactly, global config path unchanged.

- SyncOpts.sourceId threaded through performSync + performFullSync
- readSyncAnchor/writeSyncAnchor helpers centralize the sources-vs-config
  branch so every read/write goes through one decision point. Makes
  Step 5's later per-source sync-failures tracking a one-file change.
- --source resolved via src/core/source-resolver.ts (Step 6), so any
  command that shell-exposes resolveSourceId gets env var + dotfile
  walk-up + longest-prefix for free.
- Error message for missing source local_path is actionable:
    Source "gstack" has no local_path. Run: gbrain sources add gstack --path <path>
- last_sync_at auto-updates on every last_commit advance so `gbrain
  sources list` shows real recency.

No regression: 2012 pass / 3 fail (same as baseline).

NOT in this commit (deferred per plan):
- Per-source failure tracking (~/.gbrain/sources/<id>/sync-failures.jsonl)
- runImport source-awareness (import.ts path — Step 5 continuation)
- Partial-success semantics when walking N sources — single-source flow
  today, multi-walk lands when the top-level `gbrain sync` without
  --source starts iterating all sources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.17.0 step 4/9): qualified [[source:slug]] + links.resolution_type (Lane B)

Adds source-pinned wikilink syntax and records the resolution kind on
each edge so `gbrain extract --refresh-unqualified` (future) can
re-resolve bare references when the source topology changes.

Wikilink syntax extension:
  [[concepts/ai]]             — unqualified; resolves via local-first fallback
  [[wiki:concepts/ai]]        — qualified; target pinned to sources.id='wiki'
  [[gstack:projects/foo|Display]]  — qualified + display name

The qualified regex runs first and masks matched spans so the
unqualified pass can't double-emit. Source id format enforced to match
the sources CLI validation: [a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?

Schema:
- migration v18 adds links.resolution_type TEXT with CHECK constraint
  ('qualified'|'unqualified' or NULL for legacy/manual/frontmatter edges)
- schema.sql + pglite-schema.ts updated for fresh installs

EntityRef type:
- sourceId is OPTIONAL (only set on qualified wikilinks). Markdown
  [Name](path) and unqualified wikilinks omit it so strict toEqual
  tests pre-v0.17 keep working (69 existing tests still pass).

Tests:
- 5 new qualified-wikilink extraction tests + 1 migration v18 structural
  assertion. 75 tests in test/link-extraction.test.ts (up from 69).
- Full suite: 2018 pass / 3 fail (pre-existing PGLite infra timeouts).

NOT in this commit (deferred to Step 3 / Step 5 continuation):
- Writing resolution_type to the DB (addLink / addLinksBatch don't
  carry the field yet — that's the plumb-through that lands with
  Step 3 when search/dedup also needs source-aware result keys).
- `gbrain extract --refresh-unqualified` re-resolver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.17.0 step 3/9): source-aware search dedup composite keys (Lane B)

Search dedup now keys on (source_id, slug) instead of slug alone. Pre-
v0.17 would collapse two same-slug pages in different sources into
one, destroying cross-source recall. Codex outside-voice review flagged
this as regression-critical — this commit ships the fix plus tests
that lock the invariant in.

Dedup pipeline (src/core/search/dedup.ts):
- pageKey(r) helper — one canonical composite-key derivation. Falls
  back to source_id='default' for pre-v0.17 rows so single-source
  brains behave identically to before.
- Layer 1 (dedupBySource): group-by composite key.
- Layer 4 (capPerPage): count-by composite key.
- guaranteeCompiledTruth: swap scoped to matching (source_id, slug),
  so wiki:topics/ai can't accidentally pull gstack:topics/ai's
  compiled_truth chunk.

SearchResult type gains optional source_id — populated by SQL JOINs
in both engines, falls through as 'default' for legacy callers.

Engine SQL:
- pglite-engine.ts + postgres-engine.ts: search SELECTs add p.source_id
- rowToSearchResult (utils.ts): maps row.source_id → result.source_id
  when present. Shape stays backward compatible (field optional).

Tests — 4 new in test/dedup.test.ts:
- same-slug-different-source does NOT collapse (the critical regression
  guard Codex called out)
- same-slug-same-source DOES still collapse (no over-correction)
- missing source_id falls back to 'default' for pre-v0.17 compat
- compiled_truth guarantee scopes to composite key (Codex second pass
  caught this specific path would leak otherwise)

Full suite: 2022 pass / 3 fail (3 pre-existing PGLite infra timeouts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(v0.17.0 step 7/9): file_migration_ledger + phase-B storage backfill (Lane E)

Adds files.source_id + files.page_id + the file_migration_ledger
state machine that drives storage object rewrites. Each per-file
transition is its own transaction so crash-point recovery is a
ledger read, not a filesystem inspection. Codex second-pass review
flagged that "skip if already has source prefix" was an unsafe
heuristic — the ledger replaces it with explicit state tracking.

Schema:
- migration v19 (files_source_id_page_id_ledger): handler-only
  (PGLite has no files table; Postgres-only gate). ADDs
  source_id + page_id to files, backfills page_id from page_slug
  scoped to source_id='default', creates file_migration_ledger
  with PK on file_id (Codex: not storage_path_old — two sources
  can share an old path during migration).
- schema.sql updated for fresh Postgres installs; file_migration_ledger
  gets RLS alongside other tables.

Runtime:
- src/commands/migrations/v0_17_0-storage-backfill.ts: drives the
  ledger state machine pending → copy_done → db_updated → complete.
  Idempotent per row: re-running resumes from whichever state
  crashed. Old objects preserved (no delete) so operators can
  verify the soak window before a future cleanup release.
- phase B in v0_17_0.ts orchestrator: wires the storage backend
  (Supabase/S3/local) through createStorage, runs runStorageBackfill,
  reports per-state counts + first-three error details.

Tests — 13 new in test/storage-backfill.test.ts:
- pending → copy_done → db_updated → complete happy path
- 3 crash-point recovery tests (resume from copy_done, resume from
  db_updated, failed rows don't auto-retry)
- already-complete rows are skipped with zero side effects
- idempotent re-upload (exists-check skips redundant upload)
- dry-run mode (no storage, reports counts without mutating)

Plus 5 new migrate.test.ts assertions for v19 structure (handler-
only, PGLite gate, source_id + page_id + ledger DDL, default-source
backfill scope, state machine values).

Full suite: 2035 pass / 3 fail (3 pre-existing PGLite infra
timeouts).

NOT in this commit (explicitly deferred):
- DROP old page_slug column — kept for backward compat until
  operators have time to verify page_id everywhere.
- DROP old UNIQUE(storage_path) in favor of UNIQUE(source_id,
  storage_path) — same reason, deferred to later cleanup.
- Actual cleanup phase that deletes old objects post-soak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(v0.17.0 step 9/9): full multi-source PGLite integration suite (Lane G)

End-to-end exercise of every v0.17.0 surface against real PGLite
(in-memory, fast — no DATABASE_URL needed). The migration chain
v2→v19 runs start-to-finish and the test asserts each Step's
invariants hold together.

16 new integration tests across 7 describes:

1. Migration-installed state:
   - sources('default') exists with federated=true config
   - pages.source_id column has DEFAULT 'default'
   - composite UNIQUE (source_id, slug) is installed

2. Default-source write path:
   - putPage without explicit source → source_id='default' via schema
     default clause (no engine API change needed for single-source brains)

3. Composite UNIQUE regression guards (Codex-flagged):
   - Same slug in two different sources coexists
   - Third insert with same (source_id, slug) hits the UNIQUE constraint

4. sources CLI round-trip:
   - federate / unfederate flips config.federated
   - rename changes display, id stays immutable

5. Source resolution priority (integration):
   - Explicit flag > env var > fallback to default
   - Unregistered explicit source errors with actionable message

6. Cascade semantics:
   - sources remove cascades to pages; default source untouched

7. links.resolution_type (Step 4):
   - Qualified/unqualified values accepted
   - CHECK constraint rejects invalid values

All 16 tests pass. Full suite: 2042 pass / 4 fail (4 pre-existing
PGLite beforeEach timeouts in test/wait-for-completion,
test/extract-fs, test/e2e/search-quality, test/e2e/graph-quality
— count fluctuated 3-5 on baseline from variance alone).

Total new tests across Steps 1-9: ~85 unit + integration tests
(sources, source-resolver, migrate v16/v17/v18/v19 structural,
link-extraction qualified wikilinks, dedup regression-critical,
storage-backfill state machine + crash recovery, full
multi-source PGLite integration).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump to v0.18.0 + CHANGELOG entry (multi-source brains)

One-viewport release summary + itemized changes covering all 9 steps
of the multi-source primitive. Notes the v0.17 → v0.18 version bump
rationale (master shipped gbrain dream as v0.17 while this branch was
in flight).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): v0_18_0 orchestrator TS narrow + mechanical test ON CONFLICT

Two CI failures on PR #337:

1. tsc TS2367 at src/commands/migrations/v0_18_0.ts:190 —
   after the early-return on `a.status === 'failed'` (line 179),
   TypeScript narrows `a.status` to `'skipped' | 'complete'`, so the
   subsequent `a.status === 'failed' ? 'failed' :` branch was dead
   code and refused to compile. Dropped the redundant check.

2. E2E `file_list LIMIT enforcement` at test/e2e/mechanical.test.ts:636 —
   the test pre-seeded a pages row with `ON CONFLICT (slug) DO NOTHING`
   but v21 swapped the global UNIQUE for `UNIQUE (source_id, slug)`, so
   Postgres rejects with "no unique or exclusion constraint matching".
   Updated the conflict target to the composite key.

Tier-1 E2E had only this one failing test; everything else passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): v0.18.0 multi-source against real Postgres (v20-v23 schema + cascade + sync)

Closes the three biggest confidence gaps the author flagged in the
self-audit of PR #337:

1. No real Postgres E2E — PGLite has no files table, so v23's
   files.source_id + files.page_id rewrite + file_migration_ledger
   seed was NEVER executed against the real DB. This file covers it.

2. `gbrain sync --source <id>` had zero direct tests. Now has two:
   one that asserts performSync({sourceId}) reads local_path from the
   sources row (not the global config), one that asserts no-sourceId
   falls back to the global sync.repo_path.

3. Cascade delete coverage — previously verified only pages count
   after source removal. Now verifies pages + content_chunks +
   timeline_entries + links + files ALL cascade-delete when a source
   is removed.

6 describes, 16 tests total:

- Schema shape (fresh install): 6 tests confirming sources('default'),
  pages.source_id NOT NULL with DEFAULT, composite UNIQUE pages
  (source_id, slug) replaces global UNIQUE(slug), links.resolution_type
  column + CHECK, files.source_id + page_id columns, file_migration_ledger
  table + status CHECK.

- Composite UNIQUE semantics: 3 tests confirming same-slug in two
  sources coexists (Codex-critical regression guard), duplicate
  (source_id, slug) hits the UNIQUE, putPage targets default source
  by schema DEFAULT.

- Cascade delete: 1 test building a fully populated source (2 pages,
  chunks, timeline, links, files) then removing it + asserting every
  dependent row is gone.

- Sync routing: 2 tests confirming performSync({sourceId}) reads
  per-source local_path vs global config.

- Sources surface: 3 tests for federate/unfederate flipping + rename
  preserving id.

- Storage backfill: 1 end-to-end test seeding ledger + running
  runStorageBackfill against a stub StorageBackend, asserting
  pending → complete transition and files.storage_path rewrite.

Gated by DATABASE_URL per CLAUDE.md E2E lifecycle. Each describe's
beforeAll defensively DELETEs non-default sources + file_migration_ledger
rows so reruns are hermetic (sources isn't in helpers.ALL_TABLES).

Verified: 16/16 pass on first run AND second run (residual-state fix
holds). Full E2E suite still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): TS2352 in multi-source E2E — cast postgres.js RowList via unknown

tsc rejects the direct
  `(rows as { column_name: string }[]).map(...)`
cast because postgres.js RowList rows have an iterable-row shape that
doesn't overlap with the plain-object target. Standard fix: cast via
`unknown` first so the narrowing is explicit.

Verified: `bunx tsc --noEmit` clean (ignoring the pre-existing baseUrl
deprecation warning).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(v0.18.0): addLinksBatch + addTimelineEntriesBatch source-aware JOINs

Batch APIs JOINed on pages.slug globally, so two pages sharing the same
slug across sources would silently fan out — addLinksBatch(['a->b']) in
a brain with 'a' in both 'default' and 'alt' wrote 2 edges instead of 1.
Same bug on addTimelineEntriesBatch.

Fix:
- LinkBatchInput + TimelineBatchInput gain optional source_id fields
  (from_source_id, to_source_id, origin_source_id for links; source_id
  for timeline). All default to 'default' so existing callers are
  backward-compatible on single-source brains.
- pglite-engine + postgres-engine batch JOINs now composite-key on
  (slug, source_id). Postgres adds 3 more unnest arrays for links + 1
  for timeline — still one bind per column, no 65535-param cap risk.
- LEFT JOIN for origin pages also source-qualified so frontmatter-
  provenance edges don't cross-pollinate across sources.

Regression coverage:
- test/pglite-engine.test.ts: 5 new tests covering default-path isolation,
  explicit alt-source writes, and cross-source edges.
- test/e2e/multi-source.test.ts: 4 new tests against real Postgres so
  postgres-js's unnest() bind path is exercised (structurally different
  from PGLite's).

Gap #4 from the PR self-audit — latent bug, not previously reachable
because every existing caller wrote to the default source only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-22 16:24:23 -07:00
committed by GitHub
parent 55ca4984b2
commit 90c5d93fce
33 changed files with 3995 additions and 74 deletions

View File

@@ -2,6 +2,107 @@
All notable changes to GBrain will be documented in this file.
## [0.18.0] - 2026-04-22
## **Multi-source brains. One database, many repos. Federated or isolated, you choose.**
## **`gbrain sources` is the new subcommand. `.gbrain-source` is the new dotfile.**
A single gbrain database can now hold multiple knowledge repos — your wiki, your gstack checkout, your yc-media pipeline, your garrys-list essays — with clean scoping per source. Slugs are unique per source, not globally, so two sources can both have `topics/ai` and they are different pages. Every page, every file, every ingest_log row is scoped to a `sources(id)` row.
Per-source federation controls whether a source participates in unqualified default search. `federated=true` is cross-recall (your wiki + gstack both show up when you search "retry budgets"). `federated=false` is isolation (your yc-media content never leaks into your personal writing searches). Flip with `gbrain sources federate <id>` / `unfederate <id>`.
Per-directory default via `.gbrain-source` dotfile walk-up + `GBRAIN_SOURCE` env var. Same mental model as kubectl / terraform / git: `cd ~/yc-media && gbrain query "X"` just works, no `--source` flag needed. Resolution priority: explicit flag > env > dotfile > registered-path-longest-prefix > `sources.default` config > literal `default` fallback.
### The numbers that matter
9 bisectable commits. 4 new schema migrations. ~85 new tests. Full suite: 2063 pass / 17 fail (the 17 pre-existing master timeouts unchanged). Migration chain runs end-to-end against real PGLite in under 1 second for the integration test.
| Metric | BEFORE v0.17 | AFTER v0.18 | Δ |
|---|---|---|---|
| Max repos per brain | 1 | unlimited | unbounded |
| Slug uniqueness | global | per-source | composite |
| Multi-source search | impossible | default (for federated) | native |
| New CLI commands | — | 9 (`sources add/list/remove/rename/default/attach/detach/federate/unfederate`) | +9 |
| Schema migrations shipped | 0 new | 4 (v20-v23) | +4 |
| New unit + integration tests | — | ~85 | +85 |
### What this means for agents
When a brain has multiple sources, every search result carries `source_id`. Agents cite in `[source-id:slug]` form — `[wiki:topics/ai]` or `[gstack:plans/retry-policy]` — so the user can trace which repo each fact came from. The citation key is `sources.id` (immutable), so renaming a source's display name via `gbrain sources rename` never breaks existing citations.
Back-compat is total. Pre-v0.18 brains upgrade into a seeded `default` source with `federated=true`, and their existing code paths target `default` via a schema DEFAULT clause. You literally do not have to change anything to upgrade; you only change things if you want to add a second source.
## To take advantage of v0.18.0
`gbrain upgrade` should do this automatically. If it didn't, or if `gbrain doctor`
warns about a partial migration:
1. **Run the orchestrator manually:**
```bash
gbrain apply-migrations --yes
```
2. **Your agent reads `skills/migrations/v0.18.0.md` the next time you interact with it.** The migration chain is fully mechanical (v20 creates the sources table, v21 adds pages.source_id + composite UNIQUE, v22 adds links.resolution_type, v23 adds files.source_id + page_id + file_migration_ledger). No manual data work needed.
3. **Verify the outcome:**
```bash
gbrain sources list # should show 'default' federated, with your existing page count
gbrain stats # existing behavior unchanged
gbrain doctor
```
4. **To start using multi-source:**
```bash
gbrain sources add gstack --path ~/.gstack --no-federated
cd ~/.gstack && gbrain sources attach gstack
gbrain sync --source gstack
```
5. **If any step fails or the numbers look wrong,** please file an issue: https://github.com/garrytan/gbrain/issues with:
- output of `gbrain doctor`
- contents of `~/.gbrain/upgrade-errors.jsonl` if it exists
- which step broke
### Itemized changes
#### Added
- **`gbrain sources` subcommand group** — add, list, remove, rename, default, attach, detach, federate, unfederate. See `docs/guides/multi-source-brains.md` for three canonical scenarios (unified wiki+gstack / purpose-separated yc-media+garrys-list / mixed).
- **`sources` table** — first-class multi-repo primitive. `(id, name, local_path, last_commit, last_sync_at, config)`. Citation key is `sources.id`, immutable, validated `[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?`.
- **`pages.source_id` column + composite UNIQUE (source_id, slug)** — slugs unique per source. DEFAULT 'default' on the column so existing single-source callers target the default source automatically via schema default.
- **`.gbrain-source` dotfile** — walk-up resolution like kubectl/terraform/git. `gbrain sources attach <id>` writes it in CWD. Auto-selects the source for any command run from that directory or any subdirectory.
- **`GBRAIN_SOURCE` env var** — power-user / CI / script escape hatch. Second highest priority in resolution (after explicit `--source <id>`).
- **Qualified wikilink syntax `[[source:slug]]`** — new in v0.18 extractor. Unqualified `[[slug]]` still resolves via local-first fallback. `links.resolution_type ENUM('qualified','unqualified')` records which kind each edge is for future `gbrain extract --refresh-unqualified` re-resolution.
- **`files.source_id` + `files.page_id`** — files now scope per source + reference pages by id (not slug). `file_migration_ledger` drives the S3/Supabase object rewrite under the pending → copy_done → db_updated → complete state machine.
- **`gbrain sync --source <id>`** — per-source sync reads local_path + last_commit from the sources table, writes last_sync_at back. Single-source brains keep using the pre-v0.17 `sync.repo_path` / `sync.last_commit` config keys unchanged.
#### Changed
- **Search dedup is now source-aware.** Pre-v0.18 keyed on slug alone; under composite uniqueness that would collapse two same-slug pages in different sources. `pageKey(r) = source_id:slug` is the one canonical helper across all four dedup layers + compiled-truth guarantee. Codex review flagged this as regression-critical.
- **`SearchResult.source_id` optional field** — populated by engine SELECT JOINs. Falls back to `'default'` for pre-v0.18 rows that lacked the column.
- **Migration runner sorts by version** — if anyone adds a migration out of order in `MIGRATIONS[]`, the sort guards against silent skips.
#### Migrations
- **v20** `sources_table_additive` — additive-only. Creates sources table + seeds default row with `{"federated": true}`. Inherits existing `sync.repo_path` / `sync.last_commit`.
- **v21** `pages_source_id_composite_unique` — adds `pages.source_id` with DEFAULT, swaps global `UNIQUE(slug)` for composite `UNIQUE(source_id, slug)`. Lands atomically with the engine's `ON CONFLICT (source_id, slug)` rewrite.
- **v22** `links_resolution_type` — adds `links.resolution_type` CHECK column.
- **v23** `files_source_id_page_id_ledger` — Postgres-only (PGLite has no files table). Adds `files.source_id` + `files.page_id`, backfills `page_id` from legacy `page_slug`, creates `file_migration_ledger`.
#### Tests
- `test/sources.test.ts` (14 tests) — CLI dispatcher, validation, overlapping-path guard.
- `test/source-resolver.test.ts` (14 tests) — full 6-priority resolution coverage including longest-prefix match.
- `test/storage-backfill.test.ts` (13 tests) — state machine + 3 crash-point recovery tests (Codex flagged each).
- `test/multi-source-integration.test.ts` (16 tests) — end-to-end against real PGLite, migration chain v2→v23.
- `test/link-extraction.test.ts` (+6) — qualified `[[source:slug]]` parsing + masking + v22 structural.
- `test/dedup.test.ts` (+4) — regression-critical source-aware composite key tests.
- `test/migrate.test.ts` (+18) — v20/v21/v22/v23 structural assertions.
#### Docs
- `docs/guides/multi-source-brains.md` — new getting-started guide (federated / isolated / mixed scenarios).
- `skills/migrations/v0.18.0.md` — agent-facing migration skill.
- `skills/brain-ops/SKILL.md` — new "Cross-source citation format" section.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## [0.17.0] - 2026-04-22
## **`gbrain dream`. Run the brain maintenance cycle while you sleep.**

View File

@@ -0,0 +1,182 @@
# Multi-source brains
**A single gbrain database can hold multiple knowledge repos.** Each one
is a `source`: a logical brain-within-the-brain with its own slug
namespace, its own sync state, and its own federation policy. The rest
of this guide walks the three canonical scenarios.
## The three scenarios
### 1. Unified knowledge recall (wiki + gstack)
You have a personal wiki and a `gstack` checkout. Both belong to you,
both are knowledge you want your agent to recall across. When you ask
"what did I learn about X?" you want the best hit whether it lives in
the wiki or in a gstack plan.
```bash
# Register the gstack source, federate so it joins cross-source search
gbrain sources add gstack --path ~/.gstack --federated
# Pin the directory so `gbrain sync` knows which source it's walking
cd ~/.gstack && gbrain sources attach gstack
# Initial sync
gbrain sync --source gstack
# Now `gbrain search "retry budgets"` returns hits from BOTH wiki and
# gstack. Each result includes source_id so the agent can cite properly.
```
Result: wiki pages and gstack plans are separate (different source_ids,
different slug namespaces) but share the search surface.
### 2. Purpose-separated brains (yc-media + garrys-list)
You run two completely different content pipelines on the same backend.
YC Media covers portfolio news and founder profiles. Garry's List is
personal writing. You explicitly DON'T want them mixed in search — YC
portfolio content leaking into essay searches is a bug, not a feature.
```bash
# Two sources, both isolated (federated=false)
gbrain sources add yc-media --path ~/yc-media --no-federated
gbrain sources add garrys-list --path ~/writing --no-federated
# Pin each checkout directory
(cd ~/yc-media && gbrain sources attach yc-media)
(cd ~/writing && gbrain sources attach garrys-list)
# Sync each independently
gbrain sync --source yc-media
gbrain sync --source garrys-list
```
Result: searching from neither directory returns the `default` source
(your main brain). Searching from inside `~/yc-media` returns only yc-
media hits. Searching from inside `~/writing` returns only garrys-list.
Federation is opt-in, not leaked.
To search across them explicitly on demand:
```bash
gbrain search "tech layoffs" --source yc-media,garrys-list
```
### 3. Mixed (wiki federated + sessions isolated)
Your main wiki is federated with a few trusted sources. Your session
transcripts (coming in v0.18) land in a separate isolated source so
they don't dominate every search result.
```bash
# Federated sources
gbrain sources add gstack --path ~/.gstack --federated
# Isolated source (future v0.18 — sessions use this shape today for ingest)
gbrain sources add sessions --path ~/.claude/sessions --no-federated
```
## Resolution priority
When any command needs to pick a source, gbrain walks this list (highest
first):
1. Explicit `--source <id>` flag.
2. `GBRAIN_SOURCE` environment variable.
3. `.gbrain-source` dotfile in CWD or any ancestor directory.
4. A registered source whose `local_path` contains the CWD (longest
prefix wins for nested checkouts).
5. The brain-level default set via `gbrain sources default <id>`.
6. The seeded `default` source.
So inside `~/.gstack/plans/` on a brain that pinned `gstack` to
`~/.gstack` via `.gbrain-source`, `gbrain put-page` implicitly writes to
the `gstack` source. Outside any registered directory with no env/dotfile
set, it writes to the default.
## Federation flag
Every source row stores `config.federated: boolean` in its JSONB config.
| Value | Meaning |
|-------|---------|
| `true` | Source participates in unqualified `gbrain search "X"` results. |
| `false` (default for new sources) | Source only searched when explicitly named via `--source <id>` or qualified citation. |
The seeded `default` source is `federated=true` so pre-v0.17 brains
behave exactly as before — every page appears in search.
Flip later with `gbrain sources federate <id>` / `unfederate <id>`.
## Commands
Full subcommand reference:
```
gbrain sources add <id> --path <p> [--name <n>] [--federated|--no-federated]
Register a source. id: [a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?
gbrain sources list [--json] List all sources with page counts + federation state.
gbrain sources remove <id> [--yes] [--dry-run] [--keep-storage]
Cascade-delete a source (pages, chunks, timeline).
gbrain sources rename <id> <new-name>
Change display name only; id is immutable.
gbrain sources default <id> Set the brain-level default.
gbrain sources attach <id> Write .gbrain-source in CWD (like kubectl context).
gbrain sources detach Remove .gbrain-source from CWD.
gbrain sources federate <id>
gbrain sources unfederate <id>
```
## Citation format for agents
When agents receive multi-source results they MUST cite pages in
`[source-id:slug]` form. Example:
> You told me about the distillation protocol — see [wiki:topics/ai]
> and [gstack:plans/multi-repo] for where this came from.
The citation key is `sources.id` (immutable). Renaming a source via
`gbrain sources rename` changes the display name only; existing
citations keep working.
## Writing to a specific source
```bash
# Pass --source explicitly
gbrain put-page topics/ai ... --source wiki
# Or rely on the dotfile / env / CWD match
cd ~/.gstack && gbrain put-page plans/multi-repo ...
# → source auto-resolves to gstack
```
Reads span federated sources by default. Writes require a resolved
source (explicit, inferred, or default). The resolver never picks a
source silently when ambiguous — it errors with a clear fix.
## Upgrading an existing brain
`gbrain upgrade` runs the v16 + v17 migrations automatically. Your
existing pages all move under `source_id='default'`. Behavior is
unchanged until you add a second source.
To add one:
```bash
gbrain sources add gstack --path ~/.gstack --federated
cd ~/.gstack && gbrain sources attach gstack && gbrain sync
```
Two commands. The existing default source is untouched.
## Not in v0.18.0
- Session transcript ingest (`.jsonl`, raised size cap, session
PageType) — v0.18.
- Per-source retention/TTL (`gbrain sources prune`) — v0.18.
- ACL enforcement via caller-identity — v0.17.1.
- `gbrain sources import-from-github <url>` one-shot bootstrap — patch
release after the core plumbing stabilizes.
All of these build on the `sources` primitive shipped here.

View File

@@ -116,6 +116,25 @@ ingest event.
No separate output. Brain-ops is an always-on behavior layer, not a report generator.
The output is updated brain pages and enriched responses.
## Cross-source citation format (v0.18.0+)
When a brain has multiple sources (wiki, gstack, yc-media, etc.), every
citation MUST include the source id: `[source-id:slug]`. Example:
> You told me about the retry budget approach — see
> [wiki:topics/resilience] and [gstack:plans/retry-policy] for where
> this came from.
Rules:
- The key is `sources.id` (immutable), never `sources.name` (mutable display).
- Single-source brains still write `[default:slug]` OR may omit the prefix
for backward compat.
- Every page payload returned by `search`, `query`, `get_page`, `list_pages`
carries `source_id` — always use it when citing, never guess.
If a search result has `source_id: "gstack"` and `slug: "plans/foo"`,
the citation is `[gstack:plans/foo]`. That's the whole rule.
## Anti-Patterns
- Answering questions about people/companies without checking the brain first

View File

@@ -0,0 +1,161 @@
---
version: 0.18.0
feature_pitch:
headline: "Multi-source brains: one DB, many repos. Federated and isolated sources coexist."
description: |
v0.17.0 introduces sources as a first-class primitive. A single
gbrain backend can now hold multiple knowledge repos (wiki, gstack,
yc-media, garrys-list, etc.) with clean scoping. Every page, file,
and ingest_log row is scoped to a `sources(id)` row. Slugs are
unique PER source, not globally — so two sources can both have
`topics/ai` and they're different pages.
Per-source federation controls whether a source participates in
unqualified default search. `federated=true` (the default source
post-upgrade) joins the cross-source recall pool. `federated=false`
is isolation — only searched when explicitly named via `--source`.
This supports both "unified knowledge brain" (wiki + gstack, both
federated) and "purpose-separated brains" (yc-media + garrys-list,
both isolated) at the same time.
Per-directory default via `.gbrain-source` dotfile walk-up +
`GBRAIN_SOURCE` env var. Matches how kubectl / terraform / git
scope context. `cd ~/yc-media && gbrain query "X"` just works.
recipe: docs/guides/multi-source-brains.md
tiers: null
---
# v0.17.0 Migration: Multi-source brains
**Audience: host agents reading this after `gbrain apply-migrations`
has run. v0.17.0 installs a schema primitive for multi-source and
exposes a `sources` CLI subcommand. Existing single-source brains
keep working unchanged — they live under a seeded `default` source
that preserves all prior behavior.**
## Mechanical migration: automatic, no action required
`gbrain upgrade` chains to `gbrain apply-migrations --yes`, which
runs:
- **migration v16** — creates the `sources` table, seeds `default`
with `{"federated": true}` config, inherits your pre-v0.17
`sync.repo_path` and `sync.last_commit` into the default row.
- **migration v17** — adds `pages.source_id TEXT NOT NULL DEFAULT
'default' REFERENCES sources(id)`. Swaps the global `UNIQUE(slug)`
constraint for composite `UNIQUE(source_id, slug)`. Engine
upserts simultaneously re-target `ON CONFLICT (source_id, slug)`
so the constraint swap and the write path land atomically.
Both migrations are idempotent. Safe to re-run.
Later point releases (v0.17.1 and v0.18.0) will layer:
- v0.17.1: ACL enforcement via a caller-identity primitive (the
JSONB slot for `access_policy` ships now; enforcement waits for
identity to be designed).
- v0.18.0: Session ingest (`.jsonl` transcripts, raised size cap,
session PageType) AND per-source retention/TTL at the same time.
## What's new for agents
### `sources` CLI subcommand
```
gbrain sources add <id> --path <p> [--name <n>] [--federated|--no-federated]
gbrain sources list [--json]
gbrain sources remove <id> [--yes] [--dry-run] [--keep-storage]
gbrain sources rename <id> <new-display-name>
gbrain sources default <id>
gbrain sources attach <id> # write .gbrain-source in CWD
gbrain sources detach # remove .gbrain-source
gbrain sources federate <id>
gbrain sources unfederate <id>
```
Source id rules: `[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?` — start + end
with alnum, optional interior hyphens, max 32 chars. Immutable after
creation (rename only changes the display name). Used as the stable
citation key in `[source:slug]` references.
### Per-directory default
Running `gbrain sources attach gstack` inside `~/.gstack/` writes a
`.gbrain-source` file containing the single word `gstack`. Any
gbrain command run from that directory (or any subdirectory) auto-
selects `gstack` as the default source. `gbrain sources detach`
removes the dotfile.
Resolution priority for the source a command targets:
1. Explicit `--source <id>` flag.
2. `GBRAIN_SOURCE` env var.
3. `.gbrain-source` dotfile in CWD or any ancestor.
4. Registered source whose `local_path` contains CWD (longest
prefix wins — nested `~/gstack` + `~/gstack/plans` resolves to
`plans` when deeper).
5. Brain-level default set via `gbrain sources default <id>`.
6. Literal `default` (backward-compat fallback).
### Federation semantics
- `federated=true` (only the `default` source has this out of the
box, by migration): appears in unqualified `gbrain search "X"`
results.
- `federated=false` (new sources default to this): only appears
when `--source <id>` is passed.
Interactive `gbrain sources add` prompts for federation; non-
interactive uses `--federated` / `--no-federated`. Flip later with
`gbrain sources federate <id>` / `unfederate <id>`.
### Citation contract (for agents)
When agents get multi-source search results they MUST cite pages
in `[source-id:slug]` form. Example:
> You told me about the distillation protocol — see
> [wiki:topics/ai] and [gstack:plans/multi-repo] for where this
> came from.
Citations are keyed on `sources.id` (immutable), never
`sources.name` (mutable display). If a user renames a source via
`gbrain sources rename`, existing citations stay valid.
## What's NOT in v0.17.0 yet
The following land in later Steps of this release cycle (already
on the branch but gated until the matching code ships):
- `ingest_log.source_id` — lands with Step 5 sync rewrite.
- `links.resolution_type` + qualified `[[source:slug]]` wikilink
parsing — lands with Step 4 link-extraction rewrite.
- `files.page_slug → page_id` FK rewrite + `file_migration_ledger`
+ storage object prefixing — lands with Step 7 storage backfill.
- Source-aware search dedup — lands with Step 3.
- `gbrain sources import-from-github <url>` — deferred to a patch
release after the plumbing stabilizes.
Existing callers continue to work against the `default` source. No
agent behavioral change is required; the new capabilities are
opt-in via the new `sources` CLI surface.
## Host-repo actions
None required. If your host agent manages the brain via the
standard `gbrain sync` flow, it continues to target the default
source and sees no behavioral change. To start using multi-source:
```
# Register a new source
gbrain sources add gstack --path ~/.gstack --no-federated
# Pin that directory to it so no --source flag is needed
cd ~/.gstack
gbrain sources attach gstack
# Ingest
gbrain sync --source gstack
```
Or see `docs/guides/multi-source-brains.md` for the full three
canonical scenarios (unified, purpose-separated, mixed).

View File

@@ -19,7 +19,7 @@ for (const op of operations) {
}
// CLI-only commands that bypass the operation layer
const CLI_ONLY = new Set(['init', 'upgrade', 'post-upgrade', 'check-update', 'integrations', 'publish', 'check-backlinks', 'lint', 'report', 'import', 'export', 'files', 'embed', 'serve', 'call', 'config', 'doctor', 'migrate', 'eval', 'sync', 'extract', 'features', 'autopilot', 'graph-query', 'jobs', 'agent', 'apply-migrations', 'skillpack-check', 'resolvers', 'integrity', 'repair-jsonb', 'orphans', 'dream', 'check-resolvable']);
const CLI_ONLY = new Set(['init', 'upgrade', 'post-upgrade', 'check-update', 'integrations', 'publish', 'check-backlinks', 'lint', 'report', 'import', 'export', 'files', 'embed', 'serve', 'call', 'config', 'doctor', 'migrate', 'eval', 'sync', 'extract', 'features', 'autopilot', 'graph-query', 'jobs', 'agent', 'apply-migrations', 'skillpack-check', 'resolvers', 'integrity', 'repair-jsonb', 'orphans', 'sources', 'dream', 'check-resolvable']);
async function main() {
// Parse global flags (--quiet / --progress-json / --progress-interval)
@@ -472,6 +472,11 @@ async function handleCliOnly(command: string, args: string[]) {
await runOrphans(engine, args);
break;
}
case 'sources': {
const { runSources } = await import('./commands/sources.ts');
await runSources(engine, args);
break;
}
}
} finally {
if (command !== 'serve') await engine.disconnect();

View File

@@ -18,6 +18,7 @@ import { v0_13_0 } from './v0_13_0.ts';
import { v0_13_1 } from './v0_13_1.ts';
import { v0_14_0 } from './v0_14_0.ts';
import { v0_16_0 } from './v0_16_0.ts';
import { v0_18_0 } from './v0_18_0.ts';
export const migrations: Migration[] = [
v0_11_0,
@@ -27,6 +28,7 @@ export const migrations: Migration[] = [
v0_13_1,
v0_14_0,
v0_16_0,
v0_18_0,
];
/** Look up a migration by exact version string. */

View File

@@ -0,0 +1,174 @@
/**
* v0.18.0 Step 7 — phase B storage backfill loader.
*
* Drives the `file_migration_ledger` state machine forward:
*
* pending → copy_done → db_updated → complete
*
* Each per-file transition is a separate transaction so a crash
* between states leaves a recoverable row (resume-on-partial). The
* ledger is the atomicity backstop for non-atomic object-storage
* "renames" (S3/Supabase = copy+delete).
*
* Crash-point recovery:
* - crash AFTER copy, BEFORE DB update → re-run detects
* `status='copy_done'`, completes DB update (copy is idempotent
* against S3 overwrite so re-copy on same path is fine).
* - crash AFTER DB update, BEFORE ledger mark → re-run detects
* `status='db_updated'`, marks `complete`.
* - crash AFTER ledger mark, BEFORE old-object delete → delete runs
* in the explicit "cleanup" sub-phase so old objects are
* preserved until a separate operator decision.
*
* Scope: v0.18.0 Step 7 DOES rewrite storage_path in the files table
* and copies the bytes to the new source-prefixed path. It does NOT
* delete the old objects — that's reserved for a later release once
* operators have had time to verify the new paths. Old and new
* objects coexist during the soak period.
*/
import type { BrainEngine } from '../../core/engine.ts';
import type { StorageBackend, StorageConfig } from '../../core/storage.ts';
interface LedgerRow {
file_id: number;
storage_path_old: string;
storage_path_new: string;
status: 'pending' | 'copy_done' | 'db_updated' | 'complete' | 'failed';
}
export interface BackfillReport {
total: number;
alreadyComplete: number;
nowComplete: number;
failed: number;
skipped: number;
errors: Array<{ file_id: number; error: string }>;
}
/**
* Process all non-complete ledger rows. Safe to re-run; each row
* resumes from whichever state it was in. Storage is injected so the
* caller can pass a real S3/Supabase backend OR a dry-run stub that
* short-circuits the copy.
*
* If storage is null/undefined the function runs as a dry-run: it
* reports what WOULD be processed without touching objects. This is
* used by the orchestrator when storage isn't configured.
*/
export async function runStorageBackfill(
engine: BrainEngine,
storage: StorageBackend | null,
opts?: { dryRun?: boolean },
): Promise<BackfillReport> {
const report: BackfillReport = {
total: 0,
alreadyComplete: 0,
nowComplete: 0,
failed: 0,
skipped: 0,
errors: [],
};
// Snapshot all ledger rows. We don't paginate because the ledger
// is bounded by current files count — every gbrain install has
// at most low-thousands of files.
const rows = await engine.executeRaw<LedgerRow>(
`SELECT file_id, storage_path_old, storage_path_new, status
FROM file_migration_ledger
ORDER BY file_id`,
);
report.total = rows.length;
for (const row of rows) {
if (row.status === 'complete') {
report.alreadyComplete++;
continue;
}
if (row.status === 'failed') {
report.failed++;
continue;
}
if (opts?.dryRun || !storage) {
// Dry-run: count pending rows but don't advance state.
report.skipped++;
continue;
}
// Drive the state machine. Each transition is its own
// executeRaw call so mid-row crashes leave a recoverable state.
try {
let status = row.status;
// pending → copy_done: COPY the bytes.
if (status === 'pending') {
// If the new path is already populated (e.g. from a previous
// partial run), the copy is redundant but idempotent on S3/
// Supabase where upload overwrites the key.
const exists = await storage.exists(row.storage_path_new).catch(() => false);
if (!exists) {
const data = await storage.download(row.storage_path_old);
await storage.upload(row.storage_path_new, data);
}
await engine.executeRaw(
`UPDATE file_migration_ledger
SET status = 'copy_done', updated_at = now()
WHERE file_id = $1`,
[row.file_id],
);
status = 'copy_done';
}
// copy_done → db_updated: flip files.storage_path to the new
// path. Once this commits, downloads go through the new path
// and the old object is orphaned (but still present on disk
// for rollback within the soak window).
if (status === 'copy_done') {
await engine.executeRaw(
`UPDATE files SET storage_path = $1 WHERE id = $2`,
[row.storage_path_new, row.file_id],
);
await engine.executeRaw(
`UPDATE file_migration_ledger
SET status = 'db_updated', updated_at = now()
WHERE file_id = $1`,
[row.file_id],
);
status = 'db_updated';
}
// db_updated → complete: mark terminal. The old-object delete
// happens in a separate sub-phase (future release) so operators
// can verify the new paths before we drop the safety net.
if (status === 'db_updated') {
await engine.executeRaw(
`UPDATE file_migration_ledger
SET status = 'complete', updated_at = now()
WHERE file_id = $1`,
[row.file_id],
);
report.nowComplete++;
}
} catch (e) {
const msg = e instanceof Error ? e.message : String(e);
report.failed++;
report.errors.push({ file_id: row.file_id, error: msg });
// Mark failed so the next run doesn't retry blindly. Operator
// can reset to 'pending' via SQL once the root cause is fixed.
try {
await engine.executeRaw(
`UPDATE file_migration_ledger
SET status = 'failed', error = $1, updated_at = now()
WHERE file_id = $2`,
[msg.slice(0, 500), row.file_id],
);
} catch {
// Best-effort: if we can't even write 'failed', report the
// original error and move on.
}
}
}
return report;
}

View File

@@ -0,0 +1,237 @@
/**
* v0.18.0 migration orchestrator — Multi-source brains.
*
* Split across sub-versions of the migration registry for safety:
* - v16 (Step 1 / Lane A): additive-only. Installs sources table +
* default row. Does NOT break any existing engine code.
* - v17 (Step 2 / Lane B, future): breaking schema changes. Rides with
* the engine API rewrite so ON CONFLICT (source_id, slug) lands
* atomically with the composite UNIQUE.
*
* Phase structure (per /plan-ceo-review + /plan-eng-review):
* A. Schema — gbrain init --migrate-only runs the migration chain up
* to whichever v-prefix has shipped (v16 today, v17 next).
* B. Storage backfill (Step 7, future) — ledger-driven object rewrite.
* C. Verify — assert sources('default') exists today. Composite UNIQUE,
* page_id backfill, and ledger completeness get added in Step 2.
* D. (future) Delete old storage objects — only runs after C green.
*
* Idempotent: safe to re-run on partial state.
*/
import { execSync } from 'child_process';
import type { Migration, OrchestratorOpts, OrchestratorResult, OrchestratorPhaseResult } from './types.ts';
import { appendCompletedMigration } from '../../core/preferences.ts';
import { loadConfig, toEngineConfig } from '../../core/config.ts';
import { createEngine } from '../../core/engine-factory.ts';
// ── Phase A — Schema ────────────────────────────────────────
function phaseASchema(opts: OrchestratorOpts): OrchestratorPhaseResult {
if (opts.dryRun) return { name: 'schema', status: 'skipped', detail: 'dry-run' };
try {
execSync('gbrain init --migrate-only', { stdio: 'inherit', timeout: 600_000, env: process.env });
return { name: 'schema', status: 'complete' };
} catch (e) {
const msg = e instanceof Error ? e.message : String(e);
return { name: 'schema', status: 'failed', detail: msg };
}
}
// ── Phase B — Storage backfill (skeleton, filled by Step 7) ──
async function phaseBBackfillStorage(opts: OrchestratorOpts): Promise<OrchestratorPhaseResult> {
if (opts.dryRun) return { name: 'backfill_storage', status: 'skipped', detail: 'dry-run' };
try {
const config = loadConfig();
if (!config) return { name: 'backfill_storage', status: 'skipped', detail: 'no brain configured' };
const engine = await createEngine(toEngineConfig(config));
await engine.connect(toEngineConfig(config));
try {
if (engine.kind === 'pglite') {
return { name: 'backfill_storage', status: 'skipped', detail: 'pglite (no files table)' };
}
const hasLedger = await engine.executeRaw<{ exists: boolean }>(
`SELECT EXISTS (SELECT 1 FROM information_schema.tables
WHERE table_schema = current_schema()
AND table_name = 'file_migration_ledger') AS exists`,
);
if (!hasLedger[0]?.exists) {
return {
name: 'backfill_storage',
status: 'skipped',
detail: 'file_migration_ledger not yet installed (run apply-migrations first)',
};
}
// Ledger exists. If storage isn't configured, run the dry-run
// path — we can still report the ledger state but we can't
// COPY objects. Operator then wires storage and re-runs.
const storage = config.storage ? await loadStorageBackend(config.storage) : null;
const { runStorageBackfill } = await import('./v0_18_0-storage-backfill.ts');
const report = await runStorageBackfill(engine, storage, { dryRun: !storage });
if (report.total === 0) {
return { name: 'backfill_storage', status: 'complete', detail: 'no files to migrate' };
}
if (report.failed > 0) {
return {
name: 'backfill_storage',
status: 'failed',
detail: `${report.failed}/${report.total} files failed: ${report.errors.slice(0, 3).map(e => `#${e.file_id}: ${e.error.slice(0, 60)}`).join('; ')}`,
};
}
if (report.skipped > 0 && !storage) {
return {
name: 'backfill_storage',
status: 'skipped',
detail: `${report.skipped}/${report.total} files pending; storage backend not configured (wire storage + re-run)`,
};
}
const detail = `${report.total} files: ${report.alreadyComplete} already complete, ${report.nowComplete} newly migrated`;
return { name: 'backfill_storage', status: 'complete', detail };
} finally {
try { await engine.disconnect(); } catch {}
}
} catch (e) {
return {
name: 'backfill_storage',
status: 'failed',
detail: e instanceof Error ? e.message : String(e),
};
}
}
async function loadStorageBackend(storageConfig: unknown): Promise<import('../../core/storage.ts').StorageBackend | null> {
try {
const { createStorage } = await import('../../core/storage.ts');
// eslint-disable-next-line @typescript-eslint/no-explicit-any
return await createStorage(storageConfig as any);
} catch {
return null;
}
}
// ── Phase C — Verify ────────────────────────────────────────
async function phaseCVerify(opts: OrchestratorOpts): Promise<OrchestratorPhaseResult> {
if (opts.dryRun) return { name: 'verify', status: 'skipped', detail: 'dry-run' };
try {
const config = loadConfig();
if (!config) return { name: 'verify', status: 'skipped', detail: 'no brain configured' };
const engine = await createEngine(toEngineConfig(config));
await engine.connect(toEngineConfig(config));
try {
// 1. sources('default') exists (Step 1 / v16).
const defaults = await engine.executeRaw<{ id: string }>(
`SELECT id FROM sources WHERE id = 'default'`,
);
if (defaults.length !== 1) {
return { name: 'verify', status: 'failed', detail: "sources('default') row missing" };
}
// Step 2 checks (composite UNIQUE, links.resolution_type,
// file_migration_ledger completion) are gated on the future v17
// migration. They run conditionally — if the column/constraint
// exists, verify it; if not, that's fine for Step 1.
// Optional: composite UNIQUE if installed (Step 2 future work).
const constraint = await engine.executeRaw<{ conname: string }>(
`SELECT conname FROM pg_constraint WHERE conname = 'pages_source_slug_key'`,
);
// If installed, verify no pages have NULL source_id.
if (constraint.length === 1) {
const nullSources = await engine.executeRaw<{ n: number }>(
`SELECT COUNT(*)::int AS n FROM pages WHERE source_id IS NULL`,
);
if ((nullSources[0]?.n ?? 0) > 0) {
return { name: 'verify', status: 'failed', detail: `${nullSources[0].n} pages with NULL source_id` };
}
}
return { name: 'verify', status: 'complete', detail: 'sources primitive installed' };
} finally {
try { await engine.disconnect(); } catch {}
}
} catch (e) {
return { name: 'verify', status: 'failed', detail: e instanceof Error ? e.message : String(e) };
}
}
// ── Orchestrator ────────────────────────────────────────────
async function orchestrator(opts: OrchestratorOpts): Promise<OrchestratorResult> {
console.log('');
console.log('=== v0.18.0 — Multi-source brains ===');
if (opts.dryRun) console.log(' (dry-run; no side effects)');
console.log('');
const phases: OrchestratorPhaseResult[] = [];
const a = phaseASchema(opts);
phases.push(a);
if (a.status === 'failed') return finalize(phases, 'failed');
const b = await phaseBBackfillStorage(opts);
phases.push(b);
// Phase B 'failed' is currently expected until Step 7 lands the storage
// loader. Continue to verify so users see the exact gap.
const c = await phaseCVerify(opts);
phases.push(c);
// a.status === 'failed' already early-returned on line 179, so only
// c and b determine the final status here. TypeScript narrowing rejects
// a redundant a.status === 'failed' check.
const status: 'complete' | 'partial' | 'failed' =
c.status === 'failed' ? 'failed' :
b.status === 'failed' ? 'partial' :
'complete';
return finalize(phases, status);
}
function finalize(phases: OrchestratorPhaseResult[], status: 'complete' | 'partial' | 'failed'): OrchestratorResult {
if (status !== 'failed') {
try {
appendCompletedMigration({
version: '0.18.0',
completed_at: new Date().toISOString(),
status: status as 'complete' | 'partial',
phases: phases.map(p => ({ name: p.name, status: p.status })),
});
} catch {
// Best-effort.
}
}
return { version: '0.18.0', status, phases };
}
export const v0_18_0: Migration = {
version: '0.18.0',
featurePitch: {
headline: 'Multi-source brains: one database, many knowledge repos. Federation flag keeps them from polluting each other.',
description:
'v0.18.0 introduces sources — a first-class primitive that lets one gbrain backend hold ' +
'multiple repos (wiki, gstack, yc-media, etc.) with clean scoping. Every page, file, and ' +
'ingest_log row is now scoped to a source. Cross-source search is opt-in per source ' +
'(federated=true) so isolated content (yc-media, garrys-list) never bleeds into your main ' +
'brain. New commands: `gbrain sources add/attach/import-from-github`. Per-directory ' +
'default via .gbrain-source dotfile + GBRAIN_SOURCE env var. See docs/guides/' +
'multi-source-brains.md.',
},
orchestrator,
};
/** Exported for unit tests. */
export const __testing = {
phaseASchema,
phaseBBackfillStorage,
phaseCVerify,
};

372
src/commands/sources.ts Normal file
View File

@@ -0,0 +1,372 @@
/**
* gbrain sources — manage multi-source brain configuration (v0.18.0).
*
* A source is a logical brain-within-the-DB: wiki, gstack, yc-media, etc.
* Every page/file/ingest_log row is scoped to a sources(id) row. Slugs
* are unique per source. See docs/guides/multi-source-brains.md for the
* full story.
*
* Subcommands:
* gbrain sources add <id> --path <path> [--name <display>] [--federated|--no-federated]
* gbrain sources list [--json]
* gbrain sources remove <id> [--yes] [--dry-run] [--keep-storage]
* gbrain sources rename <id> <new-name>
* gbrain sources default <id>
* gbrain sources attach <id> — write .gbrain-source in CWD
* gbrain sources detach — remove .gbrain-source from CWD
* gbrain sources federate <id> — sources.config.federated = true
* gbrain sources unfederate <id> — sources.config.federated = false
*
* NOT in scope for Step 6 (deferred per plan):
* - import-from-github (needs SSRF + clone integration)
* - prune (retention/TTL deferred to v0.18)
* - MCP tool-def regen for full source-scoping of all ops (part of Step 2+5)
*/
import { writeFileSync, unlinkSync, existsSync } from 'fs';
import { join } from 'path';
import type { BrainEngine } from '../core/engine.ts';
// ── Validation ──────────────────────────────────────────────
// Shared with source-resolver.ts — canonical shape.
const SOURCE_ID_RE = /^[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?$/;
function validateSourceId(id: string): void {
if (!SOURCE_ID_RE.test(id)) {
throw new Error(
`Invalid source id "${id}". Must be 1-32 lowercase alnum chars with optional interior hyphens (e.g. "wiki", "yc-media").`,
);
}
}
// ── Types ───────────────────────────────────────────────────
interface SourceRow {
id: string;
name: string;
local_path: string | null;
last_commit: string | null;
last_sync_at: Date | null;
config: Record<string, unknown> | string;
created_at: Date;
}
interface SourceListEntry {
id: string;
name: string;
local_path: string | null;
federated: boolean;
page_count: number;
last_sync_at: string | null;
}
// ── Helpers ─────────────────────────────────────────────────
function parseConfig(config: unknown): Record<string, unknown> {
if (typeof config === 'string') {
try { return JSON.parse(config) as Record<string, unknown>; } catch { return {}; }
}
if (typeof config === 'object' && config !== null) return config as Record<string, unknown>;
return {};
}
function isFederated(config: unknown): boolean {
const parsed = parseConfig(config);
return parsed.federated === true;
}
async function fetchSource(engine: BrainEngine, id: string): Promise<SourceRow | null> {
const rows = await engine.executeRaw<SourceRow>(
`SELECT id, name, local_path, last_commit, last_sync_at, config, created_at
FROM sources WHERE id = $1`,
[id],
);
return rows[0] ?? null;
}
async function countPages(engine: BrainEngine, sourceId: string): Promise<number> {
const rows = await engine.executeRaw<{ n: number }>(
`SELECT COUNT(*)::int AS n FROM pages WHERE source_id = $1`,
[sourceId],
);
return rows[0]?.n ?? 0;
}
// ── Subcommand: add ─────────────────────────────────────────
async function runAdd(engine: BrainEngine, args: string[]): Promise<void> {
const id = args[0];
if (!id) {
console.error('Usage: gbrain sources add <id> --path <path> [--name <display>] [--federated|--no-federated]');
process.exit(2);
}
validateSourceId(id);
let localPath: string | null = null;
let displayName = id;
let federated: boolean | null = null; // null = default (false for new, opt-in via --federated)
for (let i = 1; i < args.length; i++) {
const a = args[i];
if (a === '--path') { localPath = args[++i]; continue; }
if (a === '--name') { displayName = args[++i]; continue; }
if (a === '--federated') { federated = true; continue; }
if (a === '--no-federated') { federated = false; continue; }
console.error(`Unknown flag: ${a}`);
process.exit(2);
}
// Overlapping path guard: reject if new path is inside or contains an
// existing source's local_path (per eng review §4 finding 4.1).
// Throwing (vs process.exit) keeps this testable via the standard
// CLI error-handling wrapper in src/cli.ts.
if (localPath) {
const others = await engine.executeRaw<{ id: string; local_path: string }>(
`SELECT id, local_path FROM sources WHERE local_path IS NOT NULL AND id != $1`,
[id],
);
for (const other of others) {
const a = localPath;
const b = other.local_path;
if (a === b || a.startsWith(b + '/') || b.startsWith(a + '/')) {
throw new Error(
`path "${a}" overlaps with existing source "${other.id}" at "${b}". ` +
`Overlapping sources are not allowed — same files would ingest twice under different source_ids.`,
);
}
}
}
const config = federated === null ? {} : { federated };
await engine.executeRaw(
`INSERT INTO sources (id, name, local_path, config)
VALUES ($1, $2, $3, $4::jsonb)
ON CONFLICT (id) DO NOTHING`,
[id, displayName, localPath, JSON.stringify(config)],
);
const created = await fetchSource(engine, id);
if (!created) {
console.error(`Failed to create source "${id}" (conflict with existing id?)`);
process.exit(4);
}
const fed = isFederated(created.config);
console.log(`Created source "${id}"${displayName !== id ? ` (name: ${displayName})` : ''}${localPath ? `${localPath}` : ''}`);
console.log(` federated: ${fed}${fed ? ' — appears in cross-source default search' : ' — only searched when explicitly named via --source'}`);
}
// ── Subcommand: list ────────────────────────────────────────
async function runList(engine: BrainEngine, args: string[]): Promise<void> {
const json = args.includes('--json');
const rows = await engine.executeRaw<SourceRow>(
`SELECT id, name, local_path, last_commit, last_sync_at, config, created_at
FROM sources ORDER BY (id = 'default') DESC, id`,
);
const entries: SourceListEntry[] = [];
for (const r of rows) {
const pageCount = await countPages(engine, r.id);
entries.push({
id: r.id,
name: r.name,
local_path: r.local_path,
federated: isFederated(r.config),
page_count: pageCount,
last_sync_at: r.last_sync_at ? new Date(r.last_sync_at).toISOString() : null,
});
}
if (json) {
console.log(JSON.stringify({ sources: entries }, null, 2));
return;
}
// Human-readable table.
console.log('SOURCES');
console.log('───────');
for (const e of entries) {
const fedMark = e.federated ? 'federated' : 'isolated';
const pathStr = e.local_path ?? '(no local path)';
const sync = e.last_sync_at ? `last sync ${e.last_sync_at}` : 'never synced';
console.log(` ${e.id.padEnd(20)} ${fedMark.padEnd(10)} ${String(e.page_count).padStart(6)} pages ${sync}`);
if (e.local_path) console.log(` ${' '.repeat(22)}${pathStr}`);
}
if (entries.length === 0) console.log(' (no sources registered)');
}
// ── Subcommand: remove ──────────────────────────────────────
async function runRemove(engine: BrainEngine, args: string[]): Promise<void> {
const id = args[0];
if (!id) {
console.error('Usage: gbrain sources remove <id> [--yes] [--dry-run] [--keep-storage]');
process.exit(2);
}
const yes = args.includes('--yes');
const dryRun = args.includes('--dry-run');
// NOTE: --keep-storage is accepted for forward compatibility but has no
// effect until Step 7 wires in explicit storage object deletion.
const _keepStorage = args.includes('--keep-storage');
void _keepStorage;
if (id === 'default') {
console.error('Error: cannot remove the "default" source (it backs the pre-v0.17 brain).');
process.exit(3);
}
const src = await fetchSource(engine, id);
if (!src) {
console.error(`Source "${id}" not found.`);
process.exit(4);
}
const pageCount = await countPages(engine, id);
console.log(`Source "${id}" → ${pageCount} pages will be deleted (cascade).`);
if (dryRun) {
console.log(`(dry-run; no side effects)`);
return;
}
if (!yes) {
console.error(`Refusing to remove without --yes. Pass --yes to confirm.`);
process.exit(5);
}
await engine.executeRaw(`DELETE FROM sources WHERE id = $1`, [id]);
console.log(`Removed source "${id}" (${pageCount} pages + dependent rows cascaded).`);
}
// ── Subcommand: rename ──────────────────────────────────────
async function runRename(engine: BrainEngine, args: string[]): Promise<void> {
const id = args[0];
const newName = args[1];
if (!id || !newName) {
console.error('Usage: gbrain sources rename <id> <new-display-name>');
process.exit(2);
}
const src = await fetchSource(engine, id);
if (!src) {
console.error(`Source "${id}" not found.`);
process.exit(4);
}
await engine.executeRaw(`UPDATE sources SET name = $1 WHERE id = $2`, [newName, id]);
console.log(`Renamed source "${id}" display: ${src.name}${newName} (id is immutable).`);
}
// ── Subcommand: default ─────────────────────────────────────
async function runDefault(engine: BrainEngine, args: string[]): Promise<void> {
const id = args[0];
if (!id) {
console.error('Usage: gbrain sources default <id>');
process.exit(2);
}
const src = await fetchSource(engine, id);
if (!src) {
console.error(`Source "${id}" not found.`);
process.exit(4);
}
// Stored in the config table (not sources.config, because it's a brain-
// level preference not a per-source setting).
await engine.setConfig('sources.default', id);
console.log(`Default source set to "${id}".`);
}
// ── Subcommand: attach / detach (CWD dotfile) ──────────────
function runAttach(args: string[]): void {
const id = args[0];
if (!id) {
console.error('Usage: gbrain sources attach <id>');
process.exit(2);
}
validateSourceId(id);
const dotfile = join(process.cwd(), '.gbrain-source');
writeFileSync(dotfile, id + '\n', 'utf8');
console.log(`Attached ${process.cwd()} to source "${id}" via .gbrain-source.`);
console.log(`Commands run from this directory (or any subdirectory) will default to this source.`);
}
function runDetach(): void {
const dotfile = join(process.cwd(), '.gbrain-source');
if (!existsSync(dotfile)) {
console.log(`No .gbrain-source file in ${process.cwd()}.`);
return;
}
unlinkSync(dotfile);
console.log(`Detached ${process.cwd()} (removed .gbrain-source).`);
}
// ── Subcommand: federate / unfederate ───────────────────────
async function runFederate(engine: BrainEngine, args: string[], value: boolean): Promise<void> {
const id = args[0];
if (!id) {
console.error(`Usage: gbrain sources ${value ? 'federate' : 'unfederate'} <id>`);
process.exit(2);
}
const src = await fetchSource(engine, id);
if (!src) {
console.error(`Source "${id}" not found.`);
process.exit(4);
}
const config = parseConfig(src.config);
config.federated = value;
await engine.executeRaw(
`UPDATE sources SET config = $1::jsonb WHERE id = $2`,
[JSON.stringify(config), id],
);
console.log(`Source "${id}" is now ${value ? 'federated (appears in cross-source default search)' : 'isolated (only searched when explicitly named)'}.`);
}
// ── Dispatcher ──────────────────────────────────────────────
export async function runSources(engine: BrainEngine, args: string[]): Promise<void> {
const sub = args[0];
const rest = args.slice(1);
switch (sub) {
case 'add': return runAdd(engine, rest);
case 'list': return runList(engine, rest);
case 'remove': return runRemove(engine, rest);
case 'rename': return runRename(engine, rest);
case 'default': return runDefault(engine, rest);
case 'attach': runAttach(rest); return;
case 'detach': runDetach(); return;
case 'federate': return runFederate(engine, rest, true);
case 'unfederate': return runFederate(engine, rest, false);
case undefined:
case '--help':
case '-h':
printHelp();
return;
default:
console.error(`Unknown sources subcommand: ${sub}`);
printHelp();
process.exit(2);
}
}
function printHelp(): void {
console.log(`gbrain sources — manage multi-source brain configuration (v0.18.0)
Subcommands:
add <id> --path <p> [--name <n>] [--federated|--no-federated]
Register a new source.
list [--json] List registered sources with page counts.
remove <id> [--yes] [--dry-run] Cascade-delete a source and its pages.
rename <id> <new-name> Rename display name (id is immutable).
default <id> Set the brain-level default source.
attach <id> Write .gbrain-source in CWD (like kubectl context).
detach Remove .gbrain-source from CWD.
federate <id> Make source appear in cross-source default search.
unfederate <id> Isolate source from default search.
Source id: [a-z0-9-]{1,32}. Immutable citation key.
`);
}

View File

@@ -41,6 +41,14 @@ export interface SyncOpts {
skipFailed?: boolean;
/** Bug 9 — re-attempt unacknowledged failures explicitly (CLI --retry-failed). */
retryFailed?: boolean;
/**
* v0.18.0 Step 5 — sync a specific named source. When set, sync reads
* local_path + last_commit from the sources table (not the global
* config.sync.* keys) and writes last_commit + last_sync_at back to
* the same row. Backward compat: when undefined, sync uses the
* pre-v0.17 global-config path unchanged.
*/
sourceId?: string;
}
function git(repoPath: string, ...args: string[]): string {
@@ -50,11 +58,60 @@ function git(repoPath: string, ...args: string[]): string {
}).trim();
}
// v0.18.0 Step 5: source-scoped sync state helpers. When opts.sourceId
// is set, read/write the per-source row instead of the global config
// keys. These wrappers centralize the branch so every read/write site
// picks the right storage — future Step 5 work (failure-tracking per
// source) hooks here too.
async function readSyncAnchor(
engine: BrainEngine,
sourceId: string | undefined,
which: 'repo_path' | 'last_commit',
): Promise<string | null> {
if (sourceId) {
const col = which === 'repo_path' ? 'local_path' : 'last_commit';
const rows = await engine.executeRaw<Record<string, string | null>>(
`SELECT ${col} AS value FROM sources WHERE id = $1`,
[sourceId],
);
return rows[0]?.value ?? null;
}
return await engine.getConfig(`sync.${which}`);
}
async function writeSyncAnchor(
engine: BrainEngine,
sourceId: string | undefined,
which: 'repo_path' | 'last_commit',
value: string,
): Promise<void> {
if (sourceId) {
const col = which === 'repo_path' ? 'local_path' : 'last_commit';
// last_sync_at bookmarked on every last_commit advance.
if (which === 'last_commit') {
await engine.executeRaw(
`UPDATE sources SET last_commit = $1, last_sync_at = now() WHERE id = $2`,
[value, sourceId],
);
} else {
await engine.executeRaw(
`UPDATE sources SET ${col} = $1 WHERE id = $2`,
[value, sourceId],
);
}
return;
}
await engine.setConfig(`sync.${which}`, value);
}
export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise<SyncResult> {
// Resolve repo path
const repoPath = opts.repoPath || await engine.getConfig('sync.repo_path');
const repoPath = opts.repoPath || await readSyncAnchor(engine, opts.sourceId, 'repo_path');
if (!repoPath) {
throw new Error('No repo path specified. Use --repo or run gbrain init with --repo first.');
const hint = opts.sourceId
? `Source "${opts.sourceId}" has no local_path. Run: gbrain sources add ${opts.sourceId} --path <path>`
: `No repo path specified. Use --repo or run gbrain init with --repo first.`;
throw new Error(hint);
}
// Validate git repo
@@ -84,8 +141,8 @@ export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise<
throw new Error(`No commits in repo ${repoPath}. Make at least one commit before syncing.`);
}
// Read sync state
const lastCommit = opts.full ? null : await engine.getConfig('sync.last_commit');
// Read sync state (source-scoped when sourceId is set, global otherwise)
const lastCommit = opts.full ? null : await readSyncAnchor(engine, opts.sourceId, 'last_commit');
// Ancestry validation: if lastCommit exists, verify it's still in history
if (lastCommit) {
@@ -175,7 +232,7 @@ export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise<
if (totalChanges === 0) {
// Update sync state even with no syncable changes (git advanced)
await engine.setConfig('sync.last_commit', headCommit);
await writeSyncAnchor(engine, opts.sourceId, 'last_commit', headCommit);
await engine.setConfig('sync.last_run', new Date().toISOString());
return {
status: 'up_to_date',
@@ -296,7 +353,7 @@ export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise<
);
// Update last_run + repo_path (progress on infra) but NOT last_commit.
await engine.setConfig('sync.last_run', new Date().toISOString());
await engine.setConfig('sync.repo_path', repoPath);
await writeSyncAnchor(engine, opts.sourceId, 'repo_path', repoPath);
return {
status: 'blocked_by_failures',
fromCommit: lastCommit,
@@ -318,10 +375,11 @@ export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise<
}
}
// Update sync state AFTER all changes succeed
await engine.setConfig('sync.last_commit', headCommit);
// Update sync state AFTER all changes succeed (source-scoped when
// opts.sourceId is set, global config otherwise).
await writeSyncAnchor(engine, opts.sourceId, 'last_commit', headCommit);
await engine.setConfig('sync.last_run', new Date().toISOString());
await engine.setConfig('sync.repo_path', repoPath);
await writeSyncAnchor(engine, opts.sourceId, 'repo_path', repoPath);
// Log ingest
await engine.logIngest({
@@ -423,7 +481,7 @@ async function performFullSync(
`Fix the YAML in those files and re-run, or use '--skip-failed'.`,
);
await engine.setConfig('sync.last_run', new Date().toISOString());
await engine.setConfig('sync.repo_path', repoPath);
await writeSyncAnchor(engine, opts.sourceId, 'repo_path', repoPath);
return {
status: 'blocked_by_failures',
fromCommit: null,
@@ -439,10 +497,12 @@ async function performFullSync(
if (acked > 0) console.error(` Acknowledged ${acked} failure(s) and advancing past them.`);
}
// Persist sync state so next sync is incremental (C1 fix: was missing)
await engine.setConfig('sync.last_commit', headCommit);
// Persist sync state so next sync is incremental (C1 fix: was missing).
// v0.18.0 Step 5: routed through writeSyncAnchor so --source pins it
// to the right sources row rather than the global config.
await writeSyncAnchor(engine, opts.sourceId, 'last_commit', headCommit);
await engine.setConfig('sync.last_run', new Date().toISOString());
await engine.setConfig('sync.repo_path', repoPath);
await writeSyncAnchor(engine, opts.sourceId, 'repo_path', repoPath);
// Full sync doesn't track pagesAffected, so fall back to embed --stale.
// Before commit 2: runEmbed is void; use result.imported as best estimate of
@@ -482,7 +542,17 @@ export async function runSync(engine: BrainEngine, args: string[]) {
const skipFailed = args.includes('--skip-failed');
const retryFailed = args.includes('--retry-failed');
const opts: SyncOpts = { repoPath, dryRun, full, noPull, noEmbed, skipFailed, retryFailed };
// v0.18.0 Step 5: --source resolves to a sources(id) row. Falls back
// to pre-v0.17 global config (sync.repo_path + sync.last_commit) when
// no flag, no env, no dotfile is present.
const explicitSource = args.find((a, i) => args[i - 1] === '--source') || null;
let sourceId: string | undefined = undefined;
if (explicitSource || process.env.GBRAIN_SOURCE) {
const { resolveSourceId } = await import('../core/source-resolver.ts');
sourceId = await resolveSourceId(engine, explicitSource);
}
const opts: SyncOpts = { repoPath, dryRun, full, noPull, noEmbed, skipFailed, retryFailed, sourceId };
// Bug 9 — --retry-failed: before running normal sync, clear acknowledgment
// flags so the sync picks them up as fresh work. The actual re-attempt

View File

@@ -28,6 +28,21 @@ export interface LinkBatchInput {
origin_slug?: string;
/** Frontmatter field name (e.g. 'key_people', 'investors'). */
origin_field?: string;
/**
* v0.18.0: source id for each endpoint. When omitted, the engine JOINs
* against `source_id='default'`. Pass explicit values when the edge
* lives in a non-default source OR crosses sources.
*
* Without these fields, the batch JOIN `pages.slug = v.from_slug` fans
* out across every source containing that slug, silently creating wrong
* edges in a multi-source brain. The source_id filter eliminates the
* fan-out. Origin pages (frontmatter provenance) get their own
* source_id so reconciliation can't delete edges from another source's
* frontmatter.
*/
from_source_id?: string;
to_source_id?: string;
origin_source_id?: string;
}
/** Input row for addTimelineEntriesBatch. Optional fields default to '' (matches NOT NULL DDL). */
@@ -37,6 +52,12 @@ export interface TimelineBatchInput {
source?: string;
summary: string;
detail?: string;
/**
* v0.18.0: source id for the owning page. When omitted, the engine JOINs
* against `source_id='default'`. Without this, two pages sharing the
* same slug across sources would fan out timeline rows to both.
*/
source_id?: string;
}
/** Maximum results returned by search operations. Internal bulk operations (listPages) are not clamped. */

View File

@@ -24,8 +24,19 @@ export interface EntityRef {
slug: string;
/** Top-level directory ("people" | "companies" | etc.). */
dir: string;
/**
* v0.17.0: source id when the link was qualified as `[[source:slug]]`.
* `null` means unqualified — the caller resolves via local-first fallback
* at extraction time. Mirrors links.resolution_type:
* - sourceId set → 'qualified'
* - sourceId null → 'unqualified'
*/
sourceId?: string | null;
}
/** v0.17.0: how a link's target source was pinned at extraction time. */
export type LinkResolutionType = 'qualified' | 'unqualified';
/**
* Directory prefix whitelist. These are the top-level slug dirs the extractor
* recognizes as entity references. Upstream canonical + our extensions:
@@ -63,6 +74,23 @@ const WIKILINK_RE = new RegExp(
'g',
);
/**
* v0.17.0: qualified wikilink `[[source-id:dir/slug]]` or
* `[[source-id:dir/slug|Display Text]]`. The source-id segment pins the
* target to a specific sources(id) row, overriding the local-first
* fallback used by unqualified `[[slug]]` references.
*
* Captures: sourceId, slug (dir/...), displayName (optional).
*
* Matched BEFORE WIKILINK_RE so `[[wiki:topics/ai]]` isn't mis-parsed by
* the unqualified regex (the source prefix would not satisfy DIR_PATTERN
* anyway, but the two-pass approach keeps intent crystal-clear).
*/
const QUALIFIED_WIKILINK_RE = new RegExp(
`\\[\\[([a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?):(${DIR_PATTERN}\\/[^|\\]#]+?)(?:#[^|\\]]*?)?(?:\\|([^\\]]+?))?\\]\\]`,
'g',
);
/**
* Strip fenced code blocks (```...```) and inline code (`...`) from markdown,
* replacing them with whitespace of equivalent length. Preserves byte offsets
@@ -112,6 +140,9 @@ export function extractEntityRefs(content: string): EntityRef[] {
let match: RegExpExecArray | null;
// 1. Markdown links: [Name](path)
// Markdown links have no source-qualification syntax — they're
// always unqualified. Omit sourceId so the shape stays compatible
// with pre-v0.17 consumers doing strict equality.
const mdPattern = new RegExp(ENTITY_REF_RE.source, ENTITY_REF_RE.flags);
while ((match = mdPattern.exec(stripped)) !== null) {
const name = match[1];
@@ -121,9 +152,28 @@ export function extractEntityRefs(content: string): EntityRef[] {
refs.push({ name, slug, dir });
}
// 2. Obsidian wikilinks: [[path]] or [[path|Display Text]]
// 2a. v0.17.0 qualified wikilinks: [[source-id:path]] or [[source-id:path|Display]]
// Must run BEFORE the unqualified pass or we'd double-emit. We also
// mask out the matched spans so pass 2b can't grab them.
const qualifiedRanges: Array<[number, number]> = [];
const qualPattern = new RegExp(QUALIFIED_WIKILINK_RE.source, QUALIFIED_WIKILINK_RE.flags);
while ((match = qualPattern.exec(stripped)) !== null) {
const sourceId = match[1];
let slug = match[2].trim();
if (!slug) continue;
if (slug.includes('://')) continue;
if (slug.endsWith('.md')) slug = slug.slice(0, -3);
const displayName = (match[3] || slug).trim();
const dir = slug.split('/')[0];
refs.push({ name: displayName, slug, dir, sourceId });
qualifiedRanges.push([match.index, match.index + match[0].length]);
}
// 2b. Unqualified Obsidian wikilinks: [[path]] or [[path|Display Text]]
// Same shape rule: omit sourceId when unqualified.
const unmasked = maskRanges(stripped, qualifiedRanges);
const wikiPattern = new RegExp(WIKILINK_RE.source, WIKILINK_RE.flags);
while ((match = wikiPattern.exec(stripped)) !== null) {
while ((match = wikiPattern.exec(unmasked)) !== null) {
let slug = match[1].trim();
if (!slug) continue;
if (slug.includes('://')) continue;
@@ -136,6 +186,20 @@ export function extractEntityRefs(content: string): EntityRef[] {
return refs;
}
/**
* Replace the byte ranges with spaces, preserving offsets. Used by
* extractEntityRefs to prevent the unqualified wikilink regex from
* matching inside a qualified wikilink span.
*/
function maskRanges(content: string, ranges: Array<[number, number]>): string {
if (ranges.length === 0) return content;
const chars = content.split('');
for (const [s, e] of ranges) {
for (let i = s; i < e && i < chars.length; i++) chars[i] = ' ';
}
return chars.join('');
}
// ─── Link candidates (richer than EntityRef) ────────────────────
export interface LinkCandidate {

View File

@@ -449,6 +449,201 @@ export const MIGRATIONS: Migration[] = [
}
},
},
{
version: 23,
name: 'files_source_id_page_id_ledger',
// v0.18.0 Step 7 (Lane E) — additive only: adds files.source_id and
// files.page_id columns + creates the file_migration_ledger that
// drives phase-B storage object rewrites. Does NOT drop page_slug
// yet (kept for backward compat; a later release cleans up once the
// page_id FK is proven). PGLite has no files table, so this
// migration is Postgres-only via a handler gate.
//
// Ledger PK is file_id (not storage_path_old) — two sources CAN
// share an old path during migration, so a composite would be
// wrong. Codex second-pass review caught this.
//
// State machine per row:
// pending → copy_done → db_updated → complete
// any state → failed (with error detail)
//
// Phase B in the v0_18_0 orchestrator processes `status != complete`
// rows. Re-runnable: resumes from whichever state it stopped in.
sql: '',
handler: async (engine) => {
if (engine.kind === 'pglite') return;
await engine.runMigration(19, `
-- 1a. source_id with DEFAULT 'default' (idempotent)
ALTER TABLE files ADD COLUMN IF NOT EXISTS source_id TEXT
NOT NULL DEFAULT 'default' REFERENCES sources(id) ON DELETE CASCADE;
CREATE INDEX IF NOT EXISTS idx_files_source_id ON files(source_id);
-- 1b. page_id (nullable; pre-v0.17 files pointed at page_slug
-- which was ON DELETE SET NULL, so we keep the same nullable
-- semantic — orphaned files are legal).
ALTER TABLE files ADD COLUMN IF NOT EXISTS page_id INTEGER
REFERENCES pages(id) ON DELETE SET NULL;
CREATE INDEX IF NOT EXISTS idx_files_page_id ON files(page_id);
`);
await engine.runMigration(19, `
-- 1c. Backfill page_id from existing page_slug. Scoped to
-- source_id='default' because pre-v0.17 pages ALL lived in
-- the default source. Without this scope, after new sources
-- get added mid-migration, the JOIN could hit the wrong
-- page (different source, same slug).
UPDATE files f
SET page_id = p.id
FROM pages p
WHERE f.page_slug = p.slug
AND p.source_id = 'default'
AND f.page_id IS NULL;
`);
await engine.runMigration(19, `
-- 2. file_migration_ledger — drives the storage object rewrite
-- in the v0_18_0 orchestrator's phase B. Seeded from current
-- files rows; re-seed is idempotent via NOT EXISTS guard.
CREATE TABLE IF NOT EXISTS file_migration_ledger (
file_id INTEGER PRIMARY KEY REFERENCES files(id) ON DELETE CASCADE,
storage_path_old TEXT NOT NULL,
storage_path_new TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
error TEXT,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT chk_ledger_status CHECK (status IN ('pending','copy_done','db_updated','complete','failed'))
);
CREATE INDEX IF NOT EXISTS idx_file_migration_ledger_status
ON file_migration_ledger(status) WHERE status != 'complete';
-- Seed the ledger with every existing file. New path prefixes
-- source_id so multi-source can land assets under their own
-- bucket path without collision.
INSERT INTO file_migration_ledger (file_id, storage_path_old, storage_path_new, status)
SELECT
f.id,
f.storage_path,
COALESCE(f.source_id, 'default') || '/' || f.storage_path,
'pending'
FROM files f
WHERE NOT EXISTS (
SELECT 1 FROM file_migration_ledger l WHERE l.file_id = f.id
);
`);
},
},
{
version: 22,
name: 'links_resolution_type',
// v0.18.0 Step 4 (Lane B) — adds links.resolution_type column so
// each edge records whether its target source was pinned at
// extraction time via `[[source:slug]]` (qualified) or resolved
// via local-first fallback (unqualified). Unqualified edges are
// candidates for re-resolution via `gbrain extract
// --refresh-unqualified` when the source topology changes.
//
// Nullable because legacy edges (pre-v0.17) have no resolution
// concept. `frontmatter` and `manual` edges remain NULL — they're
// not subject to staleness under source churn.
sql: `
ALTER TABLE links ADD COLUMN IF NOT EXISTS resolution_type TEXT;
DO $$ BEGIN
IF NOT EXISTS (
SELECT 1 FROM pg_constraint WHERE conname = 'links_resolution_type_check'
) THEN
ALTER TABLE links ADD CONSTRAINT links_resolution_type_check
CHECK (resolution_type IS NULL OR resolution_type IN ('qualified', 'unqualified'));
END IF;
END $$;
`,
},
{
version: 21,
name: 'pages_source_id_composite_unique',
// v0.18.0 Step 2 (Lane B) — adds pages.source_id with DEFAULT 'default'
// and swaps the global UNIQUE(slug) for the composite UNIQUE(source_id,
// slug). Lands alongside the engine SQL rewrite that makes every
// ON CONFLICT (slug) → ON CONFLICT (source_id, slug) so the constraint
// swap is atomic with the code that writes under it.
//
// DEFAULT 'default' is load-bearing: closes the Codex-flagged race
// where an INSERT between ADD COLUMN and SET NOT NULL could leave
// source_id NULL. Because the default already references a valid
// sources row (seeded in v16), new INSERTs immediately get a valid FK.
//
// Idempotent: IF NOT EXISTS on ADD COLUMN, DROP IF EXISTS on the old
// constraint, DO block guard on the new constraint creation.
sql: `
ALTER TABLE pages ADD COLUMN IF NOT EXISTS source_id TEXT
NOT NULL DEFAULT 'default' REFERENCES sources(id) ON DELETE CASCADE;
CREATE INDEX IF NOT EXISTS idx_pages_source_id ON pages(source_id);
-- Swap global UNIQUE(slug) → composite UNIQUE(source_id, slug). The
-- original constraint is named pages_slug_key by Postgres convention
-- when the column was declared UNIQUE inline. Both drops are
-- idempotent.
ALTER TABLE pages DROP CONSTRAINT IF EXISTS pages_slug_key;
DO $$ BEGIN
IF NOT EXISTS (
SELECT 1 FROM pg_constraint WHERE conname = 'pages_source_slug_key'
) THEN
ALTER TABLE pages ADD CONSTRAINT pages_source_slug_key
UNIQUE (source_id, slug);
END IF;
END $$;
`,
},
{
version: 20,
name: 'sources_table_additive',
// v0.18.0 Step 1 (Lane A) — **additive only** so Step 1 is a safe
// standalone commit. This migration installs the sources primitive
// WITHOUT breaking the engine's existing ON CONFLICT (slug) upserts.
//
// What this migration does now:
// - CREATE sources table
// - INSERT default source (federated=true, inherits sync.repo_path
// and sync.last_commit from config so post-upgrade identity is
// preserved)
//
// What this migration does NOT do yet (deferred to v17 which ships
// with Step 2 engine rewrite, so they land atomically):
// - ALTER pages ADD source_id
// - DROP UNIQUE(slug) + ADD UNIQUE(source_id, slug)
// - files.page_slug → page_id rewrite
// - file_migration_ledger
// - links.resolution_type
//
// The v0.18.0 orchestrator's phaseCVerify allows this split: it
// checks for sources('default'), but the "composite UNIQUE" +
// "pages.source_id NOT NULL" assertions only run after v17 lands.
//
// Idempotent via IF NOT EXISTS. Safe to re-run.
sql: `
CREATE TABLE IF NOT EXISTS sources (
id TEXT PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
local_path TEXT,
last_commit TEXT,
last_sync_at TIMESTAMPTZ,
config JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Seed 'default' source, inheriting the existing sync.repo_path /
-- sync.last_commit config values. federated=true for backward compat.
-- Pre-v0.17 brains behave exactly as before.
INSERT INTO sources (id, name, local_path, last_commit, config)
SELECT
'default',
'default',
(SELECT value FROM config WHERE key = 'sync.repo_path'),
(SELECT value FROM config WHERE key = 'sync.last_commit'),
'{"federated": true}'::jsonb
WHERE NOT EXISTS (SELECT 1 FROM sources WHERE id = 'default');
`,
},
{
version: 15,
name: 'minion_jobs_max_stalled_default_5',
@@ -502,8 +697,14 @@ export async function runMigrations(engine: BrainEngine): Promise<{ applied: num
const currentStr = await engine.getConfig('version');
const current = parseInt(currentStr || '1', 10);
// Sort by version ascending so array insertion order doesn't affect
// correctness. Migrations MUST run in version order; if v16 accidentally
// precedes v15 in MIGRATIONS, setConfig(version, 16) would cause v15 to
// be skipped on the next iteration.
const sorted = [...MIGRATIONS].sort((a, b) => a.version - b.version);
let applied = 0;
for (const m of MIGRATIONS) {
for (const m of sorted) {
if (m.version > current) {
// Pick SQL: engine-specific `sqlFor` wins over engine-agnostic `sql`.
const sql = m.sqlFor?.[engine.kind] ?? m.sql;

View File

@@ -116,10 +116,15 @@ export class PGLiteEngine implements BrainEngine {
const hash = page.content_hash || contentHash(page);
const frontmatter = page.frontmatter || {};
// v0.18.0 Step 2: source_id relies on the schema DEFAULT 'default' so
// existing callers still target the default source without threading
// a parameter. ON CONFLICT target becomes (source_id, slug) since the
// global UNIQUE(slug) was dropped in migration v17. Step 5+ will
// surface an explicit sourceId param on putPage for multi-source sync.
const { rows } = await this.db.query(
`INSERT INTO pages (slug, type, title, compiled_truth, timeline, frontmatter, content_hash, updated_at)
VALUES ($1, $2, $3, $4, $5, $6::jsonb, $7, now())
ON CONFLICT (slug) DO UPDATE SET
ON CONFLICT (source_id, slug) DO UPDATE SET
type = EXCLUDED.type,
title = EXCLUDED.title,
compiled_truth = EXCLUDED.compiled_truth,
@@ -205,7 +210,7 @@ export class PGLiteEngine implements BrainEngine {
const { rows } = await this.db.query(
`SELECT
p.slug, p.id as page_id, p.title, p.type,
p.slug, p.id as page_id, p.title, p.type, p.source_id,
cc.id as chunk_id, cc.chunk_index, cc.chunk_text, cc.chunk_source,
ts_rank(p.search_vector, websearch_to_tsquery('english', $1)) AS score,
CASE WHEN p.updated_at < (
@@ -235,7 +240,7 @@ export class PGLiteEngine implements BrainEngine {
const { rows } = await this.db.query(
`SELECT
p.slug, p.id as page_id, p.title, p.type,
p.slug, p.id as page_id, p.title, p.type, p.source_id,
cc.id as chunk_id, cc.chunk_index, cc.chunk_text, cc.chunk_source,
1 - (cc.embedding <=> $1::vector) AS score,
CASE WHEN p.updated_at < (
@@ -370,8 +375,14 @@ export class PGLiteEngine implements BrainEngine {
async addLinksBatch(links: LinkBatchInput[]): Promise<number> {
if (links.length === 0) return 0;
// unnest() pattern: 7 array-typed bound parameters regardless of batch size.
// Same shape as PostgresEngine (v0.13). Avoids the 65535-parameter cap.
// unnest() pattern: 10 array-typed bound parameters regardless of batch
// size. Same shape as PostgresEngine (v0.18). Avoids the 65535-parameter
// cap.
//
// v0.18.0: every JOIN composite-keys on (slug, source_id) so the batch
// can't fan out across sources when the same slug exists in multiple
// sources. Origin JOIN uses LEFT JOIN on a composite key — NULL
// origin_slug leaves origin_page_id NULL, same as pre-v0.18.
const fromSlugs = links.map(l => l.from_slug);
const toSlugs = links.map(l => l.to_slug);
const linkTypes = links.map(l => l.link_type || '');
@@ -379,17 +390,20 @@ export class PGLiteEngine implements BrainEngine {
const linkSources = links.map(l => l.link_source || 'markdown');
const originSlugs = links.map(l => l.origin_slug || null);
const originFields = links.map(l => l.origin_field || null);
const fromSourceIds = links.map(l => l.from_source_id || 'default');
const toSourceIds = links.map(l => l.to_source_id || 'default');
const originSourceIds = links.map(l => l.origin_source_id || 'default');
const result = await this.db.query(
`INSERT INTO links (from_page_id, to_page_id, link_type, context, link_source, origin_page_id, origin_field)
SELECT f.id, t.id, v.link_type, v.context, v.link_source, o.id, v.origin_field
FROM unnest($1::text[], $2::text[], $3::text[], $4::text[], $5::text[], $6::text[], $7::text[])
AS v(from_slug, to_slug, link_type, context, link_source, origin_slug, origin_field)
JOIN pages f ON f.slug = v.from_slug
JOIN pages t ON t.slug = v.to_slug
LEFT JOIN pages o ON o.slug = v.origin_slug
FROM unnest($1::text[], $2::text[], $3::text[], $4::text[], $5::text[], $6::text[], $7::text[], $8::text[], $9::text[], $10::text[])
AS v(from_slug, to_slug, link_type, context, link_source, origin_slug, origin_field, from_source_id, to_source_id, origin_source_id)
JOIN pages f ON f.slug = v.from_slug AND f.source_id = v.from_source_id
JOIN pages t ON t.slug = v.to_slug AND t.source_id = v.to_source_id
LEFT JOIN pages o ON o.slug = v.origin_slug AND o.source_id = v.origin_source_id
ON CONFLICT (from_page_id, to_page_id, link_type, link_source, origin_page_id) DO NOTHING
RETURNING 1`,
[fromSlugs, toSlugs, linkTypes, contexts, linkSources, originSlugs, originFields]
[fromSlugs, toSlugs, linkTypes, contexts, linkSources, originSlugs, originFields, fromSourceIds, toSourceIds, originSourceIds]
);
return result.rows.length;
}
@@ -724,22 +738,21 @@ export class PGLiteEngine implements BrainEngine {
async addTimelineEntriesBatch(entries: TimelineBatchInput[]): Promise<number> {
if (entries.length === 0) return 0;
// unnest() pattern: 5 array-typed bound parameters regardless of batch size.
const slugs = entries.map(e => e.slug);
const dates = entries.map(e => e.date);
// Normalize optional fields to '' to match per-row addTimelineEntry + NOT NULL DDL.
const sources = entries.map(e => e.source || '');
const summaries = entries.map(e => e.summary);
const details = entries.map(e => e.detail || '');
const sourceIds = entries.map(e => e.source_id || 'default');
const result = await this.db.query(
`INSERT INTO timeline_entries (page_id, date, source, summary, detail)
SELECT p.id, v.date::date, v.source, v.summary, v.detail
FROM unnest($1::text[], $2::text[], $3::text[], $4::text[], $5::text[])
AS v(slug, date, source, summary, detail)
JOIN pages p ON p.slug = v.slug
FROM unnest($1::text[], $2::text[], $3::text[], $4::text[], $5::text[], $6::text[])
AS v(slug, date, source, summary, detail, source_id)
JOIN pages p ON p.slug = v.slug AND p.source_id = v.source_id
ON CONFLICT (page_id, date, summary) DO NOTHING
RETURNING 1`,
[slugs, dates, sources, summaries, details]
[slugs, dates, sources, summaries, details, sourceIds]
);
return result.rows.length;
}

View File

@@ -19,12 +19,33 @@ export const PGLITE_SCHEMA_SQL = `
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- ============================================================
-- sources: multi-brain tenancy (v0.18.0). See src/schema.sql for design notes.
-- ============================================================
CREATE TABLE IF NOT EXISTS sources (
id TEXT PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
local_path TEXT,
last_commit TEXT,
last_sync_at TIMESTAMPTZ,
config JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
INSERT INTO sources (id, name, config)
VALUES ('default', 'default', '{"federated": true}'::jsonb)
ON CONFLICT (id) DO NOTHING;
-- ============================================================
-- pages: the core content table
-- ============================================================
-- v0.18.0 (Step 2): source_id scopes each page. Slugs are unique per
-- source — see src/schema.sql for the design notes.
CREATE TABLE IF NOT EXISTS pages (
id SERIAL PRIMARY KEY,
slug TEXT NOT NULL UNIQUE,
source_id TEXT NOT NULL DEFAULT 'default'
REFERENCES sources(id) ON DELETE CASCADE,
slug TEXT NOT NULL,
type TEXT NOT NULL,
title TEXT NOT NULL,
compiled_truth TEXT NOT NULL DEFAULT '',
@@ -32,12 +53,14 @@ CREATE TABLE IF NOT EXISTS pages (
frontmatter JSONB NOT NULL DEFAULT '{}',
content_hash TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT pages_source_slug_key UNIQUE (source_id, slug)
);
CREATE INDEX IF NOT EXISTS idx_pages_type ON pages(type);
CREATE INDEX IF NOT EXISTS idx_pages_frontmatter ON pages USING GIN(frontmatter);
CREATE INDEX IF NOT EXISTS idx_pages_trgm ON pages USING GIN(title gin_trgm_ops);
CREATE INDEX IF NOT EXISTS idx_pages_source_id ON pages(source_id);
-- ============================================================
-- content_chunks: chunked content with embeddings
@@ -72,6 +95,8 @@ CREATE TABLE IF NOT EXISTS links (
link_source TEXT CHECK (link_source IS NULL OR link_source IN ('markdown', 'frontmatter', 'manual')),
origin_page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL,
origin_field TEXT,
-- v0.18.0 Step 4: see src/schema.sql.
resolution_type TEXT CHECK (resolution_type IS NULL OR resolution_type IN ('qualified', 'unqualified')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT links_from_to_type_source_origin_unique
UNIQUE NULLS NOT DISTINCT (from_page_id, to_page_id, link_type, link_source, origin_page_id)
@@ -141,7 +166,7 @@ CREATE TABLE IF NOT EXISTS page_versions (
CREATE INDEX IF NOT EXISTS idx_versions_page ON page_versions(page_id);
-- ============================================================
-- ingest_log
-- ingest_log (v0.18.0 Step 1: source_id deferred to v17, see src/schema.sql)
-- ============================================================
CREATE TABLE IF NOT EXISTS ingest_log (
id SERIAL PRIMARY KEY,

View File

@@ -115,10 +115,14 @@ export class PostgresEngine implements BrainEngine {
const hash = page.content_hash || contentHash(page);
const frontmatter = page.frontmatter || {};
// v0.18.0 Step 2: source_id relies on schema DEFAULT 'default'. ON
// CONFLICT target becomes (source_id, slug) since global UNIQUE(slug)
// was dropped in migration v17. See pglite-engine.ts for matching
// notes; multi-source sync (Step 5) will surface an explicit sourceId.
const rows = await sql`
INSERT INTO pages (slug, type, title, compiled_truth, timeline, frontmatter, content_hash, updated_at)
VALUES (${slug}, ${page.type}, ${page.title}, ${page.compiled_truth}, ${page.timeline || ''}, ${sql.json(frontmatter as Parameters<typeof sql.json>[0])}, ${hash}, now())
ON CONFLICT (slug) DO UPDATE SET
ON CONFLICT (source_id, slug) DO UPDATE SET
type = EXCLUDED.type,
title = EXCLUDED.title,
compiled_truth = EXCLUDED.compiled_truth,
@@ -262,7 +266,7 @@ export class PostgresEngine implements BrainEngine {
await sql`SET LOCAL statement_timeout = '8s'`;
return await sql`
SELECT
p.slug, p.id as page_id, p.title, p.type,
p.slug, p.id as page_id, p.title, p.type, p.source_id,
cc.id as chunk_id, cc.chunk_index, cc.chunk_text, cc.chunk_source,
1 - (cc.embedding <=> ${vecStr}::vector) AS score,
false AS stale
@@ -422,17 +426,21 @@ export class PostgresEngine implements BrainEngine {
const linkSources = links.map(l => l.link_source || 'markdown');
const originSlugs = links.map(l => l.origin_slug || null);
const originFields = links.map(l => l.origin_field || null);
const fromSourceIds = links.map(l => l.from_source_id || 'default');
const toSourceIds = links.map(l => l.to_source_id || 'default');
const originSourceIds = links.map(l => l.origin_source_id || 'default');
const result = await sql`
INSERT INTO links (from_page_id, to_page_id, link_type, context, link_source, origin_page_id, origin_field)
SELECT f.id, t.id, v.link_type, v.context, v.link_source, o.id, v.origin_field
FROM unnest(
${fromSlugs}::text[], ${toSlugs}::text[], ${linkTypes}::text[],
${contexts}::text[], ${linkSources}::text[], ${originSlugs}::text[],
${originFields}::text[]
) AS v(from_slug, to_slug, link_type, context, link_source, origin_slug, origin_field)
JOIN pages f ON f.slug = v.from_slug
JOIN pages t ON t.slug = v.to_slug
LEFT JOIN pages o ON o.slug = v.origin_slug
${originFields}::text[], ${fromSourceIds}::text[], ${toSourceIds}::text[],
${originSourceIds}::text[]
) AS v(from_slug, to_slug, link_type, context, link_source, origin_slug, origin_field, from_source_id, to_source_id, origin_source_id)
JOIN pages f ON f.slug = v.from_slug AND f.source_id = v.from_source_id
JOIN pages t ON t.slug = v.to_slug AND t.source_id = v.to_source_id
LEFT JOIN pages o ON o.slug = v.origin_slug AND o.source_id = v.origin_source_id
ON CONFLICT (from_page_id, to_page_id, link_type, link_source, origin_page_id) DO NOTHING
RETURNING 1
`;
@@ -775,19 +783,18 @@ export class PostgresEngine implements BrainEngine {
async addTimelineEntriesBatch(entries: TimelineBatchInput[]): Promise<number> {
if (entries.length === 0) return 0;
const sql = this.sql;
// unnest() pattern: 5 array-typed bound parameters regardless of batch size.
const slugs = entries.map(e => e.slug);
const dates = entries.map(e => e.date);
// Normalize optional fields to '' to match per-row addTimelineEntry + NOT NULL DDL.
const sources = entries.map(e => e.source || '');
const summaries = entries.map(e => e.summary);
const details = entries.map(e => e.detail || '');
const sourceIds = entries.map(e => e.source_id || 'default');
const result = await sql`
INSERT INTO timeline_entries (page_id, date, source, summary, detail)
SELECT p.id, v.date::date, v.source, v.summary, v.detail
FROM unnest(${slugs}::text[], ${dates}::text[], ${sources}::text[], ${summaries}::text[], ${details}::text[])
AS v(slug, date, source, summary, detail)
JOIN pages p ON p.slug = v.slug
FROM unnest(${slugs}::text[], ${dates}::text[], ${sources}::text[], ${summaries}::text[], ${details}::text[], ${sourceIds}::text[])
AS v(slug, date, source, summary, detail, source_id)
JOIN pages p ON p.slug = v.slug AND p.source_id = v.source_id
ON CONFLICT (page_id, date, summary) DO NOTHING
RETURNING 1
`;

View File

@@ -9,12 +9,55 @@ CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- gen_random_uuid() is core in Postgres 13+; enable pgcrypto as fallback for older versions
CREATE EXTENSION IF NOT EXISTS pgcrypto;
-- ============================================================
-- sources: multi-repo / multi-brain tenancy (v0.18.0)
-- ============================================================
-- A source is a logical brain-within-the-DB: wiki, gstack, yc-media, etc.
-- Every page/file/ingest_log row carries source_id.
--
-- id: immutable citation key. [a-z0-9-]{1,32} enforced at app layer.
-- Used in [source:slug] citations, --source flag, wikilink syntax.
-- name: mutable display label. Rename via \`gbrain sources rename\`.
-- local_path: optional git checkout root for filesystem-backed sources.
-- config: forward-compat JSONB. Currently used for federation + ACL slot.
-- { "federated": bool, "access_policy": {...} }
-- - federated=true (or missing-but-explicit on 'default'):
-- participates in cross-source default search.
-- - federated=false (default for new sources):
-- only searched when explicitly named via --source.
-- - access_policy: forward-compat slot, no enforcement in v0.17.
-- Write-side lockdown: mutated only when ctx.remote=false.
CREATE TABLE IF NOT EXISTS sources (
id TEXT PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
local_path TEXT,
last_commit TEXT,
last_sync_at TIMESTAMPTZ,
config JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Seed the default source. 'default' is federated=true for backward compat
-- (pre-v0.17 brains behave exactly as before — every page appears in search).
-- Pre-existing sync.repo_path / sync.last_commit are copied in by the v16
-- migration, not here; fresh installs have no local_path until \`sources add\`
-- or the first \`sync\`.
INSERT INTO sources (id, name, config)
VALUES ('default', 'default', '{"federated": true}'::jsonb)
ON CONFLICT (id) DO NOTHING;
-- ============================================================
-- pages: the core content table
-- ============================================================
-- v0.18.0 (Step 2): pages.source_id scopes each row to a sources(id) row.
-- Slugs are unique per source, NOT globally. The default source is
-- seeded in the sources block above so the DEFAULT 'default' FK is
-- always valid at INSERT time.
CREATE TABLE IF NOT EXISTS pages (
id SERIAL PRIMARY KEY,
slug TEXT NOT NULL UNIQUE,
source_id TEXT NOT NULL DEFAULT 'default'
REFERENCES sources(id) ON DELETE CASCADE,
slug TEXT NOT NULL,
type TEXT NOT NULL,
title TEXT NOT NULL,
compiled_truth TEXT NOT NULL DEFAULT '',
@@ -22,7 +65,8 @@ CREATE TABLE IF NOT EXISTS pages (
frontmatter JSONB NOT NULL DEFAULT '{}',
content_hash TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT pages_source_slug_key UNIQUE (source_id, slug)
);
CREATE INDEX IF NOT EXISTS idx_pages_type ON pages(type);
@@ -30,6 +74,8 @@ CREATE INDEX IF NOT EXISTS idx_pages_frontmatter ON pages USING GIN(frontmatter)
CREATE INDEX IF NOT EXISTS idx_pages_trgm ON pages USING GIN(title gin_trgm_ops);
-- v0.13.1 #170: avoids 14.6s seqscan on large brains when listing pages newest-first.
CREATE INDEX IF NOT EXISTS idx_pages_updated_at_desc ON pages (updated_at DESC);
-- v0.18.0: source-scoped scans (per /plan-eng-review Section 4).
CREATE INDEX IF NOT EXISTS idx_pages_source_id ON pages(source_id);
-- ============================================================
-- content_chunks: chunked content with embeddings
@@ -74,6 +120,11 @@ CREATE TABLE IF NOT EXISTS links (
link_source TEXT CHECK (link_source IS NULL OR link_source IN ('markdown', 'frontmatter', 'manual')),
origin_page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL,
origin_field TEXT,
-- v0.18.0 Step 4: 'qualified' when the link was written as
-- [[source:slug]] (target source pinned). 'unqualified' when written
-- as bare [[slug]] and resolved via local-first fallback at
-- extraction time. NULL for legacy/manual/frontmatter edges.
resolution_type TEXT CHECK (resolution_type IS NULL OR resolution_type IN ('qualified', 'unqualified')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-- NULLS NOT DISTINCT (PG15+) so two rows with link_source IS NULL or
-- origin_page_id IS NULL collide as expected. Without this, every row with
@@ -148,6 +199,9 @@ CREATE INDEX IF NOT EXISTS idx_versions_page ON page_versions(page_id);
-- ============================================================
-- ingest_log
-- ============================================================
-- NOTE (v0.18.0 Step 1): ingest_log.source_id is NOT added yet — lands
-- in v17 alongside the sync rewrite (Step 5), which starts writing
-- source-scoped entries.
CREATE TABLE IF NOT EXISTS ingest_log (
id SERIAL PRIMARY KEY,
source_type TEXT NOT NULL,
@@ -202,9 +256,18 @@ CREATE TABLE IF NOT EXISTS mcp_request_log (
-- ============================================================
-- files: binary attachments stored in Supabase Storage
-- ============================================================
-- v0.18.0 Step 7: files gains source_id + page_id alongside the
-- legacy page_slug (kept for backward compat until a later release).
-- The file_migration_ledger below drives the storage object rewrite.
-- page_slug FK had ON UPDATE CASCADE — removed because slugs are no
-- longer global (composite UNIQUE) so CASCADE on-update is ambiguous.
-- ON DELETE SET NULL is preserved via both page_slug and page_id.
CREATE TABLE IF NOT EXISTS files (
id SERIAL PRIMARY KEY,
page_slug TEXT REFERENCES pages(slug) ON DELETE SET NULL ON UPDATE CASCADE,
source_id TEXT NOT NULL DEFAULT 'default'
REFERENCES sources(id) ON DELETE CASCADE,
page_slug TEXT,
page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL,
filename TEXT NOT NULL,
storage_path TEXT NOT NULL,
mime_type TEXT,
@@ -219,8 +282,30 @@ CREATE TABLE IF NOT EXISTS files (
ALTER TABLE files DROP COLUMN IF EXISTS storage_url;
CREATE INDEX IF NOT EXISTS idx_files_page ON files(page_slug);
CREATE INDEX IF NOT EXISTS idx_files_page_id ON files(page_id);
CREATE INDEX IF NOT EXISTS idx_files_source_id ON files(source_id);
CREATE INDEX IF NOT EXISTS idx_files_hash ON files(content_hash);
-- ============================================================
-- file_migration_ledger (v0.18.0 Step 7)
-- Drives the storage-object rewrite performed by the v0_18_0
-- orchestrator's phase B. Keyed on file_id so two sources can share
-- an old path during migration without PK collision (Codex second-
-- pass caught this).
-- Status state machine: pending → copy_done → db_updated → complete
-- ============================================================
CREATE TABLE IF NOT EXISTS file_migration_ledger (
file_id INTEGER PRIMARY KEY REFERENCES files(id) ON DELETE CASCADE,
storage_path_old TEXT NOT NULL,
storage_path_new TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
error TEXT,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT chk_ledger_status CHECK (status IN ('pending','copy_done','db_updated','complete','failed'))
);
CREATE INDEX IF NOT EXISTS idx_file_migration_ledger_status
ON file_migration_ledger(status) WHERE status != 'complete';
-- ============================================================
-- Trigger-based search_vector (spans pages + timeline_entries)
-- ============================================================
@@ -469,6 +554,8 @@ BEGIN
ALTER TABLE config ENABLE ROW LEVEL SECURITY;
ALTER TABLE files ENABLE ROW LEVEL SECURITY;
ALTER TABLE minion_jobs ENABLE ROW LEVEL SECURITY;
ALTER TABLE sources ENABLE ROW LEVEL SECURITY;
ALTER TABLE file_migration_ledger ENABLE ROW LEVEL SECURITY;
RAISE NOTICE 'RLS enabled on all tables (role % has BYPASSRLS)', current_user;
ELSE
RAISE WARNING 'Skipping RLS: role % does not have BYPASSRLS privilege. Run as postgres role to enable.', current_user;

View File

@@ -7,6 +7,14 @@
* 3. By type: no page type exceeds 60% of results
* 4. By page: max N chunks per page (default 2)
* 5. Compiled truth guarantee: ensure at least 1 compiled_truth chunk per page
*
* v0.18.0: every page key is composite (source_id, slug). Pre-v0.17 this
* was slug alone — under multi-source uniqueness that would collapse two
* same-slug pages in different sources into one, destroying recall.
* Codex review flagged this as a regression-critical path. The
* `pageKey()` helper below is the one canonical way to derive the key;
* every layer uses it so future "dedup just changed" drift is one file
* to fix.
*/
import type { SearchResult } from '../types.ts';
@@ -15,6 +23,17 @@ const COSINE_DEDUP_THRESHOLD = 0.85;
const MAX_TYPE_RATIO = 0.6;
const MAX_PER_PAGE = 2;
/**
* Composite page key: (source_id, slug). Pre-v0.17 rows lacked source_id
* so we fall back to 'default' to preserve single-source brain behavior
* exactly. Post-v0.17 callers always populate source_id (SQL JOINs in
* pglite/postgres engine search paths).
*/
function pageKey(r: SearchResult): string {
const source = r.source_id ?? 'default';
return `${source}:${r.slug}`;
}
export function dedupResults(
results: SearchResult[],
opts?: {
@@ -58,9 +77,10 @@ function dedupBySource(results: SearchResult[]): SearchResult[] {
const byPage = new Map<string, SearchResult[]>();
for (const r of results) {
const existing = byPage.get(r.slug) || [];
const k = pageKey(r);
const existing = byPage.get(k) || [];
existing.push(r);
byPage.set(r.slug, existing);
byPage.set(k, existing);
}
const kept: SearchResult[] = [];
@@ -130,10 +150,11 @@ function capPerPage(results: SearchResult[], maxPerPage: number): SearchResult[]
const kept: SearchResult[] = [];
for (const r of results) {
const count = pageCounts.get(r.slug) || 0;
const k = pageKey(r);
const count = pageCounts.get(k) || 0;
if (count < maxPerPage) {
kept.push(r);
pageCounts.set(r.slug, count + 1);
pageCounts.set(k, count + 1);
}
}
@@ -145,30 +166,35 @@ function capPerPage(results: SearchResult[], maxPerPage: number): SearchResult[]
* swap in the best compiled_truth chunk from the pre-dedup set (if one exists).
*/
function guaranteeCompiledTruth(results: SearchResult[], preDedup: SearchResult[]): SearchResult[] {
// Group results by page
// Group results by composite page key (source_id, slug).
const byPage = new Map<string, SearchResult[]>();
for (const r of results) {
const existing = byPage.get(r.slug) || [];
const k = pageKey(r);
const existing = byPage.get(k) || [];
existing.push(r);
byPage.set(r.slug, existing);
byPage.set(k, existing);
}
const output = [...results];
for (const [slug, pageChunks] of byPage) {
for (const [key, pageChunks] of byPage) {
const hasCompiledTruth = pageChunks.some(c => c.chunk_source === 'compiled_truth');
if (hasCompiledTruth) continue;
// Find the best compiled_truth chunk from pre-dedup input for this page
// Find the best compiled_truth chunk from pre-dedup input for this
// (source_id, slug) combination. Pre-v0.17 single-source match was
// "r.slug === slug"; now it's the composite key so two same-slug
// pages in different sources don't mistakenly swap chunks across.
const candidate = preDedup
.filter(r => r.slug === slug && r.chunk_source === 'compiled_truth')
.filter(r => pageKey(r) === key && r.chunk_source === 'compiled_truth')
.sort((a, b) => b.score - a.score)[0];
if (!candidate) continue;
// Swap: replace the lowest-scored chunk from this page
// Swap: replace the lowest-scored chunk from this page (same
// composite key match).
const lowestIdx = output.reduce((minIdx, r, idx) => {
if (r.slug !== slug) return minIdx;
if (pageKey(r) !== key) return minIdx;
if (minIdx === -1) return idx;
return r.score < output[minIdx].score ? idx : minIdx;
}, -1);

139
src/core/source-resolver.ts Normal file
View File

@@ -0,0 +1,139 @@
/**
* Source resolution for CLI commands (v0.18.0).
*
* Resolution priority (highest first):
* 1. Explicit --source <id> flag (caller passes this as `explicit`)
* 2. GBRAIN_SOURCE env var
* 3. .gbrain-source dotfile in CWD or any ancestor directory
* 4. Registered source whose local_path contains CWD
* 5. Brain-level default via `gbrain sources default <id>`
* 6. Literal 'default' (backward compat for pre-v0.17 brains)
*
* This helper is shared by the sources CLI, future sync/extract/query
* commands (Steps 4/5), and the operation layer (Step 2+).
*/
import { readFileSync, existsSync } from 'fs';
import { join, dirname, resolve } from 'path';
import type { BrainEngine } from './engine.ts';
const DOTFILE = '.gbrain-source';
// Must start + end with alnum, interior dashes allowed. Max 32 chars.
// Single-char alnum is also valid. Kebab-case enforced so citation keys
// like `[wiki:slug]` can't have ugly edges like `[wiki-:slug]`.
const SOURCE_ID_RE = /^[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?$/;
function readDotfileWalk(startDir: string): string | null {
let dir = resolve(startDir);
// Guard against infinite loops on malformed paths.
for (let i = 0; i < 50; i++) {
const candidate = join(dir, DOTFILE);
if (existsSync(candidate)) {
try {
const content = readFileSync(candidate, 'utf8').trim().split('\n')[0].trim();
if (SOURCE_ID_RE.test(content)) return content;
} catch {
// Unreadable dotfile — skip and keep walking.
}
}
const parent = dirname(dir);
if (parent === dir) break; // reached filesystem root
dir = parent;
}
return null;
}
/**
* Resolve the source id for a CLI command.
*
* @param engine Connected brain engine (for sources table lookups).
* @param explicit The --source <id> flag value, if the caller parsed one.
* @param cwd The working directory to walk for .gbrain-source. Defaults
* to process.cwd(). Exposed for testability.
* @returns The resolved source id. Falls back to 'default' if no other
* signal is present. Never returns null — every command must
* target exactly one default source.
* @throws If the resolved id doesn't correspond to a registered source
* (prevents silently writing to a nonexistent source and bloating
* pages with a dead FK).
*/
export async function resolveSourceId(
engine: BrainEngine,
explicit: string | null | undefined,
cwd: string = process.cwd(),
): Promise<string> {
// 1. Explicit flag wins.
if (explicit) {
if (!SOURCE_ID_RE.test(explicit)) {
throw new Error(`Invalid --source value "${explicit}". Must match [a-z0-9-]{1,32}.`);
}
await assertSourceExists(engine, explicit);
return explicit;
}
// 2. Env var.
const env = process.env.GBRAIN_SOURCE;
if (env && env.length > 0) {
if (!SOURCE_ID_RE.test(env)) {
throw new Error(`Invalid GBRAIN_SOURCE value "${env}". Must match [a-z0-9-]{1,32}.`);
}
await assertSourceExists(engine, env);
return env;
}
// 3. .gbrain-source dotfile walk-up.
const dotfile = readDotfileWalk(cwd);
if (dotfile) {
await assertSourceExists(engine, dotfile);
return dotfile;
}
// 4. Registered source whose local_path contains CWD.
// Uses longest-prefix match so nested-path configurations (e.g.
// gstack at ~/gstack + plans at ~/gstack/plans) pick the deepest.
const registered = await engine.executeRaw<{ id: string; local_path: string }>(
`SELECT id, local_path FROM sources WHERE local_path IS NOT NULL`,
);
const cwdResolved = resolve(cwd);
let best: { id: string; pathLen: number } | null = null;
for (const r of registered) {
const p = resolve(r.local_path);
if (cwdResolved === p || cwdResolved.startsWith(p + '/')) {
if (!best || p.length > best.pathLen) {
best = { id: r.id, pathLen: p.length };
}
}
}
if (best) return best.id;
// 5. Brain-level default.
const globalDefault = await engine.getConfig('sources.default');
if (globalDefault && SOURCE_ID_RE.test(globalDefault)) {
await assertSourceExists(engine, globalDefault);
return globalDefault;
}
// 6. Fallback: the seeded 'default' source. Always exists post-migration
// v16 so this is a safe terminal.
return 'default';
}
async function assertSourceExists(engine: BrainEngine, id: string): Promise<void> {
const rows = await engine.executeRaw<{ id: string }>(
`SELECT id FROM sources WHERE id = $1`,
[id],
);
if (rows.length === 0) {
throw new Error(
`Source "${id}" not found. Available sources: ` +
`run \`gbrain sources list\` to see registered sources, ` +
`or \`gbrain sources add ${id}\` to create it.`,
);
}
}
/** Exposed for tests. */
export const __testing = {
readDotfileWalk,
SOURCE_ID_RE,
};

View File

@@ -66,6 +66,12 @@ export interface SearchResult {
chunk_index: number;
score: number;
stale: boolean;
/**
* v0.18.0: the sources.id the page belongs to. Dedup composite-keys
* on (source_id, slug) — see src/core/search/dedup.ts. Defaults to
* 'default' for pre-v0.17 rows that lacked the column.
*/
source_id?: string;
}
export interface SearchOpts {

View File

@@ -125,7 +125,7 @@ export function rowToChunk(row: Record<string, unknown>, includeEmbedding = fals
}
export function rowToSearchResult(row: Record<string, unknown>): SearchResult {
return {
const result: SearchResult = {
slug: row.slug as string,
page_id: row.page_id as number,
title: row.title as string,
@@ -137,4 +137,12 @@ export function rowToSearchResult(row: Record<string, unknown>): SearchResult {
score: Number(row.score),
stale: Boolean(row.stale),
};
// v0.17.0: source_id comes from the p.source_id column in search
// SELECTs. Keep the field optional so pre-v0.17 engines that didn't
// join sources don't crash on the absent column — rowToSearchResult
// is shared by both paths.
if (typeof row.source_id === 'string') {
result.source_id = row.source_id;
}
return result;
}

View File

@@ -5,12 +5,55 @@ CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- gen_random_uuid() is core in Postgres 13+; enable pgcrypto as fallback for older versions
CREATE EXTENSION IF NOT EXISTS pgcrypto;
-- ============================================================
-- sources: multi-repo / multi-brain tenancy (v0.18.0)
-- ============================================================
-- A source is a logical brain-within-the-DB: wiki, gstack, yc-media, etc.
-- Every page/file/ingest_log row carries source_id.
--
-- id: immutable citation key. [a-z0-9-]{1,32} enforced at app layer.
-- Used in [source:slug] citations, --source flag, wikilink syntax.
-- name: mutable display label. Rename via `gbrain sources rename`.
-- local_path: optional git checkout root for filesystem-backed sources.
-- config: forward-compat JSONB. Currently used for federation + ACL slot.
-- { "federated": bool, "access_policy": {...} }
-- - federated=true (or missing-but-explicit on 'default'):
-- participates in cross-source default search.
-- - federated=false (default for new sources):
-- only searched when explicitly named via --source.
-- - access_policy: forward-compat slot, no enforcement in v0.17.
-- Write-side lockdown: mutated only when ctx.remote=false.
CREATE TABLE IF NOT EXISTS sources (
id TEXT PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
local_path TEXT,
last_commit TEXT,
last_sync_at TIMESTAMPTZ,
config JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Seed the default source. 'default' is federated=true for backward compat
-- (pre-v0.17 brains behave exactly as before — every page appears in search).
-- Pre-existing sync.repo_path / sync.last_commit are copied in by the v16
-- migration, not here; fresh installs have no local_path until `sources add`
-- or the first `sync`.
INSERT INTO sources (id, name, config)
VALUES ('default', 'default', '{"federated": true}'::jsonb)
ON CONFLICT (id) DO NOTHING;
-- ============================================================
-- pages: the core content table
-- ============================================================
-- v0.18.0 (Step 2): pages.source_id scopes each row to a sources(id) row.
-- Slugs are unique per source, NOT globally. The default source is
-- seeded in the sources block above so the DEFAULT 'default' FK is
-- always valid at INSERT time.
CREATE TABLE IF NOT EXISTS pages (
id SERIAL PRIMARY KEY,
slug TEXT NOT NULL UNIQUE,
source_id TEXT NOT NULL DEFAULT 'default'
REFERENCES sources(id) ON DELETE CASCADE,
slug TEXT NOT NULL,
type TEXT NOT NULL,
title TEXT NOT NULL,
compiled_truth TEXT NOT NULL DEFAULT '',
@@ -18,7 +61,8 @@ CREATE TABLE IF NOT EXISTS pages (
frontmatter JSONB NOT NULL DEFAULT '{}',
content_hash TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT pages_source_slug_key UNIQUE (source_id, slug)
);
CREATE INDEX IF NOT EXISTS idx_pages_type ON pages(type);
@@ -26,6 +70,8 @@ CREATE INDEX IF NOT EXISTS idx_pages_frontmatter ON pages USING GIN(frontmatter)
CREATE INDEX IF NOT EXISTS idx_pages_trgm ON pages USING GIN(title gin_trgm_ops);
-- v0.13.1 #170: avoids 14.6s seqscan on large brains when listing pages newest-first.
CREATE INDEX IF NOT EXISTS idx_pages_updated_at_desc ON pages (updated_at DESC);
-- v0.18.0: source-scoped scans (per /plan-eng-review Section 4).
CREATE INDEX IF NOT EXISTS idx_pages_source_id ON pages(source_id);
-- ============================================================
-- content_chunks: chunked content with embeddings
@@ -70,6 +116,11 @@ CREATE TABLE IF NOT EXISTS links (
link_source TEXT CHECK (link_source IS NULL OR link_source IN ('markdown', 'frontmatter', 'manual')),
origin_page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL,
origin_field TEXT,
-- v0.18.0 Step 4: 'qualified' when the link was written as
-- [[source:slug]] (target source pinned). 'unqualified' when written
-- as bare [[slug]] and resolved via local-first fallback at
-- extraction time. NULL for legacy/manual/frontmatter edges.
resolution_type TEXT CHECK (resolution_type IS NULL OR resolution_type IN ('qualified', 'unqualified')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-- NULLS NOT DISTINCT (PG15+) so two rows with link_source IS NULL or
-- origin_page_id IS NULL collide as expected. Without this, every row with
@@ -144,6 +195,9 @@ CREATE INDEX IF NOT EXISTS idx_versions_page ON page_versions(page_id);
-- ============================================================
-- ingest_log
-- ============================================================
-- NOTE (v0.18.0 Step 1): ingest_log.source_id is NOT added yet — lands
-- in v17 alongside the sync rewrite (Step 5), which starts writing
-- source-scoped entries.
CREATE TABLE IF NOT EXISTS ingest_log (
id SERIAL PRIMARY KEY,
source_type TEXT NOT NULL,
@@ -198,9 +252,18 @@ CREATE TABLE IF NOT EXISTS mcp_request_log (
-- ============================================================
-- files: binary attachments stored in Supabase Storage
-- ============================================================
-- v0.18.0 Step 7: files gains source_id + page_id alongside the
-- legacy page_slug (kept for backward compat until a later release).
-- The file_migration_ledger below drives the storage object rewrite.
-- page_slug FK had ON UPDATE CASCADE — removed because slugs are no
-- longer global (composite UNIQUE) so CASCADE on-update is ambiguous.
-- ON DELETE SET NULL is preserved via both page_slug and page_id.
CREATE TABLE IF NOT EXISTS files (
id SERIAL PRIMARY KEY,
page_slug TEXT REFERENCES pages(slug) ON DELETE SET NULL ON UPDATE CASCADE,
source_id TEXT NOT NULL DEFAULT 'default'
REFERENCES sources(id) ON DELETE CASCADE,
page_slug TEXT,
page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL,
filename TEXT NOT NULL,
storage_path TEXT NOT NULL,
mime_type TEXT,
@@ -215,8 +278,30 @@ CREATE TABLE IF NOT EXISTS files (
ALTER TABLE files DROP COLUMN IF EXISTS storage_url;
CREATE INDEX IF NOT EXISTS idx_files_page ON files(page_slug);
CREATE INDEX IF NOT EXISTS idx_files_page_id ON files(page_id);
CREATE INDEX IF NOT EXISTS idx_files_source_id ON files(source_id);
CREATE INDEX IF NOT EXISTS idx_files_hash ON files(content_hash);
-- ============================================================
-- file_migration_ledger (v0.18.0 Step 7)
-- Drives the storage-object rewrite performed by the v0_18_0
-- orchestrator's phase B. Keyed on file_id so two sources can share
-- an old path during migration without PK collision (Codex second-
-- pass caught this).
-- Status state machine: pending → copy_done → db_updated → complete
-- ============================================================
CREATE TABLE IF NOT EXISTS file_migration_ledger (
file_id INTEGER PRIMARY KEY REFERENCES files(id) ON DELETE CASCADE,
storage_path_old TEXT NOT NULL,
storage_path_new TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
error TEXT,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT chk_ledger_status CHECK (status IN ('pending','copy_done','db_updated','complete','failed'))
);
CREATE INDEX IF NOT EXISTS idx_file_migration_ledger_status
ON file_migration_ledger(status) WHERE status != 'complete';
-- ============================================================
-- Trigger-based search_vector (spans pages + timeline_entries)
-- ============================================================
@@ -465,6 +550,8 @@ BEGIN
ALTER TABLE config ENABLE ROW LEVEL SECURITY;
ALTER TABLE files ENABLE ROW LEVEL SECURITY;
ALTER TABLE minion_jobs ENABLE ROW LEVEL SECURITY;
ALTER TABLE sources ENABLE ROW LEVEL SECURITY;
ALTER TABLE file_migration_ledger ENABLE ROW LEVEL SECURITY;
RAISE NOTICE 'RLS enabled on all tables (role % has BYPASSRLS)', current_user;
ELSE
RAISE WARNING 'Skipping RLS: role % does not have BYPASSRLS privilege. Run as postgres role to enable.', current_user;

View File

@@ -105,8 +105,9 @@ describe('buildPlan — diff against completed + installed VERSION', () => {
// Future migrations (registered but newer than installed VERSION) land in
// skippedFuture until the binary catches up. v0.13.0 = frontmatter graph,
// v0.13.1 = Knowledge Runtime grandfather, v0.14.0 = shell jobs +
// autopilot cooperative, v0.16.0 = subagent runtime (this branch).
expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.0', '0.12.2', '0.13.0', '0.13.1', '0.14.0', '0.16.0']);
// autopilot cooperative, v0.16.0 = subagent runtime, v0.18.0 = multi-
// source brains (this branch).
expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.0', '0.12.2', '0.13.0', '0.13.1', '0.14.0', '0.16.0', '0.18.0']);
});
test('already applied → v0.11.0 lands in `applied` bucket, not pending', () => {
@@ -142,11 +143,11 @@ describe('buildPlan — diff against completed + installed VERSION', () => {
const idx = indexCompleted([]);
const plan = buildPlan(idx, '0.12.0');
expect(plan.pending.map(m => m.version)).toContain('0.11.0');
// v0.12.2, v0.13.0, v0.13.1, v0.14.0, and v0.16.0 were added later;
// v0.12.2, v0.13.0, v0.13.1, v0.14.0, v0.16.0, v0.18.0 were added later;
// installed=0.12.0 means they belong in skippedFuture, not pending. v0.11.0
// and v0.12.0 stay pending despite being ≤ installed — that is the H9
// invariant.
expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.2', '0.13.0', '0.13.1', '0.14.0', '0.16.0']);
expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.2', '0.13.0', '0.13.1', '0.14.0', '0.16.0', '0.18.0']);
});
test('--migration filter narrows to one version', () => {

View File

@@ -154,3 +154,75 @@ describe('edge cases', () => {
expect(deduped.filter(r => r.slug === 'a').length).toBeLessThanOrEqual(3);
});
});
// ─────────────────────────────────────────────────────────────────
// v0.18.0 Step 3 — source-aware dedup (REGRESSION-CRITICAL per Codex)
// ─────────────────────────────────────────────────────────────────
// Pre-v0.17 dedup collapsed on slug alone. Under multi-source
// uniqueness, two same-slug pages in different sources ARE different
// pages — collapsing them destroys cross-source recall. Codex flagged
// this as a regression-critical path in the outside-voice review.
describe('dedup — source-aware composite key (v0.18.0)', () => {
test('same slug across two sources does NOT collapse via dedupBySource layer', () => {
// Two pages, same slug, different sources. Both should survive
// Layer 1 (top-3-per-page) because they are DIFFERENT pages.
const results = [
makeResult({ slug: 'topics/ai', source_id: 'wiki', score: 0.9, chunk_text: 'wiki take on ai' }),
makeResult({ slug: 'topics/ai', source_id: 'gstack', score: 0.85, chunk_text: 'gstack plans for ai' }),
];
const deduped = dedupResults(results);
// Both pages represented — one result each.
const wikiHits = deduped.filter(r => r.source_id === 'wiki' && r.slug === 'topics/ai');
const gstackHits = deduped.filter(r => r.source_id === 'gstack' && r.slug === 'topics/ai');
expect(wikiHits.length).toBe(1);
expect(gstackHits.length).toBe(1);
});
test('same slug + same source DOES collapse to maxPerPage', () => {
// Control: same-source-same-slug behavior unchanged from pre-v0.17.
const results = [
makeResult({ slug: 'topics/ai', source_id: 'wiki', chunk_id: 1, score: 0.9, chunk_text: 'chunk one distinct content here' }),
makeResult({ slug: 'topics/ai', source_id: 'wiki', chunk_id: 2, score: 0.8, chunk_text: 'chunk two also distinct words' }),
makeResult({ slug: 'topics/ai', source_id: 'wiki', chunk_id: 3, score: 0.7, chunk_text: 'chunk three different terms again' }),
];
const deduped = dedupResults(results);
// Default maxPerPage=2 → only 2 of the 3 wiki:topics/ai chunks survive.
const wikiHits = deduped.filter(r => r.source_id === 'wiki' && r.slug === 'topics/ai');
expect(wikiHits.length).toBeLessThanOrEqual(2);
});
test('missing source_id defaults to "default" for backward compat', () => {
// Pre-v0.17 brains (single source, rows with no source_id column)
// still dedup correctly: the fallback key groups them all under
// the 'default' source bucket.
const results = [
makeResult({ slug: 'topics/ai', chunk_id: 1, score: 0.9, chunk_text: 'chunk one distinct content words' }),
makeResult({ slug: 'topics/ai', chunk_id: 2, score: 0.8, chunk_text: 'chunk two totally different phrasing' }),
makeResult({ slug: 'topics/ai', chunk_id: 3, score: 0.7, chunk_text: 'chunk three new unique text here' }),
];
const deduped = dedupResults(results);
// All three should group as one page (no source_id → default), so
// maxPerPage=2 cap applies.
expect(deduped.length).toBeLessThanOrEqual(2);
});
test('compiled_truth guarantee scopes to (source_id, slug), not slug alone', () => {
// Two pages, same slug, different sources. wiki's top-scoring chunk
// is timeline; gstack has only compiled_truth. The guarantee must
// swap in wiki's compiled_truth for wiki (without touching gstack)
// and must NOT accidentally pull gstack's compiled_truth into wiki.
const results = [
makeResult({ slug: 'topics/ai', source_id: 'wiki', score: 0.9, chunk_source: 'timeline', chunk_id: 1, chunk_text: 'wiki timeline chunk content here' }),
makeResult({ slug: 'topics/ai', source_id: 'wiki', score: 0.5, chunk_source: 'compiled_truth', chunk_id: 2, chunk_text: 'wiki compiled truth content text' }),
makeResult({ slug: 'topics/ai', source_id: 'gstack', score: 0.7, chunk_source: 'compiled_truth', chunk_id: 3, chunk_text: 'gstack compiled truth something else' }),
];
const deduped = dedupResults(results);
// Wiki ends up with a compiled_truth (swapped from its own source,
// not gstack's).
const wikiCompiledTruths = deduped.filter(
r => r.source_id === 'wiki' && r.slug === 'topics/ai' && r.chunk_source === 'compiled_truth',
);
expect(wikiCompiledTruths.length).toBe(1);
expect(wikiCompiledTruths[0].chunk_id).toBe(2); // wiki's own compiled_truth, NOT gstack's (id=3)
});
});

View File

@@ -633,7 +633,7 @@ describeE2E('E2E: file_list LIMIT enforcement', () => {
await sql`
INSERT INTO pages (slug, title, type, compiled_truth, frontmatter)
VALUES (${testSlug}, ${'Test Limit Page'}, ${'note'}, ${'body'}, ${'{}'}::jsonb)
ON CONFLICT (slug) DO NOTHING
ON CONFLICT (source_id, slug) DO NOTHING
`;
// Insert 150 file rows for the same slug

View File

@@ -0,0 +1,608 @@
/**
* E2E: v0.18.0 multi-source migrations against REAL Postgres.
*
* PGLite doesn't have a files table (see pglite-schema.ts header), so the
* v23 migration's files.source_id + files.page_id rewrite + ledger seed
* is NEVER executed by the PGLite integration test. This file closes
* that gap by exercising the full v20-v23 chain against a real Postgres
* DB with pre-existing data.
*
* Also covers the gaps in the PR's pre-shipping test matrix that the
* author self-audited:
* - files.page_slug → page_id backfill against real rows
* - file_migration_ledger seeding
* - cascade delete via sources.remove (pages + chunks + timeline +
* files + links all gone)
* - sync --source <id> routing reads + writes per-source sync anchors
* instead of the global config keys
*
* Gated by DATABASE_URL — skips gracefully when unset, per the CLAUDE.md
* E2E lifecycle pattern.
*/
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
import { PostgresEngine } from '../../src/core/postgres-engine.ts';
import { runSources } from '../../src/commands/sources.ts';
import { performSync } from '../../src/commands/sync.ts';
import { runStorageBackfill } from '../../src/commands/migrations/v0_18_0-storage-backfill.ts';
import type { StorageBackend } from '../../src/core/storage.ts';
import { hasDatabase, setupDB, teardownDB, getConn, getEngine } from './helpers.ts';
const SKIP = !hasDatabase();
const describeE2E = SKIP ? describe.skip : describe;
describeE2E('v0.18.0 multi-source — Postgres schema shape (fresh install)', () => {
beforeAll(async () => {
await setupDB();
// sources + file_migration_ledger are not in helpers.ALL_TABLES, so
// residual rows from prior test runs can shadow new INSERTs. Wipe
// non-default sources at the top of every describe to keep each
// block hermetic. file_migration_ledger cascades from files which
// setupDB already truncates, but wipe explicitly in case files did
// not cascade it.
const conn = getConn();
await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`);
await conn.unsafe(`DELETE FROM file_migration_ledger`);
});
afterAll(async () => {
await teardownDB();
});
test("sources('default') exists after initSchema + migration chain", async () => {
const conn = getConn();
const rows = await conn.unsafe(
`SELECT id, name, config FROM sources WHERE id = 'default'`,
);
expect(rows.length).toBe(1);
expect(rows[0].name).toBe('default');
const config = typeof rows[0].config === 'string' ? JSON.parse(rows[0].config) : rows[0].config;
expect(config.federated).toBe(true);
});
test('pages.source_id NOT NULL with DEFAULT default (v21)', async () => {
const conn = getConn();
const rows = await conn.unsafe(
`SELECT column_name, column_default, is_nullable
FROM information_schema.columns
WHERE table_name = 'pages' AND column_name = 'source_id'`,
);
expect(rows.length).toBe(1);
expect(rows[0].is_nullable).toBe('NO');
expect(String(rows[0].column_default)).toContain('default');
});
test('composite UNIQUE pages(source_id, slug) replaces global UNIQUE(slug)', async () => {
const conn = getConn();
const composite = await conn.unsafe(
`SELECT conname FROM pg_constraint WHERE conname = 'pages_source_slug_key'`,
);
expect(composite.length).toBe(1);
const oldGlobal = await conn.unsafe(
`SELECT conname FROM pg_constraint WHERE conname = 'pages_slug_key'`,
);
expect(oldGlobal.length).toBe(0);
});
test('links.resolution_type column exists with CHECK (v22)', async () => {
const conn = getConn();
const rows = await conn.unsafe(
`SELECT column_name FROM information_schema.columns
WHERE table_name = 'links' AND column_name = 'resolution_type'`,
);
expect(rows.length).toBe(1);
const check = await conn.unsafe(
`SELECT conname FROM pg_constraint WHERE conname = 'links_resolution_type_check'`,
);
expect(check.length).toBe(1);
});
test('files.source_id + files.page_id columns exist (v23, Postgres-only)', async () => {
const conn = getConn();
const cols = await conn.unsafe(
`SELECT column_name FROM information_schema.columns
WHERE table_name = 'files' AND column_name IN ('source_id', 'page_id')`,
);
// postgres.js returns RowList with an iterable-row shape; cast via
// unknown before narrowing to plain objects (TS2352 otherwise).
const names = new Set(
(cols as unknown as Array<{ column_name: string }>).map(r => r.column_name),
);
expect(names.has('source_id')).toBe(true);
expect(names.has('page_id')).toBe(true);
});
test('file_migration_ledger table exists with status CHECK (v23)', async () => {
const conn = getConn();
const tables = await conn.unsafe(
`SELECT table_name FROM information_schema.tables
WHERE table_name = 'file_migration_ledger'`,
);
expect(tables.length).toBe(1);
const check = await conn.unsafe(
`SELECT conname FROM pg_constraint WHERE conname = 'chk_ledger_status'`,
);
expect(check.length).toBe(1);
});
});
describeE2E('v0.18.0 multi-source — composite UNIQUE semantics on real Postgres', () => {
beforeAll(async () => {
await setupDB();
// sources + file_migration_ledger are not in helpers.ALL_TABLES, so
// residual rows from prior test runs can shadow new INSERTs. Wipe
// non-default sources at the top of every describe to keep each
// block hermetic. file_migration_ledger cascades from files which
// setupDB already truncates, but wipe explicitly in case files did
// not cascade it.
const conn = getConn();
await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`);
await conn.unsafe(`DELETE FROM file_migration_ledger`);
});
afterAll(async () => {
await teardownDB();
});
test('same slug in two sources coexists (REGRESSION GUARD — Codex critical)', async () => {
const conn = getConn();
// Create a second source.
const engine = getEngine();
await runSources(engine as unknown as Parameters<typeof runSources>[0], ['add', 'wiki', '--federated']);
// Insert the same slug under 'default' (via putPage) and 'wiki' (raw INSERT).
await engine.putPage('topics/ai', {
type: 'concept', title: 'AI from default', compiled_truth: 'default source take',
});
await conn.unsafe(
`INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash)
VALUES ('wiki', 'topics/ai', 'concept', 'AI from wiki', 'wiki source take', '', '{}'::jsonb, 'wikihash')`,
);
const rows = await conn.unsafe(
`SELECT source_id, slug, title FROM pages WHERE slug = 'topics/ai' ORDER BY source_id`,
);
expect(rows.length).toBe(2);
expect(rows.map((r: any) => r.source_id).sort()).toEqual(['default', 'wiki']);
});
test('duplicate (source_id, slug) hits composite UNIQUE', async () => {
const conn = getConn();
let err: Error | null = null;
try {
await conn.unsafe(
`INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash)
VALUES ('wiki', 'topics/ai', 'concept', 'dup', '', '', '{}'::jsonb, 'dup')`,
);
} catch (e) {
err = e as Error;
}
expect(err).not.toBeNull();
expect(err!.message.toLowerCase()).toMatch(/unique|duplicate/);
});
test('putPage (engine API) targets default source by schema DEFAULT', async () => {
const engine = getEngine();
await engine.putPage('topics/from-putpage', {
type: 'note', title: 'Via putPage', compiled_truth: 'body',
});
const conn = getConn();
const rows = await conn.unsafe(
`SELECT source_id FROM pages WHERE slug = 'topics/from-putpage'`,
);
expect(rows.length).toBe(1);
expect(rows[0].source_id).toBe('default');
});
});
describeE2E('v0.18.0 multi-source — cascade delete covers every dependent row', () => {
beforeAll(async () => {
await setupDB();
// sources + file_migration_ledger are not in helpers.ALL_TABLES, so
// residual rows from prior test runs can shadow new INSERTs. Wipe
// non-default sources at the top of every describe to keep each
// block hermetic. file_migration_ledger cascades from files which
// setupDB already truncates, but wipe explicitly in case files did
// not cascade it.
const conn = getConn();
await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`);
await conn.unsafe(`DELETE FROM file_migration_ledger`);
});
afterAll(async () => {
await teardownDB();
});
test('sources remove cascades to pages + chunks + timeline + links + files', async () => {
const conn = getConn();
const engine = getEngine();
// Build a fully populated source: page, chunks, timeline entries,
// links, a file row. Then remove the source and verify nothing
// for that source survives.
await runSources(engine as unknown as Parameters<typeof runSources>[0], ['add', 'cascadetest', '--federated']);
// Page under cascadetest
await conn.unsafe(
`INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash)
VALUES ('cascadetest', 'people/alice', 'person', 'Alice', 'Alice body', '', '{}'::jsonb, 'h1')`,
);
const alicePage = await conn.unsafe(
`SELECT id FROM pages WHERE source_id = 'cascadetest' AND slug = 'people/alice'`,
);
const aliceId = alicePage[0].id as number;
// A second page for link target
await conn.unsafe(
`INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash)
VALUES ('cascadetest', 'companies/acme', 'company', 'Acme', 'Acme body', '', '{}'::jsonb, 'h2')`,
);
const acmePage = await conn.unsafe(
`SELECT id FROM pages WHERE source_id = 'cascadetest' AND slug = 'companies/acme'`,
);
const acmeId = acmePage[0].id as number;
// Chunk
await conn.unsafe(
`INSERT INTO content_chunks (page_id, chunk_index, chunk_text, chunk_source)
VALUES (${aliceId}, 0, 'Alice body chunk', 'compiled_truth')`,
);
// Timeline
await conn.unsafe(
`INSERT INTO timeline_entries (page_id, date, source, summary, detail)
VALUES (${aliceId}, '2026-01-15', 'test', 'Joined Acme', 'detail')`,
);
// Link Alice → Acme
await conn.unsafe(
`INSERT INTO links (from_page_id, to_page_id, link_type, link_source)
VALUES (${aliceId}, ${acmeId}, 'works_at', 'markdown')`,
);
// File row pointing at Alice
await conn.unsafe(
`INSERT INTO files (source_id, page_id, filename, storage_path, content_hash)
VALUES ('cascadetest', ${aliceId}, 'alice.pdf', 'cascadetest/people/alice/alice.pdf', 'fh1')`,
);
// Sanity: everything exists
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'cascadetest'`))[0].n).toBe(2);
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM content_chunks WHERE page_id = ${aliceId}`))[0].n).toBe(1);
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM timeline_entries WHERE page_id = ${aliceId}`))[0].n).toBe(1);
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM links WHERE from_page_id = ${aliceId}`))[0].n).toBe(1);
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM files WHERE source_id = 'cascadetest'`))[0].n).toBe(1);
// Remove the source.
await runSources(engine as unknown as Parameters<typeof runSources>[0], ['remove', 'cascadetest', '--yes']);
// Everything for that source is gone.
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'cascadetest'`))[0].n).toBe(0);
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM content_chunks WHERE page_id = ${aliceId}`))[0].n).toBe(0);
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM timeline_entries WHERE page_id = ${aliceId}`))[0].n).toBe(0);
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM links WHERE from_page_id = ${aliceId}`))[0].n).toBe(0);
expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM files WHERE source_id = 'cascadetest'`))[0].n).toBe(0);
// The sources row itself is gone.
const src = await conn.unsafe(`SELECT id FROM sources WHERE id = 'cascadetest'`);
expect(src.length).toBe(0);
});
});
describeE2E('v0.18.0 multi-source — sync --source routes through sources table', () => {
beforeAll(async () => {
await setupDB();
// sources + file_migration_ledger are not in helpers.ALL_TABLES, so
// residual rows from prior test runs can shadow new INSERTs. Wipe
// non-default sources at the top of every describe to keep each
// block hermetic. file_migration_ledger cascades from files which
// setupDB already truncates, but wipe explicitly in case files did
// not cascade it.
const conn = getConn();
await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`);
await conn.unsafe(`DELETE FROM file_migration_ledger`);
});
afterAll(async () => {
await teardownDB();
});
test('performSync with sourceId reads local_path from sources row', async () => {
const engine = getEngine();
const conn = getConn();
// Register a source with a bogus path (we're not actually walking a
// repo — this test asserts that performSync correctly RESOLVES the
// source row vs hitting the global config).
await runSources(engine as unknown as Parameters<typeof runSources>[0], [
'add', 'syncsrc', '--path', '/nonexistent/syncsrc/path', '--no-federated',
]);
// Also set a DIFFERENT path in the global config so we can verify
// sourceId actually disambiguates.
await engine.setConfig('sync.repo_path', '/some/other/default/path');
// performSync({sourceId: 'syncsrc'}) should attempt to use
// /nonexistent/syncsrc/path, NOT /some/other/default/path.
let err: Error | null = null;
try {
await performSync(engine, { sourceId: 'syncsrc' });
} catch (e) {
err = e as Error;
}
expect(err).not.toBeNull();
// The error message references the source-scoped path, not the
// global config path. (Could be "Not a git repository"
// or "No commits in repo" — either way the path it cites should
// be the source's.)
expect(err!.message).toContain('/nonexistent/syncsrc/path');
expect(err!.message).not.toContain('/some/other/default/path');
});
test('performSync with no sourceId falls back to global sync.repo_path', async () => {
const engine = getEngine();
// Global config is still '/some/other/default/path' from the
// previous test. Without --source, performSync uses it.
let err: Error | null = null;
try {
await performSync(engine, {});
} catch (e) {
err = e as Error;
}
expect(err).not.toBeNull();
expect(err!.message).toContain('/some/other/default/path');
});
});
describeE2E('v0.18.0 multi-source — sources table surface', () => {
beforeAll(async () => {
await setupDB();
// sources + file_migration_ledger are not in helpers.ALL_TABLES, so
// residual rows from prior test runs can shadow new INSERTs. Wipe
// non-default sources at the top of every describe to keep each
// block hermetic. file_migration_ledger cascades from files which
// setupDB already truncates, but wipe explicitly in case files did
// not cascade it.
const conn = getConn();
await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`);
await conn.unsafe(`DELETE FROM file_migration_ledger`);
});
afterAll(async () => {
await teardownDB();
});
test('default source is seeded federated=true; new sources default to isolated', async () => {
const conn = getConn();
const engine = getEngine();
const def = await conn.unsafe(`SELECT config FROM sources WHERE id = 'default'`);
const defConfig = typeof def[0].config === 'string' ? JSON.parse(def[0].config) : def[0].config;
expect(defConfig.federated).toBe(true);
// Defensive cleanup: sources isn't in helpers.ALL_TABLES, so residual
// rows from prior test runs can shadow this INSERT via ON CONFLICT
// DO NOTHING. Delete first, then create.
await conn.unsafe(`DELETE FROM sources WHERE id = 'isolatedsrc'`);
await runSources(engine as unknown as Parameters<typeof runSources>[0], ['add', 'isolatedsrc']);
const iso = await conn.unsafe(`SELECT config FROM sources WHERE id = 'isolatedsrc'`);
const isoConfig = typeof iso[0].config === 'string' ? JSON.parse(iso[0].config) : iso[0].config;
expect(isoConfig.federated).toBeUndefined(); // omitted → isolated-by-default
});
test('federate / unfederate flips config.federated on real DB', async () => {
const conn = getConn();
const engine = getEngine();
await runSources(engine as unknown as Parameters<typeof runSources>[0], ['federate', 'isolatedsrc']);
let row = await conn.unsafe(`SELECT config FROM sources WHERE id = 'isolatedsrc'`);
let config = typeof row[0].config === 'string' ? JSON.parse(row[0].config) : row[0].config;
expect(config.federated).toBe(true);
await runSources(engine as unknown as Parameters<typeof runSources>[0], ['unfederate', 'isolatedsrc']);
row = await conn.unsafe(`SELECT config FROM sources WHERE id = 'isolatedsrc'`);
config = typeof row[0].config === 'string' ? JSON.parse(row[0].config) : row[0].config;
expect(config.federated).toBe(false);
});
test('rename changes name, id stays stable', async () => {
const conn = getConn();
const engine = getEngine();
await runSources(engine as unknown as Parameters<typeof runSources>[0], [
'rename', 'isolatedsrc', 'My Isolated Source',
]);
const row = await conn.unsafe(`SELECT id, name FROM sources WHERE id = 'isolatedsrc'`);
expect(row[0].id).toBe('isolatedsrc');
expect(row[0].name).toBe('My Isolated Source');
});
});
describeE2E('v0.18.0 multi-source — storage backfill against file_migration_ledger', () => {
beforeAll(async () => {
await setupDB();
// sources + file_migration_ledger are not in helpers.ALL_TABLES, so
// residual rows from prior test runs can shadow new INSERTs. Wipe
// non-default sources at the top of every describe to keep each
// block hermetic. file_migration_ledger cascades from files which
// setupDB already truncates, but wipe explicitly in case files did
// not cascade it.
const conn = getConn();
await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`);
await conn.unsafe(`DELETE FROM file_migration_ledger`);
});
afterAll(async () => {
await teardownDB();
});
test('seeded ledger + stub storage: pending → complete end-to-end', async () => {
const conn = getConn();
const engine = getEngine();
// Seed a page + file (via raw INSERT so the test doesn't depend on
// sync running).
await conn.unsafe(
`INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash)
VALUES ('default', 'topics/storage', 'note', 'Storage test', 'body', '', '{}'::jsonb, 'sh1')`,
);
const pageRow = await conn.unsafe(
`SELECT id FROM pages WHERE source_id = 'default' AND slug = 'topics/storage'`,
);
const pageId = pageRow[0].id as number;
await conn.unsafe(
`INSERT INTO files (source_id, page_id, filename, storage_path, content_hash)
VALUES ('default', ${pageId}, 'doc.pdf', 'topics/storage/doc.pdf', 'fh1')`,
);
const fileRow = await conn.unsafe(
`SELECT id FROM files WHERE storage_path = 'topics/storage/doc.pdf'`,
);
const fileId = fileRow[0].id as number;
// Seed the ledger manually so we don't depend on the v23 seed SQL
// (the TRUNCATE CASCADE in setupDB wipes ledger rows).
await conn.unsafe(
`INSERT INTO file_migration_ledger (file_id, storage_path_old, storage_path_new, status)
VALUES (${fileId}, 'topics/storage/doc.pdf', 'default/topics/storage/doc.pdf', 'pending')
ON CONFLICT (file_id) DO NOTHING`,
);
// Stub storage: downloads return bytes, uploads track what was written.
const uploaded = new Set<string>();
const stub: StorageBackend = {
upload: async (p: string) => { uploaded.add(p); },
download: async (p: string) => Buffer.from('bytes-for:' + p),
delete: async (p: string) => { uploaded.delete(p); },
exists: async (p: string) => uploaded.has(p),
list: async () => [],
getUrl: async (p) => `https://stub/${p}`,
};
const report = await runStorageBackfill(engine, stub);
expect(report.total).toBe(1);
expect(report.nowComplete).toBe(1);
expect(report.failed).toBe(0);
// Ledger row transitioned to complete.
const ledger = await conn.unsafe(
`SELECT status FROM file_migration_ledger WHERE file_id = ${fileId}`,
);
expect(ledger[0].status).toBe('complete');
// Files row now points at the new path.
const filesAfter = await conn.unsafe(
`SELECT storage_path FROM files WHERE id = ${fileId}`,
);
expect(filesAfter[0].storage_path).toBe('default/topics/storage/doc.pdf');
// Stub storage saw the upload happen at the new path.
expect(uploaded.has('default/topics/storage/doc.pdf')).toBe(true);
});
});
// v0.18.0: real-Postgres regression guard for the addLinksBatch /
// addTimelineEntriesBatch JOIN fan-out bug. Before the fix, the JOIN was
// `pages.slug = v.from_slug` unqualified — so two pages sharing the same
// slug across sources would silently duplicate edges and timeline rows.
// postgres-js binds arrays through `unnest()` rather than inline VALUES,
// so the query shape is structurally different from PGLite's and gets its
// own coverage.
describeE2E('v0.18.0 multi-source — addLinksBatch / addTimelineEntriesBatch source-awareness', () => {
beforeAll(async () => {
await setupDB();
const conn = getConn();
await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`);
await conn.unsafe(`DELETE FROM file_migration_ledger`);
});
afterAll(async () => { await teardownDB(); });
async function seedSameSlugTwoSources() {
const conn = getConn();
const engine = getEngine() as PostgresEngine;
// Second source alongside 'default'.
await conn.unsafe(
`INSERT INTO sources (id, name) VALUES ('alt', 'alt') ON CONFLICT (id) DO NOTHING`
);
// Create same-slug pages in both sources. putPage defaults to 'default'.
await engine.putPage('topics/ai', { type: 'concept', title: 'AI (default)', compiled_truth: '', timeline: '' });
await engine.putPage('topics/ml', { type: 'concept', title: 'ML (default)', compiled_truth: '', timeline: '' });
await conn.unsafe(
`INSERT INTO pages (slug, type, title, compiled_truth, timeline, frontmatter, content_hash, source_id, updated_at)
VALUES ('topics/ai', 'concept', 'AI (alt)', '', '', '{}'::jsonb, 'alt-ai-hash', 'alt', now()),
('topics/ml', 'concept', 'ML (alt)', '', '', '{}'::jsonb, 'alt-ml-hash', 'alt', now())`
);
}
test('addLinksBatch without explicit source_id does NOT fan out across sources', async () => {
await seedSameSlugTwoSources();
const conn = getConn();
const engine = getEngine() as PostgresEngine;
// Reset links from any prior describe block.
await conn.unsafe(`DELETE FROM links`);
const inserted = await engine.addLinksBatch([
{ from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention' },
]);
// Exactly one edge (default → default). Before the fix this was 2.
expect(inserted).toBe(1);
const rows = await conn.unsafe(
`SELECT f.source_id AS from_src, t.source_id AS to_src
FROM links l
JOIN pages f ON f.id = l.from_page_id
JOIN pages t ON t.id = l.to_page_id`
);
expect(rows.length).toBe(1);
expect(rows[0].from_src).toBe('default');
expect(rows[0].to_src).toBe('default');
});
test('addLinksBatch supports cross-source edges when explicit source_ids differ', async () => {
const conn = getConn();
const engine = getEngine() as PostgresEngine;
await conn.unsafe(`DELETE FROM links`);
const inserted = await engine.addLinksBatch([
{
from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention',
from_source_id: 'default', to_source_id: 'alt',
},
]);
expect(inserted).toBe(1);
const rows = await conn.unsafe(
`SELECT f.source_id AS from_src, t.source_id AS to_src
FROM links l
JOIN pages f ON f.id = l.from_page_id
JOIN pages t ON t.id = l.to_page_id`
);
expect(rows.length).toBe(1);
expect(rows[0].from_src).toBe('default');
expect(rows[0].to_src).toBe('alt');
});
test('addTimelineEntriesBatch without explicit source_id does NOT fan out across sources', async () => {
const conn = getConn();
const engine = getEngine() as PostgresEngine;
await conn.unsafe(`DELETE FROM timeline_entries`);
const inserted = await engine.addTimelineEntriesBatch([
{ slug: 'topics/ai', date: '2024-01-15', summary: 'Founded' },
]);
expect(inserted).toBe(1);
const rows = await conn.unsafe(
`SELECT p.source_id
FROM timeline_entries te
JOIN pages p ON p.id = te.page_id`
);
expect(rows.length).toBe(1);
expect(rows[0].source_id).toBe('default');
});
test('addTimelineEntriesBatch with explicit alt source_id lands only in alt', async () => {
const conn = getConn();
const engine = getEngine() as PostgresEngine;
await conn.unsafe(`DELETE FROM timeline_entries`);
const inserted = await engine.addTimelineEntriesBatch([
{ slug: 'topics/ai', date: '2024-02-01', summary: 'Alt-only event', source_id: 'alt' },
]);
expect(inserted).toBe(1);
const rows = await conn.unsafe(
`SELECT p.source_id
FROM timeline_entries te
JOIN pages p ON p.id = te.page_id`
);
expect(rows.length).toBe(1);
expect(rows[0].source_id).toBe('alt');
});
});

View File

@@ -609,3 +609,68 @@ describe('FRONTMATTER_LINK_MAP integrity', () => {
expect(m!.dirHint).toContain('people');
});
});
// ─────────────────────────────────────────────────────────────────
// v0.18.0 Step 4 — qualified wikilink syntax [[source-id:dir/slug]]
// ─────────────────────────────────────────────────────────────────
describe("extractEntityRefs — v0.18.0 qualified wikilinks", () => {
test("[[wiki:topics/ai]] extracts with sourceId=wiki", () => {
const refs = extractEntityRefs("See [[concepts/ai]] vs [[wiki:concepts/ai]] for wiki-specific take.");
// One unqualified + one qualified.
expect(refs.length).toBe(2);
const qual = refs.find(r => r.sourceId === "wiki");
expect(qual).toBeDefined();
expect(qual!.slug).toBe("concepts/ai");
expect(qual!.name).toBe("concepts/ai");
const unqual = refs.find(r => r.sourceId === undefined);
expect(unqual).toBeDefined();
expect(unqual!.slug).toBe("concepts/ai");
});
test("[[gstack:projects/foo|Display Name]] preserves display + sourceId", () => {
const refs = extractEntityRefs("See [[gstack:projects/foo|The Foo Project]] for details.");
expect(refs.length).toBe(1);
expect(refs[0]).toEqual({ name: "The Foo Project", slug: "projects/foo", dir: "projects", sourceId: "gstack" });
});
test("qualified source-id format is validated (must match [a-z0-9-]+ kebab rules)", () => {
// Uppercase source IDs are not qualified — fall through to unqualified wikilink or no match.
const refs = extractEntityRefs("Legit: [[yc-media:concepts/seed]] Not legit: [[NotValid:concepts/x]]");
const qualified = refs.filter(r => r.sourceId);
expect(qualified.length).toBe(1);
expect(qualified[0].sourceId).toBe("yc-media");
});
test("masking prevents unqualified regex from matching inside a qualified link", () => {
// Without the mask, [[wiki:concepts/ai]] could also match as
// unqualified with slug "wiki:concepts/ai" (invalid dir) — the
// DIR_PATTERN whitelist normally blocks it, but masking is
// defense-in-depth.
const refs = extractEntityRefs("Ref: [[wiki:concepts/ai]]");
expect(refs.length).toBe(1);
expect(refs[0].sourceId).toBe("wiki");
});
test("markdown [Name](path) links always have no sourceId (unqualified by shape)", () => {
const refs = extractEntityRefs("[Alice](people/alice-chen) met [[wiki:people/bob]]");
const mdLink = refs.find(r => r.slug === "people/alice-chen");
expect(mdLink!.sourceId).toBeUndefined();
const wiki = refs.find(r => r.slug === "people/bob");
expect(wiki!.sourceId).toBe("wiki");
});
});
describe("v0.18.0 migration v22 — links_resolution_type", () => {
test("migration v22 exists with CHECK constraint", async () => {
const { MIGRATIONS } = await import("../src/core/migrate.ts");
const v22 = MIGRATIONS.find(m => m.version === 22);
expect(v22).toBeDefined();
expect(v22!.name).toBe("links_resolution_type");
expect(v22!.sql).toContain("ADD COLUMN IF NOT EXISTS resolution_type");
expect(v22!.sql).toContain("links_resolution_type_check");
expect(v22!.sql).toContain("qualified");
expect(v22!.sql).toContain("unqualified");
});
});

View File

@@ -16,6 +16,162 @@ describe('migrate', () => {
// and are covered in the E2E suite (test/e2e/mechanical.test.ts)
});
// ─────────────────────────────────────────────────────────────────
// v0.18.0 — v16 sources_table_additive (Step 1, Lane A)
// ─────────────────────────────────────────────────────────────────
// v16 is the ADDITIVE-ONLY migration: it installs the sources primitive
// without breaking the engine's existing ON CONFLICT (slug) upserts.
// The breaking schema changes (pages.source_id NOT NULL, composite
// UNIQUE, files.page_slug → page_id, file_migration_ledger,
// links.resolution_type) land in v17 alongside the engine API rewrite
// so the engine can execute the new ON CONFLICT (source_id, slug)
// atomically with the schema change.
// ─────────────────────────────────────────────────────────────────
describe('migrate v20 — sources_table_additive', () => {
const v20 = MIGRATIONS.find(m => m.version === 20);
test('v20 exists', () => {
expect(v20).toBeDefined();
expect(v20!.name).toBe('sources_table_additive');
});
test('v20 creates sources table', () => {
expect(v20!.sql).toContain('CREATE TABLE IF NOT EXISTS sources');
expect(v20!.sql).toContain('id TEXT PRIMARY KEY');
expect(v20!.sql).toContain('name TEXT NOT NULL UNIQUE');
expect(v20!.sql).toContain('config JSONB NOT NULL');
});
test("v20 seeds 'default' source inheriting sync config", () => {
expect(v20!.sql).toContain("INSERT INTO sources (id, name, local_path, last_commit, config)");
expect(v20!.sql).toContain("'default'");
// The default source pulls from existing config so post-upgrade
// identity is preserved.
expect(v20!.sql).toContain("SELECT value FROM config WHERE key = 'sync.repo_path'");
expect(v20!.sql).toContain("SELECT value FROM config WHERE key = 'sync.last_commit'");
});
test('v20 default source is federated=true (backward-compat)', () => {
// federated=true ensures pre-v0.17 brains keep single-namespace
// search semantics — every page appears in unqualified search.
expect(v20!.sql).toContain('"federated": true');
});
test('v20 is idempotent on re-run', () => {
// CREATE TABLE IF NOT EXISTS + NOT EXISTS subquery on INSERT.
expect(v20!.sql).toContain('CREATE TABLE IF NOT EXISTS sources');
expect(v20!.sql).toContain('WHERE NOT EXISTS (SELECT 1 FROM sources WHERE id = ');
});
test('v20 does NOT touch pages / ingest_log / files / links', () => {
// Step 1 is additive-only. Breaking changes deferred to v17 so they
// land with the engine rewrite (Step 2). Guard against anyone
// accidentally re-expanding v16's scope.
expect(v20!.sql).not.toContain('ALTER TABLE pages');
expect(v20!.sql).not.toContain('ALTER TABLE ingest_log');
expect(v20!.sql).not.toContain('ALTER TABLE files');
expect(v20!.sql).not.toContain('ALTER TABLE links');
expect(v20!.handler).toBeUndefined();
});
});
// ─────────────────────────────────────────────────────────────────
// v0.18.0 — v17 pages_source_id_composite_unique (Step 2, Lane B)
// ─────────────────────────────────────────────────────────────────
describe('migrate v21 — pages_source_id_composite_unique', () => {
const v21 = MIGRATIONS.find(m => m.version === 21);
test('v21 exists and is paired with Step 2 engine rewrite', () => {
expect(v21).toBeDefined();
expect(v21!.name).toBe('pages_source_id_composite_unique');
});
test('v21 adds pages.source_id with DEFAULT default REFERENCES sources', () => {
expect(v21!.sql).toContain('ALTER TABLE pages ADD COLUMN IF NOT EXISTS source_id TEXT');
// DEFAULT 'default' closes the race where an INSERT between ADD COLUMN
// and SET NOT NULL could leave source_id NULL (Codex second-pass review).
expect(v21!.sql).toContain("NOT NULL DEFAULT 'default' REFERENCES sources(id)");
});
test('v21 swaps UNIQUE(slug) → composite UNIQUE(source_id, slug)', () => {
// ON CONFLICT (source_id, slug) in putPage relies on this swap.
expect(v21!.sql).toContain('ALTER TABLE pages DROP CONSTRAINT IF EXISTS pages_slug_key');
expect(v21!.sql).toContain('pages_source_slug_key');
expect(v21!.sql).toContain('UNIQUE (source_id, slug)');
});
test('v21 creates source-scoped index for per-source scans', () => {
expect(v21!.sql).toContain('CREATE INDEX IF NOT EXISTS idx_pages_source_id');
});
test('v21 constraint add is guarded (idempotent re-run)', () => {
// DO block with IF NOT EXISTS guard means re-running the migration
// after partial failure doesn't error on the already-installed name.
expect(v21!.sql).toContain('IF NOT EXISTS');
expect(v21!.sql).toContain("WHERE conname = 'pages_source_slug_key'");
});
});
// ─────────────────────────────────────────────────────────────────
// v0.18.0 — v19 files_source_id_page_id_ledger (Step 7, Lane E)
// ─────────────────────────────────────────────────────────────────
describe('migrate v23 — files_source_id_page_id_ledger', () => {
const v23 = MIGRATIONS.find(m => m.version === 23);
test('v23 exists as handler-only (Postgres files table, PGLite no-op)', () => {
expect(v23).toBeDefined();
expect(v23!.name).toBe('files_source_id_page_id_ledger');
expect(v23!.sql).toBe('');
expect(v23!.handler).toBeDefined();
});
test('v23 handler gates on engine.kind for PGLite (no files table)', () => {
expect(v23!.handler!.toString()).toMatch(/engine\.kind\s*===\s*["']pglite["']/);
});
test('v23 adds files.source_id + files.page_id + ledger creation', () => {
const body = v23!.handler!.toString();
expect(body).toContain('ALTER TABLE files ADD COLUMN IF NOT EXISTS source_id');
expect(body).toContain('ALTER TABLE files ADD COLUMN IF NOT EXISTS page_id');
expect(body).toContain('CREATE TABLE IF NOT EXISTS file_migration_ledger');
});
test('v23 backfills files.page_id scoped to default source (Codex fix)', () => {
const body = v23!.handler!.toString();
// Without source_id='default' scope, the JOIN could hit the wrong
// page after new sources with duplicate slugs are added.
expect(body).toContain('UPDATE files f');
expect(body).toContain("p.source_id = 'default'");
});
test('v23 ledger PK is file_id (Codex: two sources can share old path)', () => {
const body = v23!.handler!.toString();
expect(body).toContain('file_id INTEGER PRIMARY KEY');
// State machine values all present.
for (const state of ['pending', 'copy_done', 'db_updated', 'complete', 'failed']) {
expect(body).toContain(`'${state}'`);
}
});
});
describe('migrate — ordering guarantee (v15 must NOT be skipped by v16)', () => {
test('runMigrations sorts by version ascending', async () => {
// Regression: if v16 preceded v15 in the MIGRATIONS array, the iterator
// would setConfig(version, 16) first, then skip v15 on the next pass.
// runMigrations applies a defensive sort so array order doesn't matter.
// This test asserts v15 exists (if we broke the sort, v15 would still
// exist in MIGRATIONS but would never apply at runtime).
const v15 = MIGRATIONS.find(m => m.version === 15);
const v20 = MIGRATIONS.find(m => m.version === 20);
expect(v15).toBeDefined();
expect(v20).toBeDefined();
// Sanity: versions are distinct and progress.
const versions = MIGRATIONS.map(m => m.version);
const uniq = new Set(versions);
expect(uniq.size).toBe(versions.length);
});
});
// ─────────────────────────────────────────────────────────────────
// REGRESSION TESTS — migrations v8 + v9 perf on duplicate-heavy tables
// ─────────────────────────────────────────────────────────────────

View File

@@ -0,0 +1,244 @@
/**
* v0.18.0 Step 9 — multi-source integration test against real PGLite.
*
* Exercises the full Step-1-through-Step-7 surface:
* - migration v16 seeds the default source with federated=true
* - migration v17 adds pages.source_id + composite UNIQUE
* - migration v18 adds links.resolution_type column
* - putPage implicitly targets the default source via the
* schema DEFAULT 'default' clause
* - raw INSERT can write pages to a non-default source and the
* composite UNIQUE allows same-slug pages across sources
* - sources CLI add/list/federate operations are reflected in DB
* - federated flag distinguishes unqualified-search-visibility
*
* PGLite-only (fast + zero deps). Real Postgres parity lives in
* test/e2e/mechanical.test.ts when DATABASE_URL is set.
*/
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
import { PGLiteEngine } from '../src/core/pglite-engine.ts';
import { runSources } from '../src/commands/sources.ts';
import { resolveSourceId } from '../src/core/source-resolver.ts';
let engine: PGLiteEngine;
beforeAll(async () => {
engine = new PGLiteEngine();
await engine.connect({ type: 'pglite' } as never);
await engine.initSchema();
});
afterAll(async () => {
await engine.disconnect();
});
describe('v0.18.0 — sources table seeded with default row on fresh PGLite', () => {
test("sources('default') exists after initSchema + migration", async () => {
const rows = await engine.executeRaw<{ id: string; name: string; config: string | Record<string, unknown> }>(
`SELECT id, name, config FROM sources WHERE id = 'default'`,
);
expect(rows.length).toBe(1);
expect(rows[0].name).toBe('default');
const config = typeof rows[0].config === 'string' ? JSON.parse(rows[0].config) : rows[0].config;
expect(config.federated).toBe(true);
});
test('pages.source_id column exists with DEFAULT default', async () => {
const rows = await engine.executeRaw<{ column_default: string | null }>(
`SELECT column_default FROM information_schema.columns
WHERE table_name = 'pages' AND column_name = 'source_id'`,
);
expect(rows.length).toBe(1);
// PGLite normalizes the default literal.
expect(rows[0].column_default).toContain('default');
});
test('composite UNIQUE (source_id, slug) is installed', async () => {
const rows = await engine.executeRaw<{ conname: string }>(
`SELECT conname FROM pg_constraint WHERE conname = 'pages_source_slug_key'`,
);
expect(rows.length).toBe(1);
});
});
describe('v0.18.0 — putPage implicitly writes to default source', () => {
test('putPage without explicit source → source_id = default', async () => {
await engine.putPage('topics/step9-auto', {
type: 'concept',
title: 'Step 9 Auto',
compiled_truth: 'Auto-defaulted to default source.',
});
const rows = await engine.executeRaw<{ source_id: string; slug: string }>(
`SELECT source_id, slug FROM pages WHERE slug = 'topics/step9-auto'`,
);
expect(rows.length).toBe(1);
expect(rows[0].source_id).toBe('default');
});
});
describe('v0.18.0 — composite UNIQUE allows same-slug across sources', () => {
test('same slug in two different sources coexists (regression: Codex critical)', async () => {
// Insert a second source via sources CLI.
await runSources(engine, ['add', 'testsrc', '--no-federated']);
// Sanity: default already has this slug from the previous test.
// Now write the same slug under testsrc via raw INSERT (putPage only
// targets default until a later step surfaces sourceId; raw INSERT is
// the "source-aware write" Step 5 continuation will add).
await engine.executeRaw(
`INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash)
VALUES ('testsrc', 'topics/step9-auto', 'concept', 'Step 9 Auto (testsrc variant)',
'A different page with the same slug in a different source.',
'', '{}'::jsonb, 'hash2')`,
);
// Both rows must exist under the composite unique.
const rows = await engine.executeRaw<{ source_id: string; slug: string; title: string }>(
`SELECT source_id, slug, title FROM pages
WHERE slug = 'topics/step9-auto'
ORDER BY source_id`,
);
expect(rows.length).toBe(2);
expect(rows.map(r => r.source_id).sort()).toEqual(['default', 'testsrc']);
});
test('inserting THIRD row with same (source_id, slug) hits composite UNIQUE', async () => {
let err: Error | null = null;
try {
await engine.executeRaw(
`INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash)
VALUES ('testsrc', 'topics/step9-auto', 'concept', 'Dup attempt',
'Should fail', '', '{}'::jsonb, 'hash3')`,
);
} catch (e) {
err = e as Error;
}
expect(err).not.toBeNull();
expect(err!.message.toLowerCase()).toMatch(/unique|duplicate/);
});
});
describe('v0.18.0 — sources CLI manipulates the sources table', () => {
test('sources federate flips config.federated true', async () => {
await runSources(engine, ['federate', 'testsrc']);
const rows = await engine.executeRaw<{ config: string | Record<string, unknown> }>(
`SELECT config FROM sources WHERE id = 'testsrc'`,
);
const config = typeof rows[0].config === 'string' ? JSON.parse(rows[0].config) : rows[0].config;
expect(config.federated).toBe(true);
});
test('sources unfederate flips config.federated false', async () => {
await runSources(engine, ['unfederate', 'testsrc']);
const rows = await engine.executeRaw<{ config: string | Record<string, unknown> }>(
`SELECT config FROM sources WHERE id = 'testsrc'`,
);
const config = typeof rows[0].config === 'string' ? JSON.parse(rows[0].config) : rows[0].config;
expect(config.federated).toBe(false);
});
test('sources rename changes name but keeps id immutable', async () => {
await runSources(engine, ['rename', 'testsrc', 'Test Source']);
const rows = await engine.executeRaw<{ id: string; name: string }>(
`SELECT id, name FROM sources WHERE id = 'testsrc'`,
);
expect(rows[0].id).toBe('testsrc');
expect(rows[0].name).toBe('Test Source');
});
});
describe('v0.18.0 — source resolution priority (integration)', () => {
test('explicit --source flag wins when the source exists', async () => {
const id = await resolveSourceId(engine, 'testsrc');
expect(id).toBe('testsrc');
});
test('GBRAIN_SOURCE env wins when no flag', async () => {
process.env.GBRAIN_SOURCE = 'testsrc';
try {
const id = await resolveSourceId(engine, null);
expect(id).toBe('testsrc');
} finally {
delete process.env.GBRAIN_SOURCE;
}
});
test('fallback to default when nothing is set', async () => {
const id = await resolveSourceId(engine, null, '/nowhere-registered');
expect(id).toBe('default');
});
test('rejects unregistered explicit source with an actionable error', async () => {
await expect(resolveSourceId(engine, 'ghost-source')).rejects.toThrow(/not found/);
});
});
describe('v0.18.0 — sources remove cascades to pages', () => {
test('removing a source cascade-deletes its pages', async () => {
const before = await engine.executeRaw<{ n: number }>(
`SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'testsrc'`,
);
expect(before[0].n).toBeGreaterThan(0);
await runSources(engine, ['remove', 'testsrc', '--yes']);
const after = await engine.executeRaw<{ n: number }>(
`SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'testsrc'`,
);
expect(after[0].n).toBe(0);
const src = await engine.executeRaw<{ id: string }>(
`SELECT id FROM sources WHERE id = 'testsrc'`,
);
expect(src.length).toBe(0);
// Default source is untouched.
const defaultPages = await engine.executeRaw<{ n: number }>(
`SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'default'`,
);
expect(defaultPages[0].n).toBeGreaterThan(0);
});
});
describe('v0.18.0 — links.resolution_type column exists (Step 4)', () => {
test('links table accepts qualified/unqualified resolution_type', async () => {
// Create two pages, insert a link with resolution_type='qualified'.
await engine.putPage('topics/qf-a', {
type: 'concept', title: 'QA', compiled_truth: 'a',
});
await engine.putPage('topics/qf-b', {
type: 'concept', title: 'QB', compiled_truth: 'b',
});
await engine.executeRaw(
`INSERT INTO links (from_page_id, to_page_id, link_type, context, link_source, resolution_type)
SELECT a.id, b.id, 'ref', '', 'markdown', 'qualified'
FROM pages a, pages b
WHERE a.slug = 'topics/qf-a' AND b.slug = 'topics/qf-b'
AND a.source_id = 'default' AND b.source_id = 'default'`,
);
const rows = await engine.executeRaw<{ resolution_type: string }>(
`SELECT l.resolution_type
FROM links l
JOIN pages a ON a.id = l.from_page_id
WHERE a.slug = 'topics/qf-a'`,
);
expect(rows.length).toBe(1);
expect(rows[0].resolution_type).toBe('qualified');
});
test('links CHECK constraint rejects invalid resolution_type values', async () => {
let err: Error | null = null;
try {
await engine.executeRaw(
`INSERT INTO links (from_page_id, to_page_id, link_type, resolution_type)
SELECT a.id, a.id, 'self', 'bogus-value'
FROM pages a WHERE a.slug = 'topics/qf-a' AND a.source_id = 'default'`,
);
} catch (e) {
err = e as Error;
}
expect(err).not.toBeNull();
expect(err!.message.toLowerCase()).toMatch(/check|constraint/);
});
});

View File

@@ -462,6 +462,119 @@ describe('PGLiteEngine: addTimelineEntriesBatch', () => {
});
});
// v0.18.0: regression guards for the cross-source JOIN fan-out.
// Before the fix, addLinksBatch/addTimelineEntriesBatch JOINed on pages.slug
// only — so a page with the same slug in two sources would fan out and
// silently create duplicate edges / entries. Source-id-qualified JOINs
// eliminate the fan-out.
describe('PGLiteEngine: batch ops source-awareness (v0.18.0)', () => {
beforeEach(async () => {
await truncateAll();
// Register a second source and populate the same slugs in both.
const db = (engine as any).db;
await db.query(
`INSERT INTO sources (id, name) VALUES ('alt', 'alt')
ON CONFLICT (id) DO NOTHING`
);
// default-source rows via putPage (schema DEFAULT 'default').
await engine.putPage('topics/ai', { type: 'concept', title: 'AI (default)', compiled_truth: '', timeline: '' });
await engine.putPage('topics/ml', { type: 'concept', title: 'ML (default)', compiled_truth: '', timeline: '' });
// alt-source rows with the same slugs, inserted via raw SQL.
await db.query(
`INSERT INTO pages (slug, type, title, compiled_truth, timeline, frontmatter, content_hash, source_id, updated_at)
VALUES ('topics/ai', 'concept', 'AI (alt)', '', '', '{}'::jsonb, 'h1', 'alt', now()),
('topics/ml', 'concept', 'ML (alt)', '', '', '{}'::jsonb, 'h2', 'alt', now())`
);
});
test('addLinksBatch default source_id does NOT fan out across sources', async () => {
const inserted = await engine.addLinksBatch([
{ from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention' },
]);
// Exactly one edge, not two. Before the fix this was 2.
expect(inserted).toBe(1);
const db = (engine as any).db;
const { rows } = await db.query(
`SELECT f.source_id AS from_src, t.source_id AS to_src
FROM links l
JOIN pages f ON f.id = l.from_page_id
JOIN pages t ON t.id = l.to_page_id`
);
expect(rows.length).toBe(1);
expect(rows[0].from_src).toBe('default');
expect(rows[0].to_src).toBe('default');
});
test('addLinksBatch with explicit alt source_id lands in alt only', async () => {
const inserted = await engine.addLinksBatch([
{
from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention',
from_source_id: 'alt', to_source_id: 'alt',
},
]);
expect(inserted).toBe(1);
const db = (engine as any).db;
const { rows } = await db.query(
`SELECT f.source_id AS from_src, t.source_id AS to_src
FROM links l
JOIN pages f ON f.id = l.from_page_id
JOIN pages t ON t.id = l.to_page_id`
);
expect(rows.length).toBe(1);
expect(rows[0].from_src).toBe('alt');
expect(rows[0].to_src).toBe('alt');
});
test('addLinksBatch supports cross-source edges', async () => {
const inserted = await engine.addLinksBatch([
{
from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention',
from_source_id: 'default', to_source_id: 'alt',
},
]);
expect(inserted).toBe(1);
const db = (engine as any).db;
const { rows } = await db.query(
`SELECT f.source_id AS from_src, t.source_id AS to_src
FROM links l
JOIN pages f ON f.id = l.from_page_id
JOIN pages t ON t.id = l.to_page_id`
);
expect(rows.length).toBe(1);
expect(rows[0].from_src).toBe('default');
expect(rows[0].to_src).toBe('alt');
});
test('addTimelineEntriesBatch default source_id does NOT fan out across sources', async () => {
const inserted = await engine.addTimelineEntriesBatch([
{ slug: 'topics/ai', date: '2024-01-15', summary: 'Founded' },
]);
// Exactly one entry (default source), not two. Before the fix this was 2.
expect(inserted).toBe(1);
const db = (engine as any).db;
const { rows } = await db.query(
`SELECT p.source_id FROM timeline_entries te
JOIN pages p ON p.id = te.page_id`
);
expect(rows.length).toBe(1);
expect(rows[0].source_id).toBe('default');
});
test('addTimelineEntriesBatch with explicit alt source_id lands in alt only', async () => {
const inserted = await engine.addTimelineEntriesBatch([
{ slug: 'topics/ai', date: '2024-01-15', summary: 'Founded', source_id: 'alt' },
]);
expect(inserted).toBe(1);
const db = (engine as any).db;
const { rows } = await db.query(
`SELECT p.source_id FROM timeline_entries te
JOIN pages p ON p.id = te.page_id`
);
expect(rows.length).toBe(1);
expect(rows[0].source_id).toBe('alt');
});
});
// ─────────────────────────────────────────────────────────────────
// Raw Data, Versions, Config, IngestLog
// ─────────────────────────────────────────────────────────────────

View File

@@ -0,0 +1,190 @@
/**
* v0.18.0 Step 6 — source resolution priority tests.
*
* Priority order (highest first):
* 1. Explicit --source flag
* 2. GBRAIN_SOURCE env var
* 3. .gbrain-source dotfile walk-up
* 4. Registered source whose local_path contains CWD (longest prefix wins)
* 5. Brain-level `sources.default` config key
* 6. Fallback: literal 'default'
*/
import { describe, test, expect, beforeEach, afterEach } from 'bun:test';
import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from 'fs';
import { join } from 'path';
import { tmpdir } from 'os';
import { resolveSourceId, __testing } from '../src/core/source-resolver.ts';
import type { BrainEngine } from '../src/core/engine.ts';
// ── Stub engine ────────────────────────────────────────────
function makeStub(registeredSources: string[], paths: Array<{ id: string; local_path: string }>, defaultKey: string | null): BrainEngine {
return {
kind: 'pglite',
executeRaw: async <T>(sql: string, params?: unknown[]): Promise<T[]> => {
if (sql.includes('SELECT id FROM sources WHERE id = $1')) {
const target = params?.[0];
return (registeredSources.includes(target as string)
? [{ id: target } as unknown as T]
: []);
}
if (sql.includes('SELECT id, local_path FROM sources')) {
return paths as unknown as T[];
}
return [];
},
getConfig: async (key: string) => (key === 'sources.default' ? defaultKey : null),
} as unknown as BrainEngine;
}
// ── Priority 1: explicit flag ──────────────────────────────
describe('resolveSourceId priority 1 — explicit flag', () => {
test('wins over every other signal', async () => {
const engine = makeStub(['default', 'gstack', 'wiki'], [{ id: 'wiki', local_path: '/tmp' }], 'gstack');
process.env.GBRAIN_SOURCE = 'wiki';
try {
const id = await resolveSourceId(engine, 'gstack', '/tmp/whatever');
expect(id).toBe('gstack');
} finally {
delete process.env.GBRAIN_SOURCE;
}
});
test('rejects unregistered explicit source with actionable error', async () => {
const engine = makeStub(['default'], [], null);
await expect(resolveSourceId(engine, 'ghost')).rejects.toThrow(/not found/);
});
test('rejects invalid format', async () => {
const engine = makeStub(['default'], [], null);
await expect(resolveSourceId(engine, 'WRONG-case!')).rejects.toThrow(/Invalid --source/);
});
});
// ── Priority 2: env var ────────────────────────────────────
describe('resolveSourceId priority 2 — GBRAIN_SOURCE env', () => {
test('wins over dotfile / registered-path / default', async () => {
const engine = makeStub(['default', 'env-wins'], [{ id: 'other', local_path: '/tmp' }], 'default');
process.env.GBRAIN_SOURCE = 'env-wins';
try {
const id = await resolveSourceId(engine, null, '/tmp/x');
expect(id).toBe('env-wins');
} finally {
delete process.env.GBRAIN_SOURCE;
}
});
});
// ── Priority 3: dotfile walk-up ────────────────────────────
describe('resolveSourceId priority 3 — .gbrain-source dotfile walk-up', () => {
let tmpdirPath: string;
beforeEach(() => {
tmpdirPath = mkdtempSync(join(tmpdir(), 'gbrain-resolver-test-'));
});
afterEach(() => {
rmSync(tmpdirPath, { recursive: true, force: true });
});
test('finds dotfile in CWD', async () => {
writeFileSync(join(tmpdirPath, '.gbrain-source'), 'gstack\n');
const engine = makeStub(['default', 'gstack'], [], null);
const id = await resolveSourceId(engine, null, tmpdirPath);
expect(id).toBe('gstack');
});
test('walks up ancestors to find dotfile', async () => {
writeFileSync(join(tmpdirPath, '.gbrain-source'), 'wiki\n');
const deep = join(tmpdirPath, 'a', 'b', 'c');
mkdirSync(deep, { recursive: true });
const engine = makeStub(['default', 'wiki'], [], null);
const id = await resolveSourceId(engine, null, deep);
expect(id).toBe('wiki');
});
test('ignores dotfile with invalid content', async () => {
writeFileSync(join(tmpdirPath, '.gbrain-source'), 'INVALID!\n');
const engine = makeStub(['default'], [], null);
const id = await resolveSourceId(engine, null, tmpdirPath);
expect(id).toBe('default');
});
});
// ── Priority 4: registered local_path match (longest prefix) ──
describe('resolveSourceId priority 4 — registered local_path longest-prefix match', () => {
test('picks registered source whose local_path contains CWD', async () => {
const engine = makeStub(
['default', 'gstack'],
[{ id: 'gstack', local_path: '/tmp/gstack' }],
null,
);
const id = await resolveSourceId(engine, null, '/tmp/gstack/plans/foo');
expect(id).toBe('gstack');
});
test('longest prefix wins when paths are nested (per Codex second pass)', async () => {
// Codex flagged: overlapping paths need longest-prefix resolution.
// If gstack at /tmp/gstack and plans at /tmp/gstack/plans both
// exist, CWD inside plans/ must pick plans.
const engine = makeStub(
['default', 'gstack', 'plans'],
[
{ id: 'gstack', local_path: '/tmp/gstack' },
{ id: 'plans', local_path: '/tmp/gstack/plans' },
],
null,
);
const id = await resolveSourceId(engine, null, '/tmp/gstack/plans/deeper');
expect(id).toBe('plans');
});
test("CWD outside any registered path falls through to default", async () => {
const engine = makeStub(
['default', 'gstack'],
[{ id: 'gstack', local_path: '/tmp/gstack' }],
null,
);
const id = await resolveSourceId(engine, null, '/some/other/dir');
expect(id).toBe('default');
});
});
// ── Priority 5: brain-level default ────────────────────────
describe('resolveSourceId priority 5 — sources.default config key', () => {
test("returns configured default when no higher signal present", async () => {
const engine = makeStub(['default', 'custom'], [], 'custom');
const id = await resolveSourceId(engine, null, '/some/random/dir');
expect(id).toBe('custom');
});
});
// ── Priority 6: fallback ────────────────────────────────────
describe('resolveSourceId priority 6 — fallback', () => {
test("returns 'default' when no signal at all", async () => {
const engine = makeStub(['default'], [], null);
const id = await resolveSourceId(engine, null, '/random/dir');
expect(id).toBe('default');
});
});
// ── Regex validation ───────────────────────────────────────
describe('SOURCE_ID_RE', () => {
test('accepts valid ids', () => {
for (const id of ['default', 'wiki', 'gstack', 'yc-media', 'garrys-list', 'a', '123']) {
expect(__testing.SOURCE_ID_RE.test(id)).toBe(true);
}
});
test('rejects invalid ids', () => {
for (const id of ['', 'a'.repeat(33), 'Upper', 'has_underscore', 'trailing-', '-leading', 'with spaces', 'with.dots']) {
expect(__testing.SOURCE_ID_RE.test(id)).toBe(false);
}
});
});

252
test/sources.test.ts Normal file
View File

@@ -0,0 +1,252 @@
/**
* v0.18.0 Step 6 — sources CLI subcommand tests.
*
* Pure unit tests that exercise the subcommand dispatcher via a
* stub BrainEngine. No DB required — we just confirm the SQL
* shape, validation, and flag parsing.
*/
import { describe, test, expect, beforeEach } from 'bun:test';
import { runSources } from '../src/commands/sources.ts';
import type { BrainEngine } from '../src/core/engine.ts';
// ── Stub engine that records queries ───────────────────────
interface RecordedCall {
sql: string;
params: unknown[];
}
function makeStub(rowsByPattern: Record<string, unknown[]> = {}): {
engine: BrainEngine;
calls: RecordedCall[];
configSet: Array<{ key: string; value: string }>;
} {
const calls: RecordedCall[] = [];
const configSet: Array<{ key: string; value: string }> = [];
const executeRaw = async (sql: string, params?: unknown[]) => {
calls.push({ sql, params: params ?? [] });
// Match by substring so tests are robust against whitespace.
for (const [pattern, rows] of Object.entries(rowsByPattern)) {
if (sql.includes(pattern)) return rows as never;
}
return [] as never;
};
const setConfig = async (key: string, value: string) => {
configSet.push({ key, value });
};
// Minimal BrainEngine stub — only the methods sources.ts touches.
const engine = {
kind: 'pglite' as const,
executeRaw,
setConfig,
// Unused methods throw if called accidentally during these tests.
getConfig: async () => null,
} as unknown as BrainEngine;
return { engine, calls, configSet };
}
// ── add ─────────────────────────────────────────────────────
// Intercept process.exit so unit tests under bun:test don't actually
// exit. Each test that might trigger process.exit() wraps its call in
// `withExitCapture`. We only return when the function under test returns
// or throws; process.exit() is turned into a recoverable throw.
async function withExitCapture(fn: () => Promise<void>): Promise<number | null> {
const origExit = process.exit;
let captured: number | null = null;
process.exit = ((code?: number) => {
captured = code ?? 0;
throw new Error('__process_exit__');
}) as never;
try {
await fn();
} catch (e) {
if (!(e instanceof Error) || !e.message.includes('__process_exit__')) throw e;
} finally {
process.exit = origExit;
}
return captured;
}
describe('sources add', () => {
test('rejects invalid ids', async () => {
const { engine } = makeStub();
const code = await withExitCapture(() => runSources(engine, ['add']));
expect(code).toBe(2);
});
test('rejects uppercase / invalid chars in id', async () => {
const { engine } = makeStub();
await expect(runSources(engine, ['add', 'BadId', '--path', '/tmp/x'])).rejects.toThrow(/Invalid source id/);
});
test('rejects id longer than 32 chars', async () => {
const { engine } = makeStub();
const long = 'a'.repeat(33);
await expect(runSources(engine, ['add', long, '--path', '/tmp/x'])).rejects.toThrow(/Invalid source id/);
});
test('inserts a valid source with defaults (federated unset → isolated)', async () => {
const { engine, calls } = makeStub({
'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [{
id: 'gstack',
name: 'gstack',
local_path: '/tmp/gstack',
last_commit: null,
last_sync_at: null,
config: '{}',
created_at: new Date(),
}],
});
await runSources(engine, ['add', 'gstack', '--path', '/tmp/gstack']);
const insert = calls.find(c => c.sql.includes('INSERT INTO sources'));
expect(insert).toBeDefined();
expect(insert!.params[0]).toBe('gstack');
expect(insert!.params[1]).toBe('gstack'); // name defaults to id
expect(insert!.params[2]).toBe('/tmp/gstack');
expect(insert!.params[3]).toBe('{}'); // federated unset → empty config
});
test('--federated sets config.federated = true', async () => {
const { engine, calls } = makeStub({
'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [{
id: 'wiki',
name: 'wiki',
local_path: '/tmp/wiki',
last_commit: null,
last_sync_at: null,
config: '{"federated":true}',
created_at: new Date(),
}],
});
await runSources(engine, ['add', 'wiki', '--path', '/tmp/wiki', '--federated']);
const insert = calls.find(c => c.sql.includes('INSERT INTO sources'));
expect(insert!.params[3]).toBe('{"federated":true}');
});
test('--no-federated sets config.federated = false (isolation opt-in)', async () => {
const { engine, calls } = makeStub({
'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [{
id: 'yc-media',
name: 'yc-media',
local_path: '/tmp/yc',
last_commit: null,
last_sync_at: null,
config: '{"federated":false}',
created_at: new Date(),
}],
});
await runSources(engine, ['add', 'yc-media', '--path', '/tmp/yc', '--no-federated']);
const insert = calls.find(c => c.sql.includes('INSERT INTO sources'));
expect(insert!.params[3]).toBe('{"federated":false}');
});
test('rejects overlapping paths (per eng review finding 4.1)', async () => {
const { engine } = makeStub({
'SELECT id, local_path FROM sources WHERE local_path': [
{ id: 'gstack', local_path: '/tmp/gstack' },
],
});
// New source at /tmp/gstack/plans is inside existing gstack at /tmp/gstack.
await expect(runSources(engine, ['add', 'plans', '--path', '/tmp/gstack/plans']))
.rejects.toThrow(/overlaps with existing source "gstack"/);
});
});
// ── list ────────────────────────────────────────────────────
describe('sources list', () => {
test('orders default source first, then alphabetical', async () => {
const { engine, calls } = makeStub({
'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [
{ id: 'default', name: 'default', local_path: null, last_commit: null, last_sync_at: null, config: '{"federated":true}', created_at: new Date() },
],
'COUNT(*)::int AS n FROM pages': [{ n: 0 }],
});
await runSources(engine, ['list']);
const select = calls.find(c => c.sql.includes('ORDER BY (id = \'default\') DESC'));
expect(select).toBeDefined();
});
});
// ── remove ──────────────────────────────────────────────────
describe('sources remove', () => {
test("refuses to remove the 'default' source", async () => {
const { engine } = makeStub();
const code = await withExitCapture(() => runSources(engine, ['remove', 'default', '--yes']));
expect(code).toBe(3);
});
test('refuses without --yes', async () => {
const { engine } = makeStub({
'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [
{ id: 'gstack', name: 'gstack', local_path: '/tmp/g', last_commit: null, last_sync_at: null, config: '{}', created_at: new Date() },
],
'COUNT(*)::int AS n FROM pages': [{ n: 10 }],
});
const code = await withExitCapture(() => runSources(engine, ['remove', 'gstack']));
expect(code).toBe(5);
});
test('--dry-run reports but does not DELETE', async () => {
const { engine, calls } = makeStub({
'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [
{ id: 'gstack', name: 'gstack', local_path: '/tmp/g', last_commit: null, last_sync_at: null, config: '{}', created_at: new Date() },
],
'COUNT(*)::int AS n FROM pages': [{ n: 10 }],
});
await runSources(engine, ['remove', 'gstack', '--dry-run']);
const del = calls.find(c => c.sql.startsWith('DELETE FROM sources'));
expect(del).toBeUndefined();
});
});
// ── default ─────────────────────────────────────────────────
describe('sources default', () => {
test("stores id in config key 'sources.default'", async () => {
const { engine, configSet } = makeStub({
'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [
{ id: 'gstack', name: 'gstack', local_path: null, last_commit: null, last_sync_at: null, config: '{}', created_at: new Date() },
],
});
await runSources(engine, ['default', 'gstack']);
expect(configSet).toEqual([{ key: 'sources.default', value: 'gstack' }]);
});
});
// ── federate / unfederate ──────────────────────────────────
describe('sources federate / unfederate', () => {
test('federate sets config.federated = true', async () => {
const { engine, calls } = makeStub({
'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [
{ id: 'gstack', name: 'gstack', local_path: null, last_commit: null, last_sync_at: null, config: '{}', created_at: new Date() },
],
});
await runSources(engine, ['federate', 'gstack']);
const upd = calls.find(c => c.sql.includes('UPDATE sources SET config'));
expect(upd).toBeDefined();
expect(JSON.parse(upd!.params[0] as string)).toEqual({ federated: true });
});
test('unfederate preserves other config keys', async () => {
const { engine, calls } = makeStub({
'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [
{ id: 'gstack', name: 'gstack', local_path: null, last_commit: null, last_sync_at: null, config: '{"ttl_days":90,"federated":true}', created_at: new Date() },
],
});
await runSources(engine, ['unfederate', 'gstack']);
const upd = calls.find(c => c.sql.includes('UPDATE sources SET config'));
const parsed = JSON.parse(upd!.params[0] as string);
// Must preserve ttl_days while flipping federated.
expect(parsed.ttl_days).toBe(90);
expect(parsed.federated).toBe(false);
});
});

View File

@@ -0,0 +1,213 @@
/**
* v0.18.0 Step 7 — file_migration_ledger state-machine unit tests.
*
* No real storage — we stub a StorageBackend that records every
* call so we can assert the crash-point recovery semantics without
* touching S3/Supabase.
*/
import { describe, test, expect } from 'bun:test';
import { runStorageBackfill } from '../src/commands/migrations/v0_18_0-storage-backfill.ts';
import type { BrainEngine } from '../src/core/engine.ts';
import type { StorageBackend } from '../src/core/storage.ts';
interface StubLedgerRow {
file_id: number;
storage_path_old: string;
storage_path_new: string;
status: 'pending' | 'copy_done' | 'db_updated' | 'complete' | 'failed';
error?: string | null;
}
function makeEngine(initial: StubLedgerRow[]): { engine: BrainEngine; rows: StubLedgerRow[]; filePaths: Map<number, string> } {
const rows: StubLedgerRow[] = initial.map(r => ({ ...r }));
const filePaths = new Map<number, string>(); // file_id → current storage_path
const executeRaw = async <T>(sql: string, params?: unknown[]): Promise<T[]> => {
const up = sql.trim().toUpperCase();
// Read ledger
if (up.startsWith('SELECT FILE_ID')) {
return rows.map(r => ({ ...r })) as unknown as T[];
}
// UPDATE ledger SET status = 'copy_done'
if (sql.includes("SET status = 'copy_done'")) {
const row = rows.find(r => r.file_id === params?.[0]);
if (row) row.status = 'copy_done';
return [];
}
if (sql.includes("SET status = 'db_updated'")) {
const row = rows.find(r => r.file_id === params?.[0]);
if (row) row.status = 'db_updated';
return [];
}
if (sql.includes("SET status = 'complete'")) {
const row = rows.find(r => r.file_id === params?.[0]);
if (row) row.status = 'complete';
return [];
}
if (sql.includes('SET status = $1') && sql.includes("'failed'")) {
// Older form with parametric status
return [];
}
if (sql.includes("SET status = 'failed'")) {
const row = rows.find(r => r.file_id === params?.[1]);
if (row) { row.status = 'failed'; row.error = params?.[0] as string; }
return [];
}
// UPDATE files SET storage_path = $1 WHERE id = $2
if (up.startsWith('UPDATE FILES')) {
filePaths.set(params?.[1] as number, params?.[0] as string);
return [];
}
return [];
};
const engine = { kind: 'postgres' as const, executeRaw } as unknown as BrainEngine;
return { engine, rows, filePaths };
}
function makeStorage(): { storage: StorageBackend; calls: string[] } {
const calls: string[] = [];
const uploaded = new Set<string>();
const storage: StorageBackend = {
upload: async (path: string) => { calls.push(`upload:${path}`); uploaded.add(path); },
download: async (path: string) => { calls.push(`download:${path}`); return Buffer.from('content-for:' + path); },
delete: async (path: string) => { calls.push(`delete:${path}`); uploaded.delete(path); },
exists: async (path: string) => { calls.push(`exists:${path}`); return uploaded.has(path); },
list: async () => [],
getUrl: async (p) => `https://test/${p}`,
};
return { storage, calls };
}
describe('runStorageBackfill — happy path', () => {
test('advances pending → copy_done → db_updated → complete', async () => {
const { engine, rows, filePaths } = makeEngine([
{ file_id: 1, storage_path_old: 'slug/foo.pdf', storage_path_new: 'default/slug/foo.pdf', status: 'pending' },
]);
const { storage, calls } = makeStorage();
const report = await runStorageBackfill(engine, storage);
expect(report.total).toBe(1);
expect(report.nowComplete).toBe(1);
expect(report.failed).toBe(0);
expect(rows[0].status).toBe('complete');
expect(filePaths.get(1)).toBe('default/slug/foo.pdf');
// Storage operations: exists-check then download + upload (no delete yet,
// old objects preserved for soak window).
expect(calls.filter(c => c.startsWith('download:'))).toEqual(['download:slug/foo.pdf']);
expect(calls.filter(c => c.startsWith('upload:'))).toEqual(['upload:default/slug/foo.pdf']);
expect(calls.filter(c => c.startsWith('delete:'))).toEqual([]);
});
});
describe('runStorageBackfill — crash-point recovery (per Codex second pass)', () => {
test('resumes from copy_done (crash AFTER copy, BEFORE DB update)', async () => {
const { engine, rows, filePaths } = makeEngine([
{ file_id: 1, storage_path_old: 'slug/a.pdf', storage_path_new: 'default/slug/a.pdf', status: 'copy_done' },
]);
const { storage, calls } = makeStorage();
const report = await runStorageBackfill(engine, storage);
expect(report.nowComplete).toBe(1);
expect(rows[0].status).toBe('complete');
expect(filePaths.get(1)).toBe('default/slug/a.pdf');
// Should NOT re-download/re-upload — already in copy_done state.
expect(calls.filter(c => c.startsWith('download:'))).toEqual([]);
expect(calls.filter(c => c.startsWith('upload:'))).toEqual([]);
});
test('resumes from db_updated (crash AFTER DB update, BEFORE ledger mark)', async () => {
const { engine, rows } = makeEngine([
{ file_id: 1, storage_path_old: 'slug/b.pdf', storage_path_new: 'default/slug/b.pdf', status: 'db_updated' },
]);
const { storage, calls } = makeStorage();
const report = await runStorageBackfill(engine, storage);
expect(report.nowComplete).toBe(1);
expect(rows[0].status).toBe('complete');
// No copy, no db update — only the final mark.
expect(calls).toEqual([]);
});
test('already-complete rows are skipped without storage calls', async () => {
const { engine, rows } = makeEngine([
{ file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'complete' },
]);
const { storage, calls } = makeStorage();
const report = await runStorageBackfill(engine, storage);
expect(report.alreadyComplete).toBe(1);
expect(report.nowComplete).toBe(0);
expect(rows[0].status).toBe('complete');
expect(calls).toEqual([]);
});
test('failed rows stay failed and do NOT auto-retry', async () => {
const { engine, rows } = makeEngine([
{ file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'failed', error: 'previous failure' },
]);
const { storage, calls } = makeStorage();
const report = await runStorageBackfill(engine, storage);
expect(report.failed).toBe(1);
expect(report.nowComplete).toBe(0);
expect(rows[0].status).toBe('failed');
expect(calls).toEqual([]);
});
});
describe('runStorageBackfill — idempotence + dry-run', () => {
test('upload already-exists check skips redundant upload on re-run', async () => {
const { engine } = makeEngine([
{ file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'pending' },
]);
const { storage, calls } = makeStorage();
// Mark the new path as already existing (simulates a prior partial run
// where upload landed but ledger didn't get updated).
await storage.upload('default/x', Buffer.from('x'));
calls.length = 0;
await runStorageBackfill(engine, storage);
// Exists check ran, but no new download or upload since the
// destination already has the object.
expect(calls.some(c => c === 'exists:default/x')).toBe(true);
expect(calls.some(c => c.startsWith('download:'))).toBe(false);
expect(calls.some(c => c.startsWith('upload:'))).toBe(false);
});
test('dry-run mode reports skipped count, does not mutate', async () => {
const { engine, rows } = makeEngine([
{ file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'pending' },
{ file_id: 2, storage_path_old: 'y', storage_path_new: 'default/y', status: 'pending' },
]);
const report = await runStorageBackfill(engine, null, { dryRun: true });
expect(report.total).toBe(2);
expect(report.skipped).toBe(2);
expect(report.nowComplete).toBe(0);
// Rows still pending.
expect(rows.every(r => r.status === 'pending')).toBe(true);
});
test('re-running a completed ledger is a no-op with zero side effects', async () => {
const { engine } = makeEngine([
{ file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'complete' },
{ file_id: 2, storage_path_old: 'y', storage_path_new: 'default/y', status: 'complete' },
]);
const { storage, calls } = makeStorage();
const report = await runStorageBackfill(engine, storage);
expect(report.alreadyComplete).toBe(2);
expect(report.nowComplete).toBe(0);
expect(calls).toEqual([]);
});
});