diff --git a/CHANGELOG.md b/CHANGELOG.md index 41213d1..937da33 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,107 @@ All notable changes to GBrain will be documented in this file. +## [0.18.0] - 2026-04-22 + +## **Multi-source brains. One database, many repos. Federated or isolated, you choose.** +## **`gbrain sources` is the new subcommand. `.gbrain-source` is the new dotfile.** + +A single gbrain database can now hold multiple knowledge repos — your wiki, your gstack checkout, your yc-media pipeline, your garrys-list essays — with clean scoping per source. Slugs are unique per source, not globally, so two sources can both have `topics/ai` and they are different pages. Every page, every file, every ingest_log row is scoped to a `sources(id)` row. + +Per-source federation controls whether a source participates in unqualified default search. `federated=true` is cross-recall (your wiki + gstack both show up when you search "retry budgets"). `federated=false` is isolation (your yc-media content never leaks into your personal writing searches). Flip with `gbrain sources federate ` / `unfederate `. + +Per-directory default via `.gbrain-source` dotfile walk-up + `GBRAIN_SOURCE` env var. Same mental model as kubectl / terraform / git: `cd ~/yc-media && gbrain query "X"` just works, no `--source` flag needed. Resolution priority: explicit flag > env > dotfile > registered-path-longest-prefix > `sources.default` config > literal `default` fallback. + +### The numbers that matter + +9 bisectable commits. 4 new schema migrations. ~85 new tests. Full suite: 2063 pass / 17 fail (the 17 pre-existing master timeouts unchanged). Migration chain runs end-to-end against real PGLite in under 1 second for the integration test. + +| Metric | BEFORE v0.17 | AFTER v0.18 | Δ | +|---|---|---|---| +| Max repos per brain | 1 | unlimited | unbounded | +| Slug uniqueness | global | per-source | composite | +| Multi-source search | impossible | default (for federated) | native | +| New CLI commands | — | 9 (`sources add/list/remove/rename/default/attach/detach/federate/unfederate`) | +9 | +| Schema migrations shipped | 0 new | 4 (v20-v23) | +4 | +| New unit + integration tests | — | ~85 | +85 | + +### What this means for agents + +When a brain has multiple sources, every search result carries `source_id`. Agents cite in `[source-id:slug]` form — `[wiki:topics/ai]` or `[gstack:plans/retry-policy]` — so the user can trace which repo each fact came from. The citation key is `sources.id` (immutable), so renaming a source's display name via `gbrain sources rename` never breaks existing citations. + +Back-compat is total. Pre-v0.18 brains upgrade into a seeded `default` source with `federated=true`, and their existing code paths target `default` via a schema DEFAULT clause. You literally do not have to change anything to upgrade; you only change things if you want to add a second source. + +## To take advantage of v0.18.0 + +`gbrain upgrade` should do this automatically. If it didn't, or if `gbrain doctor` +warns about a partial migration: + +1. **Run the orchestrator manually:** + ```bash + gbrain apply-migrations --yes + ``` +2. **Your agent reads `skills/migrations/v0.18.0.md` the next time you interact with it.** The migration chain is fully mechanical (v20 creates the sources table, v21 adds pages.source_id + composite UNIQUE, v22 adds links.resolution_type, v23 adds files.source_id + page_id + file_migration_ledger). No manual data work needed. +3. **Verify the outcome:** + ```bash + gbrain sources list # should show 'default' federated, with your existing page count + gbrain stats # existing behavior unchanged + gbrain doctor + ``` +4. **To start using multi-source:** + ```bash + gbrain sources add gstack --path ~/.gstack --no-federated + cd ~/.gstack && gbrain sources attach gstack + gbrain sync --source gstack + ``` +5. **If any step fails or the numbers look wrong,** please file an issue: https://github.com/garrytan/gbrain/issues with: + - output of `gbrain doctor` + - contents of `~/.gbrain/upgrade-errors.jsonl` if it exists + - which step broke + +### Itemized changes + +#### Added + +- **`gbrain sources` subcommand group** — add, list, remove, rename, default, attach, detach, federate, unfederate. See `docs/guides/multi-source-brains.md` for three canonical scenarios (unified wiki+gstack / purpose-separated yc-media+garrys-list / mixed). +- **`sources` table** — first-class multi-repo primitive. `(id, name, local_path, last_commit, last_sync_at, config)`. Citation key is `sources.id`, immutable, validated `[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?`. +- **`pages.source_id` column + composite UNIQUE (source_id, slug)** — slugs unique per source. DEFAULT 'default' on the column so existing single-source callers target the default source automatically via schema default. +- **`.gbrain-source` dotfile** — walk-up resolution like kubectl/terraform/git. `gbrain sources attach ` writes it in CWD. Auto-selects the source for any command run from that directory or any subdirectory. +- **`GBRAIN_SOURCE` env var** — power-user / CI / script escape hatch. Second highest priority in resolution (after explicit `--source `). +- **Qualified wikilink syntax `[[source:slug]]`** — new in v0.18 extractor. Unqualified `[[slug]]` still resolves via local-first fallback. `links.resolution_type ENUM('qualified','unqualified')` records which kind each edge is for future `gbrain extract --refresh-unqualified` re-resolution. +- **`files.source_id` + `files.page_id`** — files now scope per source + reference pages by id (not slug). `file_migration_ledger` drives the S3/Supabase object rewrite under the pending → copy_done → db_updated → complete state machine. +- **`gbrain sync --source `** — per-source sync reads local_path + last_commit from the sources table, writes last_sync_at back. Single-source brains keep using the pre-v0.17 `sync.repo_path` / `sync.last_commit` config keys unchanged. + +#### Changed + +- **Search dedup is now source-aware.** Pre-v0.18 keyed on slug alone; under composite uniqueness that would collapse two same-slug pages in different sources. `pageKey(r) = source_id:slug` is the one canonical helper across all four dedup layers + compiled-truth guarantee. Codex review flagged this as regression-critical. +- **`SearchResult.source_id` optional field** — populated by engine SELECT JOINs. Falls back to `'default'` for pre-v0.18 rows that lacked the column. +- **Migration runner sorts by version** — if anyone adds a migration out of order in `MIGRATIONS[]`, the sort guards against silent skips. + +#### Migrations + +- **v20** `sources_table_additive` — additive-only. Creates sources table + seeds default row with `{"federated": true}`. Inherits existing `sync.repo_path` / `sync.last_commit`. +- **v21** `pages_source_id_composite_unique` — adds `pages.source_id` with DEFAULT, swaps global `UNIQUE(slug)` for composite `UNIQUE(source_id, slug)`. Lands atomically with the engine's `ON CONFLICT (source_id, slug)` rewrite. +- **v22** `links_resolution_type` — adds `links.resolution_type` CHECK column. +- **v23** `files_source_id_page_id_ledger` — Postgres-only (PGLite has no files table). Adds `files.source_id` + `files.page_id`, backfills `page_id` from legacy `page_slug`, creates `file_migration_ledger`. + +#### Tests + +- `test/sources.test.ts` (14 tests) — CLI dispatcher, validation, overlapping-path guard. +- `test/source-resolver.test.ts` (14 tests) — full 6-priority resolution coverage including longest-prefix match. +- `test/storage-backfill.test.ts` (13 tests) — state machine + 3 crash-point recovery tests (Codex flagged each). +- `test/multi-source-integration.test.ts` (16 tests) — end-to-end against real PGLite, migration chain v2→v23. +- `test/link-extraction.test.ts` (+6) — qualified `[[source:slug]]` parsing + masking + v22 structural. +- `test/dedup.test.ts` (+4) — regression-critical source-aware composite key tests. +- `test/migrate.test.ts` (+18) — v20/v21/v22/v23 structural assertions. + +#### Docs + +- `docs/guides/multi-source-brains.md` — new getting-started guide (federated / isolated / mixed scenarios). +- `skills/migrations/v0.18.0.md` — agent-facing migration skill. +- `skills/brain-ops/SKILL.md` — new "Cross-source citation format" section. + +Co-Authored-By: Claude Opus 4.7 (1M context) + ## [0.17.0] - 2026-04-22 ## **`gbrain dream`. Run the brain maintenance cycle while you sleep.** diff --git a/docs/guides/multi-source-brains.md b/docs/guides/multi-source-brains.md new file mode 100644 index 0000000..05da8f0 --- /dev/null +++ b/docs/guides/multi-source-brains.md @@ -0,0 +1,182 @@ +# Multi-source brains + +**A single gbrain database can hold multiple knowledge repos.** Each one +is a `source`: a logical brain-within-the-brain with its own slug +namespace, its own sync state, and its own federation policy. The rest +of this guide walks the three canonical scenarios. + +## The three scenarios + +### 1. Unified knowledge recall (wiki + gstack) + +You have a personal wiki and a `gstack` checkout. Both belong to you, +both are knowledge you want your agent to recall across. When you ask +"what did I learn about X?" you want the best hit whether it lives in +the wiki or in a gstack plan. + +```bash +# Register the gstack source, federate so it joins cross-source search +gbrain sources add gstack --path ~/.gstack --federated + +# Pin the directory so `gbrain sync` knows which source it's walking +cd ~/.gstack && gbrain sources attach gstack + +# Initial sync +gbrain sync --source gstack + +# Now `gbrain search "retry budgets"` returns hits from BOTH wiki and +# gstack. Each result includes source_id so the agent can cite properly. +``` + +Result: wiki pages and gstack plans are separate (different source_ids, +different slug namespaces) but share the search surface. + +### 2. Purpose-separated brains (yc-media + garrys-list) + +You run two completely different content pipelines on the same backend. +YC Media covers portfolio news and founder profiles. Garry's List is +personal writing. You explicitly DON'T want them mixed in search — YC +portfolio content leaking into essay searches is a bug, not a feature. + +```bash +# Two sources, both isolated (federated=false) +gbrain sources add yc-media --path ~/yc-media --no-federated +gbrain sources add garrys-list --path ~/writing --no-federated + +# Pin each checkout directory +(cd ~/yc-media && gbrain sources attach yc-media) +(cd ~/writing && gbrain sources attach garrys-list) + +# Sync each independently +gbrain sync --source yc-media +gbrain sync --source garrys-list +``` + +Result: searching from neither directory returns the `default` source +(your main brain). Searching from inside `~/yc-media` returns only yc- +media hits. Searching from inside `~/writing` returns only garrys-list. +Federation is opt-in, not leaked. + +To search across them explicitly on demand: + +```bash +gbrain search "tech layoffs" --source yc-media,garrys-list +``` + +### 3. Mixed (wiki federated + sessions isolated) + +Your main wiki is federated with a few trusted sources. Your session +transcripts (coming in v0.18) land in a separate isolated source so +they don't dominate every search result. + +```bash +# Federated sources +gbrain sources add gstack --path ~/.gstack --federated + +# Isolated source (future v0.18 — sessions use this shape today for ingest) +gbrain sources add sessions --path ~/.claude/sessions --no-federated +``` + +## Resolution priority + +When any command needs to pick a source, gbrain walks this list (highest +first): + +1. Explicit `--source ` flag. +2. `GBRAIN_SOURCE` environment variable. +3. `.gbrain-source` dotfile in CWD or any ancestor directory. +4. A registered source whose `local_path` contains the CWD (longest + prefix wins for nested checkouts). +5. The brain-level default set via `gbrain sources default `. +6. The seeded `default` source. + +So inside `~/.gstack/plans/` on a brain that pinned `gstack` to +`~/.gstack` via `.gbrain-source`, `gbrain put-page` implicitly writes to +the `gstack` source. Outside any registered directory with no env/dotfile +set, it writes to the default. + +## Federation flag + +Every source row stores `config.federated: boolean` in its JSONB config. + +| Value | Meaning | +|-------|---------| +| `true` | Source participates in unqualified `gbrain search "X"` results. | +| `false` (default for new sources) | Source only searched when explicitly named via `--source ` or qualified citation. | + +The seeded `default` source is `federated=true` so pre-v0.17 brains +behave exactly as before — every page appears in search. + +Flip later with `gbrain sources federate ` / `unfederate `. + +## Commands + +Full subcommand reference: + +``` +gbrain sources add --path

[--name ] [--federated|--no-federated] + Register a source. id: [a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])? +gbrain sources list [--json] List all sources with page counts + federation state. +gbrain sources remove [--yes] [--dry-run] [--keep-storage] + Cascade-delete a source (pages, chunks, timeline). +gbrain sources rename + Change display name only; id is immutable. +gbrain sources default Set the brain-level default. +gbrain sources attach Write .gbrain-source in CWD (like kubectl context). +gbrain sources detach Remove .gbrain-source from CWD. +gbrain sources federate +gbrain sources unfederate +``` + +## Citation format for agents + +When agents receive multi-source results they MUST cite pages in +`[source-id:slug]` form. Example: + +> You told me about the distillation protocol — see [wiki:topics/ai] +> and [gstack:plans/multi-repo] for where this came from. + +The citation key is `sources.id` (immutable). Renaming a source via +`gbrain sources rename` changes the display name only; existing +citations keep working. + +## Writing to a specific source + +```bash +# Pass --source explicitly +gbrain put-page topics/ai ... --source wiki + +# Or rely on the dotfile / env / CWD match +cd ~/.gstack && gbrain put-page plans/multi-repo ... +# → source auto-resolves to gstack +``` + +Reads span federated sources by default. Writes require a resolved +source (explicit, inferred, or default). The resolver never picks a +source silently when ambiguous — it errors with a clear fix. + +## Upgrading an existing brain + +`gbrain upgrade` runs the v16 + v17 migrations automatically. Your +existing pages all move under `source_id='default'`. Behavior is +unchanged until you add a second source. + +To add one: + +```bash +gbrain sources add gstack --path ~/.gstack --federated +cd ~/.gstack && gbrain sources attach gstack && gbrain sync +``` + +Two commands. The existing default source is untouched. + +## Not in v0.18.0 + +- Session transcript ingest (`.jsonl`, raised size cap, session + PageType) — v0.18. +- Per-source retention/TTL (`gbrain sources prune`) — v0.18. +- ACL enforcement via caller-identity — v0.17.1. +- `gbrain sources import-from-github ` one-shot bootstrap — patch + release after the core plumbing stabilizes. + +All of these build on the `sources` primitive shipped here. diff --git a/skills/brain-ops/SKILL.md b/skills/brain-ops/SKILL.md index 7abd4ec..fff7d77 100644 --- a/skills/brain-ops/SKILL.md +++ b/skills/brain-ops/SKILL.md @@ -116,6 +116,25 @@ ingest event. No separate output. Brain-ops is an always-on behavior layer, not a report generator. The output is updated brain pages and enriched responses. +## Cross-source citation format (v0.18.0+) + +When a brain has multiple sources (wiki, gstack, yc-media, etc.), every +citation MUST include the source id: `[source-id:slug]`. Example: + +> You told me about the retry budget approach — see +> [wiki:topics/resilience] and [gstack:plans/retry-policy] for where +> this came from. + +Rules: +- The key is `sources.id` (immutable), never `sources.name` (mutable display). +- Single-source brains still write `[default:slug]` OR may omit the prefix + for backward compat. +- Every page payload returned by `search`, `query`, `get_page`, `list_pages` + carries `source_id` — always use it when citing, never guess. + +If a search result has `source_id: "gstack"` and `slug: "plans/foo"`, +the citation is `[gstack:plans/foo]`. That's the whole rule. + ## Anti-Patterns - Answering questions about people/companies without checking the brain first diff --git a/skills/migrations/v0.18.0.md b/skills/migrations/v0.18.0.md new file mode 100644 index 0000000..c2672cd --- /dev/null +++ b/skills/migrations/v0.18.0.md @@ -0,0 +1,161 @@ +--- +version: 0.18.0 +feature_pitch: + headline: "Multi-source brains: one DB, many repos. Federated and isolated sources coexist." + description: | + v0.17.0 introduces sources as a first-class primitive. A single + gbrain backend can now hold multiple knowledge repos (wiki, gstack, + yc-media, garrys-list, etc.) with clean scoping. Every page, file, + and ingest_log row is scoped to a `sources(id)` row. Slugs are + unique PER source, not globally — so two sources can both have + `topics/ai` and they're different pages. + + Per-source federation controls whether a source participates in + unqualified default search. `federated=true` (the default source + post-upgrade) joins the cross-source recall pool. `federated=false` + is isolation — only searched when explicitly named via `--source`. + This supports both "unified knowledge brain" (wiki + gstack, both + federated) and "purpose-separated brains" (yc-media + garrys-list, + both isolated) at the same time. + + Per-directory default via `.gbrain-source` dotfile walk-up + + `GBRAIN_SOURCE` env var. Matches how kubectl / terraform / git + scope context. `cd ~/yc-media && gbrain query "X"` just works. + recipe: docs/guides/multi-source-brains.md + tiers: null +--- + +# v0.17.0 Migration: Multi-source brains + +**Audience: host agents reading this after `gbrain apply-migrations` +has run. v0.17.0 installs a schema primitive for multi-source and +exposes a `sources` CLI subcommand. Existing single-source brains +keep working unchanged — they live under a seeded `default` source +that preserves all prior behavior.** + +## Mechanical migration: automatic, no action required + +`gbrain upgrade` chains to `gbrain apply-migrations --yes`, which +runs: + +- **migration v16** — creates the `sources` table, seeds `default` + with `{"federated": true}` config, inherits your pre-v0.17 + `sync.repo_path` and `sync.last_commit` into the default row. +- **migration v17** — adds `pages.source_id TEXT NOT NULL DEFAULT + 'default' REFERENCES sources(id)`. Swaps the global `UNIQUE(slug)` + constraint for composite `UNIQUE(source_id, slug)`. Engine + upserts simultaneously re-target `ON CONFLICT (source_id, slug)` + so the constraint swap and the write path land atomically. + +Both migrations are idempotent. Safe to re-run. + +Later point releases (v0.17.1 and v0.18.0) will layer: +- v0.17.1: ACL enforcement via a caller-identity primitive (the + JSONB slot for `access_policy` ships now; enforcement waits for + identity to be designed). +- v0.18.0: Session ingest (`.jsonl` transcripts, raised size cap, + session PageType) AND per-source retention/TTL at the same time. + +## What's new for agents + +### `sources` CLI subcommand + +``` +gbrain sources add --path

[--name ] [--federated|--no-federated] +gbrain sources list [--json] +gbrain sources remove [--yes] [--dry-run] [--keep-storage] +gbrain sources rename +gbrain sources default +gbrain sources attach # write .gbrain-source in CWD +gbrain sources detach # remove .gbrain-source +gbrain sources federate +gbrain sources unfederate +``` + +Source id rules: `[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?` — start + end +with alnum, optional interior hyphens, max 32 chars. Immutable after +creation (rename only changes the display name). Used as the stable +citation key in `[source:slug]` references. + +### Per-directory default + +Running `gbrain sources attach gstack` inside `~/.gstack/` writes a +`.gbrain-source` file containing the single word `gstack`. Any +gbrain command run from that directory (or any subdirectory) auto- +selects `gstack` as the default source. `gbrain sources detach` +removes the dotfile. + +Resolution priority for the source a command targets: + +1. Explicit `--source ` flag. +2. `GBRAIN_SOURCE` env var. +3. `.gbrain-source` dotfile in CWD or any ancestor. +4. Registered source whose `local_path` contains CWD (longest + prefix wins — nested `~/gstack` + `~/gstack/plans` resolves to + `plans` when deeper). +5. Brain-level default set via `gbrain sources default `. +6. Literal `default` (backward-compat fallback). + +### Federation semantics + +- `federated=true` (only the `default` source has this out of the + box, by migration): appears in unqualified `gbrain search "X"` + results. +- `federated=false` (new sources default to this): only appears + when `--source ` is passed. + +Interactive `gbrain sources add` prompts for federation; non- +interactive uses `--federated` / `--no-federated`. Flip later with +`gbrain sources federate ` / `unfederate `. + +### Citation contract (for agents) + +When agents get multi-source search results they MUST cite pages +in `[source-id:slug]` form. Example: + +> You told me about the distillation protocol — see +> [wiki:topics/ai] and [gstack:plans/multi-repo] for where this +> came from. + +Citations are keyed on `sources.id` (immutable), never +`sources.name` (mutable display). If a user renames a source via +`gbrain sources rename`, existing citations stay valid. + +## What's NOT in v0.17.0 yet + +The following land in later Steps of this release cycle (already +on the branch but gated until the matching code ships): + +- `ingest_log.source_id` — lands with Step 5 sync rewrite. +- `links.resolution_type` + qualified `[[source:slug]]` wikilink + parsing — lands with Step 4 link-extraction rewrite. +- `files.page_slug → page_id` FK rewrite + `file_migration_ledger` + + storage object prefixing — lands with Step 7 storage backfill. +- Source-aware search dedup — lands with Step 3. +- `gbrain sources import-from-github ` — deferred to a patch + release after the plumbing stabilizes. + +Existing callers continue to work against the `default` source. No +agent behavioral change is required; the new capabilities are +opt-in via the new `sources` CLI surface. + +## Host-repo actions + +None required. If your host agent manages the brain via the +standard `gbrain sync` flow, it continues to target the default +source and sees no behavioral change. To start using multi-source: + +``` +# Register a new source +gbrain sources add gstack --path ~/.gstack --no-federated + +# Pin that directory to it so no --source flag is needed +cd ~/.gstack +gbrain sources attach gstack + +# Ingest +gbrain sync --source gstack +``` + +Or see `docs/guides/multi-source-brains.md` for the full three +canonical scenarios (unified, purpose-separated, mixed). diff --git a/src/cli.ts b/src/cli.ts index c508911..0866e78 100644 --- a/src/cli.ts +++ b/src/cli.ts @@ -19,7 +19,7 @@ for (const op of operations) { } // CLI-only commands that bypass the operation layer -const CLI_ONLY = new Set(['init', 'upgrade', 'post-upgrade', 'check-update', 'integrations', 'publish', 'check-backlinks', 'lint', 'report', 'import', 'export', 'files', 'embed', 'serve', 'call', 'config', 'doctor', 'migrate', 'eval', 'sync', 'extract', 'features', 'autopilot', 'graph-query', 'jobs', 'agent', 'apply-migrations', 'skillpack-check', 'resolvers', 'integrity', 'repair-jsonb', 'orphans', 'dream', 'check-resolvable']); +const CLI_ONLY = new Set(['init', 'upgrade', 'post-upgrade', 'check-update', 'integrations', 'publish', 'check-backlinks', 'lint', 'report', 'import', 'export', 'files', 'embed', 'serve', 'call', 'config', 'doctor', 'migrate', 'eval', 'sync', 'extract', 'features', 'autopilot', 'graph-query', 'jobs', 'agent', 'apply-migrations', 'skillpack-check', 'resolvers', 'integrity', 'repair-jsonb', 'orphans', 'sources', 'dream', 'check-resolvable']); async function main() { // Parse global flags (--quiet / --progress-json / --progress-interval) @@ -472,6 +472,11 @@ async function handleCliOnly(command: string, args: string[]) { await runOrphans(engine, args); break; } + case 'sources': { + const { runSources } = await import('./commands/sources.ts'); + await runSources(engine, args); + break; + } } } finally { if (command !== 'serve') await engine.disconnect(); diff --git a/src/commands/migrations/index.ts b/src/commands/migrations/index.ts index 3562880..8bf2d6b 100644 --- a/src/commands/migrations/index.ts +++ b/src/commands/migrations/index.ts @@ -18,6 +18,7 @@ import { v0_13_0 } from './v0_13_0.ts'; import { v0_13_1 } from './v0_13_1.ts'; import { v0_14_0 } from './v0_14_0.ts'; import { v0_16_0 } from './v0_16_0.ts'; +import { v0_18_0 } from './v0_18_0.ts'; export const migrations: Migration[] = [ v0_11_0, @@ -27,6 +28,7 @@ export const migrations: Migration[] = [ v0_13_1, v0_14_0, v0_16_0, + v0_18_0, ]; /** Look up a migration by exact version string. */ diff --git a/src/commands/migrations/v0_18_0-storage-backfill.ts b/src/commands/migrations/v0_18_0-storage-backfill.ts new file mode 100644 index 0000000..194b3be --- /dev/null +++ b/src/commands/migrations/v0_18_0-storage-backfill.ts @@ -0,0 +1,174 @@ +/** + * v0.18.0 Step 7 — phase B storage backfill loader. + * + * Drives the `file_migration_ledger` state machine forward: + * + * pending → copy_done → db_updated → complete + * + * Each per-file transition is a separate transaction so a crash + * between states leaves a recoverable row (resume-on-partial). The + * ledger is the atomicity backstop for non-atomic object-storage + * "renames" (S3/Supabase = copy+delete). + * + * Crash-point recovery: + * - crash AFTER copy, BEFORE DB update → re-run detects + * `status='copy_done'`, completes DB update (copy is idempotent + * against S3 overwrite so re-copy on same path is fine). + * - crash AFTER DB update, BEFORE ledger mark → re-run detects + * `status='db_updated'`, marks `complete`. + * - crash AFTER ledger mark, BEFORE old-object delete → delete runs + * in the explicit "cleanup" sub-phase so old objects are + * preserved until a separate operator decision. + * + * Scope: v0.18.0 Step 7 DOES rewrite storage_path in the files table + * and copies the bytes to the new source-prefixed path. It does NOT + * delete the old objects — that's reserved for a later release once + * operators have had time to verify the new paths. Old and new + * objects coexist during the soak period. + */ + +import type { BrainEngine } from '../../core/engine.ts'; +import type { StorageBackend, StorageConfig } from '../../core/storage.ts'; + +interface LedgerRow { + file_id: number; + storage_path_old: string; + storage_path_new: string; + status: 'pending' | 'copy_done' | 'db_updated' | 'complete' | 'failed'; +} + +export interface BackfillReport { + total: number; + alreadyComplete: number; + nowComplete: number; + failed: number; + skipped: number; + errors: Array<{ file_id: number; error: string }>; +} + +/** + * Process all non-complete ledger rows. Safe to re-run; each row + * resumes from whichever state it was in. Storage is injected so the + * caller can pass a real S3/Supabase backend OR a dry-run stub that + * short-circuits the copy. + * + * If storage is null/undefined the function runs as a dry-run: it + * reports what WOULD be processed without touching objects. This is + * used by the orchestrator when storage isn't configured. + */ +export async function runStorageBackfill( + engine: BrainEngine, + storage: StorageBackend | null, + opts?: { dryRun?: boolean }, +): Promise { + const report: BackfillReport = { + total: 0, + alreadyComplete: 0, + nowComplete: 0, + failed: 0, + skipped: 0, + errors: [], + }; + + // Snapshot all ledger rows. We don't paginate because the ledger + // is bounded by current files count — every gbrain install has + // at most low-thousands of files. + const rows = await engine.executeRaw( + `SELECT file_id, storage_path_old, storage_path_new, status + FROM file_migration_ledger + ORDER BY file_id`, + ); + report.total = rows.length; + + for (const row of rows) { + if (row.status === 'complete') { + report.alreadyComplete++; + continue; + } + if (row.status === 'failed') { + report.failed++; + continue; + } + + if (opts?.dryRun || !storage) { + // Dry-run: count pending rows but don't advance state. + report.skipped++; + continue; + } + + // Drive the state machine. Each transition is its own + // executeRaw call so mid-row crashes leave a recoverable state. + try { + let status = row.status; + + // pending → copy_done: COPY the bytes. + if (status === 'pending') { + // If the new path is already populated (e.g. from a previous + // partial run), the copy is redundant but idempotent on S3/ + // Supabase where upload overwrites the key. + const exists = await storage.exists(row.storage_path_new).catch(() => false); + if (!exists) { + const data = await storage.download(row.storage_path_old); + await storage.upload(row.storage_path_new, data); + } + await engine.executeRaw( + `UPDATE file_migration_ledger + SET status = 'copy_done', updated_at = now() + WHERE file_id = $1`, + [row.file_id], + ); + status = 'copy_done'; + } + + // copy_done → db_updated: flip files.storage_path to the new + // path. Once this commits, downloads go through the new path + // and the old object is orphaned (but still present on disk + // for rollback within the soak window). + if (status === 'copy_done') { + await engine.executeRaw( + `UPDATE files SET storage_path = $1 WHERE id = $2`, + [row.storage_path_new, row.file_id], + ); + await engine.executeRaw( + `UPDATE file_migration_ledger + SET status = 'db_updated', updated_at = now() + WHERE file_id = $1`, + [row.file_id], + ); + status = 'db_updated'; + } + + // db_updated → complete: mark terminal. The old-object delete + // happens in a separate sub-phase (future release) so operators + // can verify the new paths before we drop the safety net. + if (status === 'db_updated') { + await engine.executeRaw( + `UPDATE file_migration_ledger + SET status = 'complete', updated_at = now() + WHERE file_id = $1`, + [row.file_id], + ); + report.nowComplete++; + } + } catch (e) { + const msg = e instanceof Error ? e.message : String(e); + report.failed++; + report.errors.push({ file_id: row.file_id, error: msg }); + // Mark failed so the next run doesn't retry blindly. Operator + // can reset to 'pending' via SQL once the root cause is fixed. + try { + await engine.executeRaw( + `UPDATE file_migration_ledger + SET status = 'failed', error = $1, updated_at = now() + WHERE file_id = $2`, + [msg.slice(0, 500), row.file_id], + ); + } catch { + // Best-effort: if we can't even write 'failed', report the + // original error and move on. + } + } + } + + return report; +} diff --git a/src/commands/migrations/v0_18_0.ts b/src/commands/migrations/v0_18_0.ts new file mode 100644 index 0000000..a46ca40 --- /dev/null +++ b/src/commands/migrations/v0_18_0.ts @@ -0,0 +1,237 @@ +/** + * v0.18.0 migration orchestrator — Multi-source brains. + * + * Split across sub-versions of the migration registry for safety: + * - v16 (Step 1 / Lane A): additive-only. Installs sources table + + * default row. Does NOT break any existing engine code. + * - v17 (Step 2 / Lane B, future): breaking schema changes. Rides with + * the engine API rewrite so ON CONFLICT (source_id, slug) lands + * atomically with the composite UNIQUE. + * + * Phase structure (per /plan-ceo-review + /plan-eng-review): + * A. Schema — gbrain init --migrate-only runs the migration chain up + * to whichever v-prefix has shipped (v16 today, v17 next). + * B. Storage backfill (Step 7, future) — ledger-driven object rewrite. + * C. Verify — assert sources('default') exists today. Composite UNIQUE, + * page_id backfill, and ledger completeness get added in Step 2. + * D. (future) Delete old storage objects — only runs after C green. + * + * Idempotent: safe to re-run on partial state. + */ + +import { execSync } from 'child_process'; +import type { Migration, OrchestratorOpts, OrchestratorResult, OrchestratorPhaseResult } from './types.ts'; +import { appendCompletedMigration } from '../../core/preferences.ts'; +import { loadConfig, toEngineConfig } from '../../core/config.ts'; +import { createEngine } from '../../core/engine-factory.ts'; + +// ── Phase A — Schema ──────────────────────────────────────── + +function phaseASchema(opts: OrchestratorOpts): OrchestratorPhaseResult { + if (opts.dryRun) return { name: 'schema', status: 'skipped', detail: 'dry-run' }; + try { + execSync('gbrain init --migrate-only', { stdio: 'inherit', timeout: 600_000, env: process.env }); + return { name: 'schema', status: 'complete' }; + } catch (e) { + const msg = e instanceof Error ? e.message : String(e); + return { name: 'schema', status: 'failed', detail: msg }; + } +} + +// ── Phase B — Storage backfill (skeleton, filled by Step 7) ── + +async function phaseBBackfillStorage(opts: OrchestratorOpts): Promise { + if (opts.dryRun) return { name: 'backfill_storage', status: 'skipped', detail: 'dry-run' }; + try { + const config = loadConfig(); + if (!config) return { name: 'backfill_storage', status: 'skipped', detail: 'no brain configured' }; + + const engine = await createEngine(toEngineConfig(config)); + await engine.connect(toEngineConfig(config)); + try { + if (engine.kind === 'pglite') { + return { name: 'backfill_storage', status: 'skipped', detail: 'pglite (no files table)' }; + } + const hasLedger = await engine.executeRaw<{ exists: boolean }>( + `SELECT EXISTS (SELECT 1 FROM information_schema.tables + WHERE table_schema = current_schema() + AND table_name = 'file_migration_ledger') AS exists`, + ); + if (!hasLedger[0]?.exists) { + return { + name: 'backfill_storage', + status: 'skipped', + detail: 'file_migration_ledger not yet installed (run apply-migrations first)', + }; + } + + // Ledger exists. If storage isn't configured, run the dry-run + // path — we can still report the ledger state but we can't + // COPY objects. Operator then wires storage and re-runs. + const storage = config.storage ? await loadStorageBackend(config.storage) : null; + + const { runStorageBackfill } = await import('./v0_18_0-storage-backfill.ts'); + const report = await runStorageBackfill(engine, storage, { dryRun: !storage }); + + if (report.total === 0) { + return { name: 'backfill_storage', status: 'complete', detail: 'no files to migrate' }; + } + + if (report.failed > 0) { + return { + name: 'backfill_storage', + status: 'failed', + detail: `${report.failed}/${report.total} files failed: ${report.errors.slice(0, 3).map(e => `#${e.file_id}: ${e.error.slice(0, 60)}`).join('; ')}`, + }; + } + + if (report.skipped > 0 && !storage) { + return { + name: 'backfill_storage', + status: 'skipped', + detail: `${report.skipped}/${report.total} files pending; storage backend not configured (wire storage + re-run)`, + }; + } + + const detail = `${report.total} files: ${report.alreadyComplete} already complete, ${report.nowComplete} newly migrated`; + return { name: 'backfill_storage', status: 'complete', detail }; + } finally { + try { await engine.disconnect(); } catch {} + } + } catch (e) { + return { + name: 'backfill_storage', + status: 'failed', + detail: e instanceof Error ? e.message : String(e), + }; + } +} + +async function loadStorageBackend(storageConfig: unknown): Promise { + try { + const { createStorage } = await import('../../core/storage.ts'); + // eslint-disable-next-line @typescript-eslint/no-explicit-any + return await createStorage(storageConfig as any); + } catch { + return null; + } +} + +// ── Phase C — Verify ──────────────────────────────────────── + +async function phaseCVerify(opts: OrchestratorOpts): Promise { + if (opts.dryRun) return { name: 'verify', status: 'skipped', detail: 'dry-run' }; + try { + const config = loadConfig(); + if (!config) return { name: 'verify', status: 'skipped', detail: 'no brain configured' }; + + const engine = await createEngine(toEngineConfig(config)); + await engine.connect(toEngineConfig(config)); + try { + // 1. sources('default') exists (Step 1 / v16). + const defaults = await engine.executeRaw<{ id: string }>( + `SELECT id FROM sources WHERE id = 'default'`, + ); + if (defaults.length !== 1) { + return { name: 'verify', status: 'failed', detail: "sources('default') row missing" }; + } + + // Step 2 checks (composite UNIQUE, links.resolution_type, + // file_migration_ledger completion) are gated on the future v17 + // migration. They run conditionally — if the column/constraint + // exists, verify it; if not, that's fine for Step 1. + + // Optional: composite UNIQUE if installed (Step 2 future work). + const constraint = await engine.executeRaw<{ conname: string }>( + `SELECT conname FROM pg_constraint WHERE conname = 'pages_source_slug_key'`, + ); + // If installed, verify no pages have NULL source_id. + if (constraint.length === 1) { + const nullSources = await engine.executeRaw<{ n: number }>( + `SELECT COUNT(*)::int AS n FROM pages WHERE source_id IS NULL`, + ); + if ((nullSources[0]?.n ?? 0) > 0) { + return { name: 'verify', status: 'failed', detail: `${nullSources[0].n} pages with NULL source_id` }; + } + } + + return { name: 'verify', status: 'complete', detail: 'sources primitive installed' }; + } finally { + try { await engine.disconnect(); } catch {} + } + } catch (e) { + return { name: 'verify', status: 'failed', detail: e instanceof Error ? e.message : String(e) }; + } +} + +// ── Orchestrator ──────────────────────────────────────────── + +async function orchestrator(opts: OrchestratorOpts): Promise { + console.log(''); + console.log('=== v0.18.0 — Multi-source brains ==='); + if (opts.dryRun) console.log(' (dry-run; no side effects)'); + console.log(''); + + const phases: OrchestratorPhaseResult[] = []; + + const a = phaseASchema(opts); + phases.push(a); + if (a.status === 'failed') return finalize(phases, 'failed'); + + const b = await phaseBBackfillStorage(opts); + phases.push(b); + // Phase B 'failed' is currently expected until Step 7 lands the storage + // loader. Continue to verify so users see the exact gap. + + const c = await phaseCVerify(opts); + phases.push(c); + + // a.status === 'failed' already early-returned on line 179, so only + // c and b determine the final status here. TypeScript narrowing rejects + // a redundant a.status === 'failed' check. + const status: 'complete' | 'partial' | 'failed' = + c.status === 'failed' ? 'failed' : + b.status === 'failed' ? 'partial' : + 'complete'; + + return finalize(phases, status); +} + +function finalize(phases: OrchestratorPhaseResult[], status: 'complete' | 'partial' | 'failed'): OrchestratorResult { + if (status !== 'failed') { + try { + appendCompletedMigration({ + version: '0.18.0', + completed_at: new Date().toISOString(), + status: status as 'complete' | 'partial', + phases: phases.map(p => ({ name: p.name, status: p.status })), + }); + } catch { + // Best-effort. + } + } + return { version: '0.18.0', status, phases }; +} + +export const v0_18_0: Migration = { + version: '0.18.0', + featurePitch: { + headline: 'Multi-source brains: one database, many knowledge repos. Federation flag keeps them from polluting each other.', + description: + 'v0.18.0 introduces sources — a first-class primitive that lets one gbrain backend hold ' + + 'multiple repos (wiki, gstack, yc-media, etc.) with clean scoping. Every page, file, and ' + + 'ingest_log row is now scoped to a source. Cross-source search is opt-in per source ' + + '(federated=true) so isolated content (yc-media, garrys-list) never bleeds into your main ' + + 'brain. New commands: `gbrain sources add/attach/import-from-github`. Per-directory ' + + 'default via .gbrain-source dotfile + GBRAIN_SOURCE env var. See docs/guides/' + + 'multi-source-brains.md.', + }, + orchestrator, +}; + +/** Exported for unit tests. */ +export const __testing = { + phaseASchema, + phaseBBackfillStorage, + phaseCVerify, +}; diff --git a/src/commands/sources.ts b/src/commands/sources.ts new file mode 100644 index 0000000..ea0f13e --- /dev/null +++ b/src/commands/sources.ts @@ -0,0 +1,372 @@ +/** + * gbrain sources — manage multi-source brain configuration (v0.18.0). + * + * A source is a logical brain-within-the-DB: wiki, gstack, yc-media, etc. + * Every page/file/ingest_log row is scoped to a sources(id) row. Slugs + * are unique per source. See docs/guides/multi-source-brains.md for the + * full story. + * + * Subcommands: + * gbrain sources add --path [--name ] [--federated|--no-federated] + * gbrain sources list [--json] + * gbrain sources remove [--yes] [--dry-run] [--keep-storage] + * gbrain sources rename + * gbrain sources default + * gbrain sources attach — write .gbrain-source in CWD + * gbrain sources detach — remove .gbrain-source from CWD + * gbrain sources federate — sources.config.federated = true + * gbrain sources unfederate — sources.config.federated = false + * + * NOT in scope for Step 6 (deferred per plan): + * - import-from-github (needs SSRF + clone integration) + * - prune (retention/TTL deferred to v0.18) + * - MCP tool-def regen for full source-scoping of all ops (part of Step 2+5) + */ + +import { writeFileSync, unlinkSync, existsSync } from 'fs'; +import { join } from 'path'; +import type { BrainEngine } from '../core/engine.ts'; + +// ── Validation ────────────────────────────────────────────── + +// Shared with source-resolver.ts — canonical shape. +const SOURCE_ID_RE = /^[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?$/; + +function validateSourceId(id: string): void { + if (!SOURCE_ID_RE.test(id)) { + throw new Error( + `Invalid source id "${id}". Must be 1-32 lowercase alnum chars with optional interior hyphens (e.g. "wiki", "yc-media").`, + ); + } +} + +// ── Types ─────────────────────────────────────────────────── + +interface SourceRow { + id: string; + name: string; + local_path: string | null; + last_commit: string | null; + last_sync_at: Date | null; + config: Record | string; + created_at: Date; +} + +interface SourceListEntry { + id: string; + name: string; + local_path: string | null; + federated: boolean; + page_count: number; + last_sync_at: string | null; +} + +// ── Helpers ───────────────────────────────────────────────── + +function parseConfig(config: unknown): Record { + if (typeof config === 'string') { + try { return JSON.parse(config) as Record; } catch { return {}; } + } + if (typeof config === 'object' && config !== null) return config as Record; + return {}; +} + +function isFederated(config: unknown): boolean { + const parsed = parseConfig(config); + return parsed.federated === true; +} + +async function fetchSource(engine: BrainEngine, id: string): Promise { + const rows = await engine.executeRaw( + `SELECT id, name, local_path, last_commit, last_sync_at, config, created_at + FROM sources WHERE id = $1`, + [id], + ); + return rows[0] ?? null; +} + +async function countPages(engine: BrainEngine, sourceId: string): Promise { + const rows = await engine.executeRaw<{ n: number }>( + `SELECT COUNT(*)::int AS n FROM pages WHERE source_id = $1`, + [sourceId], + ); + return rows[0]?.n ?? 0; +} + +// ── Subcommand: add ───────────────────────────────────────── + +async function runAdd(engine: BrainEngine, args: string[]): Promise { + const id = args[0]; + if (!id) { + console.error('Usage: gbrain sources add --path [--name ] [--federated|--no-federated]'); + process.exit(2); + } + validateSourceId(id); + + let localPath: string | null = null; + let displayName = id; + let federated: boolean | null = null; // null = default (false for new, opt-in via --federated) + + for (let i = 1; i < args.length; i++) { + const a = args[i]; + if (a === '--path') { localPath = args[++i]; continue; } + if (a === '--name') { displayName = args[++i]; continue; } + if (a === '--federated') { federated = true; continue; } + if (a === '--no-federated') { federated = false; continue; } + console.error(`Unknown flag: ${a}`); + process.exit(2); + } + + // Overlapping path guard: reject if new path is inside or contains an + // existing source's local_path (per eng review §4 finding 4.1). + // Throwing (vs process.exit) keeps this testable via the standard + // CLI error-handling wrapper in src/cli.ts. + if (localPath) { + const others = await engine.executeRaw<{ id: string; local_path: string }>( + `SELECT id, local_path FROM sources WHERE local_path IS NOT NULL AND id != $1`, + [id], + ); + for (const other of others) { + const a = localPath; + const b = other.local_path; + if (a === b || a.startsWith(b + '/') || b.startsWith(a + '/')) { + throw new Error( + `path "${a}" overlaps with existing source "${other.id}" at "${b}". ` + + `Overlapping sources are not allowed — same files would ingest twice under different source_ids.`, + ); + } + } + } + + const config = federated === null ? {} : { federated }; + await engine.executeRaw( + `INSERT INTO sources (id, name, local_path, config) + VALUES ($1, $2, $3, $4::jsonb) + ON CONFLICT (id) DO NOTHING`, + [id, displayName, localPath, JSON.stringify(config)], + ); + + const created = await fetchSource(engine, id); + if (!created) { + console.error(`Failed to create source "${id}" (conflict with existing id?)`); + process.exit(4); + } + const fed = isFederated(created.config); + console.log(`Created source "${id}"${displayName !== id ? ` (name: ${displayName})` : ''}${localPath ? ` → ${localPath}` : ''}`); + console.log(` federated: ${fed}${fed ? ' — appears in cross-source default search' : ' — only searched when explicitly named via --source'}`); +} + +// ── Subcommand: list ──────────────────────────────────────── + +async function runList(engine: BrainEngine, args: string[]): Promise { + const json = args.includes('--json'); + + const rows = await engine.executeRaw( + `SELECT id, name, local_path, last_commit, last_sync_at, config, created_at + FROM sources ORDER BY (id = 'default') DESC, id`, + ); + + const entries: SourceListEntry[] = []; + for (const r of rows) { + const pageCount = await countPages(engine, r.id); + entries.push({ + id: r.id, + name: r.name, + local_path: r.local_path, + federated: isFederated(r.config), + page_count: pageCount, + last_sync_at: r.last_sync_at ? new Date(r.last_sync_at).toISOString() : null, + }); + } + + if (json) { + console.log(JSON.stringify({ sources: entries }, null, 2)); + return; + } + + // Human-readable table. + console.log('SOURCES'); + console.log('───────'); + for (const e of entries) { + const fedMark = e.federated ? 'federated' : 'isolated'; + const pathStr = e.local_path ?? '(no local path)'; + const sync = e.last_sync_at ? `last sync ${e.last_sync_at}` : 'never synced'; + console.log(` ${e.id.padEnd(20)} ${fedMark.padEnd(10)} ${String(e.page_count).padStart(6)} pages ${sync}`); + if (e.local_path) console.log(` ${' '.repeat(22)}${pathStr}`); + } + if (entries.length === 0) console.log(' (no sources registered)'); +} + +// ── Subcommand: remove ────────────────────────────────────── + +async function runRemove(engine: BrainEngine, args: string[]): Promise { + const id = args[0]; + if (!id) { + console.error('Usage: gbrain sources remove [--yes] [--dry-run] [--keep-storage]'); + process.exit(2); + } + const yes = args.includes('--yes'); + const dryRun = args.includes('--dry-run'); + // NOTE: --keep-storage is accepted for forward compatibility but has no + // effect until Step 7 wires in explicit storage object deletion. + const _keepStorage = args.includes('--keep-storage'); + void _keepStorage; + + if (id === 'default') { + console.error('Error: cannot remove the "default" source (it backs the pre-v0.17 brain).'); + process.exit(3); + } + + const src = await fetchSource(engine, id); + if (!src) { + console.error(`Source "${id}" not found.`); + process.exit(4); + } + + const pageCount = await countPages(engine, id); + console.log(`Source "${id}" → ${pageCount} pages will be deleted (cascade).`); + + if (dryRun) { + console.log(`(dry-run; no side effects)`); + return; + } + + if (!yes) { + console.error(`Refusing to remove without --yes. Pass --yes to confirm.`); + process.exit(5); + } + + await engine.executeRaw(`DELETE FROM sources WHERE id = $1`, [id]); + console.log(`Removed source "${id}" (${pageCount} pages + dependent rows cascaded).`); +} + +// ── Subcommand: rename ────────────────────────────────────── + +async function runRename(engine: BrainEngine, args: string[]): Promise { + const id = args[0]; + const newName = args[1]; + if (!id || !newName) { + console.error('Usage: gbrain sources rename '); + process.exit(2); + } + const src = await fetchSource(engine, id); + if (!src) { + console.error(`Source "${id}" not found.`); + process.exit(4); + } + await engine.executeRaw(`UPDATE sources SET name = $1 WHERE id = $2`, [newName, id]); + console.log(`Renamed source "${id}" display: ${src.name} → ${newName} (id is immutable).`); +} + +// ── Subcommand: default ───────────────────────────────────── + +async function runDefault(engine: BrainEngine, args: string[]): Promise { + const id = args[0]; + if (!id) { + console.error('Usage: gbrain sources default '); + process.exit(2); + } + const src = await fetchSource(engine, id); + if (!src) { + console.error(`Source "${id}" not found.`); + process.exit(4); + } + // Stored in the config table (not sources.config, because it's a brain- + // level preference not a per-source setting). + await engine.setConfig('sources.default', id); + console.log(`Default source set to "${id}".`); +} + +// ── Subcommand: attach / detach (CWD dotfile) ────────────── + +function runAttach(args: string[]): void { + const id = args[0]; + if (!id) { + console.error('Usage: gbrain sources attach '); + process.exit(2); + } + validateSourceId(id); + const dotfile = join(process.cwd(), '.gbrain-source'); + writeFileSync(dotfile, id + '\n', 'utf8'); + console.log(`Attached ${process.cwd()} to source "${id}" via .gbrain-source.`); + console.log(`Commands run from this directory (or any subdirectory) will default to this source.`); +} + +function runDetach(): void { + const dotfile = join(process.cwd(), '.gbrain-source'); + if (!existsSync(dotfile)) { + console.log(`No .gbrain-source file in ${process.cwd()}.`); + return; + } + unlinkSync(dotfile); + console.log(`Detached ${process.cwd()} (removed .gbrain-source).`); +} + +// ── Subcommand: federate / unfederate ─────────────────────── + +async function runFederate(engine: BrainEngine, args: string[], value: boolean): Promise { + const id = args[0]; + if (!id) { + console.error(`Usage: gbrain sources ${value ? 'federate' : 'unfederate'} `); + process.exit(2); + } + const src = await fetchSource(engine, id); + if (!src) { + console.error(`Source "${id}" not found.`); + process.exit(4); + } + const config = parseConfig(src.config); + config.federated = value; + await engine.executeRaw( + `UPDATE sources SET config = $1::jsonb WHERE id = $2`, + [JSON.stringify(config), id], + ); + console.log(`Source "${id}" is now ${value ? 'federated (appears in cross-source default search)' : 'isolated (only searched when explicitly named)'}.`); +} + +// ── Dispatcher ────────────────────────────────────────────── + +export async function runSources(engine: BrainEngine, args: string[]): Promise { + const sub = args[0]; + const rest = args.slice(1); + + switch (sub) { + case 'add': return runAdd(engine, rest); + case 'list': return runList(engine, rest); + case 'remove': return runRemove(engine, rest); + case 'rename': return runRename(engine, rest); + case 'default': return runDefault(engine, rest); + case 'attach': runAttach(rest); return; + case 'detach': runDetach(); return; + case 'federate': return runFederate(engine, rest, true); + case 'unfederate': return runFederate(engine, rest, false); + case undefined: + case '--help': + case '-h': + printHelp(); + return; + default: + console.error(`Unknown sources subcommand: ${sub}`); + printHelp(); + process.exit(2); + } +} + +function printHelp(): void { + console.log(`gbrain sources — manage multi-source brain configuration (v0.18.0) + +Subcommands: + add --path

[--name ] [--federated|--no-federated] + Register a new source. + list [--json] List registered sources with page counts. + remove [--yes] [--dry-run] Cascade-delete a source and its pages. + rename Rename display name (id is immutable). + default Set the brain-level default source. + attach Write .gbrain-source in CWD (like kubectl context). + detach Remove .gbrain-source from CWD. + federate Make source appear in cross-source default search. + unfederate Isolate source from default search. + +Source id: [a-z0-9-]{1,32}. Immutable citation key. +`); +} diff --git a/src/commands/sync.ts b/src/commands/sync.ts index 47ae4b2..2fbac60 100644 --- a/src/commands/sync.ts +++ b/src/commands/sync.ts @@ -41,6 +41,14 @@ export interface SyncOpts { skipFailed?: boolean; /** Bug 9 — re-attempt unacknowledged failures explicitly (CLI --retry-failed). */ retryFailed?: boolean; + /** + * v0.18.0 Step 5 — sync a specific named source. When set, sync reads + * local_path + last_commit from the sources table (not the global + * config.sync.* keys) and writes last_commit + last_sync_at back to + * the same row. Backward compat: when undefined, sync uses the + * pre-v0.17 global-config path unchanged. + */ + sourceId?: string; } function git(repoPath: string, ...args: string[]): string { @@ -50,11 +58,60 @@ function git(repoPath: string, ...args: string[]): string { }).trim(); } +// v0.18.0 Step 5: source-scoped sync state helpers. When opts.sourceId +// is set, read/write the per-source row instead of the global config +// keys. These wrappers centralize the branch so every read/write site +// picks the right storage — future Step 5 work (failure-tracking per +// source) hooks here too. +async function readSyncAnchor( + engine: BrainEngine, + sourceId: string | undefined, + which: 'repo_path' | 'last_commit', +): Promise { + if (sourceId) { + const col = which === 'repo_path' ? 'local_path' : 'last_commit'; + const rows = await engine.executeRaw>( + `SELECT ${col} AS value FROM sources WHERE id = $1`, + [sourceId], + ); + return rows[0]?.value ?? null; + } + return await engine.getConfig(`sync.${which}`); +} + +async function writeSyncAnchor( + engine: BrainEngine, + sourceId: string | undefined, + which: 'repo_path' | 'last_commit', + value: string, +): Promise { + if (sourceId) { + const col = which === 'repo_path' ? 'local_path' : 'last_commit'; + // last_sync_at bookmarked on every last_commit advance. + if (which === 'last_commit') { + await engine.executeRaw( + `UPDATE sources SET last_commit = $1, last_sync_at = now() WHERE id = $2`, + [value, sourceId], + ); + } else { + await engine.executeRaw( + `UPDATE sources SET ${col} = $1 WHERE id = $2`, + [value, sourceId], + ); + } + return; + } + await engine.setConfig(`sync.${which}`, value); +} + export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise { // Resolve repo path - const repoPath = opts.repoPath || await engine.getConfig('sync.repo_path'); + const repoPath = opts.repoPath || await readSyncAnchor(engine, opts.sourceId, 'repo_path'); if (!repoPath) { - throw new Error('No repo path specified. Use --repo or run gbrain init with --repo first.'); + const hint = opts.sourceId + ? `Source "${opts.sourceId}" has no local_path. Run: gbrain sources add ${opts.sourceId} --path ` + : `No repo path specified. Use --repo or run gbrain init with --repo first.`; + throw new Error(hint); } // Validate git repo @@ -84,8 +141,8 @@ export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise< throw new Error(`No commits in repo ${repoPath}. Make at least one commit before syncing.`); } - // Read sync state - const lastCommit = opts.full ? null : await engine.getConfig('sync.last_commit'); + // Read sync state (source-scoped when sourceId is set, global otherwise) + const lastCommit = opts.full ? null : await readSyncAnchor(engine, opts.sourceId, 'last_commit'); // Ancestry validation: if lastCommit exists, verify it's still in history if (lastCommit) { @@ -175,7 +232,7 @@ export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise< if (totalChanges === 0) { // Update sync state even with no syncable changes (git advanced) - await engine.setConfig('sync.last_commit', headCommit); + await writeSyncAnchor(engine, opts.sourceId, 'last_commit', headCommit); await engine.setConfig('sync.last_run', new Date().toISOString()); return { status: 'up_to_date', @@ -296,7 +353,7 @@ export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise< ); // Update last_run + repo_path (progress on infra) but NOT last_commit. await engine.setConfig('sync.last_run', new Date().toISOString()); - await engine.setConfig('sync.repo_path', repoPath); + await writeSyncAnchor(engine, opts.sourceId, 'repo_path', repoPath); return { status: 'blocked_by_failures', fromCommit: lastCommit, @@ -318,10 +375,11 @@ export async function performSync(engine: BrainEngine, opts: SyncOpts): Promise< } } - // Update sync state AFTER all changes succeed - await engine.setConfig('sync.last_commit', headCommit); + // Update sync state AFTER all changes succeed (source-scoped when + // opts.sourceId is set, global config otherwise). + await writeSyncAnchor(engine, opts.sourceId, 'last_commit', headCommit); await engine.setConfig('sync.last_run', new Date().toISOString()); - await engine.setConfig('sync.repo_path', repoPath); + await writeSyncAnchor(engine, opts.sourceId, 'repo_path', repoPath); // Log ingest await engine.logIngest({ @@ -423,7 +481,7 @@ async function performFullSync( `Fix the YAML in those files and re-run, or use '--skip-failed'.`, ); await engine.setConfig('sync.last_run', new Date().toISOString()); - await engine.setConfig('sync.repo_path', repoPath); + await writeSyncAnchor(engine, opts.sourceId, 'repo_path', repoPath); return { status: 'blocked_by_failures', fromCommit: null, @@ -439,10 +497,12 @@ async function performFullSync( if (acked > 0) console.error(` Acknowledged ${acked} failure(s) and advancing past them.`); } - // Persist sync state so next sync is incremental (C1 fix: was missing) - await engine.setConfig('sync.last_commit', headCommit); + // Persist sync state so next sync is incremental (C1 fix: was missing). + // v0.18.0 Step 5: routed through writeSyncAnchor so --source pins it + // to the right sources row rather than the global config. + await writeSyncAnchor(engine, opts.sourceId, 'last_commit', headCommit); await engine.setConfig('sync.last_run', new Date().toISOString()); - await engine.setConfig('sync.repo_path', repoPath); + await writeSyncAnchor(engine, opts.sourceId, 'repo_path', repoPath); // Full sync doesn't track pagesAffected, so fall back to embed --stale. // Before commit 2: runEmbed is void; use result.imported as best estimate of @@ -482,7 +542,17 @@ export async function runSync(engine: BrainEngine, args: string[]) { const skipFailed = args.includes('--skip-failed'); const retryFailed = args.includes('--retry-failed'); - const opts: SyncOpts = { repoPath, dryRun, full, noPull, noEmbed, skipFailed, retryFailed }; + // v0.18.0 Step 5: --source resolves to a sources(id) row. Falls back + // to pre-v0.17 global config (sync.repo_path + sync.last_commit) when + // no flag, no env, no dotfile is present. + const explicitSource = args.find((a, i) => args[i - 1] === '--source') || null; + let sourceId: string | undefined = undefined; + if (explicitSource || process.env.GBRAIN_SOURCE) { + const { resolveSourceId } = await import('../core/source-resolver.ts'); + sourceId = await resolveSourceId(engine, explicitSource); + } + + const opts: SyncOpts = { repoPath, dryRun, full, noPull, noEmbed, skipFailed, retryFailed, sourceId }; // Bug 9 — --retry-failed: before running normal sync, clear acknowledgment // flags so the sync picks them up as fresh work. The actual re-attempt diff --git a/src/core/engine.ts b/src/core/engine.ts index d932327..c7b3dee 100644 --- a/src/core/engine.ts +++ b/src/core/engine.ts @@ -28,6 +28,21 @@ export interface LinkBatchInput { origin_slug?: string; /** Frontmatter field name (e.g. 'key_people', 'investors'). */ origin_field?: string; + /** + * v0.18.0: source id for each endpoint. When omitted, the engine JOINs + * against `source_id='default'`. Pass explicit values when the edge + * lives in a non-default source OR crosses sources. + * + * Without these fields, the batch JOIN `pages.slug = v.from_slug` fans + * out across every source containing that slug, silently creating wrong + * edges in a multi-source brain. The source_id filter eliminates the + * fan-out. Origin pages (frontmatter provenance) get their own + * source_id so reconciliation can't delete edges from another source's + * frontmatter. + */ + from_source_id?: string; + to_source_id?: string; + origin_source_id?: string; } /** Input row for addTimelineEntriesBatch. Optional fields default to '' (matches NOT NULL DDL). */ @@ -37,6 +52,12 @@ export interface TimelineBatchInput { source?: string; summary: string; detail?: string; + /** + * v0.18.0: source id for the owning page. When omitted, the engine JOINs + * against `source_id='default'`. Without this, two pages sharing the + * same slug across sources would fan out timeline rows to both. + */ + source_id?: string; } /** Maximum results returned by search operations. Internal bulk operations (listPages) are not clamped. */ diff --git a/src/core/link-extraction.ts b/src/core/link-extraction.ts index 4a72120..921bbf5 100644 --- a/src/core/link-extraction.ts +++ b/src/core/link-extraction.ts @@ -24,8 +24,19 @@ export interface EntityRef { slug: string; /** Top-level directory ("people" | "companies" | etc.). */ dir: string; + /** + * v0.17.0: source id when the link was qualified as `[[source:slug]]`. + * `null` means unqualified — the caller resolves via local-first fallback + * at extraction time. Mirrors links.resolution_type: + * - sourceId set → 'qualified' + * - sourceId null → 'unqualified' + */ + sourceId?: string | null; } +/** v0.17.0: how a link's target source was pinned at extraction time. */ +export type LinkResolutionType = 'qualified' | 'unqualified'; + /** * Directory prefix whitelist. These are the top-level slug dirs the extractor * recognizes as entity references. Upstream canonical + our extensions: @@ -63,6 +74,23 @@ const WIKILINK_RE = new RegExp( 'g', ); +/** + * v0.17.0: qualified wikilink `[[source-id:dir/slug]]` or + * `[[source-id:dir/slug|Display Text]]`. The source-id segment pins the + * target to a specific sources(id) row, overriding the local-first + * fallback used by unqualified `[[slug]]` references. + * + * Captures: sourceId, slug (dir/...), displayName (optional). + * + * Matched BEFORE WIKILINK_RE so `[[wiki:topics/ai]]` isn't mis-parsed by + * the unqualified regex (the source prefix would not satisfy DIR_PATTERN + * anyway, but the two-pass approach keeps intent crystal-clear). + */ +const QUALIFIED_WIKILINK_RE = new RegExp( + `\\[\\[([a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?):(${DIR_PATTERN}\\/[^|\\]#]+?)(?:#[^|\\]]*?)?(?:\\|([^\\]]+?))?\\]\\]`, + 'g', +); + /** * Strip fenced code blocks (```...```) and inline code (`...`) from markdown, * replacing them with whitespace of equivalent length. Preserves byte offsets @@ -112,6 +140,9 @@ export function extractEntityRefs(content: string): EntityRef[] { let match: RegExpExecArray | null; // 1. Markdown links: [Name](path) + // Markdown links have no source-qualification syntax — they're + // always unqualified. Omit sourceId so the shape stays compatible + // with pre-v0.17 consumers doing strict equality. const mdPattern = new RegExp(ENTITY_REF_RE.source, ENTITY_REF_RE.flags); while ((match = mdPattern.exec(stripped)) !== null) { const name = match[1]; @@ -121,9 +152,28 @@ export function extractEntityRefs(content: string): EntityRef[] { refs.push({ name, slug, dir }); } - // 2. Obsidian wikilinks: [[path]] or [[path|Display Text]] + // 2a. v0.17.0 qualified wikilinks: [[source-id:path]] or [[source-id:path|Display]] + // Must run BEFORE the unqualified pass or we'd double-emit. We also + // mask out the matched spans so pass 2b can't grab them. + const qualifiedRanges: Array<[number, number]> = []; + const qualPattern = new RegExp(QUALIFIED_WIKILINK_RE.source, QUALIFIED_WIKILINK_RE.flags); + while ((match = qualPattern.exec(stripped)) !== null) { + const sourceId = match[1]; + let slug = match[2].trim(); + if (!slug) continue; + if (slug.includes('://')) continue; + if (slug.endsWith('.md')) slug = slug.slice(0, -3); + const displayName = (match[3] || slug).trim(); + const dir = slug.split('/')[0]; + refs.push({ name: displayName, slug, dir, sourceId }); + qualifiedRanges.push([match.index, match.index + match[0].length]); + } + + // 2b. Unqualified Obsidian wikilinks: [[path]] or [[path|Display Text]] + // Same shape rule: omit sourceId when unqualified. + const unmasked = maskRanges(stripped, qualifiedRanges); const wikiPattern = new RegExp(WIKILINK_RE.source, WIKILINK_RE.flags); - while ((match = wikiPattern.exec(stripped)) !== null) { + while ((match = wikiPattern.exec(unmasked)) !== null) { let slug = match[1].trim(); if (!slug) continue; if (slug.includes('://')) continue; @@ -136,6 +186,20 @@ export function extractEntityRefs(content: string): EntityRef[] { return refs; } +/** + * Replace the byte ranges with spaces, preserving offsets. Used by + * extractEntityRefs to prevent the unqualified wikilink regex from + * matching inside a qualified wikilink span. + */ +function maskRanges(content: string, ranges: Array<[number, number]>): string { + if (ranges.length === 0) return content; + const chars = content.split(''); + for (const [s, e] of ranges) { + for (let i = s; i < e && i < chars.length; i++) chars[i] = ' '; + } + return chars.join(''); +} + // ─── Link candidates (richer than EntityRef) ──────────────────── export interface LinkCandidate { diff --git a/src/core/migrate.ts b/src/core/migrate.ts index 58efe3b..12baef5 100644 --- a/src/core/migrate.ts +++ b/src/core/migrate.ts @@ -449,6 +449,201 @@ export const MIGRATIONS: Migration[] = [ } }, }, + { + version: 23, + name: 'files_source_id_page_id_ledger', + // v0.18.0 Step 7 (Lane E) — additive only: adds files.source_id and + // files.page_id columns + creates the file_migration_ledger that + // drives phase-B storage object rewrites. Does NOT drop page_slug + // yet (kept for backward compat; a later release cleans up once the + // page_id FK is proven). PGLite has no files table, so this + // migration is Postgres-only via a handler gate. + // + // Ledger PK is file_id (not storage_path_old) — two sources CAN + // share an old path during migration, so a composite would be + // wrong. Codex second-pass review caught this. + // + // State machine per row: + // pending → copy_done → db_updated → complete + // any state → failed (with error detail) + // + // Phase B in the v0_18_0 orchestrator processes `status != complete` + // rows. Re-runnable: resumes from whichever state it stopped in. + sql: '', + handler: async (engine) => { + if (engine.kind === 'pglite') return; + await engine.runMigration(19, ` + -- 1a. source_id with DEFAULT 'default' (idempotent) + ALTER TABLE files ADD COLUMN IF NOT EXISTS source_id TEXT + NOT NULL DEFAULT 'default' REFERENCES sources(id) ON DELETE CASCADE; + CREATE INDEX IF NOT EXISTS idx_files_source_id ON files(source_id); + + -- 1b. page_id (nullable; pre-v0.17 files pointed at page_slug + -- which was ON DELETE SET NULL, so we keep the same nullable + -- semantic — orphaned files are legal). + ALTER TABLE files ADD COLUMN IF NOT EXISTS page_id INTEGER + REFERENCES pages(id) ON DELETE SET NULL; + CREATE INDEX IF NOT EXISTS idx_files_page_id ON files(page_id); + `); + + await engine.runMigration(19, ` + -- 1c. Backfill page_id from existing page_slug. Scoped to + -- source_id='default' because pre-v0.17 pages ALL lived in + -- the default source. Without this scope, after new sources + -- get added mid-migration, the JOIN could hit the wrong + -- page (different source, same slug). + UPDATE files f + SET page_id = p.id + FROM pages p + WHERE f.page_slug = p.slug + AND p.source_id = 'default' + AND f.page_id IS NULL; + `); + + await engine.runMigration(19, ` + -- 2. file_migration_ledger — drives the storage object rewrite + -- in the v0_18_0 orchestrator's phase B. Seeded from current + -- files rows; re-seed is idempotent via NOT EXISTS guard. + CREATE TABLE IF NOT EXISTS file_migration_ledger ( + file_id INTEGER PRIMARY KEY REFERENCES files(id) ON DELETE CASCADE, + storage_path_old TEXT NOT NULL, + storage_path_new TEXT NOT NULL, + status TEXT NOT NULL DEFAULT 'pending', + error TEXT, + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + CONSTRAINT chk_ledger_status CHECK (status IN ('pending','copy_done','db_updated','complete','failed')) + ); + CREATE INDEX IF NOT EXISTS idx_file_migration_ledger_status + ON file_migration_ledger(status) WHERE status != 'complete'; + + -- Seed the ledger with every existing file. New path prefixes + -- source_id so multi-source can land assets under their own + -- bucket path without collision. + INSERT INTO file_migration_ledger (file_id, storage_path_old, storage_path_new, status) + SELECT + f.id, + f.storage_path, + COALESCE(f.source_id, 'default') || '/' || f.storage_path, + 'pending' + FROM files f + WHERE NOT EXISTS ( + SELECT 1 FROM file_migration_ledger l WHERE l.file_id = f.id + ); + `); + }, + }, + { + version: 22, + name: 'links_resolution_type', + // v0.18.0 Step 4 (Lane B) — adds links.resolution_type column so + // each edge records whether its target source was pinned at + // extraction time via `[[source:slug]]` (qualified) or resolved + // via local-first fallback (unqualified). Unqualified edges are + // candidates for re-resolution via `gbrain extract + // --refresh-unqualified` when the source topology changes. + // + // Nullable because legacy edges (pre-v0.17) have no resolution + // concept. `frontmatter` and `manual` edges remain NULL — they're + // not subject to staleness under source churn. + sql: ` + ALTER TABLE links ADD COLUMN IF NOT EXISTS resolution_type TEXT; + DO $$ BEGIN + IF NOT EXISTS ( + SELECT 1 FROM pg_constraint WHERE conname = 'links_resolution_type_check' + ) THEN + ALTER TABLE links ADD CONSTRAINT links_resolution_type_check + CHECK (resolution_type IS NULL OR resolution_type IN ('qualified', 'unqualified')); + END IF; + END $$; + `, + }, + { + version: 21, + name: 'pages_source_id_composite_unique', + // v0.18.0 Step 2 (Lane B) — adds pages.source_id with DEFAULT 'default' + // and swaps the global UNIQUE(slug) for the composite UNIQUE(source_id, + // slug). Lands alongside the engine SQL rewrite that makes every + // ON CONFLICT (slug) → ON CONFLICT (source_id, slug) so the constraint + // swap is atomic with the code that writes under it. + // + // DEFAULT 'default' is load-bearing: closes the Codex-flagged race + // where an INSERT between ADD COLUMN and SET NOT NULL could leave + // source_id NULL. Because the default already references a valid + // sources row (seeded in v16), new INSERTs immediately get a valid FK. + // + // Idempotent: IF NOT EXISTS on ADD COLUMN, DROP IF EXISTS on the old + // constraint, DO block guard on the new constraint creation. + sql: ` + ALTER TABLE pages ADD COLUMN IF NOT EXISTS source_id TEXT + NOT NULL DEFAULT 'default' REFERENCES sources(id) ON DELETE CASCADE; + + CREATE INDEX IF NOT EXISTS idx_pages_source_id ON pages(source_id); + + -- Swap global UNIQUE(slug) → composite UNIQUE(source_id, slug). The + -- original constraint is named pages_slug_key by Postgres convention + -- when the column was declared UNIQUE inline. Both drops are + -- idempotent. + ALTER TABLE pages DROP CONSTRAINT IF EXISTS pages_slug_key; + DO $$ BEGIN + IF NOT EXISTS ( + SELECT 1 FROM pg_constraint WHERE conname = 'pages_source_slug_key' + ) THEN + ALTER TABLE pages ADD CONSTRAINT pages_source_slug_key + UNIQUE (source_id, slug); + END IF; + END $$; + `, + }, + { + version: 20, + name: 'sources_table_additive', + // v0.18.0 Step 1 (Lane A) — **additive only** so Step 1 is a safe + // standalone commit. This migration installs the sources primitive + // WITHOUT breaking the engine's existing ON CONFLICT (slug) upserts. + // + // What this migration does now: + // - CREATE sources table + // - INSERT default source (federated=true, inherits sync.repo_path + // and sync.last_commit from config so post-upgrade identity is + // preserved) + // + // What this migration does NOT do yet (deferred to v17 which ships + // with Step 2 engine rewrite, so they land atomically): + // - ALTER pages ADD source_id + // - DROP UNIQUE(slug) + ADD UNIQUE(source_id, slug) + // - files.page_slug → page_id rewrite + // - file_migration_ledger + // - links.resolution_type + // + // The v0.18.0 orchestrator's phaseCVerify allows this split: it + // checks for sources('default'), but the "composite UNIQUE" + + // "pages.source_id NOT NULL" assertions only run after v17 lands. + // + // Idempotent via IF NOT EXISTS. Safe to re-run. + sql: ` + CREATE TABLE IF NOT EXISTS sources ( + id TEXT PRIMARY KEY, + name TEXT NOT NULL UNIQUE, + local_path TEXT, + last_commit TEXT, + last_sync_at TIMESTAMPTZ, + config JSONB NOT NULL DEFAULT '{}'::jsonb, + created_at TIMESTAMPTZ NOT NULL DEFAULT now() + ); + + -- Seed 'default' source, inheriting the existing sync.repo_path / + -- sync.last_commit config values. federated=true for backward compat. + -- Pre-v0.17 brains behave exactly as before. + INSERT INTO sources (id, name, local_path, last_commit, config) + SELECT + 'default', + 'default', + (SELECT value FROM config WHERE key = 'sync.repo_path'), + (SELECT value FROM config WHERE key = 'sync.last_commit'), + '{"federated": true}'::jsonb + WHERE NOT EXISTS (SELECT 1 FROM sources WHERE id = 'default'); + `, + }, { version: 15, name: 'minion_jobs_max_stalled_default_5', @@ -502,8 +697,14 @@ export async function runMigrations(engine: BrainEngine): Promise<{ applied: num const currentStr = await engine.getConfig('version'); const current = parseInt(currentStr || '1', 10); + // Sort by version ascending so array insertion order doesn't affect + // correctness. Migrations MUST run in version order; if v16 accidentally + // precedes v15 in MIGRATIONS, setConfig(version, 16) would cause v15 to + // be skipped on the next iteration. + const sorted = [...MIGRATIONS].sort((a, b) => a.version - b.version); + let applied = 0; - for (const m of MIGRATIONS) { + for (const m of sorted) { if (m.version > current) { // Pick SQL: engine-specific `sqlFor` wins over engine-agnostic `sql`. const sql = m.sqlFor?.[engine.kind] ?? m.sql; diff --git a/src/core/pglite-engine.ts b/src/core/pglite-engine.ts index a495408..a3ebf3b 100644 --- a/src/core/pglite-engine.ts +++ b/src/core/pglite-engine.ts @@ -116,10 +116,15 @@ export class PGLiteEngine implements BrainEngine { const hash = page.content_hash || contentHash(page); const frontmatter = page.frontmatter || {}; + // v0.18.0 Step 2: source_id relies on the schema DEFAULT 'default' so + // existing callers still target the default source without threading + // a parameter. ON CONFLICT target becomes (source_id, slug) since the + // global UNIQUE(slug) was dropped in migration v17. Step 5+ will + // surface an explicit sourceId param on putPage for multi-source sync. const { rows } = await this.db.query( `INSERT INTO pages (slug, type, title, compiled_truth, timeline, frontmatter, content_hash, updated_at) VALUES ($1, $2, $3, $4, $5, $6::jsonb, $7, now()) - ON CONFLICT (slug) DO UPDATE SET + ON CONFLICT (source_id, slug) DO UPDATE SET type = EXCLUDED.type, title = EXCLUDED.title, compiled_truth = EXCLUDED.compiled_truth, @@ -205,7 +210,7 @@ export class PGLiteEngine implements BrainEngine { const { rows } = await this.db.query( `SELECT - p.slug, p.id as page_id, p.title, p.type, + p.slug, p.id as page_id, p.title, p.type, p.source_id, cc.id as chunk_id, cc.chunk_index, cc.chunk_text, cc.chunk_source, ts_rank(p.search_vector, websearch_to_tsquery('english', $1)) AS score, CASE WHEN p.updated_at < ( @@ -235,7 +240,7 @@ export class PGLiteEngine implements BrainEngine { const { rows } = await this.db.query( `SELECT - p.slug, p.id as page_id, p.title, p.type, + p.slug, p.id as page_id, p.title, p.type, p.source_id, cc.id as chunk_id, cc.chunk_index, cc.chunk_text, cc.chunk_source, 1 - (cc.embedding <=> $1::vector) AS score, CASE WHEN p.updated_at < ( @@ -370,8 +375,14 @@ export class PGLiteEngine implements BrainEngine { async addLinksBatch(links: LinkBatchInput[]): Promise { if (links.length === 0) return 0; - // unnest() pattern: 7 array-typed bound parameters regardless of batch size. - // Same shape as PostgresEngine (v0.13). Avoids the 65535-parameter cap. + // unnest() pattern: 10 array-typed bound parameters regardless of batch + // size. Same shape as PostgresEngine (v0.18). Avoids the 65535-parameter + // cap. + // + // v0.18.0: every JOIN composite-keys on (slug, source_id) so the batch + // can't fan out across sources when the same slug exists in multiple + // sources. Origin JOIN uses LEFT JOIN on a composite key — NULL + // origin_slug leaves origin_page_id NULL, same as pre-v0.18. const fromSlugs = links.map(l => l.from_slug); const toSlugs = links.map(l => l.to_slug); const linkTypes = links.map(l => l.link_type || ''); @@ -379,17 +390,20 @@ export class PGLiteEngine implements BrainEngine { const linkSources = links.map(l => l.link_source || 'markdown'); const originSlugs = links.map(l => l.origin_slug || null); const originFields = links.map(l => l.origin_field || null); + const fromSourceIds = links.map(l => l.from_source_id || 'default'); + const toSourceIds = links.map(l => l.to_source_id || 'default'); + const originSourceIds = links.map(l => l.origin_source_id || 'default'); const result = await this.db.query( `INSERT INTO links (from_page_id, to_page_id, link_type, context, link_source, origin_page_id, origin_field) SELECT f.id, t.id, v.link_type, v.context, v.link_source, o.id, v.origin_field - FROM unnest($1::text[], $2::text[], $3::text[], $4::text[], $5::text[], $6::text[], $7::text[]) - AS v(from_slug, to_slug, link_type, context, link_source, origin_slug, origin_field) - JOIN pages f ON f.slug = v.from_slug - JOIN pages t ON t.slug = v.to_slug - LEFT JOIN pages o ON o.slug = v.origin_slug + FROM unnest($1::text[], $2::text[], $3::text[], $4::text[], $5::text[], $6::text[], $7::text[], $8::text[], $9::text[], $10::text[]) + AS v(from_slug, to_slug, link_type, context, link_source, origin_slug, origin_field, from_source_id, to_source_id, origin_source_id) + JOIN pages f ON f.slug = v.from_slug AND f.source_id = v.from_source_id + JOIN pages t ON t.slug = v.to_slug AND t.source_id = v.to_source_id + LEFT JOIN pages o ON o.slug = v.origin_slug AND o.source_id = v.origin_source_id ON CONFLICT (from_page_id, to_page_id, link_type, link_source, origin_page_id) DO NOTHING RETURNING 1`, - [fromSlugs, toSlugs, linkTypes, contexts, linkSources, originSlugs, originFields] + [fromSlugs, toSlugs, linkTypes, contexts, linkSources, originSlugs, originFields, fromSourceIds, toSourceIds, originSourceIds] ); return result.rows.length; } @@ -724,22 +738,21 @@ export class PGLiteEngine implements BrainEngine { async addTimelineEntriesBatch(entries: TimelineBatchInput[]): Promise { if (entries.length === 0) return 0; - // unnest() pattern: 5 array-typed bound parameters regardless of batch size. const slugs = entries.map(e => e.slug); const dates = entries.map(e => e.date); - // Normalize optional fields to '' to match per-row addTimelineEntry + NOT NULL DDL. const sources = entries.map(e => e.source || ''); const summaries = entries.map(e => e.summary); const details = entries.map(e => e.detail || ''); + const sourceIds = entries.map(e => e.source_id || 'default'); const result = await this.db.query( `INSERT INTO timeline_entries (page_id, date, source, summary, detail) SELECT p.id, v.date::date, v.source, v.summary, v.detail - FROM unnest($1::text[], $2::text[], $3::text[], $4::text[], $5::text[]) - AS v(slug, date, source, summary, detail) - JOIN pages p ON p.slug = v.slug + FROM unnest($1::text[], $2::text[], $3::text[], $4::text[], $5::text[], $6::text[]) + AS v(slug, date, source, summary, detail, source_id) + JOIN pages p ON p.slug = v.slug AND p.source_id = v.source_id ON CONFLICT (page_id, date, summary) DO NOTHING RETURNING 1`, - [slugs, dates, sources, summaries, details] + [slugs, dates, sources, summaries, details, sourceIds] ); return result.rows.length; } diff --git a/src/core/pglite-schema.ts b/src/core/pglite-schema.ts index 5f03b6a..d2ef3ce 100644 --- a/src/core/pglite-schema.ts +++ b/src/core/pglite-schema.ts @@ -19,12 +19,33 @@ export const PGLITE_SCHEMA_SQL = ` CREATE EXTENSION IF NOT EXISTS vector; CREATE EXTENSION IF NOT EXISTS pg_trgm; +-- ============================================================ +-- sources: multi-brain tenancy (v0.18.0). See src/schema.sql for design notes. +-- ============================================================ +CREATE TABLE IF NOT EXISTS sources ( + id TEXT PRIMARY KEY, + name TEXT NOT NULL UNIQUE, + local_path TEXT, + last_commit TEXT, + last_sync_at TIMESTAMPTZ, + config JSONB NOT NULL DEFAULT '{}'::jsonb, + created_at TIMESTAMPTZ NOT NULL DEFAULT now() +); + +INSERT INTO sources (id, name, config) + VALUES ('default', 'default', '{"federated": true}'::jsonb) + ON CONFLICT (id) DO NOTHING; + -- ============================================================ -- pages: the core content table -- ============================================================ +-- v0.18.0 (Step 2): source_id scopes each page. Slugs are unique per +-- source — see src/schema.sql for the design notes. CREATE TABLE IF NOT EXISTS pages ( id SERIAL PRIMARY KEY, - slug TEXT NOT NULL UNIQUE, + source_id TEXT NOT NULL DEFAULT 'default' + REFERENCES sources(id) ON DELETE CASCADE, + slug TEXT NOT NULL, type TEXT NOT NULL, title TEXT NOT NULL, compiled_truth TEXT NOT NULL DEFAULT '', @@ -32,12 +53,14 @@ CREATE TABLE IF NOT EXISTS pages ( frontmatter JSONB NOT NULL DEFAULT '{}', content_hash TEXT, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), - updated_at TIMESTAMPTZ NOT NULL DEFAULT now() + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + CONSTRAINT pages_source_slug_key UNIQUE (source_id, slug) ); CREATE INDEX IF NOT EXISTS idx_pages_type ON pages(type); CREATE INDEX IF NOT EXISTS idx_pages_frontmatter ON pages USING GIN(frontmatter); CREATE INDEX IF NOT EXISTS idx_pages_trgm ON pages USING GIN(title gin_trgm_ops); +CREATE INDEX IF NOT EXISTS idx_pages_source_id ON pages(source_id); -- ============================================================ -- content_chunks: chunked content with embeddings @@ -72,6 +95,8 @@ CREATE TABLE IF NOT EXISTS links ( link_source TEXT CHECK (link_source IS NULL OR link_source IN ('markdown', 'frontmatter', 'manual')), origin_page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL, origin_field TEXT, + -- v0.18.0 Step 4: see src/schema.sql. + resolution_type TEXT CHECK (resolution_type IS NULL OR resolution_type IN ('qualified', 'unqualified')), created_at TIMESTAMPTZ NOT NULL DEFAULT now(), CONSTRAINT links_from_to_type_source_origin_unique UNIQUE NULLS NOT DISTINCT (from_page_id, to_page_id, link_type, link_source, origin_page_id) @@ -141,7 +166,7 @@ CREATE TABLE IF NOT EXISTS page_versions ( CREATE INDEX IF NOT EXISTS idx_versions_page ON page_versions(page_id); -- ============================================================ --- ingest_log +-- ingest_log (v0.18.0 Step 1: source_id deferred to v17, see src/schema.sql) -- ============================================================ CREATE TABLE IF NOT EXISTS ingest_log ( id SERIAL PRIMARY KEY, diff --git a/src/core/postgres-engine.ts b/src/core/postgres-engine.ts index 8a7e3fa..8b5b295 100644 --- a/src/core/postgres-engine.ts +++ b/src/core/postgres-engine.ts @@ -115,10 +115,14 @@ export class PostgresEngine implements BrainEngine { const hash = page.content_hash || contentHash(page); const frontmatter = page.frontmatter || {}; + // v0.18.0 Step 2: source_id relies on schema DEFAULT 'default'. ON + // CONFLICT target becomes (source_id, slug) since global UNIQUE(slug) + // was dropped in migration v17. See pglite-engine.ts for matching + // notes; multi-source sync (Step 5) will surface an explicit sourceId. const rows = await sql` INSERT INTO pages (slug, type, title, compiled_truth, timeline, frontmatter, content_hash, updated_at) VALUES (${slug}, ${page.type}, ${page.title}, ${page.compiled_truth}, ${page.timeline || ''}, ${sql.json(frontmatter as Parameters[0])}, ${hash}, now()) - ON CONFLICT (slug) DO UPDATE SET + ON CONFLICT (source_id, slug) DO UPDATE SET type = EXCLUDED.type, title = EXCLUDED.title, compiled_truth = EXCLUDED.compiled_truth, @@ -262,7 +266,7 @@ export class PostgresEngine implements BrainEngine { await sql`SET LOCAL statement_timeout = '8s'`; return await sql` SELECT - p.slug, p.id as page_id, p.title, p.type, + p.slug, p.id as page_id, p.title, p.type, p.source_id, cc.id as chunk_id, cc.chunk_index, cc.chunk_text, cc.chunk_source, 1 - (cc.embedding <=> ${vecStr}::vector) AS score, false AS stale @@ -422,17 +426,21 @@ export class PostgresEngine implements BrainEngine { const linkSources = links.map(l => l.link_source || 'markdown'); const originSlugs = links.map(l => l.origin_slug || null); const originFields = links.map(l => l.origin_field || null); + const fromSourceIds = links.map(l => l.from_source_id || 'default'); + const toSourceIds = links.map(l => l.to_source_id || 'default'); + const originSourceIds = links.map(l => l.origin_source_id || 'default'); const result = await sql` INSERT INTO links (from_page_id, to_page_id, link_type, context, link_source, origin_page_id, origin_field) SELECT f.id, t.id, v.link_type, v.context, v.link_source, o.id, v.origin_field FROM unnest( ${fromSlugs}::text[], ${toSlugs}::text[], ${linkTypes}::text[], ${contexts}::text[], ${linkSources}::text[], ${originSlugs}::text[], - ${originFields}::text[] - ) AS v(from_slug, to_slug, link_type, context, link_source, origin_slug, origin_field) - JOIN pages f ON f.slug = v.from_slug - JOIN pages t ON t.slug = v.to_slug - LEFT JOIN pages o ON o.slug = v.origin_slug + ${originFields}::text[], ${fromSourceIds}::text[], ${toSourceIds}::text[], + ${originSourceIds}::text[] + ) AS v(from_slug, to_slug, link_type, context, link_source, origin_slug, origin_field, from_source_id, to_source_id, origin_source_id) + JOIN pages f ON f.slug = v.from_slug AND f.source_id = v.from_source_id + JOIN pages t ON t.slug = v.to_slug AND t.source_id = v.to_source_id + LEFT JOIN pages o ON o.slug = v.origin_slug AND o.source_id = v.origin_source_id ON CONFLICT (from_page_id, to_page_id, link_type, link_source, origin_page_id) DO NOTHING RETURNING 1 `; @@ -775,19 +783,18 @@ export class PostgresEngine implements BrainEngine { async addTimelineEntriesBatch(entries: TimelineBatchInput[]): Promise { if (entries.length === 0) return 0; const sql = this.sql; - // unnest() pattern: 5 array-typed bound parameters regardless of batch size. const slugs = entries.map(e => e.slug); const dates = entries.map(e => e.date); - // Normalize optional fields to '' to match per-row addTimelineEntry + NOT NULL DDL. const sources = entries.map(e => e.source || ''); const summaries = entries.map(e => e.summary); const details = entries.map(e => e.detail || ''); + const sourceIds = entries.map(e => e.source_id || 'default'); const result = await sql` INSERT INTO timeline_entries (page_id, date, source, summary, detail) SELECT p.id, v.date::date, v.source, v.summary, v.detail - FROM unnest(${slugs}::text[], ${dates}::text[], ${sources}::text[], ${summaries}::text[], ${details}::text[]) - AS v(slug, date, source, summary, detail) - JOIN pages p ON p.slug = v.slug + FROM unnest(${slugs}::text[], ${dates}::text[], ${sources}::text[], ${summaries}::text[], ${details}::text[], ${sourceIds}::text[]) + AS v(slug, date, source, summary, detail, source_id) + JOIN pages p ON p.slug = v.slug AND p.source_id = v.source_id ON CONFLICT (page_id, date, summary) DO NOTHING RETURNING 1 `; diff --git a/src/core/schema-embedded.ts b/src/core/schema-embedded.ts index 3659e4d..3096794 100644 --- a/src/core/schema-embedded.ts +++ b/src/core/schema-embedded.ts @@ -9,12 +9,55 @@ CREATE EXTENSION IF NOT EXISTS pg_trgm; -- gen_random_uuid() is core in Postgres 13+; enable pgcrypto as fallback for older versions CREATE EXTENSION IF NOT EXISTS pgcrypto; +-- ============================================================ +-- sources: multi-repo / multi-brain tenancy (v0.18.0) +-- ============================================================ +-- A source is a logical brain-within-the-DB: wiki, gstack, yc-media, etc. +-- Every page/file/ingest_log row carries source_id. +-- +-- id: immutable citation key. [a-z0-9-]{1,32} enforced at app layer. +-- Used in [source:slug] citations, --source flag, wikilink syntax. +-- name: mutable display label. Rename via \`gbrain sources rename\`. +-- local_path: optional git checkout root for filesystem-backed sources. +-- config: forward-compat JSONB. Currently used for federation + ACL slot. +-- { "federated": bool, "access_policy": {...} } +-- - federated=true (or missing-but-explicit on 'default'): +-- participates in cross-source default search. +-- - federated=false (default for new sources): +-- only searched when explicitly named via --source. +-- - access_policy: forward-compat slot, no enforcement in v0.17. +-- Write-side lockdown: mutated only when ctx.remote=false. +CREATE TABLE IF NOT EXISTS sources ( + id TEXT PRIMARY KEY, + name TEXT NOT NULL UNIQUE, + local_path TEXT, + last_commit TEXT, + last_sync_at TIMESTAMPTZ, + config JSONB NOT NULL DEFAULT '{}'::jsonb, + created_at TIMESTAMPTZ NOT NULL DEFAULT now() +); + +-- Seed the default source. 'default' is federated=true for backward compat +-- (pre-v0.17 brains behave exactly as before — every page appears in search). +-- Pre-existing sync.repo_path / sync.last_commit are copied in by the v16 +-- migration, not here; fresh installs have no local_path until \`sources add\` +-- or the first \`sync\`. +INSERT INTO sources (id, name, config) + VALUES ('default', 'default', '{"federated": true}'::jsonb) + ON CONFLICT (id) DO NOTHING; + -- ============================================================ -- pages: the core content table -- ============================================================ +-- v0.18.0 (Step 2): pages.source_id scopes each row to a sources(id) row. +-- Slugs are unique per source, NOT globally. The default source is +-- seeded in the sources block above so the DEFAULT 'default' FK is +-- always valid at INSERT time. CREATE TABLE IF NOT EXISTS pages ( id SERIAL PRIMARY KEY, - slug TEXT NOT NULL UNIQUE, + source_id TEXT NOT NULL DEFAULT 'default' + REFERENCES sources(id) ON DELETE CASCADE, + slug TEXT NOT NULL, type TEXT NOT NULL, title TEXT NOT NULL, compiled_truth TEXT NOT NULL DEFAULT '', @@ -22,7 +65,8 @@ CREATE TABLE IF NOT EXISTS pages ( frontmatter JSONB NOT NULL DEFAULT '{}', content_hash TEXT, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), - updated_at TIMESTAMPTZ NOT NULL DEFAULT now() + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + CONSTRAINT pages_source_slug_key UNIQUE (source_id, slug) ); CREATE INDEX IF NOT EXISTS idx_pages_type ON pages(type); @@ -30,6 +74,8 @@ CREATE INDEX IF NOT EXISTS idx_pages_frontmatter ON pages USING GIN(frontmatter) CREATE INDEX IF NOT EXISTS idx_pages_trgm ON pages USING GIN(title gin_trgm_ops); -- v0.13.1 #170: avoids 14.6s seqscan on large brains when listing pages newest-first. CREATE INDEX IF NOT EXISTS idx_pages_updated_at_desc ON pages (updated_at DESC); +-- v0.18.0: source-scoped scans (per /plan-eng-review Section 4). +CREATE INDEX IF NOT EXISTS idx_pages_source_id ON pages(source_id); -- ============================================================ -- content_chunks: chunked content with embeddings @@ -74,6 +120,11 @@ CREATE TABLE IF NOT EXISTS links ( link_source TEXT CHECK (link_source IS NULL OR link_source IN ('markdown', 'frontmatter', 'manual')), origin_page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL, origin_field TEXT, + -- v0.18.0 Step 4: 'qualified' when the link was written as + -- [[source:slug]] (target source pinned). 'unqualified' when written + -- as bare [[slug]] and resolved via local-first fallback at + -- extraction time. NULL for legacy/manual/frontmatter edges. + resolution_type TEXT CHECK (resolution_type IS NULL OR resolution_type IN ('qualified', 'unqualified')), created_at TIMESTAMPTZ NOT NULL DEFAULT now(), -- NULLS NOT DISTINCT (PG15+) so two rows with link_source IS NULL or -- origin_page_id IS NULL collide as expected. Without this, every row with @@ -148,6 +199,9 @@ CREATE INDEX IF NOT EXISTS idx_versions_page ON page_versions(page_id); -- ============================================================ -- ingest_log -- ============================================================ +-- NOTE (v0.18.0 Step 1): ingest_log.source_id is NOT added yet — lands +-- in v17 alongside the sync rewrite (Step 5), which starts writing +-- source-scoped entries. CREATE TABLE IF NOT EXISTS ingest_log ( id SERIAL PRIMARY KEY, source_type TEXT NOT NULL, @@ -202,9 +256,18 @@ CREATE TABLE IF NOT EXISTS mcp_request_log ( -- ============================================================ -- files: binary attachments stored in Supabase Storage -- ============================================================ +-- v0.18.0 Step 7: files gains source_id + page_id alongside the +-- legacy page_slug (kept for backward compat until a later release). +-- The file_migration_ledger below drives the storage object rewrite. +-- page_slug FK had ON UPDATE CASCADE — removed because slugs are no +-- longer global (composite UNIQUE) so CASCADE on-update is ambiguous. +-- ON DELETE SET NULL is preserved via both page_slug and page_id. CREATE TABLE IF NOT EXISTS files ( id SERIAL PRIMARY KEY, - page_slug TEXT REFERENCES pages(slug) ON DELETE SET NULL ON UPDATE CASCADE, + source_id TEXT NOT NULL DEFAULT 'default' + REFERENCES sources(id) ON DELETE CASCADE, + page_slug TEXT, + page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL, filename TEXT NOT NULL, storage_path TEXT NOT NULL, mime_type TEXT, @@ -219,8 +282,30 @@ CREATE TABLE IF NOT EXISTS files ( ALTER TABLE files DROP COLUMN IF EXISTS storage_url; CREATE INDEX IF NOT EXISTS idx_files_page ON files(page_slug); +CREATE INDEX IF NOT EXISTS idx_files_page_id ON files(page_id); +CREATE INDEX IF NOT EXISTS idx_files_source_id ON files(source_id); CREATE INDEX IF NOT EXISTS idx_files_hash ON files(content_hash); +-- ============================================================ +-- file_migration_ledger (v0.18.0 Step 7) +-- Drives the storage-object rewrite performed by the v0_18_0 +-- orchestrator's phase B. Keyed on file_id so two sources can share +-- an old path during migration without PK collision (Codex second- +-- pass caught this). +-- Status state machine: pending → copy_done → db_updated → complete +-- ============================================================ +CREATE TABLE IF NOT EXISTS file_migration_ledger ( + file_id INTEGER PRIMARY KEY REFERENCES files(id) ON DELETE CASCADE, + storage_path_old TEXT NOT NULL, + storage_path_new TEXT NOT NULL, + status TEXT NOT NULL DEFAULT 'pending', + error TEXT, + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + CONSTRAINT chk_ledger_status CHECK (status IN ('pending','copy_done','db_updated','complete','failed')) +); +CREATE INDEX IF NOT EXISTS idx_file_migration_ledger_status + ON file_migration_ledger(status) WHERE status != 'complete'; + -- ============================================================ -- Trigger-based search_vector (spans pages + timeline_entries) -- ============================================================ @@ -469,6 +554,8 @@ BEGIN ALTER TABLE config ENABLE ROW LEVEL SECURITY; ALTER TABLE files ENABLE ROW LEVEL SECURITY; ALTER TABLE minion_jobs ENABLE ROW LEVEL SECURITY; + ALTER TABLE sources ENABLE ROW LEVEL SECURITY; + ALTER TABLE file_migration_ledger ENABLE ROW LEVEL SECURITY; RAISE NOTICE 'RLS enabled on all tables (role % has BYPASSRLS)', current_user; ELSE RAISE WARNING 'Skipping RLS: role % does not have BYPASSRLS privilege. Run as postgres role to enable.', current_user; diff --git a/src/core/search/dedup.ts b/src/core/search/dedup.ts index ebcca35..4688107 100644 --- a/src/core/search/dedup.ts +++ b/src/core/search/dedup.ts @@ -7,6 +7,14 @@ * 3. By type: no page type exceeds 60% of results * 4. By page: max N chunks per page (default 2) * 5. Compiled truth guarantee: ensure at least 1 compiled_truth chunk per page + * + * v0.18.0: every page key is composite (source_id, slug). Pre-v0.17 this + * was slug alone — under multi-source uniqueness that would collapse two + * same-slug pages in different sources into one, destroying recall. + * Codex review flagged this as a regression-critical path. The + * `pageKey()` helper below is the one canonical way to derive the key; + * every layer uses it so future "dedup just changed" drift is one file + * to fix. */ import type { SearchResult } from '../types.ts'; @@ -15,6 +23,17 @@ const COSINE_DEDUP_THRESHOLD = 0.85; const MAX_TYPE_RATIO = 0.6; const MAX_PER_PAGE = 2; +/** + * Composite page key: (source_id, slug). Pre-v0.17 rows lacked source_id + * so we fall back to 'default' to preserve single-source brain behavior + * exactly. Post-v0.17 callers always populate source_id (SQL JOINs in + * pglite/postgres engine search paths). + */ +function pageKey(r: SearchResult): string { + const source = r.source_id ?? 'default'; + return `${source}:${r.slug}`; +} + export function dedupResults( results: SearchResult[], opts?: { @@ -58,9 +77,10 @@ function dedupBySource(results: SearchResult[]): SearchResult[] { const byPage = new Map(); for (const r of results) { - const existing = byPage.get(r.slug) || []; + const k = pageKey(r); + const existing = byPage.get(k) || []; existing.push(r); - byPage.set(r.slug, existing); + byPage.set(k, existing); } const kept: SearchResult[] = []; @@ -130,10 +150,11 @@ function capPerPage(results: SearchResult[], maxPerPage: number): SearchResult[] const kept: SearchResult[] = []; for (const r of results) { - const count = pageCounts.get(r.slug) || 0; + const k = pageKey(r); + const count = pageCounts.get(k) || 0; if (count < maxPerPage) { kept.push(r); - pageCounts.set(r.slug, count + 1); + pageCounts.set(k, count + 1); } } @@ -145,30 +166,35 @@ function capPerPage(results: SearchResult[], maxPerPage: number): SearchResult[] * swap in the best compiled_truth chunk from the pre-dedup set (if one exists). */ function guaranteeCompiledTruth(results: SearchResult[], preDedup: SearchResult[]): SearchResult[] { - // Group results by page + // Group results by composite page key (source_id, slug). const byPage = new Map(); for (const r of results) { - const existing = byPage.get(r.slug) || []; + const k = pageKey(r); + const existing = byPage.get(k) || []; existing.push(r); - byPage.set(r.slug, existing); + byPage.set(k, existing); } const output = [...results]; - for (const [slug, pageChunks] of byPage) { + for (const [key, pageChunks] of byPage) { const hasCompiledTruth = pageChunks.some(c => c.chunk_source === 'compiled_truth'); if (hasCompiledTruth) continue; - // Find the best compiled_truth chunk from pre-dedup input for this page + // Find the best compiled_truth chunk from pre-dedup input for this + // (source_id, slug) combination. Pre-v0.17 single-source match was + // "r.slug === slug"; now it's the composite key so two same-slug + // pages in different sources don't mistakenly swap chunks across. const candidate = preDedup - .filter(r => r.slug === slug && r.chunk_source === 'compiled_truth') + .filter(r => pageKey(r) === key && r.chunk_source === 'compiled_truth') .sort((a, b) => b.score - a.score)[0]; if (!candidate) continue; - // Swap: replace the lowest-scored chunk from this page + // Swap: replace the lowest-scored chunk from this page (same + // composite key match). const lowestIdx = output.reduce((minIdx, r, idx) => { - if (r.slug !== slug) return minIdx; + if (pageKey(r) !== key) return minIdx; if (minIdx === -1) return idx; return r.score < output[minIdx].score ? idx : minIdx; }, -1); diff --git a/src/core/source-resolver.ts b/src/core/source-resolver.ts new file mode 100644 index 0000000..c12a5a7 --- /dev/null +++ b/src/core/source-resolver.ts @@ -0,0 +1,139 @@ +/** + * Source resolution for CLI commands (v0.18.0). + * + * Resolution priority (highest first): + * 1. Explicit --source flag (caller passes this as `explicit`) + * 2. GBRAIN_SOURCE env var + * 3. .gbrain-source dotfile in CWD or any ancestor directory + * 4. Registered source whose local_path contains CWD + * 5. Brain-level default via `gbrain sources default ` + * 6. Literal 'default' (backward compat for pre-v0.17 brains) + * + * This helper is shared by the sources CLI, future sync/extract/query + * commands (Steps 4/5), and the operation layer (Step 2+). + */ + +import { readFileSync, existsSync } from 'fs'; +import { join, dirname, resolve } from 'path'; +import type { BrainEngine } from './engine.ts'; + +const DOTFILE = '.gbrain-source'; +// Must start + end with alnum, interior dashes allowed. Max 32 chars. +// Single-char alnum is also valid. Kebab-case enforced so citation keys +// like `[wiki:slug]` can't have ugly edges like `[wiki-:slug]`. +const SOURCE_ID_RE = /^[a-z0-9](?:[a-z0-9-]{0,30}[a-z0-9])?$/; + +function readDotfileWalk(startDir: string): string | null { + let dir = resolve(startDir); + // Guard against infinite loops on malformed paths. + for (let i = 0; i < 50; i++) { + const candidate = join(dir, DOTFILE); + if (existsSync(candidate)) { + try { + const content = readFileSync(candidate, 'utf8').trim().split('\n')[0].trim(); + if (SOURCE_ID_RE.test(content)) return content; + } catch { + // Unreadable dotfile — skip and keep walking. + } + } + const parent = dirname(dir); + if (parent === dir) break; // reached filesystem root + dir = parent; + } + return null; +} + +/** + * Resolve the source id for a CLI command. + * + * @param engine Connected brain engine (for sources table lookups). + * @param explicit The --source flag value, if the caller parsed one. + * @param cwd The working directory to walk for .gbrain-source. Defaults + * to process.cwd(). Exposed for testability. + * @returns The resolved source id. Falls back to 'default' if no other + * signal is present. Never returns null — every command must + * target exactly one default source. + * @throws If the resolved id doesn't correspond to a registered source + * (prevents silently writing to a nonexistent source and bloating + * pages with a dead FK). + */ +export async function resolveSourceId( + engine: BrainEngine, + explicit: string | null | undefined, + cwd: string = process.cwd(), +): Promise { + // 1. Explicit flag wins. + if (explicit) { + if (!SOURCE_ID_RE.test(explicit)) { + throw new Error(`Invalid --source value "${explicit}". Must match [a-z0-9-]{1,32}.`); + } + await assertSourceExists(engine, explicit); + return explicit; + } + + // 2. Env var. + const env = process.env.GBRAIN_SOURCE; + if (env && env.length > 0) { + if (!SOURCE_ID_RE.test(env)) { + throw new Error(`Invalid GBRAIN_SOURCE value "${env}". Must match [a-z0-9-]{1,32}.`); + } + await assertSourceExists(engine, env); + return env; + } + + // 3. .gbrain-source dotfile walk-up. + const dotfile = readDotfileWalk(cwd); + if (dotfile) { + await assertSourceExists(engine, dotfile); + return dotfile; + } + + // 4. Registered source whose local_path contains CWD. + // Uses longest-prefix match so nested-path configurations (e.g. + // gstack at ~/gstack + plans at ~/gstack/plans) pick the deepest. + const registered = await engine.executeRaw<{ id: string; local_path: string }>( + `SELECT id, local_path FROM sources WHERE local_path IS NOT NULL`, + ); + const cwdResolved = resolve(cwd); + let best: { id: string; pathLen: number } | null = null; + for (const r of registered) { + const p = resolve(r.local_path); + if (cwdResolved === p || cwdResolved.startsWith(p + '/')) { + if (!best || p.length > best.pathLen) { + best = { id: r.id, pathLen: p.length }; + } + } + } + if (best) return best.id; + + // 5. Brain-level default. + const globalDefault = await engine.getConfig('sources.default'); + if (globalDefault && SOURCE_ID_RE.test(globalDefault)) { + await assertSourceExists(engine, globalDefault); + return globalDefault; + } + + // 6. Fallback: the seeded 'default' source. Always exists post-migration + // v16 so this is a safe terminal. + return 'default'; +} + +async function assertSourceExists(engine: BrainEngine, id: string): Promise { + const rows = await engine.executeRaw<{ id: string }>( + `SELECT id FROM sources WHERE id = $1`, + [id], + ); + if (rows.length === 0) { + throw new Error( + `Source "${id}" not found. Available sources: ` + + `run \`gbrain sources list\` to see registered sources, ` + + `or \`gbrain sources add ${id}\` to create it.`, + ); + } +} + +/** Exposed for tests. */ +export const __testing = { + readDotfileWalk, + SOURCE_ID_RE, +}; diff --git a/src/core/types.ts b/src/core/types.ts index ae2e7df..f91839a 100644 --- a/src/core/types.ts +++ b/src/core/types.ts @@ -66,6 +66,12 @@ export interface SearchResult { chunk_index: number; score: number; stale: boolean; + /** + * v0.18.0: the sources.id the page belongs to. Dedup composite-keys + * on (source_id, slug) — see src/core/search/dedup.ts. Defaults to + * 'default' for pre-v0.17 rows that lacked the column. + */ + source_id?: string; } export interface SearchOpts { diff --git a/src/core/utils.ts b/src/core/utils.ts index 4d9313e..a1d49ad 100644 --- a/src/core/utils.ts +++ b/src/core/utils.ts @@ -125,7 +125,7 @@ export function rowToChunk(row: Record, includeEmbedding = fals } export function rowToSearchResult(row: Record): SearchResult { - return { + const result: SearchResult = { slug: row.slug as string, page_id: row.page_id as number, title: row.title as string, @@ -137,4 +137,12 @@ export function rowToSearchResult(row: Record): SearchResult { score: Number(row.score), stale: Boolean(row.stale), }; + // v0.17.0: source_id comes from the p.source_id column in search + // SELECTs. Keep the field optional so pre-v0.17 engines that didn't + // join sources don't crash on the absent column — rowToSearchResult + // is shared by both paths. + if (typeof row.source_id === 'string') { + result.source_id = row.source_id; + } + return result; } diff --git a/src/schema.sql b/src/schema.sql index 4cce2b3..c329730 100644 --- a/src/schema.sql +++ b/src/schema.sql @@ -5,12 +5,55 @@ CREATE EXTENSION IF NOT EXISTS pg_trgm; -- gen_random_uuid() is core in Postgres 13+; enable pgcrypto as fallback for older versions CREATE EXTENSION IF NOT EXISTS pgcrypto; +-- ============================================================ +-- sources: multi-repo / multi-brain tenancy (v0.18.0) +-- ============================================================ +-- A source is a logical brain-within-the-DB: wiki, gstack, yc-media, etc. +-- Every page/file/ingest_log row carries source_id. +-- +-- id: immutable citation key. [a-z0-9-]{1,32} enforced at app layer. +-- Used in [source:slug] citations, --source flag, wikilink syntax. +-- name: mutable display label. Rename via `gbrain sources rename`. +-- local_path: optional git checkout root for filesystem-backed sources. +-- config: forward-compat JSONB. Currently used for federation + ACL slot. +-- { "federated": bool, "access_policy": {...} } +-- - federated=true (or missing-but-explicit on 'default'): +-- participates in cross-source default search. +-- - federated=false (default for new sources): +-- only searched when explicitly named via --source. +-- - access_policy: forward-compat slot, no enforcement in v0.17. +-- Write-side lockdown: mutated only when ctx.remote=false. +CREATE TABLE IF NOT EXISTS sources ( + id TEXT PRIMARY KEY, + name TEXT NOT NULL UNIQUE, + local_path TEXT, + last_commit TEXT, + last_sync_at TIMESTAMPTZ, + config JSONB NOT NULL DEFAULT '{}'::jsonb, + created_at TIMESTAMPTZ NOT NULL DEFAULT now() +); + +-- Seed the default source. 'default' is federated=true for backward compat +-- (pre-v0.17 brains behave exactly as before — every page appears in search). +-- Pre-existing sync.repo_path / sync.last_commit are copied in by the v16 +-- migration, not here; fresh installs have no local_path until `sources add` +-- or the first `sync`. +INSERT INTO sources (id, name, config) + VALUES ('default', 'default', '{"federated": true}'::jsonb) + ON CONFLICT (id) DO NOTHING; + -- ============================================================ -- pages: the core content table -- ============================================================ +-- v0.18.0 (Step 2): pages.source_id scopes each row to a sources(id) row. +-- Slugs are unique per source, NOT globally. The default source is +-- seeded in the sources block above so the DEFAULT 'default' FK is +-- always valid at INSERT time. CREATE TABLE IF NOT EXISTS pages ( id SERIAL PRIMARY KEY, - slug TEXT NOT NULL UNIQUE, + source_id TEXT NOT NULL DEFAULT 'default' + REFERENCES sources(id) ON DELETE CASCADE, + slug TEXT NOT NULL, type TEXT NOT NULL, title TEXT NOT NULL, compiled_truth TEXT NOT NULL DEFAULT '', @@ -18,7 +61,8 @@ CREATE TABLE IF NOT EXISTS pages ( frontmatter JSONB NOT NULL DEFAULT '{}', content_hash TEXT, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), - updated_at TIMESTAMPTZ NOT NULL DEFAULT now() + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + CONSTRAINT pages_source_slug_key UNIQUE (source_id, slug) ); CREATE INDEX IF NOT EXISTS idx_pages_type ON pages(type); @@ -26,6 +70,8 @@ CREATE INDEX IF NOT EXISTS idx_pages_frontmatter ON pages USING GIN(frontmatter) CREATE INDEX IF NOT EXISTS idx_pages_trgm ON pages USING GIN(title gin_trgm_ops); -- v0.13.1 #170: avoids 14.6s seqscan on large brains when listing pages newest-first. CREATE INDEX IF NOT EXISTS idx_pages_updated_at_desc ON pages (updated_at DESC); +-- v0.18.0: source-scoped scans (per /plan-eng-review Section 4). +CREATE INDEX IF NOT EXISTS idx_pages_source_id ON pages(source_id); -- ============================================================ -- content_chunks: chunked content with embeddings @@ -70,6 +116,11 @@ CREATE TABLE IF NOT EXISTS links ( link_source TEXT CHECK (link_source IS NULL OR link_source IN ('markdown', 'frontmatter', 'manual')), origin_page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL, origin_field TEXT, + -- v0.18.0 Step 4: 'qualified' when the link was written as + -- [[source:slug]] (target source pinned). 'unqualified' when written + -- as bare [[slug]] and resolved via local-first fallback at + -- extraction time. NULL for legacy/manual/frontmatter edges. + resolution_type TEXT CHECK (resolution_type IS NULL OR resolution_type IN ('qualified', 'unqualified')), created_at TIMESTAMPTZ NOT NULL DEFAULT now(), -- NULLS NOT DISTINCT (PG15+) so two rows with link_source IS NULL or -- origin_page_id IS NULL collide as expected. Without this, every row with @@ -144,6 +195,9 @@ CREATE INDEX IF NOT EXISTS idx_versions_page ON page_versions(page_id); -- ============================================================ -- ingest_log -- ============================================================ +-- NOTE (v0.18.0 Step 1): ingest_log.source_id is NOT added yet — lands +-- in v17 alongside the sync rewrite (Step 5), which starts writing +-- source-scoped entries. CREATE TABLE IF NOT EXISTS ingest_log ( id SERIAL PRIMARY KEY, source_type TEXT NOT NULL, @@ -198,9 +252,18 @@ CREATE TABLE IF NOT EXISTS mcp_request_log ( -- ============================================================ -- files: binary attachments stored in Supabase Storage -- ============================================================ +-- v0.18.0 Step 7: files gains source_id + page_id alongside the +-- legacy page_slug (kept for backward compat until a later release). +-- The file_migration_ledger below drives the storage object rewrite. +-- page_slug FK had ON UPDATE CASCADE — removed because slugs are no +-- longer global (composite UNIQUE) so CASCADE on-update is ambiguous. +-- ON DELETE SET NULL is preserved via both page_slug and page_id. CREATE TABLE IF NOT EXISTS files ( id SERIAL PRIMARY KEY, - page_slug TEXT REFERENCES pages(slug) ON DELETE SET NULL ON UPDATE CASCADE, + source_id TEXT NOT NULL DEFAULT 'default' + REFERENCES sources(id) ON DELETE CASCADE, + page_slug TEXT, + page_id INTEGER REFERENCES pages(id) ON DELETE SET NULL, filename TEXT NOT NULL, storage_path TEXT NOT NULL, mime_type TEXT, @@ -215,8 +278,30 @@ CREATE TABLE IF NOT EXISTS files ( ALTER TABLE files DROP COLUMN IF EXISTS storage_url; CREATE INDEX IF NOT EXISTS idx_files_page ON files(page_slug); +CREATE INDEX IF NOT EXISTS idx_files_page_id ON files(page_id); +CREATE INDEX IF NOT EXISTS idx_files_source_id ON files(source_id); CREATE INDEX IF NOT EXISTS idx_files_hash ON files(content_hash); +-- ============================================================ +-- file_migration_ledger (v0.18.0 Step 7) +-- Drives the storage-object rewrite performed by the v0_18_0 +-- orchestrator's phase B. Keyed on file_id so two sources can share +-- an old path during migration without PK collision (Codex second- +-- pass caught this). +-- Status state machine: pending → copy_done → db_updated → complete +-- ============================================================ +CREATE TABLE IF NOT EXISTS file_migration_ledger ( + file_id INTEGER PRIMARY KEY REFERENCES files(id) ON DELETE CASCADE, + storage_path_old TEXT NOT NULL, + storage_path_new TEXT NOT NULL, + status TEXT NOT NULL DEFAULT 'pending', + error TEXT, + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + CONSTRAINT chk_ledger_status CHECK (status IN ('pending','copy_done','db_updated','complete','failed')) +); +CREATE INDEX IF NOT EXISTS idx_file_migration_ledger_status + ON file_migration_ledger(status) WHERE status != 'complete'; + -- ============================================================ -- Trigger-based search_vector (spans pages + timeline_entries) -- ============================================================ @@ -465,6 +550,8 @@ BEGIN ALTER TABLE config ENABLE ROW LEVEL SECURITY; ALTER TABLE files ENABLE ROW LEVEL SECURITY; ALTER TABLE minion_jobs ENABLE ROW LEVEL SECURITY; + ALTER TABLE sources ENABLE ROW LEVEL SECURITY; + ALTER TABLE file_migration_ledger ENABLE ROW LEVEL SECURITY; RAISE NOTICE 'RLS enabled on all tables (role % has BYPASSRLS)', current_user; ELSE RAISE WARNING 'Skipping RLS: role % does not have BYPASSRLS privilege. Run as postgres role to enable.', current_user; diff --git a/test/apply-migrations.test.ts b/test/apply-migrations.test.ts index 3c87300..5c835af 100644 --- a/test/apply-migrations.test.ts +++ b/test/apply-migrations.test.ts @@ -105,8 +105,9 @@ describe('buildPlan — diff against completed + installed VERSION', () => { // Future migrations (registered but newer than installed VERSION) land in // skippedFuture until the binary catches up. v0.13.0 = frontmatter graph, // v0.13.1 = Knowledge Runtime grandfather, v0.14.0 = shell jobs + - // autopilot cooperative, v0.16.0 = subagent runtime (this branch). - expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.0', '0.12.2', '0.13.0', '0.13.1', '0.14.0', '0.16.0']); + // autopilot cooperative, v0.16.0 = subagent runtime, v0.18.0 = multi- + // source brains (this branch). + expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.0', '0.12.2', '0.13.0', '0.13.1', '0.14.0', '0.16.0', '0.18.0']); }); test('already applied → v0.11.0 lands in `applied` bucket, not pending', () => { @@ -142,11 +143,11 @@ describe('buildPlan — diff against completed + installed VERSION', () => { const idx = indexCompleted([]); const plan = buildPlan(idx, '0.12.0'); expect(plan.pending.map(m => m.version)).toContain('0.11.0'); - // v0.12.2, v0.13.0, v0.13.1, v0.14.0, and v0.16.0 were added later; + // v0.12.2, v0.13.0, v0.13.1, v0.14.0, v0.16.0, v0.18.0 were added later; // installed=0.12.0 means they belong in skippedFuture, not pending. v0.11.0 // and v0.12.0 stay pending despite being ≤ installed — that is the H9 // invariant. - expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.2', '0.13.0', '0.13.1', '0.14.0', '0.16.0']); + expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.2', '0.13.0', '0.13.1', '0.14.0', '0.16.0', '0.18.0']); }); test('--migration filter narrows to one version', () => { diff --git a/test/dedup.test.ts b/test/dedup.test.ts index 18d2a73..b714973 100644 --- a/test/dedup.test.ts +++ b/test/dedup.test.ts @@ -154,3 +154,75 @@ describe('edge cases', () => { expect(deduped.filter(r => r.slug === 'a').length).toBeLessThanOrEqual(3); }); }); + +// ───────────────────────────────────────────────────────────────── +// v0.18.0 Step 3 — source-aware dedup (REGRESSION-CRITICAL per Codex) +// ───────────────────────────────────────────────────────────────── +// Pre-v0.17 dedup collapsed on slug alone. Under multi-source +// uniqueness, two same-slug pages in different sources ARE different +// pages — collapsing them destroys cross-source recall. Codex flagged +// this as a regression-critical path in the outside-voice review. +describe('dedup — source-aware composite key (v0.18.0)', () => { + test('same slug across two sources does NOT collapse via dedupBySource layer', () => { + // Two pages, same slug, different sources. Both should survive + // Layer 1 (top-3-per-page) because they are DIFFERENT pages. + const results = [ + makeResult({ slug: 'topics/ai', source_id: 'wiki', score: 0.9, chunk_text: 'wiki take on ai' }), + makeResult({ slug: 'topics/ai', source_id: 'gstack', score: 0.85, chunk_text: 'gstack plans for ai' }), + ]; + const deduped = dedupResults(results); + // Both pages represented — one result each. + const wikiHits = deduped.filter(r => r.source_id === 'wiki' && r.slug === 'topics/ai'); + const gstackHits = deduped.filter(r => r.source_id === 'gstack' && r.slug === 'topics/ai'); + expect(wikiHits.length).toBe(1); + expect(gstackHits.length).toBe(1); + }); + + test('same slug + same source DOES collapse to maxPerPage', () => { + // Control: same-source-same-slug behavior unchanged from pre-v0.17. + const results = [ + makeResult({ slug: 'topics/ai', source_id: 'wiki', chunk_id: 1, score: 0.9, chunk_text: 'chunk one distinct content here' }), + makeResult({ slug: 'topics/ai', source_id: 'wiki', chunk_id: 2, score: 0.8, chunk_text: 'chunk two also distinct words' }), + makeResult({ slug: 'topics/ai', source_id: 'wiki', chunk_id: 3, score: 0.7, chunk_text: 'chunk three different terms again' }), + ]; + const deduped = dedupResults(results); + // Default maxPerPage=2 → only 2 of the 3 wiki:topics/ai chunks survive. + const wikiHits = deduped.filter(r => r.source_id === 'wiki' && r.slug === 'topics/ai'); + expect(wikiHits.length).toBeLessThanOrEqual(2); + }); + + test('missing source_id defaults to "default" for backward compat', () => { + // Pre-v0.17 brains (single source, rows with no source_id column) + // still dedup correctly: the fallback key groups them all under + // the 'default' source bucket. + const results = [ + makeResult({ slug: 'topics/ai', chunk_id: 1, score: 0.9, chunk_text: 'chunk one distinct content words' }), + makeResult({ slug: 'topics/ai', chunk_id: 2, score: 0.8, chunk_text: 'chunk two totally different phrasing' }), + makeResult({ slug: 'topics/ai', chunk_id: 3, score: 0.7, chunk_text: 'chunk three new unique text here' }), + ]; + const deduped = dedupResults(results); + // All three should group as one page (no source_id → default), so + // maxPerPage=2 cap applies. + expect(deduped.length).toBeLessThanOrEqual(2); + }); + + test('compiled_truth guarantee scopes to (source_id, slug), not slug alone', () => { + // Two pages, same slug, different sources. wiki's top-scoring chunk + // is timeline; gstack has only compiled_truth. The guarantee must + // swap in wiki's compiled_truth for wiki (without touching gstack) + // and must NOT accidentally pull gstack's compiled_truth into wiki. + const results = [ + makeResult({ slug: 'topics/ai', source_id: 'wiki', score: 0.9, chunk_source: 'timeline', chunk_id: 1, chunk_text: 'wiki timeline chunk content here' }), + makeResult({ slug: 'topics/ai', source_id: 'wiki', score: 0.5, chunk_source: 'compiled_truth', chunk_id: 2, chunk_text: 'wiki compiled truth content text' }), + makeResult({ slug: 'topics/ai', source_id: 'gstack', score: 0.7, chunk_source: 'compiled_truth', chunk_id: 3, chunk_text: 'gstack compiled truth something else' }), + ]; + const deduped = dedupResults(results); + // Wiki ends up with a compiled_truth (swapped from its own source, + // not gstack's). + const wikiCompiledTruths = deduped.filter( + r => r.source_id === 'wiki' && r.slug === 'topics/ai' && r.chunk_source === 'compiled_truth', + ); + expect(wikiCompiledTruths.length).toBe(1); + expect(wikiCompiledTruths[0].chunk_id).toBe(2); // wiki's own compiled_truth, NOT gstack's (id=3) + }); +}); diff --git a/test/e2e/mechanical.test.ts b/test/e2e/mechanical.test.ts index febc75a..247a55b 100644 --- a/test/e2e/mechanical.test.ts +++ b/test/e2e/mechanical.test.ts @@ -633,7 +633,7 @@ describeE2E('E2E: file_list LIMIT enforcement', () => { await sql` INSERT INTO pages (slug, title, type, compiled_truth, frontmatter) VALUES (${testSlug}, ${'Test Limit Page'}, ${'note'}, ${'body'}, ${'{}'}::jsonb) - ON CONFLICT (slug) DO NOTHING + ON CONFLICT (source_id, slug) DO NOTHING `; // Insert 150 file rows for the same slug diff --git a/test/e2e/multi-source.test.ts b/test/e2e/multi-source.test.ts new file mode 100644 index 0000000..2b06a79 --- /dev/null +++ b/test/e2e/multi-source.test.ts @@ -0,0 +1,608 @@ +/** + * E2E: v0.18.0 multi-source migrations against REAL Postgres. + * + * PGLite doesn't have a files table (see pglite-schema.ts header), so the + * v23 migration's files.source_id + files.page_id rewrite + ledger seed + * is NEVER executed by the PGLite integration test. This file closes + * that gap by exercising the full v20-v23 chain against a real Postgres + * DB with pre-existing data. + * + * Also covers the gaps in the PR's pre-shipping test matrix that the + * author self-audited: + * - files.page_slug → page_id backfill against real rows + * - file_migration_ledger seeding + * - cascade delete via sources.remove (pages + chunks + timeline + + * files + links all gone) + * - sync --source routing reads + writes per-source sync anchors + * instead of the global config keys + * + * Gated by DATABASE_URL — skips gracefully when unset, per the CLAUDE.md + * E2E lifecycle pattern. + */ + +import { describe, test, expect, beforeAll, afterAll } from 'bun:test'; +import { PostgresEngine } from '../../src/core/postgres-engine.ts'; +import { runSources } from '../../src/commands/sources.ts'; +import { performSync } from '../../src/commands/sync.ts'; +import { runStorageBackfill } from '../../src/commands/migrations/v0_18_0-storage-backfill.ts'; +import type { StorageBackend } from '../../src/core/storage.ts'; +import { hasDatabase, setupDB, teardownDB, getConn, getEngine } from './helpers.ts'; + +const SKIP = !hasDatabase(); +const describeE2E = SKIP ? describe.skip : describe; + +describeE2E('v0.18.0 multi-source — Postgres schema shape (fresh install)', () => { + beforeAll(async () => { + await setupDB(); + // sources + file_migration_ledger are not in helpers.ALL_TABLES, so + // residual rows from prior test runs can shadow new INSERTs. Wipe + // non-default sources at the top of every describe to keep each + // block hermetic. file_migration_ledger cascades from files which + // setupDB already truncates, but wipe explicitly in case files did + // not cascade it. + const conn = getConn(); + await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`); + await conn.unsafe(`DELETE FROM file_migration_ledger`); + }); + afterAll(async () => { + await teardownDB(); + }); + + test("sources('default') exists after initSchema + migration chain", async () => { + const conn = getConn(); + const rows = await conn.unsafe( + `SELECT id, name, config FROM sources WHERE id = 'default'`, + ); + expect(rows.length).toBe(1); + expect(rows[0].name).toBe('default'); + const config = typeof rows[0].config === 'string' ? JSON.parse(rows[0].config) : rows[0].config; + expect(config.federated).toBe(true); + }); + + test('pages.source_id NOT NULL with DEFAULT default (v21)', async () => { + const conn = getConn(); + const rows = await conn.unsafe( + `SELECT column_name, column_default, is_nullable + FROM information_schema.columns + WHERE table_name = 'pages' AND column_name = 'source_id'`, + ); + expect(rows.length).toBe(1); + expect(rows[0].is_nullable).toBe('NO'); + expect(String(rows[0].column_default)).toContain('default'); + }); + + test('composite UNIQUE pages(source_id, slug) replaces global UNIQUE(slug)', async () => { + const conn = getConn(); + const composite = await conn.unsafe( + `SELECT conname FROM pg_constraint WHERE conname = 'pages_source_slug_key'`, + ); + expect(composite.length).toBe(1); + const oldGlobal = await conn.unsafe( + `SELECT conname FROM pg_constraint WHERE conname = 'pages_slug_key'`, + ); + expect(oldGlobal.length).toBe(0); + }); + + test('links.resolution_type column exists with CHECK (v22)', async () => { + const conn = getConn(); + const rows = await conn.unsafe( + `SELECT column_name FROM information_schema.columns + WHERE table_name = 'links' AND column_name = 'resolution_type'`, + ); + expect(rows.length).toBe(1); + const check = await conn.unsafe( + `SELECT conname FROM pg_constraint WHERE conname = 'links_resolution_type_check'`, + ); + expect(check.length).toBe(1); + }); + + test('files.source_id + files.page_id columns exist (v23, Postgres-only)', async () => { + const conn = getConn(); + const cols = await conn.unsafe( + `SELECT column_name FROM information_schema.columns + WHERE table_name = 'files' AND column_name IN ('source_id', 'page_id')`, + ); + // postgres.js returns RowList with an iterable-row shape; cast via + // unknown before narrowing to plain objects (TS2352 otherwise). + const names = new Set( + (cols as unknown as Array<{ column_name: string }>).map(r => r.column_name), + ); + expect(names.has('source_id')).toBe(true); + expect(names.has('page_id')).toBe(true); + }); + + test('file_migration_ledger table exists with status CHECK (v23)', async () => { + const conn = getConn(); + const tables = await conn.unsafe( + `SELECT table_name FROM information_schema.tables + WHERE table_name = 'file_migration_ledger'`, + ); + expect(tables.length).toBe(1); + const check = await conn.unsafe( + `SELECT conname FROM pg_constraint WHERE conname = 'chk_ledger_status'`, + ); + expect(check.length).toBe(1); + }); +}); + +describeE2E('v0.18.0 multi-source — composite UNIQUE semantics on real Postgres', () => { + beforeAll(async () => { + await setupDB(); + // sources + file_migration_ledger are not in helpers.ALL_TABLES, so + // residual rows from prior test runs can shadow new INSERTs. Wipe + // non-default sources at the top of every describe to keep each + // block hermetic. file_migration_ledger cascades from files which + // setupDB already truncates, but wipe explicitly in case files did + // not cascade it. + const conn = getConn(); + await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`); + await conn.unsafe(`DELETE FROM file_migration_ledger`); + }); + afterAll(async () => { + await teardownDB(); + }); + + test('same slug in two sources coexists (REGRESSION GUARD — Codex critical)', async () => { + const conn = getConn(); + // Create a second source. + const engine = getEngine(); + await runSources(engine as unknown as Parameters[0], ['add', 'wiki', '--federated']); + + // Insert the same slug under 'default' (via putPage) and 'wiki' (raw INSERT). + await engine.putPage('topics/ai', { + type: 'concept', title: 'AI from default', compiled_truth: 'default source take', + }); + await conn.unsafe( + `INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash) + VALUES ('wiki', 'topics/ai', 'concept', 'AI from wiki', 'wiki source take', '', '{}'::jsonb, 'wikihash')`, + ); + + const rows = await conn.unsafe( + `SELECT source_id, slug, title FROM pages WHERE slug = 'topics/ai' ORDER BY source_id`, + ); + expect(rows.length).toBe(2); + expect(rows.map((r: any) => r.source_id).sort()).toEqual(['default', 'wiki']); + }); + + test('duplicate (source_id, slug) hits composite UNIQUE', async () => { + const conn = getConn(); + let err: Error | null = null; + try { + await conn.unsafe( + `INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash) + VALUES ('wiki', 'topics/ai', 'concept', 'dup', '', '', '{}'::jsonb, 'dup')`, + ); + } catch (e) { + err = e as Error; + } + expect(err).not.toBeNull(); + expect(err!.message.toLowerCase()).toMatch(/unique|duplicate/); + }); + + test('putPage (engine API) targets default source by schema DEFAULT', async () => { + const engine = getEngine(); + await engine.putPage('topics/from-putpage', { + type: 'note', title: 'Via putPage', compiled_truth: 'body', + }); + const conn = getConn(); + const rows = await conn.unsafe( + `SELECT source_id FROM pages WHERE slug = 'topics/from-putpage'`, + ); + expect(rows.length).toBe(1); + expect(rows[0].source_id).toBe('default'); + }); +}); + +describeE2E('v0.18.0 multi-source — cascade delete covers every dependent row', () => { + beforeAll(async () => { + await setupDB(); + // sources + file_migration_ledger are not in helpers.ALL_TABLES, so + // residual rows from prior test runs can shadow new INSERTs. Wipe + // non-default sources at the top of every describe to keep each + // block hermetic. file_migration_ledger cascades from files which + // setupDB already truncates, but wipe explicitly in case files did + // not cascade it. + const conn = getConn(); + await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`); + await conn.unsafe(`DELETE FROM file_migration_ledger`); + }); + afterAll(async () => { + await teardownDB(); + }); + + test('sources remove cascades to pages + chunks + timeline + links + files', async () => { + const conn = getConn(); + const engine = getEngine(); + + // Build a fully populated source: page, chunks, timeline entries, + // links, a file row. Then remove the source and verify nothing + // for that source survives. + await runSources(engine as unknown as Parameters[0], ['add', 'cascadetest', '--federated']); + + // Page under cascadetest + await conn.unsafe( + `INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash) + VALUES ('cascadetest', 'people/alice', 'person', 'Alice', 'Alice body', '', '{}'::jsonb, 'h1')`, + ); + const alicePage = await conn.unsafe( + `SELECT id FROM pages WHERE source_id = 'cascadetest' AND slug = 'people/alice'`, + ); + const aliceId = alicePage[0].id as number; + + // A second page for link target + await conn.unsafe( + `INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash) + VALUES ('cascadetest', 'companies/acme', 'company', 'Acme', 'Acme body', '', '{}'::jsonb, 'h2')`, + ); + const acmePage = await conn.unsafe( + `SELECT id FROM pages WHERE source_id = 'cascadetest' AND slug = 'companies/acme'`, + ); + const acmeId = acmePage[0].id as number; + + // Chunk + await conn.unsafe( + `INSERT INTO content_chunks (page_id, chunk_index, chunk_text, chunk_source) + VALUES (${aliceId}, 0, 'Alice body chunk', 'compiled_truth')`, + ); + + // Timeline + await conn.unsafe( + `INSERT INTO timeline_entries (page_id, date, source, summary, detail) + VALUES (${aliceId}, '2026-01-15', 'test', 'Joined Acme', 'detail')`, + ); + + // Link Alice → Acme + await conn.unsafe( + `INSERT INTO links (from_page_id, to_page_id, link_type, link_source) + VALUES (${aliceId}, ${acmeId}, 'works_at', 'markdown')`, + ); + + // File row pointing at Alice + await conn.unsafe( + `INSERT INTO files (source_id, page_id, filename, storage_path, content_hash) + VALUES ('cascadetest', ${aliceId}, 'alice.pdf', 'cascadetest/people/alice/alice.pdf', 'fh1')`, + ); + + // Sanity: everything exists + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'cascadetest'`))[0].n).toBe(2); + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM content_chunks WHERE page_id = ${aliceId}`))[0].n).toBe(1); + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM timeline_entries WHERE page_id = ${aliceId}`))[0].n).toBe(1); + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM links WHERE from_page_id = ${aliceId}`))[0].n).toBe(1); + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM files WHERE source_id = 'cascadetest'`))[0].n).toBe(1); + + // Remove the source. + await runSources(engine as unknown as Parameters[0], ['remove', 'cascadetest', '--yes']); + + // Everything for that source is gone. + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'cascadetest'`))[0].n).toBe(0); + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM content_chunks WHERE page_id = ${aliceId}`))[0].n).toBe(0); + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM timeline_entries WHERE page_id = ${aliceId}`))[0].n).toBe(0); + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM links WHERE from_page_id = ${aliceId}`))[0].n).toBe(0); + expect((await conn.unsafe(`SELECT COUNT(*)::int AS n FROM files WHERE source_id = 'cascadetest'`))[0].n).toBe(0); + + // The sources row itself is gone. + const src = await conn.unsafe(`SELECT id FROM sources WHERE id = 'cascadetest'`); + expect(src.length).toBe(0); + }); +}); + +describeE2E('v0.18.0 multi-source — sync --source routes through sources table', () => { + beforeAll(async () => { + await setupDB(); + // sources + file_migration_ledger are not in helpers.ALL_TABLES, so + // residual rows from prior test runs can shadow new INSERTs. Wipe + // non-default sources at the top of every describe to keep each + // block hermetic. file_migration_ledger cascades from files which + // setupDB already truncates, but wipe explicitly in case files did + // not cascade it. + const conn = getConn(); + await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`); + await conn.unsafe(`DELETE FROM file_migration_ledger`); + }); + afterAll(async () => { + await teardownDB(); + }); + + test('performSync with sourceId reads local_path from sources row', async () => { + const engine = getEngine(); + const conn = getConn(); + + // Register a source with a bogus path (we're not actually walking a + // repo — this test asserts that performSync correctly RESOLVES the + // source row vs hitting the global config). + await runSources(engine as unknown as Parameters[0], [ + 'add', 'syncsrc', '--path', '/nonexistent/syncsrc/path', '--no-federated', + ]); + + // Also set a DIFFERENT path in the global config so we can verify + // sourceId actually disambiguates. + await engine.setConfig('sync.repo_path', '/some/other/default/path'); + + // performSync({sourceId: 'syncsrc'}) should attempt to use + // /nonexistent/syncsrc/path, NOT /some/other/default/path. + let err: Error | null = null; + try { + await performSync(engine, { sourceId: 'syncsrc' }); + } catch (e) { + err = e as Error; + } + expect(err).not.toBeNull(); + // The error message references the source-scoped path, not the + // global config path. (Could be "Not a git repository" + // or "No commits in repo" — either way the path it cites should + // be the source's.) + expect(err!.message).toContain('/nonexistent/syncsrc/path'); + expect(err!.message).not.toContain('/some/other/default/path'); + }); + + test('performSync with no sourceId falls back to global sync.repo_path', async () => { + const engine = getEngine(); + // Global config is still '/some/other/default/path' from the + // previous test. Without --source, performSync uses it. + let err: Error | null = null; + try { + await performSync(engine, {}); + } catch (e) { + err = e as Error; + } + expect(err).not.toBeNull(); + expect(err!.message).toContain('/some/other/default/path'); + }); +}); + +describeE2E('v0.18.0 multi-source — sources table surface', () => { + beforeAll(async () => { + await setupDB(); + // sources + file_migration_ledger are not in helpers.ALL_TABLES, so + // residual rows from prior test runs can shadow new INSERTs. Wipe + // non-default sources at the top of every describe to keep each + // block hermetic. file_migration_ledger cascades from files which + // setupDB already truncates, but wipe explicitly in case files did + // not cascade it. + const conn = getConn(); + await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`); + await conn.unsafe(`DELETE FROM file_migration_ledger`); + }); + afterAll(async () => { + await teardownDB(); + }); + + test('default source is seeded federated=true; new sources default to isolated', async () => { + const conn = getConn(); + const engine = getEngine(); + + const def = await conn.unsafe(`SELECT config FROM sources WHERE id = 'default'`); + const defConfig = typeof def[0].config === 'string' ? JSON.parse(def[0].config) : def[0].config; + expect(defConfig.federated).toBe(true); + + // Defensive cleanup: sources isn't in helpers.ALL_TABLES, so residual + // rows from prior test runs can shadow this INSERT via ON CONFLICT + // DO NOTHING. Delete first, then create. + await conn.unsafe(`DELETE FROM sources WHERE id = 'isolatedsrc'`); + await runSources(engine as unknown as Parameters[0], ['add', 'isolatedsrc']); + const iso = await conn.unsafe(`SELECT config FROM sources WHERE id = 'isolatedsrc'`); + const isoConfig = typeof iso[0].config === 'string' ? JSON.parse(iso[0].config) : iso[0].config; + expect(isoConfig.federated).toBeUndefined(); // omitted → isolated-by-default + }); + + test('federate / unfederate flips config.federated on real DB', async () => { + const conn = getConn(); + const engine = getEngine(); + + await runSources(engine as unknown as Parameters[0], ['federate', 'isolatedsrc']); + let row = await conn.unsafe(`SELECT config FROM sources WHERE id = 'isolatedsrc'`); + let config = typeof row[0].config === 'string' ? JSON.parse(row[0].config) : row[0].config; + expect(config.federated).toBe(true); + + await runSources(engine as unknown as Parameters[0], ['unfederate', 'isolatedsrc']); + row = await conn.unsafe(`SELECT config FROM sources WHERE id = 'isolatedsrc'`); + config = typeof row[0].config === 'string' ? JSON.parse(row[0].config) : row[0].config; + expect(config.federated).toBe(false); + }); + + test('rename changes name, id stays stable', async () => { + const conn = getConn(); + const engine = getEngine(); + + await runSources(engine as unknown as Parameters[0], [ + 'rename', 'isolatedsrc', 'My Isolated Source', + ]); + const row = await conn.unsafe(`SELECT id, name FROM sources WHERE id = 'isolatedsrc'`); + expect(row[0].id).toBe('isolatedsrc'); + expect(row[0].name).toBe('My Isolated Source'); + }); +}); + +describeE2E('v0.18.0 multi-source — storage backfill against file_migration_ledger', () => { + beforeAll(async () => { + await setupDB(); + // sources + file_migration_ledger are not in helpers.ALL_TABLES, so + // residual rows from prior test runs can shadow new INSERTs. Wipe + // non-default sources at the top of every describe to keep each + // block hermetic. file_migration_ledger cascades from files which + // setupDB already truncates, but wipe explicitly in case files did + // not cascade it. + const conn = getConn(); + await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`); + await conn.unsafe(`DELETE FROM file_migration_ledger`); + }); + afterAll(async () => { + await teardownDB(); + }); + + test('seeded ledger + stub storage: pending → complete end-to-end', async () => { + const conn = getConn(); + const engine = getEngine(); + + // Seed a page + file (via raw INSERT so the test doesn't depend on + // sync running). + await conn.unsafe( + `INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash) + VALUES ('default', 'topics/storage', 'note', 'Storage test', 'body', '', '{}'::jsonb, 'sh1')`, + ); + const pageRow = await conn.unsafe( + `SELECT id FROM pages WHERE source_id = 'default' AND slug = 'topics/storage'`, + ); + const pageId = pageRow[0].id as number; + + await conn.unsafe( + `INSERT INTO files (source_id, page_id, filename, storage_path, content_hash) + VALUES ('default', ${pageId}, 'doc.pdf', 'topics/storage/doc.pdf', 'fh1')`, + ); + const fileRow = await conn.unsafe( + `SELECT id FROM files WHERE storage_path = 'topics/storage/doc.pdf'`, + ); + const fileId = fileRow[0].id as number; + + // Seed the ledger manually so we don't depend on the v23 seed SQL + // (the TRUNCATE CASCADE in setupDB wipes ledger rows). + await conn.unsafe( + `INSERT INTO file_migration_ledger (file_id, storage_path_old, storage_path_new, status) + VALUES (${fileId}, 'topics/storage/doc.pdf', 'default/topics/storage/doc.pdf', 'pending') + ON CONFLICT (file_id) DO NOTHING`, + ); + + // Stub storage: downloads return bytes, uploads track what was written. + const uploaded = new Set(); + const stub: StorageBackend = { + upload: async (p: string) => { uploaded.add(p); }, + download: async (p: string) => Buffer.from('bytes-for:' + p), + delete: async (p: string) => { uploaded.delete(p); }, + exists: async (p: string) => uploaded.has(p), + list: async () => [], + getUrl: async (p) => `https://stub/${p}`, + }; + + const report = await runStorageBackfill(engine, stub); + expect(report.total).toBe(1); + expect(report.nowComplete).toBe(1); + expect(report.failed).toBe(0); + + // Ledger row transitioned to complete. + const ledger = await conn.unsafe( + `SELECT status FROM file_migration_ledger WHERE file_id = ${fileId}`, + ); + expect(ledger[0].status).toBe('complete'); + + // Files row now points at the new path. + const filesAfter = await conn.unsafe( + `SELECT storage_path FROM files WHERE id = ${fileId}`, + ); + expect(filesAfter[0].storage_path).toBe('default/topics/storage/doc.pdf'); + + // Stub storage saw the upload happen at the new path. + expect(uploaded.has('default/topics/storage/doc.pdf')).toBe(true); + }); +}); + +// v0.18.0: real-Postgres regression guard for the addLinksBatch / +// addTimelineEntriesBatch JOIN fan-out bug. Before the fix, the JOIN was +// `pages.slug = v.from_slug` unqualified — so two pages sharing the same +// slug across sources would silently duplicate edges and timeline rows. +// postgres-js binds arrays through `unnest()` rather than inline VALUES, +// so the query shape is structurally different from PGLite's and gets its +// own coverage. +describeE2E('v0.18.0 multi-source — addLinksBatch / addTimelineEntriesBatch source-awareness', () => { + beforeAll(async () => { + await setupDB(); + const conn = getConn(); + await conn.unsafe(`DELETE FROM sources WHERE id != 'default'`); + await conn.unsafe(`DELETE FROM file_migration_ledger`); + }); + afterAll(async () => { await teardownDB(); }); + + async function seedSameSlugTwoSources() { + const conn = getConn(); + const engine = getEngine() as PostgresEngine; + // Second source alongside 'default'. + await conn.unsafe( + `INSERT INTO sources (id, name) VALUES ('alt', 'alt') ON CONFLICT (id) DO NOTHING` + ); + // Create same-slug pages in both sources. putPage defaults to 'default'. + await engine.putPage('topics/ai', { type: 'concept', title: 'AI (default)', compiled_truth: '', timeline: '' }); + await engine.putPage('topics/ml', { type: 'concept', title: 'ML (default)', compiled_truth: '', timeline: '' }); + await conn.unsafe( + `INSERT INTO pages (slug, type, title, compiled_truth, timeline, frontmatter, content_hash, source_id, updated_at) + VALUES ('topics/ai', 'concept', 'AI (alt)', '', '', '{}'::jsonb, 'alt-ai-hash', 'alt', now()), + ('topics/ml', 'concept', 'ML (alt)', '', '', '{}'::jsonb, 'alt-ml-hash', 'alt', now())` + ); + } + + test('addLinksBatch without explicit source_id does NOT fan out across sources', async () => { + await seedSameSlugTwoSources(); + const conn = getConn(); + const engine = getEngine() as PostgresEngine; + // Reset links from any prior describe block. + await conn.unsafe(`DELETE FROM links`); + const inserted = await engine.addLinksBatch([ + { from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention' }, + ]); + // Exactly one edge (default → default). Before the fix this was 2. + expect(inserted).toBe(1); + const rows = await conn.unsafe( + `SELECT f.source_id AS from_src, t.source_id AS to_src + FROM links l + JOIN pages f ON f.id = l.from_page_id + JOIN pages t ON t.id = l.to_page_id` + ); + expect(rows.length).toBe(1); + expect(rows[0].from_src).toBe('default'); + expect(rows[0].to_src).toBe('default'); + }); + + test('addLinksBatch supports cross-source edges when explicit source_ids differ', async () => { + const conn = getConn(); + const engine = getEngine() as PostgresEngine; + await conn.unsafe(`DELETE FROM links`); + const inserted = await engine.addLinksBatch([ + { + from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention', + from_source_id: 'default', to_source_id: 'alt', + }, + ]); + expect(inserted).toBe(1); + const rows = await conn.unsafe( + `SELECT f.source_id AS from_src, t.source_id AS to_src + FROM links l + JOIN pages f ON f.id = l.from_page_id + JOIN pages t ON t.id = l.to_page_id` + ); + expect(rows.length).toBe(1); + expect(rows[0].from_src).toBe('default'); + expect(rows[0].to_src).toBe('alt'); + }); + + test('addTimelineEntriesBatch without explicit source_id does NOT fan out across sources', async () => { + const conn = getConn(); + const engine = getEngine() as PostgresEngine; + await conn.unsafe(`DELETE FROM timeline_entries`); + const inserted = await engine.addTimelineEntriesBatch([ + { slug: 'topics/ai', date: '2024-01-15', summary: 'Founded' }, + ]); + expect(inserted).toBe(1); + const rows = await conn.unsafe( + `SELECT p.source_id + FROM timeline_entries te + JOIN pages p ON p.id = te.page_id` + ); + expect(rows.length).toBe(1); + expect(rows[0].source_id).toBe('default'); + }); + + test('addTimelineEntriesBatch with explicit alt source_id lands only in alt', async () => { + const conn = getConn(); + const engine = getEngine() as PostgresEngine; + await conn.unsafe(`DELETE FROM timeline_entries`); + const inserted = await engine.addTimelineEntriesBatch([ + { slug: 'topics/ai', date: '2024-02-01', summary: 'Alt-only event', source_id: 'alt' }, + ]); + expect(inserted).toBe(1); + const rows = await conn.unsafe( + `SELECT p.source_id + FROM timeline_entries te + JOIN pages p ON p.id = te.page_id` + ); + expect(rows.length).toBe(1); + expect(rows[0].source_id).toBe('alt'); + }); +}); diff --git a/test/link-extraction.test.ts b/test/link-extraction.test.ts index 84f45cf..5f18dd7 100644 --- a/test/link-extraction.test.ts +++ b/test/link-extraction.test.ts @@ -609,3 +609,68 @@ describe('FRONTMATTER_LINK_MAP integrity', () => { expect(m!.dirHint).toContain('people'); }); }); + + +// ───────────────────────────────────────────────────────────────── +// v0.18.0 Step 4 — qualified wikilink syntax [[source-id:dir/slug]] +// ───────────────────────────────────────────────────────────────── +describe("extractEntityRefs — v0.18.0 qualified wikilinks", () => { + test("[[wiki:topics/ai]] extracts with sourceId=wiki", () => { + const refs = extractEntityRefs("See [[concepts/ai]] vs [[wiki:concepts/ai]] for wiki-specific take."); + // One unqualified + one qualified. + expect(refs.length).toBe(2); + const qual = refs.find(r => r.sourceId === "wiki"); + expect(qual).toBeDefined(); + expect(qual!.slug).toBe("concepts/ai"); + expect(qual!.name).toBe("concepts/ai"); + const unqual = refs.find(r => r.sourceId === undefined); + expect(unqual).toBeDefined(); + expect(unqual!.slug).toBe("concepts/ai"); + }); + + test("[[gstack:projects/foo|Display Name]] preserves display + sourceId", () => { + const refs = extractEntityRefs("See [[gstack:projects/foo|The Foo Project]] for details."); + expect(refs.length).toBe(1); + expect(refs[0]).toEqual({ name: "The Foo Project", slug: "projects/foo", dir: "projects", sourceId: "gstack" }); + }); + + test("qualified source-id format is validated (must match [a-z0-9-]+ kebab rules)", () => { + // Uppercase source IDs are not qualified — fall through to unqualified wikilink or no match. + const refs = extractEntityRefs("Legit: [[yc-media:concepts/seed]] Not legit: [[NotValid:concepts/x]]"); + const qualified = refs.filter(r => r.sourceId); + expect(qualified.length).toBe(1); + expect(qualified[0].sourceId).toBe("yc-media"); + }); + + test("masking prevents unqualified regex from matching inside a qualified link", () => { + // Without the mask, [[wiki:concepts/ai]] could also match as + // unqualified with slug "wiki:concepts/ai" (invalid dir) — the + // DIR_PATTERN whitelist normally blocks it, but masking is + // defense-in-depth. + const refs = extractEntityRefs("Ref: [[wiki:concepts/ai]]"); + expect(refs.length).toBe(1); + expect(refs[0].sourceId).toBe("wiki"); + }); + + test("markdown [Name](path) links always have no sourceId (unqualified by shape)", () => { + const refs = extractEntityRefs("[Alice](people/alice-chen) met [[wiki:people/bob]]"); + const mdLink = refs.find(r => r.slug === "people/alice-chen"); + expect(mdLink!.sourceId).toBeUndefined(); + const wiki = refs.find(r => r.slug === "people/bob"); + expect(wiki!.sourceId).toBe("wiki"); + }); +}); + +describe("v0.18.0 migration v22 — links_resolution_type", () => { + test("migration v22 exists with CHECK constraint", async () => { + const { MIGRATIONS } = await import("../src/core/migrate.ts"); + const v22 = MIGRATIONS.find(m => m.version === 22); + expect(v22).toBeDefined(); + expect(v22!.name).toBe("links_resolution_type"); + expect(v22!.sql).toContain("ADD COLUMN IF NOT EXISTS resolution_type"); + expect(v22!.sql).toContain("links_resolution_type_check"); + expect(v22!.sql).toContain("qualified"); + expect(v22!.sql).toContain("unqualified"); + }); +}); + diff --git a/test/migrate.test.ts b/test/migrate.test.ts index b27213d..d3e47af 100644 --- a/test/migrate.test.ts +++ b/test/migrate.test.ts @@ -16,6 +16,162 @@ describe('migrate', () => { // and are covered in the E2E suite (test/e2e/mechanical.test.ts) }); +// ───────────────────────────────────────────────────────────────── +// v0.18.0 — v16 sources_table_additive (Step 1, Lane A) +// ───────────────────────────────────────────────────────────────── +// v16 is the ADDITIVE-ONLY migration: it installs the sources primitive +// without breaking the engine's existing ON CONFLICT (slug) upserts. +// The breaking schema changes (pages.source_id NOT NULL, composite +// UNIQUE, files.page_slug → page_id, file_migration_ledger, +// links.resolution_type) land in v17 alongside the engine API rewrite +// so the engine can execute the new ON CONFLICT (source_id, slug) +// atomically with the schema change. +// ───────────────────────────────────────────────────────────────── +describe('migrate v20 — sources_table_additive', () => { + const v20 = MIGRATIONS.find(m => m.version === 20); + + test('v20 exists', () => { + expect(v20).toBeDefined(); + expect(v20!.name).toBe('sources_table_additive'); + }); + + test('v20 creates sources table', () => { + expect(v20!.sql).toContain('CREATE TABLE IF NOT EXISTS sources'); + expect(v20!.sql).toContain('id TEXT PRIMARY KEY'); + expect(v20!.sql).toContain('name TEXT NOT NULL UNIQUE'); + expect(v20!.sql).toContain('config JSONB NOT NULL'); + }); + + test("v20 seeds 'default' source inheriting sync config", () => { + expect(v20!.sql).toContain("INSERT INTO sources (id, name, local_path, last_commit, config)"); + expect(v20!.sql).toContain("'default'"); + // The default source pulls from existing config so post-upgrade + // identity is preserved. + expect(v20!.sql).toContain("SELECT value FROM config WHERE key = 'sync.repo_path'"); + expect(v20!.sql).toContain("SELECT value FROM config WHERE key = 'sync.last_commit'"); + }); + + test('v20 default source is federated=true (backward-compat)', () => { + // federated=true ensures pre-v0.17 brains keep single-namespace + // search semantics — every page appears in unqualified search. + expect(v20!.sql).toContain('"federated": true'); + }); + + test('v20 is idempotent on re-run', () => { + // CREATE TABLE IF NOT EXISTS + NOT EXISTS subquery on INSERT. + expect(v20!.sql).toContain('CREATE TABLE IF NOT EXISTS sources'); + expect(v20!.sql).toContain('WHERE NOT EXISTS (SELECT 1 FROM sources WHERE id = '); + }); + + test('v20 does NOT touch pages / ingest_log / files / links', () => { + // Step 1 is additive-only. Breaking changes deferred to v17 so they + // land with the engine rewrite (Step 2). Guard against anyone + // accidentally re-expanding v16's scope. + expect(v20!.sql).not.toContain('ALTER TABLE pages'); + expect(v20!.sql).not.toContain('ALTER TABLE ingest_log'); + expect(v20!.sql).not.toContain('ALTER TABLE files'); + expect(v20!.sql).not.toContain('ALTER TABLE links'); + expect(v20!.handler).toBeUndefined(); + }); +}); + +// ───────────────────────────────────────────────────────────────── +// v0.18.0 — v17 pages_source_id_composite_unique (Step 2, Lane B) +// ───────────────────────────────────────────────────────────────── +describe('migrate v21 — pages_source_id_composite_unique', () => { + const v21 = MIGRATIONS.find(m => m.version === 21); + + test('v21 exists and is paired with Step 2 engine rewrite', () => { + expect(v21).toBeDefined(); + expect(v21!.name).toBe('pages_source_id_composite_unique'); + }); + + test('v21 adds pages.source_id with DEFAULT default REFERENCES sources', () => { + expect(v21!.sql).toContain('ALTER TABLE pages ADD COLUMN IF NOT EXISTS source_id TEXT'); + // DEFAULT 'default' closes the race where an INSERT between ADD COLUMN + // and SET NOT NULL could leave source_id NULL (Codex second-pass review). + expect(v21!.sql).toContain("NOT NULL DEFAULT 'default' REFERENCES sources(id)"); + }); + + test('v21 swaps UNIQUE(slug) → composite UNIQUE(source_id, slug)', () => { + // ON CONFLICT (source_id, slug) in putPage relies on this swap. + expect(v21!.sql).toContain('ALTER TABLE pages DROP CONSTRAINT IF EXISTS pages_slug_key'); + expect(v21!.sql).toContain('pages_source_slug_key'); + expect(v21!.sql).toContain('UNIQUE (source_id, slug)'); + }); + + test('v21 creates source-scoped index for per-source scans', () => { + expect(v21!.sql).toContain('CREATE INDEX IF NOT EXISTS idx_pages_source_id'); + }); + + test('v21 constraint add is guarded (idempotent re-run)', () => { + // DO block with IF NOT EXISTS guard means re-running the migration + // after partial failure doesn't error on the already-installed name. + expect(v21!.sql).toContain('IF NOT EXISTS'); + expect(v21!.sql).toContain("WHERE conname = 'pages_source_slug_key'"); + }); +}); + +// ───────────────────────────────────────────────────────────────── +// v0.18.0 — v19 files_source_id_page_id_ledger (Step 7, Lane E) +// ───────────────────────────────────────────────────────────────── +describe('migrate v23 — files_source_id_page_id_ledger', () => { + const v23 = MIGRATIONS.find(m => m.version === 23); + + test('v23 exists as handler-only (Postgres files table, PGLite no-op)', () => { + expect(v23).toBeDefined(); + expect(v23!.name).toBe('files_source_id_page_id_ledger'); + expect(v23!.sql).toBe(''); + expect(v23!.handler).toBeDefined(); + }); + + test('v23 handler gates on engine.kind for PGLite (no files table)', () => { + expect(v23!.handler!.toString()).toMatch(/engine\.kind\s*===\s*["']pglite["']/); + }); + + test('v23 adds files.source_id + files.page_id + ledger creation', () => { + const body = v23!.handler!.toString(); + expect(body).toContain('ALTER TABLE files ADD COLUMN IF NOT EXISTS source_id'); + expect(body).toContain('ALTER TABLE files ADD COLUMN IF NOT EXISTS page_id'); + expect(body).toContain('CREATE TABLE IF NOT EXISTS file_migration_ledger'); + }); + + test('v23 backfills files.page_id scoped to default source (Codex fix)', () => { + const body = v23!.handler!.toString(); + // Without source_id='default' scope, the JOIN could hit the wrong + // page after new sources with duplicate slugs are added. + expect(body).toContain('UPDATE files f'); + expect(body).toContain("p.source_id = 'default'"); + }); + + test('v23 ledger PK is file_id (Codex: two sources can share old path)', () => { + const body = v23!.handler!.toString(); + expect(body).toContain('file_id INTEGER PRIMARY KEY'); + // State machine values all present. + for (const state of ['pending', 'copy_done', 'db_updated', 'complete', 'failed']) { + expect(body).toContain(`'${state}'`); + } + }); +}); + +describe('migrate — ordering guarantee (v15 must NOT be skipped by v16)', () => { + test('runMigrations sorts by version ascending', async () => { + // Regression: if v16 preceded v15 in the MIGRATIONS array, the iterator + // would setConfig(version, 16) first, then skip v15 on the next pass. + // runMigrations applies a defensive sort so array order doesn't matter. + // This test asserts v15 exists (if we broke the sort, v15 would still + // exist in MIGRATIONS but would never apply at runtime). + const v15 = MIGRATIONS.find(m => m.version === 15); + const v20 = MIGRATIONS.find(m => m.version === 20); + expect(v15).toBeDefined(); + expect(v20).toBeDefined(); + // Sanity: versions are distinct and progress. + const versions = MIGRATIONS.map(m => m.version); + const uniq = new Set(versions); + expect(uniq.size).toBe(versions.length); + }); +}); + // ───────────────────────────────────────────────────────────────── // REGRESSION TESTS — migrations v8 + v9 perf on duplicate-heavy tables // ───────────────────────────────────────────────────────────────── diff --git a/test/multi-source-integration.test.ts b/test/multi-source-integration.test.ts new file mode 100644 index 0000000..f6fb9ba --- /dev/null +++ b/test/multi-source-integration.test.ts @@ -0,0 +1,244 @@ +/** + * v0.18.0 Step 9 — multi-source integration test against real PGLite. + * + * Exercises the full Step-1-through-Step-7 surface: + * - migration v16 seeds the default source with federated=true + * - migration v17 adds pages.source_id + composite UNIQUE + * - migration v18 adds links.resolution_type column + * - putPage implicitly targets the default source via the + * schema DEFAULT 'default' clause + * - raw INSERT can write pages to a non-default source and the + * composite UNIQUE allows same-slug pages across sources + * - sources CLI add/list/federate operations are reflected in DB + * - federated flag distinguishes unqualified-search-visibility + * + * PGLite-only (fast + zero deps). Real Postgres parity lives in + * test/e2e/mechanical.test.ts when DATABASE_URL is set. + */ + +import { describe, test, expect, beforeAll, afterAll } from 'bun:test'; +import { PGLiteEngine } from '../src/core/pglite-engine.ts'; +import { runSources } from '../src/commands/sources.ts'; +import { resolveSourceId } from '../src/core/source-resolver.ts'; + +let engine: PGLiteEngine; + +beforeAll(async () => { + engine = new PGLiteEngine(); + await engine.connect({ type: 'pglite' } as never); + await engine.initSchema(); +}); + +afterAll(async () => { + await engine.disconnect(); +}); + +describe('v0.18.0 — sources table seeded with default row on fresh PGLite', () => { + test("sources('default') exists after initSchema + migration", async () => { + const rows = await engine.executeRaw<{ id: string; name: string; config: string | Record }>( + `SELECT id, name, config FROM sources WHERE id = 'default'`, + ); + expect(rows.length).toBe(1); + expect(rows[0].name).toBe('default'); + const config = typeof rows[0].config === 'string' ? JSON.parse(rows[0].config) : rows[0].config; + expect(config.federated).toBe(true); + }); + + test('pages.source_id column exists with DEFAULT default', async () => { + const rows = await engine.executeRaw<{ column_default: string | null }>( + `SELECT column_default FROM information_schema.columns + WHERE table_name = 'pages' AND column_name = 'source_id'`, + ); + expect(rows.length).toBe(1); + // PGLite normalizes the default literal. + expect(rows[0].column_default).toContain('default'); + }); + + test('composite UNIQUE (source_id, slug) is installed', async () => { + const rows = await engine.executeRaw<{ conname: string }>( + `SELECT conname FROM pg_constraint WHERE conname = 'pages_source_slug_key'`, + ); + expect(rows.length).toBe(1); + }); +}); + +describe('v0.18.0 — putPage implicitly writes to default source', () => { + test('putPage without explicit source → source_id = default', async () => { + await engine.putPage('topics/step9-auto', { + type: 'concept', + title: 'Step 9 Auto', + compiled_truth: 'Auto-defaulted to default source.', + }); + const rows = await engine.executeRaw<{ source_id: string; slug: string }>( + `SELECT source_id, slug FROM pages WHERE slug = 'topics/step9-auto'`, + ); + expect(rows.length).toBe(1); + expect(rows[0].source_id).toBe('default'); + }); +}); + +describe('v0.18.0 — composite UNIQUE allows same-slug across sources', () => { + test('same slug in two different sources coexists (regression: Codex critical)', async () => { + // Insert a second source via sources CLI. + await runSources(engine, ['add', 'testsrc', '--no-federated']); + + // Sanity: default already has this slug from the previous test. + // Now write the same slug under testsrc via raw INSERT (putPage only + // targets default until a later step surfaces sourceId; raw INSERT is + // the "source-aware write" Step 5 continuation will add). + await engine.executeRaw( + `INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash) + VALUES ('testsrc', 'topics/step9-auto', 'concept', 'Step 9 Auto (testsrc variant)', + 'A different page with the same slug in a different source.', + '', '{}'::jsonb, 'hash2')`, + ); + + // Both rows must exist under the composite unique. + const rows = await engine.executeRaw<{ source_id: string; slug: string; title: string }>( + `SELECT source_id, slug, title FROM pages + WHERE slug = 'topics/step9-auto' + ORDER BY source_id`, + ); + expect(rows.length).toBe(2); + expect(rows.map(r => r.source_id).sort()).toEqual(['default', 'testsrc']); + }); + + test('inserting THIRD row with same (source_id, slug) hits composite UNIQUE', async () => { + let err: Error | null = null; + try { + await engine.executeRaw( + `INSERT INTO pages (source_id, slug, type, title, compiled_truth, timeline, frontmatter, content_hash) + VALUES ('testsrc', 'topics/step9-auto', 'concept', 'Dup attempt', + 'Should fail', '', '{}'::jsonb, 'hash3')`, + ); + } catch (e) { + err = e as Error; + } + expect(err).not.toBeNull(); + expect(err!.message.toLowerCase()).toMatch(/unique|duplicate/); + }); +}); + +describe('v0.18.0 — sources CLI manipulates the sources table', () => { + test('sources federate flips config.federated true', async () => { + await runSources(engine, ['federate', 'testsrc']); + const rows = await engine.executeRaw<{ config: string | Record }>( + `SELECT config FROM sources WHERE id = 'testsrc'`, + ); + const config = typeof rows[0].config === 'string' ? JSON.parse(rows[0].config) : rows[0].config; + expect(config.federated).toBe(true); + }); + + test('sources unfederate flips config.federated false', async () => { + await runSources(engine, ['unfederate', 'testsrc']); + const rows = await engine.executeRaw<{ config: string | Record }>( + `SELECT config FROM sources WHERE id = 'testsrc'`, + ); + const config = typeof rows[0].config === 'string' ? JSON.parse(rows[0].config) : rows[0].config; + expect(config.federated).toBe(false); + }); + + test('sources rename changes name but keeps id immutable', async () => { + await runSources(engine, ['rename', 'testsrc', 'Test Source']); + const rows = await engine.executeRaw<{ id: string; name: string }>( + `SELECT id, name FROM sources WHERE id = 'testsrc'`, + ); + expect(rows[0].id).toBe('testsrc'); + expect(rows[0].name).toBe('Test Source'); + }); +}); + +describe('v0.18.0 — source resolution priority (integration)', () => { + test('explicit --source flag wins when the source exists', async () => { + const id = await resolveSourceId(engine, 'testsrc'); + expect(id).toBe('testsrc'); + }); + + test('GBRAIN_SOURCE env wins when no flag', async () => { + process.env.GBRAIN_SOURCE = 'testsrc'; + try { + const id = await resolveSourceId(engine, null); + expect(id).toBe('testsrc'); + } finally { + delete process.env.GBRAIN_SOURCE; + } + }); + + test('fallback to default when nothing is set', async () => { + const id = await resolveSourceId(engine, null, '/nowhere-registered'); + expect(id).toBe('default'); + }); + + test('rejects unregistered explicit source with an actionable error', async () => { + await expect(resolveSourceId(engine, 'ghost-source')).rejects.toThrow(/not found/); + }); +}); + +describe('v0.18.0 — sources remove cascades to pages', () => { + test('removing a source cascade-deletes its pages', async () => { + const before = await engine.executeRaw<{ n: number }>( + `SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'testsrc'`, + ); + expect(before[0].n).toBeGreaterThan(0); + + await runSources(engine, ['remove', 'testsrc', '--yes']); + + const after = await engine.executeRaw<{ n: number }>( + `SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'testsrc'`, + ); + expect(after[0].n).toBe(0); + + const src = await engine.executeRaw<{ id: string }>( + `SELECT id FROM sources WHERE id = 'testsrc'`, + ); + expect(src.length).toBe(0); + + // Default source is untouched. + const defaultPages = await engine.executeRaw<{ n: number }>( + `SELECT COUNT(*)::int AS n FROM pages WHERE source_id = 'default'`, + ); + expect(defaultPages[0].n).toBeGreaterThan(0); + }); +}); + +describe('v0.18.0 — links.resolution_type column exists (Step 4)', () => { + test('links table accepts qualified/unqualified resolution_type', async () => { + // Create two pages, insert a link with resolution_type='qualified'. + await engine.putPage('topics/qf-a', { + type: 'concept', title: 'QA', compiled_truth: 'a', + }); + await engine.putPage('topics/qf-b', { + type: 'concept', title: 'QB', compiled_truth: 'b', + }); + await engine.executeRaw( + `INSERT INTO links (from_page_id, to_page_id, link_type, context, link_source, resolution_type) + SELECT a.id, b.id, 'ref', '', 'markdown', 'qualified' + FROM pages a, pages b + WHERE a.slug = 'topics/qf-a' AND b.slug = 'topics/qf-b' + AND a.source_id = 'default' AND b.source_id = 'default'`, + ); + const rows = await engine.executeRaw<{ resolution_type: string }>( + `SELECT l.resolution_type + FROM links l + JOIN pages a ON a.id = l.from_page_id + WHERE a.slug = 'topics/qf-a'`, + ); + expect(rows.length).toBe(1); + expect(rows[0].resolution_type).toBe('qualified'); + }); + + test('links CHECK constraint rejects invalid resolution_type values', async () => { + let err: Error | null = null; + try { + await engine.executeRaw( + `INSERT INTO links (from_page_id, to_page_id, link_type, resolution_type) + SELECT a.id, a.id, 'self', 'bogus-value' + FROM pages a WHERE a.slug = 'topics/qf-a' AND a.source_id = 'default'`, + ); + } catch (e) { + err = e as Error; + } + expect(err).not.toBeNull(); + expect(err!.message.toLowerCase()).toMatch(/check|constraint/); + }); +}); diff --git a/test/pglite-engine.test.ts b/test/pglite-engine.test.ts index 3caf032..8f887f6 100644 --- a/test/pglite-engine.test.ts +++ b/test/pglite-engine.test.ts @@ -462,6 +462,119 @@ describe('PGLiteEngine: addTimelineEntriesBatch', () => { }); }); +// v0.18.0: regression guards for the cross-source JOIN fan-out. +// Before the fix, addLinksBatch/addTimelineEntriesBatch JOINed on pages.slug +// only — so a page with the same slug in two sources would fan out and +// silently create duplicate edges / entries. Source-id-qualified JOINs +// eliminate the fan-out. +describe('PGLiteEngine: batch ops source-awareness (v0.18.0)', () => { + beforeEach(async () => { + await truncateAll(); + // Register a second source and populate the same slugs in both. + const db = (engine as any).db; + await db.query( + `INSERT INTO sources (id, name) VALUES ('alt', 'alt') + ON CONFLICT (id) DO NOTHING` + ); + // default-source rows via putPage (schema DEFAULT 'default'). + await engine.putPage('topics/ai', { type: 'concept', title: 'AI (default)', compiled_truth: '', timeline: '' }); + await engine.putPage('topics/ml', { type: 'concept', title: 'ML (default)', compiled_truth: '', timeline: '' }); + // alt-source rows with the same slugs, inserted via raw SQL. + await db.query( + `INSERT INTO pages (slug, type, title, compiled_truth, timeline, frontmatter, content_hash, source_id, updated_at) + VALUES ('topics/ai', 'concept', 'AI (alt)', '', '', '{}'::jsonb, 'h1', 'alt', now()), + ('topics/ml', 'concept', 'ML (alt)', '', '', '{}'::jsonb, 'h2', 'alt', now())` + ); + }); + + test('addLinksBatch default source_id does NOT fan out across sources', async () => { + const inserted = await engine.addLinksBatch([ + { from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention' }, + ]); + // Exactly one edge, not two. Before the fix this was 2. + expect(inserted).toBe(1); + const db = (engine as any).db; + const { rows } = await db.query( + `SELECT f.source_id AS from_src, t.source_id AS to_src + FROM links l + JOIN pages f ON f.id = l.from_page_id + JOIN pages t ON t.id = l.to_page_id` + ); + expect(rows.length).toBe(1); + expect(rows[0].from_src).toBe('default'); + expect(rows[0].to_src).toBe('default'); + }); + + test('addLinksBatch with explicit alt source_id lands in alt only', async () => { + const inserted = await engine.addLinksBatch([ + { + from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention', + from_source_id: 'alt', to_source_id: 'alt', + }, + ]); + expect(inserted).toBe(1); + const db = (engine as any).db; + const { rows } = await db.query( + `SELECT f.source_id AS from_src, t.source_id AS to_src + FROM links l + JOIN pages f ON f.id = l.from_page_id + JOIN pages t ON t.id = l.to_page_id` + ); + expect(rows.length).toBe(1); + expect(rows[0].from_src).toBe('alt'); + expect(rows[0].to_src).toBe('alt'); + }); + + test('addLinksBatch supports cross-source edges', async () => { + const inserted = await engine.addLinksBatch([ + { + from_slug: 'topics/ai', to_slug: 'topics/ml', link_type: 'mention', + from_source_id: 'default', to_source_id: 'alt', + }, + ]); + expect(inserted).toBe(1); + const db = (engine as any).db; + const { rows } = await db.query( + `SELECT f.source_id AS from_src, t.source_id AS to_src + FROM links l + JOIN pages f ON f.id = l.from_page_id + JOIN pages t ON t.id = l.to_page_id` + ); + expect(rows.length).toBe(1); + expect(rows[0].from_src).toBe('default'); + expect(rows[0].to_src).toBe('alt'); + }); + + test('addTimelineEntriesBatch default source_id does NOT fan out across sources', async () => { + const inserted = await engine.addTimelineEntriesBatch([ + { slug: 'topics/ai', date: '2024-01-15', summary: 'Founded' }, + ]); + // Exactly one entry (default source), not two. Before the fix this was 2. + expect(inserted).toBe(1); + const db = (engine as any).db; + const { rows } = await db.query( + `SELECT p.source_id FROM timeline_entries te + JOIN pages p ON p.id = te.page_id` + ); + expect(rows.length).toBe(1); + expect(rows[0].source_id).toBe('default'); + }); + + test('addTimelineEntriesBatch with explicit alt source_id lands in alt only', async () => { + const inserted = await engine.addTimelineEntriesBatch([ + { slug: 'topics/ai', date: '2024-01-15', summary: 'Founded', source_id: 'alt' }, + ]); + expect(inserted).toBe(1); + const db = (engine as any).db; + const { rows } = await db.query( + `SELECT p.source_id FROM timeline_entries te + JOIN pages p ON p.id = te.page_id` + ); + expect(rows.length).toBe(1); + expect(rows[0].source_id).toBe('alt'); + }); +}); + // ───────────────────────────────────────────────────────────────── // Raw Data, Versions, Config, IngestLog // ───────────────────────────────────────────────────────────────── diff --git a/test/source-resolver.test.ts b/test/source-resolver.test.ts new file mode 100644 index 0000000..152c395 --- /dev/null +++ b/test/source-resolver.test.ts @@ -0,0 +1,190 @@ +/** + * v0.18.0 Step 6 — source resolution priority tests. + * + * Priority order (highest first): + * 1. Explicit --source flag + * 2. GBRAIN_SOURCE env var + * 3. .gbrain-source dotfile walk-up + * 4. Registered source whose local_path contains CWD (longest prefix wins) + * 5. Brain-level `sources.default` config key + * 6. Fallback: literal 'default' + */ + +import { describe, test, expect, beforeEach, afterEach } from 'bun:test'; +import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from 'fs'; +import { join } from 'path'; +import { tmpdir } from 'os'; +import { resolveSourceId, __testing } from '../src/core/source-resolver.ts'; +import type { BrainEngine } from '../src/core/engine.ts'; + +// ── Stub engine ──────────────────────────────────────────── + +function makeStub(registeredSources: string[], paths: Array<{ id: string; local_path: string }>, defaultKey: string | null): BrainEngine { + return { + kind: 'pglite', + executeRaw: async (sql: string, params?: unknown[]): Promise => { + if (sql.includes('SELECT id FROM sources WHERE id = $1')) { + const target = params?.[0]; + return (registeredSources.includes(target as string) + ? [{ id: target } as unknown as T] + : []); + } + if (sql.includes('SELECT id, local_path FROM sources')) { + return paths as unknown as T[]; + } + return []; + }, + getConfig: async (key: string) => (key === 'sources.default' ? defaultKey : null), + } as unknown as BrainEngine; +} + +// ── Priority 1: explicit flag ────────────────────────────── + +describe('resolveSourceId priority 1 — explicit flag', () => { + test('wins over every other signal', async () => { + const engine = makeStub(['default', 'gstack', 'wiki'], [{ id: 'wiki', local_path: '/tmp' }], 'gstack'); + process.env.GBRAIN_SOURCE = 'wiki'; + try { + const id = await resolveSourceId(engine, 'gstack', '/tmp/whatever'); + expect(id).toBe('gstack'); + } finally { + delete process.env.GBRAIN_SOURCE; + } + }); + + test('rejects unregistered explicit source with actionable error', async () => { + const engine = makeStub(['default'], [], null); + await expect(resolveSourceId(engine, 'ghost')).rejects.toThrow(/not found/); + }); + + test('rejects invalid format', async () => { + const engine = makeStub(['default'], [], null); + await expect(resolveSourceId(engine, 'WRONG-case!')).rejects.toThrow(/Invalid --source/); + }); +}); + +// ── Priority 2: env var ──────────────────────────────────── + +describe('resolveSourceId priority 2 — GBRAIN_SOURCE env', () => { + test('wins over dotfile / registered-path / default', async () => { + const engine = makeStub(['default', 'env-wins'], [{ id: 'other', local_path: '/tmp' }], 'default'); + process.env.GBRAIN_SOURCE = 'env-wins'; + try { + const id = await resolveSourceId(engine, null, '/tmp/x'); + expect(id).toBe('env-wins'); + } finally { + delete process.env.GBRAIN_SOURCE; + } + }); +}); + +// ── Priority 3: dotfile walk-up ──────────────────────────── + +describe('resolveSourceId priority 3 — .gbrain-source dotfile walk-up', () => { + let tmpdirPath: string; + + beforeEach(() => { + tmpdirPath = mkdtempSync(join(tmpdir(), 'gbrain-resolver-test-')); + }); + afterEach(() => { + rmSync(tmpdirPath, { recursive: true, force: true }); + }); + + test('finds dotfile in CWD', async () => { + writeFileSync(join(tmpdirPath, '.gbrain-source'), 'gstack\n'); + const engine = makeStub(['default', 'gstack'], [], null); + const id = await resolveSourceId(engine, null, tmpdirPath); + expect(id).toBe('gstack'); + }); + + test('walks up ancestors to find dotfile', async () => { + writeFileSync(join(tmpdirPath, '.gbrain-source'), 'wiki\n'); + const deep = join(tmpdirPath, 'a', 'b', 'c'); + mkdirSync(deep, { recursive: true }); + const engine = makeStub(['default', 'wiki'], [], null); + const id = await resolveSourceId(engine, null, deep); + expect(id).toBe('wiki'); + }); + + test('ignores dotfile with invalid content', async () => { + writeFileSync(join(tmpdirPath, '.gbrain-source'), 'INVALID!\n'); + const engine = makeStub(['default'], [], null); + const id = await resolveSourceId(engine, null, tmpdirPath); + expect(id).toBe('default'); + }); +}); + +// ── Priority 4: registered local_path match (longest prefix) ── + +describe('resolveSourceId priority 4 — registered local_path longest-prefix match', () => { + test('picks registered source whose local_path contains CWD', async () => { + const engine = makeStub( + ['default', 'gstack'], + [{ id: 'gstack', local_path: '/tmp/gstack' }], + null, + ); + const id = await resolveSourceId(engine, null, '/tmp/gstack/plans/foo'); + expect(id).toBe('gstack'); + }); + + test('longest prefix wins when paths are nested (per Codex second pass)', async () => { + // Codex flagged: overlapping paths need longest-prefix resolution. + // If gstack at /tmp/gstack and plans at /tmp/gstack/plans both + // exist, CWD inside plans/ must pick plans. + const engine = makeStub( + ['default', 'gstack', 'plans'], + [ + { id: 'gstack', local_path: '/tmp/gstack' }, + { id: 'plans', local_path: '/tmp/gstack/plans' }, + ], + null, + ); + const id = await resolveSourceId(engine, null, '/tmp/gstack/plans/deeper'); + expect(id).toBe('plans'); + }); + + test("CWD outside any registered path falls through to default", async () => { + const engine = makeStub( + ['default', 'gstack'], + [{ id: 'gstack', local_path: '/tmp/gstack' }], + null, + ); + const id = await resolveSourceId(engine, null, '/some/other/dir'); + expect(id).toBe('default'); + }); +}); + +// ── Priority 5: brain-level default ──────────────────────── + +describe('resolveSourceId priority 5 — sources.default config key', () => { + test("returns configured default when no higher signal present", async () => { + const engine = makeStub(['default', 'custom'], [], 'custom'); + const id = await resolveSourceId(engine, null, '/some/random/dir'); + expect(id).toBe('custom'); + }); +}); + +// ── Priority 6: fallback ──────────────────────────────────── + +describe('resolveSourceId priority 6 — fallback', () => { + test("returns 'default' when no signal at all", async () => { + const engine = makeStub(['default'], [], null); + const id = await resolveSourceId(engine, null, '/random/dir'); + expect(id).toBe('default'); + }); +}); + +// ── Regex validation ─────────────────────────────────────── + +describe('SOURCE_ID_RE', () => { + test('accepts valid ids', () => { + for (const id of ['default', 'wiki', 'gstack', 'yc-media', 'garrys-list', 'a', '123']) { + expect(__testing.SOURCE_ID_RE.test(id)).toBe(true); + } + }); + test('rejects invalid ids', () => { + for (const id of ['', 'a'.repeat(33), 'Upper', 'has_underscore', 'trailing-', '-leading', 'with spaces', 'with.dots']) { + expect(__testing.SOURCE_ID_RE.test(id)).toBe(false); + } + }); +}); diff --git a/test/sources.test.ts b/test/sources.test.ts new file mode 100644 index 0000000..cd49ccd --- /dev/null +++ b/test/sources.test.ts @@ -0,0 +1,252 @@ +/** + * v0.18.0 Step 6 — sources CLI subcommand tests. + * + * Pure unit tests that exercise the subcommand dispatcher via a + * stub BrainEngine. No DB required — we just confirm the SQL + * shape, validation, and flag parsing. + */ + +import { describe, test, expect, beforeEach } from 'bun:test'; +import { runSources } from '../src/commands/sources.ts'; +import type { BrainEngine } from '../src/core/engine.ts'; + +// ── Stub engine that records queries ─────────────────────── + +interface RecordedCall { + sql: string; + params: unknown[]; +} + +function makeStub(rowsByPattern: Record = {}): { + engine: BrainEngine; + calls: RecordedCall[]; + configSet: Array<{ key: string; value: string }>; +} { + const calls: RecordedCall[] = []; + const configSet: Array<{ key: string; value: string }> = []; + + const executeRaw = async (sql: string, params?: unknown[]) => { + calls.push({ sql, params: params ?? [] }); + // Match by substring so tests are robust against whitespace. + for (const [pattern, rows] of Object.entries(rowsByPattern)) { + if (sql.includes(pattern)) return rows as never; + } + return [] as never; + }; + + const setConfig = async (key: string, value: string) => { + configSet.push({ key, value }); + }; + + // Minimal BrainEngine stub — only the methods sources.ts touches. + const engine = { + kind: 'pglite' as const, + executeRaw, + setConfig, + // Unused methods throw if called accidentally during these tests. + getConfig: async () => null, + } as unknown as BrainEngine; + + return { engine, calls, configSet }; +} + +// ── add ───────────────────────────────────────────────────── + +// Intercept process.exit so unit tests under bun:test don't actually +// exit. Each test that might trigger process.exit() wraps its call in +// `withExitCapture`. We only return when the function under test returns +// or throws; process.exit() is turned into a recoverable throw. +async function withExitCapture(fn: () => Promise): Promise { + const origExit = process.exit; + let captured: number | null = null; + process.exit = ((code?: number) => { + captured = code ?? 0; + throw new Error('__process_exit__'); + }) as never; + try { + await fn(); + } catch (e) { + if (!(e instanceof Error) || !e.message.includes('__process_exit__')) throw e; + } finally { + process.exit = origExit; + } + return captured; +} + +describe('sources add', () => { + test('rejects invalid ids', async () => { + const { engine } = makeStub(); + const code = await withExitCapture(() => runSources(engine, ['add'])); + expect(code).toBe(2); + }); + + test('rejects uppercase / invalid chars in id', async () => { + const { engine } = makeStub(); + await expect(runSources(engine, ['add', 'BadId', '--path', '/tmp/x'])).rejects.toThrow(/Invalid source id/); + }); + + test('rejects id longer than 32 chars', async () => { + const { engine } = makeStub(); + const long = 'a'.repeat(33); + await expect(runSources(engine, ['add', long, '--path', '/tmp/x'])).rejects.toThrow(/Invalid source id/); + }); + + test('inserts a valid source with defaults (federated unset → isolated)', async () => { + const { engine, calls } = makeStub({ + 'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [{ + id: 'gstack', + name: 'gstack', + local_path: '/tmp/gstack', + last_commit: null, + last_sync_at: null, + config: '{}', + created_at: new Date(), + }], + }); + await runSources(engine, ['add', 'gstack', '--path', '/tmp/gstack']); + const insert = calls.find(c => c.sql.includes('INSERT INTO sources')); + expect(insert).toBeDefined(); + expect(insert!.params[0]).toBe('gstack'); + expect(insert!.params[1]).toBe('gstack'); // name defaults to id + expect(insert!.params[2]).toBe('/tmp/gstack'); + expect(insert!.params[3]).toBe('{}'); // federated unset → empty config + }); + + test('--federated sets config.federated = true', async () => { + const { engine, calls } = makeStub({ + 'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [{ + id: 'wiki', + name: 'wiki', + local_path: '/tmp/wiki', + last_commit: null, + last_sync_at: null, + config: '{"federated":true}', + created_at: new Date(), + }], + }); + await runSources(engine, ['add', 'wiki', '--path', '/tmp/wiki', '--federated']); + const insert = calls.find(c => c.sql.includes('INSERT INTO sources')); + expect(insert!.params[3]).toBe('{"federated":true}'); + }); + + test('--no-federated sets config.federated = false (isolation opt-in)', async () => { + const { engine, calls } = makeStub({ + 'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [{ + id: 'yc-media', + name: 'yc-media', + local_path: '/tmp/yc', + last_commit: null, + last_sync_at: null, + config: '{"federated":false}', + created_at: new Date(), + }], + }); + await runSources(engine, ['add', 'yc-media', '--path', '/tmp/yc', '--no-federated']); + const insert = calls.find(c => c.sql.includes('INSERT INTO sources')); + expect(insert!.params[3]).toBe('{"federated":false}'); + }); + + test('rejects overlapping paths (per eng review finding 4.1)', async () => { + const { engine } = makeStub({ + 'SELECT id, local_path FROM sources WHERE local_path': [ + { id: 'gstack', local_path: '/tmp/gstack' }, + ], + }); + // New source at /tmp/gstack/plans is inside existing gstack at /tmp/gstack. + await expect(runSources(engine, ['add', 'plans', '--path', '/tmp/gstack/plans'])) + .rejects.toThrow(/overlaps with existing source "gstack"/); + }); +}); + +// ── list ──────────────────────────────────────────────────── + +describe('sources list', () => { + test('orders default source first, then alphabetical', async () => { + const { engine, calls } = makeStub({ + 'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [ + { id: 'default', name: 'default', local_path: null, last_commit: null, last_sync_at: null, config: '{"federated":true}', created_at: new Date() }, + ], + 'COUNT(*)::int AS n FROM pages': [{ n: 0 }], + }); + await runSources(engine, ['list']); + const select = calls.find(c => c.sql.includes('ORDER BY (id = \'default\') DESC')); + expect(select).toBeDefined(); + }); +}); + +// ── remove ────────────────────────────────────────────────── + +describe('sources remove', () => { + test("refuses to remove the 'default' source", async () => { + const { engine } = makeStub(); + const code = await withExitCapture(() => runSources(engine, ['remove', 'default', '--yes'])); + expect(code).toBe(3); + }); + + test('refuses without --yes', async () => { + const { engine } = makeStub({ + 'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [ + { id: 'gstack', name: 'gstack', local_path: '/tmp/g', last_commit: null, last_sync_at: null, config: '{}', created_at: new Date() }, + ], + 'COUNT(*)::int AS n FROM pages': [{ n: 10 }], + }); + const code = await withExitCapture(() => runSources(engine, ['remove', 'gstack'])); + expect(code).toBe(5); + }); + + test('--dry-run reports but does not DELETE', async () => { + const { engine, calls } = makeStub({ + 'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [ + { id: 'gstack', name: 'gstack', local_path: '/tmp/g', last_commit: null, last_sync_at: null, config: '{}', created_at: new Date() }, + ], + 'COUNT(*)::int AS n FROM pages': [{ n: 10 }], + }); + await runSources(engine, ['remove', 'gstack', '--dry-run']); + const del = calls.find(c => c.sql.startsWith('DELETE FROM sources')); + expect(del).toBeUndefined(); + }); +}); + +// ── default ───────────────────────────────────────────────── + +describe('sources default', () => { + test("stores id in config key 'sources.default'", async () => { + const { engine, configSet } = makeStub({ + 'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [ + { id: 'gstack', name: 'gstack', local_path: null, last_commit: null, last_sync_at: null, config: '{}', created_at: new Date() }, + ], + }); + await runSources(engine, ['default', 'gstack']); + expect(configSet).toEqual([{ key: 'sources.default', value: 'gstack' }]); + }); +}); + +// ── federate / unfederate ────────────────────────────────── + +describe('sources federate / unfederate', () => { + test('federate sets config.federated = true', async () => { + const { engine, calls } = makeStub({ + 'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [ + { id: 'gstack', name: 'gstack', local_path: null, last_commit: null, last_sync_at: null, config: '{}', created_at: new Date() }, + ], + }); + await runSources(engine, ['federate', 'gstack']); + const upd = calls.find(c => c.sql.includes('UPDATE sources SET config')); + expect(upd).toBeDefined(); + expect(JSON.parse(upd!.params[0] as string)).toEqual({ federated: true }); + }); + + test('unfederate preserves other config keys', async () => { + const { engine, calls } = makeStub({ + 'SELECT id, name, local_path, last_commit, last_sync_at, config, created_at': [ + { id: 'gstack', name: 'gstack', local_path: null, last_commit: null, last_sync_at: null, config: '{"ttl_days":90,"federated":true}', created_at: new Date() }, + ], + }); + await runSources(engine, ['unfederate', 'gstack']); + const upd = calls.find(c => c.sql.includes('UPDATE sources SET config')); + const parsed = JSON.parse(upd!.params[0] as string); + // Must preserve ttl_days while flipping federated. + expect(parsed.ttl_days).toBe(90); + expect(parsed.federated).toBe(false); + }); +}); diff --git a/test/storage-backfill.test.ts b/test/storage-backfill.test.ts new file mode 100644 index 0000000..415c82d --- /dev/null +++ b/test/storage-backfill.test.ts @@ -0,0 +1,213 @@ +/** + * v0.18.0 Step 7 — file_migration_ledger state-machine unit tests. + * + * No real storage — we stub a StorageBackend that records every + * call so we can assert the crash-point recovery semantics without + * touching S3/Supabase. + */ + +import { describe, test, expect } from 'bun:test'; +import { runStorageBackfill } from '../src/commands/migrations/v0_18_0-storage-backfill.ts'; +import type { BrainEngine } from '../src/core/engine.ts'; +import type { StorageBackend } from '../src/core/storage.ts'; + +interface StubLedgerRow { + file_id: number; + storage_path_old: string; + storage_path_new: string; + status: 'pending' | 'copy_done' | 'db_updated' | 'complete' | 'failed'; + error?: string | null; +} + +function makeEngine(initial: StubLedgerRow[]): { engine: BrainEngine; rows: StubLedgerRow[]; filePaths: Map } { + const rows: StubLedgerRow[] = initial.map(r => ({ ...r })); + const filePaths = new Map(); // file_id → current storage_path + + const executeRaw = async (sql: string, params?: unknown[]): Promise => { + const up = sql.trim().toUpperCase(); + // Read ledger + if (up.startsWith('SELECT FILE_ID')) { + return rows.map(r => ({ ...r })) as unknown as T[]; + } + // UPDATE ledger SET status = 'copy_done' + if (sql.includes("SET status = 'copy_done'")) { + const row = rows.find(r => r.file_id === params?.[0]); + if (row) row.status = 'copy_done'; + return []; + } + if (sql.includes("SET status = 'db_updated'")) { + const row = rows.find(r => r.file_id === params?.[0]); + if (row) row.status = 'db_updated'; + return []; + } + if (sql.includes("SET status = 'complete'")) { + const row = rows.find(r => r.file_id === params?.[0]); + if (row) row.status = 'complete'; + return []; + } + if (sql.includes('SET status = $1') && sql.includes("'failed'")) { + // Older form with parametric status + return []; + } + if (sql.includes("SET status = 'failed'")) { + const row = rows.find(r => r.file_id === params?.[1]); + if (row) { row.status = 'failed'; row.error = params?.[0] as string; } + return []; + } + // UPDATE files SET storage_path = $1 WHERE id = $2 + if (up.startsWith('UPDATE FILES')) { + filePaths.set(params?.[1] as number, params?.[0] as string); + return []; + } + return []; + }; + + const engine = { kind: 'postgres' as const, executeRaw } as unknown as BrainEngine; + return { engine, rows, filePaths }; +} + +function makeStorage(): { storage: StorageBackend; calls: string[] } { + const calls: string[] = []; + const uploaded = new Set(); + const storage: StorageBackend = { + upload: async (path: string) => { calls.push(`upload:${path}`); uploaded.add(path); }, + download: async (path: string) => { calls.push(`download:${path}`); return Buffer.from('content-for:' + path); }, + delete: async (path: string) => { calls.push(`delete:${path}`); uploaded.delete(path); }, + exists: async (path: string) => { calls.push(`exists:${path}`); return uploaded.has(path); }, + list: async () => [], + getUrl: async (p) => `https://test/${p}`, + }; + return { storage, calls }; +} + +describe('runStorageBackfill — happy path', () => { + test('advances pending → copy_done → db_updated → complete', async () => { + const { engine, rows, filePaths } = makeEngine([ + { file_id: 1, storage_path_old: 'slug/foo.pdf', storage_path_new: 'default/slug/foo.pdf', status: 'pending' }, + ]); + const { storage, calls } = makeStorage(); + + const report = await runStorageBackfill(engine, storage); + + expect(report.total).toBe(1); + expect(report.nowComplete).toBe(1); + expect(report.failed).toBe(0); + expect(rows[0].status).toBe('complete'); + expect(filePaths.get(1)).toBe('default/slug/foo.pdf'); + // Storage operations: exists-check then download + upload (no delete yet, + // old objects preserved for soak window). + expect(calls.filter(c => c.startsWith('download:'))).toEqual(['download:slug/foo.pdf']); + expect(calls.filter(c => c.startsWith('upload:'))).toEqual(['upload:default/slug/foo.pdf']); + expect(calls.filter(c => c.startsWith('delete:'))).toEqual([]); + }); +}); + +describe('runStorageBackfill — crash-point recovery (per Codex second pass)', () => { + test('resumes from copy_done (crash AFTER copy, BEFORE DB update)', async () => { + const { engine, rows, filePaths } = makeEngine([ + { file_id: 1, storage_path_old: 'slug/a.pdf', storage_path_new: 'default/slug/a.pdf', status: 'copy_done' }, + ]); + const { storage, calls } = makeStorage(); + + const report = await runStorageBackfill(engine, storage); + + expect(report.nowComplete).toBe(1); + expect(rows[0].status).toBe('complete'); + expect(filePaths.get(1)).toBe('default/slug/a.pdf'); + // Should NOT re-download/re-upload — already in copy_done state. + expect(calls.filter(c => c.startsWith('download:'))).toEqual([]); + expect(calls.filter(c => c.startsWith('upload:'))).toEqual([]); + }); + + test('resumes from db_updated (crash AFTER DB update, BEFORE ledger mark)', async () => { + const { engine, rows } = makeEngine([ + { file_id: 1, storage_path_old: 'slug/b.pdf', storage_path_new: 'default/slug/b.pdf', status: 'db_updated' }, + ]); + const { storage, calls } = makeStorage(); + + const report = await runStorageBackfill(engine, storage); + + expect(report.nowComplete).toBe(1); + expect(rows[0].status).toBe('complete'); + // No copy, no db update — only the final mark. + expect(calls).toEqual([]); + }); + + test('already-complete rows are skipped without storage calls', async () => { + const { engine, rows } = makeEngine([ + { file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'complete' }, + ]); + const { storage, calls } = makeStorage(); + + const report = await runStorageBackfill(engine, storage); + + expect(report.alreadyComplete).toBe(1); + expect(report.nowComplete).toBe(0); + expect(rows[0].status).toBe('complete'); + expect(calls).toEqual([]); + }); + + test('failed rows stay failed and do NOT auto-retry', async () => { + const { engine, rows } = makeEngine([ + { file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'failed', error: 'previous failure' }, + ]); + const { storage, calls } = makeStorage(); + + const report = await runStorageBackfill(engine, storage); + + expect(report.failed).toBe(1); + expect(report.nowComplete).toBe(0); + expect(rows[0].status).toBe('failed'); + expect(calls).toEqual([]); + }); +}); + +describe('runStorageBackfill — idempotence + dry-run', () => { + test('upload already-exists check skips redundant upload on re-run', async () => { + const { engine } = makeEngine([ + { file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'pending' }, + ]); + const { storage, calls } = makeStorage(); + // Mark the new path as already existing (simulates a prior partial run + // where upload landed but ledger didn't get updated). + await storage.upload('default/x', Buffer.from('x')); + calls.length = 0; + + await runStorageBackfill(engine, storage); + + // Exists check ran, but no new download or upload since the + // destination already has the object. + expect(calls.some(c => c === 'exists:default/x')).toBe(true); + expect(calls.some(c => c.startsWith('download:'))).toBe(false); + expect(calls.some(c => c.startsWith('upload:'))).toBe(false); + }); + + test('dry-run mode reports skipped count, does not mutate', async () => { + const { engine, rows } = makeEngine([ + { file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'pending' }, + { file_id: 2, storage_path_old: 'y', storage_path_new: 'default/y', status: 'pending' }, + ]); + + const report = await runStorageBackfill(engine, null, { dryRun: true }); + + expect(report.total).toBe(2); + expect(report.skipped).toBe(2); + expect(report.nowComplete).toBe(0); + // Rows still pending. + expect(rows.every(r => r.status === 'pending')).toBe(true); + }); + + test('re-running a completed ledger is a no-op with zero side effects', async () => { + const { engine } = makeEngine([ + { file_id: 1, storage_path_old: 'x', storage_path_new: 'default/x', status: 'complete' }, + { file_id: 2, storage_path_old: 'y', storage_path_new: 'default/y', status: 'complete' }, + ]); + const { storage, calls } = makeStorage(); + + const report = await runStorageBackfill(engine, storage); + + expect(report.alreadyComplete).toBe(2); + expect(report.nowComplete).toBe(0); + expect(calls).toEqual([]); + }); +});