fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1) (#196)
* fix: splitBody and inferType for wiki-style markdown content - splitBody now requires explicit timeline sentinel (<!-- timeline -->, --- timeline ---, or --- directly before ## Timeline / ## History). A bare --- in body text is a markdown horizontal rule, not a separator. This fixes the 83% content truncation @knee5 reported on a 1,991-article wiki where 4,856 of 6,680 wikilinks were lost. - serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability. - inferType extended with /writing/, /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is most-specific-first so projects/blog/writing/essay.md → writing, not project. - PageType union extended: writing, analysis, guide, hardware, architecture. Updates test/import-file.test.ts to use the new sentinel. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: JSONB double-encode bug on Postgres + parseEmbedding NaN scores Two related Postgres-string-typed-data bugs that PGLite hid: 1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254): ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again on the wire, storing JSONB columns as quoted string literals. Every frontmatter->>'key' returned NULL on Postgres-backed brains; GIN indexes were inert. Switched to sql.json(value), which is the postgres.js-native JSONB encoder (Parameter with OID 3802). Affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata. page_versions.frontmatter is downstream via INSERT...SELECT and propagates the fix. 2. pgvector embeddings returning as strings (utils.ts): getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of Float32Array on Supabase, producing [NaN] cosine scores. Adds parseEmbedding() helper handling Float32Array, numeric arrays, and pgvector string format. Throws loud on malformed vectors (per Codex's no-silent-NaN requirement); returns null for non-vector strings (treated as "no embedding here"). rowToChunk delegates to parseEmbedding. E2E regression test at test/e2e/postgres-jsonb.test.ts asserts jsonb_typeof = 'object' AND col->>'k' returns expected scalar across all 5 affected columns — the test that should have caught the original bug. Runs in CI via the existing pgvector service. Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix) Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: extract wikilink syntax with ancestor-search slug resolution extractMarkdownLinks now handles [[page]] and [[page|Display Text]] alongside standard [text](page.md). For wiki KBs where authors omit leading ../ (thinking in wiki-root-relative terms), resolveSlug walks ancestor directories until it finds a matching slug. Without this, wikilinks under tech/wiki/analysis/ targeting [[../../finance/wiki/concepts/foo]] silently dangled when the correct relative depth was 3 × ../ instead of 2. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: gbrain repair-jsonb + v0.12.1 migration + CI grep guard - New gbrain repair-jsonb command. Detects rows where jsonb_typeof(col) = 'string' and rewrites them via (col #>> '{}')::jsonb across 5 affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter. Idempotent — re-running is a no-op. PGLite engines short-circuit cleanly (the bug never affected the parameterized encode path PGLite uses). --dry-run shows what would be repaired; --json for scripting. - New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify. Modeled on v0_12_0 pattern, registered in migrations/index.ts. Runs automatically via gbrain upgrade / apply-migrations. - CI grep guard at scripts/check-jsonb-pattern.sh fails the build if anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation pattern. Wired into bun test via package.json. Best-effort static analysis (multi-line and helper-wrapped variants are caught by the E2E round-trip test instead). - Updates apply-migrations.test.ts expectations to account for the new v0.12.1 entry in the registry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.12.1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.12.1 - CLAUDE.md: document repair-jsonb command, v0_12_1 migration, splitBody sentinel contract, inferType wiki subtypes, CI grep guard, new test files (repair-jsonb, migrations-v0_12_1, markdown) - README.md: add gbrain repair-jsonb to ADMIN command reference - INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add v0.12.1 upgrade guidance for Postgres brains - docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on Postgres-backed brains - docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with migration steps, splitBody contract, wiki subtype inference - skills/migrate/SKILL.md: document native wikilink extraction via gbrain extract links (v0.12.1+) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
75
CHANGELOG.md
75
CHANGELOG.md
@@ -2,6 +2,81 @@
|
||||
|
||||
All notable changes to GBrain will be documented in this file.
|
||||
|
||||
## [0.12.2] - 2026-04-19
|
||||
|
||||
## **Postgres frontmatter queries actually work now.**
|
||||
## **Wiki articles stop disappearing when you import them.**
|
||||
|
||||
This is a data-correctness hotfix for the `v0.12.0`-and-earlier Postgres-backed brains. If you run gbrain on Postgres or Supabase, you've been losing data without knowing it. PGLite users were unaffected. Upgrade auto-repairs your existing rows. Lands on top of v0.12.1 (extract N+1 fix + migration timeout fix) — pull `gbrain upgrade` and you get both.
|
||||
|
||||
### What was broken
|
||||
|
||||
**Frontmatter columns were silently stored as quoted strings, not JSON.** Every `put_page` wrote `frontmatter` to Postgres via `${JSON.stringify(value)}::jsonb` — postgres.js v3 stringified again on the wire, so the column ended up holding `"\"{\\\"author\\\":\\\"garry\\\"}\""` instead of `{"author":"garry"}`. Every `frontmatter->>'key'` query returned NULL. GIN indexes on JSONB were inert. Same bug on `raw_data.data`, `ingest_log.pages_updated`, `files.metadata`, and `page_versions.frontmatter`. PGLite hid this entirely (different driver path) — which is exactly why it slipped past the existing test suite.
|
||||
|
||||
**Wiki articles got truncated by 83% on import.** `splitBody` treated *any* standalone `---` line in body content as a timeline separator. Discovered by @knee5 migrating a 1,991-article wiki where a 23,887-byte article landed in the DB as 593 bytes (4,856 of 6,680 wikilinks lost).
|
||||
|
||||
**`/wiki/` subdirectories silently typed as `concept`.** Articles under `/wiki/analysis/`, `/wiki/guides/`, `/wiki/hardware/`, `/wiki/architecture/`, and `/writing/` defaulted to `type='concept'` — type-filtered queries lost everything in those buckets.
|
||||
|
||||
**pgvector embeddings sometimes returned as strings → NaN search scores.** Discovered by @leonardsellem on Supabase, where `getEmbeddingsByChunkIds` returned `"[0.1,0.2,…]"` instead of `Float32Array`, producing `[NaN]` query scores.
|
||||
|
||||
### What you can do now that you couldn't before
|
||||
|
||||
- **`frontmatter->>'author'` returns `garry`, not NULL.** GIN indexes work. Postgres queries by frontmatter key actually retrieve pages.
|
||||
- **Wiki articles round-trip intact.** Markdown horizontal rules in body text are horizontal rules, not timeline separators.
|
||||
- **Recover already-truncated pages with `gbrain sync --full`.** Re-import from your source-of-truth markdown rebuilds `compiled_truth` correctly.
|
||||
- **Search scores stop going `NaN` on Supabase.** Cosine rescoring sees real `Float32Array` embeddings.
|
||||
- **Type-filtered queries find your wiki articles.** `/wiki/analysis/` becomes type `analysis`, `/writing/` becomes `writing`, etc.
|
||||
|
||||
### How to upgrade
|
||||
|
||||
```bash
|
||||
gbrain upgrade
|
||||
```
|
||||
|
||||
The `v0.12.2` orchestrator runs automatically: applies any schema changes, then `gbrain repair-jsonb` rewrites every double-encoded row in place using `jsonb_typeof = 'string'` as the guard. Idempotent — re-running is a no-op. PGLite engines short-circuit cleanly. Batches well on large brains.
|
||||
|
||||
If you want to recover pages that were truncated by the splitBody bug:
|
||||
|
||||
```bash
|
||||
gbrain sync --full
|
||||
```
|
||||
|
||||
That re-imports every page from disk, so the new `splitBody` rebuilds the full `compiled_truth` correctly.
|
||||
|
||||
### What's new under the hood
|
||||
|
||||
- **`gbrain repair-jsonb`** — standalone command for the JSONB fix. Run it manually if needed; the migration runs it automatically. `--dry-run` shows what would be repaired without touching data. `--json` for scripting.
|
||||
- **CI grep guard** at `scripts/check-jsonb-pattern.sh` — fails the build if anyone reintroduces the `${JSON.stringify(x)}::jsonb` interpolation pattern. Wired into `bun test` so it runs on every CI invocation.
|
||||
- **New E2E regression test** at `test/e2e/postgres-jsonb.test.ts` — round-trips all four JSONB write sites against real Postgres and asserts `jsonb_typeof = 'object'` plus `->>` returns the expected scalar. The test that should have caught the original bug.
|
||||
- **Wikilink extraction** — `[[page]]` and `[[page|Display Text]]` syntaxes now extracted alongside standard `[text](page.md)` markdown links. Includes ancestor-search resolution for wiki KBs where authors omit one or more leading `../`.
|
||||
|
||||
### Migration scope
|
||||
|
||||
The repair touches five JSONB columns:
|
||||
- `pages.frontmatter`
|
||||
- `raw_data.data`
|
||||
- `ingest_log.pages_updated`
|
||||
- `files.metadata`
|
||||
- `page_versions.frontmatter` (downstream of `pages.frontmatter` via INSERT...SELECT)
|
||||
|
||||
Other JSONB columns in the schema (`minion_jobs.{data,result,progress,stacktrace}`, `minion_inbox.payload`) were always written via the parameterized `$N::jsonb` form so they were never affected.
|
||||
|
||||
### Behavior changes (read this if you upgrade)
|
||||
|
||||
`splitBody` now requires an explicit sentinel for timeline content. Recognized markers (in priority order):
|
||||
1. `<!-- timeline -->` (preferred — what `serializeMarkdown` emits)
|
||||
2. `--- timeline ---` (decorated separator)
|
||||
3. `---` directly before `## Timeline` or `## History` heading (backward-compat fallback)
|
||||
|
||||
If you intentionally used a plain `---` to mark your timeline section in source markdown, add `<!-- timeline -->` above it manually. The fallback covers the common case (`---` followed by `## Timeline`).
|
||||
|
||||
### Attribution
|
||||
|
||||
Built from community PRs #187 (@knee5) and #175 (@leonardsellem). The original PRs reported the bugs and proposed the fixes; this release re-implements them on top of the v0.12.0 knowledge graph release with expanded migration scope, schema audit (all 5 affected columns vs the 3 originally reported), engine-aware behavior, CI grep guard, and an E2E regression test that should have caught this in the first place. Codex outside-voice review during planning surfaced the missed `page_versions.frontmatter` propagation path and the noisy-truncated-diagnostic anti-pattern that was dropped from this scope. Thanks for finding the bugs and providing the recovery path — both PRs left work to do but the foundation was right.
|
||||
|
||||
Co-Authored-By: @knee5 (PR #187 — splitBody, inferType wiki, JSONB triple-fix)
|
||||
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding, getEmbeddingsByChunkIds fix)
|
||||
|
||||
## [0.12.1] - 2026-04-19
|
||||
|
||||
## **Extract no longer hangs on large brains.**
|
||||
|
||||
14
CLAUDE.md
14
CLAUDE.md
@@ -61,7 +61,10 @@ strict behavior when unset.
|
||||
- `src/mcp/server.ts` — MCP stdio server (generated from operations)
|
||||
- `src/commands/auth.ts` — Standalone token management (create/list/revoke/test)
|
||||
- `src/commands/upgrade.ts` — Self-update CLI. `runPostUpgrade()` enumerates migrations from the TS registry (src/commands/migrations/index.ts) and tail-calls `runApplyMigrations(['--yes', '--non-interactive'])` so the mechanical side of every outstanding migration runs unconditionally.
|
||||
- `src/commands/migrations/` — TS migration registry (compiled into the binary; no filesystem walk of `skills/migrations/*.md` needed at runtime). `index.ts` lists migrations in semver order. `v0_11_0.ts` = Minions adoption orchestrator (8 phases). `v0_12_0.ts` = Knowledge Graph auto-wire orchestrator (5 phases: schema → config check → backfill links → backfill timeline → verify). `phaseASchema` has a 600s timeout (bumped from 60s in v0.12.1 for duplicate-heavy brains). All orchestrators are idempotent and resumable from `partial` status.
|
||||
- `src/commands/migrations/` — TS migration registry (compiled into the binary; no filesystem walk of `skills/migrations/*.md` needed at runtime). `index.ts` lists migrations in semver order. `v0_11_0.ts` = Minions adoption orchestrator (8 phases). `v0_12_0.ts` = Knowledge Graph auto-wire orchestrator (5 phases: schema → config check → backfill links → backfill timeline → verify). `phaseASchema` has a 600s timeout (bumped from 60s in v0.12.1 for duplicate-heavy brains). `v0_12_2.ts` = JSONB double-encode repair orchestrator (4 phases: schema → repair-jsonb → verify → record). All orchestrators are idempotent and resumable from `partial` status.
|
||||
- `src/commands/repair-jsonb.ts` — `gbrain repair-jsonb [--dry-run] [--json]`: rewrites `jsonb_typeof='string'` rows in place across 5 affected columns (pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter). Fixes v0.12.0 double-encode bug on Postgres; PGLite no-ops. Idempotent.
|
||||
- `src/core/markdown.ts` — Frontmatter parsing + body splitter. `splitBody` requires an explicit timeline sentinel (`<!-- timeline -->`, `--- timeline ---`, or `---` immediately before `## Timeline`/`## History`). Plain `---` in body text is a markdown horizontal rule, not a separator. `inferType` auto-types `/wiki/analysis/` → analysis, `/wiki/guides/` → guide, `/wiki/hardware/` → hardware, `/wiki/architecture/` → architecture, `/writing/` → writing (plus the existing people/companies/deals/etc heuristics).
|
||||
- `scripts/check-jsonb-pattern.sh` — CI grep guard. Fails the build if anyone reintroduces the `${JSON.stringify(x)}::jsonb` interpolation pattern (which postgres.js v3 double-encodes). Wired into `bun test`.
|
||||
- `docs/UPGRADING_DOWNSTREAM_AGENTS.md` — Patches for downstream agent skill forks (Wintermute etc.) to apply when upgrading. Each release appends a new section. v0.10.3 includes diffs for brain-ops, meeting-ingestion, signal-detector, enrich.
|
||||
- `src/core/schema-embedded.ts` — AUTO-GENERATED from schema.sql (run `bun run build:schema`)
|
||||
- `src/schema.sql` — Full Postgres + pgvector DDL (source of truth, generates schema-embedded.ts)
|
||||
@@ -129,6 +132,9 @@ Key commands added for Minions (job queue):
|
||||
- `gbrain jobs stats` — job health dashboard
|
||||
- `gbrain jobs work [--queue Q] [--concurrency N]` — start worker daemon (Postgres only)
|
||||
|
||||
Key commands added in v0.12.2:
|
||||
- `gbrain repair-jsonb [--dry-run] [--json]` — repair double-encoded JSONB rows left over from v0.12.0-and-earlier Postgres writes. Idempotent; PGLite no-ops. The `v0_12_2` migration runs this automatically on `gbrain upgrade`.
|
||||
|
||||
## Testing
|
||||
|
||||
`bun test` runs all tests. After the v0.12.1 release: ~75 unit test files + 8 E2E test files (1412 unit pass, 119 E2E when `DATABASE_URL` is set — skip gracefully otherwise). Unit tests run
|
||||
@@ -172,12 +178,16 @@ parity), `test/cli.test.ts` (CLI structure), `test/config.test.ts` (config redac
|
||||
`test/features.test.ts` (feature scanning, brain_score calculation, CLI routing, persistence),
|
||||
`test/file-upload-security.test.ts` (symlink traversal, cwd confinement, slug + filename allowlists, remote vs local trust),
|
||||
`test/query-sanitization.test.ts` (prompt-injection stripping, output sanitization, structural boundary),
|
||||
`test/search-limit.test.ts` (clampSearchLimit default/cap behavior across list_pages and get_ingest_log).
|
||||
`test/search-limit.test.ts` (clampSearchLimit default/cap behavior across list_pages and get_ingest_log),
|
||||
`test/repair-jsonb.test.ts` (v0.12.2 JSONB repair: TARGETS list, idempotency, engine-awareness),
|
||||
`test/migrations-v0_12_2.test.ts` (v0.12.2 orchestrator phases: schema → repair → verify → record),
|
||||
`test/markdown.test.ts` (splitBody sentinel precedence, horizontal-rule preservation, inferType wiki subtypes).
|
||||
|
||||
E2E tests (`test/e2e/`): Run against real Postgres+pgvector. Require `DATABASE_URL`.
|
||||
- `bun run test:e2e` runs Tier 1 (mechanical, all operations, no API keys). Includes 9 dedicated cases for the postgres-engine `addLinksBatch` / `addTimelineEntriesBatch` bind path — postgres-js's `unnest()` binding is structurally different from PGLite's and gets its own coverage.
|
||||
- `test/e2e/search-quality.test.ts` runs search quality E2E against PGLite (no API keys, in-memory)
|
||||
- `test/e2e/graph-quality.test.ts` runs the v0.10.3 knowledge graph pipeline (auto-link via put_page, reconciliation, traversePaths) against PGLite in-memory
|
||||
- `test/e2e/postgres-jsonb.test.ts` — v0.12.2 regression test. Round-trips all 5 JSONB write sites (pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter) against real Postgres and asserts `jsonb_typeof='object'` plus `->>'key'` returns the expected scalar. The test that should have caught the original double-encode bug.
|
||||
- `test/e2e/upgrade.test.ts` runs check-update E2E against real GitHub API (network required)
|
||||
- Tier 2 (`skills.test.ts`) requires OpenClaw + API keys, runs nightly in CI
|
||||
- If `.env.testing` doesn't exist in this directory, check sibling worktrees for one:
|
||||
|
||||
@@ -127,7 +127,7 @@ Verify: `gbrain integrations doctor` (after at least one is configured)
|
||||
|
||||
## Step 9: Verify
|
||||
|
||||
Read `docs/GBRAIN_VERIFY.md` and run all 6 verification checks. Check #4 (live sync
|
||||
Read `docs/GBRAIN_VERIFY.md` and run all 7 verification checks. Check #4 (live sync
|
||||
actually works) is the most important.
|
||||
|
||||
## Upgrade
|
||||
@@ -145,3 +145,10 @@ this is how features ship in the binary but stay dormant in the user's brain.
|
||||
For v0.12.0+ specifically: if your brain was created before v0.12.0, run
|
||||
`gbrain extract links --source db && gbrain extract timeline --source db` to
|
||||
backfill the new graph layer (see Step 4.5 above).
|
||||
|
||||
For v0.12.2+ specifically: if your brain is Postgres- or Supabase-backed and
|
||||
predates v0.12.2, the `v0_12_2` migration runs `gbrain repair-jsonb`
|
||||
automatically during `gbrain post-upgrade` to fix the double-encoded JSONB
|
||||
columns. PGLite brains no-op. If wiki-style imports were truncated by the old
|
||||
`splitBody` bug, run `gbrain sync --full` after upgrading to rebuild
|
||||
`compiled_truth` from source markdown.
|
||||
|
||||
@@ -536,6 +536,7 @@ ADMIN
|
||||
gbrain integrations Integration recipe dashboard
|
||||
gbrain check-backlinks check|fix Back-link enforcement
|
||||
gbrain lint [--fix] LLM artifact detection
|
||||
gbrain repair-jsonb [--dry-run] Repair v0.12.0 double-encoded JSONB (Postgres)
|
||||
gbrain transcribe <audio> Transcribe audio (Groq Whisper)
|
||||
gbrain research init <name> Scaffold a data-research recipe
|
||||
gbrain research list Show available recipes
|
||||
|
||||
@@ -224,6 +224,43 @@ heuristics won't find them — file an issue with a sample page.
|
||||
|
||||
---
|
||||
|
||||
## 8. JSONB Frontmatter Integrity (v0.12.2)
|
||||
|
||||
Postgres-backed brains created before v0.12.2 had double-encoded JSONB columns
|
||||
(`frontmatter->>'key'` returned NULL, GIN indexes were inert). `gbrain upgrade`
|
||||
runs `gbrain repair-jsonb` automatically via the `v0_12_2` orchestrator.
|
||||
Verify the repair succeeded.
|
||||
|
||||
**Command:**
|
||||
|
||||
```bash
|
||||
gbrain repair-jsonb --dry-run --json
|
||||
```
|
||||
|
||||
**Expected:** `totalRepaired: 0` across all 5 columns (`pages.frontmatter`,
|
||||
`raw_data.data`, `ingest_log.pages_updated`, `files.metadata`,
|
||||
`page_versions.frontmatter`). A zero count means every row is properly-typed
|
||||
JSON objects, not string-encoded JSON.
|
||||
|
||||
**If the count is > 0:** The repair didn't run or was interrupted. Re-run
|
||||
without `--dry-run`:
|
||||
|
||||
```bash
|
||||
gbrain repair-jsonb
|
||||
```
|
||||
|
||||
Idempotent. PGLite brains always report 0 (unaffected by the original bug).
|
||||
|
||||
**Bonus check** — frontmatter-keyed queries actually resolve:
|
||||
|
||||
```bash
|
||||
gbrain call list_pages '{"frontmatterKey": "type", "frontmatterValue": "person"}'
|
||||
```
|
||||
|
||||
If this returns rows on a brain with person pages, the JSONB path is healthy.
|
||||
|
||||
---
|
||||
|
||||
## Quick Verification (all checks in one pass)
|
||||
|
||||
```bash
|
||||
@@ -247,7 +284,10 @@ gbrain check-update --json
|
||||
|
||||
# 7. Knowledge graph populated (links + timeline > 0)
|
||||
gbrain stats | grep -E 'links|timeline'
|
||||
|
||||
# 8. JSONB integrity (v0.12.2 — Postgres only, PGLite always 0)
|
||||
gbrain repair-jsonb --dry-run --json
|
||||
```
|
||||
|
||||
If all seven return successfully, the installation is healthy. For the full
|
||||
If all eight return successfully, the installation is healthy. For the full
|
||||
end-to-end sync test (4c), push a real change and verify it appears in search.
|
||||
|
||||
@@ -177,6 +177,74 @@ Timeline entries still need explicit `gbrain timeline-add` calls.
|
||||
```
|
||||
Should return an indented tree of typed edges.
|
||||
|
||||
---
|
||||
|
||||
## v0.12.2 hotfix (data-correctness, no skill edits)
|
||||
|
||||
v0.12.2 is a Postgres data-correctness hotfix. No forked skill files need to
|
||||
change — the skill contracts are unchanged. But you DO need to run the migration,
|
||||
and you should know about one behavior change in markdown parsing.
|
||||
|
||||
### 1. Run the migration (Postgres-backed brains)
|
||||
|
||||
```bash
|
||||
gbrain upgrade
|
||||
```
|
||||
|
||||
The `v0_12_2` orchestrator runs `gbrain repair-jsonb` automatically. It rewrites
|
||||
rows where `jsonb_typeof = 'string'` across `pages.frontmatter`, `raw_data.data`,
|
||||
`ingest_log.pages_updated`, `files.metadata`, and `page_versions.frontmatter`.
|
||||
Idempotent, safe to re-run. PGLite brains no-op cleanly.
|
||||
|
||||
Verify after upgrade:
|
||||
|
||||
```bash
|
||||
gbrain repair-jsonb --dry-run --json # expect totalRepaired: 0
|
||||
```
|
||||
|
||||
### 2. Recover any truncated wiki articles
|
||||
|
||||
If your brain imported wiki-style markdown before v0.12.2, some pages were
|
||||
silently truncated (any standalone `---` in body content was treated as a
|
||||
timeline separator). Re-import from source:
|
||||
|
||||
```bash
|
||||
gbrain sync --full
|
||||
```
|
||||
|
||||
The new `splitBody` rebuilds `compiled_truth` correctly.
|
||||
|
||||
### 3. Know the splitBody contract going forward
|
||||
|
||||
`splitBody` now requires an explicit timeline sentinel. Recognized markers
|
||||
(priority order):
|
||||
|
||||
1. `<!-- timeline -->` (preferred — what `serializeMarkdown` emits)
|
||||
2. `--- timeline ---` (decorated separator)
|
||||
3. `---` directly before `## Timeline` or `## History` heading (backward-compat)
|
||||
|
||||
A bare `---` in body text is now a markdown horizontal rule, not a timeline
|
||||
separator. If your agent writes pages with a bare `---` delimiter, migrate to
|
||||
`<!-- timeline -->` — the `serializeMarkdown` helper already does this.
|
||||
|
||||
### 4. Wiki subtypes now auto-typed
|
||||
|
||||
`inferType` now auto-detects five additional directory patterns as their own
|
||||
page types (previously they all defaulted to `concept`):
|
||||
|
||||
| Path pattern | New type |
|
||||
|------------------------|----------------|
|
||||
| `/wiki/analysis/` | `analysis` |
|
||||
| `/wiki/guides/` | `guide` |
|
||||
| `/wiki/hardware/` | `hardware` |
|
||||
| `/wiki/architecture/` | `architecture` |
|
||||
| `/writing/` | `writing` |
|
||||
|
||||
If your skills or queries filter by `type=concept` and expect wiki content in
|
||||
that bucket, update them to include the new types.
|
||||
|
||||
---
|
||||
|
||||
## Future versions
|
||||
|
||||
When gbrain ships a new version, this doc will be updated with the diffs for that
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "gbrain",
|
||||
"version": "0.12.0",
|
||||
"version": "0.12.2",
|
||||
"description": "Postgres-native personal knowledge brain with hybrid RAG search",
|
||||
"type": "module",
|
||||
"main": "src/core/index.ts",
|
||||
@@ -20,8 +20,9 @@
|
||||
"build": "bun build --compile --outfile bin/gbrain src/cli.ts",
|
||||
"build:all": "bun build --compile --target=bun-darwin-arm64 --outfile bin/gbrain-darwin-arm64 src/cli.ts && bun build --compile --target=bun-linux-x64 --outfile bin/gbrain-linux-x64 src/cli.ts",
|
||||
"build:schema": "bash scripts/build-schema.sh",
|
||||
"test": "bun test",
|
||||
"test": "scripts/check-jsonb-pattern.sh && bun test",
|
||||
"test:e2e": "bun test test/e2e/",
|
||||
"check:jsonb": "scripts/check-jsonb-pattern.sh",
|
||||
"postinstall": "gbrain --version >/dev/null 2>&1 && gbrain apply-migrations --yes --non-interactive 2>/dev/null || true",
|
||||
"prepublish:clawhub": "bun run build:all",
|
||||
"publish:clawhub": "clawhub package publish . --family bundle-plugin"
|
||||
|
||||
32
scripts/check-jsonb-pattern.sh
Executable file
32
scripts/check-jsonb-pattern.sh
Executable file
@@ -0,0 +1,32 @@
|
||||
#!/usr/bin/env bash
|
||||
# CI guard: fail if any source file uses the buggy `${JSON.stringify(x)}::jsonb`
|
||||
# template-string pattern instead of postgres.js's `sql.json(x)`.
|
||||
#
|
||||
# This is best-effort static analysis. It catches the common copy-paste form
|
||||
# that caused the v0.12.0 silent-data-loss bug (JSONB columns stored as
|
||||
# string literals on Postgres while PGLite hid the bug). Multi-line and
|
||||
# helper-wrapped variants are NOT caught here — those are covered by
|
||||
# test/e2e/postgres-jsonb.test.ts which round-trips actual writes through
|
||||
# real Postgres and asserts `frontmatter->>'k'` returns objects, not strings.
|
||||
#
|
||||
# Usage: scripts/check-jsonb-pattern.sh
|
||||
# Exit: 0 when no matches, 1 when matches found.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
ROOT="$(git rev-parse --show-toplevel 2>/dev/null || pwd)"
|
||||
cd "$ROOT"
|
||||
|
||||
# Match the interpolated form: ${JSON.stringify(...)}::jsonb
|
||||
# Using grep -P for Perl-compatible regex (lookahead-free pattern is enough here).
|
||||
PATTERN='\$\{JSON\.stringify\([^)]*\)\}::jsonb'
|
||||
|
||||
if grep -rEn "$PATTERN" src/ 2>/dev/null; then
|
||||
echo
|
||||
echo "ERROR: Found JSON.stringify(...)::jsonb pattern in src/."
|
||||
echo " postgres.js v3 stringifies again, producing JSONB string literals."
|
||||
echo " Use sql.json(x) instead. See feedback_postgres_jsonb_double_encode.md."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "OK: no JSON.stringify(x)::jsonb interpolation pattern in src/"
|
||||
@@ -50,10 +50,17 @@ Universal migration from any wiki, note tool, or brain system into GBrain.
|
||||
## Obsidian Migration
|
||||
|
||||
1. Import the vault directory into gbrain (Obsidian vaults are markdown directories)
|
||||
2. Convert `[[wikilinks]]` to gbrain links:
|
||||
- Read each page from gbrain
|
||||
- For each `[[Name]]` found, resolve to a slug and create a link in gbrain
|
||||
- `[[Name|alias]]` uses the alias for context
|
||||
2. Wire the graph with native wikilink support (v0.12.1+):
|
||||
|
||||
```bash
|
||||
gbrain extract links --source db --dry-run | head -20 # preview
|
||||
gbrain extract links --source db # commit
|
||||
```
|
||||
|
||||
`extract links` natively parses `[[relative/path]]` and `[[relative/path|Display Text]]`
|
||||
alongside standard `[text](page.md)` markdown syntax. Ancestor-search resolution handles
|
||||
wiki KBs where authors omit one or more leading `../` prefixes. The `.md` suffix is
|
||||
inferred automatically for wikilinks.
|
||||
|
||||
Obsidian-specific:
|
||||
- Tags (`#tag`) become gbrain tags
|
||||
|
||||
@@ -18,7 +18,7 @@ for (const op of operations) {
|
||||
}
|
||||
|
||||
// CLI-only commands that bypass the operation layer
|
||||
const CLI_ONLY = new Set(['init', 'upgrade', 'post-upgrade', 'check-update', 'integrations', 'publish', 'check-backlinks', 'lint', 'report', 'import', 'export', 'files', 'embed', 'serve', 'call', 'config', 'doctor', 'migrate', 'eval', 'sync', 'extract', 'features', 'autopilot', 'graph-query', 'jobs', 'apply-migrations', 'skillpack-check']);
|
||||
const CLI_ONLY = new Set(['init', 'upgrade', 'post-upgrade', 'check-update', 'integrations', 'publish', 'check-backlinks', 'lint', 'report', 'import', 'export', 'files', 'embed', 'serve', 'call', 'config', 'doctor', 'migrate', 'eval', 'sync', 'extract', 'features', 'autopilot', 'graph-query', 'jobs', 'apply-migrations', 'skillpack-check', 'repair-jsonb']);
|
||||
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
@@ -306,6 +306,11 @@ async function handleCliOnly(command: string, args: string[]) {
|
||||
await runApplyMigrations(args);
|
||||
return;
|
||||
}
|
||||
if (command === 'repair-jsonb') {
|
||||
const { runRepairJsonbCli } = await import('./commands/repair-jsonb.ts');
|
||||
await runRepairJsonbCli(args);
|
||||
return;
|
||||
}
|
||||
if (command === 'skillpack-check') {
|
||||
// Agent-readable health report. Shells out to doctor + apply-migrations
|
||||
// internally; does not need its own DB connection.
|
||||
|
||||
@@ -76,19 +76,72 @@ export function walkMarkdownFiles(dir: string): { path: string; relPath: string
|
||||
|
||||
// --- Link extraction ---
|
||||
|
||||
/** Extract markdown links to .md files (relative paths only) */
|
||||
/**
|
||||
* Extract markdown links to .md files (relative paths only).
|
||||
*
|
||||
* Handles two syntaxes:
|
||||
* 1. Standard markdown: [text](relative/path.md)
|
||||
* 2. Wikilinks: [[relative/path]] or [[relative/path|Display Text]]
|
||||
*
|
||||
* Both are resolved relative to the file that contains them. External URLs
|
||||
* (containing ://) are always skipped. For wikilinks, the .md suffix is added
|
||||
* if absent and section anchors (#heading) are stripped.
|
||||
*/
|
||||
export function extractMarkdownLinks(content: string): { name: string; relTarget: string }[] {
|
||||
const results: { name: string; relTarget: string }[] = [];
|
||||
const pattern = /\[([^\]]+)\]\(([^)]+\.md)\)/g;
|
||||
|
||||
const mdPattern = /\[([^\]]+)\]\(([^)]+\.md)\)/g;
|
||||
let match;
|
||||
while ((match = pattern.exec(content)) !== null) {
|
||||
while ((match = mdPattern.exec(content)) !== null) {
|
||||
const target = match[2];
|
||||
if (target.includes('://')) continue; // skip external URLs
|
||||
if (target.includes('://')) continue;
|
||||
results.push({ name: match[1], relTarget: target });
|
||||
}
|
||||
|
||||
const wikiPattern = /\[\[([^|\]]+?)(?:\|[^\]]*?)?\]\]/g;
|
||||
while ((match = wikiPattern.exec(content)) !== null) {
|
||||
const rawPath = match[1].trim();
|
||||
if (rawPath.includes('://')) continue;
|
||||
const hashIdx = rawPath.indexOf('#');
|
||||
const pagePath = hashIdx >= 0 ? rawPath.slice(0, hashIdx) : rawPath;
|
||||
if (!pagePath) continue;
|
||||
const relTarget = pagePath.endsWith('.md') ? pagePath : pagePath + '.md';
|
||||
const pipeIdx = match[0].indexOf('|');
|
||||
const displayName = pipeIdx >= 0 ? match[0].slice(pipeIdx + 1, -2).trim() : rawPath;
|
||||
results.push({ name: displayName, relTarget });
|
||||
}
|
||||
|
||||
return results;
|
||||
}
|
||||
|
||||
/**
|
||||
* Resolve a wikilink target to a canonical slug, given the directory of the
|
||||
* containing page and the set of all known slugs in the brain.
|
||||
*
|
||||
* Wiki KBs often use inconsistent relative depths. Authors omit one or more
|
||||
* leading `../` because they think in "wiki-root-relative" terms. Resolution
|
||||
* order (first match wins):
|
||||
* 1. Standard `join(fileDir, relTarget)` — exact relative path as written
|
||||
* 2. Ancestor search — strip leading path components from fileDir, retry
|
||||
*
|
||||
* Returns null when no matching slug is found (dangling link).
|
||||
*/
|
||||
export function resolveSlug(fileDir: string, relTarget: string, allSlugs: Set<string>): string | null {
|
||||
const targetNoExt = relTarget.endsWith('.md') ? relTarget.slice(0, -3) : relTarget;
|
||||
|
||||
const s1 = join(fileDir, targetNoExt);
|
||||
if (allSlugs.has(s1)) return s1;
|
||||
|
||||
const parts = fileDir.split('/').filter(Boolean);
|
||||
for (let strip = 1; strip <= parts.length; strip++) {
|
||||
const ancestor = parts.slice(0, parts.length - strip).join('/');
|
||||
const candidate = ancestor ? join(ancestor, targetNoExt) : targetNoExt;
|
||||
if (allSlugs.has(candidate)) return candidate;
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
/** Infer link type from directory structure */
|
||||
function inferLinkType(fromDir: string, toDir: string, frontmatter?: Record<string, unknown>): string {
|
||||
const from = fromDir.split('/')[0];
|
||||
@@ -146,8 +199,8 @@ export function extractLinksFromFile(
|
||||
const fm = parseFrontmatterFromContent(content, relPath);
|
||||
|
||||
for (const { name, relTarget } of extractMarkdownLinks(content)) {
|
||||
const resolved = join(fileDir, relTarget).replace('.md', '');
|
||||
if (allSlugs.has(resolved)) {
|
||||
const resolved = resolveSlug(fileDir, relTarget, allSlugs);
|
||||
if (resolved !== null) {
|
||||
links.push({
|
||||
from_slug: slug, to_slug: resolved,
|
||||
link_type: inferLinkType(fileDir, dirname(resolved), fm),
|
||||
|
||||
@@ -251,7 +251,7 @@ async function uploadRaw(args: string[]) {
|
||||
await sql`
|
||||
INSERT INTO files (page_slug, filename, storage_path, mime_type, size_bytes, content_hash, metadata)
|
||||
VALUES (${pageSlug}, ${filename}, ${storagePath}, ${mimeType}, ${stat.size}, ${'sha256:' + hash},
|
||||
${JSON.stringify({ type: fileType, upload_method: method })}::jsonb)
|
||||
${sql.json({ type: fileType, upload_method: method })})
|
||||
ON CONFLICT (storage_path) DO UPDATE SET
|
||||
content_hash = EXCLUDED.content_hash,
|
||||
size_bytes = EXCLUDED.size_bytes,
|
||||
|
||||
@@ -13,10 +13,12 @@
|
||||
import type { Migration } from './types.ts';
|
||||
import { v0_11_0 } from './v0_11_0.ts';
|
||||
import { v0_12_0 } from './v0_12_0.ts';
|
||||
import { v0_12_2 } from './v0_12_2.ts';
|
||||
|
||||
export const migrations: Migration[] = [
|
||||
v0_11_0,
|
||||
v0_12_0,
|
||||
v0_12_2,
|
||||
];
|
||||
|
||||
/** Look up a migration by exact version string. */
|
||||
|
||||
141
src/commands/migrations/v0_12_2.ts
Normal file
141
src/commands/migrations/v0_12_2.ts
Normal file
@@ -0,0 +1,141 @@
|
||||
/**
|
||||
* v0.12.2 migration orchestrator — JSONB double-encode repair.
|
||||
*
|
||||
* v0.12.0-and-earlier wrote JSONB columns via the buggy
|
||||
* `JSON.stringify(value)`-then-cast-to-jsonb interpolation pattern, which
|
||||
* postgres.js v3 stringified again on the wire. Result: every
|
||||
* `frontmatter->>'key'` query returned NULL on Postgres-backed brains and
|
||||
* GIN indexes on JSONB columns were inert. PGLite was unaffected (its
|
||||
* driver path uses parameterized binding, never interpolation).
|
||||
*
|
||||
* v0.12.2 fixes the writes (sql.json) AND repairs existing rows in place.
|
||||
* This is the migration. It's idempotent (only touches `jsonb_typeof = 'string'`
|
||||
* rows) and safe to re-run. PGLite engines no-op cleanly.
|
||||
*
|
||||
* Phases (all idempotent):
|
||||
* A. Schema — gbrain init --migrate-only (no schema changes in v0.12.2
|
||||
* but we still apply for consistency with v0.12.0).
|
||||
* B. Repair — gbrain repair-jsonb (the actual JSONB fix).
|
||||
* C. Verify — gbrain repair-jsonb --dry-run --json; assert 0 remaining.
|
||||
* D. Record — append completed.jsonl.
|
||||
*/
|
||||
|
||||
import { execSync } from 'child_process';
|
||||
import type { Migration, OrchestratorOpts, OrchestratorResult, OrchestratorPhaseResult } from './types.ts';
|
||||
import { appendCompletedMigration } from '../../core/preferences.ts';
|
||||
|
||||
// ── Phase A — Schema ────────────────────────────────────────
|
||||
|
||||
function phaseASchema(opts: OrchestratorOpts): OrchestratorPhaseResult {
|
||||
if (opts.dryRun) return { name: 'schema', status: 'skipped', detail: 'dry-run' };
|
||||
try {
|
||||
execSync('gbrain init --migrate-only', { stdio: 'inherit', timeout: 60_000, env: process.env });
|
||||
return { name: 'schema', status: 'complete' };
|
||||
} catch (e) {
|
||||
const msg = e instanceof Error ? e.message : String(e);
|
||||
return { name: 'schema', status: 'failed', detail: msg };
|
||||
}
|
||||
}
|
||||
|
||||
// ── Phase B — JSONB repair ──────────────────────────────────
|
||||
|
||||
function phaseBRepair(opts: OrchestratorOpts): OrchestratorPhaseResult {
|
||||
if (opts.dryRun) return { name: 'jsonb_repair', status: 'skipped', detail: 'dry-run' };
|
||||
try {
|
||||
execSync('gbrain repair-jsonb', { stdio: 'inherit', timeout: 600_000, env: process.env });
|
||||
return { name: 'jsonb_repair', status: 'complete' };
|
||||
} catch (e) {
|
||||
const msg = e instanceof Error ? e.message : String(e);
|
||||
return { name: 'jsonb_repair', status: 'failed', detail: msg };
|
||||
}
|
||||
}
|
||||
|
||||
// ── Phase C — Verify ────────────────────────────────────────
|
||||
|
||||
function phaseCVerify(opts: OrchestratorOpts): OrchestratorPhaseResult {
|
||||
if (opts.dryRun) return { name: 'verify', status: 'skipped', detail: 'dry-run' };
|
||||
try {
|
||||
const out = execSync('gbrain repair-jsonb --dry-run --json', {
|
||||
encoding: 'utf-8', timeout: 60_000, env: process.env,
|
||||
});
|
||||
const parsed = JSON.parse(out) as { total_repaired?: number; engine?: string };
|
||||
const remaining = parsed.total_repaired ?? 0;
|
||||
if (remaining > 0) {
|
||||
return {
|
||||
name: 'verify',
|
||||
status: 'failed',
|
||||
detail: `${remaining} string-typed JSONB rows remain after repair`,
|
||||
};
|
||||
}
|
||||
return { name: 'verify', status: 'complete', detail: parsed.engine ? `engine=${parsed.engine}` : undefined };
|
||||
} catch (e) {
|
||||
const msg = e instanceof Error ? e.message : String(e);
|
||||
return { name: 'verify', status: 'failed', detail: msg };
|
||||
}
|
||||
}
|
||||
|
||||
// ── Orchestrator ────────────────────────────────────────────
|
||||
|
||||
async function orchestrator(opts: OrchestratorOpts): Promise<OrchestratorResult> {
|
||||
console.log('');
|
||||
console.log('=== v0.12.2 — JSONB double-encode repair ===');
|
||||
if (opts.dryRun) console.log(' (dry-run; no side effects)');
|
||||
console.log('');
|
||||
|
||||
const phases: OrchestratorPhaseResult[] = [];
|
||||
|
||||
const a = phaseASchema(opts);
|
||||
phases.push(a);
|
||||
if (a.status === 'failed') return finalizeResult(phases, 'failed');
|
||||
|
||||
const b = phaseBRepair(opts);
|
||||
phases.push(b);
|
||||
if (b.status === 'failed') return finalizeResult(phases, 'failed');
|
||||
|
||||
const c = phaseCVerify(opts);
|
||||
phases.push(c);
|
||||
|
||||
const overallStatus: 'complete' | 'partial' | 'failed' =
|
||||
a.status === 'failed' || b.status === 'failed' ? 'failed' :
|
||||
c.status === 'failed' ? 'partial' :
|
||||
'complete';
|
||||
|
||||
return finalizeResult(phases, overallStatus);
|
||||
}
|
||||
|
||||
function finalizeResult(phases: OrchestratorPhaseResult[], status: 'complete' | 'partial' | 'failed'): OrchestratorResult {
|
||||
if (status !== 'failed') {
|
||||
try {
|
||||
appendCompletedMigration({ version: '0.12.2', status: status as 'complete' | 'partial' });
|
||||
} catch {
|
||||
// Recording is best-effort.
|
||||
}
|
||||
}
|
||||
return {
|
||||
version: '0.12.2',
|
||||
status,
|
||||
phases,
|
||||
};
|
||||
}
|
||||
|
||||
export const v0_12_2: Migration = {
|
||||
version: '0.12.2',
|
||||
featurePitch: {
|
||||
headline: 'Postgres frontmatter queries now work — JSONB double-encode bug fixed and existing rows auto-repaired',
|
||||
description:
|
||||
'gbrain v0.12.0-and-earlier silently stored JSONB columns as quoted string literals on ' +
|
||||
'Postgres/Supabase (PGLite was unaffected). Every `frontmatter->>\'key\'` returned NULL ' +
|
||||
'and GIN indexes were inert. v0.12.2 fixes the writes AND auto-repairs every existing ' +
|
||||
'string-typed row in pages.frontmatter, raw_data.data, ingest_log.pages_updated, ' +
|
||||
'files.metadata, and page_versions.frontmatter. The migration is idempotent. Pages ' +
|
||||
'truncated by the splitBody horizontal-rule bug can be recovered with `gbrain sync --full`.',
|
||||
},
|
||||
orchestrator,
|
||||
};
|
||||
|
||||
/** Exported for unit tests. */
|
||||
export const __testing = {
|
||||
phaseASchema,
|
||||
phaseBRepair,
|
||||
phaseCVerify,
|
||||
};
|
||||
151
src/commands/repair-jsonb.ts
Normal file
151
src/commands/repair-jsonb.ts
Normal file
@@ -0,0 +1,151 @@
|
||||
/**
|
||||
* `gbrain repair-jsonb` — repair JSONB columns that were stored as string
|
||||
* literals due to the v0.12.0-and-earlier double-encode bug.
|
||||
*
|
||||
* Background: postgres-engine.ts wrote frontmatter and other JSONB columns
|
||||
* via the buggy `JSON.stringify(value)`-then-cast-to-jsonb interpolation
|
||||
* pattern, which postgres.js v3 stringified AGAIN on the wire. Result: every `frontmatter->>'key'` query returned NULL
|
||||
* on Postgres-backed brains; GIN indexes were inert. PGLite was unaffected
|
||||
* (different driver path). v0.12.1 fixes the writes (sql.json) but existing
|
||||
* rows stay broken until they're rewritten — that's what this command does.
|
||||
*
|
||||
* Strategy: for each affected JSONB column, detect rows where
|
||||
* `jsonb_typeof(col) = 'string'` and rewrite them via `(col #>> '{}')::jsonb`,
|
||||
* which extracts the string payload and re-parses it as JSONB. Idempotent:
|
||||
* re-running is a no-op (no rows match the guard). PGLite is a no-op too
|
||||
* (it never wrote string-typed JSONB).
|
||||
*
|
||||
* Affected columns (audit of src/schema.sql):
|
||||
* - pages.frontmatter (postgres-engine.ts:107 putPage)
|
||||
* - raw_data.data (postgres-engine.ts:668 putRawData)
|
||||
* - ingest_log.pages_updated (postgres-engine.ts:846 logIngest)
|
||||
* - files.metadata (commands/files.ts:254 file upload)
|
||||
* - page_versions.frontmatter (downstream of pages.frontmatter via
|
||||
* INSERT...SELECT FROM pages)
|
||||
*
|
||||
* Other JSONB columns (minion_jobs.{data,result,progress,stacktrace},
|
||||
* minion_inbox.payload) were always written via parameterized form ($N::jsonb
|
||||
* with a string parameter, not interpolation) so they were never affected.
|
||||
*/
|
||||
|
||||
import { loadConfig, toEngineConfig } from '../core/config.ts';
|
||||
import type { EngineConfig } from '../core/types.ts';
|
||||
import * as db from '../core/db.ts';
|
||||
|
||||
interface RepairTarget {
|
||||
table: string;
|
||||
column: string;
|
||||
/** Optional secondary key column for logging. */
|
||||
keyCol?: string;
|
||||
}
|
||||
|
||||
const TARGETS: RepairTarget[] = [
|
||||
{ table: 'pages', column: 'frontmatter', keyCol: 'slug' },
|
||||
{ table: 'raw_data', column: 'data', keyCol: 'source' },
|
||||
{ table: 'ingest_log', column: 'pages_updated', keyCol: 'source_ref' },
|
||||
{ table: 'files', column: 'metadata', keyCol: 'storage_path' },
|
||||
{ table: 'page_versions', column: 'frontmatter', keyCol: 'snapshot_at' },
|
||||
];
|
||||
|
||||
export interface RepairResult {
|
||||
engine: string;
|
||||
per_target: Array<{
|
||||
table: string;
|
||||
column: string;
|
||||
rows_repaired: number;
|
||||
}>;
|
||||
total_repaired: number;
|
||||
}
|
||||
|
||||
export interface RepairOpts {
|
||||
dryRun: boolean;
|
||||
/** Engine config override (for tests). Defaults to loadConfig() result. */
|
||||
engineConfig?: EngineConfig;
|
||||
}
|
||||
|
||||
/**
|
||||
* Run the repair against the currently-configured engine.
|
||||
*
|
||||
* On PGLite this finds 0 rows (the bug never affected the parameterized
|
||||
* encode path PGLite uses) and exits cleanly. On Postgres it issues one
|
||||
* idempotent UPDATE per target column.
|
||||
*/
|
||||
export async function repairJsonb(opts: RepairOpts = { dryRun: false }): Promise<RepairResult> {
|
||||
let engineCfg = opts.engineConfig;
|
||||
if (!engineCfg) {
|
||||
const config = loadConfig();
|
||||
if (!config) {
|
||||
throw new Error('No brain configured. Run: gbrain init');
|
||||
}
|
||||
engineCfg = toEngineConfig(config);
|
||||
}
|
||||
const engineKind = engineCfg.engine || 'postgres';
|
||||
|
||||
const result: RepairResult = {
|
||||
engine: engineKind,
|
||||
per_target: [],
|
||||
total_repaired: 0,
|
||||
};
|
||||
|
||||
if (engineKind === 'pglite') {
|
||||
for (const t of TARGETS) {
|
||||
result.per_target.push({ table: t.table, column: t.column, rows_repaired: 0 });
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
await db.connect(engineCfg);
|
||||
const sql = db.getConnection();
|
||||
|
||||
for (const t of TARGETS) {
|
||||
let repaired = 0;
|
||||
|
||||
if (opts.dryRun) {
|
||||
const rows = await sql.unsafe(
|
||||
`SELECT count(*)::int AS n FROM ${t.table} WHERE jsonb_typeof(${t.column}) = 'string'`,
|
||||
);
|
||||
repaired = (rows[0] as { n: number }).n;
|
||||
} else {
|
||||
const rows = await sql.unsafe(
|
||||
`UPDATE ${t.table}
|
||||
SET ${t.column} = (${t.column} #>> '{}')::jsonb
|
||||
WHERE jsonb_typeof(${t.column}) = 'string'
|
||||
RETURNING 1`,
|
||||
);
|
||||
repaired = rows.length;
|
||||
}
|
||||
|
||||
result.per_target.push({ table: t.table, column: t.column, rows_repaired: repaired });
|
||||
result.total_repaired += repaired;
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
export async function runRepairJsonbCli(args: string[]): Promise<void> {
|
||||
const dryRun = args.includes('--dry-run');
|
||||
const jsonMode = args.includes('--json');
|
||||
|
||||
const result = await repairJsonb({ dryRun });
|
||||
|
||||
if (jsonMode) {
|
||||
console.log(JSON.stringify({ status: 'ok', dry_run: dryRun, ...result }));
|
||||
return;
|
||||
}
|
||||
|
||||
if (result.engine === 'pglite') {
|
||||
console.log('Engine: pglite — JSONB double-encode bug never affected this path. No-op.');
|
||||
return;
|
||||
}
|
||||
|
||||
console.log(`${dryRun ? '[dry-run] ' : ''}Engine: postgres`);
|
||||
console.log(`${dryRun ? '[dry-run] ' : ''}JSONB repair across ${TARGETS.length} columns:`);
|
||||
for (const t of result.per_target) {
|
||||
const verb = dryRun ? 'would repair' : 'repaired';
|
||||
console.log(` ${t.table}.${t.column}: ${verb} ${t.rows_repaired} rows`);
|
||||
}
|
||||
console.log(`${dryRun ? '[dry-run] ' : ''}Total ${dryRun ? 'to repair' : 'repaired'}: ${result.total_repaired} rows`);
|
||||
if (!dryRun && result.total_repaired === 0) {
|
||||
console.log('Nothing to repair (already-valid JSONB or fresh install).');
|
||||
}
|
||||
}
|
||||
@@ -22,14 +22,16 @@ export interface ParsedMarkdown {
|
||||
* tags: [startups, growth]
|
||||
* ---
|
||||
* Compiled truth content here...
|
||||
* ---
|
||||
*
|
||||
* <!-- timeline -->
|
||||
* Timeline content here...
|
||||
*
|
||||
* The first --- pair is YAML frontmatter (handled by gray-matter).
|
||||
* After frontmatter, the body is split at the first standalone ---
|
||||
* (a line containing only --- with optional whitespace).
|
||||
* Everything before is compiled_truth, everything after is timeline.
|
||||
* If no body --- exists, all content is compiled_truth.
|
||||
* After frontmatter, the body is split at the first recognized timeline
|
||||
* sentinel: `<!-- timeline -->` (preferred), `--- timeline ---` (decorated),
|
||||
* or a plain `---` immediately preceding a `## Timeline` / `## History`
|
||||
* heading (backward-compat for existing files). A bare `---` in body text
|
||||
* is treated as a markdown horizontal rule, not a timeline separator.
|
||||
*/
|
||||
export function parseMarkdown(content: string, filePath?: string): ParsedMarkdown {
|
||||
const { data: frontmatter, content: body } = matter(content);
|
||||
@@ -62,26 +64,21 @@ export function parseMarkdown(content: string, filePath?: string): ParsedMarkdow
|
||||
}
|
||||
|
||||
/**
|
||||
* Split body content at first standalone --- separator.
|
||||
* Split body content at the first recognized timeline sentinel.
|
||||
* Returns compiled_truth (before) and timeline (after).
|
||||
*
|
||||
* Recognized sentinels (in order of precedence):
|
||||
* 1. `<!-- timeline -->` — preferred, unambiguous, what serializeMarkdown emits
|
||||
* 2. `--- timeline ---` — decorated separator
|
||||
* 3. `---` ONLY when the next non-empty line is `## Timeline` or `## History`
|
||||
* (backward-compat fallback for older gbrain-written files)
|
||||
*
|
||||
* A plain `---` line is a markdown horizontal rule, NOT a timeline separator.
|
||||
* Treating bare `---` as a separator caused 83% content truncation on wiki corpora.
|
||||
*/
|
||||
export function splitBody(body: string): { compiled_truth: string; timeline: string } {
|
||||
// Match a line that is only --- (with optional whitespace)
|
||||
// Must not be at the very start (that would be frontmatter)
|
||||
const lines = body.split('\n');
|
||||
let splitIndex = -1;
|
||||
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const trimmed = lines[i].trim();
|
||||
if (trimmed === '---') {
|
||||
// Skip if this is the very first non-empty line (leftover from frontmatter parsing)
|
||||
const beforeContent = lines.slice(0, i).join('\n').trim();
|
||||
if (beforeContent.length > 0) {
|
||||
splitIndex = i;
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
const splitIndex = findTimelineSplitIndex(lines);
|
||||
|
||||
if (splitIndex === -1) {
|
||||
return { compiled_truth: body, timeline: '' };
|
||||
@@ -92,6 +89,33 @@ export function splitBody(body: string): { compiled_truth: string; timeline: str
|
||||
return { compiled_truth, timeline };
|
||||
}
|
||||
|
||||
function findTimelineSplitIndex(lines: string[]): number {
|
||||
for (let i = 0; i < lines.length; i++) {
|
||||
const trimmed = lines[i].trim();
|
||||
|
||||
if (trimmed === '<!-- timeline -->' || trimmed === '<!--timeline-->') {
|
||||
return i;
|
||||
}
|
||||
|
||||
if (trimmed === '--- timeline ---' || /^---\s+timeline\s+---$/i.test(trimmed)) {
|
||||
return i;
|
||||
}
|
||||
|
||||
if (trimmed === '---') {
|
||||
const beforeContent = lines.slice(0, i).join('\n').trim();
|
||||
if (beforeContent.length === 0) continue;
|
||||
|
||||
for (let j = i + 1; j < lines.length; j++) {
|
||||
const next = lines[j].trim();
|
||||
if (next.length === 0) continue;
|
||||
if (/^##\s+(timeline|history)\b/i.test(next)) return i;
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
return -1;
|
||||
}
|
||||
|
||||
/**
|
||||
* Serialize a page back to markdown format.
|
||||
* Produces: frontmatter + compiled_truth + --- + timeline
|
||||
@@ -116,7 +140,7 @@ export function serializeMarkdown(
|
||||
|
||||
let body = compiled_truth;
|
||||
if (timeline) {
|
||||
body += '\n\n---\n\n' + timeline;
|
||||
body += '\n\n<!-- timeline -->\n\n' + timeline;
|
||||
}
|
||||
|
||||
return yamlContent + '\n\n' + body + '\n';
|
||||
@@ -125,8 +149,18 @@ export function serializeMarkdown(
|
||||
function inferType(filePath?: string): PageType {
|
||||
if (!filePath) return 'concept';
|
||||
|
||||
// Normalize: add leading / for consistent matching
|
||||
// Normalize: add leading / for consistent matching.
|
||||
// Wiki subtypes and /writing/ check FIRST — they're stronger signals than
|
||||
// ancestor directories. e.g. `projects/blog/writing/essay.md` is a piece of
|
||||
// writing, not a project page; `tech/wiki/analysis/foo.md` is analysis,
|
||||
// not a hit on the broader `tech/` ancestor.
|
||||
const lower = ('/' + filePath).toLowerCase();
|
||||
if (lower.includes('/writing/')) return 'writing';
|
||||
if (lower.includes('/wiki/analysis/')) return 'analysis';
|
||||
if (lower.includes('/wiki/guides/') || lower.includes('/wiki/guide/')) return 'guide';
|
||||
if (lower.includes('/wiki/hardware/')) return 'hardware';
|
||||
if (lower.includes('/wiki/architecture/')) return 'architecture';
|
||||
if (lower.includes('/wiki/concepts/') || lower.includes('/wiki/concept/')) return 'concept';
|
||||
if (lower.includes('/people/') || lower.includes('/person/')) return 'person';
|
||||
if (lower.includes('/companies/') || lower.includes('/company/')) return 'company';
|
||||
if (lower.includes('/deals/') || lower.includes('/deal/')) return 'deal';
|
||||
|
||||
@@ -17,7 +17,7 @@ import type {
|
||||
} from './types.ts';
|
||||
import { GBrainError } from './types.ts';
|
||||
import * as db from './db.ts';
|
||||
import { validateSlug, contentHash, rowToPage, rowToChunk, rowToSearchResult } from './utils.ts';
|
||||
import { validateSlug, contentHash, rowToPage, rowToChunk, rowToSearchResult, parseEmbedding } from './utils.ts';
|
||||
|
||||
export class PostgresEngine implements BrainEngine {
|
||||
private _sql: ReturnType<typeof postgres> | null = null;
|
||||
@@ -104,7 +104,7 @@ export class PostgresEngine implements BrainEngine {
|
||||
|
||||
const rows = await sql`
|
||||
INSERT INTO pages (slug, type, title, compiled_truth, timeline, frontmatter, content_hash, updated_at)
|
||||
VALUES (${slug}, ${page.type}, ${page.title}, ${page.compiled_truth}, ${page.timeline || ''}, ${JSON.stringify(frontmatter)}::jsonb, ${hash}, now())
|
||||
VALUES (${slug}, ${page.type}, ${page.title}, ${page.compiled_truth}, ${page.timeline || ''}, ${sql.json(frontmatter)}, ${hash}, now())
|
||||
ON CONFLICT (slug) DO UPDATE SET
|
||||
type = EXCLUDED.type,
|
||||
title = EXCLUDED.title,
|
||||
@@ -272,7 +272,8 @@ export class PostgresEngine implements BrainEngine {
|
||||
`;
|
||||
const result = new Map<number, Float32Array>();
|
||||
for (const row of rows) {
|
||||
if (row.embedding) result.set(row.id as number, row.embedding as Float32Array);
|
||||
const parsed = parseEmbedding(row.embedding);
|
||||
if (parsed) result.set(row.id as number, parsed);
|
||||
}
|
||||
return result;
|
||||
}
|
||||
@@ -711,7 +712,7 @@ export class PostgresEngine implements BrainEngine {
|
||||
const sql = this.sql;
|
||||
const result = await sql`
|
||||
INSERT INTO raw_data (page_id, source, data)
|
||||
SELECT id, ${source}, ${JSON.stringify(data)}::jsonb
|
||||
SELECT id, ${source}, ${sql.json(data as Record<string, unknown>)}
|
||||
FROM pages WHERE slug = ${slug}
|
||||
ON CONFLICT (page_id, source) DO UPDATE SET
|
||||
data = EXCLUDED.data,
|
||||
@@ -889,7 +890,7 @@ export class PostgresEngine implements BrainEngine {
|
||||
const sql = this.sql;
|
||||
await sql`
|
||||
INSERT INTO ingest_log (source_type, source_ref, pages_updated, summary)
|
||||
VALUES (${entry.source_type}, ${entry.source_ref}, ${JSON.stringify(entry.pages_updated)}::jsonb, ${entry.summary})
|
||||
VALUES (${entry.source_type}, ${entry.source_ref}, ${sql.json(entry.pages_updated)}, ${entry.summary})
|
||||
`;
|
||||
}
|
||||
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
// Page types
|
||||
export type PageType = 'person' | 'company' | 'deal' | 'yc' | 'civic' | 'project' | 'concept' | 'source' | 'media';
|
||||
export type PageType = 'person' | 'company' | 'deal' | 'yc' | 'civic' | 'project' | 'concept' | 'source' | 'media' | 'writing' | 'analysis' | 'guide' | 'hardware' | 'architecture';
|
||||
|
||||
export interface Page {
|
||||
id: number;
|
||||
|
||||
@@ -43,6 +43,51 @@ export function rowToPage(row: Record<string, unknown>): Page {
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* Normalize an embedding value into a Float32Array.
|
||||
*
|
||||
* pgvector returns embeddings in different shapes depending on driver/path:
|
||||
* - postgres.js (Postgres): often a string like `"[0.1,0.2,...]"`
|
||||
* - pglite: typically a numeric array or Float32Array
|
||||
* - pgvector node binding: numeric array
|
||||
* - Some queries that JSON-aggregate embeddings: JSON-string array
|
||||
*
|
||||
* Without normalization, downstream cosine math sees a string and produces
|
||||
* NaN scores silently. This helper guarantees a Float32Array or throws
|
||||
* loudly on malformed input — never returns NaN.
|
||||
*/
|
||||
export function parseEmbedding(value: unknown): Float32Array | null {
|
||||
if (value === null || value === undefined) return null;
|
||||
if (value instanceof Float32Array) return value;
|
||||
if (Array.isArray(value)) {
|
||||
if (value.length === 0) return new Float32Array(0);
|
||||
if (typeof value[0] !== 'number') {
|
||||
throw new Error(`parseEmbedding: array contains non-numeric element (${typeof value[0]})`);
|
||||
}
|
||||
return Float32Array.from(value as number[]);
|
||||
}
|
||||
if (typeof value === 'string') {
|
||||
const trimmed = value.trim();
|
||||
// Plain non-vector strings: treat as "no embedding here", return null.
|
||||
// Strings that LOOK like vector literals but contain garbage: throw,
|
||||
// because that's a real corruption signal worth surfacing loudly.
|
||||
if (!trimmed.startsWith('[') || !trimmed.endsWith(']')) return null;
|
||||
const inner = trimmed.slice(1, -1).trim();
|
||||
if (inner.length === 0) return new Float32Array(0);
|
||||
const parts = inner.split(',');
|
||||
const out = new Float32Array(parts.length);
|
||||
for (let i = 0; i < parts.length; i++) {
|
||||
const n = Number(parts[i].trim());
|
||||
if (!Number.isFinite(n)) {
|
||||
throw new Error(`parseEmbedding: non-finite value at index ${i}: ${parts[i]}`);
|
||||
}
|
||||
out[i] = n;
|
||||
}
|
||||
return out;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
export function rowToChunk(row: Record<string, unknown>, includeEmbedding = false): Chunk {
|
||||
return {
|
||||
id: row.id as number,
|
||||
@@ -50,7 +95,7 @@ export function rowToChunk(row: Record<string, unknown>, includeEmbedding = fals
|
||||
chunk_index: row.chunk_index as number,
|
||||
chunk_text: row.chunk_text as string,
|
||||
chunk_source: row.chunk_source as 'compiled_truth' | 'timeline',
|
||||
embedding: includeEmbedding && row.embedding ? row.embedding as Float32Array : null,
|
||||
embedding: includeEmbedding ? parseEmbedding(row.embedding) : null,
|
||||
model: row.model as string,
|
||||
token_count: row.token_count as number | null,
|
||||
embedded_at: row.embedded_at ? new Date(row.embedded_at as string) : null,
|
||||
|
||||
@@ -102,9 +102,10 @@ describe('buildPlan — diff against completed + installed VERSION', () => {
|
||||
expect(plan.applied).toEqual([]);
|
||||
expect(plan.partial).toEqual([]);
|
||||
expect(plan.pending.map(m => m.version)).toContain('0.11.0');
|
||||
// v0.12.0 (Knowledge Graph auto-wire) is registered but installed VERSION
|
||||
// is 0.11.1, so it lands in skippedFuture until the binary catches up.
|
||||
expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.0']);
|
||||
// v0.12.0 (Knowledge Graph) and v0.12.2 (JSONB repair) are registered but
|
||||
// installed VERSION is 0.11.1, so they land in skippedFuture until the
|
||||
// binary catches up.
|
||||
expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.0', '0.12.2']);
|
||||
});
|
||||
|
||||
test('already applied → v0.11.0 lands in `applied` bucket, not pending', () => {
|
||||
@@ -140,7 +141,10 @@ describe('buildPlan — diff against completed + installed VERSION', () => {
|
||||
const idx = indexCompleted([]);
|
||||
const plan = buildPlan(idx, '0.12.0');
|
||||
expect(plan.pending.map(m => m.version)).toContain('0.11.0');
|
||||
expect(plan.skippedFuture).toEqual([]);
|
||||
// v0.12.2 was added later (JSONB repair); installed=0.12.0 means it
|
||||
// belongs in skippedFuture, not pending. v0.11.0 and v0.12.0 stay
|
||||
// pending despite being ≤ installed — that is the H9 invariant.
|
||||
expect(plan.skippedFuture.map(m => m.version)).toEqual(['0.12.2']);
|
||||
});
|
||||
|
||||
test('--migration filter narrows to one version', () => {
|
||||
|
||||
174
test/e2e/postgres-jsonb.test.ts
Normal file
174
test/e2e/postgres-jsonb.test.ts
Normal file
@@ -0,0 +1,174 @@
|
||||
/**
|
||||
* E2E JSONB round-trip tests — the test that should have caught the v0.12.0
|
||||
* silent-data-loss bug originally.
|
||||
*
|
||||
* v0.12.0-and-earlier wrote JSONB columns via `${JSON.stringify(value)}::jsonb`
|
||||
* which postgres.js v3 stringified again on the wire. Result: every JSONB
|
||||
* column stored a quoted-string literal instead of an object. Every
|
||||
* `frontmatter->>'key'` query returned NULL. PGLite was unaffected (different
|
||||
* driver path), which is why every previous unit test passed while real
|
||||
* Postgres-backed brains silently lost data.
|
||||
*
|
||||
* These tests exercise each of the four JSONB write sites and assert that:
|
||||
* 1. `jsonb_typeof(col) = 'object'` (or 'array' for array-shaped values)
|
||||
* — proves the column is a real JSONB structure, not a string literal.
|
||||
* 2. `col->>'key'` returns the expected scalar — proves downstream queries
|
||||
* and GIN indexes will work as intended.
|
||||
*
|
||||
* Without these E2E assertions, the CI grep guard in scripts/check-jsonb-pattern.sh
|
||||
* is the only protection — and it doesn't catch helper-wrapped or multi-line
|
||||
* variants of the buggy pattern.
|
||||
*
|
||||
* Run: DATABASE_URL=... bun test test/e2e/postgres-jsonb.test.ts
|
||||
*/
|
||||
|
||||
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
|
||||
import {
|
||||
hasDatabase, setupDB, teardownDB, getEngine, getConn,
|
||||
} from './helpers.ts';
|
||||
|
||||
const skip = !hasDatabase();
|
||||
const describeE2E = skip ? describe.skip : describe;
|
||||
|
||||
if (skip) {
|
||||
console.log('Skipping E2E JSONB round-trip tests (DATABASE_URL not set)');
|
||||
}
|
||||
|
||||
describeE2E('Postgres JSONB round-trip — frontmatter / data / pages_updated / metadata', () => {
|
||||
beforeAll(async () => { await setupDB(); });
|
||||
afterAll(async () => { await teardownDB(); });
|
||||
|
||||
test('pages.frontmatter — putPage stores object, not string literal', async () => {
|
||||
const engine = getEngine();
|
||||
const conn = getConn();
|
||||
|
||||
await engine.putPage('jsonb-test/frontmatter', {
|
||||
type: 'concept',
|
||||
title: 'JSONB roundtrip',
|
||||
compiled_truth: 'body',
|
||||
frontmatter: { author: 'garry', score: 7, tags: ['x', 'y'] },
|
||||
});
|
||||
|
||||
const rows = await conn.unsafe(`
|
||||
SELECT
|
||||
jsonb_typeof(frontmatter) AS jt,
|
||||
frontmatter->>'author' AS author,
|
||||
frontmatter->>'score' AS score,
|
||||
frontmatter->'tags' AS tags
|
||||
FROM pages
|
||||
WHERE slug = 'jsonb-test/frontmatter'
|
||||
`);
|
||||
|
||||
expect(rows).toHaveLength(1);
|
||||
expect(rows[0].jt).toBe('object');
|
||||
expect(rows[0].author).toBe('garry');
|
||||
expect(rows[0].score).toBe('7');
|
||||
expect(rows[0].tags).toEqual(['x', 'y']);
|
||||
});
|
||||
|
||||
test('raw_data.data — putRawData stores object, not string literal', async () => {
|
||||
const engine = getEngine();
|
||||
const conn = getConn();
|
||||
|
||||
await engine.putPage('jsonb-test/raw', { type: 'concept', title: 't', compiled_truth: '' });
|
||||
await engine.putRawData('jsonb-test/raw', 'unit-test', { kind: 'fixture', count: 42 });
|
||||
|
||||
const rows = await conn.unsafe(`
|
||||
SELECT
|
||||
jsonb_typeof(rd.data) AS jt,
|
||||
rd.data->>'kind' AS kind,
|
||||
rd.data->>'count' AS count
|
||||
FROM raw_data rd
|
||||
JOIN pages p ON p.id = rd.page_id
|
||||
WHERE p.slug = 'jsonb-test/raw' AND rd.source = 'unit-test'
|
||||
`);
|
||||
|
||||
expect(rows).toHaveLength(1);
|
||||
expect(rows[0].jt).toBe('object');
|
||||
expect(rows[0].kind).toBe('fixture');
|
||||
expect(rows[0].count).toBe('42');
|
||||
});
|
||||
|
||||
test('ingest_log.pages_updated — logIngest stores array, not string literal', async () => {
|
||||
const engine = getEngine();
|
||||
const conn = getConn();
|
||||
|
||||
await engine.logIngest({
|
||||
source_type: 'unit-test',
|
||||
source_ref: 'jsonb-roundtrip',
|
||||
pages_updated: ['a/b', 'c/d', 'e/f'],
|
||||
summary: 'roundtrip-check',
|
||||
});
|
||||
|
||||
const rows = await conn.unsafe(`
|
||||
SELECT
|
||||
jsonb_typeof(pages_updated) AS jt,
|
||||
pages_updated->>0 AS first,
|
||||
jsonb_array_length(pages_updated) AS len
|
||||
FROM ingest_log
|
||||
WHERE source_ref = 'jsonb-roundtrip'
|
||||
`);
|
||||
|
||||
expect(rows).toHaveLength(1);
|
||||
expect(rows[0].jt).toBe('array');
|
||||
expect(rows[0].first).toBe('a/b');
|
||||
expect(rows[0].len).toBe(3);
|
||||
});
|
||||
|
||||
test('files.metadata — write site uses sql.json, not string interpolation', async () => {
|
||||
const conn = getConn();
|
||||
|
||||
// Mimic the write at src/commands/files.ts:254 (the bonus fix).
|
||||
await conn`
|
||||
INSERT INTO files (filename, storage_path, mime_type, size_bytes, content_hash, metadata)
|
||||
VALUES (
|
||||
'roundtrip.bin',
|
||||
'unit-test/roundtrip.bin',
|
||||
'application/octet-stream',
|
||||
${0},
|
||||
'sha256:test',
|
||||
${conn.json({ type: 'archive', upload_method: 'unit-test' })}
|
||||
)
|
||||
`;
|
||||
|
||||
const rows = await conn.unsafe(`
|
||||
SELECT
|
||||
jsonb_typeof(metadata) AS jt,
|
||||
metadata->>'type' AS type,
|
||||
metadata->>'upload_method' AS method
|
||||
FROM files
|
||||
WHERE storage_path = 'unit-test/roundtrip.bin'
|
||||
`);
|
||||
|
||||
expect(rows).toHaveLength(1);
|
||||
expect(rows[0].jt).toBe('object');
|
||||
expect(rows[0].type).toBe('archive');
|
||||
expect(rows[0].method).toBe('unit-test');
|
||||
});
|
||||
|
||||
test('page_versions.frontmatter — INSERT...SELECT propagates object shape', async () => {
|
||||
const engine = getEngine();
|
||||
const conn = getConn();
|
||||
|
||||
await engine.putPage('jsonb-test/versioned', {
|
||||
type: 'concept',
|
||||
title: 'versioned',
|
||||
compiled_truth: 'v1',
|
||||
frontmatter: { mood: 'happy' },
|
||||
});
|
||||
await engine.createVersion('jsonb-test/versioned');
|
||||
|
||||
const rows = await conn.unsafe(`
|
||||
SELECT
|
||||
jsonb_typeof(pv.frontmatter) AS jt,
|
||||
pv.frontmatter->>'mood' AS mood
|
||||
FROM page_versions pv
|
||||
JOIN pages p ON p.id = pv.page_id
|
||||
WHERE p.slug = 'jsonb-test/versioned'
|
||||
`);
|
||||
|
||||
expect(rows.length).toBeGreaterThan(0);
|
||||
expect(rows[0].jt).toBe('object');
|
||||
expect(rows[0].mood).toBe('happy');
|
||||
});
|
||||
});
|
||||
@@ -252,7 +252,7 @@ title: Chunked
|
||||
|
||||
This is compiled truth content that should be chunked as compiled_truth source.
|
||||
|
||||
---
|
||||
<!-- timeline -->
|
||||
|
||||
- 2024-01-01: This is timeline content that should be chunked as timeline source.
|
||||
`);
|
||||
|
||||
@@ -2,7 +2,7 @@ import { describe, test, expect } from 'bun:test';
|
||||
import { parseMarkdown, serializeMarkdown, splitBody } from '../src/core/markdown.ts';
|
||||
|
||||
describe('Markdown Parser', () => {
|
||||
test('parses frontmatter + compiled_truth + timeline', () => {
|
||||
test('parses frontmatter + compiled_truth + timeline (explicit sentinel)', () => {
|
||||
const md = `---
|
||||
type: concept
|
||||
title: Do Things That Don't Scale
|
||||
@@ -11,7 +11,7 @@ tags: [startups, growth]
|
||||
|
||||
Paul Graham argues that startups should do unscalable things early on.
|
||||
|
||||
---
|
||||
<!-- timeline -->
|
||||
|
||||
- 2013-07-01: Published on paulgraham.com
|
||||
- 2024-11-15: Referenced in batch kickoff talk
|
||||
@@ -90,30 +90,75 @@ Content
|
||||
});
|
||||
|
||||
describe('splitBody', () => {
|
||||
test('splits at first standalone ---', () => {
|
||||
const body = 'Above the line\n\n---\n\nBelow the line';
|
||||
test('splits at <!-- timeline --> sentinel', () => {
|
||||
const body = 'Above the line\n\n<!-- timeline -->\n\nBelow the line';
|
||||
const { compiled_truth, timeline } = splitBody(body);
|
||||
expect(compiled_truth).toContain('Above the line');
|
||||
expect(timeline).toContain('Below the line');
|
||||
});
|
||||
|
||||
test('returns all as compiled_truth if no separator', () => {
|
||||
test('splits at --- timeline --- sentinel', () => {
|
||||
const body = 'Above the line\n\n--- timeline ---\n\nBelow the line';
|
||||
const { compiled_truth, timeline } = splitBody(body);
|
||||
expect(compiled_truth).toContain('Above the line');
|
||||
expect(timeline).toContain('Below the line');
|
||||
});
|
||||
|
||||
test('splits at --- when followed by ## Timeline heading', () => {
|
||||
const body = 'Article content\n\n---\n\n## Timeline\n\n- 2024: Event happened';
|
||||
const { compiled_truth, timeline } = splitBody(body);
|
||||
expect(compiled_truth).toContain('Article content');
|
||||
expect(timeline).toContain('## Timeline');
|
||||
expect(timeline).toContain('Event happened');
|
||||
});
|
||||
|
||||
test('splits at --- when followed by ## History heading', () => {
|
||||
const body = 'Article content\n\n---\n\n## History\n\n- 2020: Founded';
|
||||
const { compiled_truth, timeline } = splitBody(body);
|
||||
expect(compiled_truth).toContain('Article content');
|
||||
expect(timeline).toContain('## History');
|
||||
});
|
||||
|
||||
test('does NOT split at plain --- (horizontal rule in article body)', () => {
|
||||
const body = 'Above the line\n\n---\n\nBelow the line';
|
||||
const { compiled_truth, timeline } = splitBody(body);
|
||||
expect(compiled_truth).toBe(body);
|
||||
expect(timeline).toBe('');
|
||||
});
|
||||
|
||||
test('does NOT split on multiple plain --- horizontal rules', () => {
|
||||
const body = 'Section 1\n\n---\n\nSection 2\n\n---\n\nSection 3';
|
||||
const { compiled_truth, timeline } = splitBody(body);
|
||||
expect(compiled_truth).toBe(body);
|
||||
expect(timeline).toBe('');
|
||||
});
|
||||
|
||||
test('returns all as compiled_truth if no sentinel', () => {
|
||||
const body = 'Just some content\nWith multiple lines';
|
||||
const { compiled_truth, timeline } = splitBody(body);
|
||||
expect(compiled_truth).toBe(body);
|
||||
expect(timeline).toBe('');
|
||||
});
|
||||
|
||||
test('handles --- at end of content', () => {
|
||||
test('plain --- at end of content stays in compiled_truth', () => {
|
||||
const body = 'Content here\n\n---\n';
|
||||
const { compiled_truth, timeline } = splitBody(body);
|
||||
expect(compiled_truth).toContain('Content here');
|
||||
expect(timeline.trim()).toBe('');
|
||||
expect(compiled_truth).toBe(body);
|
||||
expect(timeline).toBe('');
|
||||
});
|
||||
|
||||
test('<!-- timeline --> with content before and after', () => {
|
||||
const body = '## Summary\n\nArticle summary here.\n\n---\n\nMore body content.\n\n<!-- timeline -->\n\n- 2024: Timeline entry';
|
||||
const { compiled_truth, timeline } = splitBody(body);
|
||||
expect(compiled_truth).toContain('## Summary');
|
||||
expect(compiled_truth).toContain('More body content.');
|
||||
expect(compiled_truth).not.toContain('Timeline entry');
|
||||
expect(timeline).toContain('Timeline entry');
|
||||
});
|
||||
});
|
||||
|
||||
describe('serializeMarkdown', () => {
|
||||
test('round-trips through parse and serialize', () => {
|
||||
test('round-trips through parse and serialize (explicit sentinel)', () => {
|
||||
const original = `---
|
||||
type: concept
|
||||
title: Do Things That Don't Scale
|
||||
@@ -125,7 +170,7 @@ custom: value
|
||||
|
||||
Paul Graham argues that startups should do unscalable things early on.
|
||||
|
||||
---
|
||||
<!-- timeline -->
|
||||
|
||||
- 2013-07-01: Published on paulgraham.com
|
||||
`;
|
||||
@@ -148,7 +193,7 @@ Paul Graham argues that startups should do unscalable things early on.
|
||||
});
|
||||
|
||||
describe('parseMarkdown edge cases', () => {
|
||||
test('handles content with multiple --- separators', () => {
|
||||
test('does NOT split on plain --- separators (horizontal rules stay in compiled_truth)', () => {
|
||||
const md = `---
|
||||
type: concept
|
||||
title: Test
|
||||
@@ -158,16 +203,38 @@ First section.
|
||||
|
||||
---
|
||||
|
||||
Timeline part 1.
|
||||
Second section.
|
||||
|
||||
---
|
||||
|
||||
More timeline.`;
|
||||
Third section.`;
|
||||
const parsed = parseMarkdown(md);
|
||||
// Only splits at the FIRST standalone ---
|
||||
expect(parsed.compiled_truth.trim()).toBe('First section.');
|
||||
expect(parsed.timeline).toContain('Timeline part 1.');
|
||||
expect(parsed.timeline).toContain('More timeline.');
|
||||
expect(parsed.compiled_truth).toContain('First section.');
|
||||
expect(parsed.compiled_truth).toContain('Second section.');
|
||||
expect(parsed.compiled_truth).toContain('Third section.');
|
||||
expect(parsed.timeline).toBe('');
|
||||
});
|
||||
|
||||
test('splits on <!-- timeline --> sentinel with horizontal rules in body', () => {
|
||||
const md = `---
|
||||
type: concept
|
||||
title: Test
|
||||
---
|
||||
|
||||
First section.
|
||||
|
||||
---
|
||||
|
||||
Second section.
|
||||
|
||||
<!-- timeline -->
|
||||
|
||||
- 2024: Timeline entry`;
|
||||
const parsed = parseMarkdown(md);
|
||||
expect(parsed.compiled_truth).toContain('First section.');
|
||||
expect(parsed.compiled_truth).toContain('Second section.');
|
||||
expect(parsed.compiled_truth).not.toContain('Timeline entry');
|
||||
expect(parsed.timeline).toContain('Timeline entry');
|
||||
});
|
||||
|
||||
test('handles frontmatter without type or title', () => {
|
||||
@@ -177,7 +244,7 @@ custom_field: hello
|
||||
|
||||
Some content.`;
|
||||
const parsed = parseMarkdown(md);
|
||||
expect(parsed.type).toBeTruthy(); // should have a default
|
||||
expect(parsed.type).toBeTruthy();
|
||||
expect(parsed.compiled_truth.trim()).toBe('Some content.');
|
||||
expect(parsed.frontmatter.custom_field).toBe('hello');
|
||||
});
|
||||
@@ -199,4 +266,19 @@ Some content.`;
|
||||
expect(parseMarkdown('', 'concepts/thing.md').type).toBe('concept');
|
||||
expect(parseMarkdown('', 'companies/acme.md').type).toBe('company');
|
||||
});
|
||||
|
||||
test('infers type from wiki subdirectory paths', () => {
|
||||
expect(parseMarkdown('', 'tech/wiki/concepts/longevity-science.md').type).toBe('concept');
|
||||
expect(parseMarkdown('', 'tech/wiki/guides/team-os-claude-code.md').type).toBe('guide');
|
||||
expect(parseMarkdown('', 'tech/wiki/analysis/agi-timeline-debate.md').type).toBe('analysis');
|
||||
expect(parseMarkdown('', 'tech/wiki/hardware/h100-vs-gb200-training-benchmarks.md').type).toBe('hardware');
|
||||
expect(parseMarkdown('', 'tech/wiki/architecture/kb-infrastructure.md').type).toBe('architecture');
|
||||
expect(parseMarkdown('', 'finance/wiki/analysis/polymarket-bot-automation-thesis.md').type).toBe('analysis');
|
||||
expect(parseMarkdown('', 'personal/wiki/concepts/career-regrets-2026-framework.md').type).toBe('concept');
|
||||
});
|
||||
|
||||
test('infers writing type from /writing/ paths', () => {
|
||||
expect(parseMarkdown('', 'writing/post.md').type).toBe('writing');
|
||||
expect(parseMarkdown('', 'projects/blog/writing/essay.md').type).toBe('writing');
|
||||
});
|
||||
});
|
||||
|
||||
59
test/migrations-v0_12_2.test.ts
Normal file
59
test/migrations-v0_12_2.test.ts
Normal file
@@ -0,0 +1,59 @@
|
||||
/**
|
||||
* Tests for the v0.12.2 JSONB-double-encode-repair orchestrator.
|
||||
*
|
||||
* Covers the contract that makes this migration safe to ship:
|
||||
* - Registered in the TS registry (so apply-migrations sees it).
|
||||
* - Phase functions exported via __testing for unit-level coverage.
|
||||
* - Dry-run skips all side-effect phases.
|
||||
* - Feature pitch explains what the user can NOW do that they couldn't.
|
||||
*
|
||||
* Idempotency, repair correctness, and PGLite-no-op behavior are exercised
|
||||
* end-to-end against real Postgres in test/e2e/postgres-jsonb.test.ts.
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
|
||||
describe('v0.12.2 — JSONB double-encode repair migration', () => {
|
||||
test('registered in the TS migration registry', async () => {
|
||||
const { migrations, getMigration } = await import('../src/commands/migrations/index.ts');
|
||||
const versions = migrations.map(m => m.version);
|
||||
expect(versions).toContain('0.12.2');
|
||||
const m = getMigration('0.12.2');
|
||||
expect(m).not.toBeNull();
|
||||
expect(m!.featurePitch.headline).toContain('JSONB');
|
||||
expect(typeof m!.orchestrator).toBe('function');
|
||||
});
|
||||
|
||||
test('feature pitch lists the affected columns and the recovery path', async () => {
|
||||
const { v0_12_2 } = await import('../src/commands/migrations/v0_12_2.ts');
|
||||
const desc = v0_12_2.featurePitch.description ?? '';
|
||||
expect(desc).toContain('pages.frontmatter');
|
||||
expect(desc).toContain('raw_data.data');
|
||||
expect(desc).toContain('ingest_log.pages_updated');
|
||||
expect(desc).toContain('files.metadata');
|
||||
expect(desc).toContain('page_versions.frontmatter');
|
||||
expect(desc).toContain('gbrain sync --full');
|
||||
});
|
||||
|
||||
test('phase functions exported for unit testing', async () => {
|
||||
const { __testing } = await import('../src/commands/migrations/v0_12_2.ts');
|
||||
expect(typeof __testing.phaseASchema).toBe('function');
|
||||
expect(typeof __testing.phaseBRepair).toBe('function');
|
||||
expect(typeof __testing.phaseCVerify).toBe('function');
|
||||
});
|
||||
|
||||
test('dry-run skips all side-effect phases', async () => {
|
||||
const { v0_12_2 } = await import('../src/commands/migrations/v0_12_2.ts');
|
||||
const result = await v0_12_2.orchestrator({
|
||||
yes: true,
|
||||
dryRun: true,
|
||||
noAutopilotInstall: true,
|
||||
});
|
||||
expect(result.version).toBe('0.12.2');
|
||||
expect(result.phases.length).toBeGreaterThanOrEqual(3);
|
||||
for (const p of result.phases) {
|
||||
expect(p.status).toBe('skipped');
|
||||
expect(p.detail).toContain('dry-run');
|
||||
}
|
||||
});
|
||||
});
|
||||
37
test/repair-jsonb.test.ts
Normal file
37
test/repair-jsonb.test.ts
Normal file
@@ -0,0 +1,37 @@
|
||||
/**
|
||||
* Unit tests for `gbrain repair-jsonb`.
|
||||
*
|
||||
* The actual repair logic runs against real Postgres in
|
||||
* test/e2e/postgres-jsonb.test.ts (covers the round-trip + the migration
|
||||
* orchestrator end to end). Here we cover only the engine-detection
|
||||
* short-circuit: PGLite was never affected by the JSONB double-encode bug,
|
||||
* so the command must report 0 repaired rows and never connect.
|
||||
*/
|
||||
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { repairJsonb } from '../src/commands/repair-jsonb.ts';
|
||||
|
||||
describe('repairJsonb — PGLite short-circuit', () => {
|
||||
test('PGLite engines short-circuit: no DB connection, all targets report 0 repaired', async () => {
|
||||
const result = await repairJsonb({
|
||||
dryRun: false,
|
||||
engineConfig: { engine: 'pglite' },
|
||||
});
|
||||
expect(result.engine).toBe('pglite');
|
||||
expect(result.total_repaired).toBe(0);
|
||||
// All 5 columns reported: pages.frontmatter, raw_data.data,
|
||||
// ingest_log.pages_updated, files.metadata, page_versions.frontmatter.
|
||||
expect(result.per_target.length).toBe(5);
|
||||
for (const t of result.per_target) {
|
||||
expect(t.rows_repaired).toBe(0);
|
||||
}
|
||||
const tables = result.per_target.map(t => `${t.table}.${t.column}`).sort();
|
||||
expect(tables).toEqual([
|
||||
'files.metadata',
|
||||
'ingest_log.pages_updated',
|
||||
'page_versions.frontmatter',
|
||||
'pages.frontmatter',
|
||||
'raw_data.data',
|
||||
]);
|
||||
});
|
||||
});
|
||||
@@ -1,5 +1,5 @@
|
||||
import { describe, test, expect } from 'bun:test';
|
||||
import { validateSlug, contentHash, rowToPage, rowToChunk, rowToSearchResult } from '../src/core/utils.ts';
|
||||
import { validateSlug, contentHash, parseEmbedding, rowToPage, rowToChunk, rowToSearchResult } from '../src/core/utils.ts';
|
||||
|
||||
describe('validateSlug', () => {
|
||||
test('accepts valid slugs', () => {
|
||||
@@ -98,6 +98,52 @@ describe('rowToChunk', () => {
|
||||
}, true);
|
||||
expect(chunk.embedding).not.toBeNull();
|
||||
});
|
||||
|
||||
test('parses pgvector string embeddings when requested', () => {
|
||||
const chunk = rowToChunk({
|
||||
id: 1, page_id: 1, chunk_index: 0, chunk_text: 'text',
|
||||
chunk_source: 'compiled_truth', embedding: '[0.1, 0.2, 0.3]',
|
||||
model: 'test', token_count: 5, embedded_at: '2024-01-01',
|
||||
}, true);
|
||||
expect(chunk.embedding).toBeInstanceOf(Float32Array);
|
||||
expect(Array.from(chunk.embedding || [])).toHaveLength(3);
|
||||
expect(chunk.embedding?.[0]).toBeCloseTo(0.1, 6);
|
||||
expect(chunk.embedding?.[1]).toBeCloseTo(0.2, 6);
|
||||
expect(chunk.embedding?.[2]).toBeCloseTo(0.3, 6);
|
||||
});
|
||||
});
|
||||
|
||||
describe('parseEmbedding', () => {
|
||||
test('returns Float32Array unchanged', () => {
|
||||
const emb = new Float32Array([0.1, 0.2]);
|
||||
expect(parseEmbedding(emb)).toBe(emb);
|
||||
});
|
||||
|
||||
test('parses pgvector text into Float32Array', () => {
|
||||
const parsed = parseEmbedding('[0.1, 0.2, 0.3]');
|
||||
expect(parsed).toBeInstanceOf(Float32Array);
|
||||
expect(Array.from(parsed || [])).toHaveLength(3);
|
||||
expect(parsed?.[0]).toBeCloseTo(0.1, 6);
|
||||
expect(parsed?.[1]).toBeCloseTo(0.2, 6);
|
||||
expect(parsed?.[2]).toBeCloseTo(0.3, 6);
|
||||
});
|
||||
|
||||
test('returns null for unsupported embedding values', () => {
|
||||
expect(parseEmbedding(null)).toBeNull();
|
||||
expect(parseEmbedding(undefined)).toBeNull();
|
||||
expect(parseEmbedding('not-a-vector')).toBeNull();
|
||||
});
|
||||
|
||||
test('parses numeric array into Float32Array', () => {
|
||||
const parsed = parseEmbedding([0.5, 0.25, 0.125]);
|
||||
expect(parsed).toBeInstanceOf(Float32Array);
|
||||
expect(parsed?.[0]).toBeCloseTo(0.5, 6);
|
||||
});
|
||||
|
||||
test('throws on vector-like string with non-numeric content (no silent NaN)', () => {
|
||||
expect(() => parseEmbedding('[abc, def]')).toThrow();
|
||||
expect(() => parseEmbedding('[1, NaN, 3]')).toThrow();
|
||||
});
|
||||
});
|
||||
|
||||
describe('rowToSearchResult', () => {
|
||||
|
||||
Reference in New Issue
Block a user