Files

Garry Tan c0b621923b fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1) (#196 )

* fix: splitBody and inferType for wiki-style markdown content

- splitBody now requires explicit timeline sentinel (<!-- timeline -->,
  --- timeline ---, or --- directly before ## Timeline / ## History).
  A bare --- in body text is a markdown horizontal rule, not a separator.
  This fixes the 83% content truncation @knee5 reported on a 1,991-article
  wiki where 4,856 of 6,680 wikilinks were lost.

- serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability.

- inferType extended with /writing/, /wiki/analysis/, /wiki/guides/,
  /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is
  most-specific-first so projects/blog/writing/essay.md → writing,
  not project.

- PageType union extended: writing, analysis, guide, hardware, architecture.

Updates test/import-file.test.ts to use the new sentinel.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: JSONB double-encode bug on Postgres + parseEmbedding NaN scores

Two related Postgres-string-typed-data bugs that PGLite hid:

1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254):
   ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again
   on the wire, storing JSONB columns as quoted string literals. Every
   frontmatter->>'key' returned NULL on Postgres-backed brains; GIN
   indexes were inert. Switched to sql.json(value), which is the
   postgres.js-native JSONB encoder (Parameter with OID 3802).
   Affected columns: pages.frontmatter, raw_data.data,
   ingest_log.pages_updated, files.metadata. page_versions.frontmatter
   is downstream via INSERT...SELECT and propagates the fix.

2. pgvector embeddings returning as strings (utils.ts):
   getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of
   Float32Array on Supabase, producing [NaN] cosine scores.
   Adds parseEmbedding() helper handling Float32Array, numeric arrays,
   and pgvector string format. Throws loud on malformed vectors
   (per Codex's no-silent-NaN requirement); returns null for
   non-vector strings (treated as "no embedding here"). rowToChunk
   delegates to parseEmbedding.

E2E regression test at test/e2e/postgres-jsonb.test.ts asserts
jsonb_typeof = 'object' AND col->>'k' returns expected scalar across
all 5 affected columns — the test that should have caught the original
bug. Runs in CI via the existing pgvector service.

Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix)
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: extract wikilink syntax with ancestor-search slug resolution

extractMarkdownLinks now handles [[page]] and [[page|Display Text]]
alongside standard [text](page.md). For wiki KBs where authors omit
leading ../ (thinking in wiki-root-relative terms), resolveSlug
walks ancestor directories until it finds a matching slug.

Without this, wikilinks under tech/wiki/analysis/ targeting
[[../../finance/wiki/concepts/foo]] silently dangled when the
correct relative depth was 3 × ../ instead of 2.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: gbrain repair-jsonb + v0.12.1 migration + CI grep guard

- New gbrain repair-jsonb command. Detects rows where
  jsonb_typeof(col) = 'string' and rewrites them via
  (col #>> '{}')::jsonb across 5 affected columns:
  pages.frontmatter, raw_data.data, ingest_log.pages_updated,
  files.metadata, page_versions.frontmatter. Idempotent — re-running
  is a no-op. PGLite engines short-circuit cleanly (the bug never
  affected the parameterized encode path PGLite uses). --dry-run
  shows what would be repaired; --json for scripting.

- New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify.
  Modeled on v0_12_0 pattern, registered in migrations/index.ts.
  Runs automatically via gbrain upgrade / apply-migrations.

- CI grep guard at scripts/check-jsonb-pattern.sh fails the build if
  anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation
  pattern. Wired into bun test via package.json. Best-effort static
  analysis (multi-line and helper-wrapped variants are caught by the
  E2E round-trip test instead).

- Updates apply-migrations.test.ts expectations to account for the new
  v0.12.1 entry in the registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.12.1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.12.1

- CLAUDE.md: document repair-jsonb command, v0_12_1 migration,
  splitBody sentinel contract, inferType wiki subtypes, CI grep
  guard, new test files (repair-jsonb, migrations-v0_12_1, markdown)
- README.md: add gbrain repair-jsonb to ADMIN command reference
- INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add
  v0.12.1 upgrade guidance for Postgres brains
- docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on
  Postgres-backed brains
- docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with
  migration steps, splitBody contract, wiki subtype inference
- skills/migrate/SKILL.md: document native wikilink extraction
  via gbrain extract links (v0.12.1+)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-19 07:14:24 +08:00

7.6 KiB

Raw Blame History

GBrain Installation Verification Runbook

Run these checks after install to confirm every part of GBrain is working. Each check includes the command, expected output, and what to do if it fails.

The most important check is #4 (live sync). "Sync ran" is not the same as "sync worked." A sync that silently skips pages because of a pooler bug is worse than no sync at all, because you think it's working.

1. Schema Verification

Command:

gbrain doctor --json

Expected: All checks return "ok":

connection: connected, N pages
pgvector: extension installed
rls: enabled on all tables
schema_version: current
embeddings: coverage percentage

If it fails: The doctor output includes specific fix instructions for each check. See skills/setup/SKILL.md Error Recovery table.

2. Skillpack Loaded

Check: Ask the agent: "What is the brain-agent loop?"

Expected: The agent references GBRAIN_SKILLPACK.md Section 2 and describes the read-write cycle: detect entities, read brain, respond with context, write brain, sync.

If it fails: The agent hasn't loaded the skillpack. Run step 6 from the install paste (read docs/GBRAIN_SKILLPACK.md).

3. Auto-Update Configured

Command:

gbrain check-update --json

Expected: Returns JSON with current_version, latest_version, update_available (boolean). The cron gbrain-update-check is registered.

If it fails: Run step 7 from the install paste. See GBRAIN_SKILLPACK.md Section 17.

4. Live Sync Actually Works

This is the most important check. Three parts.

4a. Coverage Check

Compare page count in the DB against syncable file count in the repo:

gbrain stats

Then count syncable files:

find /data/brain -name '*.md' \
  -not -path '*/.*' \
  -not -path '*/.raw/*' \
  -not -path '*/ops/*' \
  -not -name 'README.md' \
  -not -name 'index.md' \
  -not -name 'schema.md' \
  -not -name 'log.md' \
  | wc -l

Expected: Page count in gbrain stats should be close to the file count. Some difference is normal (files added since last sync), but if page count is less than half the file count, sync is silently skipping pages.

If page count is way too low: The #1 cause is the connection pooler bug. Check your DATABASE_URL:

If it contains pooler.supabase.com:6543, verify it's using Session mode, not Transaction mode.
Transaction mode breaks engine.transaction() and causes .begin() is not a function errors.
Fix: switch to Session mode pooler string, then run gbrain sync --full to reimport everything.

4b. Embed Check

gbrain stats

Expected: Embedded chunk count should be close to total chunk count.

If embedded is much lower than total:

gbrain embed --stale

If OPENAI_API_KEY is not set, embeddings can't be generated. Keyword search still works without embeddings, but hybrid/semantic search won't.

4c. End-to-End Test

This is the real test. Edit a brain page, push, wait, search.

Edit a page in the brain repo (e.g., correct a fact on a person's page):

# Example: fix a line in Gustaf's page
cd /data/brain
# Make a small edit to any .md file
git add -A && git commit -m "test: verify live sync" && git push

Wait for the next sync cycle (cron interval or --watch poll).
Search for the corrected text:

gbrain search "<text from the correction>"

Expected: The search returns the corrected text, not the old version.

If it returns old text: Sync failed silently. Check:

Is the sync cron registered and running?
Is gbrain sync --watch still alive (if using watch mode)?
Run gbrain config get sync.last_run to see when sync last ran.
Run gbrain sync --repo /data/brain manually and check for errors.
If you see .begin() is not a function, fix the pooler (see 4a above).

5. Embedding Coverage

Command:

gbrain stats

Expected: Embedded chunk count matches (or is close to) total chunk count.

If zero or very low: OPENAI_API_KEY may be missing or invalid. Check:

echo $OPENAI_API_KEY | head -c 10

If blank, set the key. Then:

gbrain embed --stale

6. Brain-First Lookup Protocol

Check: Ask the agent about a person or concept that exists in the brain.

Expected: The agent uses gbrain search or gbrain query FIRST, not grep or external APIs. The response includes brain-sourced context with source attribution.

If it fails: The brain-first lookup protocol isn't injected into the agent's system context. See skills/setup/SKILL.md Phase D.

7. Knowledge Graph Wired

The v0.12.0 graph layer needs to be populated for existing brains. New writes are auto-linked, but historical pages need a one-time backfill.

Command:

gbrain stats | grep -E 'links|timeline'

Expected: Both links and timeline_entries are non-zero (assuming the brain has content with entity references and dated markdown).

If it's zero on a brain with imported content: Run the backfill.

gbrain extract links --source db --dry-run | head -5    # preview
gbrain extract links --source db                         # commit
gbrain extract timeline --source db
gbrain stats                                             # confirm > 0

Bonus check — graph traversal works:

# Pick any well-connected slug from your brain
gbrain graph-query people/<some-person-slug> --depth 2

Expected: Indented tree of typed edges (--attended-->, --works_at-->, etc.). If the slug has no inbound or outbound links, try a different one or run extract again.

If extract finds nothing: Your pages may not use entity-reference syntax. The extractor matches [Name](people/slug), [Name](../people/slug.md), and bare people/slug references. If your brain uses a different format, the auto-link heuristics won't find them — file an issue with a sample page.

8. JSONB Frontmatter Integrity (v0.12.2)

Postgres-backed brains created before v0.12.2 had double-encoded JSONB columns (frontmatter->>'key' returned NULL, GIN indexes were inert). gbrain upgrade runs gbrain repair-jsonb automatically via the v0_12_2 orchestrator. Verify the repair succeeded.

Command:

gbrain repair-jsonb --dry-run --json

Expected: totalRepaired: 0 across all 5 columns (pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter). A zero count means every row is properly-typed JSON objects, not string-encoded JSON.

If the count is > 0: The repair didn't run or was interrupted. Re-run without --dry-run:

gbrain repair-jsonb

Idempotent. PGLite brains always report 0 (unaffected by the original bug).

Bonus check — frontmatter-keyed queries actually resolve:

gbrain call list_pages '{"frontmatterKey": "type", "frontmatterValue": "person"}'

If this returns rows on a brain with person pages, the JSONB path is healthy.

Quick Verification (all checks in one pass)

# 1. Schema
gbrain doctor --json

# 2. Sync recency
gbrain config get sync.last_run

# 3. Page count + embed coverage
gbrain stats

# 4. Search works
gbrain search "test query from your brain content"

# 5. Catch any unembedded chunks
gbrain embed --stale

# 6. Auto-update
gbrain check-update --json

# 7. Knowledge graph populated (links + timeline > 0)
gbrain stats | grep -E 'links|timeline'

# 8. JSONB integrity (v0.12.2 — Postgres only, PGLite always 0)
gbrain repair-jsonb --dry-run --json

If all eight return successfully, the installation is healthy. For the full end-to-end sync test (4c), push a real change and verify it appears in search.

7.6 KiB Raw Blame History

GBrain Installation Verification Runbook

1. Schema Verification

2. Skillpack Loaded

3. Auto-Update Configured

4. Live Sync Actually Works

4a. Coverage Check

4b. Embed Check

4c. End-to-End Test

5. Embedding Coverage

6. Brain-First Lookup Protocol

7. Knowledge Graph Wired

8. JSONB Frontmatter Integrity (v0.12.2)

Quick Verification (all checks in one pass)

7.6 KiB

Raw Blame History