Files
gbrain/test/markdown.test.ts
Garry Tan c0b621923b fix: JSONB double-encode + splitBody wiki + parseEmbedding (v0.12.1) (#196)
* fix: splitBody and inferType for wiki-style markdown content

- splitBody now requires explicit timeline sentinel (<!-- timeline -->,
  --- timeline ---, or --- directly before ## Timeline / ## History).
  A bare --- in body text is a markdown horizontal rule, not a separator.
  This fixes the 83% content truncation @knee5 reported on a 1,991-article
  wiki where 4,856 of 6,680 wikilinks were lost.

- serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability.

- inferType extended with /writing/, /wiki/analysis/, /wiki/guides/,
  /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is
  most-specific-first so projects/blog/writing/essay.md → writing,
  not project.

- PageType union extended: writing, analysis, guide, hardware, architecture.

Updates test/import-file.test.ts to use the new sentinel.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: JSONB double-encode bug on Postgres + parseEmbedding NaN scores

Two related Postgres-string-typed-data bugs that PGLite hid:

1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254):
   ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again
   on the wire, storing JSONB columns as quoted string literals. Every
   frontmatter->>'key' returned NULL on Postgres-backed brains; GIN
   indexes were inert. Switched to sql.json(value), which is the
   postgres.js-native JSONB encoder (Parameter with OID 3802).
   Affected columns: pages.frontmatter, raw_data.data,
   ingest_log.pages_updated, files.metadata. page_versions.frontmatter
   is downstream via INSERT...SELECT and propagates the fix.

2. pgvector embeddings returning as strings (utils.ts):
   getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of
   Float32Array on Supabase, producing [NaN] cosine scores.
   Adds parseEmbedding() helper handling Float32Array, numeric arrays,
   and pgvector string format. Throws loud on malformed vectors
   (per Codex's no-silent-NaN requirement); returns null for
   non-vector strings (treated as "no embedding here"). rowToChunk
   delegates to parseEmbedding.

E2E regression test at test/e2e/postgres-jsonb.test.ts asserts
jsonb_typeof = 'object' AND col->>'k' returns expected scalar across
all 5 affected columns — the test that should have caught the original
bug. Runs in CI via the existing pgvector service.

Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix)
Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: extract wikilink syntax with ancestor-search slug resolution

extractMarkdownLinks now handles [[page]] and [[page|Display Text]]
alongside standard [text](page.md). For wiki KBs where authors omit
leading ../ (thinking in wiki-root-relative terms), resolveSlug
walks ancestor directories until it finds a matching slug.

Without this, wikilinks under tech/wiki/analysis/ targeting
[[../../finance/wiki/concepts/foo]] silently dangled when the
correct relative depth was 3 × ../ instead of 2.

Co-Authored-By: @knee5 (PR #187)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: gbrain repair-jsonb + v0.12.1 migration + CI grep guard

- New gbrain repair-jsonb command. Detects rows where
  jsonb_typeof(col) = 'string' and rewrites them via
  (col #>> '{}')::jsonb across 5 affected columns:
  pages.frontmatter, raw_data.data, ingest_log.pages_updated,
  files.metadata, page_versions.frontmatter. Idempotent — re-running
  is a no-op. PGLite engines short-circuit cleanly (the bug never
  affected the parameterized encode path PGLite uses). --dry-run
  shows what would be repaired; --json for scripting.

- New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify.
  Modeled on v0_12_0 pattern, registered in migrations/index.ts.
  Runs automatically via gbrain upgrade / apply-migrations.

- CI grep guard at scripts/check-jsonb-pattern.sh fails the build if
  anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation
  pattern. Wired into bun test via package.json. Best-effort static
  analysis (multi-line and helper-wrapped variants are caught by the
  E2E round-trip test instead).

- Updates apply-migrations.test.ts expectations to account for the new
  v0.12.1 entry in the registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.12.1)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.12.1

- CLAUDE.md: document repair-jsonb command, v0_12_1 migration,
  splitBody sentinel contract, inferType wiki subtypes, CI grep
  guard, new test files (repair-jsonb, migrations-v0_12_1, markdown)
- README.md: add gbrain repair-jsonb to ADMIN command reference
- INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add
  v0.12.1 upgrade guidance for Postgres brains
- docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on
  Postgres-backed brains
- docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with
  migration steps, splitBody contract, wiki subtype inference
- skills/migrate/SKILL.md: document native wikilink extraction
  via gbrain extract links (v0.12.1+)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 07:14:24 +08:00

285 lines
9.1 KiB
TypeScript

import { describe, test, expect } from 'bun:test';
import { parseMarkdown, serializeMarkdown, splitBody } from '../src/core/markdown.ts';
describe('Markdown Parser', () => {
test('parses frontmatter + compiled_truth + timeline (explicit sentinel)', () => {
const md = `---
type: concept
title: Do Things That Don't Scale
tags: [startups, growth]
---
Paul Graham argues that startups should do unscalable things early on.
<!-- timeline -->
- 2013-07-01: Published on paulgraham.com
- 2024-11-15: Referenced in batch kickoff talk
`;
const parsed = parseMarkdown(md);
expect(parsed.type).toBe('concept');
expect(parsed.title).toBe("Do Things That Don't Scale");
expect(parsed.tags).toEqual(['startups', 'growth']);
expect(parsed.compiled_truth).toContain('unscalable things');
expect(parsed.timeline).toContain('Published on paulgraham.com');
expect(parsed.timeline).toContain('batch kickoff talk');
});
test('handles no timeline separator', () => {
const md = `---
type: concept
title: Superlinear Returns
---
Returns in many fields are superlinear.
Performance compounds over time.
`;
const parsed = parseMarkdown(md);
expect(parsed.compiled_truth).toContain('superlinear');
expect(parsed.timeline).toBe('');
});
test('handles empty body', () => {
const md = `---
type: concept
title: Empty Page
---
`;
const parsed = parseMarkdown(md);
expect(parsed.compiled_truth).toBe('');
expect(parsed.timeline).toBe('');
});
test('removes type, title, tags from frontmatter object', () => {
const md = `---
type: concept
title: Test
tags: [a, b]
custom_field: hello
---
Content
`;
const parsed = parseMarkdown(md);
expect(parsed.frontmatter).not.toHaveProperty('type');
expect(parsed.frontmatter).not.toHaveProperty('title');
expect(parsed.frontmatter).not.toHaveProperty('tags');
expect(parsed.frontmatter).toHaveProperty('custom_field', 'hello');
});
test('infers type from file path', () => {
const md = `---
title: Someone
---
Content
`;
const parsed = parseMarkdown(md, 'people/someone.md');
expect(parsed.type).toBe('person');
});
test('infers slug from file path', () => {
const md = `---
type: concept
title: Test
---
Content
`;
const parsed = parseMarkdown(md, 'concepts/do-things-that-dont-scale.md');
expect(parsed.slug).toBe('concepts/do-things-that-dont-scale');
});
});
describe('splitBody', () => {
test('splits at <!-- timeline --> sentinel', () => {
const body = 'Above the line\n\n<!-- timeline -->\n\nBelow the line';
const { compiled_truth, timeline } = splitBody(body);
expect(compiled_truth).toContain('Above the line');
expect(timeline).toContain('Below the line');
});
test('splits at --- timeline --- sentinel', () => {
const body = 'Above the line\n\n--- timeline ---\n\nBelow the line';
const { compiled_truth, timeline } = splitBody(body);
expect(compiled_truth).toContain('Above the line');
expect(timeline).toContain('Below the line');
});
test('splits at --- when followed by ## Timeline heading', () => {
const body = 'Article content\n\n---\n\n## Timeline\n\n- 2024: Event happened';
const { compiled_truth, timeline } = splitBody(body);
expect(compiled_truth).toContain('Article content');
expect(timeline).toContain('## Timeline');
expect(timeline).toContain('Event happened');
});
test('splits at --- when followed by ## History heading', () => {
const body = 'Article content\n\n---\n\n## History\n\n- 2020: Founded';
const { compiled_truth, timeline } = splitBody(body);
expect(compiled_truth).toContain('Article content');
expect(timeline).toContain('## History');
});
test('does NOT split at plain --- (horizontal rule in article body)', () => {
const body = 'Above the line\n\n---\n\nBelow the line';
const { compiled_truth, timeline } = splitBody(body);
expect(compiled_truth).toBe(body);
expect(timeline).toBe('');
});
test('does NOT split on multiple plain --- horizontal rules', () => {
const body = 'Section 1\n\n---\n\nSection 2\n\n---\n\nSection 3';
const { compiled_truth, timeline } = splitBody(body);
expect(compiled_truth).toBe(body);
expect(timeline).toBe('');
});
test('returns all as compiled_truth if no sentinel', () => {
const body = 'Just some content\nWith multiple lines';
const { compiled_truth, timeline } = splitBody(body);
expect(compiled_truth).toBe(body);
expect(timeline).toBe('');
});
test('plain --- at end of content stays in compiled_truth', () => {
const body = 'Content here\n\n---\n';
const { compiled_truth, timeline } = splitBody(body);
expect(compiled_truth).toBe(body);
expect(timeline).toBe('');
});
test('<!-- timeline --> with content before and after', () => {
const body = '## Summary\n\nArticle summary here.\n\n---\n\nMore body content.\n\n<!-- timeline -->\n\n- 2024: Timeline entry';
const { compiled_truth, timeline } = splitBody(body);
expect(compiled_truth).toContain('## Summary');
expect(compiled_truth).toContain('More body content.');
expect(compiled_truth).not.toContain('Timeline entry');
expect(timeline).toContain('Timeline entry');
});
});
describe('serializeMarkdown', () => {
test('round-trips through parse and serialize (explicit sentinel)', () => {
const original = `---
type: concept
title: Do Things That Don't Scale
tags:
- startups
- growth
custom: value
---
Paul Graham argues that startups should do unscalable things early on.
<!-- timeline -->
- 2013-07-01: Published on paulgraham.com
`;
const parsed = parseMarkdown(original);
const serialized = serializeMarkdown(
parsed.frontmatter,
parsed.compiled_truth,
parsed.timeline,
{ type: parsed.type, title: parsed.title, tags: parsed.tags },
);
// Re-parse the serialized version
const reparsed = parseMarkdown(serialized);
expect(reparsed.type).toBe(parsed.type);
expect(reparsed.title).toBe(parsed.title);
expect(reparsed.compiled_truth).toBe(parsed.compiled_truth);
expect(reparsed.timeline).toBe(parsed.timeline);
expect(reparsed.frontmatter.custom).toBe('value');
});
});
describe('parseMarkdown edge cases', () => {
test('does NOT split on plain --- separators (horizontal rules stay in compiled_truth)', () => {
const md = `---
type: concept
title: Test
---
First section.
---
Second section.
---
Third section.`;
const parsed = parseMarkdown(md);
expect(parsed.compiled_truth).toContain('First section.');
expect(parsed.compiled_truth).toContain('Second section.');
expect(parsed.compiled_truth).toContain('Third section.');
expect(parsed.timeline).toBe('');
});
test('splits on <!-- timeline --> sentinel with horizontal rules in body', () => {
const md = `---
type: concept
title: Test
---
First section.
---
Second section.
<!-- timeline -->
- 2024: Timeline entry`;
const parsed = parseMarkdown(md);
expect(parsed.compiled_truth).toContain('First section.');
expect(parsed.compiled_truth).toContain('Second section.');
expect(parsed.compiled_truth).not.toContain('Timeline entry');
expect(parsed.timeline).toContain('Timeline entry');
});
test('handles frontmatter without type or title', () => {
const md = `---
custom_field: hello
---
Some content.`;
const parsed = parseMarkdown(md);
expect(parsed.type).toBeTruthy();
expect(parsed.compiled_truth.trim()).toBe('Some content.');
expect(parsed.frontmatter.custom_field).toBe('hello');
});
test('handles content with no frontmatter at all', () => {
const md = `Just plain text with no YAML.`;
const parsed = parseMarkdown(md);
expect(parsed.compiled_truth).toContain('Just plain text');
});
test('handles empty string', () => {
const parsed = parseMarkdown('');
expect(parsed.compiled_truth).toBe('');
expect(parsed.timeline).toBe('');
});
test('infers type from various directory paths', () => {
expect(parseMarkdown('', 'people/someone.md').type).toBe('person');
expect(parseMarkdown('', 'concepts/thing.md').type).toBe('concept');
expect(parseMarkdown('', 'companies/acme.md').type).toBe('company');
});
test('infers type from wiki subdirectory paths', () => {
expect(parseMarkdown('', 'tech/wiki/concepts/longevity-science.md').type).toBe('concept');
expect(parseMarkdown('', 'tech/wiki/guides/team-os-claude-code.md').type).toBe('guide');
expect(parseMarkdown('', 'tech/wiki/analysis/agi-timeline-debate.md').type).toBe('analysis');
expect(parseMarkdown('', 'tech/wiki/hardware/h100-vs-gb200-training-benchmarks.md').type).toBe('hardware');
expect(parseMarkdown('', 'tech/wiki/architecture/kb-infrastructure.md').type).toBe('architecture');
expect(parseMarkdown('', 'finance/wiki/analysis/polymarket-bot-automation-thesis.md').type).toBe('analysis');
expect(parseMarkdown('', 'personal/wiki/concepts/career-regrets-2026-framework.md').type).toBe('concept');
});
test('infers writing type from /writing/ paths', () => {
expect(parseMarkdown('', 'writing/post.md').type).toBe('writing');
expect(parseMarkdown('', 'projects/blog/writing/essay.md').type).toBe('writing');
});
});