* fix: splitBody and inferType for wiki-style markdown content - splitBody now requires explicit timeline sentinel (<!-- timeline -->, --- timeline ---, or --- directly before ## Timeline / ## History). A bare --- in body text is a markdown horizontal rule, not a separator. This fixes the 83% content truncation @knee5 reported on a 1,991-article wiki where 4,856 of 6,680 wikilinks were lost. - serializeMarkdown emits <!-- timeline --> sentinel for round-trip stability. - inferType extended with /writing/, /wiki/analysis/, /wiki/guides/, /wiki/hardware/, /wiki/architecture/, /wiki/concepts/. Path order is most-specific-first so projects/blog/writing/essay.md → writing, not project. - PageType union extended: writing, analysis, guide, hardware, architecture. Updates test/import-file.test.ts to use the new sentinel. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: JSONB double-encode bug on Postgres + parseEmbedding NaN scores Two related Postgres-string-typed-data bugs that PGLite hid: 1. JSONB double-encode (postgres-engine.ts:107,668,846 + files.ts:254): ${JSON.stringify(value)}::jsonb in postgres.js v3 stringified again on the wire, storing JSONB columns as quoted string literals. Every frontmatter->>'key' returned NULL on Postgres-backed brains; GIN indexes were inert. Switched to sql.json(value), which is the postgres.js-native JSONB encoder (Parameter with OID 3802). Affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata. page_versions.frontmatter is downstream via INSERT...SELECT and propagates the fix. 2. pgvector embeddings returning as strings (utils.ts): getEmbeddingsByChunkIds returned "[0.1,0.2,...]" instead of Float32Array on Supabase, producing [NaN] cosine scores. Adds parseEmbedding() helper handling Float32Array, numeric arrays, and pgvector string format. Throws loud on malformed vectors (per Codex's no-silent-NaN requirement); returns null for non-vector strings (treated as "no embedding here"). rowToChunk delegates to parseEmbedding. E2E regression test at test/e2e/postgres-jsonb.test.ts asserts jsonb_typeof = 'object' AND col->>'k' returns expected scalar across all 5 affected columns — the test that should have caught the original bug. Runs in CI via the existing pgvector service. Co-Authored-By: @knee5 (PR #187 — JSONB triple-fix) Co-Authored-By: @leonardsellem (PR #175 — parseEmbedding) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: extract wikilink syntax with ancestor-search slug resolution extractMarkdownLinks now handles [[page]] and [[page|Display Text]] alongside standard [text](page.md). For wiki KBs where authors omit leading ../ (thinking in wiki-root-relative terms), resolveSlug walks ancestor directories until it finds a matching slug. Without this, wikilinks under tech/wiki/analysis/ targeting [[../../finance/wiki/concepts/foo]] silently dangled when the correct relative depth was 3 × ../ instead of 2. Co-Authored-By: @knee5 (PR #187) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: gbrain repair-jsonb + v0.12.1 migration + CI grep guard - New gbrain repair-jsonb command. Detects rows where jsonb_typeof(col) = 'string' and rewrites them via (col #>> '{}')::jsonb across 5 affected columns: pages.frontmatter, raw_data.data, ingest_log.pages_updated, files.metadata, page_versions.frontmatter. Idempotent — re-running is a no-op. PGLite engines short-circuit cleanly (the bug never affected the parameterized encode path PGLite uses). --dry-run shows what would be repaired; --json for scripting. - New v0_12_1.ts migration orchestrator. Phases: schema → repair → verify. Modeled on v0_12_0 pattern, registered in migrations/index.ts. Runs automatically via gbrain upgrade / apply-migrations. - CI grep guard at scripts/check-jsonb-pattern.sh fails the build if anyone reintroduces the ${JSON.stringify(x)}::jsonb interpolation pattern. Wired into bun test via package.json. Best-effort static analysis (multi-line and helper-wrapped variants are caught by the E2E round-trip test instead). - Updates apply-migrations.test.ts expectations to account for the new v0.12.1 entry in the registry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.12.1) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update project documentation for v0.12.1 - CLAUDE.md: document repair-jsonb command, v0_12_1 migration, splitBody sentinel contract, inferType wiki subtypes, CI grep guard, new test files (repair-jsonb, migrations-v0_12_1, markdown) - README.md: add gbrain repair-jsonb to ADMIN command reference - INSTALL_FOR_AGENTS.md: fix verification count (6 -> 7), add v0.12.1 upgrade guidance for Postgres brains - docs/GBRAIN_VERIFY.md: add check #8 for JSONB integrity on Postgres-backed brains - docs/UPGRADING_DOWNSTREAM_AGENTS.md: add v0.12.1 section with migration steps, splitBody contract, wiki subtype inference - skills/migrate/SKILL.md: document native wikilink extraction via gbrain extract links (v0.12.1+) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
285 lines
9.1 KiB
TypeScript
285 lines
9.1 KiB
TypeScript
import { describe, test, expect } from 'bun:test';
|
|
import { parseMarkdown, serializeMarkdown, splitBody } from '../src/core/markdown.ts';
|
|
|
|
describe('Markdown Parser', () => {
|
|
test('parses frontmatter + compiled_truth + timeline (explicit sentinel)', () => {
|
|
const md = `---
|
|
type: concept
|
|
title: Do Things That Don't Scale
|
|
tags: [startups, growth]
|
|
---
|
|
|
|
Paul Graham argues that startups should do unscalable things early on.
|
|
|
|
<!-- timeline -->
|
|
|
|
- 2013-07-01: Published on paulgraham.com
|
|
- 2024-11-15: Referenced in batch kickoff talk
|
|
`;
|
|
const parsed = parseMarkdown(md);
|
|
expect(parsed.type).toBe('concept');
|
|
expect(parsed.title).toBe("Do Things That Don't Scale");
|
|
expect(parsed.tags).toEqual(['startups', 'growth']);
|
|
expect(parsed.compiled_truth).toContain('unscalable things');
|
|
expect(parsed.timeline).toContain('Published on paulgraham.com');
|
|
expect(parsed.timeline).toContain('batch kickoff talk');
|
|
});
|
|
|
|
test('handles no timeline separator', () => {
|
|
const md = `---
|
|
type: concept
|
|
title: Superlinear Returns
|
|
---
|
|
|
|
Returns in many fields are superlinear.
|
|
Performance compounds over time.
|
|
`;
|
|
const parsed = parseMarkdown(md);
|
|
expect(parsed.compiled_truth).toContain('superlinear');
|
|
expect(parsed.timeline).toBe('');
|
|
});
|
|
|
|
test('handles empty body', () => {
|
|
const md = `---
|
|
type: concept
|
|
title: Empty Page
|
|
---
|
|
`;
|
|
const parsed = parseMarkdown(md);
|
|
expect(parsed.compiled_truth).toBe('');
|
|
expect(parsed.timeline).toBe('');
|
|
});
|
|
|
|
test('removes type, title, tags from frontmatter object', () => {
|
|
const md = `---
|
|
type: concept
|
|
title: Test
|
|
tags: [a, b]
|
|
custom_field: hello
|
|
---
|
|
|
|
Content
|
|
`;
|
|
const parsed = parseMarkdown(md);
|
|
expect(parsed.frontmatter).not.toHaveProperty('type');
|
|
expect(parsed.frontmatter).not.toHaveProperty('title');
|
|
expect(parsed.frontmatter).not.toHaveProperty('tags');
|
|
expect(parsed.frontmatter).toHaveProperty('custom_field', 'hello');
|
|
});
|
|
|
|
test('infers type from file path', () => {
|
|
const md = `---
|
|
title: Someone
|
|
---
|
|
Content
|
|
`;
|
|
const parsed = parseMarkdown(md, 'people/someone.md');
|
|
expect(parsed.type).toBe('person');
|
|
});
|
|
|
|
test('infers slug from file path', () => {
|
|
const md = `---
|
|
type: concept
|
|
title: Test
|
|
---
|
|
Content
|
|
`;
|
|
const parsed = parseMarkdown(md, 'concepts/do-things-that-dont-scale.md');
|
|
expect(parsed.slug).toBe('concepts/do-things-that-dont-scale');
|
|
});
|
|
});
|
|
|
|
describe('splitBody', () => {
|
|
test('splits at <!-- timeline --> sentinel', () => {
|
|
const body = 'Above the line\n\n<!-- timeline -->\n\nBelow the line';
|
|
const { compiled_truth, timeline } = splitBody(body);
|
|
expect(compiled_truth).toContain('Above the line');
|
|
expect(timeline).toContain('Below the line');
|
|
});
|
|
|
|
test('splits at --- timeline --- sentinel', () => {
|
|
const body = 'Above the line\n\n--- timeline ---\n\nBelow the line';
|
|
const { compiled_truth, timeline } = splitBody(body);
|
|
expect(compiled_truth).toContain('Above the line');
|
|
expect(timeline).toContain('Below the line');
|
|
});
|
|
|
|
test('splits at --- when followed by ## Timeline heading', () => {
|
|
const body = 'Article content\n\n---\n\n## Timeline\n\n- 2024: Event happened';
|
|
const { compiled_truth, timeline } = splitBody(body);
|
|
expect(compiled_truth).toContain('Article content');
|
|
expect(timeline).toContain('## Timeline');
|
|
expect(timeline).toContain('Event happened');
|
|
});
|
|
|
|
test('splits at --- when followed by ## History heading', () => {
|
|
const body = 'Article content\n\n---\n\n## History\n\n- 2020: Founded';
|
|
const { compiled_truth, timeline } = splitBody(body);
|
|
expect(compiled_truth).toContain('Article content');
|
|
expect(timeline).toContain('## History');
|
|
});
|
|
|
|
test('does NOT split at plain --- (horizontal rule in article body)', () => {
|
|
const body = 'Above the line\n\n---\n\nBelow the line';
|
|
const { compiled_truth, timeline } = splitBody(body);
|
|
expect(compiled_truth).toBe(body);
|
|
expect(timeline).toBe('');
|
|
});
|
|
|
|
test('does NOT split on multiple plain --- horizontal rules', () => {
|
|
const body = 'Section 1\n\n---\n\nSection 2\n\n---\n\nSection 3';
|
|
const { compiled_truth, timeline } = splitBody(body);
|
|
expect(compiled_truth).toBe(body);
|
|
expect(timeline).toBe('');
|
|
});
|
|
|
|
test('returns all as compiled_truth if no sentinel', () => {
|
|
const body = 'Just some content\nWith multiple lines';
|
|
const { compiled_truth, timeline } = splitBody(body);
|
|
expect(compiled_truth).toBe(body);
|
|
expect(timeline).toBe('');
|
|
});
|
|
|
|
test('plain --- at end of content stays in compiled_truth', () => {
|
|
const body = 'Content here\n\n---\n';
|
|
const { compiled_truth, timeline } = splitBody(body);
|
|
expect(compiled_truth).toBe(body);
|
|
expect(timeline).toBe('');
|
|
});
|
|
|
|
test('<!-- timeline --> with content before and after', () => {
|
|
const body = '## Summary\n\nArticle summary here.\n\n---\n\nMore body content.\n\n<!-- timeline -->\n\n- 2024: Timeline entry';
|
|
const { compiled_truth, timeline } = splitBody(body);
|
|
expect(compiled_truth).toContain('## Summary');
|
|
expect(compiled_truth).toContain('More body content.');
|
|
expect(compiled_truth).not.toContain('Timeline entry');
|
|
expect(timeline).toContain('Timeline entry');
|
|
});
|
|
});
|
|
|
|
describe('serializeMarkdown', () => {
|
|
test('round-trips through parse and serialize (explicit sentinel)', () => {
|
|
const original = `---
|
|
type: concept
|
|
title: Do Things That Don't Scale
|
|
tags:
|
|
- startups
|
|
- growth
|
|
custom: value
|
|
---
|
|
|
|
Paul Graham argues that startups should do unscalable things early on.
|
|
|
|
<!-- timeline -->
|
|
|
|
- 2013-07-01: Published on paulgraham.com
|
|
`;
|
|
const parsed = parseMarkdown(original);
|
|
const serialized = serializeMarkdown(
|
|
parsed.frontmatter,
|
|
parsed.compiled_truth,
|
|
parsed.timeline,
|
|
{ type: parsed.type, title: parsed.title, tags: parsed.tags },
|
|
);
|
|
|
|
// Re-parse the serialized version
|
|
const reparsed = parseMarkdown(serialized);
|
|
expect(reparsed.type).toBe(parsed.type);
|
|
expect(reparsed.title).toBe(parsed.title);
|
|
expect(reparsed.compiled_truth).toBe(parsed.compiled_truth);
|
|
expect(reparsed.timeline).toBe(parsed.timeline);
|
|
expect(reparsed.frontmatter.custom).toBe('value');
|
|
});
|
|
});
|
|
|
|
describe('parseMarkdown edge cases', () => {
|
|
test('does NOT split on plain --- separators (horizontal rules stay in compiled_truth)', () => {
|
|
const md = `---
|
|
type: concept
|
|
title: Test
|
|
---
|
|
|
|
First section.
|
|
|
|
---
|
|
|
|
Second section.
|
|
|
|
---
|
|
|
|
Third section.`;
|
|
const parsed = parseMarkdown(md);
|
|
expect(parsed.compiled_truth).toContain('First section.');
|
|
expect(parsed.compiled_truth).toContain('Second section.');
|
|
expect(parsed.compiled_truth).toContain('Third section.');
|
|
expect(parsed.timeline).toBe('');
|
|
});
|
|
|
|
test('splits on <!-- timeline --> sentinel with horizontal rules in body', () => {
|
|
const md = `---
|
|
type: concept
|
|
title: Test
|
|
---
|
|
|
|
First section.
|
|
|
|
---
|
|
|
|
Second section.
|
|
|
|
<!-- timeline -->
|
|
|
|
- 2024: Timeline entry`;
|
|
const parsed = parseMarkdown(md);
|
|
expect(parsed.compiled_truth).toContain('First section.');
|
|
expect(parsed.compiled_truth).toContain('Second section.');
|
|
expect(parsed.compiled_truth).not.toContain('Timeline entry');
|
|
expect(parsed.timeline).toContain('Timeline entry');
|
|
});
|
|
|
|
test('handles frontmatter without type or title', () => {
|
|
const md = `---
|
|
custom_field: hello
|
|
---
|
|
|
|
Some content.`;
|
|
const parsed = parseMarkdown(md);
|
|
expect(parsed.type).toBeTruthy();
|
|
expect(parsed.compiled_truth.trim()).toBe('Some content.');
|
|
expect(parsed.frontmatter.custom_field).toBe('hello');
|
|
});
|
|
|
|
test('handles content with no frontmatter at all', () => {
|
|
const md = `Just plain text with no YAML.`;
|
|
const parsed = parseMarkdown(md);
|
|
expect(parsed.compiled_truth).toContain('Just plain text');
|
|
});
|
|
|
|
test('handles empty string', () => {
|
|
const parsed = parseMarkdown('');
|
|
expect(parsed.compiled_truth).toBe('');
|
|
expect(parsed.timeline).toBe('');
|
|
});
|
|
|
|
test('infers type from various directory paths', () => {
|
|
expect(parseMarkdown('', 'people/someone.md').type).toBe('person');
|
|
expect(parseMarkdown('', 'concepts/thing.md').type).toBe('concept');
|
|
expect(parseMarkdown('', 'companies/acme.md').type).toBe('company');
|
|
});
|
|
|
|
test('infers type from wiki subdirectory paths', () => {
|
|
expect(parseMarkdown('', 'tech/wiki/concepts/longevity-science.md').type).toBe('concept');
|
|
expect(parseMarkdown('', 'tech/wiki/guides/team-os-claude-code.md').type).toBe('guide');
|
|
expect(parseMarkdown('', 'tech/wiki/analysis/agi-timeline-debate.md').type).toBe('analysis');
|
|
expect(parseMarkdown('', 'tech/wiki/hardware/h100-vs-gb200-training-benchmarks.md').type).toBe('hardware');
|
|
expect(parseMarkdown('', 'tech/wiki/architecture/kb-infrastructure.md').type).toBe('architecture');
|
|
expect(parseMarkdown('', 'finance/wiki/analysis/polymarket-bot-automation-thesis.md').type).toBe('analysis');
|
|
expect(parseMarkdown('', 'personal/wiki/concepts/career-regrets-2026-framework.md').type).toBe('concept');
|
|
});
|
|
|
|
test('infers writing type from /writing/ paths', () => {
|
|
expect(parseMarkdown('', 'writing/post.md').type).toBe('writing');
|
|
expect(parseMarkdown('', 'projects/blog/writing/essay.md').type).toBe('writing');
|
|
});
|
|
});
|