feat: GBrain v0.1.0 — Postgres-native personal knowledge brain (#1)

* chore: add CLAUDE.md with project context and gstack skill routing rules Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: initialize project with Bun + TypeScript package.json with dependencies (postgres, pgvector, openai, anthropic, MCP SDK, gray-matter). TypeScript config targeting ESNext with bundler module resolution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add foundation layer — engine interface, Postgres engine, schema BrainEngine pluggable interface with full PostgresEngine: CRUD, search (keyword + vector), links, tags, timeline, versions, stats, health, ingest log, config. Trigger-based tsvector spanning pages + timeline_entries. Markdown parser with frontmatter, compiled_truth / timeline splitting, and round-trip serialization. 19 tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add 3-tier chunking and embedding service Recursive delimiter-aware chunker (5-level hierarchy, 300-word chunks, 50-word overlap). Semantic chunker with Savitzky-Golay boundary detection and recursive fallback. LLM-guided chunker via Claude Haiku with sliding window topic detection. OpenAI embedding service with batch support, exponential backoff, and rate limit handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add hybrid search with RRF fusion, expansion, and 4-layer dedup Hybrid search merges vector (pgvector HNSW) + keyword (tsvector) via Reciprocal Rank Fusion. Multi-query expansion via Claude Haiku generates 2 alternative phrasings. 4-layer dedup pipeline: by source, cosine similarity, type diversity (60% cap), per-page cap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add GBRAIN_V0 spec, pluggable engine architecture, SQLite engine plan GBRAIN_V0.md: full product spec with architecture decisions, CLI commands, schema, search architecture, chunking strategies, first-time experience, and future plans. ENGINES.md: pluggable engine interface, capability matrix, how to add new backends. SQLITE_ENGINE.md: complete SQLite implementation plan with schema, FTS5 setup, vector search options, and contributor guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add CLI with all commands Full CLI dispatcher with 25+ commands: init (Supabase wizard), get, put, delete, list, search, query (hybrid RRF), import (bulk with progress bar), export (round-trip), embed, stats, health, tag/untag/tags, link/unlink/ backlinks/graph, timeline/timeline-add, history/revert, config, upgrade, serve, call. Smart slug resolution on reads. Version snapshots on updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add MCP stdio server with all brain tools 20 MCP tools mirroring CLI operations: get/put/delete/list pages, search (keyword), query (hybrid RRF + expansion), tags, links with graph traversal, timeline, stats, health, version history, and revert. Auto-chunks and embeds on put_page. CLI and MCP share the same engine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add 6 skill files and ClawHub manifest Fat markdown skills for AI agents: ingest (meetings/docs/articles with timeline merge), query (3-layer search + synthesis + citations), maintain (health checks, stale detection, orphan audit), enrich (external API enrichment), briefing (daily briefing compilation), migrate (universal migration from Obsidian/Notion/Logseq/markdown/CSV/JSON/Roam). ClawHub manifest for skill distribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add README, CONTRIBUTING, update CLAUDE.md test references README with quickstart, commands, architecture, library usage, MCP setup, and links to design docs. CONTRIBUTING with setup, project structure, and guides for adding commands and engines. CLAUDE.md updated to reference actual test files instead of planned-but-unwritten import test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address adversarial review findings — 5 critical/high fixes - revertToVersion: add page_id check to prevent cross-page data corruption - traverseGraph: use UNION instead of UNION ALL for cycle safety - embedAll: preserve all chunks when embedding stale subset only - embedding: throw on retry exhaustion instead of returning zero vectors - putPage: validate slugs to prevent path traversal on export Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.1.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: expand README with schema, install, search architecture, and motivation Why it exists, how search works (with ASCII diagram), full database schema with all 9 tables and index details, chunking strategies explained, storage estimates, setup wizard walkthrough, knowledge model with example page, library usage with more examples, expanded skills table. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: add MIT license (Copyright 2026 Garry Tan) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add OpenClaw install flow as primary option in README OpenClaw users just say "install gbrain" and the orchestrator handles everything: package install, Supabase setup wizard, skill registration. Shows the conversational interface for querying, ingesting, and briefings. ClawHub and standalone CLI paths follow as alternatives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add prerequisites and explicit OpenClaw install instructions Prerequisites table listing Supabase, OpenAI, and Anthropic dependencies with links. Environment variable setup. Explicit step-by-step prompt for OpenClaw users showing exactly what to tell the orchestrator. Note that search degrades gracefully without API keys (keyword-only without OpenAI, no expansion without Anthropic). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: scrub named references, add PG essay demo section to README Replace all Pedro/Brex/Jensen Huang/River AI examples with Paul Graham essay examples using the kindling corpus. Add "Try it" section to README showing the power of hybrid search on PG essays in 90 seconds. Update test fixtures to use concept pages instead of person pages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 12:48:10 -07:00
parent 3144971cd0
commit b22cbd349a
62 changed files with 6655 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,462 @@
+# GBrain
+
+Open source personal knowledge brain. Postgres + pgvector + hybrid search that actually works.
+
+```bash
+gbrain query "what does Paul Graham say about doing things that don't scale?"
+```
+
+```
+concepts/do-things-that-dont-scale (concept) score=0.0312
+  The most common unscalable thing founders have to do at the start is to
+  recruit users manually. Nearly all startups have to...
+
+concepts/how-to-get-startup-ideas (concept) score=0.0298
+  The way to get startup ideas is not to try to think of startup ideas.
+  It's to look for problems, preferably problems you have yourself...
+
+concepts/relentlessly-resourceful (concept) score=0.0251
+  Not merely relentless. That's not enough to make things go your way
+  except in a few mostly uninteresting domains. In any interesting domain...
+```
+
+Hybrid search finds essays by meaning, not just keywords. "Doing things that don't scale" matches even when the exact phrase doesn't appear. That's the point.
+
+## Why this exists
+
+You have a brain full of knowledge. It lives in markdown files, meeting notes, CRM exports, Obsidian vaults, Notion databases. It's scattered, unsearchable, and going stale.
+
+Search is the bottleneck. Keyword search misses semantic matches. Vector search misses exact names and phrases. Neither connects related ideas across documents.
+
+GBrain fixes this with hybrid search that combines both approaches, plus a knowledge model that treats every page like an intelligence assessment: compiled truth on top (your current best understanding, rewritten when evidence changes), append-only timeline on the bottom (the evidence trail that never gets edited).
+
+AI agents maintain the brain. You ingest a document and the agent updates every entity mentioned, creates cross-reference links, and appends timeline entries. MCP clients query it. The intelligence lives in fat markdown skills, not application code.
+
+## Try it: Paul Graham's essays in 90 seconds
+
+GBrain ships with 10 Paul Graham essays as a kindling corpus. After setup, they're already in your brain:
+
+```bash
+# What's in there?
+gbrain stats
+# Pages: 10, Chunks: 47, Embedded: 47, Links: 0
+
+# Keyword search (fast, exact matches)
+gbrain search "startups"
+
+# Hybrid search (the good one, semantic + keyword + expansion)
+gbrain query "what makes a great founder?"
+
+# Read a specific essay
+gbrain get concepts/do-things-that-dont-scale
+
+# Find essays related to a concept
+gbrain query "when should you ignore conventional wisdom?"
+
+# Check brain health
+gbrain health
+# Pages: 10, Embed coverage: 100%, Stale: 0, Orphans: 10
+```
+
+The essays are just the demo. The real power is when you import your own knowledge, thousands of pages about people, companies, projects, and the connections between them.
+
+## Install
+
+### Prerequisites
+
+GBrain needs three things to run:
+
+| Dependency | What it's for | How to get it |
+|------------|--------------|---------------|
+| **Supabase account** | Postgres + pgvector database | [supabase.com](https://supabase.com) (Pro tier, $25/mo for 8GB) |
+| **OpenAI API key** | Embeddings (text-embedding-3-large) | [platform.openai.com/api-keys](https://platform.openai.com/api-keys) |
+| **Anthropic API key** | Multi-query expansion + LLM chunking (Haiku) | [console.anthropic.com](https://console.anthropic.com) |
+
+Set the API keys as environment variables:
+
+```bash
+export OPENAI_API_KEY=sk-...
+export ANTHROPIC_API_KEY=sk-ant-...
+```
+
+The Supabase connection URL is configured during `gbrain init`. The OpenAI and Anthropic SDKs read their keys from the environment automatically.
+
+Without an OpenAI key, search still works (keyword only, no vector search). Without an Anthropic key, search still works (no multi-query expansion, no LLM chunking).
+
+### With OpenClaw (recommended)
+
+If you're running OpenClaw, tell it to set up your brain. Make sure your API keys are set in the environment first.
+
+```
+You: "Install gbrain and set up my knowledge brain.
+      I need you to:
+      1. Run: bun add gbrain
+      2. Run: gbrain init --supabase (follow the wizard to connect my Supabase database)
+      3. Run: gbrain import data/kindling/ (import the demo corpus)
+      4. Read the skill files in skills/ so you know how to use the brain"
+```
+
+OpenClaw will install the package, walk through the Supabase connection wizard, import demo data, and learn the 6 brain skills (ingest, query, maintain, enrich, briefing, migrate).
+
+After setup, you talk to your brain through OpenClaw:
+
+```
+You: "What essays do we have about startups?"
+You: "Ingest my meeting notes from today"
+You: "Give me a briefing for my meetings tomorrow"
+You: "Import my Obsidian vault into the brain"
+```
+
+OpenClaw reads the skill files in `skills/`, figures out which gbrain commands to run, and does the work. You never touch the CLI directly unless you want to.
+
+### With ClawHub
+
+```bash
+clawhub install gbrain
+```
+
+This installs the npm package, copies the skill files, and runs `gbrain init --supabase` on first use.
+
+### Standalone CLI
+
+```bash
+npm install -g gbrain
+```
+
+### As a library
+
+```bash
+bun add gbrain
+```
+
+```typescript
+import { PostgresEngine } from 'gbrain';
+```
+
+All paths require a Postgres database with pgvector. Supabase Pro ($25/mo) is the recommended zero-ops option.
+
+## Setup
+
+After installing via CLI or library path, run the setup wizard:
+
+```bash
+# Guided wizard: auto-provisions Supabase or accepts a connection URL
+gbrain init --supabase
+
+# Or connect to any Postgres with pgvector
+gbrain init --url postgresql://user:pass@host:5432/dbname
+```
+
+The init wizard:
+1. Checks for Supabase CLI, offers auto-provisioning
+2. Falls back to manual connection URL if CLI isn't available
+3. Runs the full schema migration (tables, indexes, triggers, extensions)
+4. Imports the kindling corpus (10 PG essays) as demo data
+5. Verifies the connection and prints your first query to try
+
+Config is saved to `~/.gbrain/config.json` with 0600 permissions.
+
+OpenClaw users skip this step. The orchestrator runs the wizard for you during install.
+
+## First import
+
+```bash
+# Import your markdown wiki (auto-chunks and auto-embeds)
+gbrain import /path/to/brain/
+
+# Skip embedding if you want to import fast and embed later
+gbrain import /path/to/brain/ --no-embed
+
+# Backfill embeddings for pages that don't have them
+gbrain embed --stale
+```
+
+Import is idempotent. Re-running it skips unchanged files (compared by SHA-256 content hash). Progress bar shows status. ~30s for text import of 7,000 files, ~10-15 min for embedding.
+
+## The knowledge model
+
+Every page in the brain follows the compiled truth + timeline pattern:
+
+```markdown
+---
+type: concept
+title: Do Things That Don't Scale
+tags: [startups, growth, pg-essay]
+---
+
+Paul Graham's argument that startups should do unscalable things early on.
+The most common: recruiting users manually, one at a time. Airbnb went
+door to door in New York photographing apartments. Stripe manually
+installed their payment integration for early users.
+
+The key insight: the unscalable effort teaches you what users actually
+want, which you can't learn any other way.
+
+---
+
+- 2013-07-01: Published on paulgraham.com
+- 2024-11-15: Referenced in batch W25 kickoff talk
+- 2025-02-20: Cited in discussion about AI agent onboarding strategies
+```
+
+Above the `---` separator: **compiled truth**. Your current best understanding. Gets rewritten when new evidence changes the picture. Below: **timeline**. Append-only evidence trail. Never edited, only added to.
+
+The compiled truth is the answer. The timeline is the proof.
+
+## How search works
+
+```
+Query: "when should you ignore conventional wisdom?"
+         |
+    Multi-query expansion (Claude Haiku)
+    "contrarian thinking startups", "going against the crowd"
+         |
+    +----+----+
+    |         |
+  Vector    Keyword
+  (HNSW     (tsvector +
+  cosine)    ts_rank)
+    |         |
+    +----+----+
+         |
+    RRF Fusion: score = sum(1/(60 + rank))
+         |
+    4-Layer Dedup
+    1. Best chunk per page
+    2. Cosine similarity > 0.85
+    3. Type diversity (60% cap)
+    4. Per-page chunk cap
+         |
+    Stale alerts (compiled truth older than latest timeline)
+         |
+    Results
+```
+
+Keyword search alone misses conceptual matches. "Ignore conventional wisdom" won't find an essay titled "The Bus Ticket Theory of Genius" even though it's exactly about that. Vector search alone misses exact phrases when the embedding is diluted by surrounding text. RRF fusion gets both right. Multi-query expansion catches phrasings you didn't think of.
+
+## Database schema
+
+9 tables in Postgres + pgvector:
+
+```
+pages                    The core content table
+  slug (UNIQUE)          e.g. "concepts/do-things-that-dont-scale"
+  type                   person, company, deal, yc, civic, project, concept, source, media
+  title, compiled_truth, timeline
+  frontmatter (JSONB)    Arbitrary metadata
+  search_vector          Trigger-based tsvector (title + compiled_truth + timeline + timeline_entries)
+  content_hash           SHA-256 for import idempotency
+
+content_chunks           Chunked content with embeddings
+  page_id (FK)           Links to pages
+  chunk_text             The chunk content
+  chunk_source           'compiled_truth' or 'timeline'
+  embedding (vector)     1536-dim from text-embedding-3-large
+  HNSW index             Cosine similarity search
+
+links                    Cross-references between pages
+  from_page_id, to_page_id
+  link_type              knows, invested_in, works_at, founded, references, etc.
+
+tags                     page_id + tag (many-to-many)
+
+timeline_entries         Structured timeline events
+  page_id, date, source, summary, detail (markdown)
+
+page_versions            Snapshot history for compiled_truth
+  compiled_truth, frontmatter, snapshot_at
+
+raw_data                 Sidecar JSON from external APIs
+  page_id, source, data (JSONB)
+
+ingest_log               Audit trail of import/ingest operations
+
+config                   Brain-level settings (embedding model, chunk strategy)
+```
+
+Indexes: B-tree on slug/type, GIN on frontmatter/search_vector, HNSW on embeddings, pg_trgm on title for fuzzy slug resolution.
+
+## Chunking
+
+Three strategies, dispatched by content type:
+
+**Recursive** (timeline, bulk import): 5-level delimiter hierarchy (paragraphs, lines, sentences, clauses, words). 300-word chunks with 50-word sentence-aware overlap. Fast, predictable, lossless.
+
+**Semantic** (compiled truth): Embeds each sentence, computes adjacent cosine similarities, applies Savitzky-Golay smoothing to find topic boundaries. Falls back to recursive on failure. Best quality for intelligence assessments.
+
+**LLM-guided** (high-value content, on request): Pre-splits into 128-word candidates, asks Claude Haiku to identify topic shifts in sliding windows. 3 retries per window. Most expensive, best results.
+
+## Commands
+
+```
+SETUP
+  gbrain init [--supabase|--url <conn>]     Create brain (guided wizard)
+  gbrain upgrade                            Self-update
+
+PAGES
+  gbrain get <slug>                         Read a page (supports fuzzy slug matching)
+  gbrain put <slug> [< file.md]             Write/update a page (auto-versions)
+  gbrain delete <slug>                      Delete a page
+  gbrain list [--type T] [--tag T] [-n N]   List pages with filters
+
+SEARCH
+  gbrain search <query>                     Keyword search (tsvector)
+  gbrain query <question>                   Hybrid search (vector + keyword + RRF + expansion)
+
+IMPORT/EXPORT
+  gbrain import <dir> [--no-embed]          Import markdown directory (idempotent)
+  gbrain export [--dir ./out/]              Export to markdown (round-trip)
+
+EMBEDDINGS
+  gbrain embed [<slug>|--all|--stale]       Generate/refresh embeddings
+
+LINKS + GRAPH
+  gbrain link <from> <to> [--type T]        Create typed link
+  gbrain unlink <from> <to>                 Remove link
+  gbrain backlinks <slug>                   Incoming links
+  gbrain graph <slug> [--depth N]           Traverse link graph (recursive CTE, default depth 5)
+
+TAGS
+  gbrain tags <slug>                        List tags
+  gbrain tag <slug> <tag>                   Add tag
+  gbrain untag <slug> <tag>                 Remove tag
+
+TIMELINE
+  gbrain timeline [<slug>]                  View timeline entries
+  gbrain timeline-add <slug> <date> <text>  Add timeline entry
+
+ADMIN
+  gbrain stats                              Brain statistics
+  gbrain health                             Health dashboard (embed coverage, stale, orphans)
+  gbrain history <slug>                     Page version history
+  gbrain revert <slug> <version-id>         Revert to previous version
+  gbrain config [get|set] <key> [value]     Brain config
+  gbrain serve                              MCP server (stdio)
+  gbrain call <tool> '<json>'               Raw tool invocation
+  gbrain --tools-json                       Tool discovery (JSON)
+```
+
+## Using as a library
+
+GBrain is library-first. The CLI and MCP server are thin wrappers over the engine.
+
+```typescript
+import { PostgresEngine } from 'gbrain';
+
+const engine = new PostgresEngine();
+await engine.connect({ database_url: process.env.DATABASE_URL });
+await engine.initSchema();
+
+// Write a page
+await engine.putPage('concepts/superlinear-returns', {
+  type: 'concept',
+  title: 'Superlinear Returns',
+  compiled_truth: 'Paul Graham argues that returns in many fields are superlinear...',
+  timeline: '- 2023-10-01: Published on paulgraham.com',
+});
+
+// Hybrid search
+const results = await engine.searchKeyword('startup growth');
+
+// Typed links
+await engine.addLink('concepts/superlinear-returns', 'concepts/do-things-that-dont-scale', '', 'references');
+
+// Graph traversal
+const graph = await engine.traverseGraph('concepts/superlinear-returns', 3);
+
+// Health check
+const health = await engine.getHealth();
+// { page_count: 10, embed_coverage: 1.0, stale_pages: 0, orphan_pages: 10 }
+```
+
+The `BrainEngine` interface is pluggable. See `docs/ENGINES.md` for how to add backends.
+
+## MCP server
+
+Add to your Claude Code or Cursor MCP config:
+
+```json
+{
+  "mcpServers": {
+    "gbrain": {
+      "command": "gbrain",
+      "args": ["serve"]
+    }
+  }
+}
+```
+
+20 tools: get_page, put_page, delete_page, list_pages, search, query, add_tag, remove_tag, get_tags, add_link, remove_link, get_links, get_backlinks, traverse_graph, add_timeline_entry, get_timeline, get_stats, get_health, get_versions, revert_version.
+
+Every tool mirrors a CLI command. Drift tests verify identical behavior.
+
+## Skills
+
+Fat markdown files that tell AI agents HOW to use gbrain. No skill logic in the binary.
+
+| Skill | What it does |
+|-------|-------------|
+| **ingest** | Ingest meetings, docs, articles. Updates compiled truth (rewrite, not append), appends timeline, creates cross-reference links across all mentioned entities. |
+| **query** | 3-layer search (keyword + vector + structured) with synthesis and citations. Says "the brain doesn't have info on X" rather than hallucinating. |
+| **maintain** | Periodic health: find contradictions, stale compiled truth, orphan pages, dead links, tag inconsistency, missing embeddings, overdue threads. |
+| **enrich** | Enrich pages from external APIs. Raw data stored separately, distilled highlights go to compiled truth. |
+| **briefing** | Daily briefing: today's meetings with participant context, active deals with deadlines, time-sensitive threads, recent changes. |
+| **migrate** | Universal migration from Obsidian (wikilinks to gbrain links), Notion (stripped UUIDs), Logseq (block refs), plain markdown, CSV, JSON, Roam. |
+
+## Architecture
+
+```
+CLI / MCP Server
+     (thin wrappers, identical operations)
+              |
+      BrainEngine interface
+       (pluggable backend)
+              |
+     +--------+--------+
+     |                  |
+PostgresEngine     SQLiteEngine
+  (ships v0)       (designed, community PRs welcome)
+     |
+Supabase Pro ($25/mo)
+  Postgres + pgvector + pg_trgm
+  connection pooling via Supavisor
+```
+
+Embedding, chunking, and search fusion are engine-agnostic. Only raw keyword search (`searchKeyword`) and raw vector search (`searchVector`) are engine-specific. RRF fusion, multi-query expansion, and 4-layer dedup run above the engine on `SearchResult[]` arrays.
+
+## Storage estimates
+
+For a brain with ~7,500 pages:
+
+| Component | Size |
+|-----------|------|
+| Page text (compiled_truth + timeline) | ~150MB |
+| JSONB frontmatter + indexes | ~70MB |
+| Content chunks (~22K, text) | ~80MB |
+| Embeddings (22K x 1536 floats) | ~134MB |
+| HNSW index overhead | ~270MB |
+| Links, tags, timeline, versions | ~50MB |
+| **Total** | **~750MB** |
+
+Supabase free tier (500MB) won't fit a large brain. Supabase Pro ($25/mo, 8GB) is the starting point.
+
+Initial embedding cost: ~$4-5 for 7,500 pages via OpenAI text-embedding-3-large.
+
+## Docs
+
+- [GBRAIN_V0.md](docs/GBRAIN_V0.md) -- Full product spec, all architecture decisions, every option considered
+- [ENGINES.md](docs/ENGINES.md) -- Pluggable engine interface, capability matrix, how to add backends
+- [SQLITE_ENGINE.md](docs/SQLITE_ENGINE.md) -- Complete SQLite engine plan with schema, FTS5, vector search options
+
+## Contributing
+
+See [CONTRIBUTING.md](CONTRIBUTING.md). Welcome PRs for:
+
+- SQLite engine implementation
+- Docker Compose for self-hosted Postgres
+- Additional migration sources
+- New enrichment API integrations
+
+## License
+
+MIT