feat: GBrain v0.1.0 — Postgres-native personal knowledge brain (#1)

* chore: add CLAUDE.md with project context and gstack skill routing rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: initialize project with Bun + TypeScript

package.json with dependencies (postgres, pgvector, openai, anthropic,
MCP SDK, gray-matter). TypeScript config targeting ESNext with bundler
module resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add foundation layer — engine interface, Postgres engine, schema

BrainEngine pluggable interface with full PostgresEngine: CRUD, search
(keyword + vector), links, tags, timeline, versions, stats, health,
ingest log, config. Trigger-based tsvector spanning pages +
timeline_entries. Markdown parser with frontmatter, compiled_truth /
timeline splitting, and round-trip serialization. 19 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add 3-tier chunking and embedding service

Recursive delimiter-aware chunker (5-level hierarchy, 300-word chunks,
50-word overlap). Semantic chunker with Savitzky-Golay boundary detection
and recursive fallback. LLM-guided chunker via Claude Haiku with sliding
window topic detection. OpenAI embedding service with batch support,
exponential backoff, and rate limit handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add hybrid search with RRF fusion, expansion, and 4-layer dedup

Hybrid search merges vector (pgvector HNSW) + keyword (tsvector) via
Reciprocal Rank Fusion. Multi-query expansion via Claude Haiku generates
2 alternative phrasings. 4-layer dedup pipeline: by source, cosine
similarity, type diversity (60% cap), per-page cap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add GBRAIN_V0 spec, pluggable engine architecture, SQLite engine plan

GBRAIN_V0.md: full product spec with architecture decisions, CLI commands,
schema, search architecture, chunking strategies, first-time experience,
and future plans. ENGINES.md: pluggable engine interface, capability matrix,
how to add new backends. SQLITE_ENGINE.md: complete SQLite implementation
plan with schema, FTS5 setup, vector search options, and contributor guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add CLI with all commands

Full CLI dispatcher with 25+ commands: init (Supabase wizard), get, put,
delete, list, search, query (hybrid RRF), import (bulk with progress bar),
export (round-trip), embed, stats, health, tag/untag/tags, link/unlink/
backlinks/graph, timeline/timeline-add, history/revert, config, upgrade,
serve, call. Smart slug resolution on reads. Version snapshots on updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add MCP stdio server with all brain tools

20 MCP tools mirroring CLI operations: get/put/delete/list pages,
search (keyword), query (hybrid RRF + expansion), tags, links with
graph traversal, timeline, stats, health, version history, and revert.
Auto-chunks and embeds on put_page. CLI and MCP share the same engine.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add 6 skill files and ClawHub manifest

Fat markdown skills for AI agents: ingest (meetings/docs/articles with
timeline merge), query (3-layer search + synthesis + citations), maintain
(health checks, stale detection, orphan audit), enrich (external API
enrichment), briefing (daily briefing compilation), migrate (universal
migration from Obsidian/Notion/Logseq/markdown/CSV/JSON/Roam).
ClawHub manifest for skill distribution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add README, CONTRIBUTING, update CLAUDE.md test references

README with quickstart, commands, architecture, library usage, MCP setup,
and links to design docs. CONTRIBUTING with setup, project structure,
and guides for adding commands and engines. CLAUDE.md updated to reference
actual test files instead of planned-but-unwritten import test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address adversarial review findings — 5 critical/high fixes

- revertToVersion: add page_id check to prevent cross-page data corruption
- traverseGraph: use UNION instead of UNION ALL for cycle safety
- embedAll: preserve all chunks when embedding stale subset only
- embedding: throw on retry exhaustion instead of returning zero vectors
- putPage: validate slugs to prevent path traversal on export

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.1.0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: expand README with schema, install, search architecture, and motivation

Why it exists, how search works (with ASCII diagram), full database schema
with all 9 tables and index details, chunking strategies explained, storage
estimates, setup wizard walkthrough, knowledge model with example page,
library usage with more examples, expanded skills table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: add MIT license (Copyright 2026 Garry Tan)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add OpenClaw install flow as primary option in README

OpenClaw users just say "install gbrain" and the orchestrator handles
everything: package install, Supabase setup wizard, skill registration.
Shows the conversational interface for querying, ingesting, and briefings.
ClawHub and standalone CLI paths follow as alternatives.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add prerequisites and explicit OpenClaw install instructions

Prerequisites table listing Supabase, OpenAI, and Anthropic dependencies
with links. Environment variable setup. Explicit step-by-step prompt for
OpenClaw users showing exactly what to tell the orchestrator. Note that
search degrades gracefully without API keys (keyword-only without OpenAI,
no expansion without Anthropic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: scrub named references, add PG essay demo section to README

Replace all Pedro/Brex/Jensen Huang/River AI examples with Paul Graham
essay examples using the kindling corpus. Add "Try it" section to README
showing the power of hybrid search on PG essays in 90 seconds. Update
test fixtures to use concept pages instead of person pages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan
2026-04-05 12:48:10 -07:00
committed by GitHub
parent 3144971cd0
commit b22cbd349a
62 changed files with 6655 additions and 0 deletions

462
README.md Normal file
View File

@@ -0,0 +1,462 @@
# GBrain
Open source personal knowledge brain. Postgres + pgvector + hybrid search that actually works.
```bash
gbrain query "what does Paul Graham say about doing things that don't scale?"
```
```
concepts/do-things-that-dont-scale (concept) score=0.0312
The most common unscalable thing founders have to do at the start is to
recruit users manually. Nearly all startups have to...
concepts/how-to-get-startup-ideas (concept) score=0.0298
The way to get startup ideas is not to try to think of startup ideas.
It's to look for problems, preferably problems you have yourself...
concepts/relentlessly-resourceful (concept) score=0.0251
Not merely relentless. That's not enough to make things go your way
except in a few mostly uninteresting domains. In any interesting domain...
```
Hybrid search finds essays by meaning, not just keywords. "Doing things that don't scale" matches even when the exact phrase doesn't appear. That's the point.
## Why this exists
You have a brain full of knowledge. It lives in markdown files, meeting notes, CRM exports, Obsidian vaults, Notion databases. It's scattered, unsearchable, and going stale.
Search is the bottleneck. Keyword search misses semantic matches. Vector search misses exact names and phrases. Neither connects related ideas across documents.
GBrain fixes this with hybrid search that combines both approaches, plus a knowledge model that treats every page like an intelligence assessment: compiled truth on top (your current best understanding, rewritten when evidence changes), append-only timeline on the bottom (the evidence trail that never gets edited).
AI agents maintain the brain. You ingest a document and the agent updates every entity mentioned, creates cross-reference links, and appends timeline entries. MCP clients query it. The intelligence lives in fat markdown skills, not application code.
## Try it: Paul Graham's essays in 90 seconds
GBrain ships with 10 Paul Graham essays as a kindling corpus. After setup, they're already in your brain:
```bash
# What's in there?
gbrain stats
# Pages: 10, Chunks: 47, Embedded: 47, Links: 0
# Keyword search (fast, exact matches)
gbrain search "startups"
# Hybrid search (the good one, semantic + keyword + expansion)
gbrain query "what makes a great founder?"
# Read a specific essay
gbrain get concepts/do-things-that-dont-scale
# Find essays related to a concept
gbrain query "when should you ignore conventional wisdom?"
# Check brain health
gbrain health
# Pages: 10, Embed coverage: 100%, Stale: 0, Orphans: 10
```
The essays are just the demo. The real power is when you import your own knowledge, thousands of pages about people, companies, projects, and the connections between them.
## Install
### Prerequisites
GBrain needs three things to run:
| Dependency | What it's for | How to get it |
|------------|--------------|---------------|
| **Supabase account** | Postgres + pgvector database | [supabase.com](https://supabase.com) (Pro tier, $25/mo for 8GB) |
| **OpenAI API key** | Embeddings (text-embedding-3-large) | [platform.openai.com/api-keys](https://platform.openai.com/api-keys) |
| **Anthropic API key** | Multi-query expansion + LLM chunking (Haiku) | [console.anthropic.com](https://console.anthropic.com) |
Set the API keys as environment variables:
```bash
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
```
The Supabase connection URL is configured during `gbrain init`. The OpenAI and Anthropic SDKs read their keys from the environment automatically.
Without an OpenAI key, search still works (keyword only, no vector search). Without an Anthropic key, search still works (no multi-query expansion, no LLM chunking).
### With OpenClaw (recommended)
If you're running OpenClaw, tell it to set up your brain. Make sure your API keys are set in the environment first.
```
You: "Install gbrain and set up my knowledge brain.
I need you to:
1. Run: bun add gbrain
2. Run: gbrain init --supabase (follow the wizard to connect my Supabase database)
3. Run: gbrain import data/kindling/ (import the demo corpus)
4. Read the skill files in skills/ so you know how to use the brain"
```
OpenClaw will install the package, walk through the Supabase connection wizard, import demo data, and learn the 6 brain skills (ingest, query, maintain, enrich, briefing, migrate).
After setup, you talk to your brain through OpenClaw:
```
You: "What essays do we have about startups?"
You: "Ingest my meeting notes from today"
You: "Give me a briefing for my meetings tomorrow"
You: "Import my Obsidian vault into the brain"
```
OpenClaw reads the skill files in `skills/`, figures out which gbrain commands to run, and does the work. You never touch the CLI directly unless you want to.
### With ClawHub
```bash
clawhub install gbrain
```
This installs the npm package, copies the skill files, and runs `gbrain init --supabase` on first use.
### Standalone CLI
```bash
npm install -g gbrain
```
### As a library
```bash
bun add gbrain
```
```typescript
import { PostgresEngine } from 'gbrain';
```
All paths require a Postgres database with pgvector. Supabase Pro ($25/mo) is the recommended zero-ops option.
## Setup
After installing via CLI or library path, run the setup wizard:
```bash
# Guided wizard: auto-provisions Supabase or accepts a connection URL
gbrain init --supabase
# Or connect to any Postgres with pgvector
gbrain init --url postgresql://user:pass@host:5432/dbname
```
The init wizard:
1. Checks for Supabase CLI, offers auto-provisioning
2. Falls back to manual connection URL if CLI isn't available
3. Runs the full schema migration (tables, indexes, triggers, extensions)
4. Imports the kindling corpus (10 PG essays) as demo data
5. Verifies the connection and prints your first query to try
Config is saved to `~/.gbrain/config.json` with 0600 permissions.
OpenClaw users skip this step. The orchestrator runs the wizard for you during install.
## First import
```bash
# Import your markdown wiki (auto-chunks and auto-embeds)
gbrain import /path/to/brain/
# Skip embedding if you want to import fast and embed later
gbrain import /path/to/brain/ --no-embed
# Backfill embeddings for pages that don't have them
gbrain embed --stale
```
Import is idempotent. Re-running it skips unchanged files (compared by SHA-256 content hash). Progress bar shows status. ~30s for text import of 7,000 files, ~10-15 min for embedding.
## The knowledge model
Every page in the brain follows the compiled truth + timeline pattern:
```markdown
---
type: concept
title: Do Things That Don't Scale
tags: [startups, growth, pg-essay]
---
Paul Graham's argument that startups should do unscalable things early on.
The most common: recruiting users manually, one at a time. Airbnb went
door to door in New York photographing apartments. Stripe manually
installed their payment integration for early users.
The key insight: the unscalable effort teaches you what users actually
want, which you can't learn any other way.
---
- 2013-07-01: Published on paulgraham.com
- 2024-11-15: Referenced in batch W25 kickoff talk
- 2025-02-20: Cited in discussion about AI agent onboarding strategies
```
Above the `---` separator: **compiled truth**. Your current best understanding. Gets rewritten when new evidence changes the picture. Below: **timeline**. Append-only evidence trail. Never edited, only added to.
The compiled truth is the answer. The timeline is the proof.
## How search works
```
Query: "when should you ignore conventional wisdom?"
|
Multi-query expansion (Claude Haiku)
"contrarian thinking startups", "going against the crowd"
|
+----+----+
| |
Vector Keyword
(HNSW (tsvector +
cosine) ts_rank)
| |
+----+----+
|
RRF Fusion: score = sum(1/(60 + rank))
|
4-Layer Dedup
1. Best chunk per page
2. Cosine similarity > 0.85
3. Type diversity (60% cap)
4. Per-page chunk cap
|
Stale alerts (compiled truth older than latest timeline)
|
Results
```
Keyword search alone misses conceptual matches. "Ignore conventional wisdom" won't find an essay titled "The Bus Ticket Theory of Genius" even though it's exactly about that. Vector search alone misses exact phrases when the embedding is diluted by surrounding text. RRF fusion gets both right. Multi-query expansion catches phrasings you didn't think of.
## Database schema
9 tables in Postgres + pgvector:
```
pages The core content table
slug (UNIQUE) e.g. "concepts/do-things-that-dont-scale"
type person, company, deal, yc, civic, project, concept, source, media
title, compiled_truth, timeline
frontmatter (JSONB) Arbitrary metadata
search_vector Trigger-based tsvector (title + compiled_truth + timeline + timeline_entries)
content_hash SHA-256 for import idempotency
content_chunks Chunked content with embeddings
page_id (FK) Links to pages
chunk_text The chunk content
chunk_source 'compiled_truth' or 'timeline'
embedding (vector) 1536-dim from text-embedding-3-large
HNSW index Cosine similarity search
links Cross-references between pages
from_page_id, to_page_id
link_type knows, invested_in, works_at, founded, references, etc.
tags page_id + tag (many-to-many)
timeline_entries Structured timeline events
page_id, date, source, summary, detail (markdown)
page_versions Snapshot history for compiled_truth
compiled_truth, frontmatter, snapshot_at
raw_data Sidecar JSON from external APIs
page_id, source, data (JSONB)
ingest_log Audit trail of import/ingest operations
config Brain-level settings (embedding model, chunk strategy)
```
Indexes: B-tree on slug/type, GIN on frontmatter/search_vector, HNSW on embeddings, pg_trgm on title for fuzzy slug resolution.
## Chunking
Three strategies, dispatched by content type:
**Recursive** (timeline, bulk import): 5-level delimiter hierarchy (paragraphs, lines, sentences, clauses, words). 300-word chunks with 50-word sentence-aware overlap. Fast, predictable, lossless.
**Semantic** (compiled truth): Embeds each sentence, computes adjacent cosine similarities, applies Savitzky-Golay smoothing to find topic boundaries. Falls back to recursive on failure. Best quality for intelligence assessments.
**LLM-guided** (high-value content, on request): Pre-splits into 128-word candidates, asks Claude Haiku to identify topic shifts in sliding windows. 3 retries per window. Most expensive, best results.
## Commands
```
SETUP
gbrain init [--supabase|--url <conn>] Create brain (guided wizard)
gbrain upgrade Self-update
PAGES
gbrain get <slug> Read a page (supports fuzzy slug matching)
gbrain put <slug> [< file.md] Write/update a page (auto-versions)
gbrain delete <slug> Delete a page
gbrain list [--type T] [--tag T] [-n N] List pages with filters
SEARCH
gbrain search <query> Keyword search (tsvector)
gbrain query <question> Hybrid search (vector + keyword + RRF + expansion)
IMPORT/EXPORT
gbrain import <dir> [--no-embed] Import markdown directory (idempotent)
gbrain export [--dir ./out/] Export to markdown (round-trip)
EMBEDDINGS
gbrain embed [<slug>|--all|--stale] Generate/refresh embeddings
LINKS + GRAPH
gbrain link <from> <to> [--type T] Create typed link
gbrain unlink <from> <to> Remove link
gbrain backlinks <slug> Incoming links
gbrain graph <slug> [--depth N] Traverse link graph (recursive CTE, default depth 5)
TAGS
gbrain tags <slug> List tags
gbrain tag <slug> <tag> Add tag
gbrain untag <slug> <tag> Remove tag
TIMELINE
gbrain timeline [<slug>] View timeline entries
gbrain timeline-add <slug> <date> <text> Add timeline entry
ADMIN
gbrain stats Brain statistics
gbrain health Health dashboard (embed coverage, stale, orphans)
gbrain history <slug> Page version history
gbrain revert <slug> <version-id> Revert to previous version
gbrain config [get|set] <key> [value] Brain config
gbrain serve MCP server (stdio)
gbrain call <tool> '<json>' Raw tool invocation
gbrain --tools-json Tool discovery (JSON)
```
## Using as a library
GBrain is library-first. The CLI and MCP server are thin wrappers over the engine.
```typescript
import { PostgresEngine } from 'gbrain';
const engine = new PostgresEngine();
await engine.connect({ database_url: process.env.DATABASE_URL });
await engine.initSchema();
// Write a page
await engine.putPage('concepts/superlinear-returns', {
type: 'concept',
title: 'Superlinear Returns',
compiled_truth: 'Paul Graham argues that returns in many fields are superlinear...',
timeline: '- 2023-10-01: Published on paulgraham.com',
});
// Hybrid search
const results = await engine.searchKeyword('startup growth');
// Typed links
await engine.addLink('concepts/superlinear-returns', 'concepts/do-things-that-dont-scale', '', 'references');
// Graph traversal
const graph = await engine.traverseGraph('concepts/superlinear-returns', 3);
// Health check
const health = await engine.getHealth();
// { page_count: 10, embed_coverage: 1.0, stale_pages: 0, orphan_pages: 10 }
```
The `BrainEngine` interface is pluggable. See `docs/ENGINES.md` for how to add backends.
## MCP server
Add to your Claude Code or Cursor MCP config:
```json
{
"mcpServers": {
"gbrain": {
"command": "gbrain",
"args": ["serve"]
}
}
}
```
20 tools: get_page, put_page, delete_page, list_pages, search, query, add_tag, remove_tag, get_tags, add_link, remove_link, get_links, get_backlinks, traverse_graph, add_timeline_entry, get_timeline, get_stats, get_health, get_versions, revert_version.
Every tool mirrors a CLI command. Drift tests verify identical behavior.
## Skills
Fat markdown files that tell AI agents HOW to use gbrain. No skill logic in the binary.
| Skill | What it does |
|-------|-------------|
| **ingest** | Ingest meetings, docs, articles. Updates compiled truth (rewrite, not append), appends timeline, creates cross-reference links across all mentioned entities. |
| **query** | 3-layer search (keyword + vector + structured) with synthesis and citations. Says "the brain doesn't have info on X" rather than hallucinating. |
| **maintain** | Periodic health: find contradictions, stale compiled truth, orphan pages, dead links, tag inconsistency, missing embeddings, overdue threads. |
| **enrich** | Enrich pages from external APIs. Raw data stored separately, distilled highlights go to compiled truth. |
| **briefing** | Daily briefing: today's meetings with participant context, active deals with deadlines, time-sensitive threads, recent changes. |
| **migrate** | Universal migration from Obsidian (wikilinks to gbrain links), Notion (stripped UUIDs), Logseq (block refs), plain markdown, CSV, JSON, Roam. |
## Architecture
```
CLI / MCP Server
(thin wrappers, identical operations)
|
BrainEngine interface
(pluggable backend)
|
+--------+--------+
| |
PostgresEngine SQLiteEngine
(ships v0) (designed, community PRs welcome)
|
Supabase Pro ($25/mo)
Postgres + pgvector + pg_trgm
connection pooling via Supavisor
```
Embedding, chunking, and search fusion are engine-agnostic. Only raw keyword search (`searchKeyword`) and raw vector search (`searchVector`) are engine-specific. RRF fusion, multi-query expansion, and 4-layer dedup run above the engine on `SearchResult[]` arrays.
## Storage estimates
For a brain with ~7,500 pages:
| Component | Size |
|-----------|------|
| Page text (compiled_truth + timeline) | ~150MB |
| JSONB frontmatter + indexes | ~70MB |
| Content chunks (~22K, text) | ~80MB |
| Embeddings (22K x 1536 floats) | ~134MB |
| HNSW index overhead | ~270MB |
| Links, tags, timeline, versions | ~50MB |
| **Total** | **~750MB** |
Supabase free tier (500MB) won't fit a large brain. Supabase Pro ($25/mo, 8GB) is the starting point.
Initial embedding cost: ~$4-5 for 7,500 pages via OpenAI text-embedding-3-large.
## Docs
- [GBRAIN_V0.md](docs/GBRAIN_V0.md) -- Full product spec, all architecture decisions, every option considered
- [ENGINES.md](docs/ENGINES.md) -- Pluggable engine interface, capability matrix, how to add backends
- [SQLITE_ENGINE.md](docs/SQLITE_ENGINE.md) -- Complete SQLite engine plan with schema, FTS5, vector search options
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md). Welcome PRs for:
- SQLite engine implementation
- Docker Compose for self-hosted Postgres
- Additional migration sources
- New enrichment API integrations
## License
MIT