Files
gbrain/docs/ENGINES.md
Garry Tan b22cbd349a feat: GBrain v0.1.0 — Postgres-native personal knowledge brain (#1)
* chore: add CLAUDE.md with project context and gstack skill routing rules

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: initialize project with Bun + TypeScript

package.json with dependencies (postgres, pgvector, openai, anthropic,
MCP SDK, gray-matter). TypeScript config targeting ESNext with bundler
module resolution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add foundation layer — engine interface, Postgres engine, schema

BrainEngine pluggable interface with full PostgresEngine: CRUD, search
(keyword + vector), links, tags, timeline, versions, stats, health,
ingest log, config. Trigger-based tsvector spanning pages +
timeline_entries. Markdown parser with frontmatter, compiled_truth /
timeline splitting, and round-trip serialization. 19 tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add 3-tier chunking and embedding service

Recursive delimiter-aware chunker (5-level hierarchy, 300-word chunks,
50-word overlap). Semantic chunker with Savitzky-Golay boundary detection
and recursive fallback. LLM-guided chunker via Claude Haiku with sliding
window topic detection. OpenAI embedding service with batch support,
exponential backoff, and rate limit handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add hybrid search with RRF fusion, expansion, and 4-layer dedup

Hybrid search merges vector (pgvector HNSW) + keyword (tsvector) via
Reciprocal Rank Fusion. Multi-query expansion via Claude Haiku generates
2 alternative phrasings. 4-layer dedup pipeline: by source, cosine
similarity, type diversity (60% cap), per-page cap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add GBRAIN_V0 spec, pluggable engine architecture, SQLite engine plan

GBRAIN_V0.md: full product spec with architecture decisions, CLI commands,
schema, search architecture, chunking strategies, first-time experience,
and future plans. ENGINES.md: pluggable engine interface, capability matrix,
how to add new backends. SQLITE_ENGINE.md: complete SQLite implementation
plan with schema, FTS5 setup, vector search options, and contributor guide.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add CLI with all commands

Full CLI dispatcher with 25+ commands: init (Supabase wizard), get, put,
delete, list, search, query (hybrid RRF), import (bulk with progress bar),
export (round-trip), embed, stats, health, tag/untag/tags, link/unlink/
backlinks/graph, timeline/timeline-add, history/revert, config, upgrade,
serve, call. Smart slug resolution on reads. Version snapshots on updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add MCP stdio server with all brain tools

20 MCP tools mirroring CLI operations: get/put/delete/list pages,
search (keyword), query (hybrid RRF + expansion), tags, links with
graph traversal, timeline, stats, health, version history, and revert.
Auto-chunks and embeds on put_page. CLI and MCP share the same engine.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: add 6 skill files and ClawHub manifest

Fat markdown skills for AI agents: ingest (meetings/docs/articles with
timeline merge), query (3-layer search + synthesis + citations), maintain
(health checks, stale detection, orphan audit), enrich (external API
enrichment), briefing (daily briefing compilation), migrate (universal
migration from Obsidian/Notion/Logseq/markdown/CSV/JSON/Roam).
ClawHub manifest for skill distribution.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add README, CONTRIBUTING, update CLAUDE.md test references

README with quickstart, commands, architecture, library usage, MCP setup,
and links to design docs. CONTRIBUTING with setup, project structure,
and guides for adding commands and engines. CLAUDE.md updated to reference
actual test files instead of planned-but-unwritten import test.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address adversarial review findings — 5 critical/high fixes

- revertToVersion: add page_id check to prevent cross-page data corruption
- traverseGraph: use UNION instead of UNION ALL for cycle safety
- embedAll: preserve all chunks when embedding stale subset only
- embedding: throw on retry exhaustion instead of returning zero vectors
- putPage: validate slugs to prevent path traversal on export

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.1.0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: expand README with schema, install, search architecture, and motivation

Why it exists, how search works (with ASCII diagram), full database schema
with all 9 tables and index details, chunking strategies explained, storage
estimates, setup wizard walkthrough, knowledge model with example page,
library usage with more examples, expanded skills table.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: add MIT license (Copyright 2026 Garry Tan)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add OpenClaw install flow as primary option in README

OpenClaw users just say "install gbrain" and the orchestrator handles
everything: package install, Supabase setup wizard, skill registration.
Shows the conversational interface for querying, ingesting, and briefings.
ClawHub and standalone CLI paths follow as alternatives.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add prerequisites and explicit OpenClaw install instructions

Prerequisites table listing Supabase, OpenAI, and Anthropic dependencies
with links. Environment variable setup. Explicit step-by-step prompt for
OpenClaw users showing exactly what to tell the orchestrator. Note that
search degrades gracefully without API keys (keyword-only without OpenAI,
no expansion without Anthropic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: scrub named references, add PG essay demo section to README

Replace all Pedro/Brex/Jensen Huang/River AI examples with Paul Graham
essay examples using the kindling corpus. Add "Try it" section to README
showing the power of hybrid search on PG essays in 90 seconds. Update
test fixtures to use concept pages instead of person pages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 12:48:10 -07:00

199 lines
9.3 KiB
Markdown

# Pluggable Engine Architecture
## The idea
Every GBrain operation goes through `BrainEngine`. The engine is the contract between "what the brain can do" and "how it's stored." Swap the engine, keep everything else.
v0 ships `PostgresEngine` backed by Supabase. The interface is designed so a `SQLiteEngine`, `DuckDBEngine`, or `TursoEngine` could slot in without touching the CLI, MCP server, skills, or any consumer code.
## Why this matters
Different users have different constraints:
| User | Needs | Best engine |
|------|-------|-------------|
| Power user (you) | World-class search, 7K+ pages, zero-ops | PostgresEngine + Supabase |
| Open source hacker | Single file, no server, git-friendly | SQLiteEngine (future) |
| Team/enterprise | Multi-user, RLS, audit trail | PostgresEngine + self-hosted |
| Researcher | Analytics, bulk exports, embeddings | DuckDBEngine (someday) |
| Edge/mobile | Offline-first, sync later | SQLiteEngine + sync (someday) |
The engine interface means we don't have to choose. Ship Postgres now, let the community build the rest.
## The interface
```typescript
// src/core/engine.ts
export interface BrainEngine {
// Lifecycle
connect(config: EngineConfig): Promise<void>;
disconnect(): Promise<void>;
initSchema(): Promise<void>;
transaction<T>(fn: (engine: BrainEngine) => Promise<T>): Promise<T>;
// Pages CRUD
getPage(slug: string): Promise<Page | null>;
putPage(slug: string, page: PageInput): Promise<Page>;
deletePage(slug: string): Promise<void>;
listPages(filters: PageFilters): Promise<Page[]>;
// Search
searchKeyword(query: string, opts?: SearchOpts): Promise<SearchResult[]>;
searchVector(embedding: Float32Array, opts?: SearchOpts): Promise<SearchResult[]>;
// Chunks
upsertChunks(slug: string, chunks: ChunkInput[]): Promise<void>;
getChunks(slug: string): Promise<Chunk[]>;
// Links
addLink(from: string, to: string, context?: string, linkType?: string): Promise<void>;
removeLink(from: string, to: string): Promise<void>;
getLinks(slug: string): Promise<Link[]>;
getBacklinks(slug: string): Promise<Link[]>;
traverseGraph(slug: string, depth?: number): Promise<GraphNode[]>;
// Tags
addTag(slug: string, tag: string): Promise<void>;
removeTag(slug: string, tag: string): Promise<void>;
getTags(slug: string): Promise<string[]>;
// Timeline
addTimelineEntry(slug: string, entry: TimelineInput): Promise<void>;
getTimeline(slug: string, opts?: TimelineOpts): Promise<TimelineEntry[]>;
// Raw data
putRawData(slug: string, source: string, data: object): Promise<void>;
getRawData(slug: string, source?: string): Promise<RawData[]>;
// Versions
createVersion(slug: string): Promise<PageVersion>;
getVersions(slug: string): Promise<PageVersion[]>;
revertToVersion(slug: string, versionId: number): Promise<void>;
// Stats + health
getStats(): Promise<BrainStats>;
getHealth(): Promise<BrainHealth>;
// Ingest log
logIngest(entry: IngestLogInput): Promise<void>;
getIngestLog(opts?: IngestLogOpts): Promise<IngestLogEntry[]>;
// Config
getConfig(key: string): Promise<string | null>;
setConfig(key: string, value: string): Promise<void>;
}
```
### Key design choices
**Slug-based API, not ID-based.** Every method takes slugs, not numeric IDs. The engine resolves slugs to IDs internally. This keeps the interface portable... slugs are strings, IDs are database-specific.
**Embedding is NOT in the engine.** The engine stores embeddings and searches by vector, but it doesn't generate embeddings. `src/core/embedding.ts` handles that. This is intentional: embedding is an external API call (OpenAI), not a storage concern. All engines share the same embedding service.
**Chunking is NOT in the engine.** Same logic. `src/core/chunkers/` handles chunking. The engine stores and retrieves chunks. All engines share the same chunkers.
**Search returns `SearchResult[]`, not raw rows.** The engine is responsible for its own search implementation (tsvector vs FTS5, pgvector vs sqlite-vss) but must return a uniform result type. RRF fusion and dedup happen above the engine, in `src/core/search/hybrid.ts`.
**`traverseGraph` exists but is engine-specific.** Postgres uses recursive CTEs. SQLite would use a loop with depth tracking. The interface is the same: give me a slug and max depth, return the graph.
## How search works across engines
```
+-------------------+
| hybrid.ts |
| (RRF fusion + |
| dedup, shared) |
+--------+----------+
|
+------------+------------+
| |
+--------v--------+ +--------v--------+
| engine.search | | engine.search |
| Keyword() | | Vector() |
+-----------------+ +-----------------+
| |
+-----------+-----------+ +---------+---------+
| | | |
+-------v-------+ +-------v---+ +-------v---+ +----v--------+
| Postgres: | | SQLite: | | Postgres: | | SQLite: |
| tsvector + | | FTS5 + | | pgvector | | sqlite-vss |
| ts_rank + | | bm25 | | HNSW | | or vec0 |
| websearch_to_ | | | | cosine | | |
| tsquery | | | | | | |
+---------------+ +-----------+ +-----------+ +-------------+
```
RRF fusion, multi-query expansion, and 4-layer dedup are engine-agnostic. They operate on `SearchResult[]` arrays. Only the raw keyword and vector searches are engine-specific.
## PostgresEngine (v0, ships)
**Dependencies:** `postgres` (porsager/postgres), `pgvector`
**Postgres-specific features used:**
- `tsvector` + `GIN` index for full-text search with `ts_rank` weighting
- `pgvector` HNSW index for cosine similarity vector search
- `pg_trgm` + `GIN` for fuzzy slug resolution
- Recursive CTEs for graph traversal
- Trigger-based search_vector (spans pages + timeline_entries)
- JSONB for frontmatter with GIN index
- Connection pooling via Supabase Supavisor (port 6543)
**Hosting:** Supabase Pro ($25/mo). Zero-ops. Managed Postgres with pgvector built in.
**Why not self-hosted for v0:** The brain should be infrastructure agents use, not something you maintain. Self-hosted Postgres with Docker is a welcome community PR, but v0 optimizes for zero ops.
## Adding a new engine
1. Create `src/core/<name>-engine.ts` implementing `BrainEngine`
2. Add to engine factory in `src/core/engine.ts`:
```typescript
export function createEngine(type: string): BrainEngine {
switch (type) {
case 'postgres': return new PostgresEngine();
case 'sqlite': return new SQLiteEngine();
default: throw new Error(`Unknown engine: ${type}`);
}
}
```
3. Store engine type in `~/.gbrain/config.json`: `{ "engine": "sqlite", ... }`
4. Add tests. The test suite should be engine-agnostic where possible... same test cases, different engine constructor.
5. Document in this file + add a design doc in `docs/`
### What you DON'T need to touch
- `src/cli.ts` (dispatches to engine, doesn't know which one)
- `src/mcp/server.ts` (same)
- `src/core/chunkers/*` (shared across engines)
- `src/core/embedding.ts` (shared across engines)
- `src/core/search/hybrid.ts`, `expansion.ts`, `dedup.ts` (shared, operate on SearchResult[])
- `skills/*` (fat markdown, engine-agnostic)
### What you DO need to implement
Every method in `BrainEngine`. The full interface. No optional methods, no feature flags. If your engine can't do vector search (e.g., a pure-text engine), implement `searchVector` to return `[]` and document the limitation.
## Capability matrix
| Capability | PostgresEngine | SQLiteEngine (future) | Notes |
|-----------|---------------|----------------------|-------|
| CRUD | Full | Full | |
| Keyword search | tsvector + ts_rank | FTS5 + bm25 | Different ranking algorithms |
| Vector search | pgvector HNSW | sqlite-vss or vec0 | Different index types |
| Fuzzy slug | pg_trgm | LIKE + Levenshtein | Postgres is better here |
| Graph traversal | Recursive CTE | Loop with depth tracking | Same interface |
| Transactions | Full ACID | Full ACID | Both support this |
| JSONB queries | GIN index | json_extract | Postgres is richer |
| Concurrent access | Connection pooling | Single writer | SQLite limitation |
| Hosting | Supabase, self-hosted, Docker | Local file | |
## Future engine ideas
**SQLiteEngine** (most requested). See `docs/SQLITE_ENGINE.md` for the full plan. Single file, no server, git-friendly. Uses FTS5 for keyword search, sqlite-vss or vec0 for vector search. Great for open source users who want zero infrastructure.
**TursoEngine.** libSQL (SQLite fork) with embedded replicas and HTTP edge access. Would give SQLite's simplicity with cloud sync. Interesting for mobile/edge use cases.
**DuckDBEngine.** Analytical workloads. Bulk exports, embedding analysis, brain-wide statistics. Not for OLTP. Could be a secondary engine for analytics alongside Postgres for operations.
**Custom/Remote.** The interface is clean enough that someone could build an engine backed by any storage: Firestore, DynamoDB, a REST API, even a flat file system. The interface doesn't assume SQL.