Files

Garry Tan 912a321cfa GBrain v0.4.0 — production agent documentation + reference architecture (#10 )

* fix: widen validateSlug to accept any filename characters

Git is the system of record. Slugs are lowercased repo-relative paths.
The restrictive regex rejected spaces, parens, and special chars, blocking
5,861 Apple Notes files from importing. Now only rejects empty slugs,
path traversal (..), and leading slash.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: enable RLS on all tables with BYPASSRLS safety check

Without RLS, the Supabase anon key gives full read access to the DB.
Enable RLS on all 10 tables with no policies — the postgres role
(used by gbrain via pooler) has BYPASSRLS and is unaffected. Only
enables if the current role actually has BYPASSRLS privilege to
avoid locking ourselves out on non-Supabase setups.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: import resilience — 5MB limit, error suppression, structured progress

Raise MAX_FILE_SIZE from 1MB to 5MB for Apple Notes with attachments.
Track error patterns and suppress after 5 identical errors to prevent
5,861 identical warnings from killing the agent process. Replace \r
progress bar with structured log lines (rate, ETA) for agent parsing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: init detects IPv6-only Supabase URLs, adds pgvector check

Detect db.*.supabase.co direct URLs and warn about IPv6 failure.
On ECONNREFUSED/ETIMEDOUT to Supabase, suggest the Session pooler
connection string with exact dashboard click path. Check for pgvector
extension after connecting and fail with clear instructions if missing.
Update wizard hints to show pooler URL format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add pre-ship requirement for E2E tests

E2E tests against real Postgres+pgvector must pass before /ship or
/review. Adds the requirement to CLAUDE.md so all agents enforce it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: parallel import with per-worker engine instances

Refactor PostgresEngine to support instance-level DB connections instead
of only the module-global singleton. Each worker gets its own connection
with poolSize:2 (vs 10 for the main engine), so 8 workers = 16 connections.

Add --workers N flag to gbrain import. Workers pull from a shared queue
and use independent engine instances — no transaction context corruption.

The bottleneck is network round-trips to Supabase (one per page upsert).
Parallel workers cut import time proportionally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: automatic schema migration runner

Migrations are embedded as string constants in migrate.ts (survives
Bun --compile). Each migration runs in a transaction for clean rollback
on failure. Runs automatically on initSchema() — no manual step needed
when a user updates the gbrain binary against an older DB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: pluggable storage backend (S3 + Supabase Storage + local)

Add StorageBackend interface with three implementations:
- S3Storage: works with AWS S3, Cloudflare R2, MinIO (any S3-compatible)
- SupabaseStorage: uses Supabase Storage REST API with service role key
- LocalStorage: filesystem-based, for testing

Add file-resolver.ts with fallback chain: local file → .redirect
breadcrumb → .supabase marker → storage backend. Supports the
three-stage migration (mirror → redirect → clean).

Add yaml-lite.ts for parsing marker and breadcrumb files without
adding a YAML dependency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: gbrain doctor command — health checks with --json output

Checks: connection, pgvector extension, RLS on all tables, schema
version, embedding coverage. Outputs structured JSON with --json flag
for agent parsing. Exit code 0 if healthy, 1 if issues found.

Agents should run gbrain doctor --json when any command fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rewrite setup skill + README for agent-first DX

Setup skill: add Why Supabase, step-by-step project creation, explicit
agent instructions (nohup for large imports, doctor on failure, don't
ask for anon key), available init flags, file migration offer after
first import. Remove ClawHub references.

README: simplify to single OpenClaw install path, remove ClawHub, fix
squatted npm name to github:garrytan/gbrain, add Supabase settings
note about Session pooler.

Add Apple Notes test fixtures with spaces and parens in filenames for
E2E testing of the slug fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add RLS verification, schema health, and nohup hints to maintain skill

Maintenance skill now checks RLS status and schema version as part of
periodic health checks. Adds nohup pattern for large embedding refreshes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: import resume checkpoint + Supabase smart URL parsing

Import resume: saves checkpoint every 100 files to ~/.gbrain/import-checkpoint.json.
On restart with same directory and file count, skips already-processed files.
Use --fresh to ignore checkpoint and start over. Cleared on successful completion.

Supabase admin: extractProjectRef() parses any Supabase URL format (dashboard,
direct, pooler, project URL) to extract the project ref. discoverPoolerUrl()
uses the Management API to find the correct pooler connection string (including
the exact region prefix). checkRls() verifies RLS status via the API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add 56 unit tests for all new code

8 new test files covering every feature added in this branch:
- slug-validation.test.ts: spaces, parens, unicode, path traversal (10 tests)
- yaml-lite.test.ts: parse + stringify, marker/redirect formats (9 tests)
- supabase-admin.test.ts: extractProjectRef for 4 URL formats (7 tests)
- migrate.test.ts: version export, runMigrations callable (2 tests)
- storage.test.ts: LocalStorage CRUD + createStorage factory (14 tests)
- file-resolver.test.ts: fallback chain, redirect, marker parsing (6 tests)
- import-resume.test.ts: checkpoint save/load/resume/fresh (6 tests)
- doctor.test.ts: module export, CLI registration (3 tests)

Total: 184 pass, 0 fail (up from 128).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: bulk chunk INSERT + E2E tests for all new features

Bulk INSERT: upsertChunks now builds a multi-row VALUES query instead
of inserting chunks one-by-one. Reduces DB round-trips by ~50x per page.

E2E tests added to mechanical.test.ts:
- Slug with special chars: import Apple Notes fixtures with spaces/parens,
  verify search finds them, verify idempotency
- RLS verification: check pg_tables.rowsecurity on all tables, verify
  current user has BYPASSRLS
- Doctor command: verify exit 0 on healthy DB, --json produces valid JSON
  with check structure
- Parallel import: --workers 2 produces same page count as sequential

Unit tests added:
- setup-branching.test.ts: IPv6 detection, defaultWorkers auto-tuning,
  smart URL parsing across all Supabase URL formats

Fixtures added:
- large/big-file.md (2.1MB) for testing raised file size limit
- apple-notes/ fixtures already existed

Total: 200 pass, 0 fail (up from 184).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: --json on init/import, file migration CLI, lifecycle tests

--json flag: init and import now support --json for structured output.
Agents get parseable JSON instead of human-readable text.

File migration CLI: implement mirror, unmirror, redirect, restore,
clean, and status subcommands for the three-stage file migration
lifecycle (local → mirrored → redirected → cloud-only).

File migration tests: full lifecycle test covering every transition
in the state machine (LOCAL → MIRROR → UNMIRROR → REDIRECT → RESTORE
→ CLEAN), including edge cases and file resolver at each stage.

Bulk chunk INSERT: upsertChunks now builds multi-row parameterized
VALUES query, reducing round-trips per page from ~50 to 1.

Total: 207 pass, 0 fail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: thorough E2E tests for parallel import concurrency

Replace the weak single-comparison parallel import test with 7 tests:
- Sequential baseline: capture page count, chunk count, and all slugs
- --workers 2: verify page count matches sequential
- Chunk count matches (no duplicates from concurrent writes)
- Page slugs match exactly
- No duplicate pages (SQL GROUP BY HAVING count > 1)
- No duplicate chunks (SQL GROUP BY page_id, chunk_index)
- --workers 4: also works correctly
- Re-import with workers is idempotent

These tests catch the exact bug Codex found (db.ts singleton causing
concurrent transaction corruption) by verifying data integrity after
parallel writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add batch embedding queue as P1 TODO

Deferred during eng review (per-worker embedding is good enough for now).
Revisit after profiling real imports to confirm embedding is the bottleneck.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: E2E test failures — fixture counts, arg parsing, doctor exit code

Fix fixture count assertions: 13 → 16 pages (added apple-notes + large file),
companies 2 → 3 (ohmygreen), concepts 3 → 5 (notes, big-file).

Fix --workers arg parsing: the worker count value (e.g. "2") was being
picked up as the directory arg. Skip flag values when finding the dir.

Fix doctor exit code: warnings (like missing embeddings) should exit 0,
only actual failures exit 1. E2E tests import with --no-embed, so
embeddings are always WARN.

Fix E2E CLI tests: add initCli() before doctor and parallel import
tests so ~/.gbrain/config.json exists for the subprocess.

All E2E tests pass: 63 pass, 0 fail.
All unit tests pass: 207 pass, 0 fail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.4.0

New CHANGELOG entry for all post-0.3.0 features (doctor, storage backends,
parallel import, resume checkpoints, RLS, schema migrations, --json output).
Version bumped 0.3.0 → 0.4.0 across all manifests.

CLAUDE.md: test count 9→19, skill count 8→7, added key files.
CONTRIBUTING.md: fixture count 13→16, added missing source files.
README.md: added gbrain doctor to commands, fixed stale welcome PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add GBRAIN_SKILLPACK.md reference architecture

Production agent patterns from a real deployment with 14,700+ brain files.
Covers: entity detection on every message, brain-first lookup protocol,
7-step enrichment pipeline with tiered API spend, compiled truth + timeline,
source attribution with mandatory citations, meeting ingestion with entity
propagation, cron schedule with quiet hours and travel-aware timezone,
YouTube/media ingestion via Diarize.io, integration guides for ClawVisor,
Circleback webhooks, and Quo/OpenPhone SMS. Opens with the Vannevar Bush
memex framing and the originals folder for capturing intellectual capital.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: rewrite README opener with memex pitch and production architecture

Replace code-first opener with mimetic-desire pitch: Vannevar Bush memex
tagline, production brain numbers (10K+ files, 3K+ people, 13 years of
calendar), "ask it anything" examples, compounding thesis.

New sections: The Compounding Thesis (read-write loop), Architecture
(three-column diagram), What a Production Agent Looks Like (SKILLPACK
reference), How gbrain fits with OpenClaw (three-layer complement).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update skills with brain-first lookup, entity detection, heartbeat

setup: Phase D rewritten with brain-first lookup protocol (gbrain search
→ query → get → grep fallback), sync-after-write rule, memory_search
complement table.

query: token-budget awareness (chunks not full pages), source precedence
hierarchy (user > compiled truth > timeline > external).

ingest: entity detection on every message (scan, check brain, create or
enrich, commit and sync).

maintain: heartbeat integration (doctor, embed --stale, sync verification,
stale compiled truth detection).

briefing: gbrain-native context loading (search attendees before meetings,
search sender before email, daily deal/meeting/commitment queries).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add OpenClaw positioning to README opener

Make it clear up top that GBrain is built for OpenClaw agents and
works with any OpenClaw deployment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: credit Karpathy's Knowledge LLM vision, add origin story

GBrain started as Karpathy's LLM wiki idea built for real. Worked great
until the brain hit thousands of files and grep fell apart. GBrain is the
search layer that had to exist once the brain outgrew grep.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-09 10:17:13 -07:00

33 KiB

Raw Blame History

GBrain Skillpack: Reference Architecture for AI Agents

1. What This Document Is

This is a reference architecture for how a production AI agent uses gbrain as its knowledge backbone. It is based on patterns from a real deployment with 14,700+ brain files, 40+ skills, and 20+ cron jobs running continuously.

This is not a tutorial. It is a pattern book. Here's what works, here's why.

The memex vision, realized. Vannevar Bush described the memex in "As We May Think" (1945): a device where an individual stores all their books, records, and communications, mechanized so it may be consulted with exceeding speed and flexibility. GBrain is that device. A personal knowledge store with full provenance trails, hybrid search across everything you've ever read, said, or thought, and an AI agent that maintains it while you sleep. Bush imagined trails of association linking items together. GBrain has typed links, backlinks, and graph traversal. Bush imagined a scholar building a trail through a body of knowledge. GBrain's compiled truth pattern IS that trail, continuously rewritten as new evidence arrives.

The key difference from Bush's vision: the memex was passive (you had to build the trails). GBrain is active. The agent detects entities, enriches pages, creates cross-references, and maintains compiled truth automatically. You don't build the memex. The memex builds itself.

2. The Brain-Agent Loop

The core read-write cycle that makes the brain compound over time:

Signal arrives (message, meeting, email, tweet, link)
  |
  v
Detect entities (people, companies, concepts, original thinking)
  |
  v
READ: Check brain first (gbrain search, gbrain get)
  |
  v
Respond with context (brain makes every answer better)
  |
  v
WRITE: Update brain pages (new info compiled into existing pages)
  |
  v
Sync: gbrain indexes changes (available for next query)
  |
  v
(next signal arrives — agent is now smarter than last time)

Every signal that flows through your agent should touch the brain in both directions. Read before responding. Write after learning something new. The next time that person, company, or concept comes up, the agent already has context.

The brain almost always has something. External APIs fill gaps — they don't start from scratch.

An agent without this loop answers from stale context every time. An agent with it gets smarter with every conversation, every meeting, every email. Six months in, the compounding is visible: the agent knows more about your world than you can hold in working memory, because it never forgets and it never stops indexing.

The loop has two invariants:

Every READ improves the response. If you answered a question about a person without checking their brain page first, you gave a worse answer than you could have.
Every WRITE improves future reads. If a meeting transcript mentioned new information about a company and you didn't update the company page, you created a gap that will bite you later.

3. Entity Detection -- Run It on Every Message

Spawn a lightweight sub-agent on EVERY inbound message. Use a cheap, fast model (e.g. Claude Sonnet). The sub-agent captures two things with equal priority:

Original Thinking (PRIMARY)

The user's ideas, observations, theses, frameworks, and philosophical riffs. This is the highest-value signal in the entire system. Original thinking becomes essays, talks, leadership philosophy, strategic insight. It compounds.

Capture the user's EXACT phrasing. The language IS the insight. "The ambition-to-lifespan ratio has never been more broken" captures something that "tension between ambition and mortality" doesn't. Don't clean it up. Don't paraphrase.

Route by authorship:

Signal	Destination
User generated the idea	`brain/originals/{slug}.md`
World concept they reference	`brain/concepts/{slug}.md`
Product or business idea	`brain/ideas/{slug}.md`
Personal reflection or pattern	`brain/personal/reflections/`

What counts: Original observations about how the world works, novel connections between disparate things, frameworks and mental models, pattern recognition moments, hot takes with reasoning, metaphors that reveal new angles.

What doesn't count: Routine operational messages ("ok", "do it"), pure questions without embedded observations, echoing back something the agent said.

Entity Mentions (SECONDARY)

People, companies, media references. For each:

Check if brain page exists (gbrain search "name")
If no page and entity is notable: create it, enrich it
If thin page: spawn background enrichment
If rich page: load it silently for context
For new facts about existing entities: append to timeline

Rules

Fire on EVERY message. No exceptions unless purely operational.
Don't block the conversation. Spawn and forget.
User's direct statements are the HIGHEST-authority signal.
Iron law: back-link FROM entity pages TO the source that mentions them. An unlinked mention is a broken brain. Format: append to their Timeline or See Also: - **YYYY-MM-DD** | Referenced in [page title](path/to/page.md) -- context

3b. The Originals Folder -- Capturing Intellectual Capital

Most knowledge systems capture WHAT YOU FOUND (articles, meetings, people). The originals folder captures WHAT YOU THINK.

When the user generates an original observation, thesis, framework, or hot take, the agent captures it verbatim in brain/originals/. This is the highest-value content in the entire brain.

The authorship test:

User generated the idea? -> originals/{slug}.md
User's unique synthesis of someone else's ideas? -> originals/ (the synthesis is original)
World concept someone else coined? -> concepts/{slug}.md
Product or business idea? -> ideas/{slug}.md

Naming: Use the user's own language for the slug. meatsuit-maintenance-tax not biological-needs-maintenance-overhead. The vividness IS the concept.

Cross-link originals to: people who shaped the thinking, companies where it played out, meetings where it was discussed, books and media that influenced it, other originals it connects to (ideas form clusters). An original without cross-links is a dead original. The connections ARE the intelligence.

Over time, the originals folder becomes a searchable archive of the user's intellectual output, organized by topic. This is the memex at its most powerful: not just remembering what you read, but remembering what you THOUGHT about what you read.

4. The Brain-First Lookup Protocol

Before calling ANY external API to research a person, company, or topic:

1. gbrain search "name"     -- keyword match, fast, works day one
2. gbrain query "what do we know about name"  -- hybrid search, needs embeddings
3. gbrain get <slug>         -- direct page read when you know the slug
4. External APIs as FALLBACK only

The brain almost always has something. Even a timeline entry from three months ago is better context than starting from scratch with a web search.

For each entity found: load compiled truth + recent timeline entries before responding. The compiled truth section gives you the state of play in 30 seconds. The timeline gives you what changed recently.

This is mandatory. An agent that calls Brave Search before checking the brain is wasting money and giving worse answers. The brain has context that no external API can provide: relationship history, the user's own assessments, meeting transcripts, cross-references to other entities.

5. Enrichment Pipeline -- 7-Step Protocol

When to enrich: entity mentioned in conversation, meeting attendees, email threads, social interactions, new contacts, whenever the brain page is thin or missing.

Tier System

Scale API spend to importance. Don't blow 20 API calls on a passing mention.

Tier	Who	Effort	API Calls
Tier 1	Key people and companies: inner circle, business partners, portfolio companies	Full pipeline, ALL data sources	10-15
Tier 2	Notable: people you interact with occasionally	Web search + social + brain cross-reference	3-5
Tier 3	Minor mentions: everyone else worth tracking	Brain cross-reference + social lookup if handle known	1-2

The 7 Steps

Step 1: Identify entities. From the incoming signal (meeting, email, tweet), extract people names, company names, and what they're associated with.

Step 2: Check brain state. Does a page exist? If yes, read it -- you're on the UPDATE path. If no, you're on the CREATE path. Check gbrain search first.

Step 3: Extract signal from source. Don't just pull facts -- pull texture:

What opinion did they express? -> What They Believe
What are they building or shipping? -> What They're Building
Did they express emotion? -> What Makes Them Tick
Who did they engage with? -> Network / Relationship
Is this a recurring topic? -> Hobby Horses
What did they commit to? -> Open Threads
What was their energy? -> Trajectory

Step 4: Data source lookups. For CREATE or thin pages, run structured lookups. The order matters -- stop when you have enough signal for the entity's tier.

Priority order:

Brain cross-reference (free, highest-value -- always first): gbrain search "name" to find mentions across meetings, other people pages, company pages.
Web search via Brave or Exa: background, press, talks, funding.
X/Twitter deep lookup (enterprise API or scraping): beliefs, building, hobby horses, network, trajectory.
People enrichment: Crustdata (LinkedIn data), Happenstance (web research, career arcs).
Company/funding data: Captain API (Pitchbook-grade funding, valuation, team data).
Meeting history: Circleback (transcript search, attendee lookup).
Contact data (Google Contacts, CRM sync).

X/Twitter lookup is underrated. When you have someone's handle, their tweets are the single best source for: what they believe (opinions expressed unprompted), what they're building (shipping announcements), hobby horses (recurring topics), who they engage with (reply patterns, amplification), and trajectory (posting frequency, tone shifts). This goes into the brain page's "What They Believe" and "Hobby Horses" sections.

Step 5: Save raw data. Every API response gets saved to a .raw/ sidecar alongside the brain page. JSON with sources.{provider}.fetched_at and .data. Overwrite on re-enrichment, don't append.

Step 6: Write to brain. CREATE path: use the page template from your brain's schema, fill compiled truth from all data gathered, add first timeline entry. UPDATE path: append timeline, update compiled truth if the new signal materially changes the picture. Flag contradictions -- don't silently resolve them.

Step 7: Cross-reference. After updating a person page: update their company page, update deal pages, add back-links. After updating a company page: update founder pages, update deal pages. Every entity page should link to every other entity page that references it.

People Pages

A person page isn't a LinkedIn profile. It's a living portrait:

Executive Summary -- How do you know them? Why do they matter?
State -- Role, company, relationship, key context
What They Believe -- Ideology, worldview, first principles
What They're Building -- Current projects, features shipped
What Motivates Them -- Ambition drivers, career arc
Assessment -- Strengths, weaknesses, net read
Trajectory -- Ascending, plateauing, pivoting, declining?
Relationship -- History, temperature, open threads
Contact -- Email, phone, X handle, LinkedIn
Timeline -- Reverse chronological, append-only, never rewritten

Facts are table stakes. Texture is the value.

6. Compiled Truth + Timeline Pattern

Every brain page has a horizontal rule separating two zones:

Above the line: Compiled truth. A synthesis that represents the current state of play. If you read only the compiled truth section, you know everything you need. This gets rewritten when new evidence changes the picture.

Below the line: Timeline. Append-only log of every signal, in reverse chronological order. Never rewritten, never deleted. This is the evidence base. Every compiled truth claim should be traceable to one or more timeline entries.

## Executive Summary
One paragraph. How do you know them, why do they matter.

## State
Role, company, key numbers, relationship status.

## What They Believe
Their worldview, first principles, hills they die on.

## What They're Building
Current projects, recent launches, what's next.

## Assessment
Strengths, weaknesses, your net read on this person.

## Trajectory
Where they're headed. Ascending, plateauing, pivoting?

## Relationship
History with you. Last interaction. Open threads.

## Contact
Email, phone, X handle, LinkedIn.

---

## Timeline

- **2026-04-07** | Met at Team Sync. Discussed new product launch. Seemed energized
  about the pivot. [Source: Meeting notes "Team Sync" #12345, 2026-04-07 2:00 PM PT]
- **2026-04-03** | Mentioned in email thread re Q2 planning. Taking lead on ops.
  [Source: email from Sarah Chen re Q2 board deck, 2026-04-03 10:30 AM PT]
- **2026-03-15** | First meeting. Intro from Pedro. Strong technical background.
  [Source: User, direct message, 2026-03-15 3:00 PM PT]

The compiled truth pattern works because the agent rewrites the synthesis as new evidence arrives, but the evidence itself is immutable. Six months of timeline entries compress into a one-paragraph assessment that's always current.

GBrain integration: gbrain query weights compiled truth higher than timeline entries in search results, so the freshest synthesis surfaces first.

7. Source Attribution -- Every Fact Needs a Citation

This is not a suggestion. It is a hard requirement. Every fact written to a brain page needs an inline [Source: ...] citation with full provenance.

Format

[Source: {who}, {channel/context}, {date} {time} {tz}]

Examples by Category

Direct statements: [Source: User, direct message, 2026-04-07 12:33 PM PT]

Meetings: [Source: Meeting notes "Team Sync" #12345, 2026-04-03 12:11 PM PT]

API enrichment: [Source: Crustdata LinkedIn enrichment, 2026-04-07 12:35 PM PT]

Social media (MUST include full URL): [Source: X/@pedroh96 tweet, product launch, 2026-04-07](https://x.com/pedroh96/status/...)

Email: [Source: email from Sarah Chen re Q2 board deck, 2026-04-05 2:30 PM PT]

Workspace: [Source: Slack #engineering, Keith re deploy schedule, 2026-04-06 11:45 AM PT]

Web research: [Source: Happenstance research, 2026-04-07 12:35 PM PT]

Published media: [Source: [Wall Street Journal, 2026-04-05](https://wsj.com/...)]

Funding data: [Source: Captain API funding data, 2026-04-07 2:00 PM PT]

Why This Matters

Six months from now, someone reads a brain page and can trace every single fact back to where it came from. "User said it" isn't enough. WHERE, ABOUT WHAT, WHEN.

The Rule Most Agents Miss

Source attribution applies to compiled truth AND timeline. The compiled truth section (above the line) isn't exempt from citations just because it's a synthesis. Every claim needs a source. "Pedro co-founded Brex" needs [Source: ...] just as much as a timeline entry does.

Tweet URLs Are Mandatory

A tweet reference without a URL is a broken citation. Format: [Source: X/@handle tweet, topic, date](https://x.com/handle/status/ID). This is a real production problem: hundreds of brain pages end up with broken tweet citations when the URL is omitted.

Source Hierarchy for Conflicting Information

User's direct statements (highest authority)
Primary sources (meetings, emails, direct conversations)
Enrichment APIs (Crustdata, Happenstance, Captain)
Web search results
Social media posts

When sources conflict, note the contradiction in compiled truth with both citations. Don't silently pick one.

8. Meeting Ingestion

Meetings are the richest signal source in the entire system. Every meeting produces entity updates across multiple brain pages.

Transcript Source

Circleback or any meeting recording service with API access. The key requirement: speaker diarization (who said what) and webhook support.

Schedule

Run as a cron job. A reasonable cadence: 3x/day (10 AM, 4 PM, 9 PM) to catch new meetings throughout the day.

After Every Meeting

1. Pull the full transcript. Always pull the complete transcript, not just the AI summary. AI-generated summaries hallucinate framing -- they editorialize what was "agreed" or "decided" when no such agreement happened. The transcript is ground truth.

2. Create the meeting page. Write to brain/meetings/YYYY-MM-DD-short-description.md with the agent's OWN analysis:

Above the bar: Agent's summary reframed through the user's priorities. What matters to YOU, not a generic meeting recap. Flag surprises, contradictions, and implications. Name real decisions and commitments (not performative ones). Call out what was left unsaid or unresolved.
Below the bar: Full diarized transcript (append-only evidence base). Format: **Speaker** (HH:MM:SS): Words.

3. Propagate to entity pages (MANDATORY). This is the step most agents skip. A meeting is NOT fully ingested until every entity page has been updated:

People pages: Update State, append Timeline with meeting-specific insights
Company pages: Update State with new metrics, status, decisions, feedback
Deal pages: Update State with new terms, status, deadlines

4. Extract action items into your task list.

5. Commit and sync. gbrain sync so the new pages are immediately searchable.

Back-Linking

Meeting page links to attendee pages. Attendee pages link back to meeting with context. The graph is bidirectional. Always.

9. Reference Cron Schedule

A production agent runs 20+ recurring jobs that interact with the brain. Here is a generalized reference schedule:

Frequency	Job	Brain Interaction
Every 30 min	Email monitoring	`gbrain search` sender, update people pages
Every 30 min	Message monitoring	`gbrain search` sender, entity detection
Hourly	Social media ingestion	Create/update media pages, entity extraction
Hourly	Workspace scanning	Update project pages, flag mentions
3x/day	Meeting processing	Full ingestion pipeline (Section 8)
Daily AM	Morning briefing	`gbrain search` for calendar attendees, deal status, active threads
Daily AM	Task preparation	Pull today's tasks, cross-reference brain for context
Weekly	Brain maintenance	`gbrain doctor`, `gbrain embed --stale`, orphan detection
Weekly	Contacts sync	New contacts -> brain pages, enrichment pipeline

Quiet Hours Gate

Before sending any notification, check if it's quiet hours (e.g., 11 PM - 8 AM, configure to your schedule). During quiet hours:

Hold non-urgent notifications
Merge held messages into the next morning briefing
Only break quiet hours for genuinely urgent items (time-sensitive, would cause real damage if delayed)

Travel-Aware Timezone Handling

The agent reads your calendar for flights, hotels, and out-of-office blocks to infer your current location and timezone. All times shown in YOUR local timezone -- "4:42 AM HT" in Hawaii, not "14:42 UTC" or "7:42 AM PT".

When you travel, cron jobs that would fire during your home-timezone waking hours but hit your sleeping hours at the destination get held and folded into the next morning briefing. No config change needed. The agent figures it out from your calendar.

This means: fly to Tokyo, land, sleep... wake up to a morning briefing that includes everything your crons would have sent you at 2 PM Pacific (which was 3 AM Tokyo). Zero missed signals, zero 3 AM pings.

Every cron job includes: quiet hours check, location/timezone awareness, sub-agent spawning for heavy work.

10. Content and Media Ingestion

When the user shares a link, article, video, tweet, or document:

Fetch and process -- transcribe video, OCR PDF, parse article
Save to brain at sources/ or media/
Cross-reference with existing brain pages (who's mentioned? what companies? what concepts?)
Surface interesting angles given the user's interests and worldview
Commit and sync -- gbrain sync

YouTube Ingestion

YouTube is a first-class workflow, not an afterthought.

Transcribe with speaker diarization via Diarize.io -- identifies WHO said WHAT, not just a wall of text
Create brain page at media/youtube/{slug}.md with: title, channel, date, link, diarized transcript, agent's analysis
Agent's analysis is the value add: what matters, key quotes attributed to specific speakers, connections to existing brain pages, implications
Cross-reference: every person mentioned gets a back-link from their brain page to this video
Over time, media/ becomes a searchable archive of every video, podcast, talk, and interview the user has consumed, with the agent's commentary layered on top

Don't just save a tweet. Reconstruct the full context:

Thread reconstruction (quoted tweets, replies in context)
Linked articles fetched and summarized
Engagement data (what resonated, what didn't)
Entity extraction from the full bundle

PDFs and Documents

OCR when needed, extract structured data, save to sources/. For books and long-form: chapter summaries, key quotes with page numbers, cross-references to brain pages for people and concepts mentioned.

11. Executive Assistant Pattern

The brain transforms basic EA work into contextual EA work. The difference between "you have a meeting at 3" and "you have a meeting at 3 with Pedro -- last time you discussed the Series B timeline, he was concerned about burn rate, here's the latest from his company page."

Email Triage

Before triaging any email: gbrain search for sender context. Load their brain page. Now you know: who they are, your relationship history, what they care about, and what open threads exist. The triage is informed, not mechanical.

Meeting Prep

Before any meeting: gbrain search all attendees. Load relationship pages. Surface: last interaction date, open threads, recent timeline entries, relevant deal status. The user walks into every meeting already briefed.

Scheduling

When scheduling: check brain for meeting frequency, last interaction, relationship temperature. "You haven't met with Diana in 6 weeks and she has an open thread about the Q3 launch" is a useful scheduling nudge.

After Clearing Inbox

Update relevant brain pages with new information from email threads. Every email is a signal. The brain should reflect what was learned.

12. The Three Search Modes

GBrain provides three distinct search modes. Use the right one for the job.

Mode	Command	Needs Embeddings	Speed	Best For
Keyword	`gbrain search "name"`	No	Fastest	Known names, exact matches, day-one queries
Hybrid	`gbrain query "what do we know"`	Yes	Fast	Semantic questions, fuzzy matching, conceptual search
Direct	`gbrain get <slug>`	No	Instant	Loading a specific page when you know the slug

Progression

Day one: Use keyword search (gbrain search). It works without embeddings and catches exact name matches.
After first embed: Use hybrid search (gbrain query) for semantic questions. "Who do I know at fintech companies?" works here.
When you know the slug: Use direct get (gbrain get pedro-franceschi). Instant, no search overhead.

Token Budget Awareness

Search returns chunks, not full pages. Read the search excerpts first. Only use gbrain get <slug> for the full page when the chunk confirms relevance.

"Tell me about Pedro" -> gbrain get pedro-franceschi (you want the full page)
"Did anyone mention the Series A?" -> search results are enough (scan chunks)
"What's the latest on Brex?" -> search first, then get the company page if needed

Precedence for Conflicting Information

User's direct statements (always wins)
Compiled truth sections (synthesized from evidence)
Timeline entries (raw signal, reverse chronological)
External sources (web search, APIs)

13. How GBrain Complements Agent Memory

A production agent has three layers of memory. All three should be consulted. They serve different purposes.

Layer	What It Stores	Examples	How to Access
GBrain	World knowledge -- facts about people, companies, deals, meetings, concepts, ideas	Pedro's company page, meeting transcripts, original theses, deal terms	`gbrain search`, `gbrain query`, `gbrain get`
Agent memory (`memory_search`)	Operational state -- preferences, architecture decisions, tool config, session continuity	"User prefers concise formatting", "Deploy to staging before prod", "ClawVisor task IDs"	`memory_search`, file reads
Session context	Current conversation window -- what was just said, what the user just asked	The last 20 messages, current task, immediate context	Already in context

When to Use Each

"Who is Pedro?" -> GBrain (world knowledge about a person)
"How do I format messages for this user?" -> Agent memory (operational preference)
"What did I just ask you to do?" -> Session context (immediate)
"What happened in Tuesday's meeting?" -> GBrain (meeting transcript + entity pages)
"Which API key goes where?" -> Agent memory (tool configuration)

GBrain is for facts about the world. Agent memory is for how the agent operates. Session context is for right now. Don't store operational preferences in GBrain. Don't store people dossiers in agent memory.

14. Integration Setup Guides

Three integrations that make the agent real. Without these, the brain is a static database. With them, it's alive.

14a. ClawVisor -- Secure Gateway to Google and iMessage

ClawVisor is a credential vaulting and authorization gateway. The agent never holds API keys directly. ClawVisor enforces policies, manages task-scoped authorization, and injects credentials at request time.

Services: Gmail (list, read, send, draft), Google Calendar (CRUD), Google Drive (list, search, read), Google Contacts (list, search), Apple iMessage (list, read, search, send), GitHub, Slack.

Task-scoped authorization: Every request must include a task_id from an approved standing task. Tasks declare: purpose (verbose, 2-3 sentences), authorized actions with expected use patterns, auto-execute flag, lifetime (standing vs ephemeral).

Why this matters for GBrain: The EA workflow needs Gmail (sender lookup before triage), Calendar (meeting prep, attendee pages), Contacts (enrichment trigger), and iMessage (direct instructions). ClawVisor gives the agent access without giving it raw credentials.

Setup:

Create agent in ClawVisor dashboard, copy agent token
Set CLAWVISOR_URL and CLAWVISOR_AGENT_TOKEN in env
Activate services (Google, iMessage, etc.) in the dashboard
Create standing tasks with expansive scopes (narrow purposes cause false blocks)
Store standing task IDs in agent memory for reuse

Critical scoping rule: Be expansive in task purposes. "Full executive assistant email management including inbox triage, searching by any criteria, reading emails, tracking threads" works. "Email triage" gets rejected. The intent verification model uses the purpose to judge whether each request is consistent -- if your purpose is narrow, legitimate requests fail verification.

14b. Circleback -- Meeting Ingestion via Webhooks

Circleback records meetings, generates transcripts with speaker diarization, and fires webhooks on completion.

Webhook setup:

In Circleback dashboard -> Automations -> add webhook
URL: {your_agent_gateway}/hooks/circleback-meetings
Circleback provides a signing secret for HMAC-SHA256 signature verification
Store the signing secret in your webhook transform for verification

Webhook payload: Meeting JSON with id, name, attendees, notes, action items, full transcript, calendar event context.

Signature verification: Header X-Circleback-Signature contains sha256=<hex>. Verify with HMAC-SHA256(body, signing_secret). Reject unverified webhooks.

OAuth for API access: Circleback uses dynamic client registration (OAuth 2.0). Access tokens expire in ~24h, auto-refresh via refresh token. Store credentials in agent memory.

Flow: Webhook fires -> transform validates signature + normalizes -> agent wakes -> pulls full transcript via API -> creates brain meeting page -> propagates to entity pages -> commits to brain repo -> gbrain sync.

14c. Quo (OpenPhone) -- SMS and Call Integration

Quo (formerly OpenPhone) provides business phone numbers with SMS, calls, voicemail, and AI transcripts.

Webhook setup:

In Quo dashboard -> Integrations -> Webhooks
Register webhooks for: message.received, call.completed, call.summary.completed, call.transcript.completed
Point all to: {your_agent_gateway}/hooks/quo-events
Store registered webhook IDs in agent memory

How inbound texts work:

Webhook fires with sender phone, message text, conversation context
Agent looks up sender in brain by phone number
Surfaces to user's messaging platform with sender identity + brain context
Drafts reply for approval (never auto-replies without explicit permission)

How inbound calls work:

call.completed fires -> if duration > 30s, fetch transcript + AI summary via API
Ingest to brain (meeting-style page at meetings/)
Update relevant people and company pages

API auth: Bare API key in Authorization header (no Bearer prefix).

Key endpoints: POST /v1/messages (send SMS), GET /v1/messages (list), GET /v1/call-transcripts/{id}, GET /v1/conversations.

15. Five Operational Disciplines

These are the non-negotiable disciplines that separate a production agent from a demo.

1. Signal Detection on Every Message (MANDATORY)

Every inbound message triggers entity detection and original-thinking capture. No exceptions. If the user thinks out loud and the brain doesn't capture it, the system is broken. This is the #1 operational discipline.

2. Brain-First Lookup Before External APIs (MANDATORY)

gbrain search before Brave Search. gbrain get before Crustdata. The brain almost always has something. External APIs fill gaps. An agent that reaches for the web before checking its own brain is wasting money and giving worse answers.

3. Source Attribution on Every Brain Write (MANDATORY)

Every fact written to a brain page gets an inline [Source: ...] citation. No exceptions. Compiled truth isn't exempt because it's a synthesis. Tweet URLs are mandatory -- a tweet reference without a URL is a broken citation. The goal: six months from now, every fact traces back to where it came from.

4. Iron Law Back-Linking (MANDATORY)

When a person or company with a brain page is mentioned in ANY brain file, that file MUST be linked FROM the person or company's brain page. This is the connective tissue of the brain. An unlinked mention is a broken brain. Every skill that writes to the brain enforces this.

5. Durable Skills Over One-Off Work

If you do something twice, make it a skill + cron. The first time is discovery. The second time is a system failure.

The development cycle:

Concept a process -- describe what needs to happen
Run it manually for 3-10 items -- see if the output is good
Revise -- iterate on quality, fix gaps, adjust the bar
Codify into a skill -- create a new skill or add to an existing one
Add to cron -- automate it so it runs without being asked

The skills should collectively cover every type of ingest event without overlap. If two skills both try to create the same brain page, that's a coverage violation. Each entity type and signal source should have exactly one owner skill.

Appendix: GBrain CLI Quick Reference

Commands referenced in this document:

Command	Purpose
`gbrain search "term"`	Keyword search across all brain pages
`gbrain query "question"`	Hybrid search (vector + keyword + RRF)
`gbrain get <slug>`	Read a specific brain page by slug
`gbrain sync`	Sync local markdown repo to gbrain index
`gbrain import <path>`	Import files into the brain
`gbrain embed --stale`	Re-embed pages with stale or missing embeddings
`gbrain stats`	Show brain statistics (page count, last sync, etc.)
`gbrain doctor`	Diagnose brain health issues
`gbrain doctor --json`	Machine-readable health check (for cron jobs)
`gbrain init`	Initialize a new brain database

Run gbrain --help for the full command reference.

33 KiB Raw Blame History

GBrain Skillpack: Reference Architecture for AI Agents

1. What This Document Is

2. The Brain-Agent Loop

3. Entity Detection -- Run It on Every Message

Original Thinking (PRIMARY)

Entity Mentions (SECONDARY)

Rules

3b. The Originals Folder -- Capturing Intellectual Capital

4. The Brain-First Lookup Protocol

5. Enrichment Pipeline -- 7-Step Protocol

Tier System

The 7 Steps

People Pages

6. Compiled Truth + Timeline Pattern

7. Source Attribution -- Every Fact Needs a Citation

Format

Examples by Category

Why This Matters

The Rule Most Agents Miss

Tweet URLs Are Mandatory

Source Hierarchy for Conflicting Information

8. Meeting Ingestion

Transcript Source

Schedule

After Every Meeting

Back-Linking

9. Reference Cron Schedule

Quiet Hours Gate

Travel-Aware Timezone Handling

10. Content and Media Ingestion

YouTube Ingestion

Social Media Bundles

PDFs and Documents

11. Executive Assistant Pattern

Email Triage

Meeting Prep

Scheduling

After Clearing Inbox

12. The Three Search Modes

Progression

Token Budget Awareness

Precedence for Conflicting Information

13. How GBrain Complements Agent Memory

When to Use Each

14. Integration Setup Guides

14a. ClawVisor -- Secure Gateway to Google and iMessage

14b. Circleback -- Meeting Ingestion via Webhooks

14c. Quo (OpenPhone) -- SMS and Call Integration

15. Five Operational Disciplines

1. Signal Detection on Every Message (MANDATORY)

2. Brain-First Lookup Before External APIs (MANDATORY)

3. Source Attribution on Every Brain Write (MANDATORY)

4. Iron Law Back-Linking (MANDATORY)

5. Durable Skills Over One-Off Work

Appendix: GBrain CLI Quick Reference

33 KiB

Raw Blame History