Files

Garry Tan e9f3c9c24d docs: live sync setup + verification runbook + API key loading (#24 )

* docs: add SKILLPACK Section 18 — Live Sync (MUST ADD)

Contract-first guide for keeping the vector DB in sync with the brain
repo. Documents the pooler prerequisite (Session mode required for
transactions), sync + embed primitives, four example approaches (cron,
--watch, webhook, git hook), isSyncable exclusions, silent skip warning,
and OpenClaw/Hermes cron registration examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add GBRAIN_VERIFY.md installation verification runbook

Six-check runbook: schema (doctor), skillpack loaded, auto-update,
live sync (coverage check + embed check + end-to-end push-and-search
test), embedding coverage, brain-first lookup protocol. Emphasizes
"sync ran" != "sync worked" — the real test is searching for corrected
text after a push.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add setup Phases H (Live Sync) and I (Verification)

Phase H: MUST ADD live sync setup — pooler prerequisite check, automatic
sync configuration (agent picks approach), sync+embed chaining, coverage
verification. Phase I: run GBRAIN_VERIFY.md end-to-end before declaring
setup complete.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add install steps 8-9 (live sync + verification)

Step 8: set up automatic sync with SKILLPACK Section 18 reference.
Step 9: run GBRAIN_VERIFY.md runbook. Add GBRAIN_VERIFY.md to docs
section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add API key loading instructions to CLAUDE.md

Source ~/.zshrc before running Tier 2 tests so OPENAI_API_KEY and
ANTHROPIC_API_KEY are available. Without this, embedding and skills
tests skip silently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: bump version to v0.5.0

Live sync, verification runbook, API key loading instructions.
Version markers updated in SKILLPACK and RECOMMENDED_SCHEMA.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add anti-hand-roll rule to skill routing in CLAUDE.md

Explicitly prohibit manually running git commit + push + gh pr create
when /ship is available. /ship handles VERSION, CHANGELOG,
document-release, reviews, and coverage audit. Hand-rolling skips
all of these. Added "commit and ship" / "push and ship" variants
to the ship routing rule.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: changelog voice rule + rewrite 0.5.0 changelog to sell the upgrade

CLAUDE.md: add changelog voice guidance — lead with benefits, not
implementation details. Make users want to upgrade.

CHANGELOG: rewrite 0.5.0 entries from dry feature descriptions to
capability-focused bullets ("your brain never falls behind" not
"SKILLPACK Section 18 added").

SKILLPACK Section 17: update the auto-update message template to
instruct agents to sell the upgrade, not just summarize the diff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add v0.5.0 migration directive for live sync + verification

Agents upgrading from v0.4.x will automatically: check their pooler
connection string, set up automatic sync, and run the verification
runbook. Without this migration file, upgrading agents would learn
about live sync (by re-reading Section 18) but wouldn't set it up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: sharpen migration file guidance in CLAUDE.md

Replace vague "requires agent action" with concrete trigger list:
new setup steps existing users don't have, MUST ADD skillpack sections,
schema changes, deprecated commands, new verification steps, new crons.
Add the key test: "if an existing user upgrades and does nothing else,
will their brain work worse?"

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: make Section 17 upgrade flow work for direct user requests

Section 17 was structured as a cron-initiated flow only. An agent
handling "upgrade gbrain" might just run the command and stop, missing
the post-upgrade steps where the value is (re-read skills, run
migrations, schema sync). Added explicit entry point for direct
upgrade requests. Made Steps 2-4 more concrete about where to find
files and why migrations can't be skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add E2E sync tests — git-to-DB pipeline (11 tests)

Tests the full sync lifecycle against real Postgres+pgvector:
- First sync imports all pages from a git repo
- Second sync with no changes returns up_to_date
- Incremental sync picks up new files (add → commit → sync → verify)
- Incremental sync picks up modifications — THE CRITICAL TEST:
  corrected text appears in DB and keyword search after sync
- Incremental sync handles deletes
- Non-syncable files are excluded (README, .raw/, ops/)
- Sync state (last_commit, last_run) persisted to config
- Sync logged to ingest_log
- --full reimports everything
- --dry-run shows changes without applying

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: strengthen CLAUDE.md to always run ALL test tiers

Replace passive "source zshrc" suggestion with ALWAYS directive.
Explicitly state that "run all tests" means ALL tiers including
Tier 2 with API keys. Do not skip Tier 2 just because keys need
loading.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: Tier 2 E2E tests — correct openclaw CLI invocation

The tests used `openclaw -p` which doesn't exist. The correct command
is `openclaw agent --local --agent <id> --message <prompt>`. Also fixed
JSON output parsing (structured JSON goes to stderr, not stdout — use
non-JSON mode instead). Fixed ingest test to assert on agent response
text rather than test DB state (the agent writes to its own configured
DB, not the ephemeral test DB).

82 tests pass, 0 fail, 0 skip across all 5 E2E files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-10 07:23:59 -10:00

55 KiB

Raw Blame History

GBrain Skillpack: Reference Architecture for AI Agents

1. What This Document Is

This is a reference architecture for how a production AI agent uses gbrain as its knowledge backbone. It is based on patterns from a real deployment with 14,700+ brain files, 40+ skills, and 20+ cron jobs running continuously.

This is not a tutorial. It is a pattern book. Here's what works, here's why.

The memex vision, realized. Vannevar Bush described the memex in "As We May Think" (1945): a device where an individual stores all their books, records, and communications, mechanized so it may be consulted with exceeding speed and flexibility. GBrain is that device. A personal knowledge store with full provenance trails, hybrid search across everything you've ever read, said, or thought, and an AI agent that maintains it while you sleep. Bush imagined trails of association linking items together. GBrain has typed links, backlinks, and graph traversal. Bush imagined a scholar building a trail through a body of knowledge. GBrain's compiled truth pattern IS that trail, continuously rewritten as new evidence arrives.

The key difference from Bush's vision: the memex was passive (you had to build the trails). GBrain is active. The agent detects entities, enriches pages, creates cross-references, and maintains compiled truth automatically. You don't build the memex. The memex builds itself.

2. The Brain-Agent Loop

The core read-write cycle that makes the brain compound over time:

Signal arrives (message, meeting, email, tweet, link)
  |
  v
Detect entities (people, companies, concepts, original thinking)
  |
  v
READ: Check brain first (gbrain search, gbrain get)
  |
  v
Respond with context (brain makes every answer better)
  |
  v
WRITE: Update brain pages (new info compiled into existing pages)
  |
  v
Sync: gbrain indexes changes (available for next query)
  |
  v
(next signal arrives — agent is now smarter than last time)

Every signal that flows through your agent should touch the brain in both directions. Read before responding. Write after learning something new. The next time that person, company, or concept comes up, the agent already has context.

The brain almost always has something. External APIs fill gaps — they don't start from scratch.

An agent without this loop answers from stale context every time. An agent with it gets smarter with every conversation, every meeting, every email. Six months in, the compounding is visible: the agent knows more about your world than you can hold in working memory, because it never forgets and it never stops indexing.

The loop has two invariants:

Every READ improves the response. If you answered a question about a person without checking their brain page first, you gave a worse answer than you could have.
Every WRITE improves future reads. If a meeting transcript mentioned new information about a company and you didn't update the company page, you created a gap that will bite you later.

3. Entity Detection -- Run It on Every Message

Spawn a lightweight sub-agent on EVERY inbound message. Use a cheap, fast model (e.g. Claude Sonnet). The sub-agent captures two things with equal priority:

Original Thinking (PRIMARY)

The user's ideas, observations, theses, frameworks, and philosophical riffs. This is the highest-value signal in the entire system. Original thinking becomes essays, talks, leadership philosophy, strategic insight. It compounds.

Capture the user's EXACT phrasing. The language IS the insight. "The ambition-to-lifespan ratio has never been more broken" captures something that "tension between ambition and mortality" doesn't. Don't clean it up. Don't paraphrase.

Route by authorship:

Signal	Destination
User generated the idea	`brain/originals/{slug}.md`
World concept they reference	`brain/concepts/{slug}.md`
Product or business idea	`brain/ideas/{slug}.md`
Personal reflection or pattern	`brain/personal/reflections/`

What counts: Original observations about how the world works, novel connections between disparate things, frameworks and mental models, pattern recognition moments, hot takes with reasoning, metaphors that reveal new angles.

What doesn't count: Routine operational messages ("ok", "do it"), pure questions without embedded observations, echoing back something the agent said.

Entity Mentions (SECONDARY)

People, companies, media references. For each:

Check if brain page exists (gbrain search "name")
If no page and entity is notable: create it, enrich it
If thin page: spawn background enrichment
If rich page: load it silently for context
For new facts about existing entities: append to timeline

Rules

Fire on EVERY message. No exceptions unless purely operational.
Don't block the conversation. Spawn and forget.
User's direct statements are the HIGHEST-authority signal.
Iron law: back-link FROM entity pages TO the source that mentions them. An unlinked mention is a broken brain. Format: append to their Timeline or See Also: - **YYYY-MM-DD** | Referenced in [page title](path/to/page.md) -- context

3b. The Originals Folder -- Capturing Intellectual Capital

Most knowledge systems capture WHAT YOU FOUND (articles, meetings, people). The originals folder captures WHAT YOU THINK.

When the user generates an original observation, thesis, framework, or hot take, the agent captures it verbatim in brain/originals/. This is the highest-value content in the entire brain.

The authorship test:

User generated the idea? -> originals/{slug}.md
User's unique synthesis of someone else's ideas? -> originals/ (the synthesis is original)
World concept someone else coined? -> concepts/{slug}.md
Product or business idea? -> ideas/{slug}.md

Naming: Use the user's own language for the slug. meatsuit-maintenance-tax not biological-needs-maintenance-overhead. The vividness IS the concept.

Cross-link originals to: people who shaped the thinking, companies where it played out, meetings where it was discussed, books and media that influenced it, other originals it connects to (ideas form clusters). An original without cross-links is a dead original. The connections ARE the intelligence.

Over time, the originals folder becomes a searchable archive of the user's intellectual output, organized by topic. This is the memex at its most powerful: not just remembering what you read, but remembering what you THOUGHT about what you read.

4. The Brain-First Lookup Protocol

Before calling ANY external API to research a person, company, or topic:

1. gbrain search "name"     -- keyword match, fast, works day one
2. gbrain query "what do we know about name"  -- hybrid search, needs embeddings
3. gbrain get <slug>         -- direct page read when you know the slug
4. External APIs as FALLBACK only

The brain almost always has something. Even a timeline entry from three months ago is better context than starting from scratch with a web search.

For each entity found: load compiled truth + recent timeline entries before responding. The compiled truth section gives you the state of play in 30 seconds. The timeline gives you what changed recently.

This is mandatory. An agent that calls Brave Search before checking the brain is wasting money and giving worse answers. The brain has context that no external API can provide: relationship history, the user's own assessments, meeting transcripts, cross-references to other entities.

5. Enrichment Pipeline -- 7-Step Protocol

When to enrich: entity mentioned in conversation, meeting attendees, email threads, social interactions, new contacts, whenever the brain page is thin or missing.

Tier System

Scale API spend to importance. Don't blow 20 API calls on a passing mention.

Tier	Who	Effort	API Calls
Tier 1	Key people and companies: inner circle, business partners, portfolio companies	Full pipeline, ALL data sources	10-15
Tier 2	Notable: people you interact with occasionally	Web search + social + brain cross-reference	3-5
Tier 3	Minor mentions: everyone else worth tracking	Brain cross-reference + social lookup if handle known	1-2

The 7 Steps

Step 1: Identify entities. From the incoming signal (meeting, email, tweet), extract people names, company names, and what they're associated with.

Step 2: Check brain state. Does a page exist? If yes, read it -- you're on the UPDATE path. If no, you're on the CREATE path. Check gbrain search first.

Step 3: Extract signal from source. Don't just pull facts -- pull texture:

What opinion did they express? -> What They Believe
What are they building or shipping? -> What They're Building
Did they express emotion? -> What Makes Them Tick
Who did they engage with? -> Network / Relationship
Is this a recurring topic? -> Hobby Horses
What did they commit to? -> Open Threads
What was their energy? -> Trajectory

Step 4: Data source lookups. For CREATE or thin pages, run structured lookups. The order matters -- stop when you have enough signal for the entity's tier.

Priority order:

Brain cross-reference (free, highest-value -- always first): gbrain search "name" to find mentions across meetings, other people pages, company pages.
Web search via Brave or Exa: background, press, talks, funding.
X/Twitter deep lookup (enterprise API or scraping): beliefs, building, hobby horses, network, trajectory.
People enrichment: Crustdata (LinkedIn data), Happenstance (web research, career arcs).
Company/funding data: Captain API (Pitchbook-grade funding, valuation, team data).
Meeting history: Circleback (transcript search, attendee lookup).
Contact data (Google Contacts, CRM sync).

X/Twitter lookup is underrated. When you have someone's handle, their tweets are the single best source for: what they believe (opinions expressed unprompted), what they're building (shipping announcements), hobby horses (recurring topics), who they engage with (reply patterns, amplification), and trajectory (posting frequency, tone shifts). This goes into the brain page's "What They Believe" and "Hobby Horses" sections.

Step 5: Save raw data. Every API response gets saved to a .raw/ sidecar alongside the brain page. JSON with sources.{provider}.fetched_at and .data. Overwrite on re-enrichment, don't append.

Step 6: Write to brain. CREATE path: use the page template from your brain's schema, fill compiled truth from all data gathered, add first timeline entry. UPDATE path: append timeline, update compiled truth if the new signal materially changes the picture. Flag contradictions -- don't silently resolve them.

Step 7: Cross-reference. After updating a person page: update their company page, update deal pages, add back-links. After updating a company page: update founder pages, update deal pages. Every entity page should link to every other entity page that references it.

People Pages

A person page isn't a LinkedIn profile. It's a living portrait:

Executive Summary -- How do you know them? Why do they matter?
State -- Role, company, relationship, key context
What They Believe -- Ideology, worldview, first principles
What They're Building -- Current projects, features shipped
What Motivates Them -- Ambition drivers, career arc
Assessment -- Strengths, weaknesses, net read
Trajectory -- Ascending, plateauing, pivoting, declining?
Relationship -- History, temperature, open threads
Contact -- Email, phone, X handle, LinkedIn
Timeline -- Reverse chronological, append-only, never rewritten

Facts are table stakes. Texture is the value.

6. Compiled Truth + Timeline Pattern

Every brain page has a horizontal rule separating two zones:

Above the line: Compiled truth. A synthesis that represents the current state of play. If you read only the compiled truth section, you know everything you need. This gets rewritten when new evidence changes the picture.

Below the line: Timeline. Append-only log of every signal, in reverse chronological order. Never rewritten, never deleted. This is the evidence base. Every compiled truth claim should be traceable to one or more timeline entries.

## Executive Summary
One paragraph. How do you know them, why do they matter.

## State
Role, company, key numbers, relationship status.

## What They Believe
Their worldview, first principles, hills they die on.

## What They're Building
Current projects, recent launches, what's next.

## Assessment
Strengths, weaknesses, your net read on this person.

## Trajectory
Where they're headed. Ascending, plateauing, pivoting?

## Relationship
History with you. Last interaction. Open threads.

## Contact
Email, phone, X handle, LinkedIn.

---

## Timeline

- **2026-04-07** | Met at Team Sync. Discussed new product launch. Seemed energized
  about the pivot. [Source: Meeting notes "Team Sync" #12345, 2026-04-07 2:00 PM PT]
- **2026-04-03** | Mentioned in email thread re Q2 planning. Taking lead on ops.
  [Source: email from Sarah Chen re Q2 board deck, 2026-04-03 10:30 AM PT]
- **2026-03-15** | First meeting. Intro from Pedro. Strong technical background.
  [Source: User, direct message, 2026-03-15 3:00 PM PT]

The compiled truth pattern works because the agent rewrites the synthesis as new evidence arrives, but the evidence itself is immutable. Six months of timeline entries compress into a one-paragraph assessment that's always current.

GBrain integration: gbrain query weights compiled truth higher than timeline entries in search results, so the freshest synthesis surfaces first.

7. Source Attribution -- Every Fact Needs a Citation

This is not a suggestion. It is a hard requirement. Every fact written to a brain page needs an inline [Source: ...] citation with full provenance.

Format

[Source: {who}, {channel/context}, {date} {time} {tz}]

Examples by Category

Direct statements: [Source: User, direct message, 2026-04-07 12:33 PM PT]

Meetings: [Source: Meeting notes "Team Sync" #12345, 2026-04-03 12:11 PM PT]

API enrichment: [Source: Crustdata LinkedIn enrichment, 2026-04-07 12:35 PM PT]

Social media (MUST include full URL): [Source: X/@pedroh96 tweet, product launch, 2026-04-07](https://x.com/pedroh96/status/...)

Email: [Source: email from Sarah Chen re Q2 board deck, 2026-04-05 2:30 PM PT]

Workspace: [Source: Slack #engineering, Keith re deploy schedule, 2026-04-06 11:45 AM PT]

Web research: [Source: Happenstance research, 2026-04-07 12:35 PM PT]

Published media: [Source: [Wall Street Journal, 2026-04-05](https://wsj.com/...)]

Funding data: [Source: Captain API funding data, 2026-04-07 2:00 PM PT]

Why This Matters

Six months from now, someone reads a brain page and can trace every single fact back to where it came from. "User said it" isn't enough. WHERE, ABOUT WHAT, WHEN.

The Rule Most Agents Miss

Source attribution applies to compiled truth AND timeline. The compiled truth section (above the line) isn't exempt from citations just because it's a synthesis. Every claim needs a source. "Pedro co-founded Brex" needs [Source: ...] just as much as a timeline entry does.

Tweet URLs Are Mandatory

A tweet reference without a URL is a broken citation. Format: [Source: X/@handle tweet, topic, date](https://x.com/handle/status/ID). This is a real production problem: hundreds of brain pages end up with broken tweet citations when the URL is omitted.

Source Hierarchy for Conflicting Information

User's direct statements (highest authority)
Primary sources (meetings, emails, direct conversations)
Enrichment APIs (Crustdata, Happenstance, Captain)
Web search results
Social media posts

When sources conflict, note the contradiction in compiled truth with both citations. Don't silently pick one.

8. Meeting Ingestion

Meetings are the richest signal source in the entire system. Every meeting produces entity updates across multiple brain pages.

Transcript Source

Circleback or any meeting recording service with API access. The key requirement: speaker diarization (who said what) and webhook support.

Schedule

Run as a cron job. A reasonable cadence: 3x/day (10 AM, 4 PM, 9 PM) to catch new meetings throughout the day.

After Every Meeting

1. Pull the full transcript. Always pull the complete transcript, not just the AI summary. AI-generated summaries hallucinate framing -- they editorialize what was "agreed" or "decided" when no such agreement happened. The transcript is ground truth.

2. Create the meeting page. Write to brain/meetings/YYYY-MM-DD-short-description.md with the agent's OWN analysis:

Above the bar: Agent's summary reframed through the user's priorities. What matters to YOU, not a generic meeting recap. Flag surprises, contradictions, and implications. Name real decisions and commitments (not performative ones). Call out what was left unsaid or unresolved.
Below the bar: Full diarized transcript (append-only evidence base). Format: **Speaker** (HH:MM:SS): Words.

3. Propagate to entity pages (MANDATORY). This is the step most agents skip. A meeting is NOT fully ingested until every entity page has been updated:

People pages: Update State, append Timeline with meeting-specific insights
Company pages: Update State with new metrics, status, decisions, feedback
Deal pages: Update State with new terms, status, deadlines

4. Extract action items into your task list.

5. Commit and sync. gbrain sync so the new pages are immediately searchable.

Back-Linking

Meeting page links to attendee pages. Attendee pages link back to meeting with context. The graph is bidirectional. Always.

9. Reference Cron Schedule

A production agent runs 20+ recurring jobs that interact with the brain. Here is a generalized reference schedule:

Frequency	Job	Brain Interaction
Every 30 min	Email monitoring	`gbrain search` sender, update people pages
Every 30 min	Message monitoring	`gbrain search` sender, entity detection
Hourly	Social media ingestion	Create/update media pages, entity extraction
Hourly	Workspace scanning	Update project pages, flag mentions
3x/day	Meeting processing	Full ingestion pipeline (Section 8)
Daily AM	Morning briefing	`gbrain search` for calendar attendees, deal status, active threads
Daily AM	Task preparation	Pull today's tasks, cross-reference brain for context
Weekly	Brain maintenance	`gbrain doctor`, `gbrain embed --stale`, orphan detection
Weekly	Contacts sync	New contacts -> brain pages, enrichment pipeline

Quiet Hours Gate

Before sending any notification, check if it's quiet hours (e.g., 11 PM - 8 AM, configure to your schedule). During quiet hours:

Hold non-urgent notifications
Merge held messages into the next morning briefing
Only break quiet hours for genuinely urgent items (time-sensitive, would cause real damage if delayed)

Travel-Aware Timezone Handling

The agent reads your calendar for flights, hotels, and out-of-office blocks to infer your current location and timezone. All times shown in YOUR local timezone -- "4:42 AM HT" in Hawaii, not "14:42 UTC" or "7:42 AM PT".

When you travel, cron jobs that would fire during your home-timezone waking hours but hit your sleeping hours at the destination get held and folded into the next morning briefing. No config change needed. The agent figures it out from your calendar.

This means: fly to Tokyo, land, sleep... wake up to a morning briefing that includes everything your crons would have sent you at 2 PM Pacific (which was 3 AM Tokyo). Zero missed signals, zero 3 AM pings.

Every cron job includes: quiet hours check, location/timezone awareness, sub-agent spawning for heavy work.

The Dream Cycle

The most important cron job runs while you sleep. When quiet hours start, the dream cycle kicks off:

Entity sweep. Scan today's conversations for every person, company, concept, or idea you mentioned. Check each against the brain.
Enrich the thin spots. Create pages for entities that don't exist yet. Update pages that are thin. Write your direct assessments verbatim... the exact words you used, not a cleaned-up paraphrase.
Fix broken citations. Tweet links without URLs, missing source attributions, timeline entries without dates. The citation hygiene problems that accumulate during fast daytime conversations get cleaned up in the background.
Consolidate memory. Signals that matter get promoted to MEMORY.md. Patterns the agent noticed across multiple conversations get surfaced. Ephemeral context becomes durable knowledge.

The dream cycle is why the brain compounds. During the day, you're moving fast and the agent captures signal opportunistically. At night, the agent goes back through everything methodically. You wake up and the brain is smarter than when you went to sleep.

This is the difference between an agent that forgets and one that remembers. The dream cycle is not optional for a production brain. Without it, signal leaks out of every conversation. With it, nothing is lost.

OpenClaw

Ships with DREAMS.md as a default skill. Three phases (light, deep, REM) run automatically during quiet hours. Entity sweeps, memory promotion, and a narrative dream diary are built in.

Hermes Agent

Hermes has all the pieces but doesn't bundle a dream cycle by default. Set one up with the cron scheduler:

/cron add "0 2 * * *" "Dream cycle: search today's sessions for
  entities I mentioned. For each person, company, or idea: check
  if a brain page exists (gbrain search), create or update it if
  thin. Fix any broken citations. Then consolidate: read MEMORY.md,
  promote important signals, remove stale entries."
  --name "nightly-dream-cycle"

The scheduled job spawns an isolated agent session that can call session_search() to scan recent conversations (FTS5 over SQLite), gbrain search / gbrain get to check the brain, and memory(action="replace") to consolidate. Enable Honcho (plugins/memory/honcho) for automatic dialectic reasoning on top.

Key Hermes files for reference: tools/memory_tool.py (MEMORY.md/USER.md ops), tools/session_search_tool.py (past conversation retrieval), cron/scheduler.py (gateway tick loop).

10. Content and Media Ingestion

When the user shares a link, article, video, tweet, or document:

Fetch and process -- transcribe video, OCR PDF, parse article
Save to brain at sources/ or media/
Cross-reference with existing brain pages (who's mentioned? what companies? what concepts?)
Surface interesting angles given the user's interests and worldview
Commit and sync -- gbrain sync

YouTube Ingestion

YouTube is a first-class workflow, not an afterthought.

Transcribe with speaker diarization via Diarize.io -- identifies WHO said WHAT, not just a wall of text
Create brain page at media/youtube/{slug}.md with: title, channel, date, link, diarized transcript, agent's analysis
Agent's analysis is the value add: what matters, key quotes attributed to specific speakers, connections to existing brain pages, implications
Cross-reference: every person mentioned gets a back-link from their brain page to this video
Over time, media/ becomes a searchable archive of every video, podcast, talk, and interview the user has consumed, with the agent's commentary layered on top

Don't just save a tweet. Reconstruct the full context:

Thread reconstruction (quoted tweets, replies in context)
Linked articles fetched and summarized
Engagement data (what resonated, what didn't)
Entity extraction from the full bundle

PDFs and Documents

OCR when needed, extract structured data, save to sources/. For books and long-form: chapter summaries, key quotes with page numbers, cross-references to brain pages for people and concepts mentioned.

11. Executive Assistant Pattern

The brain transforms basic EA work into contextual EA work. The difference between "you have a meeting at 3" and "you have a meeting at 3 with Pedro -- last time you discussed the Series B timeline, he was concerned about burn rate, here's the latest from his company page."

Email Triage

Before triaging any email: gbrain search for sender context. Load their brain page. Now you know: who they are, your relationship history, what they care about, and what open threads exist. The triage is informed, not mechanical.

Meeting Prep

Before any meeting: gbrain search all attendees. Load relationship pages. Surface: last interaction date, open threads, recent timeline entries, relevant deal status. The user walks into every meeting already briefed.

Scheduling

When scheduling: check brain for meeting frequency, last interaction, relationship temperature. "You haven't met with Diana in 6 weeks and she has an open thread about the Q3 launch" is a useful scheduling nudge.

After Clearing Inbox

Update relevant brain pages with new information from email threads. Every email is a signal. The brain should reflect what was learned.

12. The Three Search Modes

GBrain provides three distinct search modes. Use the right one for the job.

Mode	Command	Needs Embeddings	Speed	Best For
Keyword	`gbrain search "name"`	No	Fastest	Known names, exact matches, day-one queries
Hybrid	`gbrain query "what do we know"`	Yes	Fast	Semantic questions, fuzzy matching, conceptual search
Direct	`gbrain get <slug>`	No	Instant	Loading a specific page when you know the slug

Progression

Day one: Use keyword search (gbrain search). It works without embeddings and catches exact name matches.
After first embed: Use hybrid search (gbrain query) for semantic questions. "Who do I know at fintech companies?" works here.
When you know the slug: Use direct get (gbrain get pedro-franceschi). Instant, no search overhead.

Token Budget Awareness

Search returns chunks, not full pages. Read the search excerpts first. Only use gbrain get <slug> for the full page when the chunk confirms relevance.

"Tell me about Pedro" -> gbrain get pedro-franceschi (you want the full page)
"Did anyone mention the Series A?" -> search results are enough (scan chunks)
"What's the latest on Brex?" -> search first, then get the company page if needed

Precedence for Conflicting Information

User's direct statements (always wins)
Compiled truth sections (synthesized from evidence)
Timeline entries (raw signal, reverse chronological)
External sources (web search, APIs)

13. How GBrain Complements Agent Memory

A production agent has three layers of memory. All three should be consulted. They serve different purposes.

Layer	What It Stores	Examples	How to Access
GBrain	World knowledge -- facts about people, companies, deals, meetings, concepts, ideas	Pedro's company page, meeting transcripts, original theses, deal terms	`gbrain search`, `gbrain query`, `gbrain get`
Agent memory	Operational state -- preferences, architecture decisions, tool config, session continuity	"User prefers concise formatting", "Deploy to staging before prod"	OpenClaw: `memory_search`. Hermes: `memory(action="read")` + `session_search()`
Session context	Current conversation window -- what was just said, what the user just asked	The last 20 messages, current task, immediate context	Already in context

When to Use Each

"Who is Pedro?" -> GBrain (world knowledge about a person)
"How do I format messages for this user?" -> Agent memory (operational preference)
"What did I just ask you to do?" -> Session context (immediate)
"What happened in Tuesday's meeting?" -> GBrain (meeting transcript + entity pages)
"Which API key goes where?" -> Agent memory (tool configuration)

GBrain is for facts about the world. Agent memory is for how the agent operates. Session context is for right now. Don't store operational preferences in GBrain. Don't store people dossiers in agent memory.

14. Integration Setup Guides

Three integrations that make the agent real. Without these, the brain is a static database. With them, it's alive.

14a. Credential Gateway (ClawVisor / Hermes Gateway)

The EA workflow needs Gmail, Calendar, Contacts, and messaging access. The agent should never hold API keys directly. Use a credential gateway that enforces policies and injects credentials at request time.

OpenClaw: ClawVisor. ClawVisor is a credential vaulting and authorization gateway with task-scoped authorization.

Services: Gmail (list, read, send, draft), Google Calendar (CRUD), Google Drive (list, search, read), Google Contacts (list, search), Apple iMessage (list, read, search, send), GitHub, Slack.

Task-scoped authorization: Every request must include a task_id from an approved standing task. Tasks declare: purpose (verbose, 2-3 sentences), authorized actions with expected use patterns, auto-execute flag, lifetime (standing vs ephemeral).

Why this matters for GBrain: The EA workflow needs Gmail (sender lookup before triage), Calendar (meeting prep, attendee pages), Contacts (enrichment trigger), and iMessage (direct instructions). ClawVisor gives the agent access without giving it raw credentials.

Setup:

Create agent in ClawVisor dashboard, copy agent token
Set CLAWVISOR_URL and CLAWVISOR_AGENT_TOKEN in env
Activate services (Google, iMessage, etc.) in the dashboard
Create standing tasks with expansive scopes (narrow purposes cause false blocks)
Store standing task IDs in agent memory for reuse

Critical scoping rule: Be expansive in task purposes. "Full executive assistant email management including inbox triage, searching by any criteria, reading emails, tracking threads" works. "Email triage" gets rejected. The intent verification model uses the purpose to judge whether each request is consistent -- if your purpose is narrow, legitimate requests fail verification.

Hermes Agent: Built-in gateway. Hermes has multi-platform messaging (Telegram, Discord, Slack, WhatsApp, Signal, Email) and tool access built into its gateway. Use config.yaml to configure API credentials. The gateway daemon manages connections and routes webhooks to agent sessions. For Google services, configure OAuth credentials in the gateway config. Hermes's scheduled automations can run the same EA workflows (email triage, calendar prep, contact enrichment) through the gateway's tool system.

14b. Circleback -- Meeting Ingestion via Webhooks

Circleback records meetings, generates transcripts with speaker diarization, and fires webhooks on completion.

Webhook setup:

In Circleback dashboard -> Automations -> add webhook
URL: {your_agent_gateway}/hooks/circleback-meetings
Circleback provides a signing secret for HMAC-SHA256 signature verification
Store the signing secret in your webhook transform for verification

Webhook payload: Meeting JSON with id, name, attendees, notes, action items, full transcript, calendar event context.

Signature verification: Header X-Circleback-Signature contains sha256=<hex>. Verify with HMAC-SHA256(body, signing_secret). Reject unverified webhooks.

OAuth for API access: Circleback uses dynamic client registration (OAuth 2.0). Access tokens expire in ~24h, auto-refresh via refresh token. Store credentials in agent memory.

Flow: Webhook fires -> transform validates signature + normalizes -> agent wakes -> pulls full transcript via API -> creates brain meeting page -> propagates to entity pages -> commits to brain repo -> gbrain sync.

14c. Quo (OpenPhone) -- SMS and Call Integration

Quo (formerly OpenPhone) provides business phone numbers with SMS, calls, voicemail, and AI transcripts.

Webhook setup:

In Quo dashboard -> Integrations -> Webhooks
Register webhooks for: message.received, call.completed, call.summary.completed, call.transcript.completed
Point all to: {your_agent_gateway}/hooks/quo-events
Store registered webhook IDs in agent memory

How inbound texts work:

Webhook fires with sender phone, message text, conversation context
Agent looks up sender in brain by phone number
Surfaces to user's messaging platform with sender identity + brain context
Drafts reply for approval (never auto-replies without explicit permission)

How inbound calls work:

call.completed fires -> if duration > 30s, fetch transcript + AI summary via API
Ingest to brain (meeting-style page at meetings/)
Update relevant people and company pages

API auth: Bare API key in Authorization header (no Bearer prefix).

Key endpoints: POST /v1/messages (send SMS), GET /v1/messages (list), GET /v1/call-transcripts/{id}, GET /v1/conversations.

15. Five Operational Disciplines

These are the non-negotiable disciplines that separate a production agent from a demo.

1. Signal Detection on Every Message (MANDATORY)

Every inbound message triggers entity detection and original-thinking capture. No exceptions. If the user thinks out loud and the brain doesn't capture it, the system is broken. This is the #1 operational discipline.

2. Brain-First Lookup Before External APIs (MANDATORY)

gbrain search before Brave Search. gbrain get before Crustdata. The brain almost always has something. External APIs fill gaps. An agent that reaches for the web before checking its own brain is wasting money and giving worse answers.

3. Source Attribution on Every Brain Write (MANDATORY)

Every fact written to a brain page gets an inline [Source: ...] citation. No exceptions. Compiled truth isn't exempt because it's a synthesis. Tweet URLs are mandatory -- a tweet reference without a URL is a broken citation. The goal: six months from now, every fact traces back to where it came from.

4. Iron Law Back-Linking (MANDATORY)

When a person or company with a brain page is mentioned in ANY brain file, that file MUST be linked FROM the person or company's brain page. This is the connective tissue of the brain. An unlinked mention is a broken brain. Every skill that writes to the brain enforces this.

5. Durable Skills Over One-Off Work

If you do something twice, make it a skill + cron. The first time is discovery. The second time is a system failure.

The development cycle:

Concept a process -- describe what needs to happen
Run it manually for 3-10 items -- see if the output is good
Revise -- iterate on quality, fix gaps, adjust the bar
Codify into a skill -- create a new skill or add to an existing one
Add to cron -- automate it so it runs without being asked

The skills should collectively cover every type of ingest event without overlap. If two skills both try to create the same brain page, that's a coverage violation. Each entity type and signal source should have exactly one owner skill.

Appendix: GBrain CLI Quick Reference

Commands referenced in this document:

Command	Purpose
`gbrain search "term"`	Keyword search across all brain pages
`gbrain query "question"`	Hybrid search (vector + keyword + RRF)
`gbrain get <slug>`	Read a specific brain page by slug
`gbrain sync`	Sync local markdown repo to gbrain index
`gbrain import <path>`	Import files into the brain
`gbrain embed --stale`	Re-embed pages with stale or missing embeddings
`gbrain stats`	Show brain statistics (page count, last sync, etc.)
`gbrain doctor`	Diagnose brain health issues
`gbrain doctor --json`	Machine-readable health check (for cron jobs)
`gbrain init`	Initialize a new brain database

Run gbrain --help for the full command reference.

16. Deterministic Collectors -- Code for Data, LLMs for Judgment

When your agent keeps failing at a mechanical task despite repeated prompt fixes, stop fighting the LLM. Move the mechanical work to code.

The Pattern That Broke

We built an email triage system. The agent swept Gmail, classified emails by urgency, and posted a digest to the user. One rule: every email item must include a clickable [Open in Gmail] link so the user can act on it with one tap.

We put the rule in the skill file. We put it in MEMORY.md. We put it in the cron prompt. We wrote "NO EXCEPTIONS" in all caps. We wrote "ZERO TOLERANCE" after the fourth failure. The agent still dropped links -- on carry-forward reminders, on FYI items, on "still awaiting" sections. The user asked five times. Each time we added stronger language to the prompt.

The failure mode is probabilistic. The LLM understands the rule. It follows it for the first 10 items. Then it gets sloppy on item 11, especially on items that are re-surfaced from state rather than freshly pulled from the API. No amount of prompt engineering fixes a 90%-reliable formatting task, because 90% reliability over 20 items per sweep means you fail visibly about twice per day. That's enough to destroy trust.

The Fix: Separate Deterministic from Analytical

┌─────────────────────────────┐     ┌──────────────────────────────┐
│  Deterministic Collector    │────▶│       LLM Agent              │
│  (Node.js / Python script)  │     │                              │
│                             │     │  • Read the pre-formatted    │
│  • Pull data from API       │     │    digest                    │
│  • Store structured JSON    │     │  • Classify items            │
│  • Generate links/URLs      │     │  • Add commentary            │
│  • Detect patterns (regex)  │     │  • Run brain enrichment      │
│  • Track state (seen/new)   │     │  • Draft replies             │
│  • Output markdown digest   │     │  • Surface to user           │
│                             │     │                              │
│  CODE — deterministic,      │     │  AI — judgment, context,     │
│  never forgets              │     │  creativity                  │
└─────────────────────────────┘     └──────────────────────────────┘

The collector handles everything mechanical:

Pulling emails from Gmail (via credential gateway)
Generating [Open in Gmail](URL) from message IDs -- by code, not by LLM
Detecting signature requests (DocuSign/Dropbox Sign regex patterns)
Tracking which messages are new vs. already seen (state file)
Storing structured JSON with full metadata
Generating a pre-formatted markdown digest with every link already embedded

The LLM reads the pre-formatted digest and does what LLMs are good at:

Classifying urgency (requires understanding relationships, deadlines, context)
Writing commentary ("this is the $110M acquisition thread, 7 days dropped")
Running brain enrichment on notable entities (gbrain search + page updates)
Drafting replies
Deciding what to surface vs. filter

The links are in the source data. The LLM can't forget them because it doesn't generate them.

Implementation

The email collector follows the same architecture as the X/Twitter collector (a deterministic data pipeline for social media monitoring):

scripts/email-collector/
├── email-collector.mjs     # No LLM calls, no external deps
├── data/
│   ├── state.json          # Last pull timestamp, known IDs, pending signatures
│   ├── messages/           # Structured JSON per day
│   │   └── 2026-04-09.json
│   └── digests/            # Pre-formatted markdown
│       └── 2026-04-09.md

Every stored message includes:

{
  "id": "19d74109a811b9e7",
  "account": "work",
  "authuser": "user@example.com",
  "from": "Alex Smith",
  "subject": "Re: Next Steps",
  "snippet": "Hey, wanted to follow up on...",
  "timestamp": "2026-04-09T08:56:09Z",
  "is_unread": true,
  "is_noise": false,
  "is_signature": false,
  "gmail_link": "https://mail.google.com/mail/u/?authuser=user@example.com#inbox/19d74109a811b9e7",
  "gmail_markdown": "[Open in Gmail](https://mail.google.com/mail/u/?authuser=user@example.com#inbox/19d74109a811b9e7)"
}

The gmail_link and gmail_markdown fields are computed from id + authuser at collection time. Three lines of code. Never wrong.

Cron Integration

The email monitoring cron runs the collector first, then invokes the LLM:

1. node email-collector.mjs collect     → deterministic API pull, store JSON
2. node email-collector.mjs digest      → generate markdown with links baked in
3. node email-collector.mjs signatures  → list pending e-signature items
4. LLM reads digest + signatures        → classifies, enriches, posts to user

The collector runs in under 10 seconds. The LLM analysis takes 30-60 seconds. Total: under 90 seconds for a full inbox sweep with brain enrichment.

Where Else This Pattern Applies

The deterministic-collector pattern works for any recurring data pull where the LLM was previously responsible for both fetching AND formatting:

Signal Source	Collector Generates	LLM Adds
Email	Gmail links, sender metadata, signature detection	Urgency classification, enrichment, reply drafts
X/Twitter	Tweet links, engagement metrics, deletion detection	Sentiment analysis, narrative detection, content ideas
Calendar	Event links, attendee lists, conflict detection	Prep briefings, meeting context from brain
Slack	Channel links, thread links, mention detection	Priority classification, action item extraction
GitHub	PR/issue links, diff stats, CI status	Code review context, priority assessment

The principle: if a piece of output MUST be present and MUST be formatted correctly every time, generate it in code. If a piece of output requires judgment, context, or creativity, generate it with the LLM. Don't ask the LLM to do both in the same pass.

The Lesson

When an LLM keeps failing at a mechanical task despite repeated prompt fixes:

Stop adding more prompt language. You've already written "NO EXCEPTIONS" and "ZERO TOLERANCE." The LLM read it. The failure is probabilistic, not comprehension.
Identify what's mechanical vs. analytical. URL generation is mechanical. Classification is analytical. State tracking is mechanical. Commentary is analytical.
Move the mechanical work to a script. Node.js, Python, bash -- anything deterministic. No LLM calls, no external dependencies if possible.
Feed the LLM pre-formatted data. The script's output becomes the LLM's input. Links are already there. Metadata is already structured. The LLM just adds judgment.
Wire it into your cron. Script runs first (fast, cheap, reliable), then LLM reads the output (slower, expensive, creative).

This is not about the LLM being bad. It's about using the right tool for the right job. Code is 100% reliable at string concatenation. LLMs are 90% reliable at string concatenation but 10x better at understanding what an email means. Use both.

17. Upgrades and Auto-Update Notifications

GBrain ships updates frequently. There are two ways an upgrade happens:

User says "upgrade gbrain": Run gbrain check-update --json to see what's new, then run the Full Upgrade Flow below (Steps 1-6). Do NOT just run gbrain upgrade and stop. The post-upgrade steps (re-read skills, run migrations, schema sync) are where the value is. Without them, the agent has new code but old behavior.

Cron finds an update: The auto-update cron checks for new versions and messages the user. The user decides whether to upgrade. If yes, run the same Full Upgrade Flow (Steps 1-6).

The upgrade is always manual. Never install without the user's explicit permission.

The Check (cron-initiated)

Run gbrain check-update --json. If update_available is false, stay completely silent — do nothing. If true, message the user on their preferred channel.

The Message

Sell the upgrade. The user should feel "hell yeah, I want that." Lead with what they can DO now that they couldn't before, not what files changed. Frame as capabilities and benefits, not implementation details. Make them excited that GBrain keeps getting better. 2-3 punchy bullets, no raw markdown, no file names.

GBrain v0.5.0 is available (you're on v0.4.0)

What's new:

Your brain never falls behind. Live sync keeps the vector DB current automatically, so edits show up in search within minutes, not "whenever someone remembers to run sync"

New verification runbook catches silent failures: the pooler bug that skips pages, missing embeddings, stale search results

New installs set up live sync automatically. No more manual setup step

Want me to upgrade? I'll update everything and refresh my playbook.

(Reply yes to upgrade, not now to skip, weekly to check less often, or stop to turn off update checks)

Handling Responses

User says	Action
yes / y / sure / ok / do it / upgrade / go ahead	Run the full upgrade flow (see below)
not now / later / skip / snooze	Acknowledge, check again next cycle
weekly	Store preference in agent memory, switch cron to weekly
daily	Store preference, switch cron back to daily
stop / unsubscribe / no more	Store preference, disable the cron. Tell user: "Update checks disabled. Say 'resume gbrain updates' or run `gbrain check-update` anytime."

Acceptable "yes": any clearly affirmative response. When in doubt, ask again. Never auto-upgrade. Always wait for explicit confirmation.

The Full Upgrade Flow (after user says yes)

Step 1: Update the binary/package. Run gbrain upgrade. This updates the CLI and all shipped files (skills, docs, migrations).

Step 2: Re-read all updated skills. Find the gbrain package directory (bun pm ls 2>/dev/null | grep gbrain or check node_modules/gbrain/). Re-read every skill file in skills/*/SKILL.md to pick up new patterns and workflows. Updated skills = better agent behavior. The user gets this for free.

Step 3: Re-read the production reference docs. Read docs/GBRAIN_SKILLPACK.md and docs/GBRAIN_RECOMMENDED_SCHEMA.md fresh from the gbrain package directory. These contain the latest patterns, cron schedules, and integration guides. This is how the agent learns about new capabilities like live sync (Section 18).

Step 4: Check for version-specific migration directives. Look for skills/migrations/v[version].md files between the old and new version. If they exist, read and execute them in order. These are the post-upgrade actions that make the new version actually work for existing users (e.g., v0.5.0 migration sets up live sync and runs the verification runbook). Do NOT skip this step. Without migrations, the agent has new code but the user's environment hasn't changed.

Step 5: Schema sync — suggest new recommendations without undoing user choices. Read ~/.gbrain/update-state.json to see what the user previously adopted, declined, and customized. For each NEW recommendation in the updated schema:

If the user never saw it before, suggest it
If the user previously declined it, do NOT re-suggest
Never modify or remove the user's custom directories

Update ~/.gbrain/update-state.json with new choices and bump schema_version_applied.

Step 6: Report what changed. Tell the user what was upgraded and what actions were taken.

Frequency Preferences

Default: daily. Store in agent memory as gbrain_update_frequency: daily|weekly|off. Also persist in ~/.gbrain/update-state.json so it survives agent context resets.

Quiet Hours

Standard quiet hours gate (Section 9). Update notifications are never urgent. Hold for morning briefing.

Cron Registration

Name: gbrain-update-check Default schedule: 0 9 * * * (daily 9 AM) Weekly schedule: 0 9 * * 1 (Monday 9 AM) Prompt: "Run gbrain check-update --json. If update_available is true, summarize the changelog and message me asking if I'd like to upgrade. If false, stay silent."

Standalone Skillpack Users (no gbrain CLI)

If you loaded this SKILLPACK directly (copied the file or read from GitHub) without installing gbrain, you can still stay current. Both this file and GBRAIN_RECOMMENDED_SCHEMA.md have version markers in their first lines:

<!-- skillpack-version: 0.5.0 -->
<!-- source: https://raw.githubusercontent.com/garrytan/gbrain/master/docs/GBRAIN_SKILLPACK.md -->

Self-update check: Periodically fetch the first 3 lines of the source URL and compare the version marker against your local copy:

curl -s https://raw.githubusercontent.com/garrytan/gbrain/master/docs/GBRAIN_SKILLPACK.md | head -1
# Returns: <!-- skillpack-version: X.Y.Z -->

If the remote version is newer (minor or major bump, same rule as the CLI), fetch the full file and replace your local copy. Do the same for GBRAIN_RECOMMENDED_SCHEMA.md.

Cron for standalone users: Name: gbrain-skillpack-update Schedule: 0 9 * * 1 (weekly Monday 9 AM) Prompt: "Fetch the first line of https://raw.githubusercontent.com/garrytan/gbrain/master/docs/GBRAIN_SKILLPACK.md and compare the skillpack-version against my local copy. If newer, fetch the full file, re-read it, and tell the user what changed. Do the same for GBRAIN_RECOMMENDED_SCHEMA.md. If both are current, stay silent."

This way standalone users get new patterns, workflows, and schema recommendations without installing the gbrain CLI. The agent fetches, diffs, and updates its own playbook.

18. Live Sync -- Keep the Index Current (MUST ADD)

The brain repo is the source of truth. The vector DB is a derived index. If sync doesn't run, gbrain returns stale answers. This is not theoretical: a hallucination was corrected in a brain page but the vector DB kept serving the old text because nobody ran gbrain sync. This section is mandatory during agent install.

The Contract

After setup, the vector DB must stay current with the brain repo automatically. No human should need to run gbrain sync manually. The agent chooses HOW based on its environment. The only hard requirement: verify it actually works (see docs/GBRAIN_VERIFY.md).

Prerequisite: Session Mode Pooler

Sync uses engine.transaction() on every import. If DATABASE_URL points to Supabase's Transaction mode pooler, sync will throw .begin() is not a function and silently skip most pages. This is the #1 cause of "sync ran but nothing happened."

Fix: use the Session mode pooler string (port 6543, Session mode) or the direct connection (port 5432, IPv6-only). Verify by running gbrain sync and checking that the page count in gbrain stats matches the syncable file count in the repo. If they diverge, your connection string is wrong.

The Primitives

Always chain sync + embed:

gbrain sync --repo /path/to/brain && gbrain embed --stale

gbrain sync --repo <path> -- one-shot incremental sync. Detects changes via git diff, imports only what changed. For small changesets (<= 100 files), embeddings are generated inline during import.
gbrain embed --stale -- backfill embeddings for any chunks that don't have them. This is a safety net: it catches chunks from large syncs (>100 files, where embeddings are deferred) or prior --no-embed runs.
gbrain sync --watch --repo <path> -- foreground polling loop, every 60s (configurable with --interval N). Embeds inline for small changesets. Exits after 5 consecutive failures, so run under a process manager (systemd Restart=on-failure, pm2) or pair with a cron fallback.

Example Approaches (pick what fits your environment)

Cron job (recommended for agents): run every 5-30 minutes.

gbrain sync --repo /data/brain && gbrain embed --stale

Works with any cron scheduler: OpenClaw, Hermes, system crontab.

Long-lived watcher: for near-instant sync (60s polling).

gbrain sync --watch --repo /data/brain

Run under a process manager that auto-restarts on exit. Pair with a cron fallback since --watch exits on repeated failures.

GitHub webhook: triggers sync on push events. Optional, for instant sync (<5s). Set up the webhook to call gbrain sync --repo /data/brain && gbrain embed --stale. If using webhooks, verify X-Hub-Signature-256 against a shared secret.

Git post-receive hook: if the brain repo is on the same machine.

What Gets Synced

Sync only indexes "syncable" markdown files. These are excluded by design:

Hidden paths (.git/, .raw/, etc.)
The ops/ directory
Meta files: README.md, index.md, schema.md, log.md

This is intentional. The brain is the .md files in the directory structure, not every file in the repo.

Sync is Idempotent

Concurrent runs are safe. Two syncs on the same commit no-op because content hashes match. If both a cron and --watch fire simultaneously, no conflict.

Silent Skip Warning

If a file fails to import (malformed frontmatter, binary content in a .md file, connection error on a single page), sync logs a warning but still advances sync.last_commit. That file is silently skipped until it changes again or someone runs gbrain sync --full.

The verify step in docs/GBRAIN_VERIFY.md catches this by comparing page count in the DB against syncable file count in the repo. If they diverge, --full reconciles.

OpenClaw

Name: gbrain-auto-sync
Schedule: */15 * * * *
Prompt: "Run: gbrain sync --repo /data/brain && gbrain embed --stale
  Log the result. If sync fails with .begin() is not a function,
  the DATABASE_URL is using Transaction mode pooler. Switch to
  Session mode (see GBRAIN_SKILLPACK.md Section 18)."

Add to /data/.openclaw/cron/jobs.json.

Hermes

/cron add "*/15 * * * *" "Run gbrain sync --repo /data/brain &&
  gbrain embed --stale. Log the result." --name "gbrain-auto-sync"

What "Done" Looks Like

Edit a brain file, commit, push
Wait for the next sync cycle (cron interval or --watch poll)
gbrain search "<text from the edit>" returns the updated content
gbrain stats shows page count matching syncable file count
Embedded chunk count is close to total chunk count

If step 3 returns old content, sync failed silently. Run the full verification from docs/GBRAIN_VERIFY.md.

55 KiB Raw Blame History

GBrain Skillpack: Reference Architecture for AI Agents

1. What This Document Is

2. The Brain-Agent Loop

3. Entity Detection -- Run It on Every Message

Original Thinking (PRIMARY)

Entity Mentions (SECONDARY)

Rules

3b. The Originals Folder -- Capturing Intellectual Capital

4. The Brain-First Lookup Protocol

5. Enrichment Pipeline -- 7-Step Protocol

Tier System

The 7 Steps

People Pages

6. Compiled Truth + Timeline Pattern

7. Source Attribution -- Every Fact Needs a Citation

Format

Examples by Category

Why This Matters

The Rule Most Agents Miss

Tweet URLs Are Mandatory

Source Hierarchy for Conflicting Information

8. Meeting Ingestion

Transcript Source

Schedule

After Every Meeting

Back-Linking

9. Reference Cron Schedule

Quiet Hours Gate

Travel-Aware Timezone Handling

The Dream Cycle

OpenClaw

Hermes Agent

10. Content and Media Ingestion

YouTube Ingestion

Social Media Bundles

PDFs and Documents

11. Executive Assistant Pattern

Email Triage

Meeting Prep

Scheduling

After Clearing Inbox

12. The Three Search Modes

Progression

Token Budget Awareness

Precedence for Conflicting Information

13. How GBrain Complements Agent Memory

When to Use Each

14. Integration Setup Guides

14a. Credential Gateway (ClawVisor / Hermes Gateway)

14b. Circleback -- Meeting Ingestion via Webhooks

14c. Quo (OpenPhone) -- SMS and Call Integration

15. Five Operational Disciplines

1. Signal Detection on Every Message (MANDATORY)

2. Brain-First Lookup Before External APIs (MANDATORY)

3. Source Attribution on Every Brain Write (MANDATORY)

4. Iron Law Back-Linking (MANDATORY)

5. Durable Skills Over One-Off Work

Appendix: GBrain CLI Quick Reference

16. Deterministic Collectors -- Code for Data, LLMs for Judgment

The Pattern That Broke

The Fix: Separate Deterministic from Analytical

Implementation

Cron Integration

Where Else This Pattern Applies

The Lesson

17. Upgrades and Auto-Update Notifications

The Check (cron-initiated)

The Message

Handling Responses

The Full Upgrade Flow (after user says yes)

Frequency Preferences

Quiet Hours

Cron Registration

Standalone Skillpack Users (no gbrain CLI)

18. Live Sync -- Keep the Index Current (MUST ADD)

The Contract

Prerequisite: Session Mode Pooler

The Primitives

Example Approaches (pick what fits your environment)

55 KiB

Raw Blame History