Files
gbrain/docs/integrations
Garry Tan 013b348c28 v0.12.3: Reliability wave — sync deadlock, search timeout scoping, wikilinks, orphans (#216)
* fix(sync): remove nested transaction that deadlocks > 10 file syncs

sync.ts wraps the add/modify loop in engine.transaction(), and each
importFromContent inside opens another one. PGLite's
_runExclusiveTransaction is a non-reentrant mutex — the second call
queues on the mutex the first is holding, and the process hangs forever
in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched
runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct
anyway (one file's failure should not roll back the others).

(cherry picked from commit 4a1ac00105226695d16fb343b44e55a52f44b95b)

* test(sync): regression guard for #132 top-level engine.transaction wrap

Reads src/commands/sync.ts verbatim and asserts no uncommented
engine.transaction() call appears above the add/modify loop. Protects
against silent reintroduction of the nested-mutex deadlock that hung
> 10-file syncs forever in ep_poll.

* feat(utils): tryParseEmbedding() skip+warn sibling for availability path

parseEmbedding() throws on structural corruption — right call for ingest/
migrate paths where silent skips would be data loss. Wrong call for
search/rescore paths where one corrupt row in 10K would kill every
query that touches it.

tryParseEmbedding() wraps parseEmbedding in try/catch: returns null on
any shape that would throw, warns once per session so the bad row is
visible in logs. Use it anywhere we'd rather degrade ranking than blow
up the whole query.

Retrofit postgres-engine.getEmbeddingsByChunkIds (the #175 slice call
site) — the 5-line rescore loop was the direct motivator. Keep the
throwing parseEmbedding() for everything else (pglite-engine rowToChunk,
migrate-engine round-trips, ingest).

* postgres-engine: scope search statement_timeout to the transaction

searchKeyword and searchVector run on a pooled postgres.js client
(max: 10 by default). The original code bounded each search with

  await sql`SET statement_timeout = '8s'`
  try { await sql`<query>` }
  finally { await sql`SET statement_timeout = '0'` }

but every tagged template is an independent round-trip that picks an
arbitrary connection from the pool. The SET, the query, and the reset
could all land on DIFFERENT connections. In practice the GUC sticks
to whichever connection ran the SET and then gets returned to the
pool — the next unrelated caller on that connection inherits the 8s
timeout (clipping legitimate long queries) or the reset-to-0 (disabling
the guard for whoever expected it). A crash in the middle leaves the
state set permanently.

Wrap each search in sql.begin(async sql => …). postgres.js reserves
a single connection for the transaction body, so the SET LOCAL, the
query, and the implicit COMMIT all run on the same connection. SET
LOCAL scopes the GUC to the transaction — COMMIT or ROLLBACK restores
the previous value automatically, regardless of the code path out.
Error paths can no longer leak the GUC.

No API change. Timeout value and semantics are identical (8s cap on
search queries, no effect on embed --all / bulk import which runs
outside these methods). Only one transaction per search — BEGIN +
COMMIT round-trips are negligible next to a ranked FTS or pgvector
query.

Also closes the earlier audit finding R4-F002 which reported the same
pattern on searchKeyword. This PR covers both searchKeyword and
searchVector so the pool-leak class is fully closed.

Tests (test/postgres-engine.test.ts, new file):
- No bare SET statement_timeout remains after stripping comments.
- searchKeyword and searchVector each wrap their query in sql.begin.
- Both use SET LOCAL.
- Neither explicitly clears the timeout with SET statement_timeout=0.

Source-level guardrails keep the fast unit suite DB-free. Live
Postgres coverage of the search path is in test/e2e/search-quality.test.ts,
which continues to exercise these methods end-to-end against
pgvector when DATABASE_URL is set.

(cherry picked from commit 6146c3b470dce7380da024a238eab9e6b2174296)

* feat(orphans): add gbrain orphans command for finding under-connected pages

Surfaces pages with zero inbound wikilinks. Essential for content
enrichment cycles in KBs with 1000+ pages. By default filters out
auto-generated pages, raw sources, and pseudo-pages where no inbound
links is expected; --include-pseudo to disable.

Supports text (grouped by domain), --json, --count outputs.
Also exposed as find_orphans MCP operation.

Tests cover basic detection, filtering, all output modes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
(cherry picked from commit f50954f8e03f85803c6133c85c530bd45e9aceaa)

* feat(extract): support Obsidian wikilinks + wiki-style domain slugs in canonical extractor

extractEntityRefs now recognizes both syntaxes equally:
  [Name](people/slug)      -- upstream original
  [[people/slug|Name]]     -- Obsidian wikilink (new)

Extends DIR_PATTERN to include domain-organized wiki slugs used by
Karpathy-style knowledge bases:
  - entities  (legacy prefix some brains keep during migration)
  - projects  (gbrain canonical, was missing from regex)
  - tech, finance, personal, openclaw (domain-organized wiki roots)

Before this change, a 2,100-page brain with wikilinks throughout extracted
zero auto-links on put_page because the regex only matched markdown-style
[name](path). After: 1,377 new typed edges on a single extract --source db
pass over the same corpus.

Matches the behavior of the extract.ts filesystem walker (which already
handled wikilinks as of the wiki-markdown-compat fix wave), so the db and
fs sources now produce the same link graph from the same content.

Both patterns share the DIR_PATTERN constant so adding a new entity dir
only requires updating one string.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 1cfb15679a684e94bec5a48c537a0a40a85f57ab)

* feat(doctor): jsonb_integrity + markdown_body_completeness detection

Add two v0.12.1-era reliability checks to `gbrain doctor`:

- `jsonb_integrity` scans the 4 known write sites from the v0.12.0
  double-encode bug (pages.frontmatter, raw_data.data,
  ingest_log.pages_updated, files.metadata) and reports rows where
  jsonb_typeof(col) = 'string'. The fix hint points at
  `gbrain repair-jsonb` (the standalone repair command shipped in
  v0.12.1).

- `markdown_body_completeness` flags pages whose compiled_truth is
  <30% of the raw source content length when raw has multiple H2/H3
  boundaries. Heuristic only; suggests `gbrain sync --force` or
  `gbrain import --force <slug>`.

Also adds test/e2e/jsonb-roundtrip.test.ts — the regression coverage
that should have caught the original double-encode bug. Hits all four
write sites against real Postgres and asserts jsonb_typeof='object'
plus `->>'key'` returns the expected scalar.

Detection only: doctor diagnoses, `gbrain repair-jsonb` treats.
No overlap with the standalone repair path.

* chore: bump to v0.12.3 + changelog (reliability wave)

Master shipped v0.12.1 (extract N+1 + migration timeout) and v0.12.2
(JSONB double-encode + splitBody + wiki types + parseEmbedding) while
this wave was mid-flight. Ships the remaining pieces as v0.12.3:

- sync deadlock (#132, @sunnnybala)
- statement_timeout scoping (#158, @garagon)
- Obsidian wikilinks + domain patterns (#187 slice, @knee5)
- gbrain orphans command (#187 slice, @knee5)
- tryParseEmbedding() availability helper
- doctor detection for jsonb_integrity + markdown_body_completeness

No schema, no migration, no data touch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: update project documentation for v0.12.3

CLAUDE.md:
- Add src/commands/orphans.ts entry
- Expand src/commands/doctor.ts with v0.12.3 jsonb_integrity +
  markdown_body_completeness check descriptions
- Update src/core/link-extraction.ts to mention Obsidian wikilinks +
  extended DIR_PATTERN (entities/projects/tech/finance/personal/openclaw)
- Update src/core/utils.ts to mention tryParseEmbedding sibling
- Update src/core/postgres-engine.ts to note statement_timeout scoping +
  tryParseEmbedding usage in getEmbeddingsByChunkIds
- Add Key commands added in v0.12.3 section (orphans, doctor checks)
- Add test/orphans.test.ts, test/postgres-engine.test.ts, updated
  descriptions for test/sync.test.ts, test/doctor.test.ts,
  test/utils.test.ts
- Add test/e2e/jsonb-roundtrip.test.ts with note on intentional overlap
- Bump operation count from ~36 to ~41 (find_orphans shipped in v0.12.3)

README.md:
- Add gbrain orphans to ADMIN commands block

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: sunnnybala <dhruvagarwal5018@gmail.com>
Co-authored-by: Gustavo Aragon <gustavoraularagon@gmail.com>
Co-authored-by: Clevin Canales <clevin@Clevins-MacBook-Pro.local>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Clevin Canales <clev.canales@gmail.com>
2026-04-19 18:23:02 +08:00
..

Getting Data Into Your Brain

GBrain is the retrieval layer. But retrieval is only as good as what you put in. This directory covers how to get data flowing into your brain automatically.

How Data Flows In

Signal arrives (phone call, email, tweet, calendar event)
  ↓
Collector captures it (deterministic code, reliable)
  ↓
Agent analyzes it (LLM, judgment, entity detection)
  ↓
Brain pages created/updated (compiled truth + timeline)
  ↓
GBrain indexes it (chunking, embedding, search-ready)
  ↓
Next query is smarter (the compounding effect)

Available Integrations

Self-Installing Recipes

These are integration recipes your agent can set up for you. Run gbrain integrations to see what's available and their status.

Recipe Category Requires What It Does Setup Time
ngrok-tunnel Infra Fixed public URL for MCP + voice ($8/mo) 10 min
credential-gateway Infra Gmail + Calendar access (ClawVisor or Google OAuth) 15 min
voice-to-brain Sense ngrok-tunnel Phone calls create brain pages via Twilio + OpenAI Realtime 30 min
email-to-brain Sense credential-gateway Gmail messages flow into entity pages via deterministic collector 20 min
x-to-brain Sense Twitter timeline, mentions, keyword monitoring with deletion detection 15 min
calendar-to-brain Sense credential-gateway Google Calendar events become searchable daily brain pages 20 min
meeting-sync Sense Circleback meeting transcripts auto-import with attendee propagation 15 min

Manual Integration Guides

These require manual setup (no self-installing recipe yet):

Guide What It Does
Credential Gateway Set up ClawVisor or Hermes for Gmail, Calendar, Contacts access
Meeting & Call Webhooks Circleback meeting transcripts + Quo/OpenPhone SMS/calls

How to Read a Recipe

Integration recipes are markdown files with YAML frontmatter. Your agent reads the recipe and walks you through setup.

---
id: voice-to-brain              # unique identifier
name: Voice-to-Brain            # human-readable name
version: 0.7.0                  # recipe version
description: Phone calls...     # what it does
category: sense                 # sense (data input) or reflex (automated response)
requires: []                    # other recipes that must be set up first
secrets:                        # API keys and credentials needed
  - name: TWILIO_ACCOUNT_SID
    description: Twilio account SID
    where: https://console.twilio.com    # exact URL to get this key
health_checks:                  # typed DSL to verify the integration is working
  - type: http
    url: "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID.json"
    auth: basic
    auth_user: "$TWILIO_ACCOUNT_SID"
    auth_token: "$TWILIO_AUTH_TOKEN"
    label: "Twilio account"
setup_time: 30 min              # estimated time to complete setup
---

[Setup instructions the agent follows step by step...]

The recipe IS the installer. Your agent (OpenClaw, Hermes, Claude Code) reads the markdown body and executes the setup steps. It asks you for API keys, validates each one, configures the integration, and runs a smoke test.

Recipe trust boundary

Only recipes shipped inside the gbrain package itself (the recipes/ directory in a source install, or the global install copy) are trusted. Recipes discovered at runtime from $GBRAIN_RECIPES_DIR or a cwd-local ./recipes/ are marked untrusted: they cannot run command health checks, cannot run http health checks (SSRF defense), and cannot use the deprecated string health_check form. Untrusted recipes can still use env_exists and any_of compositions. To ship a recipe that runs live checks, contribute it upstream so it becomes package-bundled.

The Deterministic Collector Pattern

When an LLM keeps failing at a mechanical task despite repeated prompt fixes, stop fighting the LLM. Move the mechanical work to code.

Code for data. LLMs for judgment.

  • Email collection: code pulls emails with baked-in links (100% reliable). LLM reads the digest, classifies, enriches brain entries (judgment).
  • Tweet collection: code pulls timeline, detects deletions, tracks engagement (deterministic). LLM extracts entities, writes brain updates (judgment).
  • Calendar sync: code pulls events and attendees (deterministic). LLM enriches attendee brain pages (judgment).

This pattern prevents the "LLM forgot the links" failure mode. Mechanical work must be 100% reliable. Judgment work is where LLMs shine.

See Deterministic Collectors for the full pattern.

Architecture

For details on the shared infrastructure that all integrations build on (import pipeline, chunking, embedding, search), see the Infrastructure Layer.

For the philosophy behind thin harness + fat skills, see Thin Harness, Fat Skills.