Files
gbrain/docs/integrations/README.md
Garry Tan 7bbfc3e36a security: fix wave 3 — 9 vulns (file_upload, SSRF, recipe trust, prompt injection) (#174)
* feat(engine): add cap parameter to clampSearchLimit (H6)

clampSearchLimit(limit, defaultLimit, cap = MAX_SEARCH_LIMIT) — third arg
is a caller-specified cap so operation handlers can enforce limits below
MAX_SEARCH_LIMIT. Backward compatible: existing two-arg callers still cap
at MAX_SEARCH_LIMIT.

This fixes a Codex-caught semantics bug: the prior signature took (limit,
defaultLimit) where the second arg was misread as a cap. clampSearchLimit(x, 20)
was actually allowing values up to 100, not 20.

* feat(integrations): SSRF defense + recipe trust boundary (B1, B2, Fix 2, Fix 4, B3, B4)

- B1: split loadAllRecipes into trusted (package-bundled) and untrusted
  (cwd/recipes, $GBRAIN_RECIPES_DIR) tiers. Only package-bundled recipes
  get embedded=true. Closes the fake trust boundary that let any cwd-local
  recipe bypass health-check gates.
- B2: hard-block string health_checks for non-embedded recipes (was previously
  only blocked when isUnsafeHealthCheck regex matched, which the cwd recipe
  exploit bypassed). Embedded recipes still get the regex defense.
- Fix 2: gate command DSL health_checks on isEmbedded. Non-embedded
  recipes cannot spawnSync.
- Fix 4 + B3 + B4: gate http DSL health_checks on isEmbedded; for embedded
  recipes, validate URLs via new isInternalUrl() before fetch:
  - Scheme allowlist (http/https only): blocks file:, data:, blob:, ftp:, javascript:
  - IPv4 range check covering hex/octal/decimal/single-integer bypass forms
  - IPv6 loopback ::1 + IPv4-mapped ::ffff: (canonicalized hex hextets handled)
  - Metadata hostnames (AWS, GCP, instance-data) blocked
  - fetch with redirect: 'manual' + per-hop re-validation up to 3 hops

Original PRs #105-109 by @garagon. Wave 3 collector branch reimplemented
the fixes after Codex outside-voice review found that PRs #106/#108 alone
did not actually gate cwd-local recipes (B1) and that PR #108 missed
redirect-following SSRF (B3) and non-http schemes (B4).

* feat(file_upload): path/slug/filename validation + remote-caller confinement (Fix 1, B5, H5, M4, Fix 5)

- Fix 1 + B5 + H1: validateUploadPath uses realpathSync + path.relative
  to defeat symlink-parent traversal. lstatSync alone (the original PR #105
  approach) only catches final-component symlinks; a symlinked parent dir
  still followed to /etc/passwd. Now the entire path chain is resolved.
- H5: validatePageSlug uses an allowlist regex (alphanumeric + hyphens,
  slash-separated segments). Closes URL-encoded traversal (%2e%2e%2f),
  Unicode lookalikes, backslashes, control chars implicitly.
- M4: validateFilename allowlist regex. Rejects control chars, backslash,
  RTL override (\u202E), leading dot/dash. Filename flows into storage_path
  so this matters for every storage backend.
- Fix 5: clamp list_pages and get_ingest_log limits at the operation layer
  via new clampSearchLimit cap parameter (list_pages caps at 100,
  get_ingest_log at 50). Internal bulk commands bypass the operation
  layer and remain uncapped.
- New OperationContext.remote flag distinguishes trusted local CLI from
  untrusted MCP callers. file_upload uses strict cwd confinement when
  remote=true (default), loose mode when remote=false (CLI). MCP stdio
  server sets remote=true; cli.ts and handleToolCall (gbrain call) set
  remote=false.

Original PR #105 by @garagon. Issue #139 reported by @Hybirdss.

* feat(search): query sanitization + structural prompt boundary (Fix 3, M1, M2, M3)

- M1: restructure callHaikuForExpansion to use a system message that declares
  the user query as untrusted data, plus an XML-tagged <user_query> boundary
  in the user message. Layered defense with the existing tool_choice constraint
  (3 layers vs 1).
- Fix 3 (regex sanitizer, defense-in-depth): sanitizeQueryForPrompt strips
  triple-backtick code fences, XML/HTML tags, leading injection prefixes,
  and caps at 500 chars. Original query is still used for downstream search;
  only the LLM-facing copy is sanitized.
- M2: sanitizeExpansionOutput validates the model's alternative_queries array
  before it flows into search. Strips control chars, caps length, dedupes
  case-insensitively, drops empty/non-string items, caps to 2 items.
- M3: console.warn on stripped content NEVER logs the query text — privacy-safe
  debug signal only.

Original PR #107 by @garagon. M1/M2/M3 are wave 3 hardening per Codex review.

* chore: bump version and changelog (v0.10.2)

Security wave 3: 9 vulnerabilities closed across file_upload, recipe trust
boundary, SSRF defense, prompt injection, and limit clamping. See CHANGELOG
for full details.

Contributors:
- @garagon (PRs #105-109)
- @Hybirdss (Issue #139)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: sync documentation with v0.10.2 security wave 3

- CLAUDE.md: document OperationContext.remote, new security helpers
  (validateUploadPath, validatePageSlug, validateFilename, isInternalUrl,
  parseOctet, hostnameToOctets, isPrivateIpv4, getRecipeDirs,
  sanitizeQueryForPrompt, sanitizeExpansionOutput), updated clampSearchLimit
  signature, recipe trust boundary, new test files
- docs/integrations/README.md: replace string-form health_check example
  with typed DSL (string checks now hard-block for non-embedded recipes);
  add recipe trust boundary subsection
- docs/mcp/DEPLOY.md: document file_upload remote-caller cwd confinement,
  symlink rejection, slug/filename allowlists

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 23:03:15 -07:00

120 lines
5.3 KiB
Markdown

# Getting Data Into Your Brain
GBrain is the retrieval layer. But retrieval is only as good as what you put in.
This directory covers how to get data flowing into your brain automatically.
## How Data Flows In
```
Signal arrives (phone call, email, tweet, calendar event)
Collector captures it (deterministic code, reliable)
Agent analyzes it (LLM, judgment, entity detection)
Brain pages created/updated (compiled truth + timeline)
GBrain indexes it (chunking, embedding, search-ready)
Next query is smarter (the compounding effect)
```
## Available Integrations
### Self-Installing Recipes
These are integration recipes your agent can set up for you. Run
`gbrain integrations` to see what's available and their status.
| Recipe | Category | Requires | What It Does | Setup Time |
|--------|----------|----------|-------------|------------|
| [ngrok-tunnel](../../recipes/ngrok-tunnel.md) | Infra | — | Fixed public URL for MCP + voice ($8/mo) | 10 min |
| [credential-gateway](../../recipes/credential-gateway.md) | Infra | — | Gmail + Calendar access (ClawVisor or Google OAuth) | 15 min |
| [voice-to-brain](../../recipes/twilio-voice-brain.md) | Sense | ngrok-tunnel | Phone calls create brain pages via Twilio + OpenAI Realtime | 30 min |
| [email-to-brain](../../recipes/email-to-brain.md) | Sense | credential-gateway | Gmail messages flow into entity pages via deterministic collector | 20 min |
| [x-to-brain](../../recipes/x-to-brain.md) | Sense | — | Twitter timeline, mentions, keyword monitoring with deletion detection | 15 min |
| [calendar-to-brain](../../recipes/calendar-to-brain.md) | Sense | credential-gateway | Google Calendar events become searchable daily brain pages | 20 min |
| [meeting-sync](../../recipes/meeting-sync.md) | Sense | — | Circleback meeting transcripts auto-import with attendee propagation | 15 min |
### Manual Integration Guides
These require manual setup (no self-installing recipe yet):
| Guide | What It Does |
|-------|-------------|
| [Credential Gateway](credential-gateway.md) | Set up ClawVisor or Hermes for Gmail, Calendar, Contacts access |
| [Meeting & Call Webhooks](meeting-webhooks.md) | Circleback meeting transcripts + Quo/OpenPhone SMS/calls |
## How to Read a Recipe
Integration recipes are markdown files with YAML frontmatter. Your agent reads
the recipe and walks you through setup.
```yaml
---
id: voice-to-brain # unique identifier
name: Voice-to-Brain # human-readable name
version: 0.7.0 # recipe version
description: Phone calls... # what it does
category: sense # sense (data input) or reflex (automated response)
requires: [] # other recipes that must be set up first
secrets: # API keys and credentials needed
- name: TWILIO_ACCOUNT_SID
description: Twilio account SID
where: https://console.twilio.com # exact URL to get this key
health_checks: # typed DSL to verify the integration is working
- type: http
url: "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID.json"
auth: basic
auth_user: "$TWILIO_ACCOUNT_SID"
auth_token: "$TWILIO_AUTH_TOKEN"
label: "Twilio account"
setup_time: 30 min # estimated time to complete setup
---
[Setup instructions the agent follows step by step...]
```
**The recipe IS the installer.** Your agent (OpenClaw, Hermes, Claude Code) reads
the markdown body and executes the setup steps. It asks you for API keys, validates
each one, configures the integration, and runs a smoke test.
### Recipe trust boundary
Only recipes shipped inside the gbrain package itself (the `recipes/` directory in
a source install, or the global install copy) are trusted. Recipes discovered at
runtime from `$GBRAIN_RECIPES_DIR` or a cwd-local `./recipes/` are marked untrusted:
they cannot run `command` health checks, cannot run `http` health checks (SSRF
defense), and cannot use the deprecated string health_check form. Untrusted recipes
can still use `env_exists` and `any_of` compositions. To ship a recipe that runs
live checks, contribute it upstream so it becomes package-bundled.
## The Deterministic Collector Pattern
When an LLM keeps failing at a mechanical task despite repeated prompt fixes,
stop fighting the LLM. Move the mechanical work to code.
**Code for data. LLMs for judgment.**
- Email collection: code pulls emails with baked-in links (100% reliable).
LLM reads the digest, classifies, enriches brain entries (judgment).
- Tweet collection: code pulls timeline, detects deletions, tracks engagement
(deterministic). LLM extracts entities, writes brain updates (judgment).
- Calendar sync: code pulls events and attendees (deterministic). LLM enriches
attendee brain pages (judgment).
This pattern prevents the "LLM forgot the links" failure mode. Mechanical work
must be 100% reliable. Judgment work is where LLMs shine.
See [Deterministic Collectors](../guides/deterministic-collectors.md) for the
full pattern.
## Architecture
For details on the shared infrastructure that all integrations build on
(import pipeline, chunking, embedding, search), see the
[Infrastructure Layer](../architecture/infra-layer.md).
For the philosophy behind thin harness + fat skills, see
[Thin Harness, Fat Skills](../ethos/THIN_HARNESS_FAT_SKILLS.md).