* feat(engine): add cap parameter to clampSearchLimit (H6) clampSearchLimit(limit, defaultLimit, cap = MAX_SEARCH_LIMIT) — third arg is a caller-specified cap so operation handlers can enforce limits below MAX_SEARCH_LIMIT. Backward compatible: existing two-arg callers still cap at MAX_SEARCH_LIMIT. This fixes a Codex-caught semantics bug: the prior signature took (limit, defaultLimit) where the second arg was misread as a cap. clampSearchLimit(x, 20) was actually allowing values up to 100, not 20. * feat(integrations): SSRF defense + recipe trust boundary (B1, B2, Fix 2, Fix 4, B3, B4) - B1: split loadAllRecipes into trusted (package-bundled) and untrusted (cwd/recipes, $GBRAIN_RECIPES_DIR) tiers. Only package-bundled recipes get embedded=true. Closes the fake trust boundary that let any cwd-local recipe bypass health-check gates. - B2: hard-block string health_checks for non-embedded recipes (was previously only blocked when isUnsafeHealthCheck regex matched, which the cwd recipe exploit bypassed). Embedded recipes still get the regex defense. - Fix 2: gate command DSL health_checks on isEmbedded. Non-embedded recipes cannot spawnSync. - Fix 4 + B3 + B4: gate http DSL health_checks on isEmbedded; for embedded recipes, validate URLs via new isInternalUrl() before fetch: - Scheme allowlist (http/https only): blocks file:, data:, blob:, ftp:, javascript: - IPv4 range check covering hex/octal/decimal/single-integer bypass forms - IPv6 loopback ::1 + IPv4-mapped ::ffff: (canonicalized hex hextets handled) - Metadata hostnames (AWS, GCP, instance-data) blocked - fetch with redirect: 'manual' + per-hop re-validation up to 3 hops Original PRs #105-109 by @garagon. Wave 3 collector branch reimplemented the fixes after Codex outside-voice review found that PRs #106/#108 alone did not actually gate cwd-local recipes (B1) and that PR #108 missed redirect-following SSRF (B3) and non-http schemes (B4). * feat(file_upload): path/slug/filename validation + remote-caller confinement (Fix 1, B5, H5, M4, Fix 5) - Fix 1 + B5 + H1: validateUploadPath uses realpathSync + path.relative to defeat symlink-parent traversal. lstatSync alone (the original PR #105 approach) only catches final-component symlinks; a symlinked parent dir still followed to /etc/passwd. Now the entire path chain is resolved. - H5: validatePageSlug uses an allowlist regex (alphanumeric + hyphens, slash-separated segments). Closes URL-encoded traversal (%2e%2e%2f), Unicode lookalikes, backslashes, control chars implicitly. - M4: validateFilename allowlist regex. Rejects control chars, backslash, RTL override (\u202E), leading dot/dash. Filename flows into storage_path so this matters for every storage backend. - Fix 5: clamp list_pages and get_ingest_log limits at the operation layer via new clampSearchLimit cap parameter (list_pages caps at 100, get_ingest_log at 50). Internal bulk commands bypass the operation layer and remain uncapped. - New OperationContext.remote flag distinguishes trusted local CLI from untrusted MCP callers. file_upload uses strict cwd confinement when remote=true (default), loose mode when remote=false (CLI). MCP stdio server sets remote=true; cli.ts and handleToolCall (gbrain call) set remote=false. Original PR #105 by @garagon. Issue #139 reported by @Hybirdss. * feat(search): query sanitization + structural prompt boundary (Fix 3, M1, M2, M3) - M1: restructure callHaikuForExpansion to use a system message that declares the user query as untrusted data, plus an XML-tagged <user_query> boundary in the user message. Layered defense with the existing tool_choice constraint (3 layers vs 1). - Fix 3 (regex sanitizer, defense-in-depth): sanitizeQueryForPrompt strips triple-backtick code fences, XML/HTML tags, leading injection prefixes, and caps at 500 chars. Original query is still used for downstream search; only the LLM-facing copy is sanitized. - M2: sanitizeExpansionOutput validates the model's alternative_queries array before it flows into search. Strips control chars, caps length, dedupes case-insensitively, drops empty/non-string items, caps to 2 items. - M3: console.warn on stripped content NEVER logs the query text — privacy-safe debug signal only. Original PR #107 by @garagon. M1/M2/M3 are wave 3 hardening per Codex review. * chore: bump version and changelog (v0.10.2) Security wave 3: 9 vulnerabilities closed across file_upload, recipe trust boundary, SSRF defense, prompt injection, and limit clamping. See CHANGELOG for full details. Contributors: - @garagon (PRs #105-109) - @Hybirdss (Issue #139) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: sync documentation with v0.10.2 security wave 3 - CLAUDE.md: document OperationContext.remote, new security helpers (validateUploadPath, validatePageSlug, validateFilename, isInternalUrl, parseOctet, hostnameToOctets, isPrivateIpv4, getRecipeDirs, sanitizeQueryForPrompt, sanitizeExpansionOutput), updated clampSearchLimit signature, recipe trust boundary, new test files - docs/integrations/README.md: replace string-form health_check example with typed DSL (string checks now hard-block for non-embedded recipes); add recipe trust boundary subsection - docs/mcp/DEPLOY.md: document file_upload remote-caller cwd confinement, symlink rejection, slug/filename allowlists Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
120 lines
5.3 KiB
Markdown
120 lines
5.3 KiB
Markdown
# Getting Data Into Your Brain
|
|
|
|
GBrain is the retrieval layer. But retrieval is only as good as what you put in.
|
|
This directory covers how to get data flowing into your brain automatically.
|
|
|
|
## How Data Flows In
|
|
|
|
```
|
|
Signal arrives (phone call, email, tweet, calendar event)
|
|
↓
|
|
Collector captures it (deterministic code, reliable)
|
|
↓
|
|
Agent analyzes it (LLM, judgment, entity detection)
|
|
↓
|
|
Brain pages created/updated (compiled truth + timeline)
|
|
↓
|
|
GBrain indexes it (chunking, embedding, search-ready)
|
|
↓
|
|
Next query is smarter (the compounding effect)
|
|
```
|
|
|
|
## Available Integrations
|
|
|
|
### Self-Installing Recipes
|
|
|
|
These are integration recipes your agent can set up for you. Run
|
|
`gbrain integrations` to see what's available and their status.
|
|
|
|
| Recipe | Category | Requires | What It Does | Setup Time |
|
|
|--------|----------|----------|-------------|------------|
|
|
| [ngrok-tunnel](../../recipes/ngrok-tunnel.md) | Infra | — | Fixed public URL for MCP + voice ($8/mo) | 10 min |
|
|
| [credential-gateway](../../recipes/credential-gateway.md) | Infra | — | Gmail + Calendar access (ClawVisor or Google OAuth) | 15 min |
|
|
| [voice-to-brain](../../recipes/twilio-voice-brain.md) | Sense | ngrok-tunnel | Phone calls create brain pages via Twilio + OpenAI Realtime | 30 min |
|
|
| [email-to-brain](../../recipes/email-to-brain.md) | Sense | credential-gateway | Gmail messages flow into entity pages via deterministic collector | 20 min |
|
|
| [x-to-brain](../../recipes/x-to-brain.md) | Sense | — | Twitter timeline, mentions, keyword monitoring with deletion detection | 15 min |
|
|
| [calendar-to-brain](../../recipes/calendar-to-brain.md) | Sense | credential-gateway | Google Calendar events become searchable daily brain pages | 20 min |
|
|
| [meeting-sync](../../recipes/meeting-sync.md) | Sense | — | Circleback meeting transcripts auto-import with attendee propagation | 15 min |
|
|
|
|
### Manual Integration Guides
|
|
|
|
These require manual setup (no self-installing recipe yet):
|
|
|
|
| Guide | What It Does |
|
|
|-------|-------------|
|
|
| [Credential Gateway](credential-gateway.md) | Set up ClawVisor or Hermes for Gmail, Calendar, Contacts access |
|
|
| [Meeting & Call Webhooks](meeting-webhooks.md) | Circleback meeting transcripts + Quo/OpenPhone SMS/calls |
|
|
|
|
## How to Read a Recipe
|
|
|
|
Integration recipes are markdown files with YAML frontmatter. Your agent reads
|
|
the recipe and walks you through setup.
|
|
|
|
```yaml
|
|
---
|
|
id: voice-to-brain # unique identifier
|
|
name: Voice-to-Brain # human-readable name
|
|
version: 0.7.0 # recipe version
|
|
description: Phone calls... # what it does
|
|
category: sense # sense (data input) or reflex (automated response)
|
|
requires: [] # other recipes that must be set up first
|
|
secrets: # API keys and credentials needed
|
|
- name: TWILIO_ACCOUNT_SID
|
|
description: Twilio account SID
|
|
where: https://console.twilio.com # exact URL to get this key
|
|
health_checks: # typed DSL to verify the integration is working
|
|
- type: http
|
|
url: "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID.json"
|
|
auth: basic
|
|
auth_user: "$TWILIO_ACCOUNT_SID"
|
|
auth_token: "$TWILIO_AUTH_TOKEN"
|
|
label: "Twilio account"
|
|
setup_time: 30 min # estimated time to complete setup
|
|
---
|
|
|
|
[Setup instructions the agent follows step by step...]
|
|
```
|
|
|
|
**The recipe IS the installer.** Your agent (OpenClaw, Hermes, Claude Code) reads
|
|
the markdown body and executes the setup steps. It asks you for API keys, validates
|
|
each one, configures the integration, and runs a smoke test.
|
|
|
|
### Recipe trust boundary
|
|
|
|
Only recipes shipped inside the gbrain package itself (the `recipes/` directory in
|
|
a source install, or the global install copy) are trusted. Recipes discovered at
|
|
runtime from `$GBRAIN_RECIPES_DIR` or a cwd-local `./recipes/` are marked untrusted:
|
|
they cannot run `command` health checks, cannot run `http` health checks (SSRF
|
|
defense), and cannot use the deprecated string health_check form. Untrusted recipes
|
|
can still use `env_exists` and `any_of` compositions. To ship a recipe that runs
|
|
live checks, contribute it upstream so it becomes package-bundled.
|
|
|
|
## The Deterministic Collector Pattern
|
|
|
|
When an LLM keeps failing at a mechanical task despite repeated prompt fixes,
|
|
stop fighting the LLM. Move the mechanical work to code.
|
|
|
|
**Code for data. LLMs for judgment.**
|
|
|
|
- Email collection: code pulls emails with baked-in links (100% reliable).
|
|
LLM reads the digest, classifies, enriches brain entries (judgment).
|
|
- Tweet collection: code pulls timeline, detects deletions, tracks engagement
|
|
(deterministic). LLM extracts entities, writes brain updates (judgment).
|
|
- Calendar sync: code pulls events and attendees (deterministic). LLM enriches
|
|
attendee brain pages (judgment).
|
|
|
|
This pattern prevents the "LLM forgot the links" failure mode. Mechanical work
|
|
must be 100% reliable. Judgment work is where LLMs shine.
|
|
|
|
See [Deterministic Collectors](../guides/deterministic-collectors.md) for the
|
|
full pattern.
|
|
|
|
## Architecture
|
|
|
|
For details on the shared infrastructure that all integrations build on
|
|
(import pipeline, chunking, embedding, search), see the
|
|
[Infrastructure Layer](../architecture/infra-layer.md).
|
|
|
|
For the philosophy behind thin harness + fat skills, see
|
|
[Thin Harness, Fat Skills](../ethos/THIN_HARNESS_FAT_SKILLS.md).
|