* security: path traversal, query bounds, marker injection fixes LocalStorage: contained() method validates all paths stay within storage root. file-resolver: resolveFile validates filePath within brainRoot, marker prefix rejects ../, absolute paths, bare '..'. file_list: LIMIT 100 on slug-filtered branch + FILE_LIST_LIMIT constant for both branches. Co-Authored-By: Gus <garagon@users.noreply.github.com> * security: symlink hardening in all file walkers All 4 walkers in files.ts (collectFiles, findRedirects, findAndClean, scan) plus init.ts counter now use lstatSync + isSymbolicLink skip. Tests import production collectFiles instead of reimplementing it. node_modules skipped. CLI file list and verify queries bounded with LIMIT. Co-Authored-By: Gus <garagon@users.noreply.github.com> * feat: typed health check DSL + recipe migration 4 DSL types: http, env_exists, command, any_of. Replaces raw execSync on recipe YAML. All 7 first-party recipes migrated from shell strings to typed objects. String health_checks still accepted with deprecation warning + metachar validation for non-embedded recipes. isUnsafeHealthCheck blocks shell injection for user-created recipes. Co-Authored-By: Gus <garagon@users.noreply.github.com> * chore: bump version and changelog (v0.9.3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: E2E test for file_list LIMIT enforcement against real Postgres Inserts 150 file rows for one slug, verifies file_list returns at most 100 (both slug-filtered and unfiltered branches). Proves the LIMIT works at the database level, not just in unit tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Gus <garagon@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
34 KiB
id, name, version, description, category, requires, secrets, health_checks, setup_time, cost_estimate
| id | name | version | description | category | requires | secrets | health_checks | setup_time | cost_estimate | ||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| twilio-voice-brain | Voice-to-Brain | 0.8.1 | Phone calls create brain pages via Twilio + voice pipeline + GBrain MCP. Two architectures -- OpenAI Realtime (turnkey) or DIY STT+LLM+TTS (full control). Callers talk, brain pages appear. | sense |
|
|
|
30 min | $15-25/mo (Twilio number $1-2 + voice $0.01/min, OpenAI Realtime $0.06/min input + $0.24/min output) |
Voice-to-Brain: Phone Calls That Create Brain Pages
Call a phone number. Talk. A structured brain page appears with entity detection, cross-references, and a summary posted to your messaging app.
IMPORTANT: Instructions for the Agent
You are the installer. This recipe is written for YOU (the AI agent) to execute on behalf of the user. Follow these instructions precisely.
Why sequential execution matters: Each step depends on the previous one:
- Step 1 validates prerequisites. If GBrain isn't configured, nothing else works.
- Step 2 collects credentials. If a credential is wrong, Steps 5-7 will silently fail.
- Step 3 creates the ngrok tunnel. Step 5 needs the ngrok URL for the Twilio webhook.
- Step 5 configures Twilio. Step 7 (smoke test) needs Twilio configured to reach your server.
Do not skip steps. Do not reorder steps. Do not batch multiple steps.
Stop points (MUST pause and verify before continuing):
- After Step 1: all prerequisites pass? If not, fix before proceeding.
- After each credential in Step 2: validation passes? If not, help the user fix it.
- After Step 6: health check passes? If not, debug before smoke test.
- After Step 7: brain page created? If not, troubleshoot before declaring success.
When something fails: Tell the user EXACTLY what failed, what it means, and what to try. Never say "something went wrong." Say "Twilio returned a 401, which means the auth token is incorrect. Let's re-enter it."
Architecture
Two pipeline options:
Option A: OpenAI Realtime (turnkey, simpler)
Caller (phone)
↓ Twilio (WebSocket, g711_ulaw audio — no transcoding)
Voice Server (Node.js, your machine or cloud)
↓↑ OpenAI Realtime API (STT + LLM + TTS in one pipeline)
↓ Function calls during conversation
GBrain MCP (semantic search, page reads, page writes)
↓ Post-call
Brain page created (meetings/YYYY-MM-DD-call-{caller}.md)
Summary posted to messaging app (Telegram/Slack/Discord)
Option B: DIY STT+LLM+TTS (full control, production-grade)
Caller (phone or WebRTC browser)
↓ Twilio WebSocket OR WebRTC
Voice Server (Node.js)
↓ Deepgram STT (streaming speech-to-text, speaker diarization)
↓ Claude API (streaming SSE, sentence-boundary dispatch)
↓ Cartesia / OpenAI TTS (text-to-speech, low latency)
↓ Function calls during conversation
GBrain MCP (semantic search, page reads, page writes)
↓ Post-call
Brain page + audio upload + transcript storage
Why v2 (Option B)? OpenAI Realtime is a black box — you can't control STT quality, swap LLMs, or debug audio issues. The DIY stack gives you transparent Deepgram+Claude+TTS with full control over each stage. Trade-off: more integration work, but you own the pipeline.
Production-tested v2 architecture (pipeline.mjs, ~250 lines):
- Streaming SSE from Claude with sentence-boundary TTS dispatch
- 20-turn conversation history cap (prevents context bloat)
- Reconnect logic with exponential backoff on STT/TTS disconnects
- Periodic keepalives to prevent WebSocket timeout
- Audio endpointing for natural turn-taking
- Smart VAD (Silero) as default with push-to-talk fallback
Opinionated Defaults
These are production-tested defaults from a real deployment. Customize after setup.
Caller routing (prompt-based, enforced server-side):
- Owner: OTP challenge via secure channel, then full access (read + write + gateway)
- Trusted contacts: callback verification, scoped write access
- Known contacts (brain score >= 4): warm greeting by name, offer to transfer
- Unknown callers: screen, ask name + reason, take message
Security:
- Twilio signature validation on
/voiceendpoint (X-Twilio-Signature header) - Unauthenticated callers never see write tools
- Caller ID is NOT trusted for auth (OTP or callback required)
Setup Flow
Step 1: Check Prerequisites
STOP if any check fails. Fix before proceeding.
Run these checks and report results to the user:
# 1. Verify GBrain is configured
gbrain doctor --json
If this fails: "GBrain isn't set up yet. Let's run gbrain init --supabase first."
# 2. Verify Node.js 18+
node --version
If missing or < 18: "Node.js 18+ is required. Install it: https://nodejs.org/en/download"
# 3. Check if ngrok is installed
which ngrok
If missing:
- Mac: "Run
brew install ngrokin your terminal." - Linux: "Run
snap install ngrokor download from https://ngrok.com/download"
Tell the user: "All prerequisites checked. [N/3 passed]. [List any that failed and how to fix.]"
Step 2: Collect and Validate Credentials
Ask for each credential ONE AT A TIME. Validate IMMEDIATELY. Do not proceed to the next credential until the current one validates.
Credential 1: Twilio Account SID + Auth Token
Tell the user: "I need your Twilio Account SID and Auth Token. Here's exactly where to find them:
- Go to https://www.twilio.com/console (sign up free if you don't have an account)
- After logging in, you'll see your Account SID right on the main dashboard (it starts with 'AC' followed by 32 characters)
- Below it you'll see Auth Token — click 'Show' to reveal it
- Copy both values and paste them to me"
After the user provides them, validate immediately:
curl -s -u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN" \
"https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID.json" \
| grep -q '"status"' \
&& echo "PASS: Twilio credentials valid" \
|| echo "FAIL: Twilio credentials invalid — double-check the SID starts with AC and the auth token is correct"
If validation fails: "That didn't work. Common issues: (1) the SID should start with 'AC', (2) make sure you clicked 'Show' to reveal the auth token and copied the full value, (3) if you just created the account, wait 30 seconds and try again."
STOP HERE until Twilio validates.
Credential 2: OpenAI API Key
Tell the user: "I need your OpenAI API key. Here's exactly where to get one:
- Go to https://platform.openai.com/api-keys
- Click '+ Create new secret key' (top right)
- Name it something like 'gbrain-voice'
- Click 'Create secret key'
- Copy the key immediately — you won't be able to see it again after closing the dialog
- Paste it to me
Note: your OpenAI account needs Realtime API access. Most accounts have it by default."
After the user provides it, validate immediately:
curl -sf -H "Authorization: Bearer $OPENAI_API_KEY" \
https://api.openai.com/v1/models > /dev/null \
&& echo "PASS: OpenAI key valid" \
|| echo "FAIL: OpenAI key invalid — make sure you copied the full key (starts with sk-)"
If validation fails: "That didn't work. Common issues: (1) the key starts with 'sk-', (2) make sure you copied the entire key (it's long), (3) if you just created it, it's active immediately — no delay needed."
STOP HERE until OpenAI validates.
Credential 3: ngrok Account (Hobby tier recommended)
Tell the user: "I need your ngrok auth token. I strongly recommend the Hobby tier ($8/mo) because it gives you a fixed domain that never changes. With the free tier, your URL changes every time ngrok restarts, breaking Twilio and Claude Desktop.
- Go to https://dashboard.ngrok.com/signup (sign up)
- Recommended: Go to https://dashboard.ngrok.com/billing and upgrade to Hobby ($8/mo). This gives you a fixed domain.
- If you upgraded: go to https://dashboard.ngrok.com/domains and click
'+ New Domain'. Choose a name (e.g.,
your-brain-voice.ngrok.app). - Go to https://dashboard.ngrok.com/get-started/your-authtoken
- Copy your Authtoken and paste it to me
- Also tell me your fixed domain name (if you created one)"
ngrok config add-authtoken $NGROK_TOKEN \
&& echo "PASS: ngrok configured" \
|| echo "FAIL: ngrok auth token rejected"
If user has a fixed domain, use --url flag (Step 3 below).
If user stayed on free tier, URLs will change on restart (the watchdog handles this).
Credential 4: Messaging Platform (for call summaries)
Ask the user: "Where should I send call summaries? Options: Telegram, Slack, or Discord."
Based on their choice:
- Telegram: "Create a bot via @BotFather on Telegram, copy the bot token, and
tell me which chat/group to send summaries to."
Validate:
curl -sf "https://api.telegram.org/bot$TOKEN/getMe" | grep -q '"ok":true' - Slack: "Create an Incoming Webhook at https://api.slack.com/apps → your app →
Incoming Webhooks → Add New. Copy the webhook URL."
Validate:
curl -sf -X POST -d '{"text":"GBrain voice test"}' $WEBHOOK_URL - Discord: "Go to your server → channel settings → Integrations → Webhooks →
New Webhook. Copy the webhook URL."
Validate:
curl -sf -X POST -H "Content-Type: application/json" -d '{"content":"GBrain voice test"}' $WEBHOOK_URL
Tell the user: "All credentials validated. Moving to server setup."
Step 3: Start ngrok Tunnel
# With fixed domain (Hobby tier — recommended):
ngrok http 8765 --url your-brain-voice.ngrok.app
# Without fixed domain (free tier — URL changes on restart):
ngrok http 8765
If using a fixed domain, the URL is always https://your-brain-voice.ngrok.app.
If using free tier, copy the URL from the ngrok output (changes every restart).
Note: ngrok runs in the foreground. Run it in a background process or new terminal tab.
The same ngrok account can also serve your GBrain MCP server (see ngrok-tunnel recipe for the full multi-service pattern).
Step 4: Create Voice Server
Create the voice server directory and install dependencies:
mkdir -p voice-agent && cd voice-agent
npm init -y
npm install ws express
The voice server needs these components in server.mjs:
-
HTTP server on port 8765 with:
POST /voice— returns TwiML that opens a WebSocket media stream to/wsGET /health— returns{ ok: true }- Twilio signature validation (
X-Twilio-Signatureheader) on/voice
-
WebSocket handler at
/wsthat:- Accepts Twilio media stream (g711_ulaw audio)
- Opens a second WebSocket to
wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview - Bridges audio bidirectionally (no transcoding — both sides use g711_ulaw)
- Handles
response.function_call_arguments.doneevents from OpenAI (tool execution) - Sends tool results back via
conversation.item.createwith typefunction_call_output
-
System prompt builder that takes caller phone number and returns:
- Appropriate greeting based on caller routing rules
- Available tools (read-only for unauthenticated, full for authenticated)
- Instructions: "You are a voice assistant. Search the brain before answering questions. Take messages from unknown callers. Never hang up first."
-
Tool executor that:
- Spawns GBrain MCP client (
gbrain serveas stdio child process) - Routes function calls:
search_brain→gbrain query,lookup_person→gbrain search+gbrain get - Gates write tools behind authentication
- Spawns GBrain MCP client (
-
Post-call handler that:
- Saves transcript to
brain/meetings/YYYY-MM-DD-call-{caller}.md - Posts summary to the user's messaging platform
- Runs
gbrain sync --no-pull --no-embedto index the new page
- Saves transcript to
-
WebRTC endpoint (optional, for browser-based calling):
POST /session— accepts SDP offer, forwards to OpenAI Realtime/v1/realtime/callsas multipart form-data, returns SDP answerGET /call— serves a web client HTML page with:- WebRTC connection to OpenAI Realtime API
- RNNoise WASM noise suppression (AudioWorklet)
- Push-to-talk AND auto-VAD mode switching
- Pipeline: Microphone → RNNoise denoise → MediaStream → WebRTC → OpenAI
POST /tool— receives tool calls from the WebRTC data channel, executes them, returns results- This lets users call the voice agent from a browser tab instead of a phone
WebRTC session creation pseudocode:
POST /session: sdp = request.body // caller's SDP offer sessionConfig = JSON.stringify({ type: 'realtime', model: 'gpt-4o-realtime-preview', audio: { output: { voice: VOICE } }, instructions: buildPrompt(null), tools: TOOL_SETS.unauthenticated, }) // Use native FormData (Node 18+) — NOT manual multipart fd = new FormData() fd.set('sdp', sdp) fd.set('session', sessionConfig) response = POST 'https://api.openai.com/v1/realtime/calls' Authorization: Bearer OPENAI_API_KEY body: fd // fetch() sets Content-Type automatically return response.text() // SDP answerImportant WebRTC gotchas:
voicegoes underaudio.output.voice, not top-level- Do NOT send
turn_detectionin session config (not accepted by/v1/realtime/calls) - Do NOT send
session.updateon connect (server already configured it) - All
session.updatecalls must includetype: 'realtime'to avoid session.type errors input_audio_transcriptionis NOT supported over WebRTC data channel — use Whisper post-call on recorded audio instead- Trigger greeting via data channel after WebRTC connects
Reference implementation: The architecture above and the OpenAI Realtime API docs (https://platform.openai.com/docs/guides/realtime) provide the building blocks.
Step 5: Configure Twilio Phone Number
Tell the user: "Now I need to set up your Twilio phone number. Here's what to do:
- Go to https://www.twilio.com/console/phone-numbers/search
- Search for a number (pick your area code or any available number)
- Click 'Buy' next to the number you want (costs $1-2/month)
- After purchase, go to https://www.twilio.com/console/phone-numbers/incoming
- Click on your new number
- Scroll to 'Voice Configuration'
- Under 'A call comes in', select 'Webhook'
- Enter:
https://YOUR-NGROK-URL.ngrok-free.app/voice - Method: HTTP POST
- Click 'Save configuration'
- Tell me the phone number you purchased"
Or if the user prefers CLI:
# Buy a number (US local)
twilio phone-numbers:buy:local --area-code 415
# Configure webhook
twilio phone-numbers:update PHONE_SID \
--voice-url https://YOUR-NGROK-URL.ngrok-free.app/voice \
--voice-method POST
Step 6: Start Voice Server and Verify
cd voice-agent && node server.mjs
STOP and verify:
curl -sf http://localhost:8765/health && echo "Voice server: running" || echo "Voice server: NOT running"
If not running: check the server logs for errors. Common issues:
- Port 8765 already in use:
lsof -i :8765to find what's using it - Missing environment variables: make sure OPENAI_API_KEY is set
- Module not found: run
npm installagain
Step 7: Smoke Test (Outbound Call)
This is the magical moment. The agent calls the USER to prove the system works.
Tell the user: "Your phone is about to ring. Pick up and talk for about 30 seconds. Say something like 'Hey, I'm testing my new voice-to-brain system. Remind me to check the quarterly numbers tomorrow.' When you're done, hang up."
curl -X POST "https://api.twilio.com/2010-04-01/Accounts/$TWILIO_ACCOUNT_SID/Calls.json" \
--data-urlencode "To=USER_PHONE_NUMBER" \
--data-urlencode "From=TWILIO_PHONE_NUMBER" \
--data-urlencode "Url=https://YOUR-NGROK-URL.ngrok-free.app/voice" \
-u "$TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN"
After the call ends, verify ALL of these:
- Messaging notification arrived with call summary
- Brain page exists:
gbrain search "call" --limit 1 - The brain page has: transcript, entity mentions, action items
If the smoke test fails:
- No ring: check Twilio console for error logs at https://www.twilio.com/console/debugger
- Ring but no voice: check ngrok tunnel is up, check OpenAI key is valid
- Voice works but no brain page: check post-call handler logs, run
gbrain syncmanually - Brain page but no messaging: check messaging bot token is valid
STOP HERE until the smoke test passes. Do not declare success until the user confirms they received the messaging notification AND the brain page exists.
Step 8: Set Up Inbound Calling
Tell the user: "The smoke test passed — voice-to-brain is live! Your number is [TWILIO_NUMBER]. Now let's set up inbound calling."
- Twilio webhook is already configured from Step 5
- Ask: "Do you want calls to your existing phone to forward to this number after a few rings? That way you answer if you can, and the voice agent picks up if you don't."
- Configure caller routing rules in the system prompt
- Add the user's phone number as the "owner" number for full access
Step 9: Watchdog (Auto-restart)
# Cron watchdog (every 2 minutes) — add to crontab
*/2 * * * * curl -sf http://localhost:8765/health > /dev/null || (cd /path/to/voice-agent && node server.mjs >> /tmp/voice-agent.log 2>&1 &)
If using ngrok, also set up URL monitoring (free ngrok URLs change on restart):
# Check if ngrok URL changed, update Twilio if so
NGROK_URL=$(curl -s http://localhost:4040/api/tunnels 2>/dev/null | grep -o '"public_url":"https://[^"]*' | grep -o 'https://.*')
if [ -n "$NGROK_URL" ]; then
twilio phone-numbers:update PHONE_SID --voice-url "$NGROK_URL/voice"
fi
Step 10: Log Setup Completion
mkdir -p ~/.gbrain/integrations/twilio-voice-brain
echo '{"ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","event":"setup_complete","source_version":"0.8.1","status":"ok","details":{"phone":"TWILIO_NUMBER","deployment":"local+ngrok"}}' >> ~/.gbrain/integrations/twilio-voice-brain/heartbeat.jsonl
Tell the user: "Voice-to-brain is fully set up. Your number is [NUMBER]. Here's what happens now: anyone who calls gets screened by the voice agent. Known contacts get a warm greeting. Unknown callers leave a message. Every call creates a brain page with the full transcript, and you get a summary on [their messaging platform]. The watchdog restarts the server if it crashes."
Cost Estimate
| Component | Monthly Cost | Source |
|---|---|---|
| Twilio phone number | $1-2/mo | Twilio pricing |
| Twilio voice minutes (100 min) | $1-2/mo | $0.0085-0.015/min depending on direction |
| OpenAI Realtime input (100 min) | $6/mo | $0.06/min |
| OpenAI Realtime output (50 min) | $12/mo | $0.24/min |
| ngrok (free tier) | $0 | Static domain: $8/mo |
| Total estimate | $20-22/mo | For ~100 min of calls |
Troubleshooting
Calls don't connect:
- Check ngrok:
curl http://localhost:4040/api/tunnels— if empty, ngrok isn't running - Check voice server:
curl http://localhost:8765/health— should return{"ok":true} - Check Twilio debugger: https://www.twilio.com/console/debugger — shows webhook errors
- Check webhook URL: go to https://www.twilio.com/console/phone-numbers/incoming, click your number, verify the webhook URL matches your ngrok URL
Voice agent doesn't respond:
- Check OpenAI key: the validation command from Step 2 should still pass
- Check server logs for WebSocket errors (look for "connection refused" or "401")
- Verify Realtime API access: not all OpenAI accounts have it. Check https://platform.openai.com/docs/guides/realtime
Brain pages not created after call:
- Run
gbrain doctor— if it fails, the database connection is broken - Check if the post-call handler ran (look in server logs for "transcript saved")
- Run
gbrain syncmanually to force indexing - Check file permissions on the brain repo directory
ngrok URL keeps changing:
- Free ngrok URLs change every time ngrok restarts
- The watchdog (Step 9) handles this automatically
- For a permanent URL: upgrade to ngrok paid ($8/mo) for a static domain, or deploy to Fly.io/Railway instead
Note on Option B credentials: If using the DIY pipeline (Option B), you will also need API keys for your chosen STT provider (e.g., Deepgram) and TTS provider (e.g., Cartesia, OpenAI TTS). Collect and validate these during Step 2 alongside the Twilio and OpenAI credentials listed above.
Critical Production Fixes (v0.8.1)
These are NOT optional. They prevent real production failures discovered in a deployment handling daily calls.
Unicode Crash Fix (CRITICAL)
Problem: Em dashes (--), arrows (->), and other non-ASCII characters in the prompt context cause broken surrogate pairs that crash the Twilio WebSocket connection. Phone calls drop silently.
Fix: Replace ALL non-ASCII characters with ASCII equivalents throughout the entire prompt file before sending to Twilio. This is invisible in development (browsers handle unicode fine) and catastrophic in production.
function sanitizeForTwilio(text) {
return text
.replace(/[\u2014\u2013]/g, '--') // em/en dash
.replace(/[\u2018\u2019]/g, "'") // smart quotes
.replace(/[\u201C\u201D]/g, '"') // smart double quotes
.replace(/\u2192/g, '->') // right arrow
.replace(/\u2190/g, '<-') // left arrow
.replace(/[\u2026]/g, '...') // ellipsis
.replace(/[^\x00-\x7F]/g, '') // strip remaining non-ASCII
}
PII Scrub from Voice Context (CRITICAL)
Problem: Brain context loaded into the voice prompt may contain phone numbers, email addresses, and other PII. The voice agent reads these aloud to callers.
Fix: Regex-strip PII from all voice context before injecting into the prompt:
- Phone numbers:
/\+?\d[\d\s\-().]{7,}\d/g - Email addresses:
/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g - URLs with auth tokens or API keys
- Any string matching common credential patterns
Identity-First Prompt (IMPORTANT)
Problem: Voice agents lose their identity mid-conversation. Saying "You are NOT Claude" doesn't stick. The model reverts to its base persona.
Fix: Put identity FIRST in the system prompt, before any context or rules:
# You ARE [Agent Name]
You are [Name], a voice assistant who works with [Brain Name].
You are NOT Claude. You are NOT a general AI assistant.
[Name] has their own personality: [traits].
# Context
[... brain context, calendar, tasks ...]
# Rules
[... behavioral rules ...]
Positioning identity before context ensures the model sees it first and maintains it throughout the conversation.
Auto-Upload Call Audio (RECOMMENDED)
Problem: If post-call processing fails, the call audio is lost forever.
Fix: Auto-upload ALL call audio immediately on call end:
- Twilio calls: download the MP3 recording URL from Twilio
- WebRTC calls: capture via MediaRecorder (webm/opus format)
- Upload via
gbrain files upload-raw <audio-file> --page meetings/call-slug --type call-recording - GBrain auto-routes: small files stay in git, large files go to cloud storage
with
.redirect.yamlpointer. Files >= 100 MB use TUS resumable upload. - Generate signed URLs for playback:
gbrain files signed-url <storage-path> - This ensures every call has a recoverable audio source regardless of whether the transcript or brain page was created successfully
Smart VAD as Default
Problem: Push-to-talk is unnatural on phone calls. Server-side VAD has variable quality.
Fix: Default to Smart VAD (Silero VAD) for voice activity detection:
- Better endpointing than server-side VAD
- Fewer false triggers in noisy environments
- PTT available as fallback (UI toggle for WebRTC clients)
- Presets: quiet (0.7 threshold), normal (0.85), noisy (0.95), very_noisy (0.98)
Production Patterns (Recommended)
These patterns come from a production voice deployment handling real calls daily. They are NOT required for basic setup. Implement them AFTER the smoke test passes. Each pattern is self-contained and optional.
Agent Identity & Engagement
Identity Separation
Problem: A voice agent pretending to be the full AI system creates uncanny valley. Pattern: The voice agent picks its own name and personality, distinct from the main AI brain. "I work with [Brain], [Owner]'s AI." Lighter, more playful, more curious.
Pre-Computed Bid System
Problem: Dead air kills engagement. Voice agents wait passively. Pattern: At call start, scan live context and pre-compute up to 10 engagement bids. Two types: informative (tasks, calendar, social monitoring) and relational (curiosity templates). Bids go INTO the prompt so the agent picks from a list. Use bids #1 and #2 for greeting, cycle the rest during conversation. Never ask "anything else?" — bring up the next bid.
Context-First Prompt
Problem: Voice agent greets generically because it doesn't know what's happening today. Pattern: Load live context at call start: tasks, calendar, location, social monitoring, morning briefing. Position context FIRST in the prompt (before rules) so the model sees it immediately and uses it in the greeting. Try/catch per section. Cap 500-1000 chars each.
Proactive Advisor Mode
Problem: Voice agents are reactive task machines. Pattern: The agent drives the conversation. Anticipate decisions on stale tasks. Suggest capitalizing on trending items. Connect upcoming events with brain context. "Dead air is your enemy" — fill every pause. Never wait passively.
Conversation Timing (the #1 fix)
Problem: Voice agents interrupt mid-thought AND go silent when the caller is done. Both feel terrible. Early "fill every pause" instructions cause the agent to talk over the caller while they're thinking. Pattern: Replace blanket "never be silent" with nuanced timing rules:
- Caller talking or thinking: SHUT UP. Even 3-5 second pauses mid-thought, wait. Incomplete sentence or mid-story = still thinking. Do not interrupt.
- Caller done (complete thought + 2-3 seconds silence): NOW respond. Use a bid, ask a follow-up, or pivot to the next topic.
- Detection heuristic: Incomplete sentence = still thinking. Complete statement + silence = done. Question directed at you = respond immediately.
- Hard rule: Never let silence go past 5 seconds after a COMPLETE thought.
Add this as a labeled section in the system prompt (e.g., # CRITICAL: Conversation Timing)
positioned prominently so the model sees it early. This came from real usage feedback
and is the single highest-impact voice quality improvement.
No Repetition Rule
Problem: Voice agent cycles back to the same bid multiple times in a call. Pattern: Add to the system prompt: "Do NOT repeat yourself. If you already said something, move to the NEXT bid. Vary your responses." Simple but addresses a real annoyance that compounds over longer calls.
Prompt Engineering
Radical Prompt Compression
Problem: Long system prompts increase latency and cost on every turn. Pattern: Compress aggressively. Production went 13K to 4.7K tokens (65% cut). Bullets over prose, cut repetition, behavior-first. Every token costs latency + money.
OpenAI Realtime Prompting Guide Structure
Problem: Prose paragraphs parse slowly for the model.
Pattern: Use labeled markdown sections: # Role & Objective, # Personality & Tone,
# Rules, # Conversation Flow with state machine substates (## State 1: VERIFY,
## State 2: GREETING, ## State 3: CONVERSATION), # Trust.
Auth-Before-Speech
Problem: Auth flow adds dead air at call start. Pattern: Call the auth tool BEFORE speaking any greeting. Then speak "Hey, code's on its way." Shaves seconds off the round-trip.
Brain Escalation
Problem: Voice agent can't answer complex questions that need the full brain. Pattern: If caller says "talk to [Brain]" or asks a deep question, immediately route to main AI via gateway tool with verbal bridge: "one sec, checking with [Brain]."
Call Reliability
Stuck Watchdog
Problem: Calls go silent when VAD stalls or tool execution hangs.
Pattern: 20-second timer. If no audio out: clear input buffer, inject "you still
there?" system message, force response.create.
Never Hang Up
Problem: AI agents try to end calls. Pattern: Hard prompt rule: only the caller decides when the call ends. Never say goodbye, "I'll let you go," or wrap-up language. If silence, ask "you still there?"
Thinking Sound
Problem: Dead air during slow tool execution. Pattern: Pre-generate g711_ulaw audio chunks in a JSON array. Loop at 20ms intervals during slow tools (brain search, web lookup). Stop when tool result returns.
Fallback TwiML
Problem: Voice agent crashes, callers get silence.
Pattern: /fallback endpoint returns TwiML forwarding to owner's cell. Configure as
Twilio fallback URL.
Authentication & Authorization
Tool Set Architecture
Problem: Unauthenticated callers accessing write operations.
Pattern: Four sets: READ_TOOLS (all callers), WRITE_TOOLS (owner), SCOPED_WRITE_TOOLS
(trusted users), GATEWAY_TOOLS (authenticated). LLM doesn't see write tools until auth
succeeds. Upgrade via session.update with new tools array. All session.update calls
must include type: 'realtime'.
Trusted User Auth with Callback
Problem: People other than the owner need authenticated access. Pattern: Phone registry + callback verification. Each user gets a scope: full, household, content, operational. Scope determines which tools they access.
Caller Routing
Problem: Different callers need different experiences.
Pattern: buildPrompt(callerPhone) returns different system prompts: owner (OTP),
trusted (callback), inner circle (warm greeting + transfer), known (greeting, message),
unknown (screen + message).
Voice Quality
Dynamic VAD / Noise Mode
Problem: Background noise causes false triggers or missed speech.
Pattern: set_noise_mode tool adjusts VAD threshold mid-call. Presets: quiet (0.7),
normal (0.85), noisy (0.95), very_noisy (0.98). Agent calls proactively on noise.
On-Screen Debug UI
Problem: console.log is useless when testing from a phone. Pattern: WebRTC client displays tool calls, results, errors, and key events inline.
Real-Time Awareness
Live Moment Capture
Problem: Important things said during a call are lost if the call drops or the
post-call summary tool doesn't fire.
Pattern: When the caller shares something important (feedback, ideas, personal
stories, decisions), log it in real-time using a log_voice_request tool. Don't
wait until the call ends. Tell the caller: "Got that, sending it to [Brain] now."
Also stream key moments to [messaging platform] during the call so the main agent
has awareness before the call is over.
Belt-and-Suspenders Post-Call
Problem: Post-call processing depends on the voice agent remembering to call the
post_call_summary tool. If the call drops or the agent forgets, the call is lost.
Pattern: Both the tool-based AND the automatic call-end handler should post
structured signals. The call-end handler (fires on WebSocket close or /call-end)
should post to [messaging platform] with:
- Audio file path
- Transcript file path (or warning if missing)
- Tools used during the call
- Explicit instruction: "[Brain]: Read the call, summarize, take action."
This ensures every call gets processed regardless of whether the voice agent remembered to call the summary tool. Belt and suspenders.
Post-Call Processing
Mandatory 3-Step Post-Call
Problem: Main agent doesn't know a call happened. Pattern: Every call ends with three steps:
- Messaging notification — summary to [messaging platform]
- Transcript to brain —
brain/meetings/YYYY-MM-DD-call-{caller}.md - Audio to storage — Twilio MP3 or WebRTC webm/opus, uploaded to cloud storage
WebRTC Audio + Transcript Parity
Problem: WebRTC calls don't go through Twilio, no automatic logging.
Pattern: Client captures audio (MediaRecorder, webm/opus) and transcript (per-turn
POST to /transcript). On call end, POST to /call-end saves JSON log. Both channels
produce identical output formats. Note: input_audio_transcription is NOT supported
over WebRTC data channel — use Whisper post-call instead.
Dual API Event Handling
Problem: OpenAI Realtime API changed event names.
Pattern: Handle both response.audio.delta (old) and response.output_audio.delta
(new). Same for .done events. Future-proofs against API changes.
Brain Query Optimization
Report-Aware Query Routing
Problem: Voice queries about specific topics trigger slow vector searches. Pattern: Check the question against a keyword map BEFORE full brain search:
| Keyword | Report Loaded |
|---|---|
| email, inbox, mail | inbox sweep report |
| social, twitter, mentions | social engagement report |
| briefing, morning | morning briefing |
| meeting | meeting sync report |
| slack | slack scan report |
| content, ideas | content ideas report |
Load up to 2,500 chars of matching report. Break after first match. Fall back to full brain search if no keyword matches.