feat: Voice v0.8.0 + feature discovery + Edge Function removal (#55)

* chore: remove Supabase Edge Function MCP deployment The Edge Function never worked reliably. All MCP traffic goes through self-hosted server + ngrok tunnel. Removes deploy-remote.sh, edge-entry.ts, supabase/functions/, .env.production.example, and CHATGPT.md (OAuth not implemented). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: rewrite MCP docs for self-hosted + ngrok deployment All per-client guides updated from Edge Function URLs to self-hosted server + ngrok tunnel pattern. DEPLOY.md rewritten with local vs remote paths. ALTERNATIVES.md now shows self-hosted as primary, with ngrok, Tailscale, and Fly.io/Railway comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: voice recipe v0.8.0 — 25 production patterns from real deployment Identity separation, pre-computed bid system, conversation timing fix, proactive advisor mode, radical prompt compression, OpenAI Realtime Prompting Guide structure, auth-before-speech, brain escalation, stuck watchdog, never-hang-up rule, thinking sounds, fallback TwiML, tool set architecture, trusted user auth, caller routing, dynamic VAD, on-screen debug UI, live moment capture, belt-and-suspenders post-call, mandatory 3-step post-call, WebRTC parity, dual API events, report-aware query routing. WebRTC pseudocode updated with native FormData and 6 gotchas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: post-upgrade feature discovery framework upgrade.ts captures old version before upgrading, then execs gbrain post-upgrade (new binary) to read migration files and print feature pitches. Migration files get YAML frontmatter with feature_pitch field (headline, description, recipe, tiers). CLI prints excited builder tone post-upgrade. v0.8.0 migration offers voice setup with environment detection (server vs local) and 3-tier progressive disclosure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add Voice section to README with WebRTC screenshot + tweet link Her out of the box: voice-to-brain with 25 production patterns. WebRTC client screenshot embedded. Remote MCP section rewritten for self-hosted + ngrok. Setup block genericized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add recipe validation tests + genericize personal refs 5 new integration tests: secrets completeness, semver version, requires resolution, all-recipes-parse, no-personal-references. Test fixture genericized. CLAUDE.md/TODOS.md/SKILLPACK updated for v0.8.0. build:edge script removed from package.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.8.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 10:52:30 -10:00
parent 0ca2e86acb
commit 91ced664b6
30 changed files with 786 additions and 1132 deletions
--- a/recipes/twilio-voice-brain.md
+++ b/recipes/twilio-voice-brain.md
@@ -1,7 +1,7 @@
 ---
 id: twilio-voice-brain
 name: Voice-to-Brain
-version: 0.7.0
+version: 0.8.0
 description: Phone calls create brain pages via Twilio + OpenAI Realtime + GBrain MCP. Callers talk, brain pages appear.
 category: sense
 requires: [ngrok-tunnel]
@@ -230,7 +230,7 @@ If using free tier, copy the URL from the ngrok output (changes every restart).
 Note: ngrok runs in the foreground. Run it in a background process or new terminal tab.

 The same ngrok account can also serve your GBrain MCP server (see
-[ngrok Setup](docs/mcp/NGROK_SETUP.md) for the full multi-service pattern).
+[ngrok-tunnel recipe](recipes/ngrok-tunnel.md) for the full multi-service pattern).

 ### Step 4: Create Voice Server

@@ -286,18 +286,23 @@ The voice server needs these components in `server.mjs`:
   ```
   POST /session:
     sdp = request.body  // caller's SDP offer
-     form = new FormData()
-     form.append('sdp', sdp)
-     form.append('session', JSON.stringify({
+
+     sessionConfig = JSON.stringify({
       type: 'realtime',
       model: 'gpt-4o-realtime-preview',
-       audio: {output: {voice: VOICE}},
-       instructions: buildPrompt(null)
-     }))
+       audio: { output: { voice: VOICE } },
+       instructions: buildPrompt(null),
+       tools: TOOL_SETS.unauthenticated,
+     })
+
+     // Use native FormData (Node 18+) — NOT manual multipart
+     fd = new FormData()
+     fd.set('sdp', sdp)
+     fd.set('session', sessionConfig)

     response = POST 'https://api.openai.com/v1/realtime/calls'
       Authorization: Bearer OPENAI_API_KEY
-       body: form
+       body: fd   // fetch() sets Content-Type automatically

     return response.text()  // SDP answer
   ```
@@ -306,6 +311,8 @@ The voice server needs these components in `server.mjs`:
   - `voice` goes under `audio.output.voice`, not top-level
   - Do NOT send `turn_detection` in session config (not accepted by `/v1/realtime/calls`)
   - Do NOT send `session.update` on connect (server already configured it)
+   - All `session.update` calls must include `type: 'realtime'` to avoid session.type errors
+   - `input_audio_transcription` is NOT supported over WebRTC data channel — use Whisper post-call on recorded audio instead
   - Trigger greeting via data channel after WebRTC connects

 **Reference implementation:** The architecture above and the OpenAI Realtime API
@@ -464,3 +471,198 @@ The watchdog restarts the server if it crashes."
 - Free ngrok URLs change every time ngrok restarts
 - The watchdog (Step 9) handles this automatically
 - For a permanent URL: upgrade to ngrok paid ($8/mo) for a static domain, or deploy to Fly.io/Railway instead
+
+## Production Patterns (Recommended)
+
+These patterns come from a production voice deployment handling real calls daily.
+They are NOT required for basic setup. **Implement them AFTER the smoke test passes.**
+Each pattern is self-contained and optional.
+
+### Agent Identity & Engagement
+
+#### Identity Separation
+**Problem:** A voice agent pretending to be the full AI system creates uncanny valley.
+**Pattern:** The voice agent picks its own name and personality, distinct from the main
+AI brain. "I work with [Brain], [Owner]'s AI." Lighter, more playful, more curious.
+
+#### Pre-Computed Bid System
+**Problem:** Dead air kills engagement. Voice agents wait passively.
+**Pattern:** At call start, scan live context and pre-compute up to 10 engagement bids.
+Two types: informative (tasks, calendar, social radar) and relational (curiosity templates).
+Bids go INTO the prompt so the agent picks from a list. Use bids #1 and #2 for greeting,
+cycle the rest during conversation. Never ask "anything else?" — bring up the next bid.
+
+#### Context-First Prompt
+**Problem:** Voice agent greets generically because it doesn't know what's happening today.
+**Pattern:** Load live context at call start: tasks, calendar, location, social radar,
+morning briefing. Position context FIRST in the prompt (before rules) so the model sees
+it immediately and uses it in the greeting. Try/catch per section. Cap 500-1000 chars each.
+
+#### Proactive Advisor Mode
+**Problem:** Voice agents are reactive task machines.
+**Pattern:** The agent drives the conversation. Anticipate decisions on stale tasks.
+Suggest capitalizing on trending items. Connect upcoming events with brain context.
+"Dead air is your enemy" — fill every pause. Never wait passively.
+
+#### Conversation Timing (the #1 fix)
+**Problem:** Voice agents interrupt mid-thought AND go silent when the caller is done.
+Both feel terrible. Early "fill every pause" instructions cause the agent to talk over
+the caller while they're thinking.
+**Pattern:** Replace blanket "never be silent" with nuanced timing rules:
+- **Caller talking or thinking:** SHUT UP. Even 3-5 second pauses mid-thought, wait.
+  Incomplete sentence or mid-story = still thinking. Do not interrupt.
+- **Caller done** (complete thought + 2-3 seconds silence): NOW respond. Use a bid,
+  ask a follow-up, or pivot to the next topic.
+- **Detection heuristic:** Incomplete sentence = still thinking. Complete statement +
+  silence = done. Question directed at you = respond immediately.
+- **Hard rule:** Never let silence go past 5 seconds after a COMPLETE thought.
+
+Add this as a labeled section in the system prompt (e.g., `# CRITICAL: Conversation Timing`)
+positioned prominently so the model sees it early. This came from real usage feedback
+and is the single highest-impact voice quality improvement.
+
+#### No Repetition Rule
+**Problem:** Voice agent cycles back to the same bid multiple times in a call.
+**Pattern:** Add to the system prompt: "Do NOT repeat yourself. If you already said
+something, move to the NEXT bid. Vary your responses." Simple but addresses a real
+annoyance that compounds over longer calls.
+
+### Prompt Engineering
+
+#### Radical Prompt Compression
+**Problem:** Long system prompts increase latency and cost on every turn.
+**Pattern:** Compress aggressively. Production went 13K to 4.7K tokens (65% cut).
+Bullets over prose, cut repetition, behavior-first. Every token costs latency + money.
+
+#### OpenAI Realtime Prompting Guide Structure
+**Problem:** Prose paragraphs parse slowly for the model.
+**Pattern:** Use labeled markdown sections: `# Role & Objective`, `# Personality & Tone`,
+`# Rules`, `# Conversation Flow` with state machine substates (`## State 1: VERIFY`,
+`## State 2: GREETING`, `## State 3: CONVERSATION`), `# Trust`.
+
+#### Auth-Before-Speech
+**Problem:** Auth flow adds dead air at call start.
+**Pattern:** Call the auth tool BEFORE speaking any greeting. Then speak "Hey, code's on
+its way." Shaves seconds off the round-trip.
+
+#### Brain Escalation
+**Problem:** Voice agent can't answer complex questions that need the full brain.
+**Pattern:** If caller says "talk to [Brain]" or asks a deep question, immediately route
+to main AI via gateway tool with verbal bridge: "one sec, checking with [Brain]."
+
+### Call Reliability
+
+#### Stuck Watchdog
+**Problem:** Calls go silent when VAD stalls or tool execution hangs.
+**Pattern:** 20-second timer. If no audio out: clear input buffer, inject "you still
+there?" system message, force `response.create`.
+
+#### Never Hang Up
+**Problem:** AI agents try to end calls.
+**Pattern:** Hard prompt rule: only the caller decides when the call ends. Never say
+goodbye, "I'll let you go," or wrap-up language. If silence, ask "you still there?"
+
+#### Thinking Sound
+**Problem:** Dead air during slow tool execution.
+**Pattern:** Pre-generate g711_ulaw audio chunks in a JSON array. Loop at 20ms intervals
+during slow tools (brain search, web lookup). Stop when tool result returns.
+
+#### Fallback TwiML
+**Problem:** Voice agent crashes, callers get silence.
+**Pattern:** `/fallback` endpoint returns TwiML forwarding to owner's cell. Configure as
+Twilio fallback URL.
+
+### Authentication & Authorization
+
+#### Tool Set Architecture
+**Problem:** Unauthenticated callers accessing write operations.
+**Pattern:** Four sets: READ_TOOLS (all callers), WRITE_TOOLS (owner), SCOPED_WRITE_TOOLS
+(trusted users), GATEWAY_TOOLS (authenticated). LLM doesn't see write tools until auth
+succeeds. Upgrade via `session.update` with new tools array. All `session.update` calls
+must include `type: 'realtime'`.
+
+#### Trusted User Auth with Callback
+**Problem:** People other than the owner need authenticated access.
+**Pattern:** Phone registry + callback verification. Each user gets a scope: full,
+household, content, operational. Scope determines which tools they access.
+
+#### Caller Routing
+**Problem:** Different callers need different experiences.
+**Pattern:** `buildPrompt(callerPhone)` returns different system prompts: owner (OTP),
+trusted (callback), inner circle (warm greeting + transfer), known (greeting, message),
+unknown (screen + message).
+
+### Voice Quality
+
+#### Dynamic VAD / Noise Mode
+**Problem:** Background noise causes false triggers or missed speech.
+**Pattern:** `set_noise_mode` tool adjusts VAD threshold mid-call. Presets: quiet (0.7),
+normal (0.85), noisy (0.95), very_noisy (0.98). Agent calls proactively on noise.
+
+#### On-Screen Debug UI
+**Problem:** console.log is useless when testing from a phone.
+**Pattern:** WebRTC client displays tool calls, results, errors, and key events inline.
+
+### Real-Time Awareness
+
+#### Live Moment Capture
+**Problem:** Important things said during a call are lost if the call drops or the
+post-call summary tool doesn't fire.
+**Pattern:** When the caller shares something important (feedback, ideas, personal
+stories, decisions), log it in real-time using a `log_voice_request` tool. Don't
+wait until the call ends. Tell the caller: "Got that, sending it to [Brain] now."
+Also stream key moments to [messaging platform] during the call so the main agent
+has awareness before the call is over.
+
+#### Belt-and-Suspenders Post-Call
+**Problem:** Post-call processing depends on the voice agent remembering to call the
+`post_call_summary` tool. If the call drops or the agent forgets, the call is lost.
+**Pattern:** Both the tool-based AND the automatic call-end handler should post
+structured signals. The call-end handler (fires on WebSocket close or `/call-end`)
+should post to [messaging platform] with:
+- Audio file path
+- Transcript file path (or warning if missing)
+- Tools used during the call
+- Explicit instruction: "[Brain]: Read the call, summarize, take action."
+
+This ensures every call gets processed regardless of whether the voice agent
+remembered to call the summary tool. Belt and suspenders.
+
+### Post-Call Processing
+
+#### Mandatory 3-Step Post-Call
+**Problem:** Main agent doesn't know a call happened.
+**Pattern:** Every call ends with three steps:
+1. **Messaging notification** — summary to [messaging platform]
+2. **Transcript to brain** — `brain/meetings/YYYY-MM-DD-call-{caller}.md`
+3. **Audio to storage** — Twilio MP3 or WebRTC webm/opus, uploaded to cloud storage
+
+#### WebRTC Audio + Transcript Parity
+**Problem:** WebRTC calls don't go through Twilio, no automatic logging.
+**Pattern:** Client captures audio (MediaRecorder, webm/opus) and transcript (per-turn
+POST to `/transcript`). On call end, POST to `/call-end` saves JSON log. Both channels
+produce identical output formats. Note: `input_audio_transcription` is NOT supported
+over WebRTC data channel — use Whisper post-call instead.
+
+#### Dual API Event Handling
+**Problem:** OpenAI Realtime API changed event names.
+**Pattern:** Handle both `response.audio.delta` (old) and `response.output_audio.delta`
+(new). Same for `.done` events. Future-proofs against API changes.
+
+### Brain Query Optimization
+
+#### Report-Aware Query Routing
+**Problem:** Voice queries about specific topics trigger slow vector searches.
+**Pattern:** Check the question against a keyword map BEFORE full brain search:
+
+| Keyword | Report Loaded |
+|---------|--------------|
+| email, inbox, mail | inbox sweep report |
+| social, twitter, mentions | social radar report |
+| briefing, morning | morning briefing |
+| meeting | meeting sync report |
+| slack | slack scan report |
+| content, ideas | content ideas report |
+
+Load up to 2,500 chars of matching report. Break after first match. Fall back to full
+brain search if no keyword matches.