Files

Garry Tan baf3517868 feat: v0.9.0 -- smart file storage, publish, production-grade skills (#62 )

* feat: battle-tested skill patterns from production deployment

Backport production-learned brain-operations patterns:
- Iron Law of Back-Linking (mandatory bidirectional linking)
- Brain filing rules (file by primary subject, not format)
- Enrichment protocol (7-step pipeline, 3-tier system, person/company templates)
- Media ingest workflows (articles, videos, podcasts, PDFs, screenshots)
- Citation requirements (mandatory [Source: ...] on every fact)
- Test Before Bulk operating principle
- Voice recipe: unicode crash fix, PII scrub, identity-first prompt, DIY STT+LLM+TTS
- X-to-Brain recipe: image OCR, Filtered Stream, tweet rating rubric, cron stagger

* chore: bump version and changelog (v0.8.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add _brain-filing-rules.md to CLAUDE.md key files

* feat: smart file upload with TUS resumable and .redirect.yaml pointers

- Supabase Storage auto-selects upload method by file size:
  < 100 MB standard POST, >= 100 MB TUS resumable (6 MB chunks + retry)
- Signed URL generation for private bucket access (1-hour expiry)
- New `upload-raw` command with size routing: small text stays in git,
  large/media files go to cloud with .redirect.yaml pointer
- New `signed-url` command for generating access links
- File resolver supports both .redirect.yaml (v0.9+) and .redirect (legacy)
- Redirect format upgraded: 10 fields with full metadata
- All migration commands (mirror, redirect, restore, clean) handle both formats

* feat: skills reference actual gbrain file commands

- Filing rules document upload-raw, signed-url, and .redirect.yaml format
- Ingest skill uses gbrain files upload-raw for raw source preservation
- Maintain skill adds file storage health checks
- Setup skill adds storage configuration phase with migration guidance
- Voice recipe uses upload-raw for call audio storage
- Migration v0.9.0 with complete storage setup instructions

* chore: bump version and changelog (v0.9.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: gbrain publish -- shareable HTML with password protection

First code+skill pair: deterministic code does the work (strip private data,
encrypt with AES-256-GCM, generate self-contained HTML), the skill tells the
agent when and how to use it. 34 new tests.

See: https://x.com/garrytan/status/2042925773300908103

* feat: backlinks check/fix, page lint, and report commands

Three new deterministic tools (zero LLM calls):

- gbrain backlinks check/fix -- scans brain for entity mentions without
  back-links, creates them. Enforces the Iron Law from the skills.
- gbrain lint [--fix] -- catches LLM preambles, code fence wrapping,
  placeholder dates, missing frontmatter, broken citations, empty sections.
  --fix auto-strips fixable artifacts.
- gbrain report --type <name> -- saves timestamped reports to
  brain/reports/{type}/YYYY-MM-DD-HHMM.md for audit trails.

33 new tests (409 total, 0 fail).

* feat: v0.9.0 migration tells agents to swap scripts for built-in commands

Migration file now:
- Lists all 5 new deterministic commands with usage examples
- Includes a script-to-command replacement table (old -> new)
- Tells the agent to find custom script references in AGENTS.md,
  skills, and cron jobs and replace with gbrain commands
- Adds recommended cron jobs for daily backlink fix + weekly lint
- References the Thin Harness, Fat Skills thread

* fix: CLI routing bugs found during DX review

- Fixed subArgs reference error in handleCliOnly (used wrong variable name)
- Renamed gbrain backlinks check/fix to gbrain check-backlinks to avoid
  conflict with existing backlinks operation (per-page incoming links)
- Added TOOLS section to --help output showing publish, check-backlinks,
  lint, report
- Added upload-raw and signed-url to FILES section in --help
- Updated all docs/migration references to use check-backlinks

* fix: security hardening from adversarial review

- XSS: sanitize marked.parse() output (strip script/iframe/on* attrs)
- Path traversal: validate report --type against [a-z0-9-] pattern
- TUS: HEAD request before retry to get server's actual offset (TUS spec)
- Pointer: upload-raw now includes pointer content in JSON output
- Symlinks: use lstatSync in all walkers to prevent directory escape

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-11 21:46:07 -10:00

16 KiB

Raw Blame History

id, name, version, description, category, requires, secrets, health_checks, setup_time, cost_estimate

name

version

description

X-to-Brain: Twitter Monitoring That Updates Your Brain

Your timeline, mentions, and keyword searches flow into brain pages. The collector tracks deletions, engagement velocity, and narrative patterns. You wake up knowing what happened on X while you slept.

IMPORTANT: Instructions for the Agent

You are the installer. Follow these steps precisely.

The core pattern: code for data, LLMs for judgment. The X collector is deterministic code. It pulls tweets, detects deletions, tracks engagement. It NEVER interprets content. YOU (the agent) read the collected data and make judgment calls: who is important, what entities are mentioned, what narratives are forming.

Why sequential execution matters:

Step 1 validates the API key. Without it, nothing connects to X.
Step 2 sets up the collector. Without it, you have no data.
Step 3 runs the first collection. Without data, you can't enrich.
Step 4 is YOUR job: read the collected tweets, update brain pages.

Do not skip steps. Do not reorder. Verify after each step.

Architecture

X API v2 (Bearer token auth)
  ↓ Three collection streams:
  ├── Own timeline: GET /users/{id}/tweets
  ├── Mentions: GET /users/{id}/mentions
  └── Keyword searches: GET /tweets/search/recent
  ↓
X Collector (deterministic Node.js script)
  ↓ Outputs:
  ├── data/tweets/{own,mentions,searches}/{id}.json
  ├── data/deletions/{id}.json (detected via diff)
  ├── data/engagement/{id}.json (velocity snapshots)
  └── data/state.json (pagination, rate limits)
  ↓
Agent reads collected data
  ↓ Judgment calls:
  ├── Entity detection (people, companies mentioned)
  ├── Brain page updates (timeline entries)
  ├── Narrative pattern detection
  └── Engagement spike alerts

Opinionated Defaults

Three collection streams:

Own timeline — your tweets, for your own archive and engagement tracking
Mentions — who is talking about you, for relationship tracking
Keyword searches — topics you care about, for signal detection

Deletion detection:

Compare tweet IDs from previous run vs current
If an ID is missing AND the tweet is < 7 days old, call GET /tweets/{id}
404 = confirmed deleted. Save the original tweet + deletion timestamp.
Alert on deletions from accounts you track.

Engagement velocity:

Snapshot likes/retweets/replies for tracked tweets
Alert if likes doubled AND previous count >= 50
Alert if likes gained > 100 absolute since last check
Only write snapshot if metrics actually changed (idempotent)

Rate limit awareness:

Basic tier: 1500 req/15min for timeline, 450 for mentions, 60 for search
Collector tracks rate limits in state.json
Back off automatically when approaching limits

Prerequisites

GBrain installed and configured (gbrain doctor passes)
Node.js 18+ (for the collector script)
X Developer account with API access

Setup Flow

Step 1: Get X API Credentials

Tell the user: "I need your X API Bearer token. Here's exactly where to get it:

Go to https://developer.x.com/en/portal/dashboard
If you don't have a developer account, click 'Sign up' (free tier available)
Create a new Project (name it anything, e.g., 'GBrain')
Inside the project, create a new App
Go to the app's 'Keys and tokens' tab
Under 'Bearer Token', click 'Generate' (or 'Regenerate')
Copy the Bearer Token and paste it to me

Note: Free tier gives read-only access with low limits. Basic tier ($200/mo) gives search/recent endpoint and higher limits. Pro tier gets full archive search."

Validate immediately:

curl -sf -H "Authorization: Bearer $X_BEARER_TOKEN" \
  "https://api.x.com/2/users/me" \
  && echo "PASS: X API connected" \
  || echo "FAIL: X API token invalid"

If validation fails: "That didn't work. Common issues: (1) make sure you copied the Bearer Token, not the API Key or API Secret, (2) Bearer Tokens are long strings starting with 'AAA...', (3) if you just created the app, the token is valid immediately."

STOP until X API validates.

Step 2: Get Your X User ID

# Look up the user's X user ID from their handle
curl -sf -H "Authorization: Bearer $X_BEARER_TOKEN" \
  "https://api.x.com/2/users/by/username/USERNAME" | grep -o '"id":"[^"]*"'

Ask the user for their X handle (e.g., @yourhandle). Look up their user ID. Save it — the collector needs the numeric ID, not the handle.

Step 3: Configure the Collector

Create the collector directory:

mkdir -p x-collector/data/{tweets/{own,mentions,searches},deletions,engagement}
cd x-collector

The collector script needs these capabilities:

collect — pull tweets from three streams:
- Own timeline: GET /2/users/{id}/tweets with max_results=100
- Mentions: GET /2/users/{id}/mentions with max_results=100
- Keyword searches: configurable search terms via GET /2/tweets/search/recent
Deletion detection — compare previous run's tweet IDs vs current. For missing IDs, verify with individual tweet lookup. 404 = deleted.
Engagement tracking — snapshot metrics for tracked tweets. Only write if metrics changed.
State management — save pagination tokens, last run timestamp, rate limit state to data/state.json
Atomic writes — write to .tmp file, then rename (prevents corrupt data on crash)

Configure keyword searches based on what the user cares about:

{
  "searches": [
    "\"your name\" -from:yourhandle",
    "\"your company\" OR \"your product\"",
    "topic you track"
  ]
}

Step 4: Run First Collection

node x-collector.mjs collect

Verify: ls data/tweets/own/ should contain tweet JSON files. Show the user a sample: "Found N tweets from your timeline, M mentions, K search results."

Step 5: Enrich Brain Pages

This is YOUR job (the agent). Read the collected tweets:

Detect entities: who tweeted? Who is mentioned? What companies/topics?
Check the brain: gbrain search "person name" — do we have a page?
Update brain pages: for each notable person or company mentioned: - YYYY-MM-DD | Tweeted about {topic} [Source: X, @handle, {date}]
Track narratives: if someone tweets about the same topic 3+ times in a week, note the pattern in their compiled truth
Flag deletions: if a tracked account deleted a tweet, note it: - YYYY-MM-DD | Deleted tweet: "{content}" [Source: X deletion, detected {date}]
Sync: gbrain sync --no-pull --no-embed

Step 6: Set Up Cron

The collector should run every 30 minutes:

*/30 * * * * cd /path/to/x-collector && node x-collector.mjs collect >> /tmp/x-collector.log 2>&1

The agent should review collected data 2-3x daily and run enrichment.

Step 7: Log Setup Completion

mkdir -p ~/.gbrain/integrations/x-to-brain
echo '{"ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","event":"setup_complete","source_version":"0.8.1","status":"ok","details":{"user_id":"X_USER_ID"}}' >> ~/.gbrain/integrations/x-to-brain/heartbeat.jsonl

Production Patterns (v0.8.1)

These patterns come from a production deployment tracking 19+ accounts with real-time monitoring.

Image OCR (NEW)

Problem: Text-only collection misses visual context in tweet images -- screenshots, charts, memes with text overlay, quote screenshots.

Fix: Run OCR on tweet images via a vision model (Claude Sonnet or equivalent):

For every tweet with images, extract full text content via vision API
Store OCR output alongside the tweet data
Include extracted text in entity detection and brain page updates
Charts/data visualizations: extract data points, describe findings

This catches signal that text-only collectors miss entirely.

Real-Time Monitoring via Filtered Stream (NEW)

Problem: 30-minute polling means you find out about things 30 minutes late. For time-sensitive content (engagement spikes, deletions, breaking threads), that's too slow.

Fix: Use Twitter's Filtered Stream API (GET /2/tweets/search/stream) for near-real-time monitoring. Catches outbound tweets within seconds.

Setup:

Add filter rules: POST /2/tweets/search/stream/rules with your tracking terms
Open persistent connection: GET /2/tweets/search/stream
Process tweets as they arrive (no polling delay)

Requirements: Basic tier ($200/mo) minimum for Filtered Stream access.

Use alongside polling: Stream for real-time alerts, polling for completeness (stream can drop tweets during disconnects).

Tweet Rating Rubric (NEW)

Problem: Not all tweets deserve the same attention. Without scoring, every tweet gets equal weight.

Fix: Rate tweets on a 6-dimension rubric:

Reach -- follower count, engagement rate
Relevance -- connection to your interests/work
Sentiment -- positive/negative/neutral toward you
Novelty -- new information vs rehash
Actionability -- does this require a response?
Virality potential -- engagement velocity, quote-tweet ratio

Re-rate after 60 minutes to track engagement trajectory. A tweet at 50 likes that hits 500 in an hour is a different signal than one that stays at 50.

Outbound Tweet Monitoring (NEW)

Problem: You tweet something and don't notice engagement patterns until hours later.

Fix: 60-second monitoring window after every outbound tweet:

Check engagement velocity (likes, replies, quotes)
Flag unusual reply-to-like ratios (high reply ratios signal controversy)
Flag if quote-tweet ratio > retweet ratio (commentary, not sharing)
Cross-reference mentioned accounts against brain for context

X-to-Brain Pipeline (NEW)

Every tweet interaction can automatically create/update brain pages:

Mentioned person has a brain page? Append to their timeline
New person mentioned? Check notability gate, create page if notable
Article URL in tweet? Fetch and ingest via article workflow
Video URL in tweet? Queue for transcription pipeline
Images? OCR and extract text content

Follow skills/_brain-filing-rules.md for filing decisions.

Cron Staggering (IMPORTANT)

Problem: Multiple cron jobs firing simultaneously causes resource contention and timeouts.

Fix: Stagger all collection schedules so max 1 runs per minute:

# Good: staggered
*/30 * * * * x-collector       # :00, :30
5,35 * * * * x-bundle-ingest   # :05, :35
10 */3 * * * social-monitor     # :10 every 3h

# Bad: overlapping
*/30 * * * * x-collector
*/30 * * * * x-bundle-ingest   # fires at same time!

Implementation Guide

These are production-tested patterns from a deployment tracking 19+ accounts.

Deletion Detection Algorithm

detect_deletions(prevIds, currentIds):
  for id in prevIds:
    if id in currentIds: continue          // still exists

    stored = load_tweet(id)
    if not stored: continue                // never stored

    // HEURISTIC 1: Only check tweets < 7 days old
    age = now - stored.created_at
    if age > 7_DAYS: continue              // aged out of API window

    // HEURISTIC 2: Skip if last seen > 48h ago
    staleness = now - stored.last_updated
    if staleness > 48_HOURS: continue      // fell out of window, not deleted

    // HEURISTIC 3: Already logged?
    if deletion_file_exists(id): continue

    // VERIFY via direct API call
    res = GET /tweets/{id}
    if res.status == 404 OR (res.ok AND no data):
      save_deletion(id, original_tweet, detected_at)
      alert(f"DELETION: {author} deleted: {preview}")

Why the heuristics matter: Without #2 (48h staleness check), you get false positives on old tweets that just aged out of the API search window. Without #1 (7-day cap), you'd investigate thousands of old tweets on every run.

Engagement Velocity Tracking

track_engagement(id, metrics):
  snapshots = load_snapshots(id)
  last = snapshots[-1] if snapshots else null

  if last AND metrics_equal(last, metrics): return  // no change

  snapshots.append({timestamp: now, metrics})
  if len(snapshots) > 100: snapshots = snapshots[-100:]  // cap growth

  // Alert conditions (OR logic):
  if last:
    old_likes = last.like_count
    new_likes = metrics.like_count

    // Condition 1: 2x on established tweets (>= 50 likes)
    if old_likes >= 50 AND new_likes >= old_likes * 2:
      alert(f"VELOCITY: {id} likes {old_likes} -> {new_likes}")

    // Condition 2: Absolute jump > 100
    elif (new_likes - old_likes) > 100:
      alert(f"VELOCITY: {id} likes {old_likes} -> {new_likes}")

Threshold design: 50 minimum prevents noise from small tweets going 2→4. The 100 absolute jump catches big spikes on tweets with any baseline.

Atomic File Writes

atomic_write(path, obj):
  tmp = path + '.tmp'
  writeFileSync(tmp, JSON.stringify(obj, null, 2))
  renameSync(tmp, path)  // atomic on most filesystems

If the process dies mid-write, the .tmp file is left behind but the original is untouched. Critical when you have thousands of per-tweet JSON files.

Rate Limit Handling

rate_limits = {}  // per endpoint

after_each_request(endpoint, headers):
  rate_limits[endpoint] = {
    remaining: headers['x-rate-limit-remaining'],
    reset: headers['x-rate-limit-reset']
  }

is_rate_limited(endpoint, min_remaining=2):
  r = rate_limits[endpoint]
  return r AND r.remaining <= min_remaining

Reserve 2 requests per endpoint so other streams still work. If mentions hits the limit, own timeline and searches can still run.

Stdout Contract

The collector prints structured lines the cron agent can parse:

RUN_START:{timestamp}
OWN_TWEETS:{total} ({new} new)
MENTIONS:{total} ({new} new)
DELETION_DETECTED:{id}:{author}:{preview}
VELOCITY_ALERT:{id}:likes:{old}->{new}:{minutes}min
RUN_COMPLETE:{timestamp}:tweets_stored={N}:deletions={N}:velocity_alerts={N}

What the Agent Should Test After Setup

Deletion detection: Post a tweet, collect, delete it, collect again. Verify deletion is detected on second run.
Rate limit: Run collect with very low remaining quota. Verify it stops gracefully and reports which streams were skipped.
Engagement: Find a tweet with 45 likes. Mock it jumping to 90 (no alert, < 50 threshold). Then 50→100 (alert: 2x). Then 30→150 (alert: >100 jump).
Deduplication: Collect, then like one of your own tweets, collect again. Verify _collected_at is preserved (not overwritten).
Atomic writes: Kill the process mid-collection. Verify no corrupted JSON.

Cost Estimate

Component	Monthly Cost
X API Free tier	$0 (read-only, low limits)
X API Basic tier	$200/mo (search + higher limits)
X API Pro tier	$5,000/mo (full archive)
Recommended	$0 (free) or $200 (basic)

Free tier works for personal monitoring. Basic tier needed for keyword search.

Troubleshooting

API returns 403:

Check your app has the right access level (Read or Read+Write)
Free tier apps can only use basic endpoints
Some endpoints require Basic or Pro tier

Rate limited (429):

The collector respects rate limits automatically
If hitting limits frequently, increase the cron interval to 60 minutes
Check data/state.json for rate limit tracking

No tweets collected:

Verify the user ID is correct (numeric, not handle)
Check the Bearer Token is valid (Step 1 validation)
Some accounts may have protected tweets (requires OAuth 2.0 user context)

16 KiB Raw Blame History