Files

Garry Tan f82978d38d security: fix wave 2 — 5 vulns + typed health check DSL (v0.9.3) (#95 )

* security: path traversal, query bounds, marker injection fixes

LocalStorage: contained() method validates all paths stay within storage root.
file-resolver: resolveFile validates filePath within brainRoot, marker prefix
rejects ../, absolute paths, bare '..'. file_list: LIMIT 100 on slug-filtered
branch + FILE_LIST_LIMIT constant for both branches.

Co-Authored-By: Gus <garagon@users.noreply.github.com>

* security: symlink hardening in all file walkers

All 4 walkers in files.ts (collectFiles, findRedirects, findAndClean, scan)
plus init.ts counter now use lstatSync + isSymbolicLink skip. Tests import
production collectFiles instead of reimplementing it. node_modules skipped.
CLI file list and verify queries bounded with LIMIT.

Co-Authored-By: Gus <garagon@users.noreply.github.com>

* feat: typed health check DSL + recipe migration

4 DSL types: http, env_exists, command, any_of. Replaces raw execSync
on recipe YAML. All 7 first-party recipes migrated from shell strings
to typed objects. String health_checks still accepted with deprecation
warning + metachar validation for non-embedded recipes. isUnsafeHealthCheck
blocks shell injection for user-created recipes.

Co-Authored-By: Gus <garagon@users.noreply.github.com>

* chore: bump version and changelog (v0.9.3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: E2E test for file_list LIMIT enforcement against real Postgres

Inserts 150 file rows for one slug, verifies file_list returns at most
100 (both slug-filtered and unfiltered branches). Proves the LIMIT
works at the database level, not just in unit tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Gus <garagon@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

2026-04-13 07:49:13 -10:00

16 KiB

Raw Blame History

id, name, version, description, category, requires, secrets, health_checks, setup_time, cost_estimate

name

version

description

X-to-Brain: Twitter Monitoring That Updates Your Brain

Your timeline, mentions, and keyword searches flow into brain pages. The collector tracks deletions, engagement velocity, and narrative patterns. You wake up knowing what happened on X while you slept.

IMPORTANT: Instructions for the Agent

You are the installer. Follow these steps precisely.

The core pattern: code for data, LLMs for judgment. The X collector is deterministic code. It pulls tweets, detects deletions, tracks engagement. It NEVER interprets content. YOU (the agent) read the collected data and make judgment calls: who is important, what entities are mentioned, what narratives are forming.

Why sequential execution matters:

Step 1 validates the API key. Without it, nothing connects to X.
Step 2 sets up the collector. Without it, you have no data.
Step 3 runs the first collection. Without data, you can't enrich.
Step 4 is YOUR job: read the collected tweets, update brain pages.

Do not skip steps. Do not reorder. Verify after each step.

Architecture

X API v2 (Bearer token auth)
  ↓ Three collection streams:
  ├── Own timeline: GET /users/{id}/tweets
  ├── Mentions: GET /users/{id}/mentions
  └── Keyword searches: GET /tweets/search/recent
  ↓
X Collector (deterministic Node.js script)
  ↓ Outputs:
  ├── data/tweets/{own,mentions,searches}/{id}.json
  ├── data/deletions/{id}.json (detected via diff)
  ├── data/engagement/{id}.json (velocity snapshots)
  └── data/state.json (pagination, rate limits)
  ↓
Agent reads collected data
  ↓ Judgment calls:
  ├── Entity detection (people, companies mentioned)
  ├── Brain page updates (timeline entries)
  ├── Narrative pattern detection
  └── Engagement spike alerts

Opinionated Defaults

Three collection streams:

Own timeline — your tweets, for your own archive and engagement tracking
Mentions — who is talking about you, for relationship tracking
Keyword searches — topics you care about, for signal detection

Deletion detection:

Compare tweet IDs from previous run vs current
If an ID is missing AND the tweet is < 7 days old, call GET /tweets/{id}
404 = confirmed deleted. Save the original tweet + deletion timestamp.
Alert on deletions from accounts you track.

Engagement velocity:

Snapshot likes/retweets/replies for tracked tweets
Alert if likes doubled AND previous count >= 50
Alert if likes gained > 100 absolute since last check
Only write snapshot if metrics actually changed (idempotent)

Rate limit awareness:

Basic tier: 1500 req/15min for timeline, 450 for mentions, 60 for search
Collector tracks rate limits in state.json
Back off automatically when approaching limits

Prerequisites

GBrain installed and configured (gbrain doctor passes)
Node.js 18+ (for the collector script)
X Developer account with API access

Setup Flow

Step 1: Get X API Credentials

Tell the user: "I need your X API Bearer token. Here's exactly where to get it:

Go to https://developer.x.com/en/portal/dashboard
If you don't have a developer account, click 'Sign up' (free tier available)
Create a new Project (name it anything, e.g., 'GBrain')
Inside the project, create a new App
Go to the app's 'Keys and tokens' tab
Under 'Bearer Token', click 'Generate' (or 'Regenerate')
Copy the Bearer Token and paste it to me

Note: Free tier gives read-only access with low limits. Basic tier ($200/mo) gives search/recent endpoint and higher limits. Pro tier gets full archive search."

Validate immediately:

curl -sf -H "Authorization: Bearer $X_BEARER_TOKEN" \
  "https://api.x.com/2/users/me" \
  && echo "PASS: X API connected" \
  || echo "FAIL: X API token invalid"

If validation fails: "That didn't work. Common issues: (1) make sure you copied the Bearer Token, not the API Key or API Secret, (2) Bearer Tokens are long strings starting with 'AAA...', (3) if you just created the app, the token is valid immediately."

STOP until X API validates.

Step 2: Get Your X User ID

# Look up the user's X user ID from their handle
curl -sf -H "Authorization: Bearer $X_BEARER_TOKEN" \
  "https://api.x.com/2/users/by/username/USERNAME" | grep -o '"id":"[^"]*"'

Ask the user for their X handle (e.g., @yourhandle). Look up their user ID. Save it — the collector needs the numeric ID, not the handle.

Step 3: Configure the Collector

Create the collector directory:

mkdir -p x-collector/data/{tweets/{own,mentions,searches},deletions,engagement}
cd x-collector

The collector script needs these capabilities:

collect — pull tweets from three streams:
- Own timeline: GET /2/users/{id}/tweets with max_results=100
- Mentions: GET /2/users/{id}/mentions with max_results=100
- Keyword searches: configurable search terms via GET /2/tweets/search/recent
Deletion detection — compare previous run's tweet IDs vs current. For missing IDs, verify with individual tweet lookup. 404 = deleted.
Engagement tracking — snapshot metrics for tracked tweets. Only write if metrics changed.
State management — save pagination tokens, last run timestamp, rate limit state to data/state.json
Atomic writes — write to .tmp file, then rename (prevents corrupt data on crash)

Configure keyword searches based on what the user cares about:

{
  "searches": [
    "\"your name\" -from:yourhandle",
    "\"your company\" OR \"your product\"",
    "topic you track"
  ]
}

Step 4: Run First Collection

node x-collector.mjs collect

Verify: ls data/tweets/own/ should contain tweet JSON files. Show the user a sample: "Found N tweets from your timeline, M mentions, K search results."

Step 5: Enrich Brain Pages

This is YOUR job (the agent). Read the collected tweets:

Detect entities: who tweeted? Who is mentioned? What companies/topics?
Check the brain: gbrain search "person name" — do we have a page?
Update brain pages: for each notable person or company mentioned: - YYYY-MM-DD | Tweeted about {topic} [Source: X, @handle, {date}]
Track narratives: if someone tweets about the same topic 3+ times in a week, note the pattern in their compiled truth
Flag deletions: if a tracked account deleted a tweet, note it: - YYYY-MM-DD | Deleted tweet: "{content}" [Source: X deletion, detected {date}]
Sync: gbrain sync --no-pull --no-embed

Step 6: Set Up Cron

The collector should run every 30 minutes:

*/30 * * * * cd /path/to/x-collector && node x-collector.mjs collect >> /tmp/x-collector.log 2>&1

The agent should review collected data 2-3x daily and run enrichment.

Step 7: Log Setup Completion

mkdir -p ~/.gbrain/integrations/x-to-brain
echo '{"ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","event":"setup_complete","source_version":"0.8.1","status":"ok","details":{"user_id":"X_USER_ID"}}' >> ~/.gbrain/integrations/x-to-brain/heartbeat.jsonl

Production Patterns (v0.8.1)

These patterns come from a production deployment tracking 19+ accounts with real-time monitoring.

Image OCR (NEW)

Problem: Text-only collection misses visual context in tweet images -- screenshots, charts, memes with text overlay, quote screenshots.

Fix: Run OCR on tweet images via a vision model (Claude Sonnet or equivalent):

For every tweet with images, extract full text content via vision API
Store OCR output alongside the tweet data
Include extracted text in entity detection and brain page updates
Charts/data visualizations: extract data points, describe findings

This catches signal that text-only collectors miss entirely.

Real-Time Monitoring via Filtered Stream (NEW)

Problem: 30-minute polling means you find out about things 30 minutes late. For time-sensitive content (engagement spikes, deletions, breaking threads), that's too slow.

Fix: Use Twitter's Filtered Stream API (GET /2/tweets/search/stream) for near-real-time monitoring. Catches outbound tweets within seconds.

Setup:

Add filter rules: POST /2/tweets/search/stream/rules with your tracking terms
Open persistent connection: GET /2/tweets/search/stream
Process tweets as they arrive (no polling delay)

Requirements: Basic tier ($200/mo) minimum for Filtered Stream access.

Use alongside polling: Stream for real-time alerts, polling for completeness (stream can drop tweets during disconnects).

Tweet Rating Rubric (NEW)

Problem: Not all tweets deserve the same attention. Without scoring, every tweet gets equal weight.

Fix: Rate tweets on a 6-dimension rubric:

Reach -- follower count, engagement rate
Relevance -- connection to your interests/work
Sentiment -- positive/negative/neutral toward you
Novelty -- new information vs rehash
Actionability -- does this require a response?
Virality potential -- engagement velocity, quote-tweet ratio

Re-rate after 60 minutes to track engagement trajectory. A tweet at 50 likes that hits 500 in an hour is a different signal than one that stays at 50.

Outbound Tweet Monitoring (NEW)

Problem: You tweet something and don't notice engagement patterns until hours later.

Fix: 60-second monitoring window after every outbound tweet:

Check engagement velocity (likes, replies, quotes)
Flag unusual reply-to-like ratios (high reply ratios signal controversy)
Flag if quote-tweet ratio > retweet ratio (commentary, not sharing)
Cross-reference mentioned accounts against brain for context

X-to-Brain Pipeline (NEW)

Every tweet interaction can automatically create/update brain pages:

Mentioned person has a brain page? Append to their timeline
New person mentioned? Check notability gate, create page if notable
Article URL in tweet? Fetch and ingest via article workflow
Video URL in tweet? Queue for transcription pipeline
Images? OCR and extract text content

Follow skills/_brain-filing-rules.md for filing decisions.

Cron Staggering (IMPORTANT)

Problem: Multiple cron jobs firing simultaneously causes resource contention and timeouts.

Fix: Stagger all collection schedules so max 1 runs per minute:

# Good: staggered
*/30 * * * * x-collector       # :00, :30
5,35 * * * * x-bundle-ingest   # :05, :35
10 */3 * * * social-monitor     # :10 every 3h

# Bad: overlapping
*/30 * * * * x-collector
*/30 * * * * x-bundle-ingest   # fires at same time!

Implementation Guide

These are production-tested patterns from a deployment tracking 19+ accounts.

Deletion Detection Algorithm

detect_deletions(prevIds, currentIds):
  for id in prevIds:
    if id in currentIds: continue          // still exists

    stored = load_tweet(id)
    if not stored: continue                // never stored

    // HEURISTIC 1: Only check tweets < 7 days old
    age = now - stored.created_at
    if age > 7_DAYS: continue              // aged out of API window

    // HEURISTIC 2: Skip if last seen > 48h ago
    staleness = now - stored.last_updated
    if staleness > 48_HOURS: continue      // fell out of window, not deleted

    // HEURISTIC 3: Already logged?
    if deletion_file_exists(id): continue

    // VERIFY via direct API call
    res = GET /tweets/{id}
    if res.status == 404 OR (res.ok AND no data):
      save_deletion(id, original_tweet, detected_at)
      alert(f"DELETION: {author} deleted: {preview}")

Why the heuristics matter: Without #2 (48h staleness check), you get false positives on old tweets that just aged out of the API search window. Without #1 (7-day cap), you'd investigate thousands of old tweets on every run.

Engagement Velocity Tracking

track_engagement(id, metrics):
  snapshots = load_snapshots(id)
  last = snapshots[-1] if snapshots else null

  if last AND metrics_equal(last, metrics): return  // no change

  snapshots.append({timestamp: now, metrics})
  if len(snapshots) > 100: snapshots = snapshots[-100:]  // cap growth

  // Alert conditions (OR logic):
  if last:
    old_likes = last.like_count
    new_likes = metrics.like_count

    // Condition 1: 2x on established tweets (>= 50 likes)
    if old_likes >= 50 AND new_likes >= old_likes * 2:
      alert(f"VELOCITY: {id} likes {old_likes} -> {new_likes}")

    // Condition 2: Absolute jump > 100
    elif (new_likes - old_likes) > 100:
      alert(f"VELOCITY: {id} likes {old_likes} -> {new_likes}")

Threshold design: 50 minimum prevents noise from small tweets going 2→4. The 100 absolute jump catches big spikes on tweets with any baseline.

Atomic File Writes

atomic_write(path, obj):
  tmp = path + '.tmp'
  writeFileSync(tmp, JSON.stringify(obj, null, 2))
  renameSync(tmp, path)  // atomic on most filesystems

If the process dies mid-write, the .tmp file is left behind but the original is untouched. Critical when you have thousands of per-tweet JSON files.

Rate Limit Handling

rate_limits = {}  // per endpoint

after_each_request(endpoint, headers):
  rate_limits[endpoint] = {
    remaining: headers['x-rate-limit-remaining'],
    reset: headers['x-rate-limit-reset']
  }

is_rate_limited(endpoint, min_remaining=2):
  r = rate_limits[endpoint]
  return r AND r.remaining <= min_remaining

Reserve 2 requests per endpoint so other streams still work. If mentions hits the limit, own timeline and searches can still run.

Stdout Contract

The collector prints structured lines the cron agent can parse:

RUN_START:{timestamp}
OWN_TWEETS:{total} ({new} new)
MENTIONS:{total} ({new} new)
DELETION_DETECTED:{id}:{author}:{preview}
VELOCITY_ALERT:{id}:likes:{old}->{new}:{minutes}min
RUN_COMPLETE:{timestamp}:tweets_stored={N}:deletions={N}:velocity_alerts={N}

What the Agent Should Test After Setup

Deletion detection: Post a tweet, collect, delete it, collect again. Verify deletion is detected on second run.
Rate limit: Run collect with very low remaining quota. Verify it stops gracefully and reports which streams were skipped.
Engagement: Find a tweet with 45 likes. Mock it jumping to 90 (no alert, < 50 threshold). Then 50→100 (alert: 2x). Then 30→150 (alert: >100 jump).
Deduplication: Collect, then like one of your own tweets, collect again. Verify _collected_at is preserved (not overwritten).
Atomic writes: Kill the process mid-collection. Verify no corrupted JSON.

Cost Estimate

Component	Monthly Cost
X API Free tier	$0 (read-only, low limits)
X API Basic tier	$200/mo (search + higher limits)
X API Pro tier	$5,000/mo (full archive)
Recommended	$0 (free) or $200 (basic)

Free tier works for personal monitoring. Basic tier needed for keyword search.

Troubleshooting

API returns 403:

Check your app has the right access level (Read or Read+Write)
Free tier apps can only use basic endpoints
Some endpoints require Basic or Pro tier

Rate limited (429):

The collector respects rate limits automatically
If hitting limits frequently, increase the cron interval to 60 minutes
Check data/state.json for rate limit tracking

No tweets collected:

Verify the user ID is correct (numeric, not handle)
Check the Bearer Token is valid (Step 1 validation)
Some accounts may have protected tweets (requires OAuth 2.0 user context)

16 KiB Raw Blame History