Files
gbrain/skills/ingest/SKILL.md
Garry Tan baf3517868 feat: v0.9.0 -- smart file storage, publish, production-grade skills (#62)
* feat: battle-tested skill patterns from production deployment

Backport production-learned brain-operations patterns:
- Iron Law of Back-Linking (mandatory bidirectional linking)
- Brain filing rules (file by primary subject, not format)
- Enrichment protocol (7-step pipeline, 3-tier system, person/company templates)
- Media ingest workflows (articles, videos, podcasts, PDFs, screenshots)
- Citation requirements (mandatory [Source: ...] on every fact)
- Test Before Bulk operating principle
- Voice recipe: unicode crash fix, PII scrub, identity-first prompt, DIY STT+LLM+TTS
- X-to-Brain recipe: image OCR, Filtered Stream, tweet rating rubric, cron stagger

* chore: bump version and changelog (v0.8.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add _brain-filing-rules.md to CLAUDE.md key files

* feat: smart file upload with TUS resumable and .redirect.yaml pointers

- Supabase Storage auto-selects upload method by file size:
  < 100 MB standard POST, >= 100 MB TUS resumable (6 MB chunks + retry)
- Signed URL generation for private bucket access (1-hour expiry)
- New `upload-raw` command with size routing: small text stays in git,
  large/media files go to cloud with .redirect.yaml pointer
- New `signed-url` command for generating access links
- File resolver supports both .redirect.yaml (v0.9+) and .redirect (legacy)
- Redirect format upgraded: 10 fields with full metadata
- All migration commands (mirror, redirect, restore, clean) handle both formats

* feat: skills reference actual gbrain file commands

- Filing rules document upload-raw, signed-url, and .redirect.yaml format
- Ingest skill uses gbrain files upload-raw for raw source preservation
- Maintain skill adds file storage health checks
- Setup skill adds storage configuration phase with migration guidance
- Voice recipe uses upload-raw for call audio storage
- Migration v0.9.0 with complete storage setup instructions

* chore: bump version and changelog (v0.9.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: gbrain publish -- shareable HTML with password protection

First code+skill pair: deterministic code does the work (strip private data,
encrypt with AES-256-GCM, generate self-contained HTML), the skill tells the
agent when and how to use it. 34 new tests.

See: https://x.com/garrytan/status/2042925773300908103

* feat: backlinks check/fix, page lint, and report commands

Three new deterministic tools (zero LLM calls):

- gbrain backlinks check/fix -- scans brain for entity mentions without
  back-links, creates them. Enforces the Iron Law from the skills.
- gbrain lint [--fix] -- catches LLM preambles, code fence wrapping,
  placeholder dates, missing frontmatter, broken citations, empty sections.
  --fix auto-strips fixable artifacts.
- gbrain report --type <name> -- saves timestamped reports to
  brain/reports/{type}/YYYY-MM-DD-HHMM.md for audit trails.

33 new tests (409 total, 0 fail).

* feat: v0.9.0 migration tells agents to swap scripts for built-in commands

Migration file now:
- Lists all 5 new deterministic commands with usage examples
- Includes a script-to-command replacement table (old -> new)
- Tells the agent to find custom script references in AGENTS.md,
  skills, and cron jobs and replace with gbrain commands
- Adds recommended cron jobs for daily backlink fix + weekly lint
- References the Thin Harness, Fat Skills thread

* fix: CLI routing bugs found during DX review

- Fixed subArgs reference error in handleCliOnly (used wrong variable name)
- Renamed gbrain backlinks check/fix to gbrain check-backlinks to avoid
  conflict with existing backlinks operation (per-page incoming links)
- Added TOOLS section to --help output showing publish, check-backlinks,
  lint, report
- Added upload-raw and signed-url to FILES section in --help
- Updated all docs/migration references to use check-backlinks

* fix: security hardening from adversarial review

- XSS: sanitize marked.parse() output (strip script/iframe/on* attrs)
- Path traversal: validate report --type against [a-z0-9-] pattern
- TUS: HEAD request before retry to get server's actual offset (TUS spec)
- Pointer: upload-raw now includes pointer content in JSON output
- Symlinks: use lstatSync in all walkers to prevent directory escape

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-11 21:46:07 -10:00

10 KiB

Ingest Skill

Ingest meetings, articles, media, documents, and conversations into the brain.

Filing rule: Read skills/_brain-filing-rules.md before creating any new page.

Iron Law: Back-Linking (MANDATORY)

Every mention of a person or company with a brain page MUST create a back-link FROM that entity's page TO the page mentioning them. An unlinked mention is a broken brain. See skills/_brain-filing-rules.md for format.

Citation Requirements (MANDATORY)

Every fact written to a brain page must carry an inline [Source: ...] citation.

  • User's statements: [Source: User, {context}, YYYY-MM-DD]
  • Meeting data: [Source: Meeting "{title}", YYYY-MM-DD]
  • Email/message: [Source: email from {name} re: {subject}, YYYY-MM-DD]
  • Web content: [Source: {publication}, {URL}, YYYY-MM-DD]
  • Social media: [Source: X/@handle, YYYY-MM-DD](URL) (include link)
  • Synthesis: [Source: compiled from {sources}]

Workflow

  1. Parse the source. Extract people, companies, dates, and events from the input.
  2. For each entity mentioned:
    • Read the entity's page from gbrain to check if it exists
    • If exists: update compiled_truth (rewrite State section with new info, don't append)
    • If new: check notability gate, then store the page in gbrain with the appropriate type and slug
  3. Append to timeline. Add a timeline entry in gbrain for each event, with date, summary, and source citation.
  4. Create cross-reference links. Link entities in gbrain for every entity pair mentioned together, using the appropriate relationship type.
  5. Back-link all entities. Update EVERY mentioned entity's page with a back-link to this page (Iron Law).
  6. Timeline merge. The same event appears on ALL mentioned entities' timelines. If Alice met Bob at Acme Corp, the event goes on Alice's page, Bob's page, and Acme Corp's page.

Entity Detection on Every Message

Production agents should detect entity mentions on EVERY inbound message. This is the signal detection loop that makes the brain compound over time.

Protocol

  1. Scan the message for entity mentions: people, companies, concepts, original thinking. Fire on every message (no exceptions unless purely operational).
  2. For each entity detected:
    • gbrain search "name" -- does a page already exist?
    • If yes: load context with gbrain get <slug>. Use the compiled truth to inform your response. Update the page if the message contains new information.
    • If no: assess notability (see skills/_brain-filing-rules.md). If the entity is worth tracking, create a new page with gbrain put <type/slug> and populate with what you know.
  3. After creating or updating pages: sync to gbrain:
    gbrain sync --no-pull --no-embed
    
  4. Don't block the conversation. Entity detection and enrichment should happen alongside the response, not before it. The user shouldn't wait for brain writes to get an answer.

What counts as notable

  • People the user interacts with or discusses (not random mentions)
  • Companies relevant to the user's work or interests
  • Concepts or frameworks the user references or creates
  • The user's own original thinking (ideas, theses, observations) -- highest value
  • See skills/_brain-filing-rules.md for the full notability gate

What to capture from the user's own thinking

Original thinking is the most valuable signal. Capture exact phrasing -- the user's language IS the insight. Don't paraphrase.

  • Novel observations or theses
  • Frameworks, mental models, heuristics
  • Connections between ideas that others miss
  • Contrarian positions with reasoning
  • Strong reactions to external stimuli (what triggered it and why)

Media Workflows

Content the user encounters should be captured in the brain. File by PRIMARY SUBJECT, not by format (see skills/_brain-filing-rules.md).

Articles & Web Content

Input: URL shared by user, or article mentioned in conversation.

Process:

  1. Fetch content (web_fetch or equivalent)
  2. Extract: title, author, publication, date, full text
  3. Summarize: executive summary + key arguments (not a rehash)
  4. Extract entities: people, companies, concepts mentioned
  5. Save raw source for provenance (see Raw Source Preservation below)
  6. Analyze for the user: don't just summarize. What's interesting given what you know about them? Flag connections, contradictions, content opportunities.

Write to: appropriate directory per filing rules (about a person -> people/, about a company -> companies/, reusable framework -> concepts/, raw data -> sources/)

Videos & Podcasts

Input: URL (YouTube, podcast, etc.) or local audio/video file.

Process:

  1. Get transcript -- speaker-diarized if possible (services like Diarize.io provide speaker-labeled, word-level timing)
  2. Save raw transcript (both JSON and human-readable TXT)
  3. Analyze: executive summary, key ideas, key quotes with speaker attribution, notable stories/anecdotes, people and companies mentioned
  4. Extract and cross-reference all entities mentioned
  5. HARD RULE: every video/podcast brain page MUST link to the raw diarized transcript. A page without transcript links is incomplete.

Write to: media/videos/ or media/podcasts/ with back-links to all entities.

Quality bar:

  • Compelling headline (not "This video discusses...")
  • Executive summary that makes you want to watch/listen
  • Key Ideas as actual insights, not topic labels
  • Verbatim quotes with real speaker names (not "speaker_0")
  • All entities extracted with context and back-linked

PDFs & Documents

Input: File path or URL.

Process:

  1. Extract text (OCR if scanned/image PDF)
  2. Save raw source for provenance
  3. Summarize: executive summary + key sections + notable data
  4. Extract entities
  5. Cross-reference from entity pages

Write to: per filing rules (file by primary subject, not format).

Screenshots & Images

Input: Image file.

Process:

  1. Analyze content (OCR for text-heavy images, description for photos)
  2. If tweet screenshot: extract text, author, date, route to social media workflow
  3. If article screenshot: extract text, route to article workflow
  4. If data/chart: extract data points, describe findings

Write to: depends on content -- route to the appropriate workflow above.

Meeting Transcripts

Input: Transcript from meeting recording service, or manual notes.

Process:

  1. Pull full transcript (source of truth -- AI summaries are medium-low trust)
  2. Save raw transcript for provenance
  3. Write meeting page with YOUR analysis above the line, raw transcript below
  4. Entity propagation (MANDATORY): for each attendee and company discussed:
    • Update their brain page State section if new info surfaced
    • Append to their Timeline with link to the meeting page
    • Create page if person/company is notable and has no page yet
  5. A meeting is NOT fully ingested until all entity pages are updated

Write to: meetings/YYYY-MM-DD-short-description.md

What makes a good meeting page:

  • Reveals the real crux, not a bullet dump
  • Connects to existing brain pages (people, companies, deals)
  • Flags what changed (status, decisions, new info)
  • Names tension or what was left unsaid
  • Captures actual dynamic, not performative summary

Social Media Content

Input: Tweet, thread, or social media post.

Process:

  1. Fetch full content (thread, quote tweets, context)
  2. If images present: OCR via vision model for full text extraction
  3. Summarize: what's being said, why it matters, who's involved
  4. Extract entities and update brain pages
  5. Include direct link to the original post (MANDATORY for citations)

Write to: media/x/ for daily aggregation, or entity-specific directories if the post is primarily about a person/company.

Raw Source Preservation

Every ingested item must have its raw source preserved for provenance.

Use gbrain files upload-raw for automatic size routing:

gbrain files upload-raw <file> --page <page-slug> --type <type>
  • < 100 MB text/PDF: stays in git (brain repo .raw/ sidecar directories)
  • >= 100 MB OR media (video, audio, images): uploaded to cloud storage via TUS resumable upload, .redirect.yaml pointer left in the brain repo

The .redirect.yaml pointer format:

target: supabase://brain-files/page-slug/filename.mp4
bucket: brain-files
storage_path: page-slug/filename.mp4
size: 524288000
size_human: 500 MB
hash: sha256:abc123...
mime: video/mp4
uploaded: 2026-04-11T...
type: transcript

Accessing stored files:

  • gbrain files signed-url <storage-path> -- generate 1-hour signed URL for viewing/sharing
  • gbrain files restore <dir> -- download back to local from cloud storage

Use put_raw_data in gbrain to store raw API responses and metadata (JSON, not binary).

Test Before Bulk

When processing multiple items (batch video ingestion, bulk meeting processing, etc.):

  1. Test on 3-5 items first. Run in test mode if available.
  2. Read the actual output. Is the quality good? Are titles compelling (not "This video discusses...")? Are entities extracted and back-linked? Is the format clean?
  3. Fix what's wrong in the approach/skill, not via one-off patches.
  4. Only then: bulk execute with throttling, commits every 5-10 items.

The marginal cost of testing 3 items first is near zero. The cost of cleaning up 100 bad pages is enormous.

Quality Rules

  • Executive summary in compiled_truth must be updated, not just timeline appended
  • State section is REWRITTEN, not appended to. Current best understanding only.
  • Timeline entries are reverse-chronological (newest first)
  • Every person/company mentioned gets a page if notable (see filing rules)
  • Link types: knows, works_at, invested_in, founded, met_at, discussed
  • Source attribution: every timeline entry includes [Source: ...] citation
  • Back-links: every entity mention creates a back-link (Iron Law)
  • Filing: file by primary subject, not format or source (see filing rules)

Tools Used

  • Read a page from gbrain (get_page)
  • Store/update a page in gbrain (put_page)
  • Add a timeline entry in gbrain (add_timeline_entry)
  • Link entities in gbrain (add_link)
  • List tags for a page (get_tags)
  • Tag a page in gbrain (add_tag)
  • Store raw data in gbrain (put_raw_data)
  • Check backlinks in gbrain (get_backlinks)