Files
gbrain/skills/enrich/SKILL.md
Garry Tan baf3517868 feat: v0.9.0 -- smart file storage, publish, production-grade skills (#62)
* feat: battle-tested skill patterns from production deployment

Backport production-learned brain-operations patterns:
- Iron Law of Back-Linking (mandatory bidirectional linking)
- Brain filing rules (file by primary subject, not format)
- Enrichment protocol (7-step pipeline, 3-tier system, person/company templates)
- Media ingest workflows (articles, videos, podcasts, PDFs, screenshots)
- Citation requirements (mandatory [Source: ...] on every fact)
- Test Before Bulk operating principle
- Voice recipe: unicode crash fix, PII scrub, identity-first prompt, DIY STT+LLM+TTS
- X-to-Brain recipe: image OCR, Filtered Stream, tweet rating rubric, cron stagger

* chore: bump version and changelog (v0.8.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add _brain-filing-rules.md to CLAUDE.md key files

* feat: smart file upload with TUS resumable and .redirect.yaml pointers

- Supabase Storage auto-selects upload method by file size:
  < 100 MB standard POST, >= 100 MB TUS resumable (6 MB chunks + retry)
- Signed URL generation for private bucket access (1-hour expiry)
- New `upload-raw` command with size routing: small text stays in git,
  large/media files go to cloud with .redirect.yaml pointer
- New `signed-url` command for generating access links
- File resolver supports both .redirect.yaml (v0.9+) and .redirect (legacy)
- Redirect format upgraded: 10 fields with full metadata
- All migration commands (mirror, redirect, restore, clean) handle both formats

* feat: skills reference actual gbrain file commands

- Filing rules document upload-raw, signed-url, and .redirect.yaml format
- Ingest skill uses gbrain files upload-raw for raw source preservation
- Maintain skill adds file storage health checks
- Setup skill adds storage configuration phase with migration guidance
- Voice recipe uses upload-raw for call audio storage
- Migration v0.9.0 with complete storage setup instructions

* chore: bump version and changelog (v0.9.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: gbrain publish -- shareable HTML with password protection

First code+skill pair: deterministic code does the work (strip private data,
encrypt with AES-256-GCM, generate self-contained HTML), the skill tells the
agent when and how to use it. 34 new tests.

See: https://x.com/garrytan/status/2042925773300908103

* feat: backlinks check/fix, page lint, and report commands

Three new deterministic tools (zero LLM calls):

- gbrain backlinks check/fix -- scans brain for entity mentions without
  back-links, creates them. Enforces the Iron Law from the skills.
- gbrain lint [--fix] -- catches LLM preambles, code fence wrapping,
  placeholder dates, missing frontmatter, broken citations, empty sections.
  --fix auto-strips fixable artifacts.
- gbrain report --type <name> -- saves timestamped reports to
  brain/reports/{type}/YYYY-MM-DD-HHMM.md for audit trails.

33 new tests (409 total, 0 fail).

* feat: v0.9.0 migration tells agents to swap scripts for built-in commands

Migration file now:
- Lists all 5 new deterministic commands with usage examples
- Includes a script-to-command replacement table (old -> new)
- Tells the agent to find custom script references in AGENTS.md,
  skills, and cron jobs and replace with gbrain commands
- Adds recommended cron jobs for daily backlink fix + weekly lint
- References the Thin Harness, Fat Skills thread

* fix: CLI routing bugs found during DX review

- Fixed subArgs reference error in handleCliOnly (used wrong variable name)
- Renamed gbrain backlinks check/fix to gbrain check-backlinks to avoid
  conflict with existing backlinks operation (per-page incoming links)
- Added TOOLS section to --help output showing publish, check-backlinks,
  lint, report
- Added upload-raw and signed-url to FILES section in --help
- Updated all docs/migration references to use check-backlinks

* fix: security hardening from adversarial review

- XSS: sanitize marked.parse() output (strip script/iframe/on* attrs)
- Path traversal: validate report --type against [a-z0-9-] pattern
- TUS: HEAD request before retry to get server's actual offset (TUS spec)
- Pointer: upload-raw now includes pointer content in JSON output
- Symlinks: use lstatSync in all walkers to prevent directory escape

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-11 21:46:07 -10:00

290 lines
8.9 KiB
Markdown

# Enrich Skill
Enrich person and company pages from external sources. Scale effort to importance.
> **Filing rule:** Read `skills/_brain-filing-rules.md` before creating any new page.
## Iron Law: Back-Linking (MANDATORY)
Every mention of a person or company with a brain page MUST create a back-link
FROM that entity's page TO the page mentioning them. An unlinked mention is a
broken brain. See `skills/_brain-filing-rules.md` for format.
## Philosophy
A brain page should read like an intelligence dossier, not a LinkedIn scrape.
Facts are table stakes. Texture is the value -- what do they believe, what are
they building, what makes them tick, where are they headed.
## Citation Requirements (MANDATORY)
Every fact must carry an inline `[Source: ...]` citation.
Three formats:
- **Direct attribution:** `[Source: User, {context}, YYYY-MM-DD]`
- **API/external:** `[Source: {provider} enrichment, YYYY-MM-DD]`
- **Synthesis:** `[Source: compiled from {list of sources}]`
Source precedence (highest to lowest):
1. User's direct statements
2. Compiled truth (pre-existing brain synthesis)
3. Timeline entries (raw evidence)
4. External sources (API enrichment, web search)
When sources conflict, note the contradiction with both citations.
## When To Enrich
### Primary triggers
- User mentions an entity in conversation
- Entity appears in a meeting transcript or email
- New contact appears with significant context
- Entity makes news or has a major event
- Any ingest pipeline encounters a notable entity
### Do NOT enrich
- Random mentions with no relationship signal
- Bot/spam accounts
- Entities with no substantive connection to the user's work
- Same page enriched within the past week (unless new signal warrants it)
## Enrichment Tiers
Scale enrichment to importance. Don't waste API calls on low-value entities.
| Tier | Who | Effort | Sources |
|------|-----|--------|---------|
| 1 (key) | Inner circle, close collaborators, key contacts | Full pipeline | All available APIs + deep web research |
| 2 (notable) | Occasional interactions, industry figures | Moderate | Web research + social + brain cross-ref |
| 3 (minor) | Worth tracking, not critical | Light | Brain cross-ref + social lookup if handle known |
## The Enrichment Protocol (7 Steps)
### Step 1: Identify entities
Extract people, companies, concepts from the incoming signal.
### Step 2: Check brain state
For each entity:
- `gbrain search "name"` -- does a page already exist?
- **If yes:** UPDATE path (add new signal, update compiled truth if material)
- **If no:** CREATE path (check notability gate first, then create)
### Step 3: Extract signal from source
Don't just capture facts. Capture texture:
| Signal Type | What to Extract |
|-------------|----------------|
| Opinions, beliefs | What They Believe section |
| Current projects, features shipped | What They're Building section |
| Ambition, career arc, motivation | What Motivates Them section |
| Topics they return to obsessively | Hobby Horses section |
| Who they amplify, argue with, respect | Network / Relationships |
| Ascending, plateauing, pivoting? | Trajectory section |
| Role, company, funding, location | State section (hard facts) |
### Step 4: External data source lookups
Priority order -- stop when you have enough signal for the entity's tier.
**4a. Brain cross-reference (always, all tiers)**
- `gbrain search "name"` and `gbrain query "what do we know about name"`
- Check related pages: company pages for person enrichment and vice versa
- This is free and often the richest source
**4b. Web research (Tier 1 and 2)**
- Use Perplexity, Brave Search, Exa, or equivalent web research tool
- **Key pattern:** Send existing brain knowledge as context so the search
returns DELTA (what's new vs what you already know), not a rehash
- Opus-class models for Tier 1 deep research, lighter models for Tier 2
**4c. Social media lookup (all tiers when handle known)**
- Pull recent posts/tweets for tone, interests, current focus
- Social media is the highest-texture signal for what someone actually thinks
**4d. People enrichment APIs (Tier 1)**
- LinkedIn data, career history, connections, education
**4e. Company enrichment APIs (Tier 1)**
- Company data, financials, headcount, key hires, recent news
| Data Need | Example Sources | Tier |
|-----------|----------------|------|
| Web research | Perplexity, Brave, Exa | 1-2 |
| LinkedIn / career | Crustdata, Proxycurl, People Data Labs | 1 |
| Career history | Happenstance, LinkedIn | 1 |
| Funding / company data | Crunchbase, PitchBook, Clearbit | 1 |
| Social media | Platform APIs, web scraping | 1-3 |
| Meeting history | Calendar/meeting transcript tools | 1-2 |
### Step 5: Save raw data (preserves provenance)
Store raw API responses via `put_raw_data` in gbrain:
```json
{
"source": "crustdata",
"fetched_at": "2026-04-11T...",
"query": "jane doe",
"data": { ... }
}
```
Raw data preserves provenance. If the compiled truth is ever questioned,
the raw data shows exactly what the API returned.
### Step 6: Write to brain
#### CREATE path
1. Check notability gate (see `skills/_brain-filing-rules.md`)
2. Check filing rules -- where does this entity go?
3. Create page with the appropriate template (below)
4. Fill compiled truth with citations
5. Add first timeline entry
6. Leave empty sections as `[No data yet]` (don't fill with boilerplate)
#### UPDATE path
1. Add new timeline entries (reverse-chronological, append-only)
2. Update compiled truth ONLY if the new signal materially changes the picture
3. Update State section with new facts
4. Flag contradictions between new signal and existing compiled truth
5. Don't overwrite user-written assessments with API boilerplate
#### Person page template
```markdown
---
title: Full Name
type: person
created: YYYY-MM-DD
updated: YYYY-MM-DD
tags: []
company: Current Company
relationship: How the user knows them
email:
linkedin:
twitter:
location:
---
# Full Name
> 1-paragraph executive summary: HOW do you know them, WHY do they matter,
> what's the current state of the relationship.
## State
Role, company, key context. Hard facts only.
## What They Believe
Ideology, first principles, worldview. What hills do they die on?
## What They're Building
Current projects, recent launches, what they're focused on.
## What Motivates Them
Ambition, career arc, what drives them.
## Hobby Horses
Topics they return to obsessively. Recurring themes in their work/posts.
## Assessment
Your read on this person. Strengths, gaps, trajectory.
## Trajectory
Ascending, plateauing, pivoting, declining? Where are they headed?
## Relationship
History of interactions, shared context, relationship quality.
## Contact
Email, social handles, preferred communication channel.
## Network
Key connections, mutual contacts, organizational relationships.
## Open Threads
Active conversations, pending items, things to follow up on.
---
## Timeline
Reverse chronological. Every entry has a date and [Source: ...] citation.
- **YYYY-MM-DD** | Event description [Source: ...]
```
#### Company page template
```markdown
---
title: Company Name
type: company
created: YYYY-MM-DD
updated: YYYY-MM-DD
tags: []
---
# Company Name
> 1-paragraph executive summary.
## State
What they do, stage, key people, key metrics, your connection.
## Open Threads
Active items, pending decisions, things to track.
---
## Timeline
- **YYYY-MM-DD** | Event description [Source: ...]
```
### Step 7: Cross-reference
- Update company pages from person enrichment (and vice versa)
- Update related project/deal pages if relevant context surfaced
- Add back-links from every entity mentioned (MANDATORY)
- Check index files if the brain uses them
## Bulk Enrichment Rules
- **Test on 3-5 entities first.** Read actual output. Check quality.
- Only proceed to bulk after test shots pass your quality bar.
- 3+ entities from one source -> batch process or spawn sub-agent
- Throttle API calls. Respect rate limits.
- Commit every 5-10 entities during bulk runs.
- Save a report after bulk enrichment (see Report Storage below).
## Validation Rules
- Connection count < 20 on LinkedIn = likely wrong person, skip
- Name mismatch between brain and API = skip, flag for review
- Joke profiles or obviously wrong data = save to raw, don't update page
- Don't overwrite user-written assessments with API boilerplate
- When in doubt: save raw data but don't update brain page
## Report Storage
After enrichment sweeps, save a report:
- Number of entities processed
- New pages created vs existing updated
- Data sources called and results quality
- Notable discoveries or contradictions
- Validation flags or API failures
This creates an audit trail for brain enrichment over time.
## Tools Used
- Read a page from gbrain (get_page)
- Store/update a page in gbrain (put_page)
- Add a timeline entry in gbrain (add_timeline_entry)
- List pages in gbrain by type (list_pages)
- Store raw API data in gbrain (put_raw_data)
- Retrieve raw data from gbrain (get_raw_data)
- Link entities in gbrain (add_link)
- Check backlinks in gbrain (get_backlinks)