feat: v0.9.0 -- smart file storage, publish, production-grade skills (#62)
* feat: battle-tested skill patterns from production deployment Backport production-learned brain-operations patterns: - Iron Law of Back-Linking (mandatory bidirectional linking) - Brain filing rules (file by primary subject, not format) - Enrichment protocol (7-step pipeline, 3-tier system, person/company templates) - Media ingest workflows (articles, videos, podcasts, PDFs, screenshots) - Citation requirements (mandatory [Source: ...] on every fact) - Test Before Bulk operating principle - Voice recipe: unicode crash fix, PII scrub, identity-first prompt, DIY STT+LLM+TTS - X-to-Brain recipe: image OCR, Filtered Stream, tweet rating rubric, cron stagger * chore: bump version and changelog (v0.8.1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add _brain-filing-rules.md to CLAUDE.md key files * feat: smart file upload with TUS resumable and .redirect.yaml pointers - Supabase Storage auto-selects upload method by file size: < 100 MB standard POST, >= 100 MB TUS resumable (6 MB chunks + retry) - Signed URL generation for private bucket access (1-hour expiry) - New `upload-raw` command with size routing: small text stays in git, large/media files go to cloud with .redirect.yaml pointer - New `signed-url` command for generating access links - File resolver supports both .redirect.yaml (v0.9+) and .redirect (legacy) - Redirect format upgraded: 10 fields with full metadata - All migration commands (mirror, redirect, restore, clean) handle both formats * feat: skills reference actual gbrain file commands - Filing rules document upload-raw, signed-url, and .redirect.yaml format - Ingest skill uses gbrain files upload-raw for raw source preservation - Maintain skill adds file storage health checks - Setup skill adds storage configuration phase with migration guidance - Voice recipe uses upload-raw for call audio storage - Migration v0.9.0 with complete storage setup instructions * chore: bump version and changelog (v0.9.0) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: gbrain publish -- shareable HTML with password protection First code+skill pair: deterministic code does the work (strip private data, encrypt with AES-256-GCM, generate self-contained HTML), the skill tells the agent when and how to use it. 34 new tests. See: https://x.com/garrytan/status/2042925773300908103 * feat: backlinks check/fix, page lint, and report commands Three new deterministic tools (zero LLM calls): - gbrain backlinks check/fix -- scans brain for entity mentions without back-links, creates them. Enforces the Iron Law from the skills. - gbrain lint [--fix] -- catches LLM preambles, code fence wrapping, placeholder dates, missing frontmatter, broken citations, empty sections. --fix auto-strips fixable artifacts. - gbrain report --type <name> -- saves timestamped reports to brain/reports/{type}/YYYY-MM-DD-HHMM.md for audit trails. 33 new tests (409 total, 0 fail). * feat: v0.9.0 migration tells agents to swap scripts for built-in commands Migration file now: - Lists all 5 new deterministic commands with usage examples - Includes a script-to-command replacement table (old -> new) - Tells the agent to find custom script references in AGENTS.md, skills, and cron jobs and replace with gbrain commands - Adds recommended cron jobs for daily backlink fix + weekly lint - References the Thin Harness, Fat Skills thread * fix: CLI routing bugs found during DX review - Fixed subArgs reference error in handleCliOnly (used wrong variable name) - Renamed gbrain backlinks check/fix to gbrain check-backlinks to avoid conflict with existing backlinks operation (per-page incoming links) - Added TOOLS section to --help output showing publish, check-backlinks, lint, report - Added upload-raw and signed-url to FILES section in --help - Updated all docs/migration references to use check-backlinks * fix: security hardening from adversarial review - XSS: sanitize marked.parse() output (strip script/iframe/on* attrs) - Path traversal: validate report --type against [a-z0-9-] pattern - TUS: HEAD request before retry to get server's actual offset (TUS spec) - Pointer: upload-raw now includes pointer content in JSON output - Symlinks: use lstatSync in all walkers to prevent directory escape --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,25 @@
|
||||
# Ingest Skill
|
||||
|
||||
Ingest meetings, articles, documents, and conversations into the brain.
|
||||
Ingest meetings, articles, media, documents, and conversations into the brain.
|
||||
|
||||
> **Filing rule:** Read `skills/_brain-filing-rules.md` before creating any new page.
|
||||
|
||||
## Iron Law: Back-Linking (MANDATORY)
|
||||
|
||||
Every mention of a person or company with a brain page MUST create a back-link
|
||||
FROM that entity's page TO the page mentioning them. An unlinked mention is a
|
||||
broken brain. See `skills/_brain-filing-rules.md` for format.
|
||||
|
||||
## Citation Requirements (MANDATORY)
|
||||
|
||||
Every fact written to a brain page must carry an inline `[Source: ...]` citation.
|
||||
|
||||
- **User's statements:** `[Source: User, {context}, YYYY-MM-DD]`
|
||||
- **Meeting data:** `[Source: Meeting "{title}", YYYY-MM-DD]`
|
||||
- **Email/message:** `[Source: email from {name} re: {subject}, YYYY-MM-DD]`
|
||||
- **Web content:** `[Source: {publication}, {URL}, YYYY-MM-DD]`
|
||||
- **Social media:** `[Source: X/@handle, YYYY-MM-DD](URL)` (include link)
|
||||
- **Synthesis:** `[Source: compiled from {sources}]`
|
||||
|
||||
## Workflow
|
||||
|
||||
@@ -8,10 +27,11 @@ Ingest meetings, articles, documents, and conversations into the brain.
|
||||
2. **For each entity mentioned:**
|
||||
- Read the entity's page from gbrain to check if it exists
|
||||
- If exists: update compiled_truth (rewrite State section with new info, don't append)
|
||||
- If new: store the page in gbrain with the appropriate type and slug
|
||||
3. **Append to timeline.** Add a timeline entry in gbrain for each event, with date, summary, and source.
|
||||
- If new: check notability gate, then store the page in gbrain with the appropriate type and slug
|
||||
3. **Append to timeline.** Add a timeline entry in gbrain for each event, with date, summary, and source citation.
|
||||
4. **Create cross-reference links.** Link entities in gbrain for every entity pair mentioned together, using the appropriate relationship type.
|
||||
5. **Timeline merge.** The same event appears on ALL mentioned entities' timelines. If Alice met Bob at Acme Corp, the event goes on Alice's page, Bob's page, and Acme Corp's page.
|
||||
5. **Back-link all entities.** Update EVERY mentioned entity's page with a back-link to this page (Iron Law).
|
||||
6. **Timeline merge.** The same event appears on ALL mentioned entities' timelines. If Alice met Bob at Acme Corp, the event goes on Alice's page, Bob's page, and Acme Corp's page.
|
||||
|
||||
## Entity Detection on Every Message
|
||||
|
||||
@@ -26,13 +46,11 @@ the signal detection loop that makes the brain compound over time.
|
||||
- `gbrain search "name"` -- does a page already exist?
|
||||
- **If yes:** load context with `gbrain get <slug>`. Use the compiled truth to
|
||||
inform your response. Update the page if the message contains new information.
|
||||
- **If no:** assess notability. If the entity is worth tracking (will come up
|
||||
again, is relevant to the user's world), create a new page with
|
||||
`gbrain put <type/slug>` and populate with what you know.
|
||||
3. **After creating or updating pages:** commit changes to the brain repo, then
|
||||
sync to gbrain:
|
||||
- **If no:** assess notability (see `skills/_brain-filing-rules.md`). If the entity
|
||||
is worth tracking, create a new page with `gbrain put <type/slug>` and populate
|
||||
with what you know.
|
||||
3. **After creating or updating pages:** sync to gbrain:
|
||||
```bash
|
||||
git add brain/ && git commit -m "update entity pages"
|
||||
gbrain sync --no-pull --no-embed
|
||||
```
|
||||
4. **Don't block the conversation.** Entity detection and enrichment should happen
|
||||
@@ -42,18 +60,184 @@ the signal detection loop that makes the brain compound over time.
|
||||
### What counts as notable
|
||||
|
||||
- People the user interacts with or discusses (not random mentions)
|
||||
- Companies relevant to the user's work, investments, or interests
|
||||
- Companies relevant to the user's work or interests
|
||||
- Concepts or frameworks the user references or creates
|
||||
- The user's own original thinking (ideas, theses, observations) -- highest value
|
||||
- See `skills/_brain-filing-rules.md` for the full notability gate
|
||||
|
||||
### What to capture from the user's own thinking
|
||||
|
||||
Original thinking is the most valuable signal. Capture exact phrasing -- the user's
|
||||
language IS the insight. Don't paraphrase.
|
||||
|
||||
- Novel observations or theses
|
||||
- Frameworks, mental models, heuristics
|
||||
- Connections between ideas that others miss
|
||||
- Contrarian positions with reasoning
|
||||
- Strong reactions to external stimuli (what triggered it and why)
|
||||
|
||||
## Media Workflows
|
||||
|
||||
Content the user encounters should be captured in the brain. File by PRIMARY
|
||||
SUBJECT, not by format (see `skills/_brain-filing-rules.md`).
|
||||
|
||||
### Articles & Web Content
|
||||
|
||||
**Input:** URL shared by user, or article mentioned in conversation.
|
||||
|
||||
**Process:**
|
||||
1. Fetch content (`web_fetch` or equivalent)
|
||||
2. Extract: title, author, publication, date, full text
|
||||
3. Summarize: executive summary + key arguments (not a rehash)
|
||||
4. Extract entities: people, companies, concepts mentioned
|
||||
5. **Save raw source** for provenance (see Raw Source Preservation below)
|
||||
6. Analyze for the user: don't just summarize. What's interesting given what you
|
||||
know about them? Flag connections, contradictions, content opportunities.
|
||||
|
||||
**Write to:** appropriate directory per filing rules (about a person -> `people/`,
|
||||
about a company -> `companies/`, reusable framework -> `concepts/`, raw data -> `sources/`)
|
||||
|
||||
### Videos & Podcasts
|
||||
|
||||
**Input:** URL (YouTube, podcast, etc.) or local audio/video file.
|
||||
|
||||
**Process:**
|
||||
1. Get transcript -- speaker-diarized if possible (services like Diarize.io provide
|
||||
speaker-labeled, word-level timing)
|
||||
2. **Save raw transcript** (both JSON and human-readable TXT)
|
||||
3. Analyze: executive summary, key ideas, key quotes with speaker attribution,
|
||||
notable stories/anecdotes, people and companies mentioned
|
||||
4. Extract and cross-reference all entities mentioned
|
||||
5. **HARD RULE:** every video/podcast brain page MUST link to the raw diarized
|
||||
transcript. A page without transcript links is incomplete.
|
||||
|
||||
**Write to:** `media/videos/` or `media/podcasts/` with back-links to all entities.
|
||||
|
||||
**Quality bar:**
|
||||
- Compelling headline (not "This video discusses...")
|
||||
- Executive summary that makes you want to watch/listen
|
||||
- Key Ideas as actual insights, not topic labels
|
||||
- Verbatim quotes with real speaker names (not "speaker_0")
|
||||
- All entities extracted with context and back-linked
|
||||
|
||||
### PDFs & Documents
|
||||
|
||||
**Input:** File path or URL.
|
||||
|
||||
**Process:**
|
||||
1. Extract text (OCR if scanned/image PDF)
|
||||
2. **Save raw source** for provenance
|
||||
3. Summarize: executive summary + key sections + notable data
|
||||
4. Extract entities
|
||||
5. Cross-reference from entity pages
|
||||
|
||||
**Write to:** per filing rules (file by primary subject, not format).
|
||||
|
||||
### Screenshots & Images
|
||||
|
||||
**Input:** Image file.
|
||||
|
||||
**Process:**
|
||||
1. Analyze content (OCR for text-heavy images, description for photos)
|
||||
2. If tweet screenshot: extract text, author, date, route to social media workflow
|
||||
3. If article screenshot: extract text, route to article workflow
|
||||
4. If data/chart: extract data points, describe findings
|
||||
|
||||
**Write to:** depends on content -- route to the appropriate workflow above.
|
||||
|
||||
### Meeting Transcripts
|
||||
|
||||
**Input:** Transcript from meeting recording service, or manual notes.
|
||||
|
||||
**Process:**
|
||||
1. Pull full transcript (source of truth -- AI summaries are medium-low trust)
|
||||
2. **Save raw transcript** for provenance
|
||||
3. Write meeting page with YOUR analysis above the line, raw transcript below
|
||||
4. **Entity propagation (MANDATORY):** for each attendee and company discussed:
|
||||
- Update their brain page State section if new info surfaced
|
||||
- Append to their Timeline with link to the meeting page
|
||||
- Create page if person/company is notable and has no page yet
|
||||
5. A meeting is NOT fully ingested until all entity pages are updated
|
||||
|
||||
**Write to:** `meetings/YYYY-MM-DD-short-description.md`
|
||||
|
||||
**What makes a good meeting page:**
|
||||
- Reveals the real crux, not a bullet dump
|
||||
- Connects to existing brain pages (people, companies, deals)
|
||||
- Flags what changed (status, decisions, new info)
|
||||
- Names tension or what was left unsaid
|
||||
- Captures actual dynamic, not performative summary
|
||||
|
||||
### Social Media Content
|
||||
|
||||
**Input:** Tweet, thread, or social media post.
|
||||
|
||||
**Process:**
|
||||
1. Fetch full content (thread, quote tweets, context)
|
||||
2. If images present: OCR via vision model for full text extraction
|
||||
3. Summarize: what's being said, why it matters, who's involved
|
||||
4. Extract entities and update brain pages
|
||||
5. Include direct link to the original post (MANDATORY for citations)
|
||||
|
||||
**Write to:** `media/x/` for daily aggregation, or entity-specific directories
|
||||
if the post is primarily about a person/company.
|
||||
|
||||
## Raw Source Preservation
|
||||
|
||||
Every ingested item must have its raw source preserved for provenance.
|
||||
|
||||
**Use `gbrain files upload-raw` for automatic size routing:**
|
||||
```bash
|
||||
gbrain files upload-raw <file> --page <page-slug> --type <type>
|
||||
```
|
||||
|
||||
- **< 100 MB text/PDF**: stays in git (brain repo `.raw/` sidecar directories)
|
||||
- **>= 100 MB OR media** (video, audio, images): uploaded to cloud storage
|
||||
via TUS resumable upload, `.redirect.yaml` pointer left in the brain repo
|
||||
|
||||
The `.redirect.yaml` pointer format:
|
||||
```yaml
|
||||
target: supabase://brain-files/page-slug/filename.mp4
|
||||
bucket: brain-files
|
||||
storage_path: page-slug/filename.mp4
|
||||
size: 524288000
|
||||
size_human: 500 MB
|
||||
hash: sha256:abc123...
|
||||
mime: video/mp4
|
||||
uploaded: 2026-04-11T...
|
||||
type: transcript
|
||||
```
|
||||
|
||||
**Accessing stored files:**
|
||||
- `gbrain files signed-url <storage-path>` -- generate 1-hour signed URL for viewing/sharing
|
||||
- `gbrain files restore <dir>` -- download back to local from cloud storage
|
||||
|
||||
Use `put_raw_data` in gbrain to store raw API responses and metadata (JSON, not binary).
|
||||
|
||||
## Test Before Bulk
|
||||
|
||||
When processing multiple items (batch video ingestion, bulk meeting processing, etc.):
|
||||
|
||||
1. **Test on 3-5 items first.** Run in test mode if available.
|
||||
2. **Read the actual output.** Is the quality good? Are titles compelling (not
|
||||
"This video discusses...")? Are entities extracted and back-linked? Is the
|
||||
format clean?
|
||||
3. **Fix what's wrong** in the approach/skill, not via one-off patches.
|
||||
4. **Only then: bulk execute** with throttling, commits every 5-10 items.
|
||||
|
||||
The marginal cost of testing 3 items first is near zero. The cost of cleaning
|
||||
up 100 bad pages is enormous.
|
||||
|
||||
## Quality Rules
|
||||
|
||||
- Executive summary in compiled_truth must be updated, not just timeline appended
|
||||
- State section is REWRITTEN, not appended to. Current best understanding only.
|
||||
- Timeline entries are reverse-chronological (newest first)
|
||||
- Every person/company mentioned gets a page if one doesn't exist
|
||||
- Every person/company mentioned gets a page if notable (see filing rules)
|
||||
- Link types: knows, works_at, invested_in, founded, met_at, discussed
|
||||
- Source attribution: every timeline entry includes the source (meeting, article, email, etc.)
|
||||
- Source attribution: every timeline entry includes [Source: ...] citation
|
||||
- Back-links: every entity mention creates a back-link (Iron Law)
|
||||
- Filing: file by primary subject, not format or source (see filing rules)
|
||||
|
||||
## Tools Used
|
||||
|
||||
@@ -63,3 +247,5 @@ the signal detection loop that makes the brain compound over time.
|
||||
- Link entities in gbrain (add_link)
|
||||
- List tags for a page (get_tags)
|
||||
- Tag a page in gbrain (add_tag)
|
||||
- Store raw data in gbrain (put_raw_data)
|
||||
- Check backlinks in gbrain (get_backlinks)
|
||||
|
||||
Reference in New Issue
Block a user