# Deterministic Collectors: Code for Data, LLMs for Judgment

## Goal

Separate mechanical work (100% reliable code) from analytical work (LLM judgment) so that deterministic tasks never fail probabilistically.

## What the User Gets

Without this: the LLM generates Gmail links, formats tables, and tracks state.
It follows the rule for the first 10 items, then drops a link on item 11. You
write "NO EXCEPTIONS" in the prompt. It still fails. 90% reliability over 20
items means visible failures twice per day. Trust is destroyed.

With this: code handles URLs, formatting, and state (100% reliable). The LLM
reads pre-formatted data and adds judgment, classification, and enrichment.
Links are never wrong because the LLM never generates them.

## Implementation

```
// The pattern: code collects, LLM analyzes

// STEP 1: Deterministic collector (script, no LLM calls)
collector_run():
  messages = gmail_api.fetch_unread()
  for msg in messages:
    structured = {
      id: msg.id,
      from: msg.sender,
      subject: msg.subject,
      snippet: msg.snippet,
      gmail_link: f"https://mail.google.com/mail/u/?authuser={account}#inbox/{msg.id}",
      gmail_markdown: f"[Open in Gmail]({gmail_link})",
      is_signature: regex_match(msg, DOCUSIGN_PATTERNS),
      is_noise: regex_match(msg, NOISE_PATTERNS),
      is_new: msg.id not in state.seen_ids
    }
    store(structured)
    state.seen_ids.add(msg.id)
  generate_markdown_digest(structured_messages)

// STEP 2: LLM reads the pre-formatted digest
llm_analyze():
  digest = read("data/digests/today.md")  // links already baked in
  classify_urgency(digest)                 // judgment call
  add_commentary(digest)                   // contextual analysis
  run_brain_enrichment(notable_entities)   // gbrain search + update
  draft_replies(urgent_items)              // creative work
  surface_to_user(final_output)            // delivery

// STEP 3: Wire into cron
cron_job():
  collector_run()     // fast, cheap, deterministic
  llm_analyze()       // slower, expensive, creative
```

### The Architecture

```
+-----------------------------+     +------------------------------+
|  Deterministic Collector    |---->|       LLM Agent              |
|  (Node.js / Python script)  |     |                              |
|                             |     |  - Read the pre-formatted    |
|  - Pull data from API       |     |    digest                    |
|  - Store structured JSON    |     |  - Classify items            |
|  - Generate links/URLs      |     |  - Add commentary            |
|  - Detect patterns (regex)  |     |  - Run brain enrichment      |
|  - Track state (seen/new)   |     |  - Draft replies             |
|  - Output markdown digest   |     |  - Surface to user           |
|                             |     |                              |
|  CODE — deterministic,      |     |  AI — judgment, context,     |
|  never forgets              |     |  creativity                  |
+-----------------------------+     +------------------------------+
```

### File Structure

```
scripts/email-collector/
├── email-collector.mjs     # No LLM calls, no external deps
├── data/
│   ├── state.json          # Last pull timestamp, known IDs, pending signatures
│   ├── messages/           # Structured JSON per day
│   │   └── 2026-04-09.json
│   └── digests/            # Pre-formatted markdown
│       └── 2026-04-09.md
```

### Where the Pattern Applies

| Signal Source | Collector Generates | LLM Adds |
|--------------|-------------------|----------|
| **Email** | Gmail links, sender metadata, signature detection | Urgency classification, enrichment, reply drafts |
| **X/Twitter** | Tweet links, engagement metrics, deletion detection | Sentiment analysis, narrative detection, content ideas |
| **Calendar** | Event links, attendee lists, conflict detection | Prep briefings, meeting context from brain |
| **Slack** | Channel links, thread links, mention detection | Priority classification, action item extraction |
| **GitHub** | PR/issue links, diff stats, CI status | Code review context, priority assessment |

### The Principle

If a piece of output MUST be present and MUST be formatted correctly every
time, generate it in code. If a piece of output requires judgment, context,
or creativity, generate it with the LLM. Don't ask the LLM to do both in
the same pass.

## Tricky Spots

1. **LLMs forget links -- bake them in code.** The LLM will follow the
   "include a Gmail link" rule for the first 10 items, then silently drop
   it on item 11. No amount of prompt engineering fixes probabilistic
   formatting over long outputs. The fix: generate every link in the
   collector script. The LLM reads pre-formatted markdown where links are
   already embedded. It can't forget what it didn't generate.

2. **Noise filtering must be deterministic.** Regex-based noise detection
   (newsletters, automated receipts, marketing) belongs in the collector,
   not the LLM. The LLM might classify a newsletter as "possibly important"
   on one run and "noise" on the next. Code classifies the same input the
   same way every time.

3. **Atomic writes prevent corruption.** The collector writes to a state
   file (`state.json`) that tracks which messages have been seen. If the
   script crashes mid-write, the state file can be corrupted. Write to a
   temp file first, then rename atomically. This also prevents the LLM
   from reading a partial digest if the cron fires during a collection run.

## How to Verify

1. **Run the collector and check every link.** Execute the collector script
   manually. Open the generated digest. Click every `[Open in Gmail]` link
   (or equivalent). Every single link must resolve to the correct item. If
   any link is broken or missing, the collector has a bug.

2. **Verify noise filtering is consistent.** Run the collector twice on the
   same input data. The noise classification (is_noise field) must be
   identical both times. If it varies, a probabilistic element leaked into
   the deterministic layer.

3. **Verify the LLM reads structured output.** Run the full pipeline
   (collector then LLM). Check that the LLM's analysis references data
   from the structured digest, not from its own generation. The links in
   the final output should be identical to the links in the digest file.

---

*Part of the [GBrain Skillpack](../GBRAIN_SKILLPACK.md).*