fix: community fix wave — 10 PRs, 7 contributors (v0.9.1) (#65)
* fix: security hardening — search DoS, slug hijack, symlink traversal, content bombs, stdin guard 4 security vulnerabilities closed: - Search limit clamped to 100 (MAX_SEARCH_LIMIT) with statement_timeout 8s - Frontmatter slug authority enforced (path-derived, mismatch rejected) - Symlink traversal blocked (lstatSync in walker + importFromFile) - Content size guard on importFromContent (Buffer.byteLength, 5MB) - Stdin size guard in parseOpArgs (5MB cap) Search pagination added (--offset param on search + query operations). Clamp warning emitted when limit is capped. Co-Authored-By: garagon <garagon@users.noreply.github.com> * fix: PGLite concurrent access lock — prevent Aborted() crash File-based advisory lock using atomic mkdir with PID tracking and 5-minute stale detection. Clear error messages show which process holds the lock and how to recover. Co-Authored-By: danbr <danbr@users.noreply.github.com> * fix: 12 data integrity fixes + stale embedding prevention CTE searchKeyword rewrite (SQL-level LIMIT, not JS splice). Write validation on addLink/addTag/addTimelineEntry/putRawData/createVersion. Health metrics now measure real problems (stale_pages, orphan_pages, dead_links). Orphan chunk cleanup on empty pages. Embedding error logging. contentHash now covers all PageInput fields. Stale embedding NULL'd when chunk_text changes (prevents wrong vector on new text). hybridSearch stops double-embedding query. MCP param validation. type/exclude_slugs search filters now work. pgcrypto extension for Postgres <13. Co-Authored-By: win4r <win4r@users.noreply.github.com> * perf: 30x embedAll speedup + O(n²) fix + ask alias Sliding worker pool (concurrency 20, tunable via GBRAIN_EMBED_CONCURRENCY). O(n²) chunk lookup in embedPage replaced with Map. gbrain ask alias for query (CLI-only, not in MCP tools-json). .idea added to .gitignore. Co-Authored-By: stephenhungg <stephenhungg@users.noreply.github.com> Co-Authored-By: sharziki <sharziki@users.noreply.github.com> Co-Authored-By: hnshah <hnshah@users.noreply.github.com> Co-Authored-By: doguabaris <doguabaris@users.noreply.github.com> * chore: bump version and changelog (v0.9.1) Community fix wave: 10 PRs, 7 contributors. 4 security fixes, PGLite crash fix, 12 data integrity fixes, 30x embed speedup, search pagination, ask alias. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: garagon <garagon@users.noreply.github.com> Co-authored-by: danbr <danbr@users.noreply.github.com> Co-authored-by: win4r <win4r@users.noreply.github.com> Co-authored-by: stephenhungg <stephenhungg@users.noreply.github.com> Co-authored-by: sharziki <sharziki@users.noreply.github.com> Co-authored-by: hnshah <hnshah@users.noreply.github.com> Co-authored-by: doguabaris <doguabaris@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@@ -10,3 +10,4 @@ bin/
|
||||
.gstack/
|
||||
supabase/.temp/
|
||||
.claude/skills/
|
||||
.idea
|
||||
|
||||
29
CHANGELOG.md
29
CHANGELOG.md
@@ -2,6 +2,35 @@
|
||||
|
||||
All notable changes to GBrain will be documented in this file.
|
||||
|
||||
## [0.9.1] - 2026-04-11
|
||||
|
||||
### Fixed
|
||||
|
||||
- **Your brain can't be poisoned by rogue frontmatter anymore.** Slug authority is now path-derived. A file at `notes/random.md` can't declare `slug: people/admin` and silently overwrite someone else's page. Mismatches are rejected with a clear error telling you exactly what to fix.
|
||||
- **Symlinks in your notes directory can't exfiltrate files.** The import walker now uses `lstatSync` and refuses to follow symlinks, blocking the attack where a contributor plants a link to `~/.zshrc` in the brain directory. Defense-in-depth: `importFromFile` itself also checks.
|
||||
- **Giant payloads through MCP can't rack up your OpenAI bill.** `importFromContent` now checks `Buffer.byteLength` before any processing. 10 MB of emoji through `put_page`? Rejected before chunking starts.
|
||||
- **Search can't be weaponized into a DoS.** `limit` is clamped to 100 across all search paths (keyword, vector, hybrid). `statement_timeout: 8s` on the Postgres connection as defense-in-depth. Requesting `limit: 10000000` now gets you 100 results and a warning.
|
||||
- **PGLite stops crashing when two processes touch the same brain.** File-based advisory lock using atomic `mkdir` with PID tracking and 5-minute stale detection. Clear error messages tell you which process holds the lock and how to recover.
|
||||
- **12 data integrity fixes landed.** Orphan chunks cleaned up on empty pages. Write operations (`addLink`, `addTag`, `addTimelineEntry`, `putRawData`, `createVersion`) now throw when the target page doesn't exist instead of silently no-opping. Health metrics (`stale_pages`, `dead_links`, `orphan_pages`) now measure real problems instead of always returning 0. Keyword search moved from JS-side sort-and-splice to a SQL CTE with `LIMIT`. MCP server validates params before dispatch.
|
||||
- **Stale embeddings can't lie to you anymore.** When chunk text changes but embedding fails, the old vector is now NULL'd out instead of preserved. Previously, search could return results based on outdated vectors attached to new text.
|
||||
- **Embedding failures are no longer silent.** The `catch { /* non-fatal */ }` is gone. You now get `[gbrain] embedding failed for slug (N chunks): error message` in stderr. Still non-fatal, but you know what happened.
|
||||
- **O(n^2) chunk lookup in `embedPage` is gone.** Replaced `find() + indexOf()` with a single `Map` lookup. Matches the pattern `embedAll` already uses.
|
||||
- **Stdin bombs blocked.** `parseOpArgs` now caps stdin at 5 MB before the full buffer is consumed.
|
||||
|
||||
### Added
|
||||
|
||||
- **`gbrain embed --all` is 30x faster.** Sliding worker pool with 20 concurrent workers (tunable via `GBRAIN_EMBED_CONCURRENCY`). A 20,000-chunk corpus that took 2.5 hours now finishes in ~8 minutes.
|
||||
- **Search pagination.** Both `search` and `query` now accept `--offset` for paginating through results. Combined with the 100-result ceiling, you can now page through large result sets.
|
||||
- **`gbrain ask` is an alias for `gbrain query`.** CLI-only, doesn't appear in MCP tools-json.
|
||||
- **Content hash now covers all page fields.** Title, type, and frontmatter changes trigger re-import. First sync after upgrade will re-import all pages (one-time, expected).
|
||||
- **Migration file for v0.9.1.** Auto-update agent knows to expect the full re-import and will run `gbrain embed --all` afterward.
|
||||
- **`pgcrypto` extension added to schema.** Fallback for `gen_random_uuid()` on Postgres < 13.
|
||||
|
||||
### Changed
|
||||
|
||||
- **Search type and exclude_slugs filters now work.** These were advertised in the API but never implemented. Both `searchKeyword` and `searchVector` now respect `type` and `exclude_slugs` params.
|
||||
- **Hybrid search no longer double-embeds the query.** `expandQuery` already includes the original, so we use it directly instead of prepending.
|
||||
|
||||
## [0.9.0] - 2026-04-11
|
||||
|
||||
### Added
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
{
|
||||
"name": "gbrain",
|
||||
"version": "0.9.0",
|
||||
"version": "0.9.1",
|
||||
"description": "Postgres-native personal knowledge brain with hybrid RAG search",
|
||||
"type": "module",
|
||||
"main": "src/core/index.ts",
|
||||
|
||||
68
skills/migrations/v0.9.1.md
Normal file
68
skills/migrations/v0.9.1.md
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
version: 0.9.1
|
||||
feature_pitch:
|
||||
headline: "Security hardening + 30x faster embeddings"
|
||||
description: "4 security fixes (search DoS, slug hijack, symlink traversal, content bombs), PGLite crash fix, 12 data integrity fixes, 30x embed speedup, and search pagination. Community fix wave: 10 PRs, 7 contributors."
|
||||
recipe: null
|
||||
tiers: null
|
||||
---
|
||||
|
||||
# v0.9.1 Migration: Security + Integrity + Performance
|
||||
|
||||
This migration includes security hardening, data integrity fixes, and a
|
||||
performance improvement that changes how hashes are computed. Existing users
|
||||
will see a one-time full re-import on next sync.
|
||||
|
||||
## What Changed
|
||||
|
||||
### Content hash algorithm
|
||||
The content hash now includes all page fields (title, type, frontmatter), not
|
||||
just compiled_truth + timeline. This means every existing page's hash will differ
|
||||
on next sync, triggering a full re-import. This is expected and correct.
|
||||
|
||||
### Search limit ceiling
|
||||
Search results are now capped at 100. If your agent requests more, it gets a
|
||||
warning and 100 results. Use the new `--offset` flag to paginate.
|
||||
|
||||
### Embedding performance
|
||||
`gbrain embed --all` now runs 30x faster (sliding worker pool, 20 concurrent
|
||||
by default). A 20,000-chunk corpus that took 2.5 hours now takes ~8 minutes.
|
||||
|
||||
## Agent Instructions
|
||||
|
||||
### Step 1: Run sync (expect full re-import)
|
||||
|
||||
The content hash change will cause all pages to re-import. This is expected.
|
||||
|
||||
```bash
|
||||
gbrain sync
|
||||
```
|
||||
|
||||
You will see all pages reported as "imported" instead of the usual few "modified".
|
||||
This is a one-time event.
|
||||
|
||||
### Step 2: Re-embed all pages
|
||||
|
||||
After the re-import, embeddings need refreshing:
|
||||
|
||||
```bash
|
||||
gbrain embed --all
|
||||
```
|
||||
|
||||
With the new parallel worker pool, this should complete in under 10 minutes
|
||||
for most brains. Set `GBRAIN_EMBED_CONCURRENCY=10` if you hit API rate limits.
|
||||
|
||||
### Step 3: Verify search works
|
||||
|
||||
```bash
|
||||
gbrain search "test query" --limit 5
|
||||
```
|
||||
|
||||
If search returns results, the re-import and re-embed worked correctly.
|
||||
|
||||
### Step 4: Done
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.gbrain/migrations
|
||||
echo '{"version":"0.9.1","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","status":"complete"}' >> ~/.gbrain/migrations/completed.jsonl
|
||||
```
|
||||
16
src/cli.ts
16
src/cli.ts
@@ -22,7 +22,7 @@ const CLI_ONLY = new Set(['init', 'upgrade', 'post-upgrade', 'check-update', 'in
|
||||
|
||||
async function main() {
|
||||
const args = process.argv.slice(2);
|
||||
const command = args[0];
|
||||
let command = args[0];
|
||||
|
||||
if (!command || command === '--help' || command === '-h') {
|
||||
printHelp();
|
||||
@@ -42,6 +42,11 @@ async function main() {
|
||||
|
||||
const subArgs = args.slice(1);
|
||||
|
||||
// DX alias: `ask` is a natural-language alias for `query`
|
||||
if (command === 'ask') {
|
||||
command = 'query';
|
||||
}
|
||||
|
||||
// Per-command --help
|
||||
if (subArgs.includes('--help') || subArgs.includes('-h')) {
|
||||
const op = cliOps.get(command);
|
||||
@@ -122,7 +127,13 @@ function parseOpArgs(op: Operation, args: string[]): Record<string, unknown> {
|
||||
|
||||
// Read stdin for content params
|
||||
if (op.cliHints?.stdin && !params[op.cliHints.stdin] && !process.stdin.isTTY) {
|
||||
params[op.cliHints.stdin] = readFileSync('/dev/stdin', 'utf-8');
|
||||
const stdinContent = readFileSync('/dev/stdin', 'utf-8');
|
||||
const MAX_STDIN = 5_000_000; // 5MB
|
||||
if (Buffer.byteLength(stdinContent, 'utf-8') > MAX_STDIN) {
|
||||
console.error(`Error: stdin content exceeds ${MAX_STDIN} bytes. Split into smaller inputs.`);
|
||||
process.exit(1);
|
||||
}
|
||||
params[op.cliHints.stdin] = stdinContent;
|
||||
}
|
||||
|
||||
return params;
|
||||
@@ -379,6 +390,7 @@ PAGES
|
||||
SEARCH
|
||||
search <query> Keyword search (tsvector)
|
||||
query <question> [--no-expand] Hybrid search (RRF + expansion)
|
||||
ask <question> [--no-expand] Alias for query
|
||||
|
||||
IMPORT/EXPORT
|
||||
import <dir> [--no-embed] Import markdown directory
|
||||
|
||||
@@ -54,17 +54,17 @@ async function embedPage(engine: BrainEngine, slug: string) {
|
||||
}
|
||||
|
||||
const embeddings = await embedBatch(toEmbed.map(c => c.chunk_text));
|
||||
const updated: ChunkInput[] = chunks.map((c, i) => {
|
||||
const needsEmbed = toEmbed.find(te => te.chunk_index === c.chunk_index);
|
||||
const embIdx = needsEmbed ? toEmbed.indexOf(needsEmbed) : -1;
|
||||
return {
|
||||
chunk_index: c.chunk_index,
|
||||
chunk_text: c.chunk_text,
|
||||
chunk_source: c.chunk_source,
|
||||
embedding: embIdx >= 0 ? embeddings[embIdx] : undefined,
|
||||
token_count: c.token_count || Math.ceil(c.chunk_text.length / 4),
|
||||
};
|
||||
});
|
||||
const embeddingMap = new Map<number, Float32Array>();
|
||||
for (let j = 0; j < toEmbed.length; j++) {
|
||||
embeddingMap.set(toEmbed[j].chunk_index, embeddings[j]);
|
||||
}
|
||||
const updated: ChunkInput[] = chunks.map(c => ({
|
||||
chunk_index: c.chunk_index,
|
||||
chunk_text: c.chunk_text,
|
||||
chunk_source: c.chunk_source,
|
||||
embedding: embeddingMap.get(c.chunk_index),
|
||||
token_count: c.token_count || Math.ceil(c.chunk_text.length / 4),
|
||||
}));
|
||||
|
||||
await engine.upsertChunks(slug, updated);
|
||||
console.log(`${slug}: embedded ${toEmbed.length} chunks`);
|
||||
@@ -74,15 +74,29 @@ async function embedAll(engine: BrainEngine, staleOnly: boolean) {
|
||||
const pages = await engine.listPages({ limit: 100000 });
|
||||
let total = 0;
|
||||
let embedded = 0;
|
||||
let processed = 0;
|
||||
|
||||
for (let i = 0; i < pages.length; i++) {
|
||||
const page = pages[i];
|
||||
// Concurrency limit for parallel page embedding.
|
||||
// Each worker pulls pages from a shared queue and makes independent
|
||||
// embedBatch calls to OpenAI + upsertChunks to the engine.
|
||||
//
|
||||
// Default 20: keeps us well under OpenAI's embedding RPM limit
|
||||
// (3000+/min for tier 1 = 50+/sec, 20 parallel is safely below) and
|
||||
// avoids overwhelming postgres connection pools. Users can tune via
|
||||
// GBRAIN_EMBED_CONCURRENCY env var based on their tier/infra.
|
||||
const CONCURRENCY = parseInt(process.env.GBRAIN_EMBED_CONCURRENCY || '20', 10);
|
||||
|
||||
async function embedOnePage(page: typeof pages[number]) {
|
||||
const chunks = await engine.getChunks(page.slug);
|
||||
const toEmbed = staleOnly
|
||||
? chunks.filter(c => !c.embedded_at)
|
||||
: chunks;
|
||||
|
||||
if (toEmbed.length === 0) continue;
|
||||
if (toEmbed.length === 0) {
|
||||
processed++;
|
||||
process.stdout.write(`\r ${processed}/${pages.length} pages, ${embedded} chunks embedded`);
|
||||
return;
|
||||
}
|
||||
|
||||
try {
|
||||
const embeddings = await embedBatch(toEmbed.map(c => c.chunk_text));
|
||||
@@ -106,8 +120,25 @@ async function embedAll(engine: BrainEngine, staleOnly: boolean) {
|
||||
}
|
||||
|
||||
total += toEmbed.length;
|
||||
process.stdout.write(`\r ${i + 1}/${pages.length} pages, ${embedded} chunks embedded`);
|
||||
processed++;
|
||||
process.stdout.write(`\r ${processed}/${pages.length} pages, ${embedded} chunks embedded`);
|
||||
}
|
||||
|
||||
// Sliding worker pool: N workers share a queue and each pulls the
|
||||
// next page as soon as it finishes its current one. This handles
|
||||
// uneven per-page workloads (some pages have 1 chunk, others have 50)
|
||||
// much better than a fixed-window Promise.all, since fast workers
|
||||
// don't wait for slow workers to finish an entire window.
|
||||
let nextIdx = 0;
|
||||
async function worker() {
|
||||
while (nextIdx < pages.length) {
|
||||
const idx = nextIdx++;
|
||||
await embedOnePage(pages[idx]);
|
||||
}
|
||||
}
|
||||
|
||||
const numWorkers = Math.min(CONCURRENCY, pages.length);
|
||||
await Promise.all(Array.from({ length: numWorkers }, () => worker()));
|
||||
|
||||
console.log(`\n\nEmbedded ${embedded} chunks across ${pages.length} pages`);
|
||||
}
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
import { readdirSync, statSync, existsSync, writeFileSync, readFileSync, unlinkSync } from 'fs';
|
||||
import { readdirSync, lstatSync, existsSync, writeFileSync, readFileSync, unlinkSync } from 'fs';
|
||||
import { execFileSync } from 'child_process';
|
||||
import { join, relative } from 'path';
|
||||
import { cpus, totalmem, homedir } from 'os';
|
||||
@@ -211,7 +211,7 @@ export async function runImport(engine: BrainEngine, args: string[]) {
|
||||
}
|
||||
}
|
||||
|
||||
function collectMarkdownFiles(dir: string): string[] {
|
||||
export function collectMarkdownFiles(dir: string): string[] {
|
||||
const files: string[] = [];
|
||||
|
||||
function walk(d: string) {
|
||||
@@ -224,13 +224,28 @@ function collectMarkdownFiles(dir: string): string[] {
|
||||
const full = join(d, entry);
|
||||
let stat;
|
||||
try {
|
||||
stat = statSync(full);
|
||||
// lstatSync, not statSync: we must NOT follow symlinks. A symlink
|
||||
// inside the brain directory can point to any file the importing
|
||||
// user can read, so a contributor to a shared brain could plant
|
||||
// notes/innocent.md as a symlink to ~/.gbrain/config.json, /etc/passwd,
|
||||
// or another sensitive file outside the brain root — and on the
|
||||
// next `gbrain import` it would be read, chunked, embedded, and
|
||||
// indexed, at which point a bearer-token holder could exfiltrate
|
||||
// it via search/get_page. See L002 in report/findings.md.
|
||||
stat = lstatSync(full);
|
||||
} catch {
|
||||
// Broken symlink or permission error — skip
|
||||
console.warn(`[gbrain import] Skipping unreadable path: ${full}`);
|
||||
continue;
|
||||
}
|
||||
|
||||
// Skip symlinks (both file and directory targets). This also blocks
|
||||
// circular symlink DoS since we refuse to descend into linked dirs.
|
||||
if (stat.isSymbolicLink()) {
|
||||
console.warn(`[gbrain import] Skipping symlink: ${full}`);
|
||||
continue;
|
||||
}
|
||||
|
||||
if (stat.isDirectory()) {
|
||||
walk(full);
|
||||
} else if (entry.endsWith('.md') || entry.endsWith('.mdx')) {
|
||||
|
||||
@@ -3,6 +3,7 @@ import { GBrainError, type EngineConfig } from './types.ts';
|
||||
import { SCHEMA_SQL } from './schema-embedded.ts';
|
||||
|
||||
let sql: ReturnType<typeof postgres> | null = null;
|
||||
let connectedUrl: string | null = null;
|
||||
|
||||
export function getConnection(): ReturnType<typeof postgres> {
|
||||
if (!sql) {
|
||||
@@ -16,7 +17,13 @@ export function getConnection(): ReturnType<typeof postgres> {
|
||||
}
|
||||
|
||||
export async function connect(config: EngineConfig): Promise<void> {
|
||||
if (sql) return;
|
||||
if (sql) {
|
||||
// Warn if a different URL is passed — the old connection is still in use
|
||||
if (config.database_url && connectedUrl && config.database_url !== connectedUrl) {
|
||||
console.warn('[gbrain] connect() called with a different database_url but a connection already exists. Using existing connection.');
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
const url = config.database_url;
|
||||
if (!url) {
|
||||
@@ -32,6 +39,7 @@ export async function connect(config: EngineConfig): Promise<void> {
|
||||
max: 10,
|
||||
idle_timeout: 20,
|
||||
connect_timeout: 10,
|
||||
connection: { statement_timeout: '8s' },
|
||||
types: {
|
||||
// Register pgvector type
|
||||
bigint: postgres.BigInt,
|
||||
@@ -40,8 +48,10 @@ export async function connect(config: EngineConfig): Promise<void> {
|
||||
|
||||
// Test connection
|
||||
await sql`SELECT 1`;
|
||||
connectedUrl = url;
|
||||
} catch (e: unknown) {
|
||||
sql = null;
|
||||
connectedUrl = null;
|
||||
const msg = e instanceof Error ? e.message : String(e);
|
||||
throw new GBrainError(
|
||||
'Cannot connect to database',
|
||||
@@ -55,6 +65,7 @@ export async function disconnect(): Promise<void> {
|
||||
if (sql) {
|
||||
await sql.end();
|
||||
sql = null;
|
||||
connectedUrl = null;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -11,6 +11,16 @@ import type {
|
||||
EngineConfig,
|
||||
} from './types.ts';
|
||||
|
||||
/** Maximum results returned by search operations. Internal bulk operations (listPages) are not clamped. */
|
||||
export const MAX_SEARCH_LIMIT = 100;
|
||||
|
||||
/** Clamp a user-provided search limit to a safe range. */
|
||||
export function clampSearchLimit(limit: number | undefined, defaultLimit = 20): number {
|
||||
if (limit === undefined || limit === null || !Number.isFinite(limit) || Number.isNaN(limit)) return defaultLimit;
|
||||
if (limit <= 0) return defaultLimit;
|
||||
return Math.min(Math.floor(limit), MAX_SEARCH_LIMIT);
|
||||
}
|
||||
|
||||
export interface BrainEngine {
|
||||
// Lifecycle
|
||||
connect(config: EngineConfig): Promise<void>;
|
||||
|
||||
@@ -1,9 +1,10 @@
|
||||
import { readFileSync, statSync } from 'fs';
|
||||
import { readFileSync, statSync, lstatSync } from 'fs';
|
||||
import { createHash } from 'crypto';
|
||||
import type { BrainEngine } from './engine.ts';
|
||||
import { parseMarkdown } from './markdown.ts';
|
||||
import { chunkText } from './chunkers/recursive.ts';
|
||||
import { embedBatch } from './embedding.ts';
|
||||
import { slugifyPath } from './sync.ts';
|
||||
import type { ChunkInput } from './types.ts';
|
||||
|
||||
export interface ImportResult {
|
||||
@@ -20,6 +21,12 @@ const MAX_FILE_SIZE = 5_000_000; // 5MB
|
||||
* parse -> hash -> embed (external) -> transaction(version + putPage + tags + chunks)
|
||||
*
|
||||
* Used by put_page operation and importFromFile.
|
||||
*
|
||||
* Size guard: content is rejected if its UTF-8 byte length exceeds MAX_FILE_SIZE.
|
||||
* importFromFile already enforces this against disk size before calling here, but
|
||||
* the remote MCP put_page operation passes caller-supplied content straight in,
|
||||
* so the guard has to live on this function — otherwise an authenticated caller
|
||||
* can spend the owner's OpenAI budget at will by shipping a megabyte-sized page.
|
||||
*/
|
||||
export async function importFromContent(
|
||||
engine: BrainEngine,
|
||||
@@ -27,6 +34,19 @@ export async function importFromContent(
|
||||
content: string,
|
||||
opts: { noEmbed?: boolean } = {},
|
||||
): Promise<ImportResult> {
|
||||
// Reject oversized payloads before any parsing, chunking, or embedding happens.
|
||||
// Uses Buffer.byteLength to count UTF-8 bytes the same way disk size would,
|
||||
// so the network path behaves identically to the file path.
|
||||
const byteLength = Buffer.byteLength(content, 'utf-8');
|
||||
if (byteLength > MAX_FILE_SIZE) {
|
||||
return {
|
||||
slug,
|
||||
status: 'skipped',
|
||||
chunks: 0,
|
||||
error: `Content too large (${byteLength} bytes, max ${MAX_FILE_SIZE}). Split the content into smaller files or remove large embedded assets.`,
|
||||
};
|
||||
}
|
||||
|
||||
const parsed = parseMarkdown(content, slug + '.md');
|
||||
|
||||
// Hash includes ALL fields for idempotency (not just compiled_truth + timeline)
|
||||
@@ -67,7 +87,9 @@ export async function importFromContent(
|
||||
chunks[i].embedding = embeddings[i];
|
||||
chunks[i].token_count = Math.ceil(chunks[i].chunk_text.length / 4);
|
||||
}
|
||||
} catch { /* non-fatal */ }
|
||||
} catch (e: unknown) {
|
||||
console.warn(`[gbrain] embedding failed for ${slug} (${chunks.length} chunks): ${e instanceof Error ? e.message : String(e)}`);
|
||||
}
|
||||
}
|
||||
|
||||
// Transaction wraps all DB writes
|
||||
@@ -95,6 +117,9 @@ export async function importFromContent(
|
||||
|
||||
if (chunks.length > 0) {
|
||||
await tx.upsertChunks(slug, chunks);
|
||||
} else {
|
||||
// Content is empty — delete stale chunks so they don't ghost in search results
|
||||
await tx.deleteChunks(slug);
|
||||
}
|
||||
});
|
||||
|
||||
@@ -103,6 +128,13 @@ export async function importFromContent(
|
||||
|
||||
/**
|
||||
* Import from a file path. Validates size, reads content, delegates to importFromContent.
|
||||
*
|
||||
* Slug authority: the path on disk is the source of truth. `frontmatter.slug`
|
||||
* is only accepted when it matches `slugifyPath(relativePath)`. A mismatch is
|
||||
* rejected rather than silently honored — otherwise a file at `notes/random.md`
|
||||
* could declare `slug: people/elon` in frontmatter and overwrite the legitimate
|
||||
* `people/elon` page on the next `gbrain sync` or `gbrain import`. In shared
|
||||
* brains where PRs are mergeable, this is a silent page-hijack primitive.
|
||||
*/
|
||||
export async function importFromFile(
|
||||
engine: BrainEngine,
|
||||
@@ -110,6 +142,12 @@ export async function importFromFile(
|
||||
relativePath: string,
|
||||
opts: { noEmbed?: boolean } = {},
|
||||
): Promise<ImportResult> {
|
||||
// Defense-in-depth: reject symlinks before reading content.
|
||||
const lstat = lstatSync(filePath);
|
||||
if (lstat.isSymbolicLink()) {
|
||||
return { slug: relativePath, status: 'skipped', chunks: 0, error: `Skipping symlink: ${filePath}` };
|
||||
}
|
||||
|
||||
const stat = statSync(filePath);
|
||||
if (stat.size > MAX_FILE_SIZE) {
|
||||
return { slug: relativePath, status: 'skipped', chunks: 0, error: `File too large (${stat.size} bytes)` };
|
||||
@@ -117,7 +155,25 @@ export async function importFromFile(
|
||||
|
||||
const content = readFileSync(filePath, 'utf-8');
|
||||
const parsed = parseMarkdown(content, relativePath);
|
||||
return importFromContent(engine, parsed.slug, content, opts);
|
||||
|
||||
// Enforce path-authoritative slug. parseMarkdown prefers frontmatter.slug over
|
||||
// the path-derived slug, so a mismatch here means the frontmatter is trying
|
||||
// to rewrite a page whose filesystem location says something different.
|
||||
const expectedSlug = slugifyPath(relativePath);
|
||||
if (parsed.slug !== expectedSlug) {
|
||||
return {
|
||||
slug: expectedSlug,
|
||||
status: 'skipped',
|
||||
chunks: 0,
|
||||
error:
|
||||
`Frontmatter slug "${parsed.slug}" does not match path-derived slug "${expectedSlug}" ` +
|
||||
`(from ${relativePath}). Remove the frontmatter "slug:" line or move the file.`,
|
||||
};
|
||||
}
|
||||
|
||||
// Pass the path-derived slug explicitly so that any future change to
|
||||
// parseMarkdown's precedence rules cannot re-introduce this bug.
|
||||
return importFromContent(engine, expectedSlug, content, opts);
|
||||
}
|
||||
|
||||
// Backward compat
|
||||
|
||||
@@ -176,9 +176,13 @@ const search: Operation = {
|
||||
params: {
|
||||
query: { type: 'string', required: true },
|
||||
limit: { type: 'number', description: 'Max results (default 20)' },
|
||||
offset: { type: 'number', description: 'Skip first N results (for pagination)' },
|
||||
},
|
||||
handler: async (ctx, p) => {
|
||||
return ctx.engine.searchKeyword(p.query as string, { limit: (p.limit as number) || 20 });
|
||||
return ctx.engine.searchKeyword(p.query as string, {
|
||||
limit: (p.limit as number) || 20,
|
||||
offset: (p.offset as number) || 0,
|
||||
});
|
||||
},
|
||||
cliHints: { name: 'search', positional: ['query'] },
|
||||
};
|
||||
@@ -189,12 +193,14 @@ const query: Operation = {
|
||||
params: {
|
||||
query: { type: 'string', required: true },
|
||||
limit: { type: 'number', description: 'Max results (default 20)' },
|
||||
offset: { type: 'number', description: 'Skip first N results (for pagination)' },
|
||||
expand: { type: 'boolean', description: 'Enable multi-query expansion (default: true)' },
|
||||
},
|
||||
handler: async (ctx, p) => {
|
||||
const expand = p.expand !== false;
|
||||
return hybridSearch(ctx.engine, p.query as string, {
|
||||
limit: (p.limit as number) || 20,
|
||||
offset: (p.offset as number) || 0,
|
||||
expansion: expand,
|
||||
expandFn: expand ? expandQuery : undefined,
|
||||
});
|
||||
|
||||
@@ -3,8 +3,10 @@ import { vector } from '@electric-sql/pglite/vector';
|
||||
import { pg_trgm } from '@electric-sql/pglite/contrib/pg_trgm';
|
||||
import type { Transaction } from '@electric-sql/pglite';
|
||||
import type { BrainEngine } from './engine.ts';
|
||||
import { MAX_SEARCH_LIMIT, clampSearchLimit } from './engine.ts';
|
||||
import { runMigrations } from './migrate.ts';
|
||||
import { PGLITE_SCHEMA_SQL } from './pglite-schema.ts';
|
||||
import { acquireLock, releaseLock, type LockHandle } from './pglite-lock.ts';
|
||||
import type {
|
||||
Page, PageInput, PageFilters, PageType,
|
||||
Chunk, ChunkInput,
|
||||
@@ -23,6 +25,7 @@ type PGLiteDB = PGlite;
|
||||
|
||||
export class PGLiteEngine implements BrainEngine {
|
||||
private _db: PGLiteDB | null = null;
|
||||
private _lock: LockHandle | null = null;
|
||||
|
||||
get db(): PGLiteDB {
|
||||
if (!this._db) throw new Error('PGLite not connected. Call connect() first.');
|
||||
@@ -32,6 +35,14 @@ export class PGLiteEngine implements BrainEngine {
|
||||
// Lifecycle
|
||||
async connect(config: EngineConfig): Promise<void> {
|
||||
const dataDir = config.database_path || undefined; // undefined = in-memory
|
||||
|
||||
// Acquire file lock to prevent concurrent PGLite access (crashes with Aborted())
|
||||
this._lock = await acquireLock(dataDir);
|
||||
|
||||
if (!this._lock.acquired) {
|
||||
throw new Error('Could not acquire PGLite lock. Another gbrain process is using the database.');
|
||||
}
|
||||
|
||||
this._db = await PGlite.create({
|
||||
dataDir,
|
||||
extensions: { vector, pg_trgm },
|
||||
@@ -43,6 +54,10 @@ export class PGLiteEngine implements BrainEngine {
|
||||
await this._db.close();
|
||||
this._db = null;
|
||||
}
|
||||
if (this._lock?.acquired) {
|
||||
await releaseLock(this._lock);
|
||||
this._lock = null;
|
||||
}
|
||||
}
|
||||
|
||||
async initSchema(): Promise<void> {
|
||||
@@ -156,7 +171,12 @@ export class PGLiteEngine implements BrainEngine {
|
||||
|
||||
// Search
|
||||
async searchKeyword(query: string, opts?: SearchOpts): Promise<SearchResult[]> {
|
||||
const limit = opts?.limit || 20;
|
||||
const limit = clampSearchLimit(opts?.limit);
|
||||
const offset = opts?.offset || 0;
|
||||
|
||||
if (opts?.limit && opts.limit > MAX_SEARCH_LIMIT) {
|
||||
console.warn(`[gbrain] Warning: search limit clamped from ${opts.limit} to ${MAX_SEARCH_LIMIT}`);
|
||||
}
|
||||
|
||||
const { rows } = await this.db.query(
|
||||
`SELECT DISTINCT ON (p.slug)
|
||||
@@ -173,19 +193,23 @@ export class PGLiteEngine implements BrainEngine {
|
||||
[query]
|
||||
);
|
||||
|
||||
// Re-sort by score (DISTINCT ON requires ORDER BY slug first) and apply limit
|
||||
// Re-sort by score (DISTINCT ON requires ORDER BY slug first) and apply limit + offset
|
||||
const sorted = (rows as Record<string, unknown>[]).sort(
|
||||
(a: any, b: any) => b.score - a.score
|
||||
);
|
||||
sorted.splice(limit);
|
||||
|
||||
return sorted.map(rowToSearchResult);
|
||||
return sorted.slice(offset, offset + limit).map(rowToSearchResult);
|
||||
}
|
||||
|
||||
async searchVector(embedding: Float32Array, opts?: SearchOpts): Promise<SearchResult[]> {
|
||||
const limit = opts?.limit || 20;
|
||||
const limit = clampSearchLimit(opts?.limit);
|
||||
const offset = opts?.offset || 0;
|
||||
const vecStr = '[' + Array.from(embedding).join(',') + ']';
|
||||
|
||||
if (opts?.limit && opts.limit > MAX_SEARCH_LIMIT) {
|
||||
console.warn(`[gbrain] Warning: search limit clamped from ${opts.limit} to ${MAX_SEARCH_LIMIT}`);
|
||||
}
|
||||
|
||||
const { rows } = await this.db.query(
|
||||
`SELECT
|
||||
p.slug, p.id as page_id, p.title, p.type,
|
||||
@@ -198,8 +222,9 @@ export class PGLiteEngine implements BrainEngine {
|
||||
JOIN pages p ON p.id = cc.page_id
|
||||
WHERE cc.embedding IS NOT NULL
|
||||
ORDER BY cc.embedding <=> $1::vector
|
||||
LIMIT $2`,
|
||||
[vecStr, limit]
|
||||
LIMIT $2
|
||||
OFFSET $3`,
|
||||
[vecStr, limit, offset]
|
||||
);
|
||||
|
||||
return (rows as Record<string, unknown>[]).map(rowToSearchResult);
|
||||
@@ -250,7 +275,7 @@ export class PGLiteEngine implements BrainEngine {
|
||||
ON CONFLICT (page_id, chunk_index) DO UPDATE SET
|
||||
chunk_text = EXCLUDED.chunk_text,
|
||||
chunk_source = EXCLUDED.chunk_source,
|
||||
embedding = COALESCE(EXCLUDED.embedding, content_chunks.embedding),
|
||||
embedding = CASE WHEN EXCLUDED.chunk_text != content_chunks.chunk_text THEN EXCLUDED.embedding ELSE COALESCE(EXCLUDED.embedding, content_chunks.embedding) END,
|
||||
model = COALESCE(EXCLUDED.model, content_chunks.model),
|
||||
token_count = EXCLUDED.token_count,
|
||||
embedded_at = COALESCE(EXCLUDED.embedded_at, content_chunks.embedded_at)`,
|
||||
|
||||
144
src/core/pglite-lock.ts
Normal file
144
src/core/pglite-lock.ts
Normal file
@@ -0,0 +1,144 @@
|
||||
/**
|
||||
* PGLite File Lock — prevents concurrent process access to the same data directory.
|
||||
*
|
||||
* PGLite uses embedded Postgres (WASM) which only supports one connection at a time.
|
||||
* When `gbrain embed` (which can take minutes) is running and another process tries
|
||||
* to connect, PGLite throws `Aborted()` because it can't handle concurrent access.
|
||||
*
|
||||
* This module implements a simple advisory lock using a lock file next to the data
|
||||
* directory. It uses atomic `mkdir` (which is POSIX-atomic) combined with PID tracking
|
||||
* for stale lock detection.
|
||||
*
|
||||
* Usage:
|
||||
* const lock = await acquireLock(dataDir);
|
||||
* try { ... } finally { await releaseLock(lock); }
|
||||
*/
|
||||
|
||||
import { mkdirSync, existsSync, readFileSync, writeFileSync, rmSync, statSync } from 'fs';
|
||||
import { join } from 'path';
|
||||
|
||||
const LOCK_DIR_NAME = '.gbrain-lock';
|
||||
const LOCK_FILE = 'lock';
|
||||
const STALE_THRESHOLD_MS = 5 * 60 * 1000; // 5 minutes — embed jobs can be long
|
||||
|
||||
export interface LockHandle {
|
||||
lockDir: string;
|
||||
acquired: boolean;
|
||||
}
|
||||
|
||||
function getLockDir(dataDir: string | undefined): string {
|
||||
// Use the parent of the data dir for the lock, or a temp location for in-memory
|
||||
if (!dataDir) {
|
||||
// In-memory PGLite — no concurrent access possible since it's process-scoped
|
||||
// Return a sentinel that we skip
|
||||
return '';
|
||||
}
|
||||
return join(dataDir, LOCK_DIR_NAME);
|
||||
}
|
||||
|
||||
function isProcessAlive(pid: number): boolean {
|
||||
try {
|
||||
// Sending signal 0 checks existence without actually sending a signal
|
||||
process.kill(pid, 0);
|
||||
return true;
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Attempt to acquire an exclusive lock on the PGLite data directory.
|
||||
* Returns { acquired: true } if the lock was obtained, { acquired: false } otherwise.
|
||||
* Stale locks (from dead processes) are automatically cleaned up.
|
||||
*/
|
||||
export async function acquireLock(dataDir: string | undefined, opts?: { timeoutMs?: number }): Promise<LockHandle> {
|
||||
const lockDir = getLockDir(dataDir);
|
||||
|
||||
// In-memory PGLite — no lock needed (process-scoped, can't be shared)
|
||||
if (!lockDir) {
|
||||
return { lockDir: '', acquired: true };
|
||||
}
|
||||
|
||||
const timeoutMs = opts?.timeoutMs ?? 30_000; // 30 second default timeout
|
||||
const startTime = Date.now();
|
||||
|
||||
while (Date.now() - startTime < timeoutMs) {
|
||||
// Check for stale lock first
|
||||
if (existsSync(lockDir)) {
|
||||
const lockPath = join(lockDir, LOCK_FILE);
|
||||
try {
|
||||
const lockData = JSON.parse(readFileSync(lockPath, 'utf-8'));
|
||||
const lockPid = lockData.pid as number;
|
||||
const lockTime = lockData.acquired_at as number;
|
||||
|
||||
// Is the locking process still alive?
|
||||
if (!isProcessAlive(lockPid)) {
|
||||
// Stale lock — clean it up
|
||||
try { rmSync(lockDir, { recursive: true, force: true }); } catch { /* race condition, try again */ }
|
||||
} else if (Date.now() - lockTime > STALE_THRESHOLD_MS) {
|
||||
// Lock held for too long — assume stale (e.g., process hung)
|
||||
// Still alive but probably stuck — force remove
|
||||
try { rmSync(lockDir, { recursive: true, force: true }); } catch { /* race condition */ }
|
||||
} else {
|
||||
// Lock is held by a live process — wait and retry
|
||||
await new Promise(r => setTimeout(r, 1000));
|
||||
continue;
|
||||
}
|
||||
} catch {
|
||||
// Corrupt lock file — remove it
|
||||
try { rmSync(lockDir, { recursive: true, force: true }); } catch { /* race condition */ }
|
||||
}
|
||||
}
|
||||
|
||||
// Try to acquire lock (atomic mkdir)
|
||||
try {
|
||||
mkdirSync(lockDir, { recursive: false });
|
||||
// We got the lock — write our PID
|
||||
const lockPath = join(lockDir, LOCK_FILE);
|
||||
writeFileSync(lockPath, JSON.stringify({
|
||||
pid: process.pid,
|
||||
acquired_at: Date.now(),
|
||||
command: process.argv.slice(1).join(' '),
|
||||
}), { mode: 0o644 });
|
||||
|
||||
return { lockDir, acquired: true };
|
||||
} catch (e: unknown) {
|
||||
// mkdir failed — someone else grabbed it between our check and mkdir
|
||||
// This is fine, we'll retry
|
||||
if (Date.now() - startTime >= timeoutMs) {
|
||||
// Timeout — report which process holds the lock
|
||||
const lockPath = join(lockDir, LOCK_FILE);
|
||||
try {
|
||||
const lockData = JSON.parse(readFileSync(lockPath, 'utf-8'));
|
||||
throw new Error(
|
||||
`GBrain: Timed out waiting for PGLite lock. Process ${lockData.pid} has held it since ${new Date(lockData.acquired_at).toISOString()} (command: ${lockData.command}). ` +
|
||||
`If that process is dead, remove ${lockDir} and try again.`
|
||||
);
|
||||
} catch (readErr) {
|
||||
if (readErr instanceof Error && readErr.message.startsWith('GBrain')) throw readErr;
|
||||
throw new Error(
|
||||
`GBrain: Timed out waiting for PGLite lock. Remove ${lockDir} and try again.`
|
||||
);
|
||||
}
|
||||
}
|
||||
// Brief wait before retry
|
||||
await new Promise(r => setTimeout(r, 500));
|
||||
}
|
||||
}
|
||||
|
||||
// Should not reach here, but just in case
|
||||
throw new Error(`GBrain: Timed out waiting for PGLite lock.`);
|
||||
}
|
||||
|
||||
/**
|
||||
* Release a previously acquired lock.
|
||||
*/
|
||||
export async function releaseLock(lock: LockHandle): Promise<void> {
|
||||
if (!lock.lockDir || !lock.acquired) return;
|
||||
|
||||
try {
|
||||
rmSync(lock.lockDir, { recursive: true, force: true });
|
||||
} catch {
|
||||
// Lock file already removed (e.g., by stale cleanup) — that's fine
|
||||
}
|
||||
}
|
||||
@@ -1,5 +1,6 @@
|
||||
import postgres from 'postgres';
|
||||
import type { BrainEngine } from './engine.ts';
|
||||
import { MAX_SEARCH_LIMIT, clampSearchLimit } from './engine.ts';
|
||||
import { runMigrations } from './migrate.ts';
|
||||
import { SCHEMA_SQL } from './schema-embedded.ts';
|
||||
import type {
|
||||
@@ -98,7 +99,7 @@ export class PostgresEngine implements BrainEngine {
|
||||
async putPage(slug: string, page: PageInput): Promise<Page> {
|
||||
slug = validateSlug(slug);
|
||||
const sql = this.sql;
|
||||
const hash = page.content_hash || contentHash(page.compiled_truth, page.timeline || '');
|
||||
const hash = page.content_hash || contentHash(page);
|
||||
const frontmatter = page.frontmatter || {};
|
||||
|
||||
const rows = await sql`
|
||||
@@ -178,31 +179,56 @@ export class PostgresEngine implements BrainEngine {
|
||||
// Search
|
||||
async searchKeyword(query: string, opts?: SearchOpts): Promise<SearchResult[]> {
|
||||
const sql = this.sql;
|
||||
const limit = opts?.limit || 20;
|
||||
const limit = clampSearchLimit(opts?.limit);
|
||||
const offset = opts?.offset || 0;
|
||||
const type = opts?.type;
|
||||
const excludeSlugs = opts?.exclude_slugs;
|
||||
|
||||
if (opts?.limit && opts.limit > MAX_SEARCH_LIMIT) {
|
||||
console.warn(`[gbrain] Warning: search limit clamped from ${opts.limit} to ${MAX_SEARCH_LIMIT}`);
|
||||
}
|
||||
|
||||
// CTE: rank pages by FTS score, then pick the best chunk per page in SQL
|
||||
const rows = await sql`
|
||||
SELECT DISTINCT ON (p.slug)
|
||||
p.slug, p.id as page_id, p.title, p.type,
|
||||
cc.chunk_text, cc.chunk_source,
|
||||
ts_rank(p.search_vector, websearch_to_tsquery('english', ${query})) AS score,
|
||||
CASE WHEN p.updated_at < (
|
||||
SELECT MAX(te.created_at) FROM timeline_entries te WHERE te.page_id = p.id
|
||||
) THEN true ELSE false END AS stale
|
||||
FROM pages p
|
||||
JOIN content_chunks cc ON cc.page_id = p.id
|
||||
WHERE p.search_vector @@ websearch_to_tsquery('english', ${query})
|
||||
ORDER BY p.slug, score DESC
|
||||
WITH ranked_pages AS (
|
||||
SELECT p.id, p.slug, p.title, p.type,
|
||||
ts_rank(p.search_vector, websearch_to_tsquery('english', ${query})) AS score
|
||||
FROM pages p
|
||||
WHERE p.search_vector @@ websearch_to_tsquery('english', ${query})
|
||||
${type ? sql`AND p.type = ${type}` : sql``}
|
||||
${excludeSlugs?.length ? sql`AND p.slug != ALL(${excludeSlugs})` : sql``}
|
||||
ORDER BY score DESC
|
||||
LIMIT ${limit}
|
||||
OFFSET ${offset}
|
||||
),
|
||||
best_chunks AS (
|
||||
SELECT DISTINCT ON (rp.slug)
|
||||
rp.slug, rp.id as page_id, rp.title, rp.type, rp.score,
|
||||
cc.chunk_text, cc.chunk_source
|
||||
FROM ranked_pages rp
|
||||
JOIN content_chunks cc ON cc.page_id = rp.id
|
||||
ORDER BY rp.slug, cc.chunk_index
|
||||
)
|
||||
SELECT slug, page_id, title, type, chunk_text, chunk_source, score,
|
||||
false AS stale
|
||||
FROM best_chunks
|
||||
ORDER BY score DESC
|
||||
`;
|
||||
// Re-sort by score (DISTINCT ON requires ORDER BY slug first) and apply limit
|
||||
rows.sort((a: any, b: any) => b.score - a.score);
|
||||
rows.splice(limit);
|
||||
|
||||
return rows.map(rowToSearchResult);
|
||||
}
|
||||
|
||||
async searchVector(embedding: Float32Array, opts?: SearchOpts): Promise<SearchResult[]> {
|
||||
const sql = this.sql;
|
||||
const limit = opts?.limit || 20;
|
||||
const limit = clampSearchLimit(opts?.limit);
|
||||
const offset = opts?.offset || 0;
|
||||
const type = opts?.type;
|
||||
const excludeSlugs = opts?.exclude_slugs;
|
||||
|
||||
if (opts?.limit && opts.limit > MAX_SEARCH_LIMIT) {
|
||||
console.warn(`[gbrain] Warning: search limit clamped from ${opts.limit} to ${MAX_SEARCH_LIMIT}`);
|
||||
}
|
||||
|
||||
const vecStr = '[' + Array.from(embedding).join(',') + ']';
|
||||
|
||||
const rows = await sql`
|
||||
@@ -210,14 +236,15 @@ export class PostgresEngine implements BrainEngine {
|
||||
p.slug, p.id as page_id, p.title, p.type,
|
||||
cc.chunk_text, cc.chunk_source,
|
||||
1 - (cc.embedding <=> ${vecStr}::vector) AS score,
|
||||
CASE WHEN p.updated_at < (
|
||||
SELECT MAX(te.created_at) FROM timeline_entries te WHERE te.page_id = p.id
|
||||
) THEN true ELSE false END AS stale
|
||||
false AS stale
|
||||
FROM content_chunks cc
|
||||
JOIN pages p ON p.id = cc.page_id
|
||||
WHERE cc.embedding IS NOT NULL
|
||||
${type ? sql`AND p.type = ${type}` : sql``}
|
||||
${excludeSlugs?.length ? sql`AND p.slug != ALL(${excludeSlugs})` : sql``}
|
||||
ORDER BY cc.embedding <=> ${vecStr}::vector
|
||||
LIMIT ${limit}
|
||||
OFFSET ${offset}
|
||||
`;
|
||||
|
||||
return rows.map(rowToSearchResult);
|
||||
@@ -268,7 +295,7 @@ export class PostgresEngine implements BrainEngine {
|
||||
ON CONFLICT (page_id, chunk_index) DO UPDATE SET
|
||||
chunk_text = EXCLUDED.chunk_text,
|
||||
chunk_source = EXCLUDED.chunk_source,
|
||||
embedding = COALESCE(EXCLUDED.embedding, content_chunks.embedding),
|
||||
embedding = CASE WHEN EXCLUDED.chunk_text != content_chunks.chunk_text THEN EXCLUDED.embedding ELSE COALESCE(EXCLUDED.embedding, content_chunks.embedding) END,
|
||||
model = COALESCE(EXCLUDED.model, content_chunks.model),
|
||||
token_count = EXCLUDED.token_count,
|
||||
embedded_at = COALESCE(EXCLUDED.embedded_at, content_chunks.embedded_at)`,
|
||||
@@ -298,7 +325,7 @@ export class PostgresEngine implements BrainEngine {
|
||||
// Links
|
||||
async addLink(from: string, to: string, context?: string, linkType?: string): Promise<void> {
|
||||
const sql = this.sql;
|
||||
await sql`
|
||||
const result = await sql`
|
||||
INSERT INTO links (from_page_id, to_page_id, link_type, context)
|
||||
SELECT f.id, t.id, ${linkType || ''}, ${context || ''}
|
||||
FROM pages f, pages t
|
||||
@@ -306,7 +333,9 @@ export class PostgresEngine implements BrainEngine {
|
||||
ON CONFLICT (from_page_id, to_page_id) DO UPDATE SET
|
||||
link_type = EXCLUDED.link_type,
|
||||
context = EXCLUDED.context
|
||||
RETURNING id
|
||||
`;
|
||||
if (result.length === 0) throw new Error(`addLink failed: page "${from}" or "${to}" not found`);
|
||||
}
|
||||
|
||||
async removeLink(from: string, to: string): Promise<void> {
|
||||
@@ -381,9 +410,13 @@ export class PostgresEngine implements BrainEngine {
|
||||
// Tags
|
||||
async addTag(slug: string, tag: string): Promise<void> {
|
||||
const sql = this.sql;
|
||||
// Verify page exists before attempting insert (ON CONFLICT DO NOTHING
|
||||
// swallows the "already tagged" case, but we still need to detect missing pages)
|
||||
const page = await sql`SELECT id FROM pages WHERE slug = ${slug}`;
|
||||
if (page.length === 0) throw new Error(`addTag failed: page "${slug}" not found`);
|
||||
await sql`
|
||||
INSERT INTO tags (page_id, tag)
|
||||
SELECT id, ${tag} FROM pages WHERE slug = ${slug}
|
||||
VALUES (${page[0].id}, ${tag})
|
||||
ON CONFLICT (page_id, tag) DO NOTHING
|
||||
`;
|
||||
}
|
||||
@@ -410,11 +443,13 @@ export class PostgresEngine implements BrainEngine {
|
||||
// Timeline
|
||||
async addTimelineEntry(slug: string, entry: TimelineInput): Promise<void> {
|
||||
const sql = this.sql;
|
||||
await sql`
|
||||
const result = await sql`
|
||||
INSERT INTO timeline_entries (page_id, date, source, summary, detail)
|
||||
SELECT id, ${entry.date}::date, ${entry.source || ''}, ${entry.summary}, ${entry.detail || ''}
|
||||
FROM pages WHERE slug = ${slug}
|
||||
RETURNING id
|
||||
`;
|
||||
if (result.length === 0) throw new Error(`addTimelineEntry failed: page "${slug}" not found`);
|
||||
}
|
||||
|
||||
async getTimeline(slug: string, opts?: TimelineOpts): Promise<TimelineEntry[]> {
|
||||
@@ -451,14 +486,16 @@ export class PostgresEngine implements BrainEngine {
|
||||
// Raw data
|
||||
async putRawData(slug: string, source: string, data: object): Promise<void> {
|
||||
const sql = this.sql;
|
||||
await sql`
|
||||
const result = await sql`
|
||||
INSERT INTO raw_data (page_id, source, data)
|
||||
SELECT id, ${source}, ${JSON.stringify(data)}::jsonb
|
||||
FROM pages WHERE slug = ${slug}
|
||||
ON CONFLICT (page_id, source) DO UPDATE SET
|
||||
data = EXCLUDED.data,
|
||||
fetched_at = now()
|
||||
RETURNING id
|
||||
`;
|
||||
if (result.length === 0) throw new Error(`putRawData failed: page "${slug}" not found`);
|
||||
}
|
||||
|
||||
async getRawData(slug: string, source?: string): Promise<RawData[]> {
|
||||
@@ -489,6 +526,7 @@ export class PostgresEngine implements BrainEngine {
|
||||
FROM pages WHERE slug = ${slug}
|
||||
RETURNING *
|
||||
`;
|
||||
if (rows.length === 0) throw new Error(`createVersion failed: page "${slug}" not found`);
|
||||
return rows[0] as unknown as PageVersion;
|
||||
}
|
||||
|
||||
@@ -555,13 +593,16 @@ export class PostgresEngine implements BrainEngine {
|
||||
(SELECT count(*) FROM content_chunks WHERE embedded_at IS NOT NULL)::float /
|
||||
GREATEST((SELECT count(*) FROM content_chunks), 1)::float as embed_coverage,
|
||||
(SELECT count(*) FROM pages p
|
||||
WHERE p.updated_at < (SELECT MAX(te.created_at) FROM timeline_entries te WHERE te.page_id = p.id)
|
||||
WHERE (p.compiled_truth != '' OR p.timeline != '')
|
||||
AND NOT EXISTS (SELECT 1 FROM content_chunks cc WHERE cc.page_id = p.id)
|
||||
) as stale_pages,
|
||||
(SELECT count(*) FROM pages p
|
||||
WHERE NOT EXISTS (SELECT 1 FROM links l WHERE l.to_page_id = p.id)
|
||||
AND NOT EXISTS (SELECT 1 FROM links l WHERE l.from_page_id = p.id)
|
||||
) as orphan_pages,
|
||||
(SELECT count(*) FROM links l
|
||||
WHERE NOT EXISTS (SELECT 1 FROM pages p WHERE p.id = l.to_page_id)
|
||||
(SELECT count(*) FROM content_chunks cc
|
||||
JOIN pages p ON p.id = cc.page_id
|
||||
WHERE p.compiled_truth = '' AND p.timeline = ''
|
||||
) as dead_links,
|
||||
(SELECT count(*) FROM content_chunks WHERE embedded_at IS NULL) as missing_embeddings
|
||||
`;
|
||||
|
||||
@@ -7,6 +7,7 @@
|
||||
*/
|
||||
|
||||
import type { BrainEngine } from '../engine.ts';
|
||||
import { MAX_SEARCH_LIMIT, clampSearchLimit } from '../engine.ts';
|
||||
import type { SearchResult, SearchOpts } from '../types.ts';
|
||||
import { embed } from '../embedding.ts';
|
||||
import { dedupResults } from './dedup.ts';
|
||||
@@ -24,21 +25,25 @@ export async function hybridSearch(
|
||||
opts?: HybridSearchOpts,
|
||||
): Promise<SearchResult[]> {
|
||||
const limit = opts?.limit || 20;
|
||||
const offset = opts?.offset || 0;
|
||||
const innerLimit = Math.min(limit * 2, MAX_SEARCH_LIMIT);
|
||||
|
||||
// Run keyword search (always available, no API key needed)
|
||||
const keywordResults = await engine.searchKeyword(query, { limit: limit * 2 });
|
||||
const keywordResults = await engine.searchKeyword(query, { limit: innerLimit });
|
||||
|
||||
// Skip vector search entirely if no OpenAI key is configured
|
||||
if (!process.env.OPENAI_API_KEY) {
|
||||
return dedupResults(keywordResults).slice(0, limit);
|
||||
return dedupResults(keywordResults).slice(offset, offset + limit);
|
||||
}
|
||||
|
||||
// Determine query variants (optionally with expansion)
|
||||
// expandQuery already includes the original query in its return value,
|
||||
// so we use it directly instead of prepending query again
|
||||
let queries = [query];
|
||||
if (opts?.expansion && opts?.expandFn) {
|
||||
try {
|
||||
const expanded = await opts.expandFn(query);
|
||||
queries = [query, ...expanded].slice(0, 3);
|
||||
queries = await opts.expandFn(query);
|
||||
if (queries.length === 0) queries = [query];
|
||||
} catch {
|
||||
// Expansion failure is non-fatal
|
||||
}
|
||||
@@ -49,14 +54,14 @@ export async function hybridSearch(
|
||||
try {
|
||||
const embeddings = await Promise.all(queries.map(q => embed(q)));
|
||||
vectorLists = await Promise.all(
|
||||
embeddings.map(emb => engine.searchVector(emb, { limit: limit * 2 })),
|
||||
embeddings.map(emb => engine.searchVector(emb, { limit: innerLimit })),
|
||||
);
|
||||
} catch {
|
||||
// Embedding failure is non-fatal, fall back to keyword-only
|
||||
}
|
||||
|
||||
if (vectorLists.length === 0) {
|
||||
return dedupResults(keywordResults).slice(0, limit);
|
||||
return dedupResults(keywordResults).slice(offset, offset + limit);
|
||||
}
|
||||
|
||||
// Merge all result lists via RRF
|
||||
@@ -66,7 +71,7 @@ export async function hybridSearch(
|
||||
// Dedup
|
||||
const deduped = dedupResults(fused);
|
||||
|
||||
return deduped.slice(0, limit);
|
||||
return deduped.slice(offset, offset + limit);
|
||||
}
|
||||
|
||||
/**
|
||||
|
||||
@@ -66,6 +66,7 @@ export interface SearchResult {
|
||||
|
||||
export interface SearchOpts {
|
||||
limit?: number;
|
||||
offset?: number;
|
||||
type?: PageType;
|
||||
exclude_slugs?: string[];
|
||||
}
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
import { createHash } from 'crypto';
|
||||
import type { Page, PageType, Chunk, SearchResult } from './types.ts';
|
||||
import type { Page, PageInput, PageType, Chunk, SearchResult } from './types.ts';
|
||||
|
||||
/**
|
||||
* Validate and normalize a slug. Slugs are lowercased repo-relative paths.
|
||||
@@ -13,10 +13,19 @@ export function validateSlug(slug: string): string {
|
||||
}
|
||||
|
||||
/**
|
||||
* SHA-256 hash of compiled_truth + timeline, used for import idempotency.
|
||||
* SHA-256 hash of page content, used for import idempotency.
|
||||
* Hashes all PageInput fields to match importFromContent's hash algorithm.
|
||||
*/
|
||||
export function contentHash(compiledTruth: string, timeline: string): string {
|
||||
return createHash('sha256').update(compiledTruth + '\n---\n' + timeline).digest('hex');
|
||||
export function contentHash(page: PageInput): string {
|
||||
return createHash('sha256')
|
||||
.update(JSON.stringify({
|
||||
title: page.title,
|
||||
type: page.type,
|
||||
compiled_truth: page.compiled_truth,
|
||||
timeline: page.timeline || '',
|
||||
frontmatter: page.frontmatter || {},
|
||||
}))
|
||||
.digest('hex');
|
||||
}
|
||||
|
||||
export function rowToPage(row: Record<string, unknown>): Page {
|
||||
|
||||
@@ -3,10 +3,29 @@ import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js'
|
||||
import { ListToolsRequestSchema, CallToolRequestSchema } from '@modelcontextprotocol/sdk/types.js';
|
||||
import type { BrainEngine } from '../core/engine.ts';
|
||||
import { operations, OperationError } from '../core/operations.ts';
|
||||
import type { OperationContext } from '../core/operations.ts';
|
||||
import type { Operation, OperationContext } from '../core/operations.ts';
|
||||
import { loadConfig } from '../core/config.ts';
|
||||
import { VERSION } from '../version.ts';
|
||||
|
||||
/** Validate required params exist and have the expected type */
|
||||
function validateParams(op: Operation, params: Record<string, unknown>): string | null {
|
||||
for (const [key, def] of Object.entries(op.params)) {
|
||||
if (def.required && (params[key] === undefined || params[key] === null)) {
|
||||
return `Missing required parameter: ${key}`;
|
||||
}
|
||||
if (params[key] !== undefined && params[key] !== null) {
|
||||
const val = params[key];
|
||||
const expected = def.type;
|
||||
if (expected === 'string' && typeof val !== 'string') return `Parameter "${key}" must be a string`;
|
||||
if (expected === 'number' && typeof val !== 'number') return `Parameter "${key}" must be a number`;
|
||||
if (expected === 'boolean' && typeof val !== 'boolean') return `Parameter "${key}" must be a boolean`;
|
||||
if (expected === 'object' && (typeof val !== 'object' || Array.isArray(val))) return `Parameter "${key}" must be an object`;
|
||||
if (expected === 'array' && !Array.isArray(val)) return `Parameter "${key}" must be an array`;
|
||||
}
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
export async function startMcpServer(engine: BrainEngine) {
|
||||
const server = new Server(
|
||||
{ name: 'gbrain', version: VERSION },
|
||||
@@ -54,8 +73,14 @@ export async function startMcpServer(engine: BrainEngine) {
|
||||
dryRun: !!(params?.dry_run),
|
||||
};
|
||||
|
||||
const safeParams = params || {};
|
||||
const validationError = validateParams(op, safeParams);
|
||||
if (validationError) {
|
||||
return { content: [{ type: 'text', text: JSON.stringify({ error: 'invalid_params', message: validationError }, null, 2) }], isError: true };
|
||||
}
|
||||
|
||||
try {
|
||||
const result = await op.handler(ctx, params || {});
|
||||
const result = await op.handler(ctx, safeParams);
|
||||
return { content: [{ type: 'text', text: JSON.stringify(result, null, 2) }] };
|
||||
} catch (e: unknown) {
|
||||
if (e instanceof OperationError) {
|
||||
@@ -79,6 +104,9 @@ export async function handleToolCall(
|
||||
const op = operations.find(o => o.name === tool);
|
||||
if (!op) throw new Error(`Unknown tool: ${tool}`);
|
||||
|
||||
const validationError = validateParams(op, params);
|
||||
if (validationError) throw new Error(validationError);
|
||||
|
||||
const ctx: OperationContext = {
|
||||
engine,
|
||||
config: loadConfig() || { engine: 'postgres' },
|
||||
|
||||
@@ -2,6 +2,8 @@
|
||||
|
||||
CREATE EXTENSION IF NOT EXISTS vector;
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm;
|
||||
-- gen_random_uuid() is core in Postgres 13+; enable pgcrypto as fallback for older versions
|
||||
CREATE EXTENSION IF NOT EXISTS pgcrypto;
|
||||
|
||||
-- ============================================================
|
||||
-- pages: the core content table
|
||||
|
||||
@@ -40,6 +40,26 @@ describe('CLI version', () => {
|
||||
});
|
||||
});
|
||||
|
||||
describe('ask alias', () => {
|
||||
test('ask alias maps to query in source', () => {
|
||||
expect(cliSource).toContain("if (command === 'ask')");
|
||||
expect(cliSource).toContain("command = 'query'");
|
||||
});
|
||||
|
||||
test('ask does NOT appear in --tools-json output', async () => {
|
||||
const proc = Bun.spawn(['bun', 'run', 'src/cli.ts', '--tools-json'], {
|
||||
cwd: new URL('..', import.meta.url).pathname,
|
||||
stdout: 'pipe',
|
||||
stderr: 'pipe',
|
||||
});
|
||||
const stdout = await new Response(proc.stdout).text();
|
||||
await proc.exited;
|
||||
const tools = JSON.parse(stdout);
|
||||
const names = tools.map((t: any) => t.name);
|
||||
expect(names).not.toContain('ask');
|
||||
});
|
||||
});
|
||||
|
||||
describe('CLI dispatch integration', () => {
|
||||
test('--version outputs version', async () => {
|
||||
const proc = Bun.spawn(['bun', 'run', 'src/cli.ts', '--version'], {
|
||||
|
||||
128
test/embed.test.ts
Normal file
128
test/embed.test.ts
Normal file
@@ -0,0 +1,128 @@
|
||||
import { describe, test, expect, mock, beforeEach, afterEach } from 'bun:test';
|
||||
import type { BrainEngine } from '../src/core/engine.ts';
|
||||
|
||||
// Mock the embedding module BEFORE importing runEmbed, so runEmbed picks up
|
||||
// the mocked embedBatch. We track max concurrent invocations via a counter
|
||||
// that increments on entry and decrements when the mock resolves.
|
||||
let activeEmbedCalls = 0;
|
||||
let maxConcurrentEmbedCalls = 0;
|
||||
let totalEmbedCalls = 0;
|
||||
|
||||
mock.module('../src/core/embedding.ts', () => ({
|
||||
embedBatch: async (texts: string[]) => {
|
||||
activeEmbedCalls++;
|
||||
totalEmbedCalls++;
|
||||
if (activeEmbedCalls > maxConcurrentEmbedCalls) {
|
||||
maxConcurrentEmbedCalls = activeEmbedCalls;
|
||||
}
|
||||
// Simulate API latency so concurrent workers actually overlap.
|
||||
await new Promise(r => setTimeout(r, 30));
|
||||
activeEmbedCalls--;
|
||||
return texts.map(() => new Float32Array(1536));
|
||||
},
|
||||
}));
|
||||
|
||||
// Import AFTER mocking.
|
||||
const { runEmbed } = await import('../src/commands/embed.ts');
|
||||
|
||||
// Proxy-based mock engine that matches test/import-file.test.ts pattern.
|
||||
function mockEngine(overrides: Partial<Record<string, any>> = {}): BrainEngine {
|
||||
const calls: { method: string; args: any[] }[] = [];
|
||||
const track = (method: string) => (...args: any[]) => {
|
||||
calls.push({ method, args });
|
||||
if (overrides[method]) return overrides[method](...args);
|
||||
return Promise.resolve(null);
|
||||
};
|
||||
const engine = new Proxy({} as any, {
|
||||
get(_, prop: string) {
|
||||
if (prop === '_calls') return calls;
|
||||
if (overrides[prop]) return overrides[prop];
|
||||
return track(prop);
|
||||
},
|
||||
});
|
||||
return engine;
|
||||
}
|
||||
|
||||
beforeEach(() => {
|
||||
activeEmbedCalls = 0;
|
||||
maxConcurrentEmbedCalls = 0;
|
||||
totalEmbedCalls = 0;
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
delete process.env.GBRAIN_EMBED_CONCURRENCY;
|
||||
});
|
||||
|
||||
describe('runEmbed --all (parallel)', () => {
|
||||
test('runs embedBatch calls concurrently across pages', async () => {
|
||||
const NUM_PAGES = 20;
|
||||
const pages = Array.from({ length: NUM_PAGES }, (_, i) => ({ slug: `page-${i}` }));
|
||||
// Each page has one chunk without an embedding (stale).
|
||||
const chunksBySlug = new Map(
|
||||
pages.map(p => [
|
||||
p.slug,
|
||||
[{ chunk_index: 0, chunk_text: `text for ${p.slug}`, chunk_source: 'compiled_truth', embedded_at: null, token_count: 4 }],
|
||||
]),
|
||||
);
|
||||
|
||||
const engine = mockEngine({
|
||||
listPages: async () => pages,
|
||||
getChunks: async (slug: string) => chunksBySlug.get(slug) || [],
|
||||
upsertChunks: async () => {},
|
||||
});
|
||||
|
||||
process.env.GBRAIN_EMBED_CONCURRENCY = '10';
|
||||
|
||||
await runEmbed(engine, ['--all']);
|
||||
|
||||
expect(totalEmbedCalls).toBe(NUM_PAGES);
|
||||
// Concurrency actually happened.
|
||||
expect(maxConcurrentEmbedCalls).toBeGreaterThan(1);
|
||||
// And stayed within the configured limit.
|
||||
expect(maxConcurrentEmbedCalls).toBeLessThanOrEqual(10);
|
||||
});
|
||||
|
||||
test('respects GBRAIN_EMBED_CONCURRENCY=1 (serial)', async () => {
|
||||
const pages = Array.from({ length: 5 }, (_, i) => ({ slug: `page-${i}` }));
|
||||
const chunksBySlug = new Map(
|
||||
pages.map(p => [
|
||||
p.slug,
|
||||
[{ chunk_index: 0, chunk_text: `text ${p.slug}`, chunk_source: 'compiled_truth', embedded_at: null, token_count: 4 }],
|
||||
]),
|
||||
);
|
||||
|
||||
const engine = mockEngine({
|
||||
listPages: async () => pages,
|
||||
getChunks: async (slug: string) => chunksBySlug.get(slug) || [],
|
||||
upsertChunks: async () => {},
|
||||
});
|
||||
|
||||
process.env.GBRAIN_EMBED_CONCURRENCY = '1';
|
||||
|
||||
await runEmbed(engine, ['--all']);
|
||||
|
||||
expect(totalEmbedCalls).toBe(5);
|
||||
expect(maxConcurrentEmbedCalls).toBe(1);
|
||||
});
|
||||
|
||||
test('skips pages whose chunks are all already embedded when --stale', async () => {
|
||||
const pages = [{ slug: 'fresh' }, { slug: 'stale' }];
|
||||
const chunksBySlug = new Map<string, any[]>([
|
||||
['fresh', [{ chunk_index: 0, chunk_text: 'hi', chunk_source: 'compiled_truth', embedded_at: '2026-01-01', token_count: 1 }]],
|
||||
['stale', [{ chunk_index: 0, chunk_text: 'hi', chunk_source: 'compiled_truth', embedded_at: null, token_count: 1 }]],
|
||||
]);
|
||||
|
||||
const engine = mockEngine({
|
||||
listPages: async () => pages,
|
||||
getChunks: async (slug: string) => chunksBySlug.get(slug) || [],
|
||||
upsertChunks: async () => {},
|
||||
});
|
||||
|
||||
process.env.GBRAIN_EMBED_CONCURRENCY = '5';
|
||||
|
||||
await runEmbed(engine, ['--stale']);
|
||||
|
||||
// Only the stale page triggers an embedBatch call.
|
||||
expect(totalEmbedCalls).toBe(1);
|
||||
});
|
||||
});
|
||||
@@ -1,7 +1,7 @@
|
||||
import { describe, test, expect, beforeAll, afterAll } from 'bun:test';
|
||||
import { writeFileSync, mkdirSync, rmSync } from 'fs';
|
||||
import { writeFileSync, mkdirSync, rmSync, symlinkSync } from 'fs';
|
||||
import { join } from 'path';
|
||||
import { importFile } from '../src/core/import-file.ts';
|
||||
import { importFile, importFromContent } from '../src/core/import-file.ts';
|
||||
import type { BrainEngine } from '../src/core/engine.ts';
|
||||
|
||||
const TMP = join(import.meta.dir, '.tmp-import-test');
|
||||
@@ -87,6 +87,91 @@ This is the compiled truth.
|
||||
expect((engine as any)._calls.length).toBe(0);
|
||||
});
|
||||
|
||||
test('rejects frontmatter slug that does not match the file path', async () => {
|
||||
// In a shared brain where contributors can land PRs, this prevents a
|
||||
// poisoned notes/random.md from declaring `slug: people/elon` in its
|
||||
// frontmatter and overwriting the legitimate people/elon page on sync.
|
||||
const filePath = join(TMP, 'hijack.md');
|
||||
writeFileSync(filePath, `---
|
||||
type: person
|
||||
title: Elon Musk
|
||||
slug: people/elon
|
||||
---
|
||||
|
||||
Poisoned content that would overwrite people/elon.
|
||||
`);
|
||||
|
||||
const engine = mockEngine();
|
||||
const result = await importFile(engine, filePath, 'notes/random.md', { noEmbed: true });
|
||||
|
||||
expect(result.status).toBe('skipped');
|
||||
expect(result.error).toContain('people/elon');
|
||||
expect(result.error).toContain('notes/random');
|
||||
// No writes to the DB — the hijack never reaches putPage/createVersion.
|
||||
expect((engine as any)._calls.length).toBe(0);
|
||||
});
|
||||
|
||||
test('accepts frontmatter slug that matches the file path', async () => {
|
||||
// Sanity: a legitimate file whose frontmatter slug happens to equal the
|
||||
// path-derived slug must still import.
|
||||
const filePath = join(TMP, 'alice.md');
|
||||
writeFileSync(filePath, `---
|
||||
type: person
|
||||
title: Alice
|
||||
slug: people/alice-smith
|
||||
---
|
||||
|
||||
Legit content.
|
||||
`);
|
||||
|
||||
const engine = mockEngine();
|
||||
const result = await importFile(engine, filePath, 'people/alice-smith.md', { noEmbed: true });
|
||||
|
||||
expect(result.status).toBe('imported');
|
||||
expect(result.slug).toBe('people/alice-smith');
|
||||
});
|
||||
|
||||
test('uses path-derived slug when no frontmatter slug is set', async () => {
|
||||
// The common case: no frontmatter.slug, so the path determines the slug.
|
||||
const filePath = join(TMP, 'concept-path.md');
|
||||
writeFileSync(filePath, `---
|
||||
type: concept
|
||||
title: From Path
|
||||
---
|
||||
|
||||
Content.
|
||||
`);
|
||||
|
||||
const engine = mockEngine();
|
||||
const result = await importFile(engine, filePath, 'concepts/from-path.md', { noEmbed: true });
|
||||
|
||||
expect(result.status).toBe('imported');
|
||||
expect(result.slug).toBe('concepts/from-path');
|
||||
});
|
||||
|
||||
test('skips symlinks in importFromFile (defense-in-depth)', async () => {
|
||||
// Even if the walker somehow passes a symlink through, importFromFile
|
||||
// should catch it and return skipped.
|
||||
const realFile = join(TMP, 'real-target.md');
|
||||
writeFileSync(realFile, `---
|
||||
type: concept
|
||||
title: Real
|
||||
---
|
||||
|
||||
Content.
|
||||
`);
|
||||
const linkPath = join(TMP, 'symlink-file.md');
|
||||
try { rmSync(linkPath); } catch { /* may not exist */ }
|
||||
symlinkSync(realFile, linkPath);
|
||||
|
||||
const engine = mockEngine();
|
||||
const result = await importFile(engine, linkPath, 'symlink-file.md', { noEmbed: true });
|
||||
|
||||
expect(result.status).toBe('skipped');
|
||||
expect(result.error).toContain('symlink');
|
||||
expect((engine as any)._calls.length).toBe(0);
|
||||
});
|
||||
|
||||
test('skips file when content hash matches (idempotent)', async () => {
|
||||
const filePath = join(TMP, 'unchanged.md');
|
||||
writeFileSync(filePath, `---
|
||||
@@ -252,6 +337,49 @@ Content to chunk but not embed.
|
||||
}
|
||||
});
|
||||
|
||||
test('rejects in-memory content larger than MAX_FILE_SIZE', async () => {
|
||||
// The remote MCP put_page operation hands user-supplied content straight
|
||||
// to importFromContent, which is the path this guard defends. The guard
|
||||
// must trigger BEFORE parseMarkdown / chunkText / embedBatch — if it doesn't,
|
||||
// an authenticated attacker can force the owner to pay for embedding a
|
||||
// multi-megabyte string.
|
||||
const bigContent = '---\ntitle: Big\n---\n' + 'x'.repeat(5_100_000);
|
||||
|
||||
const engine = mockEngine();
|
||||
const result = await importFromContent(engine, 'big-slug', bigContent, { noEmbed: true });
|
||||
|
||||
expect(result.status).toBe('skipped');
|
||||
expect(result.error).toContain('too large');
|
||||
// No engine work at all — confirms the guard short-circuits before any
|
||||
// parsing or chunking allocation.
|
||||
expect((engine as any)._calls.length).toBe(0);
|
||||
});
|
||||
|
||||
test('uses UTF-8 byte length, not JS string length, for the size check', async () => {
|
||||
// 2.6M 4-byte codepoints = ~10.4 MB UTF-8 but only 2.6M JS UTF-16 code units.
|
||||
// A length-based check would let this through; a byteLength check catches it.
|
||||
const fourByteChar = '\u{1F600}'; // emoji, 4 bytes in UTF-8
|
||||
const bigContent = fourByteChar.repeat(2_600_000);
|
||||
|
||||
const engine = mockEngine();
|
||||
const result = await importFromContent(engine, 'emoji-slug', bigContent, { noEmbed: true });
|
||||
|
||||
expect(result.status).toBe('skipped');
|
||||
expect(result.error).toContain('too large');
|
||||
expect((engine as any)._calls.length).toBe(0);
|
||||
});
|
||||
|
||||
test('accepts in-memory content just under MAX_FILE_SIZE', async () => {
|
||||
// Sanity: content exactly at the limit must still import. If this test
|
||||
// fails, the guard is off-by-one and will break legitimate large imports.
|
||||
const content = '---\ntitle: Borderline\n---\n' + 'x'.repeat(4_900_000);
|
||||
|
||||
const engine = mockEngine();
|
||||
const result = await importFromContent(engine, 'borderline-slug', content, { noEmbed: true });
|
||||
|
||||
expect(result.status).toBe('imported');
|
||||
});
|
||||
|
||||
test('assigns sequential chunk_index values', async () => {
|
||||
const filePath = join(TMP, 'indexed.md');
|
||||
const longText = Array(50).fill('This is a sentence that adds length to the content.').join(' ');
|
||||
|
||||
89
test/import-walker.test.ts
Normal file
89
test/import-walker.test.ts
Normal file
@@ -0,0 +1,89 @@
|
||||
import { describe, test, expect, beforeEach, afterEach } from 'bun:test';
|
||||
import { mkdirSync, writeFileSync, symlinkSync, rmSync, mkdtempSync } from 'fs';
|
||||
import { tmpdir } from 'os';
|
||||
import { join } from 'path';
|
||||
import { collectMarkdownFiles } from '../src/commands/import.ts';
|
||||
|
||||
// These tests exercise the filesystem walker that feeds `gbrain import`.
|
||||
// They target L002 (report/findings.md): a malicious symlink inside a shared
|
||||
// brain directory must not cause the walker to read files outside the brain
|
||||
// root. See src/commands/import.ts:collectMarkdownFiles.
|
||||
|
||||
describe('collectMarkdownFiles — symlink containment', () => {
|
||||
let root: string;
|
||||
let secretDir: string;
|
||||
|
||||
beforeEach(() => {
|
||||
// Fresh directories per test so symlinks can't cross-contaminate runs.
|
||||
root = mkdtempSync(join(tmpdir(), 'gbrain-walker-root-'));
|
||||
secretDir = mkdtempSync(join(tmpdir(), 'gbrain-walker-secret-'));
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
rmSync(root, { recursive: true, force: true });
|
||||
rmSync(secretDir, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
test('includes real markdown files inside the root', () => {
|
||||
writeFileSync(join(root, 'legit.md'), '# legit\n');
|
||||
mkdirSync(join(root, 'notes'));
|
||||
writeFileSync(join(root, 'notes', 'other.md'), '# other\n');
|
||||
|
||||
const files = collectMarkdownFiles(root);
|
||||
expect(files).toContain(join(root, 'legit.md'));
|
||||
expect(files).toContain(join(root, 'notes', 'other.md'));
|
||||
});
|
||||
|
||||
test('skips a symlink file pointing outside the brain root', () => {
|
||||
// Plant a real secret outside the brain root
|
||||
const secretFile = join(secretDir, 'secret.md');
|
||||
writeFileSync(secretFile, '# secret — must not be ingested\n');
|
||||
|
||||
// Inside the brain, create a symlink that points at the secret.
|
||||
// Before the fix, statSync followed the link and reported it as
|
||||
// a regular file, so it ended up in the walker's output and got
|
||||
// fed to importFile — chunked, embedded, and indexed in the brain.
|
||||
writeFileSync(join(root, 'legit.md'), '# legit\n');
|
||||
symlinkSync(secretFile, join(root, 'innocent.md'));
|
||||
|
||||
const files = collectMarkdownFiles(root);
|
||||
expect(files).toContain(join(root, 'legit.md'));
|
||||
// The symlink itself must not appear — this is the security guarantee.
|
||||
expect(files).not.toContain(join(root, 'innocent.md'));
|
||||
// And the canonical secret path must definitely not be in the results.
|
||||
expect(files).not.toContain(secretFile);
|
||||
});
|
||||
|
||||
test('does not descend into a symlinked directory', () => {
|
||||
// Create a directory outside the root with a markdown file inside it.
|
||||
const outsideSub = join(secretDir, 'sub');
|
||||
mkdirSync(outsideSub);
|
||||
writeFileSync(join(outsideSub, 'external.md'), '# external\n');
|
||||
|
||||
// Plant a symlink inside the brain pointing at that directory.
|
||||
// Before the fix, walk() would follow it and emit external.md.
|
||||
// With lstatSync, stat.isSymbolicLink() is true and we refuse
|
||||
// to descend — this also blocks circular-symlink DoS as a side effect.
|
||||
writeFileSync(join(root, 'legit.md'), '# legit\n');
|
||||
symlinkSync(outsideSub, join(root, 'linked-notes'));
|
||||
|
||||
const files = collectMarkdownFiles(root);
|
||||
expect(files).toContain(join(root, 'legit.md'));
|
||||
expect(files).not.toContain(join(root, 'linked-notes', 'external.md'));
|
||||
expect(files).not.toContain(join(outsideSub, 'external.md'));
|
||||
});
|
||||
|
||||
test('skips broken symlinks without crashing', () => {
|
||||
// A dangling symlink — the target never existed. Pre-existing behavior
|
||||
// (PR #26 / PR #38) handled this via try/catch around statSync. The
|
||||
// L002 fix must not regress it: lstatSync succeeds on a dangling link
|
||||
// (it reports on the link itself, not the target), so we reach the
|
||||
// isSymbolicLink() branch and skip cleanly, no throw.
|
||||
writeFileSync(join(root, 'legit.md'), '# legit\n');
|
||||
symlinkSync('/nonexistent/path/to/nowhere', join(root, 'dangling.md'));
|
||||
|
||||
const files = collectMarkdownFiles(root);
|
||||
expect(files).toContain(join(root, 'legit.md'));
|
||||
expect(files).not.toContain(join(root, 'dangling.md'));
|
||||
});
|
||||
});
|
||||
89
test/pglite-lock.test.ts
Normal file
89
test/pglite-lock.test.ts
Normal file
@@ -0,0 +1,89 @@
|
||||
import { describe, test, expect, beforeEach, afterEach } from 'bun:test';
|
||||
import { mkdirSync, rmSync, existsSync, readFileSync, writeFileSync } from 'fs';
|
||||
import { join } from 'path';
|
||||
import { tmpdir } from 'os';
|
||||
import { acquireLock, releaseLock, type LockHandle } from '../src/core/pglite-lock';
|
||||
|
||||
const TEST_DIR = join(tmpdir(), 'gbrain-lock-test-' + process.pid);
|
||||
|
||||
describe('pglite-lock', () => {
|
||||
beforeEach(() => {
|
||||
// Clean up test directory
|
||||
if (existsSync(TEST_DIR)) rmSync(TEST_DIR, { recursive: true, force: true });
|
||||
mkdirSync(TEST_DIR, { recursive: true });
|
||||
});
|
||||
|
||||
afterEach(() => {
|
||||
if (existsSync(TEST_DIR)) rmSync(TEST_DIR, { recursive: true, force: true });
|
||||
});
|
||||
|
||||
test('acquires and releases lock', async () => {
|
||||
const lock = await acquireLock(TEST_DIR);
|
||||
expect(lock.acquired).toBe(true);
|
||||
expect(existsSync(join(TEST_DIR, '.gbrain-lock'))).toBe(true);
|
||||
|
||||
await releaseLock(lock);
|
||||
expect(existsSync(join(TEST_DIR, '.gbrain-lock'))).toBe(false);
|
||||
});
|
||||
|
||||
test('prevents concurrent lock acquisition', async () => {
|
||||
const lock1 = await acquireLock(TEST_DIR, { timeoutMs: 2000 });
|
||||
expect(lock1.acquired).toBe(true);
|
||||
|
||||
// Second lock attempt should timeout
|
||||
await expect(acquireLock(TEST_DIR, { timeoutMs: 1000 })).rejects.toThrow(/Timed out/);
|
||||
|
||||
await releaseLock(lock1);
|
||||
});
|
||||
|
||||
test('detects and cleans stale lock from dead process', async () => {
|
||||
// Simulate a stale lock from a dead process
|
||||
const lockDir = join(TEST_DIR, '.gbrain-lock');
|
||||
mkdirSync(lockDir);
|
||||
writeFileSync(join(lockDir, 'lock'), JSON.stringify({
|
||||
pid: 999999999, // Non-existent PID
|
||||
acquired_at: Date.now(),
|
||||
command: 'test',
|
||||
}));
|
||||
|
||||
// Should clean up the stale lock and acquire
|
||||
const lock = await acquireLock(TEST_DIR);
|
||||
expect(lock.acquired).toBe(true);
|
||||
|
||||
await releaseLock(lock);
|
||||
});
|
||||
|
||||
test('skips lock for in-memory (undefined dataDir)', async () => {
|
||||
const lock = await acquireLock(undefined);
|
||||
expect(lock.acquired).toBe(true);
|
||||
expect(lock.lockDir).toBe('');
|
||||
|
||||
// Release should be a no-op
|
||||
await releaseLock(lock);
|
||||
});
|
||||
|
||||
test('lock file contains PID and command', async () => {
|
||||
const lock = await acquireLock(TEST_DIR);
|
||||
const lockData = JSON.parse(readFileSync(join(TEST_DIR, '.gbrain-lock', 'lock'), 'utf-8'));
|
||||
|
||||
expect(lockData.pid).toBe(process.pid);
|
||||
expect(lockData.acquired_at).toBeDefined();
|
||||
expect(lockData.command).toBeDefined();
|
||||
|
||||
await releaseLock(lock);
|
||||
});
|
||||
|
||||
test('releases lock on disconnect even if DB close fails', async () => {
|
||||
const lock = await acquireLock(TEST_DIR);
|
||||
expect(lock.acquired).toBe(true);
|
||||
|
||||
// Simulate DB already closed
|
||||
await releaseLock(lock);
|
||||
expect(existsSync(join(TEST_DIR, '.gbrain-lock'))).toBe(false);
|
||||
|
||||
// Second acquisition should work
|
||||
const lock2 = await acquireLock(TEST_DIR);
|
||||
expect(lock2.acquired).toBe(true);
|
||||
await releaseLock(lock2);
|
||||
});
|
||||
});
|
||||
67
test/search-limit.test.ts
Normal file
67
test/search-limit.test.ts
Normal file
@@ -0,0 +1,67 @@
|
||||
import { describe, it, expect } from 'bun:test';
|
||||
import { MAX_SEARCH_LIMIT, clampSearchLimit } from '../src/core/engine.ts';
|
||||
|
||||
describe('clampSearchLimit', () => {
|
||||
it('uses default when undefined', () => {
|
||||
expect(clampSearchLimit(undefined)).toBe(20);
|
||||
});
|
||||
|
||||
it('uses custom default when provided', () => {
|
||||
expect(clampSearchLimit(undefined, 10)).toBe(10);
|
||||
});
|
||||
|
||||
it('passes through in-range values', () => {
|
||||
expect(clampSearchLimit(50)).toBe(50);
|
||||
});
|
||||
|
||||
it('clamps oversized values to MAX_SEARCH_LIMIT', () => {
|
||||
expect(clampSearchLimit(10_000_000)).toBe(MAX_SEARCH_LIMIT);
|
||||
});
|
||||
|
||||
it('uses default for zero', () => {
|
||||
expect(clampSearchLimit(0)).toBe(20);
|
||||
});
|
||||
|
||||
it('uses default for negative', () => {
|
||||
expect(clampSearchLimit(-5)).toBe(20);
|
||||
});
|
||||
|
||||
it('floors fractional values', () => {
|
||||
expect(clampSearchLimit(7.9)).toBe(7);
|
||||
});
|
||||
|
||||
it('uses default for NaN', () => {
|
||||
expect(clampSearchLimit(NaN)).toBe(20);
|
||||
});
|
||||
|
||||
it('clamps Infinity to MAX_SEARCH_LIMIT', () => {
|
||||
expect(clampSearchLimit(Infinity)).toBe(20); // !isFinite → default
|
||||
});
|
||||
|
||||
it('MAX_SEARCH_LIMIT is 100', () => {
|
||||
expect(MAX_SEARCH_LIMIT).toBe(100);
|
||||
});
|
||||
});
|
||||
|
||||
describe('listPages is NOT affected by search clamp', () => {
|
||||
it('listPages accepts limit > MAX_SEARCH_LIMIT (regression test)', async () => {
|
||||
// listPages uses PageFilters.limit, NOT clampSearchLimit.
|
||||
// This test verifies the clamp is scoped to search operations only.
|
||||
// We import the PGLite engine and check that listPages with limit 100000 works.
|
||||
const { PGLiteEngine } = await import('../src/core/pglite-engine.ts');
|
||||
const engine = new PGLiteEngine();
|
||||
await engine.connect({});
|
||||
await engine.initSchema();
|
||||
|
||||
// Insert a page
|
||||
await engine.putPage('test/big-list', {
|
||||
title: 'Test', type: 'concept', compiled_truth: 'test content', timeline: '',
|
||||
});
|
||||
|
||||
// listPages with limit 100000 should NOT be clamped
|
||||
const pages = await engine.listPages({ limit: 100000 });
|
||||
expect(pages.length).toBeGreaterThanOrEqual(1);
|
||||
|
||||
await engine.disconnect();
|
||||
});
|
||||
});
|
||||
@@ -29,19 +29,20 @@ describe('validateSlug', () => {
|
||||
|
||||
describe('contentHash', () => {
|
||||
test('returns deterministic hash', () => {
|
||||
const h1 = contentHash('hello', 'world');
|
||||
const h2 = contentHash('hello', 'world');
|
||||
const page = { title: 'Test', type: 'concept' as const, compiled_truth: 'hello', timeline: 'world' };
|
||||
const h1 = contentHash(page);
|
||||
const h2 = contentHash(page);
|
||||
expect(h1).toBe(h2);
|
||||
});
|
||||
|
||||
test('changes when content changes', () => {
|
||||
const h1 = contentHash('hello', 'world');
|
||||
const h2 = contentHash('hello', 'changed');
|
||||
const h1 = contentHash({ title: 'Test', type: 'concept' as const, compiled_truth: 'hello', timeline: 'world' });
|
||||
const h2 = contentHash({ title: 'Test', type: 'concept' as const, compiled_truth: 'hello', timeline: 'changed' });
|
||||
expect(h1).not.toBe(h2);
|
||||
});
|
||||
|
||||
test('returns hex string', () => {
|
||||
const h = contentHash('test', '');
|
||||
const h = contentHash({ title: 'Test', type: 'concept' as const, compiled_truth: 'test', timeline: '' });
|
||||
expect(h).toMatch(/^[a-f0-9]{64}$/);
|
||||
});
|
||||
});
|
||||
|
||||
Reference in New Issue
Block a user