Files
gbrain/test/eval.test.ts
Garry Tan d547a64600 feat: search quality boost — compiled truth ranking + detail parameter (v0.8.1) (#64)
* feat: search quality boost — compiled truth ranking, detail parameter, cosine re-scoring

Compiled truth chunks now rank 2x higher in hybrid search via RRF
normalization + source boost. New --detail flag (low/medium/high)
controls timeline inclusion. Cosine re-scoring blends query-chunk
similarity before dedup for query-specific ranking.

Also: remove DISTINCT ON from keyword search (dedup handles per-page
capping), add chunk_id + chunk_index to SearchResult, add
getEmbeddingsByChunkIds to BrainEngine interface.

Inspired by Ramp Labs' "Latent Briefing" paper (April 2026).

* feat: RRF normalization, source-aware dedup, detail param in operations

RRF scores normalized to 0-1 before 2.0x compiled truth boost.
Source-aware dedup guarantees compiled truth chunk per page.
Detail parameter added to query operation, dedupResults added to
bare search operation. Debug logging via GBRAIN_SEARCH_DEBUG=1.

* chore: bump version and changelog (v0.8.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: CJK word count in query expansion

CJK text is not space-delimited. A query like "向量搜索优化" was counted
as 1 word and silently skipped expansion. Now counts characters for CJK
queries instead of space-separated tokens.

Co-Authored-By: YIING99 <yiing99@users.noreply.github.com>

* feat: retrieval evaluation harness — P@k, R@k, MRR, nDCG@k + gbrain eval

Full IR evaluation framework: precisionAtK, recallAtK, mrr, ndcgAtK
metrics with runEval() orchestrator. gbrain eval CLI with single-run
table and A/B comparison mode (--config-a / --config-b) for parameter
tuning. HybridSearchOpts now accepts rrfK and dedupOpts overrides.

Co-Authored-By: 4shut0sh <4shut0sh@users.noreply.github.com>

* test: search quality tests — RRF boost, dedup guarantee, cosine similarity, E2E benchmark

42 new tests across 3 files:
- test/search.test.ts: RRF normalization, compiled truth 2x boost, dedup key
  collision prevention, cosine similarity edge cases, CJK word count detection
- test/dedup.test.ts: source-aware compiled truth guarantee, layer interactions,
  custom maxPerPage, empty/single result edge cases
- test/e2e/search-quality.test.ts: full pipeline against PGLite with basis vector
  embeddings — chunk_id/chunk_index fields, detail parameter filtering,
  getEmbeddingsByChunkIds, keyword multi-chunk, vector ordering

Also: export rrfFusion + cosineSimilarity for unit testing, fix PGLite
getEmbeddingsByChunkIds to parse string vectors from pgvector.

* test: search quality benchmark with A/B comparison (baseline vs PR#64)

Benchmark measures P@1, MRR, nDCG@5, and source accuracy across 8 queries
against 5 seeded pages. Key finding: boost helps entity lookups but
over-corrects temporal queries. Validates the --detail parameter as the
right control mechanism. Output at docs/benchmarks/2026-04-13.md.

* feat: query intent classifier — auto-selects detail level, 100% source accuracy

Zero-latency heuristic classifier detects query intent from text patterns:
- "Who is Pedro?" → entity → detail=low (compiled truth only)
- "When did we last meet?" → temporal → detail=high (no boost, natural ranking)
- "Variant fund announcement" → event → detail=high
- General queries → detail=medium (default with boost)

The key insight: skip the 2.0x compiled truth boost for detail=high queries.
Temporal/event queries want natural ranking where timeline entries can win.

Benchmark results (source accuracy = does the top chunk match expected type):
- Baseline: 100% (already good, no boost needed)
- Boost only: 71.4% (boost over-corrects temporal queries)
- Boost + intent classifier: 100% (best of both worlds)

35 unit tests for the classifier. 590 total tests pass.

* feat: query intent classifier — auto-selects detail level, 100% source accuracy

Heuristic classifier detects query intent from text patterns (zero latency,
no LLM call). Maps temporal queries ("when did we last meet") to detail=high,
entity queries ("who is X") to detail=low, events to detail=high.

Benchmark results (29 pages, 20 queries, graded relevance):
- Baseline: P@1=0.947, MRR=0.974, source accuracy=89.5%
- Boost only: P@1=0.895, MRR=0.939, source accuracy=63.2% (over-correction)
- Boost + intent: P@1=0.947, MRR=0.974, source accuracy=89.5% (fully recovered)

The intent classifier eliminates the boost's over-correction on temporal queries
while preserving its benefits for entity lookups. 35 unit tests for the classifier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: search quality benchmark with A/B comparison (baseline vs PR#64)

Rich benchmark: 29 pages, 58 chunks, 20 queries with graded relevance.
Now measures CHUNK-LEVEL quality, not just page-level retrieval.

Key findings (C. Boost+Intent vs A. Baseline):
- Unique pages in top-10: 7.2 → 8.7 (+21% broader coverage)
- Compiled truth ratio: 51.6% → 66.8% (+15pp more signal)
- CT-first rate: 100% (compiled truth leads for entity queries)
- Timeline accessible: 100% (temporal queries still find dates)
- Source accuracy: 89.5% maintained (intent classifier prevents regression)

The boost alone (B) causes -26pp source accuracy regression.
Intent classifier (C) recovers it fully.

* docs: clean benchmark report — ELI10 search quality analysis for PR#64

Replaces two drafts with one clean report. Explains what changed, why it
matters, and what the numbers mean. All fictional data, no private info.

Key findings: 21% more page coverage per query, 29% more compiled truth
in results. Intent classifier prevents boost from burying timeline for
temporal queries. Full per-query breakdown with before/after comparison.

* chore: remove auto-generated benchmark file (clean version is 2026-04-14-search-quality.md)

* docs: update project documentation for search quality boost

CLAUDE.md: added search/intent.ts, search/eval.ts, commands/eval.ts to key
files. Added 5 new test files (search, dedup, intent, eval, e2e/search-quality).
Updated test count from 23+4 to 28+5. Added docs/benchmarks/ to key files.

README.md: updated search pipeline diagram with intent classifier, RRF
normalization, compiled truth boost, cosine re-scoring, and 5-layer dedup.
Added --detail flag explanation and benchmark instructions.

CHANGELOG.md: added search quality entries to v0.9.3 (intent classifier,
--detail flag, gbrain eval, CJK fix). Credited @4shut0sh and @YIING99.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: headline benchmark gains in changelog

* docs: add community attribution rule to CHANGELOG voice section

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: YIING99 <yiing99@users.noreply.github.com>
Co-authored-by: 4shut0sh <4shut0sh@users.noreply.github.com>
2026-04-13 21:03:40 -10:00

245 lines
8.7 KiB
TypeScript

/**
* Unit tests for src/core/search/eval.ts
*
* Pure function tests — no database, no API keys, runs in: bun test
*/
import { describe, test, expect } from 'bun:test';
import {
precisionAtK,
recallAtK,
mrr,
ndcgAtK,
parseQrels,
} from '../src/core/search/eval.ts';
// ─────────────────────────────────────────────────────────────────
// precisionAtK
// ─────────────────────────────────────────────────────────────────
describe('precisionAtK', () => {
test('all hits relevant → 1.0', () => {
const relevant = new Set(['a', 'b', 'c']);
expect(precisionAtK(['a', 'b', 'c'], relevant, 3)).toBe(1.0);
});
test('no hits relevant → 0.0', () => {
const relevant = new Set(['x', 'y']);
expect(precisionAtK(['a', 'b', 'c'], relevant, 3)).toBe(0.0);
});
test('partial: 2 of 5 hits relevant at k=5', () => {
const relevant = new Set(['a', 'c']);
expect(precisionAtK(['a', 'b', 'c', 'd', 'e'], relevant, 5)).toBeCloseTo(2 / 5);
});
test('k=1 with first hit relevant → 1.0', () => {
const relevant = new Set(['a']);
expect(precisionAtK(['a', 'b', 'c'], relevant, 1)).toBe(1.0);
});
test('k=1 with first hit not relevant → 0.0', () => {
const relevant = new Set(['b']);
expect(precisionAtK(['a', 'b', 'c'], relevant, 1)).toBe(0.0);
});
test('k greater than hits length → uses actual hits', () => {
const relevant = new Set(['a', 'b']);
// 2 relevant in 2 hits but k=10 → still 2/10
expect(precisionAtK(['a', 'b'], relevant, 10)).toBeCloseTo(2 / 10);
});
test('empty hits → 0', () => {
expect(precisionAtK([], new Set(['a']), 5)).toBe(0);
});
test('empty relevant set → 0', () => {
expect(precisionAtK(['a', 'b'], new Set(), 5)).toBe(0);
});
test('k=0 → 0', () => {
expect(precisionAtK(['a', 'b'], new Set(['a']), 0)).toBe(0);
});
});
// ─────────────────────────────────────────────────────────────────
// recallAtK
// ─────────────────────────────────────────────────────────────────
describe('recallAtK', () => {
test('all relevant found → 1.0', () => {
const relevant = new Set(['a', 'b']);
expect(recallAtK(['a', 'b', 'c'], relevant, 3)).toBe(1.0);
});
test('none found → 0.0', () => {
const relevant = new Set(['x', 'y', 'z']);
expect(recallAtK(['a', 'b', 'c'], relevant, 3)).toBe(0.0);
});
test('1 of 3 relevant found', () => {
const relevant = new Set(['a', 'x', 'y']);
expect(recallAtK(['a', 'b', 'c'], relevant, 3)).toBeCloseTo(1 / 3);
});
test('relevant found beyond k → not counted', () => {
const relevant = new Set(['a', 'b']);
// 'b' is at rank 5, beyond k=3
expect(recallAtK(['a', 'x', 'y', 'z', 'b'], relevant, 3)).toBeCloseTo(1 / 2);
});
test('empty hits → 0', () => {
expect(recallAtK([], new Set(['a']), 5)).toBe(0);
});
test('empty relevant set → 0', () => {
expect(recallAtK(['a', 'b'], new Set(), 5)).toBe(0);
});
});
// ─────────────────────────────────────────────────────────────────
// mrr
// ─────────────────────────────────────────────────────────────────
describe('mrr', () => {
test('first hit relevant → 1.0', () => {
expect(mrr(['a', 'b', 'c'], new Set(['a']))).toBe(1.0);
});
test('second hit relevant → 0.5', () => {
expect(mrr(['x', 'a', 'c'], new Set(['a']))).toBeCloseTo(0.5);
});
test('third hit relevant → 1/3', () => {
expect(mrr(['x', 'y', 'a'], new Set(['a']))).toBeCloseTo(1 / 3);
});
test('no relevant hit → 0', () => {
expect(mrr(['x', 'y', 'z'], new Set(['a']))).toBe(0);
});
test('empty hits → 0', () => {
expect(mrr([], new Set(['a']))).toBe(0);
});
test('empty relevant → 0', () => {
expect(mrr(['a', 'b'], new Set())).toBe(0);
});
test('uses first relevant hit when multiple are relevant', () => {
// 'b' is rank 2, 'c' is rank 3 — MRR should use 'b' at rank 2
expect(mrr(['x', 'b', 'c'], new Set(['b', 'c']))).toBeCloseTo(0.5);
});
});
// ─────────────────────────────────────────────────────────────────
// ndcgAtK
// ─────────────────────────────────────────────────────────────────
describe('ndcgAtK', () => {
test('perfect ranking with binary relevance → 1.0', () => {
const grades = new Map([['a', 1], ['b', 1]]);
// Hits: a at rank1, b at rank2 — same as ideal
expect(ndcgAtK(['a', 'b', 'c'], grades, 5)).toBeCloseTo(1.0);
});
test('single relevant doc at rank 1 → 1.0', () => {
const grades = new Map([['a', 1]]);
expect(ndcgAtK(['a', 'x', 'y'], grades, 5)).toBeCloseTo(1.0);
});
test('single relevant doc at rank 2 → less than 1', () => {
const grades = new Map([['a', 1]]);
const score = ndcgAtK(['x', 'a', 'y'], grades, 5);
expect(score).toBeGreaterThan(0);
expect(score).toBeLessThan(1);
});
test('no relevant in hits → 0', () => {
const grades = new Map([['a', 1], ['b', 1]]);
expect(ndcgAtK(['x', 'y', 'z'], grades, 5)).toBe(0);
});
test('graded relevance: higher grade docs placed first → nDCG=1', () => {
const grades = new Map([['a', 3], ['b', 2], ['c', 1]]);
expect(ndcgAtK(['a', 'b', 'c'], grades, 3)).toBeCloseTo(1.0);
});
test('graded relevance: lower grade first → nDCG < 1', () => {
const grades = new Map([['a', 3], ['b', 2], ['c', 1]]);
// Reversed: worst first
const score = ndcgAtK(['c', 'b', 'a'], grades, 3);
expect(score).toBeGreaterThan(0);
expect(score).toBeLessThan(1);
});
test('graded relevance: reversed is worse than perfect', () => {
const grades = new Map([['a', 3], ['b', 2], ['c', 1]]);
const perfect = ndcgAtK(['a', 'b', 'c'], grades, 3);
const reversed = ndcgAtK(['c', 'b', 'a'], grades, 3);
expect(perfect).toBeGreaterThan(reversed);
});
test('k=1 picks only the first hit', () => {
const grades = new Map([['a', 1], ['b', 1]]);
// Only 'x' at rank1, not relevant
expect(ndcgAtK(['x', 'a', 'b'], grades, 1)).toBe(0);
// Only 'a' at rank1, relevant
expect(ndcgAtK(['a', 'x', 'b'], grades, 1)).toBeCloseTo(1.0);
});
test('empty hits → 0', () => {
expect(ndcgAtK([], new Map([['a', 1]]), 5)).toBe(0);
});
test('empty grades → 0', () => {
expect(ndcgAtK(['a', 'b'], new Map(), 5)).toBe(0);
});
test('k=0 → 0', () => {
expect(ndcgAtK(['a', 'b'], new Map([['a', 1]]), 0)).toBe(0);
});
});
// ─────────────────────────────────────────────────────────────────
// parseQrels
// ─────────────────────────────────────────────────────────────────
describe('parseQrels', () => {
test('parses inline JSON array', () => {
const input = JSON.stringify([
{ query: 'foo', relevant: ['a', 'b'] },
]);
const result = parseQrels(input);
expect(result).toHaveLength(1);
expect(result[0].query).toBe('foo');
expect(result[0].relevant).toEqual(['a', 'b']);
});
test('parses inline JSON object with queries array', () => {
const input = JSON.stringify({
version: 1,
queries: [{ query: 'bar', relevant: ['x'] }],
});
const result = parseQrels(input);
expect(result).toHaveLength(1);
expect(result[0].query).toBe('bar');
});
test('preserves grades when present', () => {
const input = JSON.stringify([
{ query: 'baz', relevant: ['a'], grades: { a: 3, b: 1 } },
]);
const result = parseQrels(input);
expect(result[0].grades).toEqual({ a: 3, b: 1 });
});
test('throws on invalid JSON', () => {
expect(() => parseQrels('not-json')).toThrow();
});
test('throws on unrecognized format', () => {
expect(() => parseQrels(JSON.stringify({ foo: 'bar' }))).toThrow();
});
});