* feat: search quality boost — compiled truth ranking, detail parameter, cosine re-scoring Compiled truth chunks now rank 2x higher in hybrid search via RRF normalization + source boost. New --detail flag (low/medium/high) controls timeline inclusion. Cosine re-scoring blends query-chunk similarity before dedup for query-specific ranking. Also: remove DISTINCT ON from keyword search (dedup handles per-page capping), add chunk_id + chunk_index to SearchResult, add getEmbeddingsByChunkIds to BrainEngine interface. Inspired by Ramp Labs' "Latent Briefing" paper (April 2026). * feat: RRF normalization, source-aware dedup, detail param in operations RRF scores normalized to 0-1 before 2.0x compiled truth boost. Source-aware dedup guarantees compiled truth chunk per page. Detail parameter added to query operation, dedupResults added to bare search operation. Debug logging via GBRAIN_SEARCH_DEBUG=1. * chore: bump version and changelog (v0.8.1) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: CJK word count in query expansion CJK text is not space-delimited. A query like "向量搜索优化" was counted as 1 word and silently skipped expansion. Now counts characters for CJK queries instead of space-separated tokens. Co-Authored-By: YIING99 <yiing99@users.noreply.github.com> * feat: retrieval evaluation harness — P@k, R@k, MRR, nDCG@k + gbrain eval Full IR evaluation framework: precisionAtK, recallAtK, mrr, ndcgAtK metrics with runEval() orchestrator. gbrain eval CLI with single-run table and A/B comparison mode (--config-a / --config-b) for parameter tuning. HybridSearchOpts now accepts rrfK and dedupOpts overrides. Co-Authored-By: 4shut0sh <4shut0sh@users.noreply.github.com> * test: search quality tests — RRF boost, dedup guarantee, cosine similarity, E2E benchmark 42 new tests across 3 files: - test/search.test.ts: RRF normalization, compiled truth 2x boost, dedup key collision prevention, cosine similarity edge cases, CJK word count detection - test/dedup.test.ts: source-aware compiled truth guarantee, layer interactions, custom maxPerPage, empty/single result edge cases - test/e2e/search-quality.test.ts: full pipeline against PGLite with basis vector embeddings — chunk_id/chunk_index fields, detail parameter filtering, getEmbeddingsByChunkIds, keyword multi-chunk, vector ordering Also: export rrfFusion + cosineSimilarity for unit testing, fix PGLite getEmbeddingsByChunkIds to parse string vectors from pgvector. * test: search quality benchmark with A/B comparison (baseline vs PR#64) Benchmark measures P@1, MRR, nDCG@5, and source accuracy across 8 queries against 5 seeded pages. Key finding: boost helps entity lookups but over-corrects temporal queries. Validates the --detail parameter as the right control mechanism. Output at docs/benchmarks/2026-04-13.md. * feat: query intent classifier — auto-selects detail level, 100% source accuracy Zero-latency heuristic classifier detects query intent from text patterns: - "Who is Pedro?" → entity → detail=low (compiled truth only) - "When did we last meet?" → temporal → detail=high (no boost, natural ranking) - "Variant fund announcement" → event → detail=high - General queries → detail=medium (default with boost) The key insight: skip the 2.0x compiled truth boost for detail=high queries. Temporal/event queries want natural ranking where timeline entries can win. Benchmark results (source accuracy = does the top chunk match expected type): - Baseline: 100% (already good, no boost needed) - Boost only: 71.4% (boost over-corrects temporal queries) - Boost + intent classifier: 100% (best of both worlds) 35 unit tests for the classifier. 590 total tests pass. * feat: query intent classifier — auto-selects detail level, 100% source accuracy Heuristic classifier detects query intent from text patterns (zero latency, no LLM call). Maps temporal queries ("when did we last meet") to detail=high, entity queries ("who is X") to detail=low, events to detail=high. Benchmark results (29 pages, 20 queries, graded relevance): - Baseline: P@1=0.947, MRR=0.974, source accuracy=89.5% - Boost only: P@1=0.895, MRR=0.939, source accuracy=63.2% (over-correction) - Boost + intent: P@1=0.947, MRR=0.974, source accuracy=89.5% (fully recovered) The intent classifier eliminates the boost's over-correction on temporal queries while preserving its benefits for entity lookups. 35 unit tests for the classifier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: search quality benchmark with A/B comparison (baseline vs PR#64) Rich benchmark: 29 pages, 58 chunks, 20 queries with graded relevance. Now measures CHUNK-LEVEL quality, not just page-level retrieval. Key findings (C. Boost+Intent vs A. Baseline): - Unique pages in top-10: 7.2 → 8.7 (+21% broader coverage) - Compiled truth ratio: 51.6% → 66.8% (+15pp more signal) - CT-first rate: 100% (compiled truth leads for entity queries) - Timeline accessible: 100% (temporal queries still find dates) - Source accuracy: 89.5% maintained (intent classifier prevents regression) The boost alone (B) causes -26pp source accuracy regression. Intent classifier (C) recovers it fully. * docs: clean benchmark report — ELI10 search quality analysis for PR#64 Replaces two drafts with one clean report. Explains what changed, why it matters, and what the numbers mean. All fictional data, no private info. Key findings: 21% more page coverage per query, 29% more compiled truth in results. Intent classifier prevents boost from burying timeline for temporal queries. Full per-query breakdown with before/after comparison. * chore: remove auto-generated benchmark file (clean version is 2026-04-14-search-quality.md) * docs: update project documentation for search quality boost CLAUDE.md: added search/intent.ts, search/eval.ts, commands/eval.ts to key files. Added 5 new test files (search, dedup, intent, eval, e2e/search-quality). Updated test count from 23+4 to 28+5. Added docs/benchmarks/ to key files. README.md: updated search pipeline diagram with intent classifier, RRF normalization, compiled truth boost, cosine re-scoring, and 5-layer dedup. Added --detail flag explanation and benchmark instructions. CHANGELOG.md: added search quality entries to v0.9.3 (intent classifier, --detail flag, gbrain eval, CJK fix). Credited @4shut0sh and @YIING99. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: headline benchmark gains in changelog * docs: add community attribution rule to CHANGELOG voice section --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: YIING99 <yiing99@users.noreply.github.com> Co-authored-by: 4shut0sh <4shut0sh@users.noreply.github.com>
245 lines
8.7 KiB
TypeScript
245 lines
8.7 KiB
TypeScript
/**
|
|
* Unit tests for src/core/search/eval.ts
|
|
*
|
|
* Pure function tests — no database, no API keys, runs in: bun test
|
|
*/
|
|
|
|
import { describe, test, expect } from 'bun:test';
|
|
import {
|
|
precisionAtK,
|
|
recallAtK,
|
|
mrr,
|
|
ndcgAtK,
|
|
parseQrels,
|
|
} from '../src/core/search/eval.ts';
|
|
|
|
// ─────────────────────────────────────────────────────────────────
|
|
// precisionAtK
|
|
// ─────────────────────────────────────────────────────────────────
|
|
|
|
describe('precisionAtK', () => {
|
|
test('all hits relevant → 1.0', () => {
|
|
const relevant = new Set(['a', 'b', 'c']);
|
|
expect(precisionAtK(['a', 'b', 'c'], relevant, 3)).toBe(1.0);
|
|
});
|
|
|
|
test('no hits relevant → 0.0', () => {
|
|
const relevant = new Set(['x', 'y']);
|
|
expect(precisionAtK(['a', 'b', 'c'], relevant, 3)).toBe(0.0);
|
|
});
|
|
|
|
test('partial: 2 of 5 hits relevant at k=5', () => {
|
|
const relevant = new Set(['a', 'c']);
|
|
expect(precisionAtK(['a', 'b', 'c', 'd', 'e'], relevant, 5)).toBeCloseTo(2 / 5);
|
|
});
|
|
|
|
test('k=1 with first hit relevant → 1.0', () => {
|
|
const relevant = new Set(['a']);
|
|
expect(precisionAtK(['a', 'b', 'c'], relevant, 1)).toBe(1.0);
|
|
});
|
|
|
|
test('k=1 with first hit not relevant → 0.0', () => {
|
|
const relevant = new Set(['b']);
|
|
expect(precisionAtK(['a', 'b', 'c'], relevant, 1)).toBe(0.0);
|
|
});
|
|
|
|
test('k greater than hits length → uses actual hits', () => {
|
|
const relevant = new Set(['a', 'b']);
|
|
// 2 relevant in 2 hits but k=10 → still 2/10
|
|
expect(precisionAtK(['a', 'b'], relevant, 10)).toBeCloseTo(2 / 10);
|
|
});
|
|
|
|
test('empty hits → 0', () => {
|
|
expect(precisionAtK([], new Set(['a']), 5)).toBe(0);
|
|
});
|
|
|
|
test('empty relevant set → 0', () => {
|
|
expect(precisionAtK(['a', 'b'], new Set(), 5)).toBe(0);
|
|
});
|
|
|
|
test('k=0 → 0', () => {
|
|
expect(precisionAtK(['a', 'b'], new Set(['a']), 0)).toBe(0);
|
|
});
|
|
});
|
|
|
|
// ─────────────────────────────────────────────────────────────────
|
|
// recallAtK
|
|
// ─────────────────────────────────────────────────────────────────
|
|
|
|
describe('recallAtK', () => {
|
|
test('all relevant found → 1.0', () => {
|
|
const relevant = new Set(['a', 'b']);
|
|
expect(recallAtK(['a', 'b', 'c'], relevant, 3)).toBe(1.0);
|
|
});
|
|
|
|
test('none found → 0.0', () => {
|
|
const relevant = new Set(['x', 'y', 'z']);
|
|
expect(recallAtK(['a', 'b', 'c'], relevant, 3)).toBe(0.0);
|
|
});
|
|
|
|
test('1 of 3 relevant found', () => {
|
|
const relevant = new Set(['a', 'x', 'y']);
|
|
expect(recallAtK(['a', 'b', 'c'], relevant, 3)).toBeCloseTo(1 / 3);
|
|
});
|
|
|
|
test('relevant found beyond k → not counted', () => {
|
|
const relevant = new Set(['a', 'b']);
|
|
// 'b' is at rank 5, beyond k=3
|
|
expect(recallAtK(['a', 'x', 'y', 'z', 'b'], relevant, 3)).toBeCloseTo(1 / 2);
|
|
});
|
|
|
|
test('empty hits → 0', () => {
|
|
expect(recallAtK([], new Set(['a']), 5)).toBe(0);
|
|
});
|
|
|
|
test('empty relevant set → 0', () => {
|
|
expect(recallAtK(['a', 'b'], new Set(), 5)).toBe(0);
|
|
});
|
|
});
|
|
|
|
// ─────────────────────────────────────────────────────────────────
|
|
// mrr
|
|
// ─────────────────────────────────────────────────────────────────
|
|
|
|
describe('mrr', () => {
|
|
test('first hit relevant → 1.0', () => {
|
|
expect(mrr(['a', 'b', 'c'], new Set(['a']))).toBe(1.0);
|
|
});
|
|
|
|
test('second hit relevant → 0.5', () => {
|
|
expect(mrr(['x', 'a', 'c'], new Set(['a']))).toBeCloseTo(0.5);
|
|
});
|
|
|
|
test('third hit relevant → 1/3', () => {
|
|
expect(mrr(['x', 'y', 'a'], new Set(['a']))).toBeCloseTo(1 / 3);
|
|
});
|
|
|
|
test('no relevant hit → 0', () => {
|
|
expect(mrr(['x', 'y', 'z'], new Set(['a']))).toBe(0);
|
|
});
|
|
|
|
test('empty hits → 0', () => {
|
|
expect(mrr([], new Set(['a']))).toBe(0);
|
|
});
|
|
|
|
test('empty relevant → 0', () => {
|
|
expect(mrr(['a', 'b'], new Set())).toBe(0);
|
|
});
|
|
|
|
test('uses first relevant hit when multiple are relevant', () => {
|
|
// 'b' is rank 2, 'c' is rank 3 — MRR should use 'b' at rank 2
|
|
expect(mrr(['x', 'b', 'c'], new Set(['b', 'c']))).toBeCloseTo(0.5);
|
|
});
|
|
});
|
|
|
|
// ─────────────────────────────────────────────────────────────────
|
|
// ndcgAtK
|
|
// ─────────────────────────────────────────────────────────────────
|
|
|
|
describe('ndcgAtK', () => {
|
|
test('perfect ranking with binary relevance → 1.0', () => {
|
|
const grades = new Map([['a', 1], ['b', 1]]);
|
|
// Hits: a at rank1, b at rank2 — same as ideal
|
|
expect(ndcgAtK(['a', 'b', 'c'], grades, 5)).toBeCloseTo(1.0);
|
|
});
|
|
|
|
test('single relevant doc at rank 1 → 1.0', () => {
|
|
const grades = new Map([['a', 1]]);
|
|
expect(ndcgAtK(['a', 'x', 'y'], grades, 5)).toBeCloseTo(1.0);
|
|
});
|
|
|
|
test('single relevant doc at rank 2 → less than 1', () => {
|
|
const grades = new Map([['a', 1]]);
|
|
const score = ndcgAtK(['x', 'a', 'y'], grades, 5);
|
|
expect(score).toBeGreaterThan(0);
|
|
expect(score).toBeLessThan(1);
|
|
});
|
|
|
|
test('no relevant in hits → 0', () => {
|
|
const grades = new Map([['a', 1], ['b', 1]]);
|
|
expect(ndcgAtK(['x', 'y', 'z'], grades, 5)).toBe(0);
|
|
});
|
|
|
|
test('graded relevance: higher grade docs placed first → nDCG=1', () => {
|
|
const grades = new Map([['a', 3], ['b', 2], ['c', 1]]);
|
|
expect(ndcgAtK(['a', 'b', 'c'], grades, 3)).toBeCloseTo(1.0);
|
|
});
|
|
|
|
test('graded relevance: lower grade first → nDCG < 1', () => {
|
|
const grades = new Map([['a', 3], ['b', 2], ['c', 1]]);
|
|
// Reversed: worst first
|
|
const score = ndcgAtK(['c', 'b', 'a'], grades, 3);
|
|
expect(score).toBeGreaterThan(0);
|
|
expect(score).toBeLessThan(1);
|
|
});
|
|
|
|
test('graded relevance: reversed is worse than perfect', () => {
|
|
const grades = new Map([['a', 3], ['b', 2], ['c', 1]]);
|
|
const perfect = ndcgAtK(['a', 'b', 'c'], grades, 3);
|
|
const reversed = ndcgAtK(['c', 'b', 'a'], grades, 3);
|
|
expect(perfect).toBeGreaterThan(reversed);
|
|
});
|
|
|
|
test('k=1 picks only the first hit', () => {
|
|
const grades = new Map([['a', 1], ['b', 1]]);
|
|
// Only 'x' at rank1, not relevant
|
|
expect(ndcgAtK(['x', 'a', 'b'], grades, 1)).toBe(0);
|
|
// Only 'a' at rank1, relevant
|
|
expect(ndcgAtK(['a', 'x', 'b'], grades, 1)).toBeCloseTo(1.0);
|
|
});
|
|
|
|
test('empty hits → 0', () => {
|
|
expect(ndcgAtK([], new Map([['a', 1]]), 5)).toBe(0);
|
|
});
|
|
|
|
test('empty grades → 0', () => {
|
|
expect(ndcgAtK(['a', 'b'], new Map(), 5)).toBe(0);
|
|
});
|
|
|
|
test('k=0 → 0', () => {
|
|
expect(ndcgAtK(['a', 'b'], new Map([['a', 1]]), 0)).toBe(0);
|
|
});
|
|
});
|
|
|
|
// ─────────────────────────────────────────────────────────────────
|
|
// parseQrels
|
|
// ─────────────────────────────────────────────────────────────────
|
|
|
|
describe('parseQrels', () => {
|
|
test('parses inline JSON array', () => {
|
|
const input = JSON.stringify([
|
|
{ query: 'foo', relevant: ['a', 'b'] },
|
|
]);
|
|
const result = parseQrels(input);
|
|
expect(result).toHaveLength(1);
|
|
expect(result[0].query).toBe('foo');
|
|
expect(result[0].relevant).toEqual(['a', 'b']);
|
|
});
|
|
|
|
test('parses inline JSON object with queries array', () => {
|
|
const input = JSON.stringify({
|
|
version: 1,
|
|
queries: [{ query: 'bar', relevant: ['x'] }],
|
|
});
|
|
const result = parseQrels(input);
|
|
expect(result).toHaveLength(1);
|
|
expect(result[0].query).toBe('bar');
|
|
});
|
|
|
|
test('preserves grades when present', () => {
|
|
const input = JSON.stringify([
|
|
{ query: 'baz', relevant: ['a'], grades: { a: 3, b: 1 } },
|
|
]);
|
|
const result = parseQrels(input);
|
|
expect(result[0].grades).toEqual({ a: 3, b: 1 });
|
|
});
|
|
|
|
test('throws on invalid JSON', () => {
|
|
expect(() => parseQrels('not-json')).toThrow();
|
|
});
|
|
|
|
test('throws on unrecognized format', () => {
|
|
expect(() => parseQrels(JSON.stringify({ foo: 'bar' }))).toThrow();
|
|
});
|
|
});
|