Files
gbrain/test/intent.test.ts
Garry Tan d547a64600 feat: search quality boost — compiled truth ranking + detail parameter (v0.8.1) (#64)
* feat: search quality boost — compiled truth ranking, detail parameter, cosine re-scoring

Compiled truth chunks now rank 2x higher in hybrid search via RRF
normalization + source boost. New --detail flag (low/medium/high)
controls timeline inclusion. Cosine re-scoring blends query-chunk
similarity before dedup for query-specific ranking.

Also: remove DISTINCT ON from keyword search (dedup handles per-page
capping), add chunk_id + chunk_index to SearchResult, add
getEmbeddingsByChunkIds to BrainEngine interface.

Inspired by Ramp Labs' "Latent Briefing" paper (April 2026).

* feat: RRF normalization, source-aware dedup, detail param in operations

RRF scores normalized to 0-1 before 2.0x compiled truth boost.
Source-aware dedup guarantees compiled truth chunk per page.
Detail parameter added to query operation, dedupResults added to
bare search operation. Debug logging via GBRAIN_SEARCH_DEBUG=1.

* chore: bump version and changelog (v0.8.1)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: CJK word count in query expansion

CJK text is not space-delimited. A query like "向量搜索优化" was counted
as 1 word and silently skipped expansion. Now counts characters for CJK
queries instead of space-separated tokens.

Co-Authored-By: YIING99 <yiing99@users.noreply.github.com>

* feat: retrieval evaluation harness — P@k, R@k, MRR, nDCG@k + gbrain eval

Full IR evaluation framework: precisionAtK, recallAtK, mrr, ndcgAtK
metrics with runEval() orchestrator. gbrain eval CLI with single-run
table and A/B comparison mode (--config-a / --config-b) for parameter
tuning. HybridSearchOpts now accepts rrfK and dedupOpts overrides.

Co-Authored-By: 4shut0sh <4shut0sh@users.noreply.github.com>

* test: search quality tests — RRF boost, dedup guarantee, cosine similarity, E2E benchmark

42 new tests across 3 files:
- test/search.test.ts: RRF normalization, compiled truth 2x boost, dedup key
  collision prevention, cosine similarity edge cases, CJK word count detection
- test/dedup.test.ts: source-aware compiled truth guarantee, layer interactions,
  custom maxPerPage, empty/single result edge cases
- test/e2e/search-quality.test.ts: full pipeline against PGLite with basis vector
  embeddings — chunk_id/chunk_index fields, detail parameter filtering,
  getEmbeddingsByChunkIds, keyword multi-chunk, vector ordering

Also: export rrfFusion + cosineSimilarity for unit testing, fix PGLite
getEmbeddingsByChunkIds to parse string vectors from pgvector.

* test: search quality benchmark with A/B comparison (baseline vs PR#64)

Benchmark measures P@1, MRR, nDCG@5, and source accuracy across 8 queries
against 5 seeded pages. Key finding: boost helps entity lookups but
over-corrects temporal queries. Validates the --detail parameter as the
right control mechanism. Output at docs/benchmarks/2026-04-13.md.

* feat: query intent classifier — auto-selects detail level, 100% source accuracy

Zero-latency heuristic classifier detects query intent from text patterns:
- "Who is Pedro?" → entity → detail=low (compiled truth only)
- "When did we last meet?" → temporal → detail=high (no boost, natural ranking)
- "Variant fund announcement" → event → detail=high
- General queries → detail=medium (default with boost)

The key insight: skip the 2.0x compiled truth boost for detail=high queries.
Temporal/event queries want natural ranking where timeline entries can win.

Benchmark results (source accuracy = does the top chunk match expected type):
- Baseline: 100% (already good, no boost needed)
- Boost only: 71.4% (boost over-corrects temporal queries)
- Boost + intent classifier: 100% (best of both worlds)

35 unit tests for the classifier. 590 total tests pass.

* feat: query intent classifier — auto-selects detail level, 100% source accuracy

Heuristic classifier detects query intent from text patterns (zero latency,
no LLM call). Maps temporal queries ("when did we last meet") to detail=high,
entity queries ("who is X") to detail=low, events to detail=high.

Benchmark results (29 pages, 20 queries, graded relevance):
- Baseline: P@1=0.947, MRR=0.974, source accuracy=89.5%
- Boost only: P@1=0.895, MRR=0.939, source accuracy=63.2% (over-correction)
- Boost + intent: P@1=0.947, MRR=0.974, source accuracy=89.5% (fully recovered)

The intent classifier eliminates the boost's over-correction on temporal queries
while preserving its benefits for entity lookups. 35 unit tests for the classifier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: search quality benchmark with A/B comparison (baseline vs PR#64)

Rich benchmark: 29 pages, 58 chunks, 20 queries with graded relevance.
Now measures CHUNK-LEVEL quality, not just page-level retrieval.

Key findings (C. Boost+Intent vs A. Baseline):
- Unique pages in top-10: 7.2 → 8.7 (+21% broader coverage)
- Compiled truth ratio: 51.6% → 66.8% (+15pp more signal)
- CT-first rate: 100% (compiled truth leads for entity queries)
- Timeline accessible: 100% (temporal queries still find dates)
- Source accuracy: 89.5% maintained (intent classifier prevents regression)

The boost alone (B) causes -26pp source accuracy regression.
Intent classifier (C) recovers it fully.

* docs: clean benchmark report — ELI10 search quality analysis for PR#64

Replaces two drafts with one clean report. Explains what changed, why it
matters, and what the numbers mean. All fictional data, no private info.

Key findings: 21% more page coverage per query, 29% more compiled truth
in results. Intent classifier prevents boost from burying timeline for
temporal queries. Full per-query breakdown with before/after comparison.

* chore: remove auto-generated benchmark file (clean version is 2026-04-14-search-quality.md)

* docs: update project documentation for search quality boost

CLAUDE.md: added search/intent.ts, search/eval.ts, commands/eval.ts to key
files. Added 5 new test files (search, dedup, intent, eval, e2e/search-quality).
Updated test count from 23+4 to 28+5. Added docs/benchmarks/ to key files.

README.md: updated search pipeline diagram with intent classifier, RRF
normalization, compiled truth boost, cosine re-scoring, and 5-layer dedup.
Added --detail flag explanation and benchmark instructions.

CHANGELOG.md: added search quality entries to v0.9.3 (intent classifier,
--detail flag, gbrain eval, CJK fix). Credited @4shut0sh and @YIING99.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: headline benchmark gains in changelog

* docs: add community attribution rule to CHANGELOG voice section

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: YIING99 <yiing99@users.noreply.github.com>
Co-authored-by: 4shut0sh <4shut0sh@users.noreply.github.com>
2026-04-13 21:03:40 -10:00

164 lines
5.6 KiB
TypeScript

/**
* Query Intent Classifier tests
*/
import { describe, test, expect } from 'bun:test';
import { classifyQueryIntent, autoDetectDetail } from '../src/core/search/intent.ts';
describe('classifyQueryIntent', () => {
describe('entity queries', () => {
test('"Who is Pedro?" → entity', () => {
expect(classifyQueryIntent('Who is Pedro?')).toBe('entity');
});
test('"What does Variant do?" → entity', () => {
expect(classifyQueryIntent('What does Variant do?')).toBe('entity');
});
test('"Tell me about Brex" → entity', () => {
expect(classifyQueryIntent('Tell me about Brex')).toBe('entity');
});
test('"What is the ownership economy?" → entity', () => {
expect(classifyQueryIntent('What is the ownership economy?')).toBe('entity');
});
test('"Summarize Pedro" → entity', () => {
expect(classifyQueryIntent('Summarize Pedro')).toBe('entity');
});
test('"Background on Variant Fund" → entity', () => {
expect(classifyQueryIntent('Background on Variant Fund')).toBe('entity');
});
test('"What do we know about Brex?" → entity', () => {
expect(classifyQueryIntent('What do we know about Brex?')).toBe('entity');
});
});
describe('temporal queries', () => {
test('"When did we last meet Pedro?" → temporal', () => {
expect(classifyQueryIntent('When did we last meet Pedro?')).toBe('temporal');
});
test('"Recent updates on Variant" → temporal', () => {
expect(classifyQueryIntent('Recent updates on Variant')).toBe('temporal');
});
test('"Meeting notes about Pedro" → temporal', () => {
expect(classifyQueryIntent('Meeting notes about Pedro')).toBe('temporal');
});
test('"What\'s new with Brex?" → temporal', () => {
expect(classifyQueryIntent("What's new with Brex?")).toBe('temporal');
});
test('"Last conversation with Jesse" → temporal', () => {
expect(classifyQueryIntent('Last conversation with Jesse')).toBe('temporal');
});
test('"Timeline of Variant" → temporal', () => {
expect(classifyQueryIntent('Timeline of Variant')).toBe('temporal');
});
test('"History with Pedro" → temporal', () => {
expect(classifyQueryIntent('History with Pedro')).toBe('temporal');
});
test('"Updates from last month" → temporal', () => {
expect(classifyQueryIntent('Updates from last month')).toBe('temporal');
});
test('"Latest on Brex" → temporal', () => {
expect(classifyQueryIntent('Latest on Brex')).toBe('temporal');
});
test('"How long ago did we meet Jesse?" → temporal', () => {
expect(classifyQueryIntent('How long ago did we meet Jesse?')).toBe('temporal');
});
test('"2024-03 Pedro" → temporal (date pattern)', () => {
expect(classifyQueryIntent('2024-03 Pedro')).toBe('temporal');
});
});
describe('event queries', () => {
test('"Variant fund announcement" → event', () => {
expect(classifyQueryIntent('Variant fund announcement')).toBe('event');
});
test('"Brex launched new product" → event', () => {
expect(classifyQueryIntent('Brex launched new product')).toBe('event');
});
test('"Series B raised $50M" → event', () => {
expect(classifyQueryIntent('Series B raised $50M')).toBe('event');
});
test('"Brex IPO" → event', () => {
expect(classifyQueryIntent('Brex IPO')).toBe('event');
});
test('"What happened with the acquisition" → event', () => {
expect(classifyQueryIntent('What happened with the acquisition')).toBe('event');
});
});
describe('full context queries → temporal', () => {
test('"Give me everything on Pedro" → temporal', () => {
expect(classifyQueryIntent('Give me everything on Pedro')).toBe('temporal');
});
test('"Full history with Variant" → temporal', () => {
expect(classifyQueryIntent('Full history with Variant')).toBe('temporal');
});
test('"All information about Brex" → temporal', () => {
expect(classifyQueryIntent('All information about Brex')).toBe('temporal');
});
test('"Deep dive on AI philosophy" → temporal', () => {
expect(classifyQueryIntent('Deep dive on AI philosophy')).toBe('temporal');
});
});
describe('general queries', () => {
test('"AI changes who gets to build" → general', () => {
expect(classifyQueryIntent('AI changes who gets to build')).toBe('general');
});
test('"fintech payments infrastructure" → general', () => {
expect(classifyQueryIntent('fintech payments infrastructure')).toBe('general');
});
test('"Pedro Brex" → general (bare entity name)', () => {
expect(classifyQueryIntent('Pedro Brex')).toBe('general');
});
test('"crypto web3 ownership" → general', () => {
expect(classifyQueryIntent('crypto web3 ownership')).toBe('general');
});
});
});
describe('autoDetectDetail', () => {
test('entity queries → low', () => {
expect(autoDetectDetail('Who is Pedro?')).toBe('low');
expect(autoDetectDetail('What does Variant do?')).toBe('low');
});
test('temporal queries → high', () => {
expect(autoDetectDetail('When did we last meet Pedro?')).toBe('high');
expect(autoDetectDetail('Recent updates on Variant')).toBe('high');
});
test('event queries → high', () => {
expect(autoDetectDetail('Variant fund announcement')).toBe('high');
});
test('general queries → undefined (default)', () => {
expect(autoDetectDetail('AI changes who gets to build')).toBeUndefined();
expect(autoDetectDetail('fintech payments')).toBeUndefined();
});
});