Evaluation metrics¶
Each field on EvalMetrics is aggregated (usually averaged) over evaluation queries. Composite scores are produced by ScoreCalculator from weighted metrics for a given use case.
Absolute numbers vs comparisons
Metrics are always relative to the embedding profile and corpus. Use them to compare strategies in the same run, not as standalone quality guarantees.
Definitions¶
Full definitions, ranges, and intuition for every metric field live in the metrics glossary (same content as previously lived on this page).
Use-case weights (defaults)¶
Default metric weights are returned by chunktuner.config.score_profile_weights(use_case) for:
use_case |
Notes |
|---|---|
rag_qa |
Weights token_recall, token_iou, mrr, optional faithfulness, avg_tokens_per_query, duplication_ratio. |
search |
Emphasizes recall_at_k[1] (exposed to the scorer as recall_at_1), mrr, avg_tokens_per_query, duplication_ratio. |
summarization |
token_recall, avg_chunk_length, duplication_ratio. |
code_assist |
token_recall, mrr, chunk_length_std, duplication_ratio. |
Inspect src/chunktuner/config.py for the exact floating-point weights.
Further reading¶
- Strategy guide — how strategies feed these metrics
- Python API — Evaluator