Evaluation metrics

Each field on EvalMetrics is aggregated (usually averaged) over evaluation queries. Composite scores are produced by ScoreCalculator from weighted metrics for a given use case.

Absolute numbers vs comparisons

Metrics are always relative to the embedding profile and corpus. Use them to compare strategies in the same run, not as standalone quality guarantees.

Definitions

Full definitions, ranges, and intuition for every metric field live in the metrics glossary (same content as previously lived on this page).

Use-case weights (defaults)

Default metric weights are returned by chunktuner.config.score_profile_weights(use_case) for:

use_case Notes
rag_qa Weights token_recall, token_iou, mrr, optional faithfulness, avg_tokens_per_query, duplication_ratio.
search Emphasizes recall_at_k[1] (exposed to the scorer as recall_at_1), mrr, avg_tokens_per_query, duplication_ratio.
summarization token_recall, avg_chunk_length, duplication_ratio.
code_assist token_recall, mrr, chunk_length_std, duplication_ratio.

Inspect src/chunktuner/config.py for the exact floating-point weights.

Further reading