Evaluation metrics¶

Each field on EvalMetrics is aggregated (usually averaged) over evaluation queries. Composite scores are produced by ScoreCalculator from weighted metrics for a given use case.

Absolute numbers vs comparisons

Metrics are always relative to the embedding profile and corpus. Use them to compare strategies in the same run, not as standalone quality guarantees.

Definitions¶

Full definitions, ranges, and intuition for every metric field live in the metrics glossary (same content as previously lived on this page).

Use-case weights (defaults)¶

Default metric weights are returned by chunktuner.config.score_profile_weights(use_case) for:

`use_case`	Notes
`rag_qa`	Weights `token_recall`, `token_iou`, `mrr`, optional `faithfulness`, `avg_tokens_per_query`, `duplication_ratio`.
`search`	Emphasizes `recall_at_k[1]` (exposed to the scorer as `recall_at_1`), `mrr`, `avg_tokens_per_query`, `duplication_ratio`.
`summarization`	`token_recall`, `avg_chunk_length`, `duplication_ratio`.
`code_assist`	`token_recall`, `mrr`, `chunk_length_std`, `duplication_ratio`.

Inspect src/chunktuner/config.py for the exact floating-point weights.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Evaluation metrics¶

Definitions¶

Use-case weights (defaults)¶

Further reading¶