Using chunktuner alongside LangChain¶
chunktuner benchmarks chunking strategies; LangChain handles retrieval, chains, and agents. They compose: run chunktuner to pick (strategy, params), then configure LangChain splitters to match.
Typical workflow¶
- Run
chunk-tune recommend ./my_docs --use-case rag_qa(add--output-format yamlif you want machine-readable output). - Parse the
bestentry from the printed YAML/JSON: it matchesRecommendation.model_dump()— seechunktuner.models.RecommendationandEvalResult(best.configis aChunkConfig: fieldsnameandparams). - Map the winning
recursive_characterparams toRecursiveCharacterTextSplitter(character-basedchunk_size/chunk_overlap).
Example: recursive_character → LangChain¶
recursive_character uses chunk_size_chars and chunk_overlap_chars in ChunkConfig.params (src/chunktuner/chunking/recursive_character.py).
import yaml
from pathlib import Path
# LangChain ≥0.2: `langchain-text-splitters`. Older stacks: `langchain.text_splitter`.
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Save CLI output first, e.g.:
# chunk-tune recommend ./my_docs --use-case rag_qa --output-format yaml > recommend.yaml
data = yaml.safe_load(Path("recommend.yaml").read_text())
cfg = data["best"]["config"]
if cfg["name"] != "recursive_character":
raise SystemExit(f"Expected recursive_character, got {cfg['name']!r}")
p = cfg["params"]
splitter = RecursiveCharacterTextSplitter(
chunk_size=int(p.get("chunk_size_chars", 1600)),
chunk_overlap=int(p.get("chunk_overlap_chars", 0)),
separators=list(p.get("separators", ["\n\n", "\n", ". ", " ", ""])),
)
Install LangChain’s text splitter package in your app (langchain-text-splitters); chunktuner does not declare it as a dependency.
Using chunktuner chunks directly¶
from pathlib import Path
from chunktuner import FileIngestor, default_registry
from chunktuner.models import ChunkConfig
docs = FileIngestor().ingest_dir(Path("./my_docs"))
strategy = default_registry.get("recursive_character")
config = ChunkConfig(
name="recursive_character",
params={"chunk_size_chars": 1600, "chunk_overlap_chars": 100},
)
chunks = strategy.chunk(docs[0], config)
# Feed chunk.text into your LangChain document / vector store pipeline