Using chunktuner alongside Haystack (2.x)¶
chunktuner ranks chunking strategies on your corpus; Haystack runs preprocessing, embedding, and retrieval pipelines. Haystack is not a dependency of chunktuner — install it separately.
Typical workflow¶
- Run
chunk-tune recommend ./my_docs --use-case rag_qa --output-format yamland parsebest.config(ChunkConfig:name,params). - Map character-oriented params (e.g.
chunk_size_charsfromrecursive_character) to Haystack’sDocumentSplitterunits (split_by="word"uses word counts, not characters). Calibratesplit_length/split_overlapempirically or by approximating words ≈ chars / 5 for rough English prose.
Example: Haystack Document from chunktuner chunks¶
from pathlib import Path
from chunktuner import FileIngestor, default_registry
from chunktuner.models import ChunkConfig
from haystack import Document
docs = FileIngestor().ingest_dir(Path("./my_docs"))
strategy = default_registry.get("recursive_character")
cfg = ChunkConfig(name="recursive_character", params={"chunk_size_chars": 1600, "chunk_overlap_chars": 100})
chunks = strategy.chunk(docs[0], cfg)
haystack_docs = [
Document(content=c.text, meta={"doc_id": docs[0].id, "chunk_id": c.id, "start": c.start_offset, "end": c.end_offset})
for c in chunks
]
Example: DocumentSplitter after a rough char → word mapping¶
from haystack.components.preprocessors import DocumentSplitter
# Illustrative: map a character target to a word budget — validate on your data.
splitter = DocumentSplitter(split_by="word", split_length=300, split_overlap=30)
# result = splitter.run(documents=haystack_docs)
For recursive splitting closer to LangChain’s hierarchy, Haystack also documents RecursiveDocumentSplitter — chunktuner does not emit its separator list today; use chunktuner-produced chunks when you need parity with an evaluation run.