Chunking strategy for ingesting files? #1903
Replies: 3 comments
-
yep, you’re right to suspect something's off. if you're getting exactly 5 chunks for a 10-page A4 doc, chances are LightRAG is using a fixed-size tokenizer-based chunker without adaptive structure detection. this usually triggers two major issues:
most RAG pipelines suffer from these by default, especially if they run chunking before semantic restoration. we’ve actually documented this and a few related ingestion traps pretty deeply — happy to share if that’s helpful. you’re not alone on this — but yeah, if you want context-aware or document-type-specific chunking, fixed-size won’t cut it. |
Beta Was this translation helpful? Give feedback.
-
So LightRag just uses fixed chunks as strategy? I was hoping for semantic embedding here... You know of a solution using semantic embedding and graphing for a RAG that is relatively simple to install? Been looking for weeks now... |
Beta Was this translation helpful? Give feedback.
-
Seems like chunking boundaries are less important since there is post processing anyways, but yea in mix mode, large chunks are provided, instead of adding a separate rag to provide an alternative more accurate chunks -- im thinking of adding another post processing to summarize and rerank the large raw chunks in a separate workflow and manually doing a mix mode --and just have the lightrag produce the graphed data. Keeping it simple but increasing costs. Perhaps using a better LLM to do the graph may even remove the need for mix mode. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
As far as I can see in the provided files for LightRAG, it is not possible to change chunking strategy, only use fixed chunking? Is this correct? I must have set up something wrong, ingesting a 10 page document (A4) returned 5 chunks... Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions