Chunking strategy for ingesting files? #1903

ModernMagus · 2025-08-03T10:22:00Z

ModernMagus
Aug 3, 2025

As far as I can see in the provided files for LightRAG, it is not possible to change chunking strategy, only use fixed chunking? Is this correct? I must have set up something wrong, ingesting a 10 page document (A4) returned 5 chunks... Thoughts?

onestardao · 2025-08-05T06:18:44Z

onestardao
Aug 5, 2025

yep, you’re right to suspect something's off.

if you're getting exactly 5 chunks for a 10-page A4 doc, chances are LightRAG is using a fixed-size tokenizer-based chunker without adaptive structure detection. this usually triggers two major issues:

[No.2] incorrect chunk boundaries that split semantic units mid-sentence or mid-section

[No.10] PDF-to-text conversion collapsing structure (e.g., headers, bullet points, tables) before chunking even begins

most RAG pipelines suffer from these by default, especially if they run chunking before semantic restoration. we’ve actually documented this and a few related ingestion traps pretty deeply — happy to share if that’s helpful.

you’re not alone on this — but yeah, if you want context-aware or document-type-specific chunking, fixed-size won’t cut it.

0 replies

ModernMagus · 2025-08-05T06:52:37Z

ModernMagus
Aug 5, 2025
Author

So LightRag just uses fixed chunks as strategy? I was hoping for semantic embedding here... You know of a solution using semantic embedding and graphing for a RAG that is relatively simple to install? Been looking for weeks now...

0 replies

oiqm4T · 2025-08-12T04:05:22Z

oiqm4T
Aug 12, 2025

Seems like chunking boundaries are less important since there is post processing anyways, but yea in mix mode, large chunks are provided, instead of adding a separate rag to provide an alternative more accurate chunks -- im thinking of adding another post processing to summarize and rerank the large raw chunks in a separate workflow and manually doing a mix mode --and just have the lightrag produce the graphed data. Keeping it simple but increasing costs. Perhaps using a better LLM to do the graph may even remove the need for mix mode.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chunking strategy for ingesting files? #1903

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Chunking strategy for ingesting files? #1903

Uh oh!

ModernMagus Aug 3, 2025

Replies: 3 comments

Uh oh!

onestardao Aug 5, 2025

Uh oh!

ModernMagus Aug 5, 2025 Author

Uh oh!

oiqm4T Aug 12, 2025

ModernMagus
Aug 3, 2025

onestardao
Aug 5, 2025

ModernMagus
Aug 5, 2025
Author

oiqm4T
Aug 12, 2025