speed up chunking & add separator chunking #48

rangehow · 2024-09-18T06:53:00Z

Hey, so this PR's got two main changes:

We're now converting docs to tokens in bulk, which is giving us a sweet 3X speed boost when dealing with a ton of docs (we tested it with 30k) by chunking_by_token_size. It's not gonna make much difference for small-scale stuff, but 30k is still pretty much toy-level (both industry and research usually work with way more). So yeah, this is definitely a solid upgrade.
We've added support for separator-based splitting without needing any extra dependencies. This splitting method tries to keep the grammar structure intact, meaning you'll always get complete clauses or sentences (if without any overlap). We tweaked the logic from langchain, so it might not be exactly the same, but it does what it says on the tin.

nano_graphrag/_spliter.py

nano_graphrag/graphrag.py

codecov · 2024-09-18T12:57:34Z

Codecov Report

Attention: Patch coverage is 97.36842% with 1 line in your changes missing coverage. Please review.

Project coverage is 94.36%. Comparing base (ad74d13) to head (bcfe5bf).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
nano_graphrag/_op.py	97.22%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #48   +/-   ##
=======================================
  Coverage   94.36%   94.36%           
=======================================
  Files          11       11           
  Lines        1189     1189           
=======================================
  Hits         1122     1122           
  Misses         67       67

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rangehow · 2024-09-18T13:06:19Z

NOW I GUESS ALL SHOULD BE WELL : )

gusye1234

Great works! Few typing errors I think

nano_graphrag/_op.py

nano_graphrag/graphrag.py

gusye1234 reviewed Sep 18, 2024

View reviewed changes

nano_graphrag/_spliter.py Outdated Show resolved Hide resolved

nano_graphrag/graphrag.py Outdated Show resolved Hide resolved

rangehow added 2 commits September 18, 2024 20:50

speed up chunking & add separator chunking

2915526

add test code for splitter & reformat chunking methods

9900d35

rangehow force-pushed the new-feature-branch branch from 20fed70 to 9900d35 Compare September 18, 2024 12:50

typo

6b5ad6c

rangehow added 2 commits September 18, 2024 21:04

fix overlap behaviour

67ecee0

typo

0072896

gusye1234 reviewed Sep 19, 2024

View reviewed changes

nano_graphrag/_op.py Outdated Show resolved Hide resolved

nano_graphrag/_op.py Outdated Show resolved Hide resolved

nano_graphrag/graphrag.py Outdated Show resolved Hide resolved

typo for type check

bcfe5bf

gusye1234 merged commit 13ce7d1 into gusye1234:main Sep 19, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

speed up chunking & add separator chunking #48

speed up chunking & add separator chunking #48

Uh oh!

rangehow commented Sep 18, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 18, 2024 •

edited

Loading

Uh oh!

rangehow commented Sep 18, 2024

Uh oh!

gusye1234 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

speed up chunking & add separator chunking #48

speed up chunking & add separator chunking #48

Uh oh!

Conversation

rangehow commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rangehow commented Sep 18, 2024

Uh oh!

gusye1234 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rangehow commented Sep 18, 2024 •

edited

Loading

codecov bot commented Sep 18, 2024 •

edited

Loading