Skip to content

Conversation

rangehow
Copy link
Collaborator

@rangehow rangehow commented Sep 18, 2024

Hey, so this PR's got two main changes:

  1. We're now converting docs to tokens in bulk, which is giving us a sweet 3X speed boost when dealing with a ton of docs (we tested it with 30k) by chunking_by_token_size. It's not gonna make much difference for small-scale stuff, but 30k is still pretty much toy-level (both industry and research usually work with way more). So yeah, this is definitely a solid upgrade.

  2. We've added support for separator-based splitting without needing any extra dependencies. This splitting method tries to keep the grammar structure intact, meaning you'll always get complete clauses or sentences (if without any overlap). We tweaked the logic from langchain, so it might not be exactly the same, but it does what it says on the tin.

Copy link

codecov bot commented Sep 18, 2024

Codecov Report

Attention: Patch coverage is 97.36842% with 1 line in your changes missing coverage. Please review.

Project coverage is 94.36%. Comparing base (ad74d13) to head (bcfe5bf).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
nano_graphrag/_op.py 97.22% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #48   +/-   ##
=======================================
  Coverage   94.36%   94.36%           
=======================================
  Files          11       11           
  Lines        1189     1189           
=======================================
  Hits         1122     1122           
  Misses         67       67           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rangehow
Copy link
Collaborator Author

NOW I GUESS ALL SHOULD BE WELL : )

Copy link
Owner

@gusye1234 gusye1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great works! Few typing errors I think

@gusye1234 gusye1234 merged commit 13ce7d1 into gusye1234:main Sep 19, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants