Skip to content

Conversation

@smcdonald-jus
Copy link
Contributor

@smcdonald-jus smcdonald-jus commented Aug 1, 2025

Description

Fixes #19580

When performing sparse queries with words ending in double-E's (such as refugee), relevant database entries would not be returned. This is because the Snowball stemmer is not idempotent on double-E words (i.e. stem(stem('refugee')) 'refug' != stem('refugee') = 'refugee', and the query was being stemmed twice while the database entries were stemmed only once.

This is a consequence of the definition of ts_query, which first transformed the string into a ts query, then to text so the AND operators could be replaced with OR operators, and finally back into a ts query.

Instead, we can replace the spaces with "|"'s in the initial query string, so that the to_tsquery transformation need only be applied once. This will ensure an OR query while stemming only one time.

Note: this assumes English stemming and may have unintended consequences for other languages/encodings, depending on how spaces work in the corresponding Snowball stemmers.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Aug 1, 2025
@smcdonald-jus smcdonald-jus changed the title fix ts_query definition to avoid double-stemming fix: change ts_query definition to avoid double-stemming Aug 1, 2025
@logan-markewich logan-markewich merged commit f0d2f93 into run-llama:main Aug 2, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: sparse queries in postgres get stemmed twice

2 participants