QA : Metadata and multi instance? #2001

Prabesh002 · 2025-08-24T15:45:31Z

Prabesh002
Aug 24, 2025

I have a question about the best way to implement strict multi-tenancy with LightRAG.

My app has two levels: Organization and Branch. Security requirement:

User from Org A should never access Org B’s data.
Regular users only see their own branch.
Privileged org-level users can query across all branches in their org.

The approach I’m considering: tag all data at ingestion with organization_id and branch_id. What I need clarity on is:

1. Ingestion

How do I attach custom metadata like organization_id and branch_id to text passed into rag.insert() so that it flows into chunks, entities, and relationships?

Is there a metadata parameter in rag.insert() I’m missing (like it does take ids and i do see query param has id when we query but i was wondering if there was a flexible metadatata KVP like filter)
Or is the intended way to preprocess text → chunks and use rag.insert_custom_kg() with custom metadata (like in insert_custom_kg.py)? (I know we can define custom relationships and put it but i'd like to avoid manual work)

2. Querying

At query time, I want to filter retrieval so only matching organization_id and (optionally) branch_id are considered.

Does QueryParam support filters for this? For example:

security_filters = {
    "organization_id": "org-abc-123", 
    "branch_id": "branch-xyz-456"
}

query_params = QueryParam(mode="hybrid", filters=security_filters)
response = rag.query(query="...", param=query_params)

This library seems awesome and i'd rather not try to mess with how it works internally. I've read few medium references about this and what i mostly saw was it created a local directory and ran there so i can probably create a Light Rag Instance for each organization (as the files i store are in a IIS server and each organization has it's own dedicated folder (with id duh) seperated by branching btw so i could create a Light rag instance there and whenever any organization tries to query i guess I could finally use that instance and close it when i am done and also use it when I want to batch insert document I could probably use that instance to insert document or query (which i guess reduces cross organization issues?)

I'm not sure how the data is chunked or stored in the database level, i'm more of a doc reader and i havent looked into this library as i just found it today I just want to know more about this lib as it seems interesting and the benchmark seems good :D

I hope I can get help and I apolgoise if i was naivee :')) I'm kind of slacking off writing this but i hope i wrote enough to cover some basic answers!

onestardao · 2025-08-25T02:18:25Z

onestardao
Aug 25, 2025

looks like you’re basically asking two things:

how to attach custom metadata (like organization_id, user_id) at ingestion time, and
whether query-time filters can restrict retrieval by those metadata fields.

for ingestion, most frameworks don’t inject metadata automatically; you usually pass it alongside each chunk when calling insert. the key is to expand your insert payload so each document carries { text, metadata: {...} }. lightRAG doesn’t seem to block this, but the helper wrapper functions may need to be bypassed or extended.

for querying, what you’re sketching (query(name="…", filters=security_filters)) is the right idea. if the underlying vectorstore supports metadata filters (like metadata_key == value), you can push those down. if not, you’ll need a post-filter step after retrieval.

so the gap isn’t really in lightRAG’s concept but in making sure the vector db driver you’re using respects metadata at both insert and filter time.

if you want a more structured breakdown, we keep a “ProblemMap” that matches common RAG failures to fixes (like metadata drift, or filter collapse). your case looks like No.1 (hallucination & chunk drift if metadata not bound) and No.8 (black-box ingestion where metadata is silently dropped). if you’d like the link to that checklist, just say so and I’ll share it.

6 replies

onestardao Aug 25, 2025

thanks for laying out the details so clearly. what you’re describing (multi-org separation with metadata + query-time filters) is exactly where most RAG pipelines tend to break in practice — the ingestion and querying layers aren’t always aligned.

from our side we’ve seen this map cleanly to two failure modes:

ProblemMap No.2 → ingestion misalignment (metadata not consistently attached at chunking time)

ProblemMap No.7 → query-side filter gaps (metadata exists but isn’t propagated through the retrieval pipeline)

the trick is to treat the metadata binding as a “semantic firewall”: it has to survive every stage from chunk creation → vector insertion → query filtering. if it gets dropped at any stage, you end up with the kind of cross-org bleed you’re worried about.

if it helps, i can share the step-by-step checklist we use to debug these two cases. it’s fairly lightweight: just a few diagnostic queries against known metadata keys usually show whether the failure is ingestion-side or query-side.

in practice, once you confirm which side is dropping the binding, the fix is mechanical (schema enforcement at ingestion vs. filter propagation at retrieval).

ref: ProblemMap/README.md

Prabesh002 Aug 25, 2025
Author

Thanks for pointing me toward the ProblemMap I’ve gone through the WFGY repo and skimmed the map but I still need to pin down a few LightRAG-specific mechanics before I can wire in the firewall cleanly

Can I pass specific ids at insert() time and be sure they’re in into every chunk, entity node, and relationship edge? Or do I need to bypass defaults and call i remember you saying it depends on the provider i'm using, and I'm pretty sure it works on crhoma db and for neo does LightRAG automatically attach metadata to nodes/edges, or do I need to hook into the ingest pipeline and extend the schema? The WFGY map covers the failure, but I’d like to see the intended LightRAG schema contract (i.e. what metadata fields survive by default)

When I use QueryParam(..., filters=...), will those filters be enforced in both Chroma vector lookups and Neo4j? Or are graph queries separate?

And regarding pg when doing instance creation does LightRAG auto-create collections/tables (like i thik since it did create the local dir, should be same for tables) but wha database will it use? is there anycontrol for me (i havent looked into it)

onestardao Aug 25, 2025

Hi, my friend, your Q I will use AI to generate contents, it's faster to me, hope you don't mind :) (if you feed my TXTOS into your AI, same result bcuz AI will know everything if you give them my txt file) ^___^

===

thanks for the clear follow-up. the symptoms you describe still line up with two classic failure points we see a lot

No.2 ingestion misalignment. metadata not consistently attached at chunk time
No.7 query-side filter gaps. metadata exists but is not carried through the retrieval pipeline

below is a concrete way to turn your plan into a “semantic firewall”. idea is simple. bind org and branch at every stage. if any stage drops the binding, you catch it fast.

1) ingestion checklist

normalize keys first

org_id, branch_id, entity_id, created_at

keep names identical in every store. strings only. no nested objects.

vector side (chroma)

client.upsert(
  ids=[doc_id],
  embeddings=[vec],
  metadatas=[{
    "org_id": org_id,
    "branch_id": branch_id,
    "entity_id": entity_id,
    "created_at": ts
  }],
  documents=[text]
)

verify right after insert

client.query(
  query_texts=["ping"],
  n_results=1,
  where={"org_id": org_id, "branch_id": branch_id}
)

if where does not filter in your build, use per-tenant collections or per-tenant namespaces. either isolates ids or you will get bleed.

graph side (neo4j)

CREATE CONSTRAINT IF NOT EXISTS
FOR (d:Doc) REQUIRE d.doc_id IS UNIQUE;

MERGE (d:Doc {doc_id:$doc_id})
SET d.text=$text,
    d.org_id=$org_id,
    d.branch_id=$branch_id,
    d.entity_id=$entity_id,
    d.created_at=$ts;

// if you create chunk nodes:
MERGE (c:Chunk {chunk_id:$chunk_id})
SET c.org_id=$org_id, c.branch_id=$branch_id
MERGE (d)-[:HAS_CHUNK]->(c);

add simple constraints or indexes on org_id and branch_id for both Doc and Chunk.

smoke tests

ingest two tiny docs A and B with different org ids. one chunk each
chroma query with filter. expect only A for org A

direct cypher

MATCH (d:Doc {org_id:$org}) RETURN count(d)

numbers must match

if either side fails the isolation test, fix that side before continuing

2) query-time checklist

you want the filter to be enforced in both legs. first on vector search. then again on graph fetch. finally intersect ids in the orchestrator. this prevents any single leg from leaking.

vector leg

hits = client.query(
  query_embeddings=[qvec],
  n_results=50,
  where={"org_id": org_id, "branch_id": branch_id}
)
vec_ids = set(hits["ids"][0])

graph leg

MATCH (d:Doc {org_id:$org_id, branch_id:$branch_id})
WHERE d.doc_id IN $vec_ids
RETURN d LIMIT 50

orchestrator guard

vec_ids = set(vec_ids)
graph_ids = set(graph_ids)
final_ids = vec_ids & graph_ids

if your QueryParam(..., filters=...) only applies on the graph side, keep the intersection anyway. if chroma supports where use it. if not, isolate per tenant with collections or with an id prefix like ORG123__<doc_id>.

3) where does metadata get attached

attach at chunk build time. never after. the chunker should write the same metadata dict to both the vector payload and the graph nodes. if you rely on a downstream stage to “recover” metadata, gaps appear.

4) about postgres and drivers

lightRAG often defaults to a local sqlite path when you see a local dir created. when you configure a pg driver, the tables live in whatever DSN you pass. chroma can be remote or local. neo4j is its own store. the rule is the same. whatever you pick, ensure the same four keys are present and indexed.

5) small diagnostic set you can run in five minutes

minimal doc per org. run chroma query with and without filter. confirm difference
run a raw cypher count per org. confirm numbers
turn off the graph leg. confirm that the vector leg alone still isolates with where or collections
turn off the vector leg. confirm that the graph leg isolates with MATCH (d:Doc {org_id:$org})
re-enable both and keep the id intersection in the orchestrator

that workflow usually tells you exactly which side is dropping the binding.

reference checklist map if you want the broader context
https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md
most of what we discussed maps to No.2 and No.7. if you need a short one-pager version for your repo, tell me and i’ll distill it.

happy to review a tiny repo snippet if you want me to sanity-check the chunker and the query hook.

Prabesh002 Aug 26, 2025
Author

HELLO MY FRIEND! Thank you so much, I'll look into this and will create a new q/a or thread if I get stuck

Thank you for your help, and hope you have a wonderful day >_>

Also I will install this package to my project and try it out today to see how this works

onestardao Aug 26, 2025

You are welcome , also hope you have a BigBig wonderful day

BigBig Smile 4u --> BigBig ^_________________^

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QA : Metadata and multi instance? #2001

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

QA : Metadata and multi instance? #2001

Uh oh!

Prabesh002 Aug 24, 2025

1. Ingestion

2. Querying

Replies: 1 comment · 6 replies

Uh oh!

onestardao Aug 25, 2025

Uh oh!

onestardao Aug 25, 2025

Uh oh!

Prabesh002 Aug 25, 2025 Author

Uh oh!

onestardao Aug 25, 2025

1) ingestion checklist

2) query-time checklist

3) where does metadata get attached

4) about postgres and drivers

5) small diagnostic set you can run in five minutes

Uh oh!

Uh oh!

Prabesh002 Aug 26, 2025 Author

Uh oh!

onestardao Aug 26, 2025

Prabesh002
Aug 24, 2025

Replies: 1 comment 6 replies

onestardao
Aug 25, 2025

Prabesh002 Aug 25, 2025
Author

Prabesh002 Aug 26, 2025
Author