Persistent directory storage for docstore in MultiVectorRetriever #20642

sumedha73 · 2024-04-19T05:25:52Z

sumedha73
Apr 19, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore, LocalFileStore
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

def create_multi_vector_retriever(store, vectorstore, id_key, text_summaries=None, texts=None):
  """
  Create a multi-vector retriever that indexes summaries and stores embeddings in a folder.

  Args:
      folder_path: Path to the folder for storing embeddings.
      text_summaries: List of text summaries for new documents (optional).
      texts: List of corresponding texts/images/etc. for new documents ek(optional).

  Returns:
      The multi-vector retriever instance.
  """

  # Create the multi-vector retriever
  retriever = MultiVectorRetriever(
      vectorstore=vectorstore,
      docstore=store,
      id_key=id_key,
      search_kwargs={"k": 5},
  )

  def add_documents(retriever, doc_summaries, doc_contents):
      doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
      summary_docs = [
          Document(page_content=s, metadata={id_key: doc_ids[i]})
          for i,s, in enumerate(doc_summaries)
      ]
      retriever.vectorstore.add_documents(summary_docs)
      retriever.docstore.mset(list(zip(doc_ids, doc_contents)))
      print("added")

  # Filter out empty text_summaries
  non_empty_text_summaries = [summary for summary in text_summaries if summary.strip()]

  # Add texts, tables, and images if summaries are not empty
  if non_empty_text_summaries:
      add_documents(retriever, non_empty_text_summaries, texts)

  return retriever


vectorstore = Chroma(
    collection_name = "second",
    persist_directory = "/content/embeddings12",
    embedding_function=VertexAIEmbeddings(model_name="textembedding-gecko@latest"),
)

docstore = InMemoryStore()
id_key="doc_id"

# Create Retriever
retriever_multi_vector_img = create_multi_vector_retriever(
    docstore,
    vectorstore,
    id_key,
    doc_summaries, #description of the images
    doc_img_base64_list #the actual images
)

def multi_modal_rag_chain(retriever):
  """
  Multi-modal RAG chain
  """

  # Multi-modal LLM
  model = ChatVertexAI(
      temperature = 0, model_name = "gemini-pro-vision", max_output_tokens=2048, safety_settings = safety_settings
  )


  # RAG Pipeline
  chain = (
      {
          "context": retriever | RunnableLambda(split_image_text_types),
          "question": RunnablePassthrough()
      }
      | RunnableLambda(img_prompt_func)
      | model
      | StrOutputParser()
  )
  return chain


# Create RAG chain
chain_multimodal_rag = multi_modal_rag_chain(retriever_multi_vector_img)

Description

i have been using the multi vector retriever from langchain for storing images and their description to my embeddings db. i am embedding the descriptions and storing the images as it is for retrieval. the documents are labelled by doc_id which i am storing in a InMemoryStore but i would want this to be in a persistent directory or variable (which i can store in a directory).
i have tried using the LocalFileStore but it stores byte-like object and the docstore needs to be in str format (documentation) so this approach threw TypeError.

is there anyway to implement this functionality? please help me, i am just a beginner with langchain and llms.

Above is my code for the retriever and the llm chain

thanks!!

System Info

running the code on google colab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Persistent directory storage for docstore in MultiVectorRetriever #20642

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Persistent directory storage for docstore in MultiVectorRetriever #20642

Uh oh!

sumedha73 Apr 19, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 0 comments

sumedha73
Apr 19, 2024