-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Showing duplicate entries in my vector db #14611
Comments
The issue of duplicate entries in your vector database could be due to several factors in your ingestion and upserting process. Here are some steps and best practices to help you identify and resolve the problem:
Here is an example of how to handle upserts in the def _handle_upserts(self, nodes: List[BaseNode], store_doc_text: bool = True) -> List[BaseNode]:
assert self.docstore is not None
existing_doc_ids_before = set(self.docstore.get_all_document_hashes().values())
doc_ids_from_nodes = set()
deduped_nodes_to_run = {}
for node in nodes:
ref_doc_id = node.ref_doc_id if node.ref_doc_id else node.id_
doc_ids_from_nodes.add(ref_doc_id)
existing_hash = self.docstore.get_document_hash(ref_doc_id)
if not existing_hash:
self.docstore.set_document_hash(ref_doc_id, node.hash)
deduped_nodes_to_run[ref_doc_id] = node
elif existing_hash and existing_hash != node.hash:
self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)
if self.vector_store is not None:
self.vector_store.delete(ref_doc_id)
self.docstore.set_document_hash(ref_doc_id, node.hash)
deduped_nodes_to_run[ref_doc_id] = node
else:
continue
if self.docstore_strategy == DocstoreStrategy.UPSERTS_AND_DELETE:
doc_ids_to_delete = existing_doc_ids_before - doc_ids_from_nodes
for ref_doc_id in doc_ids_to_delete:
self.docstore.delete_document(ref_doc_id)
if self.vector_store is not None:
self.vector_store.delete(ref_doc_id)
nodes_to_run = list(deduped_nodes_to_run.values())
self.docstore.add_documents(nodes_to_run, store_text=store_doc_text)
return nodes_to_run Additionally, ensure that your vector_store = PineconeVectorStore(
pinecone_index=pinecone_index,
namespace="my_namespace",
insert_kwargs={"some_key": "some_value"},
batch_size=50,
remove_text_from_metadata=True,
add_sparse_vector=True,
tokenizer=my_tokenizer,
default_empty_query_vector=[0.0] * 128,
) By following these practices, you can help prevent and manage duplicate entries in your vector database [1][2][3]. |
Question Validation
Question
I have created nodes using ingestion pipeline:
and I have total 113 nodes after I concatenate all the chunks.
however When I am upserting those nodes in my pinecone vector db it is showing vector count 572
this is the code i am using to upserting the nodes to my vec db.
can anyone please help me out with it.
The text was updated successfully, but these errors were encountered: