You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have observed frequent duplicates or near-duplicates across different modules,
such as user inputs, knowledge graph nodes, or text-based data in our pipeline.
This results in:
Increased confusion and potential redundancy in search or retrieval,
Larger storage requirements,
Potential performance degradation over time.
Previously, we had a specialized deduplication utility targeting Node objects (FinDKG project),
but this approach does not generalize well to other text-based data.
Proposal
Introduce a deduplication.py module with more flexible deduplication related functions.
deduplication_internally
deduplication_externally
Key features and improvements:
Works with strings: Instead of requiring Node objects, we now accept lists of raw strings.
Optional embeddings parameter:
Users can supply a BaseEmbedding instance from the Camel embedding library,
and the function will internally handle embeddings.
Alternatively, users can pass in precomputed embeddings directly if they have
their own embedding process or data is pre-embedded.
Multiple strategies:
Initially supports a "top1" strategy (i.e., find the highest similarity above threshold).
A future "llm-supervise" strategy will rely on an LLM to decide whether two
texts are duplicates, especially when borderline or semantic similarity is unclear.
(Currently not implemented, but planned.)
Example
See the updated function deduplicate_internally in deduplication.py:
Required prerequisites
Motivation / Background
We have observed frequent duplicates or near-duplicates across different modules,
such as user inputs, knowledge graph nodes, or text-based data in our pipeline.
This results in:
Previously, we had a specialized deduplication utility targeting
Node
objects (FinDKG project),but this approach does not generalize well to other text-based data.
Proposal
Introduce a
deduplication.py
module with more flexible deduplication related functions.deduplication_internally
deduplication_externally
Key features and improvements:
Node
objects, we now accept lists of raw strings.BaseEmbedding
instance from the Camel embedding library,and the function will internally handle embeddings.
their own embedding process or data is pre-embedded.
"top1"
strategy (i.e., find the highest similarity above threshold)."llm-supervise"
strategy will rely on an LLM to decide whether twotexts are duplicates, especially when borderline or semantic similarity is unclear.
(Currently not implemented, but planned.)
Example
See the updated function
deduplicate_internally
indeduplication.py
:The text was updated successfully, but these errors were encountered: