Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] General Deduplication Utility for String-based Data. #1517

Open
2 of 4 tasks
keli-wen opened this issue Jan 28, 2025 · 0 comments
Open
2 of 4 tasks
Assignees
Labels
enhancement New feature or request

Comments

@keli-wen
Copy link
Collaborator

keli-wen commented Jan 28, 2025

Required prerequisites

Motivation / Background

We have observed frequent duplicates or near-duplicates across different modules,
such as user inputs, knowledge graph nodes, or text-based data in our pipeline.

This results in:

  • Increased confusion and potential redundancy in search or retrieval,
  • Larger storage requirements,
  • Potential performance degradation over time.

Previously, we had a specialized deduplication utility targeting Node objects (FinDKG project),
but this approach does not generalize well to other text-based data.

Proposal

Introduce a deduplication.py module with more flexible deduplication related functions.

  • deduplication_internally
  • deduplication_externally

Key features and improvements:

  1. Works with strings: Instead of requiring Node objects, we now accept lists of raw strings.
  2. Optional embeddings parameter:
    • Users can supply a BaseEmbedding instance from the Camel embedding library,
      and the function will internally handle embeddings.
    • Alternatively, users can pass in precomputed embeddings directly if they have
      their own embedding process or data is pre-embedded.
  3. Multiple strategies:
    • Initially supports a "top1" strategy (i.e., find the highest similarity above threshold).
    • A future "llm-supervise" strategy will rely on an LLM to decide whether two
      texts are duplicates, especially when borderline or semantic similarity is unclear.
      (Currently not implemented, but planned.)

Example

See the updated function deduplicate_internally in deduplication.py:

def deduplicate_internally(
    texts: List[str],
    threshold: float = 0.65,
    embedding_instance: Optional[BaseEmbedding[str]] = None,
    embeddings: Optional[List[List[float]]] = None,
    strategy: Literal["top1", "llm-supervise"] = "top1",
) -> DeduplicationResult:
    ...
@keli-wen keli-wen added the enhancement New feature or request label Jan 28, 2025
@keli-wen keli-wen self-assigned this Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant