Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow dataset queries #1665

Open
forgetso opened this issue Feb 1, 2025 · 1 comment
Open

Slow dataset queries #1665

forgetso opened this issue Feb 1, 2025 · 1 comment
Labels
bug Something isn't working dev Product development size-s

Comments

@forgetso
Copy link
Member

forgetso commented Feb 1, 2025

We are using $sample 2 when getting captchas. This is causing slow queries on the nodes. We need to change this approach as follows:

  1. Create an index on { datasetId: 1, solved: 1 }

  2. Instead of $sample, use a random selection method to improve performance. For example:

  • Add a random field to each document at insertion time.
  • Index this field.
  • Query using $gte or $lte to efficiently retrieve random documents.
  1. Use $limit Before $sample

Instead of sampling from the entire dataset, limit the query first:

db.captchas.aggregate([
  { $match: { datasetId: "0xe666b35451f302b9fccfbe783b1de9a6a4420b840abed071931d68a9ccc1c21d", solved: true } },
  { $limit: 1000 },  // Get a subset first
  { $sample: { size: 2 } },  // Then sample from that subset
  { $project: { datasetId: 1, datasetContentId: 1, captchaId: 1, captchaContentId: 1, items: 1, target: 1 } }
]);

This reduces the number of documents MongoDB has to scan.

@forgetso forgetso added bug Something isn't working dev Product development labels Feb 1, 2025
@goastler
Copy link
Member

goastler commented Feb 3, 2025

aggregate has no ordering so you don't need the random field

@forgetso forgetso added the size-s label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dev Product development size-s
Projects
None yet
Development

No branches or pull requests

2 participants