Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logic for collecting Histogram efficiently using Point Trees #14439

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jainankitk
Copy link
Contributor

@jainankitk jainankitk commented Apr 4, 2025

Description

This PR adds multi range traversal logic to collect the histogram on numeric field indexed as pointValues for MATCH_ALL cases. Even for non-match all cases like PointRangeQuery, if the query field == histogram field, this logic can be used. For the later, need to supply the PointRangeQuery bounds for building the appropriate Ranges to be collected. Need some inputs from the community on how it can be plugged correctly into the HistogramCollector

One of the key assumptions is absence of any deleted documents. Maybe going forward (especially if the deleted documents percentage is low), we can consider correcting the collected Ranges by subtracting for deleted documents. Although if I remember correctly, getting doc values for just deleted documents was non-trivial task!

Related issue #13335

@jainankitk jainankitk changed the title Adding logic for collecting Histogram efficiently using Point Trees Logic for collecting Histogram efficiently using Point Trees Apr 4, 2025
@jainankitk
Copy link
Contributor Author

@stefanvodita / @jpountz - Would love to get your thoughts on this optimization, and how we can leverage it in Lucene. In a nutshell, it solves the below problem:

Given a sorted non-overlapping set of intervals (Histogram buckets could be an example), it collects the matching documents count in single travel of PointsTree index, by skipping over the leafBlocks completely unless the values in leafBlock overlap with more than one interval. This ensures that the # leafBlocks actually traversed is bounded by the # buckets and remaining leafBlocks are collected in bulk. Hence it can very efficiently collect the doc counts, especially when the # documents / # buckets is pretty high.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I only had a quick look. Is the opimisation here analogous to the one HistogramLeafCollector does with the skipper?

@jainankitk
Copy link
Contributor Author

Sorry, I only had a quick look. Is the opimisation here analogous to the one HistogramLeafCollector does with the skipper?

No, this approach is different from the skipper as it leverages PointValues instead of DocValues for computing the buckets

@stefanvodita
Copy link
Contributor

I didn't mean to imply that the two solutions are the same, apologies if that's how it came across.

Need some inputs from the community on how it can be plugged correctly into the HistogramCollector

Let me know if this doesn't answer the question @jainankitk, maybe you'd already gone through this and you were looking for a different answer.
I think you could start in HistogramCollector.getLeafCollector (code). Right now we throw an exception if the field we're using isn't doc values (code). You'd need a new branch for the case you want to implement and a new LeafCollector, similar to the ones already in the file. Having that would make it easier to think through the next steps.

At a higher level, I'm curious if you had a use-case in mind.

@jainankitk
Copy link
Contributor Author

I didn't mean to imply that the two solutions are the same, apologies if that's how it came across.

Not at all. Even I was initially confused with skipper logic, only after spending some time realized this approach is slightly different. So, thanks for reiterating the question.

I think you could start in HistogramCollector.getLeafCollector (code). Right now we throw an exception if the field we're using isn't doc values (code).

Currently, Collector doesn't need to be aware of the Query itself. They are designed to collect individual docId or using DocIdStream from the scorer. But this CustomCollector, does not need the scorer to provide documents, but can BulkCollect documents, assuming MATCH_ALL or PointRangeQuery (where PointRangeQuery.field == histogram.field). Otherwise, it should fallback to traditional methods for collecting matching documents.

At a higher level, I'm curious if you had a use-case in mind.

This optimization can be applied to following use cases:

  • Number of sale based on the price range (0-50, 50-100, 100-250,.....)
  • Number of visits on website for each day in a month

Just as a data point, this change helped us improve date histogram latency from 5168 ms to 160 ms (~32x!!) for big5 workload in OpenSearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants