-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logic for collecting Histogram efficiently using Point Trees #14439
base: main
Are you sure you want to change the base?
Conversation
@stefanvodita / @jpountz - Would love to get your thoughts on this optimization, and how we can leverage it in Lucene. In a nutshell, it solves the below problem: Given a sorted non-overlapping set of intervals (Histogram buckets could be an example), it collects the matching documents count in single travel of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I only had a quick look. Is the opimisation here analogous to the one HistogramLeafCollector
does with the skipper?
No, this approach is different from the skipper as it leverages |
I didn't mean to imply that the two solutions are the same, apologies if that's how it came across.
Let me know if this doesn't answer the question @jainankitk, maybe you'd already gone through this and you were looking for a different answer. At a higher level, I'm curious if you had a use-case in mind. |
Not at all. Even I was initially confused with skipper logic, only after spending some time realized this approach is slightly different. So, thanks for reiterating the question.
Currently,
This optimization can be applied to following use cases:
Just as a data point, this change helped us improve date histogram latency from 5168 ms to 160 ms (~32x!!) for big5 workload in OpenSearch |
Description
This PR adds multi range traversal logic to collect the histogram on numeric field indexed as pointValues for MATCH_ALL cases. Even for non-match all cases like
PointRangeQuery
, if the query field == histogram field, this logic can be used. For the later, need to supply thePointRangeQuery
bounds for building the appropriateRanges
to be collected. Need some inputs from the community on how it can be plugged correctly into theHistogramCollector
One of the key assumptions is absence of any deleted documents. Maybe going forward (especially if the deleted documents percentage is low), we can consider correcting the collected
Ranges
by subtracting for deleted documents. Although if I remember correctly, getting doc values for just deleted documents was non-trivial task!Related issue #13335