Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group By Truncates at Limit in Combine Operator #14956

Open
ankitsultana opened this issue Jan 30, 2025 · 4 comments
Open

Group By Truncates at Limit in Combine Operator #14956

ankitsultana opened this issue Jan 30, 2025 · 4 comments
Labels

Comments

@ankitsultana
Copy link
Contributor

Queries such as the following return incorrect results in most cases because the Combine Operator truncates the IndexedTable at the Query Limit. Moreover there are no warnings in the response:

SELECT COUNT(*), cityID
FROM my_table
GROUP BY cityID
LIMIT 10

Relevant code for this is here:

// When there is no ORDER BY, trim is not required because the indexed table stops accepting new groups once the
// result size is reached
if (!hasOrderBy) {
int resultSize;
if (hasHaving) {
// Keep more groups when there is HAVING clause
resultSize = trimSize;
} else {
// TODO: Keeping only 'LIMIT' groups can cause inaccurate result because the groups are randomly selected
// without ordering. Consider ordering on group-by columns if no ordering is specified.
resultSize = limit;
}
int initialCapacity = getIndexedTableInitialCapacity(resultSize, numGroups, minInitialIndexedTableCapacity);
return getTrimDisabledIndexedTable(dataSchema, false, queryContext, resultSize, initialCapacity, numThreads,
executorService);
}

I think we can do a couple of things to improve this:

  1. Add a server level config to control the trim for Group By without Order-By.
  2. Set a flag in the response to indicate a trim was done at server level.

Marking this as a bug because the documentation for Grouping Algorithm does not call this out explicitly: https://docs.pinot.apache.org/users/user-guide-query/query-syntax/grouping-algorithm

@Jackie-Jiang
Copy link
Contributor

This is tracked under #11924

@Jackie-Jiang
Copy link
Contributor

Duplicate of #11706

@Jackie-Jiang Jackie-Jiang marked this as a duplicate of #11706 Jan 30, 2025
@anandkrshaw
Copy link
Contributor

@ankitsultana @Jackie-Jiang
An implicit order by will not harm. right ?
That looks to be an cleaner and consistent approach.

@ankitsultana
Copy link
Contributor Author

It will help a bit but from our perspective the main requirement is to return a warning to indicate that results are partial.

There are some other scenarios of partial results that we want to handle too and one of our engineers will take a stab at it in the next month or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants