Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup controller and MSE broker entry points to keep error codes an… #15277

Open
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

gortiz
Copy link
Contributor

@gortiz gortiz commented Mar 14, 2025

This PR fixes some of the issues described in #14950.

It is mainly a clean up of MultiStageBrokerRequestHandler and PinotQueryResource to make parts reusable and query handling more consistent.

Specifically, it fixes:

  1. On errors, Pinot controller query endpoints inconsistently return broker-like JSON error payloads (if the broker detects the error) or plain text (if the controller itself detected the error)
  2. The Pinot controller detected errors are not logged.
  3. MSE errors don't include the actual error code. Instead, error code 200 is always returned.

And it also improves (but not totally fixes) MSE errors including the stack trace

Notes for reviewers:

  1. The main that have been changed are:
    • pinot-broker/src/main/java/org/apache/pinot/broker/requesthandler/MultiStageBrokerRequestHandler.java
      • A slight refactor that makes code easier to extend. It also defines 3 level of errors (green, yellow and red) used to decide whether to log or not.
    • pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotQueryResource.java
      • Most methods return the streaming output instead of a string. This is more efficient (we don't need to allocate all the responses from brokers, which could be very expensive when returning large amounts of data!) and can also be used to return actual errors.
    • pinot-query-planner/src/main/java/org/apache/pinot/query/QueryEnvironment.java
      I introduced the concept of CompiledQuery, which is the output of applying Calcite. This simplifies the code and makes it more efficient, given that we could be optimizing a query more than once previously.
  2. The diff between some classes is larger than expected because I've had to add a new try-with-resources. I recommend to review the PR with hidden whitespaces

@codecov-commenter
Copy link

codecov-commenter commented Mar 14, 2025

Codecov Report

Attention: Patch coverage is 27.56410% with 226 lines in your changes missing coverage. Please review.

Project coverage is 63.57%. Comparing base (59551e4) to head (34909ba).
Report is 1880 commits behind head on master.

Files with missing lines Patch % Lines
...requesthandler/MultiStageBrokerRequestHandler.java 0.00% 109 Missing ⚠️
...t/controller/api/resources/PinotQueryResource.java 14.51% 52 Missing and 1 partial ⚠️
.../java/org/apache/pinot/query/QueryEnvironment.java 60.37% 39 Missing and 3 partials ⚠️
...main/java/org/apache/pinot/common/utils/Timer.java 0.00% 10 Missing ⚠️
...g/apache/pinot/query/parser/utils/ParserUtils.java 0.00% 7 Missing ⚠️
...pinot/query/CalciteContextExceptionClassifier.java 44.44% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #15277      +/-   ##
============================================
+ Coverage     61.75%   63.57%   +1.82%     
- Complexity      207     1459    +1252     
============================================
  Files          2436     2783     +347     
  Lines        133233   156947   +23714     
  Branches      20636    24070    +3434     
============================================
+ Hits          82274    99781   +17507     
- Misses        44911    49653    +4742     
- Partials       6048     7513    +1465     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.55% <27.56%> (+1.84%) ⬆️
java-21 63.47% <27.56%> (+1.84%) ⬆️
skip-bytebuffers-false 63.57% <27.56%> (+1.82%) ⬆️
skip-bytebuffers-true 63.44% <27.56%> (+35.71%) ⬆️
temurin 63.57% <27.56%> (+1.82%) ⬆️
unittests 63.57% <27.56%> (+1.82%) ⬆️
unittests1 56.10% <47.51%> (+9.20%) ⬆️
unittests2 34.18% <18.58%> (+6.45%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@yashmayya yashmayya added cleanup multi-stage Related to the multi-stage query engine labels Mar 17, 2025
Comment on lines +98 to +102
<dependency>
<groupId>org.assertj</groupId>
<artifactId>assertj-core</artifactId>
<scope>test</scope>
</dependency>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is time to start using better assertions.

Comment on lines -502 to -535
/**
* Test invalid queries which should cause query exceptions.
*
* @throws Exception
*/
public void testQueryExceptions()
throws Exception {
testQueryException("POTATO", QueryErrorCode.SQL_PARSING);

// Ideally, we should attempt to unify the error codes returned by the two query engines if possible
testQueryException("SELECT COUNT(*) FROM potato",
useMultiStageQueryEngine()
? QueryErrorCode.QUERY_PLANNING : QueryErrorCode.TABLE_DOES_NOT_EXIST);

testQueryException("SELECT POTATO(ArrTime) FROM mytable",
useMultiStageQueryEngine()
? QueryErrorCode.QUERY_PLANNING : QueryErrorCode.QUERY_VALIDATION);

// ArrTime expects a numeric type
testQueryException("SELECT COUNT(*) FROM mytable where ArrTime = 'potato'",
useMultiStageQueryEngine()
? QueryErrorCode.QUERY_EXECUTION : QueryErrorCode.QUERY_VALIDATION);

// Cannot use numeric aggregate function for string column
testQueryException("SELECT MAX(OriginState) FROM mytable where ArrTime > 5",
QueryErrorCode.QUERY_VALIDATION);
}

private void testQueryException(String query, QueryErrorCode errorCode)
throws Exception {
JsonNode jsonObject = postQuery(query);
assertEquals(jsonObject.get("exceptions").get(0).get("errorCode").asInt(), errorCode.getId());
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an unknown reason these tests were only executed in realtime integration tests. I don't think that makes sense. Instead now these tests are executed in ErrorCodesIntegrationTests and their subclasses. They are also split into different tests so we can have better coverage even if one of them fail. They are also executed against both controllers and brokers. So there are 2 configuration axes (MSE vs SSE and broker vs controller) and therefore 4 classes that actually run these tests. As a result, we need to startup the cluster 4 extra times. That is something we need to improve in the future (ie using JUnit 5 it would be trivial to do)

Comment on lines -75 to -80
if (tableName != null) {
if (_resolvedTables == null) {
_resolvedTables = new HashSet<>();
}
_resolvedTables.add(tableName);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bziobrowski I don't think we need this anymore.

@@ -439,51 +439,51 @@ public void testWindowFunctions() {
"SELECT col1, col2, SUM(col3) OVER (PARTITION BY col1 ORDER BY col3 RANGE BETWEEN UNBOUNDED PRECEDING AND 1 "
+ "FOLLOWING) FROM a";
e = expectThrows(RuntimeException.class, () -> _queryEnvironment.planQuery(sumQueryWithCustomRangeWindow));
assertTrue(e.getCause().getCause().getMessage()
assertTrue(e.getCause().getMessage()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the exception that was thrown was an ExecutionException whose cause was the actual exception. Now we get rid of the ExecutionException and therefore we should not call getCause

@@ -644,13 +644,13 @@
"description": "nested aggregation",
"sql": "SELECT min(max(int_col)) FROM {tbl}",
"comments": ".*Aggregate expressions cannot be nested.",
"expectedException": "Error composing query plan for.*"
"expectedException": ".*Aggregate expressions cannot be nested"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message now is better than before. In fact, we shouldn't have been asserting the original message.

@gortiz gortiz requested a review from bziobrowski March 19, 2025 11:10
Copy link
Contributor

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gortiz, this is a really nice cleanup!

/// assertions to be made before failing.
///
/// See [Soft Assertions in AssertJ docs](https://assertj.github.io/doc/#assertj-core-soft-assertions)
public QueryErrorAssert.Soft softFirstException() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would we explicitly want non-soft assertions?

Copy link
Contributor Author

@gortiz gortiz Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When only one condition is needed (then soft assertions are ok, but require more code to be used because they need to be closed) or when one condition only make sense if the next one is true (ie an assert verifies that something is not null and the next one uses it as a not null value).

There are some cases like that in this PR

return new QueryErrorAssert.Soft(actual.get("exceptions").get(0));
}

public static class Soft extends AbstractSoftAssertions implements AutoCloseable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Copy link
Contributor Author

@gortiz gortiz Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is unused, but contrary to the other Soft class, this is the standard way to create soft assertions, so it is useful to keep it here in case we want to use soft assertions in the future for QueryAssert. I think it will be very useful once we start adding assertions on success scenarios.

Comment on lines 708 to 710
assertThat(response.get("exceptions").get(0).get("message").asText())
.as("First exception message")
.contains("Illegal base64 character");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use QueryAssert here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've modified this code before I created QueryAssert. Changing it.

return Math.max(remainingTime, 0);
public long getRemainingTimeMs() {
long remainingNs = _deadlineNs - _clock.nanos();
return Math.max(remainingNs / 1000, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be remainingNs / 1000000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. My bad. BTW, I've changed this class because nanos are recommended over millis and also because it was useful to add some extra methods

} catch (Throwable t) {
throw new RuntimeException("Error composing query plan for: " + sqlQuery, t);
try (CompiledQuery compiledQuery = compile(sqlQuery, sqlNodeAndOptions)) {
return compiledQuery.planQuery(requestId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat refactor!

@@ -105,47 +104,41 @@ public class PinotQueryResource {
@POST
@Path("sql")
@ManualAuthorization // performed by broker
public String handlePostSql(String requestJsonStr, @Context HttpHeaders httpHeaders) {
public StreamingOutput handlePostSql(String requestJsonStr, @Context HttpHeaders httpHeaders) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of changing the return type to StreamingOutput?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of returning a StreamingOutput instead of a String is double:

  1. We don't need to allocate the response on heap. Instead we can just pipeline the response from the broker. We use the same trick in LocalLogFileServer and TablesResource.
  2. We can start to pipeline results to the customer as soon as the first bytes from broker arrive. Before we had to wait until the whole response was received in the controller.

Comment on lines +627 to +628
case INTERNAL:
case UNKNOWN:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these "yellow" errors and not "red"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Red errors are errors we didn't catch and somehow bubbled up to the handleRequest. By definition, QueryExceptions are never red exceptions. Even if they are internal or unknown, we already caught them and explicitly set a user error message for them.

Colors are not about how problematic the error is. Instead, they are focused on how we want to log the error and how/what send to used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup multi-stage Related to the multi-stage query engine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants