Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit if no start block is found #133

Merged
merged 52 commits into from
Jan 18, 2024
Merged

Exit if no start block is found #133

merged 52 commits into from
Jan 18, 2024

Conversation

cam-schultz
Copy link
Collaborator

Why this should be merged

Originally in #114, we were planning on implementing the catch up mechanism such that if the starting block was not found in the database nor in the StartBlockHeight catch up option, we'd emit a warning and process from genesis. Discussion.

@minghinmatthewlam suggested we instead emit a warning and exit in this case, leaving it up to the user to explicitly specify that they want to process blocks from genesis.

How this works

  • Updates the StartBlockHeight config option to be a signed integer, with a valid range of [-1, +maxint], with -1 representing genesis (i.e. block height 0). We can't use 0 directly because Go JSON unmarshalling does not let us distinguish between a provided value of 0 and the key not being provided at all, since in both cases the Go int is initialized to 0.
  • Also updates the relayer goroutine wait logic to exit as soon as a single relayer fails. This is related to Implement failure recovery #30. In the case of a consistently failing relayer goroutine, the failing source chain should be disabled in the config.

How this was tested

CI

How is this documented

N/A

@@ -216,6 +220,9 @@ func (c *Config) Validate() error {
return fmt.Errorf("invalid subnetID in configuration. error: %v", err)
}
sourceBlockchainIDs = append(sourceBlockchainIDs, blockchainID)

// Write back to the config
c.SourceSubnets[i] = s
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an existing bug. c.SourceSubnets is a slice of values, not references, so updating s (as occurs in s.Validate) did not update c. We only updated s to populate the allowed destinations list, which was not used anywhere in our tests.

main/main.go Outdated
var wg sync.WaitGroup
wg.Add(1) // Single counter to exit as soon as any relayer goroutine fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we just want to get rid of the health check then? It no longer does anything if we exit on the first goroutine failure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the health check endpoint in order to more easily integrate with our avalanchego infra setup. Open to ideas on how best to utilize that endpoint versus exiting though...

Thinking about it some more, I think it would be more appropriate to mark the application as unhealthy here rather than exit. That would make it easier to track in the deployment itself, rather than relying on filtering logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still thinking it through, but I think we should probably keep the original health check. We could panic in certain situations, such as if there's no db entry for a chain and no start height set, which gets us the immediate exit in that scenario but not all

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait... that's already implemented lol. I'll just revert the wg change altogether

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should only panic if we encounter something that should otherwise be impossible. Exiting the application should only be done from the top level main, or via the health check. Otherwise error handling can become quickly intractible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with those points for something like ASSERT or os.exit() in c/c++, but panic in golang is a fair bit different. Idiomatically, panics should be thrown for "should be impossible" situations, but also errors that we know are unrecoverable such as a failed db, misconfiguration, etc.

IMO, panics actually make error-handling easier, as it will do all the things that bubbling up an error would do - unwinds the stack, executing defer'd functions, but also prints a stack-trace which our current error handling doesn't do. It also makes it clear in the code at the earliest possible point that if this errors, nothing above this can fix it. The logger we use also contains functions like StopOnPanic to catch the panic, log the error, then continue panicing. Repos like subnet-evm use panic quite liberally.

All this being said, the failure recovery PR I almost have ready, uses a context instead to cancel the other goroutines, which probably works slightly better because it allows the other goroutines to unwind gracefully.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid points, I'm probably biased by my experience with C++ error handling. I think you hit the nail on the head here - "if this errors, nothing above this can fix it." That said, most of the edge cases we'd likely encounter in the relayer are in theory recoverable by refreshing the state as we would do at startup, making panicking versus automated recovery somewhat equivalent.

In that case, my strong preference would be to have interfaces return errors rather than have implementations panic so that we leave room for improved error recovery in the future. We should strive to design interfaces such that they can never put the application in an unrecoverable state at all. At a high level that would look something like interfaces such as DestinationClient or Subscriber reporting errors to Relayer or MessageRelayer and either recovering or panicking from there.

cam-schultz and others added 6 commits January 12, 2024 22:21
Co-authored-by: Michael Kaplan <55204436+michaelkaplan13@users.noreply.github.com>
Signed-off-by: cam-schultz <78878559+cam-schultz@users.noreply.github.com>
@geoff-vball
Copy link
Contributor

geoff-vball commented Jan 13, 2024

The race condition does matter if the application exits or crashes while processing historical blocks. If it processes a new block before the historical blocks are done, the new block will be set as the "most recently seen". A crash (or, error thrown by a different relayer goroutine) would mean that on the next startup, the relayer starts processing from newest height, and skips the historical blocks.

geoff-vball
geoff-vball previously approved these changes Jan 16, 2024
Copy link
Contributor

@geoff-vball geoff-vball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. I created some tickets for the existing issues I mentioned in the comments.

r.logger.Error("failed to convert latest block to big.Int", zap.Error(err))
return 0, err
}
if startBlockHeight == 0 || latestProcessedBlock.Uint64() > startBlockHeight {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity checking here, is the startBlockHeight == 0 condition necessary? I think latestProcessedBlock.Uint64() > startBlockHeight is sufficient.

It would also be good to add unit tests for this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Simplified this condition and added a unit test

@cam-schultz cam-schultz merged commit 2c071a1 into main Jan 18, 2024
7 checks passed
@cam-schultz cam-schultz deleted the specify-block-height-exit branch January 18, 2024 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants