-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exit if no start block is found #133
Conversation
@@ -216,6 +220,9 @@ func (c *Config) Validate() error { | |||
return fmt.Errorf("invalid subnetID in configuration. error: %v", err) | |||
} | |||
sourceBlockchainIDs = append(sourceBlockchainIDs, blockchainID) | |||
|
|||
// Write back to the config | |||
c.SourceSubnets[i] = s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an existing bug. c.SourceSubnets
is a slice of values, not references, so updating s
(as occurs in s.Validate
) did not update c
. We only updated s
to populate the allowed destinations list, which was not used anywhere in our tests.
main/main.go
Outdated
var wg sync.WaitGroup | ||
wg.Add(1) // Single counter to exit as soon as any relayer goroutine fails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we just want to get rid of the health check then? It no longer does anything if we exit on the first goroutine failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep the health check endpoint in order to more easily integrate with our avalanchego infra setup. Open to ideas on how best to utilize that endpoint versus exiting though...
Thinking about it some more, I think it would be more appropriate to mark the application as unhealthy here rather than exit. That would make it easier to track in the deployment itself, rather than relying on filtering logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still thinking it through, but I think we should probably keep the original health check. We could panic in certain situations, such as if there's no db entry for a chain and no start height set, which gets us the immediate exit in that scenario but not all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait... that's already implemented lol. I'll just revert the wg
change altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should only panic if we encounter something that should otherwise be impossible. Exiting the application should only be done from the top level main
, or via the health check. Otherwise error handling can become quickly intractible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with those points for something like ASSERT
or os.exit()
in c/c++, but panic in golang is a fair bit different. Idiomatically, panics should be thrown for "should be impossible" situations, but also errors that we know are unrecoverable such as a failed db, misconfiguration, etc.
IMO, panics actually make error-handling easier, as it will do all the things that bubbling up an error would do - unwinds the stack, executing defer
'd functions, but also prints a stack-trace which our current error handling doesn't do. It also makes it clear in the code at the earliest possible point that if this errors, nothing above this can fix it. The logger we use also contains functions like StopOnPanic
to catch the panic, log the error, then continue panicing. Repos like subnet-evm
use panic quite liberally.
All this being said, the failure recovery PR I almost have ready, uses a context instead to cancel the other goroutines, which probably works slightly better because it allows the other goroutines to unwind gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid points, I'm probably biased by my experience with C++ error handling. I think you hit the nail on the head here - "if this errors, nothing above this can fix it." That said, most of the edge cases we'd likely encounter in the relayer are in theory recoverable by refreshing the state as we would do at startup, making panicking versus automated recovery somewhat equivalent.
In that case, my strong preference would be to have interfaces return errors rather than have implementations panic so that we leave room for improved error recovery in the future. We should strive to design interfaces such that they can never put the application in an unrecoverable state at all. At a high level that would look something like interfaces such as DestinationClient
or Subscriber
reporting errors to Relayer
or MessageRelayer
and either recovering or panicking from there.
…elayer into specify-block-height-exit
Co-authored-by: Michael Kaplan <55204436+michaelkaplan13@users.noreply.github.com> Signed-off-by: cam-schultz <78878559+cam-schultz@users.noreply.github.com>
…elayer into specify-block-height-exit
The race condition does matter if the application exits or crashes while processing historical blocks. If it processes a new block before the historical blocks are done, the new block will be set as the "most recently seen". A crash (or, error thrown by a different relayer goroutine) would mean that on the next startup, the relayer starts processing from newest height, and skips the historical blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me. I created some tickets for the existing issues I mentioned in the comments.
relayer/relayer.go
Outdated
r.logger.Error("failed to convert latest block to big.Int", zap.Error(err)) | ||
return 0, err | ||
} | ||
if startBlockHeight == 0 || latestProcessedBlock.Uint64() > startBlockHeight { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanity checking here, is the startBlockHeight == 0
condition necessary? I think latestProcessedBlock.Uint64() > startBlockHeight
is sufficient.
It would also be good to add unit tests for this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Simplified this condition and added a unit test
Why this should be merged
Originally in #114, we were planning on implementing the catch up mechanism such that if the starting block was not found in the database nor in the
StartBlockHeight
catch up option, we'd emit a warning and process from genesis. Discussion.@minghinmatthewlam suggested we instead emit a warning and exit in this case, leaving it up to the user to explicitly specify that they want to process blocks from genesis.
How this works
StartBlockHeight
config option to be a signed integer, with a valid range of [-1, +maxint], with -1 representing genesis (i.e. block height 0). We can't use 0 directly because Go JSON unmarshalling does not let us distinguish between a provided value of 0 and the key not being provided at all, since in both cases the Go int is initialized to 0.How this was tested
CI
How is this documented
N/A