Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle expired refresh tokens #237

Merged
merged 18 commits into from
Feb 26, 2025
Merged

Conversation

pkalita-lbl
Copy link
Collaborator

Fixes #118

Apologies in advance for this novel 👇🏻

Background

A quick reminder about how authenticated requests to nmdc-server API endpoints work and what is currently implemented:

  1. When a user logs in they are issued an access token and a refresh token.
  2. The access token is sent with API requests; the refresh token is stored for later. The access token is valid for 24 hours.
  3. Once the access token expires, subsequent API requests made with it will result in a 401 Unauthorized response.
  4. At that point the refresh token needs to be submitted to the /auth/refresh endpoint. If the refresh token is valid (it is valid for 365 days), the /auth/refresh endpoint will return a new access token.
  5. Then the original API request that generated the 401 Unauthorized response can be retried with the new access token.

What isn't currently being handled is if the call to /auth/refresh in step 4 also returns a 401 Unauthorized response because the refresh token has expired.

Summary

The changes here might be considered what our colleague Mark Miller would call a "belt-and-suspenders" approach. Although I'll admit there might be 2 belts and 3 pairs of suspenders here. In other words, there is a lot of redundancy. The reason for the redundancy is that the need for offline support (producing paused API requests) and some oddities about when Android devices resume paused mutations produce a lot of odd edge cases.

At a high level, the approach is two-fold:

  • Preemptively log the user out when their refresh token expires in order to prevent them from taking any actions that would produce an API request that cannot be fulfilled
  • If an API request does trigger a failed token exchange, catch that specific error and log the user out

Details

  • A new hook has been added (src/useLogoutAndRedirect.ts) to facilitate the log out action. The function provided by this hook differs from the logout function provided by the Store in that it also does a bit of offline query cleanup (note: just queries! not mutations!), presents a toast message to the user, and returns them to the welcome screen.
  • These changes are related to preemptively logging the user out when the refresh token expires:
    • In src/Store.tsx we now track when the time that the refresh token will expire. This is done by decoding the JWT refresh token and extracting the exp field.
    • A new component (RefreshTokenExpirationMonitor) picks up the refresh token expiration time and, if that time is soon-ish, sets up a timer to perform the logout action.
    • As a convenience to the user (and also as a debugging aid to the developer), the refresh token expiration time is now shown in the User section of the Settings tab (SettingsUserList).
  • These changes are related to catching and recovering from a case where an API request does somehow go out and fail due to an expired refresh token:
    • In src/api.ts we define a new Error subclass, RefreshTokenExchangeError. This error gets thrown if a call to /auth/refresh goes out and fails due to a 401 Unauthorized response or can't go out due to a missing refresh token.
    • Normally React Query swallows up errors thrown by query and mutation functions so that it can possibly retry them or ultimately invoke the query/mutation onError callbacks. These changes to instruct React Query to rethrow the error in the render phase (throwOnError) if it is an instance of RefreshTokenExchangeError. The error can then be caught by a new error boundary TokenRefreshErrorBoundary. Finally when TokenRefreshErrorBoundary catches an error it can perform the logout action.

      Why all that rigamarole with error boundaries instead of using the query/mutation onError callbacks in the first place? First, individual queries don't have an onError callback -- only individual mutations do. You can have a global onError callback for all queries using QueryCache options, but that's outside of the React component tree, so you couldn't utilize useStore, useHistory, etc from there. Second, even if you could figure out what to do with queries, for mutations you'd still have to remember to put the correct onError handler on each individual mutation and that felt, ahem, error prone. For more details see: https://tkdodo.eu/blog/react-query-error-handling

    • The TokenRefreshErrorBoundary is wrapped around pages where we make authenticated API requests. Mostly this is handled by adding it to AuthRoute. The one exception is the HomePage, which does make authenticated API requests if you are logged in, but is also accessible if you are not. So that needed to be wrapped with TokenRefreshErrorBoundary in Router.
  • These changes are about handling mutations that were paused due to being offline. When the app is offline and a mutation is requested, the mutation gets created in a "paused" state and held there for later. There are broadly two cases to consider.
    • If the user makes offline mutations, then closes the app, and reopens the app some time in the future those paused mutations will be rehydrated from the cache. Currently we use the rehydration onSuccess callback (src/QueryClientProvider.tsx) as an opportunity to resume the paused mutations. However the refresh token might have expired while the app was closed. So these changes add a check to see if the user is logged in before resuming because if they're not logged in those resumed mutation are guaranteed to fail. Since paused mutation might not have been resumed after the cache rehydration, we also want to attempt to resume any paused mutations just after login (TokenPage.tsx).
    • The second use case is: the user makes offline mutations, gets logged out due to the refresh token timeout, keeps the app open and then regains network connection. In this case, React Query really really wants to resume those paused mutations immediately after the network reconnect. But if that happens, those mutation requests will go out, fail, and then all that offline work will be lost. The only way I could find to interrupt that process is by making a custom subclass of one of the React Query classes (NmdcQueryClient) where I override the resumePausedMutations method so that it includes a check for an active refresh token. If it's missing the offline mutations will stay paused until the next login.
  • Miscellaneous stuff
    • I initially thought that some features added in a recent version of React Query could help me solve a problem. It turns out that wasn't the case. But I left the upgrade in here because it didn't seem to harm anything.
    • During testing I got sick of having to scroll down to get to the Save button in the StudyForm component. So I made it always visible at the bottom of the screen. In addition to solving my own annoyance I think it's a good UX improvement regardless. It did require adding some additional CSS color variables to make it look nice.

@pkalita-lbl pkalita-lbl marked this pull request as ready for review February 24, 2025 22:03
@eecavanna
Copy link
Collaborator

I'll review this on Wednesday.

Copy link
Collaborator

@eecavanna eecavanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting the review now so that my review comments become public, and then I will re-read the PR description and post a follow-up comment.

Comment on lines +25 to +29
let logoutTimer: NodeJS.Timeout;
const msUntilExpiration = refreshTokenExpiration.getTime() - Date.now();
if (msUntilExpiration < MAX_DELAY_MS) {
logoutTimer = setTimeout(logoutAndRedirect, msUntilExpiration);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking through this algorithm. If today is January 1 and the token expires January 3, we will see that the expiration date is sooner than 7 days (i.e. MAX_DELAY_MS) from now, so will set the timeout. But the timeout is still being set for the token expiration time. Did you mean to instead set it to be earlier than that?

Caveat: At this point in the PR review, I haven't read the rest of the diff yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am wondering how setTimeout works when an app does not remain in use throughout the entire timeout period. For example, I am wondering whether the countdown pauses while the app is closed and, when the app is reopened, it is as though no time has passed since it was closed (e.g., timer has progressed by 5 minutes, while real-world clock has progressed by 5 hours) — or whether the countdown (effectively) continues while the app is closed and, when the app is reopened, it is as though that much real-world time has passed (e.g., timer and real-world clock have both progressed by 5 hours).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the timeout is still being set for the token expiration time. Did you mean to instead set it to be earlier than that?

No, my rationale is that this is covering the case where the user is actively using the app when the refresh token expires. But in that case they will have an active access token, which may be expiring some time between 1 second and 24 hours from now. The user could technically make successful API requests in that period while the access token is active but the refresh token is expired. That gives us a natural buffer period and we don't need to force the log out any amount of time before the refresh token expiration.

I am wondering whether the countdown pauses while the app is closed

The state of the timer is not persisted when the app closes. The timer just goes away at that point (similar to when a web page is closed).

when the app is reopened

When the app is reopened, a new timer is setup (if it's within the MAX_DELAY_MS threshold) with no connection to the (non-existent) ghosts of timers from sessions past.

@eecavanna
Copy link
Collaborator

Sorry to say, I think I'm out of time in terms of what I can afford to spend reviewing this PR this week, given unrelated tasks and our plans to submit to the app stores for an initial review by Monday.

I'm OK with this being merged in, given that it does address a previously-unaddressed scenario:

What isn't currently being handled is if the call to /auth/refresh in step 4 also returns a 401 Unauthorized response because the refresh token has expired.

I would like to think more about that scenario, but that isn't something I would be able to do this week. We can revisit this in the future.

pkalita-lbl and others added 2 commits February 26, 2025 13:10
Co-authored-by: eecavanna <134325062+eecavanna@users.noreply.github.com>
@pkalita-lbl pkalita-lbl merged commit 0b19da1 into main Feb 26, 2025
1 check passed
@pkalita-lbl pkalita-lbl deleted the issue-118-refresh-token-expiration branch February 26, 2025 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle case where stored refresh token expires
2 participants