Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle onConnectionInterrupted callback due to connection close in 24 hours #607

Closed
cyber274 opened this issue Jan 2, 2025 · 11 comments
Labels
p3 This is a minor priority issue

Comments

@cyber274
Copy link

cyber274 commented Jan 2, 2025

Describe the bug

We are creating MQTT connection(software.amazon.awssdk.crt.mqtt.MqttClientConnection) over WebSocket using IAM Role credentials. The connection gets successfully created and we are able to listen to incoming events.

However, we noticed that MQTT connection gets closed unexpectedly in 24 hours due to IOT Core limitation. And the connection does not resume automatically.

This is the code.

 {

final EventLoopGroup eventLoopGroup = new EventLoopGroup(1);
            final ClientBootstrap clientBootstrap = new ClientBootstrap(eventLoopGroup,
                    new HostResolver(eventLoopGroup));

            final SocketOptions socketOptions = new SocketOptions();
            socketOptions.keepAliveTimeoutSecs = 30;
            socketOptions.keepAliveIntervalSecs = 5;
            socketOptions.keepAlive = true;

            return AwsIotMqttConnectionBuilder.newDefaultBuilder()
                    .withBootstrap(clientBootstrap)
                    .withClientId(iotConnectClientId)
                    .withEndpoint(iotCoreEndpoint)
                    .withWebsockets(true)
                    .withReconnectTimeoutSecs(1, 60)
                    .withWebsocketSigningRegion(region)
                    .withProtocolOperationTimeoutMs(30 * 1000)
                    .withPingTimeoutMs(15 * 1000)
                    .withWebsocketCredentialsProvider(credentialsProvider)
                    .withKeepAliveSecs(60)
                    .withSocketOptions(socketOptions)
                    .withTimeoutMs(30 * 1000)
                    .withCleanSession(false)
                    .withConnectionEventCallbacks(getMqttClientConnectionEvents(siteId))
                    .build();

        } catch (final Exception e) {
            log.error("Exception while creating AwsIotMqtt Connection", e);
            throw e;
        }
}

private MqttClientConnectionEvents getMqttClientConnectionEvents(final String siteId) {
        return new MqttClientConnectionEvents() {

            @Override
            public void onConnectionInterrupted(int errorCode) {
                if (errorCode != 0) {
                    log.warn("connection interrupted: {} and string: {} for MQTT topic: {}",
                            errorCode, CRT.awsErrorString(errorCode), siteId);
                }
                // 5134 error code gets generated on WEBSOCKET_TTL_EXPIRATION disconnectReason
                // Handling this error code and retrying the subscription...
                if (errorCode == 5134) {
                    log.info("Reconnecting...");

                    try {
                        log.info("Creating new client");
                        dataSyncMQTTSubscriber.createIotCoreSubscriptionWithRetry(siteId);
                        log.info("New client is created successfully");
                    } catch (Exception e) {
                        log.error("Exception while reconnecting to MQTT topic: {} ", siteId, e);
                    }
                }
            }
            @Override
            public void onConnectionResumed(boolean b) {
                log.warn("connection resumed for MQTT topic: {}", siteId);
            }
            @Override
            public void onConnectionSuccess(OnConnectionSuccessReturn data) {
                if (data.getSessionPresent()) {
                    log.info("[MQTTEventSecondaryHandler.onSuccess] Thin client is " +
                            "using existing session to create subscription to MQTT topic: {}", siteId);
                } else {
                    log.info("[MQTTEventSecondaryHandler.onSuccess] Thin client is " +
                            "starting a new session to create successful subscription to MQTT topic: {}", siteId);
                }
            }
            @Override
            public void onConnectionFailure(OnConnectionFailureReturn data) {
                log.error("[MQTTEventSecondaryHandler.onFailure] Thin client is facing" +
                        " error while subscribing to MQTT topic: {} with error: {}", siteId, data.getErrorCode());
            }
            @Override
            public void onConnectionClosed(OnConnectionClosedReturn data) {
                log.error("[MQTTEventSecondaryHandler.OnClose] Thin client subscription connection " +
                        "to MQTT topic: {} is closed {}.", siteId, data);
            }
        };
    }
createIotCoreSubscriptionWithRetry() {
   ....
  mqttSecondaryConnection.connect().get(10, TimeUnit.SECONDS);
  ....
}

We receive error code:
5134 and error message "string: The connection was closed unexpectedly. for MQTT topic: <MQTT topic>".

in onConnectionInterrupted() callback method.

We tried to call mqttSecondaryConnection.connect().get(10, TimeUnit.SECONDS); as reconnect() in onConnectionInterrupted() callback method but we received:

java.util.concurrent.ExecutionException: software.amazon.awssdk.crt.CrtRuntimeException: MqttClientConnection.mqtt_connect: aws_mqtt_client_connection_connect failed (aws_last_error: AWS_ERROR_MQTT_ALREADY_CONNECTED(5132), The requested operation is invalid as the connection is already open.) AWS_ERROR_MQTT_ALREADY_CONNECTED(5132) at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396) ~[?:?] at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2096) ~[?:?] at ...

Can anybody help, how to reconnect in this case ?

Expected Behavior

The connection should resume automatically.

Current Behavior

The connection does not resume in this case.

Reproduction Steps

Create the fresh connection after 24 hours to see the connection closure after 24 hours.

Possible Solution

No response

Additional Information/Context

No response

SDK version used

JDK17

Environment details (OS name and version, etc.)

IOT Core

@bretambrose
Copy link
Contributor

Copy-pasting from the other issue:

The client will always reconnect in the scenario you have described. If you have CRT logs showing otherwise, please attach them.

@MikeDombo
Copy link
Contributor

The logs imply that you're trying to reconnect and the SDK itself has already reconnected for you.

If you remove your own reconnection logic, then what happens? Do you see that the connection is interrupted and then re-established as expected?

Also, since you're using Greengrass, why not just use the Greengrass MQTT connection? It will handle message spooling and reconnection for you. https://docs.aws.amazon.com/greengrass/v2/developerguide/ipc-iot-core-mqtt.html

@jmklix jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. p3 This is a minor priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Jan 2, 2025
@cyber274
Copy link
Author

cyber274 commented Jan 2, 2025

Copy-pasting from the other issue:

The client will always reconnect in the scenario you have described. If you have CRT logs showing otherwise, please attach them.

attached: https://paste.amazon.com/show/gujrap/1735839898

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label Jan 2, 2025
@cyber274
Copy link
Author

cyber274 commented Jan 2, 2025

The logs imply that you're trying to reconnect and the SDK itself has already reconnected for you.

If you remove your own reconnection logic, then what happens? Do you see that the connection is interrupted and then re-established as expected?

Also, since you're using Greengrass, why not just use the Greengrass MQTT connection? It will handle message spooling and reconnection for you. https://docs.aws.amazon.com/greengrass/v2/developerguide/ipc-iot-core-mqtt.html

If you remove reconnection logic, Do you see that the connection is interrupted ?
We are running reconnection logic only after the connection is interrupted. So, yes the connection is interrupted after 24 hours.

If you remove reconnection logic, Do you see that the connection re-established ?
I did not try this yet that if i remove reconnection logic, the connection re-establishes or not. But, i am only running mqttSecondaryConnection.connect().get(10, TimeUnit.SECONDS) as reconnection logic and the connection does not re-established. I hope this does not trouble the existing mqttConnection.

why not just use the Greengrass MQTT connection?
Let me explore this.

@MikeDombo
Copy link
Contributor

But, i am only running ... as reconnection logic

You should not be doing any reconnection logic at all. Just let the SDK reconnect.

@cyber274
Copy link
Author

cyber274 commented Jan 2, 2025

But, i am only running ... as reconnection logic

You should not be doing any reconnection logic at all. Just let the SDK reconnect.

Let me try this.

Also, is there a way to re-produce this scenario otherwise this will be tested in 24 hours.

@MikeDombo
Copy link
Contributor

You could remove the internet connection and then add it back in a bit. That will require a reconnection. It won't necessarily be the same as the websocket timeout, but it does prove that reconnecting works.

@cyber274
Copy link
Author

cyber274 commented Jan 3, 2025

The logs imply that you're trying to reconnect and the SDK itself has already reconnected for you.
If you remove your own reconnection logic, then what happens? Do you see that the connection is interrupted and then re-established as expected?
Also, since you're using Greengrass, why not just use the Greengrass MQTT connection? It will handle message spooling and reconnection for you. https://docs.aws.amazon.com/greengrass/v2/developerguide/ipc-iot-core-mqtt.html

If you remove reconnection logic, Do you see that the connection is interrupted ? We are running reconnection logic only after the connection is interrupted. So, yes the connection is interrupted after 24 hours.

If you remove reconnection logic, Do you see that the connection re-established ? I did not try this yet that if i remove reconnection logic, the connection re-establishes or not. But, i am only running mqttSecondaryConnection.connect().get(10, TimeUnit.SECONDS) as reconnection logic and the connection does not re-established. I hope this does not trouble the existing mqttConnection.

why not just use the Greengrass MQTT connection? Let me explore this.

why not just use the Greengrass MQTT connection?
We need to connect to a cross-region IoT Thing registered in us-east-1 from us-west-2 for resiliency usecase. Since Greengrass MQTT can't be used due to the lack of a Thing and certificate in us-west-2 region, we're using an IAM Role with WebSocket credentials for the MQTT connection.

@MikeDombo
Copy link
Contributor

Please get the clients logs showing that it disconnects and does not attempt to reconnect at all (or fails to reconnect).

Mostly likely the issue is that you are providing static credentials that don't refresh. What is the credential provider you are using?

@bretambrose
Copy link
Contributor

Closing as internal issue was resolved with code analysis. Putting session credentials inside a static provider will lead to reconnect failures once the session credentials expire.

@bretambrose bretambrose removed the bug This issue is a bug. label Jan 23, 2025
Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

4 participants