Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disconnection with error code [5154] and reason code [SESSION_TAKEN_OVER] #580

Closed
kriehzoo opened this issue May 7, 2024 · 24 comments
Closed
Labels
bug This issue is a bug. p3 This is a minor priority issue

Comments

@kriehzoo
Copy link

kriehzoo commented May 7, 2024

Describe the bug

I have a subsribing client written in Java and Spring Boot.
When I start the client I keep seeing SESSION_TAKEN_OVER.
Already opened a case with AWS support but I was referred to this repo to open an issue here.
I know that SESSION_TAKEN_OVER will happen when I use the same client ID for different clients but I only start one client so this cannot be the issue.
My suspicion is that resources may be leaking from a previous run?
Do I need some cleanup when shutting down my client? I came across CrtResource.waitForNoResources() so is this required?
How can I detect if resources are leaking?
I use my IoT thing name as client ID.

Building my client looks like this:

@Autowired
  public IoTCoreMqtt5ConnectionHandler(
      @Value("${iot.enabled:false}") Boolean enabled,
      @Value("${iot.thing.name}") @NotEmpty String thingName,
      @Value("${iot.endpoint.host}") String host,
      @Value("${iot.endpoint.port:8883}") @NotNull Long port,
      @Value("${iot.auth.certPath}") @NotEmpty String certPath,
      @Value("${iot.auth.privateKeyPath}") @NotEmpty String privateKeyPath,
      @Value("${iot.auth.caFilePath}") @NotEmpty String caFilePath,
      @Value("${iot.subscribeTimeoutSeconds:10}") Long subscribeTimeoutSeconds,
      @Value("${iot.keepAliveIntervalSeconds:60}") Long keepAliveIntervalSeconds,
      ApplicationEventPublisher applicationEventPublisher, ObjectMapper objectMapper) {
    this.enabled = enabled;
    this.thingName = thingName;
    this.eventTopic = "connector/" + thingName + "/event";
    this.host = host;
    this.port = port;
    this.certPath = certPath;
    this.privateKeyPath = privateKeyPath;
    this.caFilePath = caFilePath;
    this.subscribeTimeoutSeconds = subscribeTimeoutSeconds;
    this.keepAliveIntervalSeconds = keepAliveIntervalSeconds;
    this.applicationEventPublisher = applicationEventPublisher;
    this.objectMapper = objectMapper;
  }

for stop:

/**
   * Disconnect from the MQTT broker and wait for all CRT resources to be cleaned up.
   * A stopped client can be restarted.
   *
   * @throws ExecutionException   if this future completed exceptionally
   * @throws InterruptedException if the current thread was interrupted while waiting
   */
  private void stop() throws ExecutionException, InterruptedException {
    log.debug("stop");
    client.stop(new DisconnectPacket.DisconnectPacketBuilder()
        .withReasonCode(DisconnectPacket.DisconnectReasonCode.NORMAL_DISCONNECTION)
        .withReasonString("client is shutting down")
        .build());
  }

creating the client:

/**
   * Creates a client w/o starting it.
   */
  private void createClient() {
    if (client != null) {
      throw new CreateMqtt5ClientFailedException("Client is already created");
    }

    ConnectPacket.ConnectPacketBuilder connectPacketBuilder = new ConnectPacket.ConnectPacketBuilder()
        .withClientId(thingName)
        .withKeepAliveIntervalSeconds(keepAliveIntervalSeconds);

    SocketOptions socketOptions = new SocketOptions();
    socketOptions.keepAlive = true;

    // NEEDS REVIEW maximum of 7 days and even that may require a change to the account limits using the limits process
    // currently, although setting 7 days, the negotiated session expiry interval is 3600 seconds (1 hour) which
    // would be too short for our use case
    connectPacketBuilder.withSessionExpiryIntervalSeconds(60 * 60 * 24 * 7L); // 7 days in seconds
    try (AwsIotMqtt5ClientBuilder builder =
             AwsIotMqtt5ClientBuilder.newDirectMqttBuilderWithMtlsFromPath(host, certPath, privateKeyPath)
                 .withConnectProperties(connectPacketBuilder)
                 .withCertificateAuthorityFromPath(null, caFilePath)
                 .withLifeCycleEvents(lifecycleEvents)
                 .withPublishEvents(publishEvents)
                 // MQTT broker uses keep alive times 1.5 before it drops an idle connection
                 .withMinReconnectDelayMs((long) (keepAliveIntervalSeconds * 1000 * 1.5))
                 //.withSocketOptions(socketOptions)
                 // persistent session (must have same client ID, and a session expiry > 0)
                 // REVIEW REJOIN_POST_SUCCESS should be enough, but then an existing session was not re-joined
                 .withSessionBehavior(Mqtt5ClientOptions.ClientSessionBehavior.REJOIN_ALWAYS)
                 .withPort(port);
    ) {
      client = builder.build();
    }

    if (client == null) {
      throw new CreateMqtt5ClientFailedException("Failed to create the MQTT-5 client");
    }
  }

starting it:

/**
   * Connect to the MQTT broker using MQTT-5.
   * A client can be stopped and restarted.
   * If the client has not been created, it will be created first.
   *
   * @throws ExecutionException   if this future completed exceptionally
   * @throws InterruptedException if the current thread was interrupted while waiting
   */
  private void startClient() throws ExecutionException, InterruptedException {
    if (client == null) {
      createClient();
    } else {
      if (client.getIsConnected()) {
        throw new StartMqtt5ClientFailedException("Client is already connected");
      }
    }
    log.info("Start");
    client.start();
  }

subscribing:

private void subscribeToConnectorEvents() throws ExecutionException, InterruptedException, TimeoutException {
    QOS qos = QOS.AT_LEAST_ONCE;
    log.debug("subscribing to topic [{}] with QOS [{}]", eventTopic, qos);
    SubAckPacket subAckPacket = client.subscribe(new SubscribePacket.SubscribePacketBuilder()
        .withSubscription(eventTopic,
            qos,
            true,
            false,
            SubscribePacket.RetainHandlingType.SEND_ON_SUBSCRIBE_IF_NEW) // only for retained messages
        .build()).get(subscribeTimeoutSeconds, TimeUnit.SECONDS);
    log.info("subscribed to topic [{}] with reason codes {}: {}",
        eventTopic, subAckPacket.getReasonCodes(), subAckPacket.getReasonString());

    if (!subAckPacket.getReasonCodes().contains(SubAckPacket.SubAckReasonCode.GRANTED_QOS_1)) {
      throw new SubscriptionDoesNotMatchRequirementsException(String.format(
          "QOS [%s] was not granted although QOS [%s] was requested",
          SubAckPacket.SubAckReasonCode.GRANTED_QOS_1, qos));
    }

    if (hasSubscriptionError(subAckPacket)) {
      throw new SubscriptionFailedException(String.format(
          "Subscription to topic [%s] failed with reason codes [%s]: %s",
          eventTopic, subAckPacket.getReasonCodes(), subAckPacket.getReasonString()));
    }
  }

shutdown:

public void onShutdown() {
    if (!enabled) {
      return;
    }
    stopAndCloseClient();
  }

  private void stopAndCloseClient() {
    try {
      stopClient();
    } finally {
      closeClient();
    }
  }

  private void closeClient() {
    log.debug("close");
    // Important: When finished with an MQTT5 client, you must call close() on it or any associated
    // native resource may leak
    client.close();
    //CrtResource.waitForNoResources(); // NEEDS REVIEW
    //log.info("resources cleaned up");
    client = null;
  }

  private void stopClient() {
    try {
      stop();
    } catch (ExecutionException e) {
      throw new DisconnectOnShutdownFailedException(e);
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt(); // re-interrupt the thread, no need to fail the shutdown
    }
  }

  public void onContextClosedEvent(ContextClosedEvent event) {
    log.debug("onApplicationEvent: {}", event);
    // Important: When finished with an MQTT5 client, you must call close() on it or any associated
    // native resource may leak
    onShutdown();
  }

startup:

public void onContextRefreshedEvent(ContextRefreshedEvent event) {
    log.debug("onApplicationEvent: {}", event);
    onStartup();
  }

  private void onStartup() {
    if (!enabled) {
      log.warn("IoT Core is disabled");
      return;
    }

    try {
      startClient();
    } catch (InterruptedException ex) {
      Thread.currentThread().interrupt(); // re-interrupt the thread
      throw new StartMqtt5ClientFailedException(ex); // and fail the startup

    } catch (CrtRuntimeException | ExecutionException ex) {
      throw new StartMqtt5ClientFailedException(ex);
    }

    // REVIEW wait for the initial connection to be established before subscribing
    // Once connected, the MQTT5 client can subscribe to one or more topics
    // make sure not to do this in the connection success callback because it will
    // block the MQTT event loop, and you will run into a timeout
    try {
      subscribeToConnectorEvents();
    } catch (ExecutionException | InterruptedException | TimeoutException e) {
      if (e instanceof InterruptedException) {
        Thread.currentThread().interrupt(); // re-interrupt the thread
      }
      throw new ConnectAndSubscribeTopicsFailedException("Failed to subscribe", e);
    }
  }

So I think these are all relevant code snippets.

Expected Behavior

to not see SESSION_TAKEN_OVER at all

Current Behavior

15:45:33.113 [main] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - onApplicationEvent: org.springframework.context.event.ContextRefreshedEvent[source=org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@305f031, started on Tue May 07 15:45:31 CEST 2024]
15:45:33.500 [main] INFO c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - Start
15:45:33.500 [main] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - subscribing to topic [connector/connector-608/event] with QOS [AT_LEAST_ONCE]
15:45:33.500 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect
15:45:33.889 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties []
15:45:33.960 [main] INFO c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - subscribed to topic [connector/connector-608/event] with reason codes [GRANTED_QOS_1]: null
15:45:33.965 [main] INFO c.m.c.s.ConnectorClientApplication - Started ConnectorClientApplication in 2.44 seconds (process running for 2.796)
15:45:34.167 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]
15:45:34.167 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
15:47:01.925 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect
15:47:02.426 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties []
15:47:03.127 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]
15:47:03.127 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
15:47:29.742 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect
15:47:30.053 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties []
15:47:31.157 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]
15:47:31.157 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
15:49:31.158 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect
15:49:31.471 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties []
15:49:32.455 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]
15:49:32.455 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
15:50:55.170 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect
15:50:55.534 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties []
15:50:56.696 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]
15:50:56.696 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
15:51:08.092 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - attempting connect
15:51:08.392 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties []
15:51:08.939 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]

Reproduction Steps

Relevant source files were provided to AWS support. This issue will also be linked there.

Possible Solution

I suspect a resource leak.

Additional Information/Context

No response

SDK version used

1.20.1

Environment details (OS name and version, etc.)

Mac 14.3.1

@kriehzoo kriehzoo added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 7, 2024
@kriehzoo
Copy link
Author

kriehzoo commented May 7, 2024

I enabled CRT tracing and native debugging by adding these VM options:
-Daws.crt.log.level=Trace -Daws.crt.log.destination=Stdout -Daws.crt.debugnative=true
Also I added a call to CrtResource.logNativeResources(); on startup and shutdown to see if something is leaking.
Plus a call to CrtResource.waitForNoResources(); after closing the client.

When just running my client and waiting for a couple of minutes, I keep getting the SESSION_TAKEN_OVER:

[TRACE] [2024-05-07T15:45:31Z] [00000001727eb000] [event-loop] - id=0x144f2ed60: waiting for a maximum o17:46:21.263 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]
17:46:21.263 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
f 100s 0ns

Next time:

[TRACE] [2024-05-07T15:46:30Z] [00000001725bb000] [ev17:46:30.448 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - connection success with clientID [connector-608] and rejoinedSession [true] and sessionExpiryIntervalSeconds [3600] and serverKeepAliveSeconds [60] : assignedClientIdentifier [null], reasonCode [SUCCESS], reasonString [null], responseInformation [null], sessionExpiryIntervalSeconds [3600], sessionPresent [true], serverKeepAliveSeconds [60], userProperties []
17:46:31.716 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with error code [5154]
17:46:31.716 [AwsEventLoop 2] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
ent-loop] - id=0x144f2be90: wake up with 0 events to process.

This goes on for a while until I stop my app.

Some suspicous things that I see from the logs:

[ERROR] [2024-05-07T15:59:11Z] [00000001725bb000] [mqtt5-client] - MQTT5 client termination function in JNI called, but could not get env

And after waiting for the 1 minute timeout:

[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - waitForNoResources - timeOut
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - Dumping native object set:
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 0, Class SocketOptions, Refs 1](2024-05-07T15:38:51.039533Z) - <null>
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 1, Class TlsContextOptions, Refs 1](2024-05-07T15:38:51.042230Z) - <null>
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 3, Class TlsContext, Refs 1](2024-05-07T15:38:51.042744Z) - <null>
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 5, Class ClientBootstrap, Refs 1](2024-05-07T15:38:51.072969Z) - <null>
   Forward references by Id: 7 6 
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 6, Class EventLoopGroup, Refs 2](2024-05-07T15:38:51.072987Z) - <null>
[ERROR] [2024-05-07T16:00:11Z] [00000001702cb000] [JavaCrtResource] - [Id 7, Class HostResolver, Refs 1](2024-05-07T15:38:51.073470Z) - <null>
   Forward references by Id: 6 
[WARN] [2024-05-07T16:00:13Z] [000000016c657000] [JavaCrtGeneral] - Not all native threads were successfully joined during gentle shutdown.  Memory may be leaked.

Process finished with exit code 130 (interrupted by signal 2:SIGINT)

Hope this helps, I can also provide full CRT logs if that would help.

@kriehzoo
Copy link
Author

kriehzoo commented May 7, 2024

Another suspicious line from the same logs:

[ERROR] [2024-05-07T15:59:11Z] [00000001725bb000] [mqtt5-client] - id=0x144bd3db0: client channel shutdown invoked from unexpected state 6(PENDING_RECONNECT)

@sbSteveK
Copy link
Contributor

sbSteveK commented May 8, 2024

The SESSION_TAKEN_OVER error code is sent in a disconnect packet from the server when the server gets a connect packet containing the same client id as a client that is currently connected to it. The server will accept the connection from the new client and disconnect the previously connected client.

You probably have two clients running concurrently using the same client id and they are taking turns disconnecting each other. To test whether this is the issue, try changing your createClient() function to use a UUID to generate a client id instead of using thing name at .withClientId(thingName). If this resolves the SESSION_TAKEN_OVER issue, you can then look into your code to figure out how and why it's creating more than one client.

You can also setup and check Cloudwatch logs to see the incoming CONNECT packets to your MQTT endpoint. If you see more than one CONNECT packet with two different UUID client ids, that will also confirm that something is creating multiple clients and starting them up.

@jmklix jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. p3 This is a minor priority issue and removed needs-triage This issue or PR still needs to be triaged. labels May 8, 2024
@kriehzoo
Copy link
Author

Hi Kim,
thank you for your suggestions.
I already checked for multiple clients using the same client ID and also reviewed my code for this.
However, I am pretty sure that I am not using the same client id.
Still I will try what you suggested although this will probably give me issues with IoT permissions as the permission policy is based on the thing name. I will post the outcome as soon as possible.
And I will double check the CloudWatch logs.
In the meantime, what about my previous comments on suspicious log output from the SDK?

@kriehzoo
Copy link
Author

Hello Kim,

W/o any changes to the thing name after starting my client, here are the events as shown in IoTCoreV2 logs on CloudWatch.

I get 3 events: Connect, Subscribe, Disconnect in the first log stream:

{
"timestamp": "2024-05-14 14:52:53.812",
"logLevel": "INFO",
"traceId": "a9f601e6-295a-ebc3-d168-459817ead22c",
"accountId": "851725189817",
"status": "Success",
"eventType": "Connect",
"protocol": "MQTT",
"clientId": "connector-608",
"principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836",
"sourceIp": "80.120.7.190",
"sourcePort": 47021
}

{
"timestamp": "2024-05-14 14:52:53.873",
"logLevel": "INFO",
"traceId": "e7870603-26d5-c35f-2450-16ea327a34e7",
"accountId": "851725189817",
"status": "Success",
"eventType": "Subscribe",
"protocol": "MQTT",
"topicName": "connector/connector-608/event",
"subscriptions": [
{
"topicName": "connector/connector-608/event",
"reasonCode": 1
}
],
"clientId": "connector-608",
"principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836",
"sourceIp": "80.120.7.190",
"sourcePort": 47021
}

{
"timestamp": "2024-05-14 14:52:54.916",
"logLevel": "ERROR",
"traceId": "7b921e95-2c25-e6a4-3ad8-e32471faa7d3",
"accountId": "851725189817",
"status": "Failure",
"eventType": "Disconnect",
"protocol": "MQTT",
"clientId": "connector-608",
"principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836",
"sourceIp": "80.120.7.190",
"sourcePort": 47021,
"reason": "DUPLICATE_CLIENT_ID",
"details": "A new connection was established with the same client ID",
"disconnectReason": "DUPLICATE_CLIENTID"
}

then the next log stream shows a connect event:

{
"timestamp": "2024-05-14 14:54:11.414",
"logLevel": "INFO",
"traceId": "f6676a00-5383-fa83-aa57-8a101cbe619b",
"accountId": "851725189817",
"status": "Success",
"eventType": "Connect",
"protocol": "MQTT",
"clientId": "connector-608",
"principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836",
"sourceIp": "80.120.7.190",
"sourcePort": 5666
}

and another log stream contains a disconnect:

{
"timestamp": "2024-05-14 14:56:12.377",
"logLevel": "ERROR",
"traceId": "a340d241-125f-cbbe-0734-89bd69d46ed4",
"accountId": "851725189817",
"status": "Failure",
"eventType": "Disconnect",
"protocol": "MQTT",
"clientId": "connector-608",
"principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836",
"sourceIp": "80.120.7.190",
"sourcePort": 40656,
"reason": "DUPLICATE_CLIENT_ID",
"details": "A new connection was established with the same client ID",
"disconnectReason": "DUPLICATE_CLIENTID"
}

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label May 14, 2024
@kriehzoo
Copy link
Author

When I stop my client i get client logs:

17:02:55.585 [AwsEventLoop 12] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - disconnection with reason code [SESSION_TAKEN_OVER] and sessionExpiryIntervalSeconds [null]: null
17:03:32.011 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - onApplicationEvent: org.springframework.context.event.ContextClosedEvent[source=org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext@305f031, started on Tue May 14 16:52:51 CEST 2024]
17:03:32.011 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - stop
17:03:32.013 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - close
17:03:32.013 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - closed
17:03:32.013 [SpringApplicationShutdownHook] DEBUG c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - waitForNoResources
17:04:32.202 [SpringApplicationShutdownHook] WARN c.m.c.s.i.m.IoTCoreMqtt5ConnectionHandler - Failed to wait for all resources being cleaned up
java.lang.RuntimeException: Timeout waiting for resource count to drop to zero
at software.amazon.awssdk.crt.CrtResource.waitForNoResources(CrtResource.java:448)
at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.closeClient(IoTCoreMqtt5ConnectionHandler.java:417)
at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.stopAndCloseClient(IoTCoreMqtt5ConnectionHandler.java:404)
at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.onShutdown(IoTCoreMqtt5ConnectionHandler.java:397)
at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.onContextClosedEvent(IoTCoreMqtt5ConnectionHandler.java:443)
at com.medina.connector.standalone.iot.mqtt5.IoTCoreMqtt5ConnectionHandler.onApplicationEvent(IoTCoreMqtt5ConnectionHandler.java:389)
at org.springframework.context.event.SimpleApplicationEventMulticaster.doInvokeListener(SimpleApplicationEventMulticaster.java:185)
at org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(SimpleApplicationEventMulticaster.java:178)
at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:156)
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:451)
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:384)
at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1130)
at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.doClose(ServletWebServerApplicationContext.java:174)
at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:1093)
at org.springframework.boot.SpringApplicationShutdownHook.closeAndWait(SpringApplicationShutdownHook.java:145)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at org.springframework.boot.SpringApplicationShutdownHook.run(SpringApplicationShutdownHook.java:114)
at java.base/java.lang.Thread.run(Thread.java:1583)
17:04:32.215 [SpringApplicationShutdownHook] INFO o.s.i.e.SourcePollingChannelAdapter - stopped bean 'fileWatcherFlow.org.springframework.integration.config.SourcePollingChannelAdapterFactoryBean#0'; defined in: 'class path resource [com/medina/connector/standalone/file/FileWatcherConfig.class]'; from source: 'bean method fileWatcherFlow'
17:04:32.215 [SpringApplicationShutdownHook] INFO o.s.i.channel.DirectChannel - Channel 'application.fileWatcherFlow.channel#0' has 0 subscriber(s).
17:04:32.215 [SpringApplicationShutdownHook] INFO o.s.i.endpoint.EventDrivenConsumer - stopped bean 'fileWatcherFlow.org.springframework.integration.config.ConsumerEndpointFactoryBean#0'; defined in: 'class path resource [com/medina/connector/standalone/file/FileWatcherConfig.class]'; from source: 'bean method fileWatcherFlow'
17:04:32.220 [SpringApplicationShutdownHook] INFO o.s.i.endpoint.EventDrivenConsumer - Removing {logging-channel-adapter:_org.springframework.integration.errorLogger} as a subscriber to the 'errorChannel' channel
17:04:32.220 [SpringApplicationShutdownHook] INFO o.s.i.c.PublishSubscribeChannel - Channel 'application.errorChannel' has 0 subscriber(s).
17:04:32.220 [SpringApplicationShutdownHook] INFO o.s.i.endpoint.EventDrivenConsumer - stopped bean '_org.springframework.integration.errorLogger'

Process finished with exit code 130 (interrupted by signal 2:SIGINT)

the event that matches the shutdown in CloudWatch:

{
"timestamp": "2024-05-14 15:02:55.634",
"logLevel": "ERROR",
"traceId": "1b441fcb-5258-47a5-283c-1cea49068f0d",
"accountId": "851725189817",
"status": "Failure",
"eventType": "Disconnect",
"protocol": "MQTT",
"clientId": "connector-608",
"principalId": "fb03ca6b5bf8ea850fa216eb96007b9ab7f2c4182ba23ad945f7df94b07fa836",
"sourceIp": "80.120.7.190",
"sourcePort": 54916,
"reason": "DUPLICATE_CLIENT_ID",
"details": "A new connection was established with the same client ID",
"disconnectReason": "DUPLICATE_CLIENTID"
}

and that is the last IoT Event i get there with the next minutes, so it seems this was indeed the only client running

@kriehzoo
Copy link
Author

as expected, connect fails when using some random UUID:

connection failure with error code: [5150]: assignedClientIdentifier [null], reasonCode [NOT_AUTHORIZED], reasonString [CONNACK:Client is not authenticated/authorized to send the message:0af2347e-d534-cb34-502c-af1bac3ebdb4]

@kriehzoo
Copy link
Author

I also now added thread synchronization using a CountDownLatch so I make sure the subsribe is only done after the successful connect. The issue persists, however.

@alfred2g
Copy link
Contributor

Hello @kriehzoo ,
Try to expand the certificates permissions to allow any client ID or to create a separate cert/key for testing purposes if you cannot modify the one you are currently using and try the uuid generated client id again. Also be sure to specify that the client ID should be generated and used at time of client creation, and not stored anywhere to prevent the identical generated client id from being used again.

@bretambrose
Copy link
Contributor

bretambrose commented May 20, 2024

I feel like we are not going in the right direction in diagnosing this.

What we need are SDK logs at the Debug level. You can find documentation for doing so here: https://awslabs.github.io/aws-crt-java/software/amazon/awssdk/crt/Log.html

I will also echo what's been said previously: this behavior only occurs if multiple clients are fighting over the same client id.

Beyond logs, a complete working sample that repros the issue would also be helpful.

@jmklix jmklix added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label May 21, 2024
@kriehzoo
Copy link
Author

Hi @bretambrose with SDK logs at Debug level you mean CRT Logging? I enabled that before to get trace output and provided the relevant part.
I also provided my full client code to AWS support in the related support case. Could you pls get in contact with them because I suppose I am not allowed by my company to post the full Java class here.

@kriehzoo
Copy link
Author

@alfred2g I will try that and let you know

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label May 27, 2024
@kriehzoo
Copy link
Author

@alfred2g I added a second IoT policy that allows any client using the same certificate and I changed to use a UUID as client ID.
Now I do not see any connection related errors since more than 5 minutes which should be enough to verify it is working.
That being said, I then changed it back to use the thingName and it is also working. That is unexpected.

@bretambrose
Copy link
Contributor

Hi @bretambrose with SDK logs at Debug level you mean CRT Logging? I enabled that before to get trace output and provided the relevant part. I also provided my full client code to AWS support in the related support case. Could you pls get in contact with them because I suppose I am not allowed by my company to post the full Java class here.

Please include full debug logs. Snippets aren't of much use.

@kriehzoo
Copy link
Author

@alfred2g I also removed the second IoT policy again and still I do not get the SESSION_TAKEN_OVER so far. I will keep an eye on it.

@kriehzoo
Copy link
Author

What about the waitForNoResources(), am I supposed to call it when closing my client? As mentioned, it always hangs there for some time, then is interrupted with an exception. Would it be safe to remove it?

@kriehzoo
Copy link
Author

@bretambrose next time it happens I will send you full debug logs. Thanks for your support, also to @alfred2g

@bretambrose
Copy link
Contributor

So this only happens with low frequency?

@kriehzoo
Copy link
Author

@bretambrose it happened every time when I started the client and now it seems to be gone and to be honest, I am not sure, why

@kriehzoo
Copy link
Author

If you could give me a quick indication on waitForNoResources() I could probably resolve this issue.

@bretambrose
Copy link
Contributor

@bretambrose it happened every time when I started the client and now it seems to be gone and to be honest, I am not sure, why

Seems very likely that a second process -- either on your machine or on a colleague's machine -- was using the same client id and account.

You don't need to bother with waitForNoResources. That's a public-internal API we used to verify native resources aren't leaking within tests/samples.

@kriehzoo
Copy link
Author

@bretambrose thank you for the quick response, so I will remove the call to waitForNoResources()
After some testing I will update this issue.

@kriehzoo
Copy link
Author

@bretambrose issue cannot be reproduced so I will close it

Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

5 participants