Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confluence pagination handling #5440

Merged
merged 4 commits into from
Feb 19, 2025

Conversation

san81
Copy link
Collaborator

@san81 san81 commented Feb 17, 2025

Description

With each search call, Confluence API returns the cursors for the next page that we should use to navigate through the search results. Enhanced the code to follow through these cursor based navigation

Issues Resolved

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>
Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>
Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>
@@ -38,7 +40,7 @@ public class ConfluenceRestClient extends AtlassianRestClient {
//public static final String REST_API_SPACES = "/rest/api/api/spaces";
public static final String FIFTY = "50";
public static final String START_AT = "startAt";
public static final String MAX_RESULT = "maxResults";
public static final String LIMIT = "limit";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe LIMIT_PARAM or LIMIT_PARAM_NAME for clarity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to LIMIT_PARAM

} while (startAt < total);
log.debug("Content items fetched so far: {}", total);
paginationLinks = searchContentItems.getLinks();
} while (paginationLinks != null && paginationLinks.getNext() != null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any concerns about having this as a loop that we don't have control over the bounds for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Is there an expectation that this while loop terminates?

Copy link
Collaborator Author

@san81 san81 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It is paginating the confluence site content using confluence search results api provided cursor pointer. It can be a large site but it can't go infinite because we are relaying on the Confluence api response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that it is bounded by the response. But, that could put our code in an infinite loop if Confluence had a bug or the server itself were maliciously sending the same link over and over.

Should we have any sort of checks to ensure that doesn't happen?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlvenable it looks like it is infinite loop in one way or the other. Even if it terminates the loop here, it will come back here to read more. So, as a source, that is OK, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, agree with Krishna. Even if we exit the loop, we will come back here to complete the page navigation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that makes sense. We'd hit it again. Maybe we should have a metric or some other mechanism to at least detect it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, currently, I am recording the metric at the end of the loop. I will move inside in my next PR

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>
} while (startAt < total);
log.debug("Content items fetched so far: {}", total);
paginationLinks = searchContentItems.getLinks();
} while (paginationLinks != null && paginationLinks.getNext() != null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that it is bounded by the response. But, that could put our code in an infinite loop if Confluence had a bug or the server itself were maliciously sending the same link over and over.

Should we have any sort of checks to ensure that doesn't happen?

@kkondaka kkondaka merged commit b8bf24b into opensearch-project:main Feb 19, 2025
45 of 47 checks passed
divbok pushed a commit to divbok/data-prepper that referenced this pull request Feb 24, 2025
* pagination handling

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

* adding option to pass batch size for the source coordinator partition

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

* renaming the constant

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

* review comments

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>

---------

Signed-off-by: Santhosh Gandhe <1909520+san81@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants