Skip to content

Confluence Connector Pagination #3320

@WildDogOne

Description

@WildDogOne

Bug Description

The fullsync on the confluence connector only pulls 50 documents if a CQL is set.

To Reproduce

Set a CQL as an "advanced rule" in the connector "sync rules" for example:
[
{
"query": "created >= now('-5y')"
}
]

Expected behavior

Pull the confluence content of the last 5 years (obvious overkill but that is a different story)

Environment

8.17.3

Solution

I have been playing around with the "paginated_api_call" function in "confluence.py" and have noticed that the function looks for a next link.
However in the /api/search call this does not actually seem to exist according to the API documentation:
https://docs.atlassian.com/atlassian-confluence/REST/6.6.0/#content-search

It seems that pagination for a search has to be done with moving of the start window.
quick prof of concept while still keeping the next link if it would be needed by another function:

    async def paginated_api_call(self, url_name, **url_kwargs):
        """Make a paginated API call for Confluence objects using the passed url_name.
        Args:
            url_name (str): URL Name to identify the API endpoint to hit
        Yields:
            response: JSON response.
        """
        base_url = os.path.join(self.host_url, URLS[url_name].format(**url_kwargs))
        start = 0

        while True:
            try:
                url = f"{base_url}&start={start}"
                print("Starting Pagination for API endpoint: ", url)
                self._logger.debug(f"Starting pagination for API endpoint {url}")
                response = await self.api_call(url=url)
                json_response = await response.json()

                #print(json_response)
                links = json_response.get("_links")
                yield json_response
                print(links.get("next"))
                if links.get("next"):
                    print("Next URL Found")
                    url = os.path.join(
                        self.host_url,
                        links.get("next")[1:],
                    )
                elif json_response.get("start") + json_response.get("size") < json_response.get("totalSize"):
                    print("Calculating next URL")
                    start = json_response.get("start") + json_response.get("size")
                    url = f"{base_url}&start={start}"
                    print("Next URL: ", url)
                else:
                    print("No more data to fetch")
                    return
            except Exception as exception:
                print("Exception: ", exception)
                self._logger.warning(
                    f"Skipping data for type {url_name} from {base_url}. Exception: {exception}."
                )
                break

While debugging this I also found another issue in the function "search_by_query", it never is checked if "entity_details" exists, so if entity details is none, it will fail.
I fixed this with an additional condition

    async def search_by_query(self, query):
        async for entity in self.confluence_client.search_by_query(query=query):
            # entity can be space or content
            entity_details = entity.get(SPACE) or entity.get(CONTENT)

            if not entity_details:
                continue
            if (entity_details.get("type", "") == "attachment"
                and entity_details.get("container", {}).get("title") is None
            ):
                continue

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions