Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure Blob Storage] Add logs #2530

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

moxarth-elastic
Copy link
Collaborator

@moxarth-elastic moxarth-elastic commented May 8, 2024

Part Of #2299

Adding more logs in Azure Blob Storage connector.

Log file: https://drive.google.com/file/d/1RMuieFYk-llbqZUj__eYJbnUY5W2wq7K/view?usp=drive_link
Updated file: https://drive.google.com/file/d/1dVEpn6ccoz5uKAgzSrahtDp1vyifJDIh/view?usp=drive_link

Checklists

Pre-Review Checklist

  • this PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check config.yml.example)
  • this PR has a meaningful title
  • this PR links to all relevant github issues that it fixes or partially addresses
  • if there is no GH issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Tested the changes locally
  • Added a label for each target release version (example: v7.13.2, v7.14.0, v8.0.0)

Copy link
Member

@jedrazb jedrazb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one additional thing I noticed is that in get_docs we have

 async def get_docs(self, filtering=None):
        ...
        async for container_data in self.get_container(container_list=self.containers):
            if container_data:
                ...

I don't have the context to know why we need if container_data but if it's needed I would log a debug statement that in else clause that we are skipping container because it's empty(?) / doesn't have data?

@@ -212,6 +216,7 @@ async def get_container(self, container_list):
Yields:
dictionary: Container document with name & metadata
"""
self._logger.debug("Fetching containers")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in get_blob we have log level INFO, maybe worth changing log level INFO here as well to keep it consistent?

@@ -185,6 +185,7 @@ async def get_content(self, blob, timestamp=None, doit=None):
if not self.can_file_be_downloaded(file_extension, filename, file_size):
return

self._logger.debug(f"Downloading content for file: {filename}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a duplicate line - you already have "f"Downloading content for blob: {blob_name} from {container_name} container"" that happens right after

@@ -247,6 +252,7 @@ async def get_blob(self, container):
Yields:
dictionary: Formatted blob document
"""
self._logger.info(f"Fetching blobs for '{container['name']}' container")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._logger.info(f"Fetching blobs for '{container['name']}' container")
self._logger.info(f"Fetching blobs for container '{container['name']}'")

Copy link
Member

@artem-shelkovnikov artem-shelkovnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Would be also cool to log how many blobs are found in the container. Also would be nice to report progress in format of "Extracted X out of Y blobs for container {container}" reported every 100 lines and in the end of sync.
  2. Log lines "Generating connection string." should be DEBUG
  3. Log lines "Successfully connected to the Azure Blob Storage" should be DEBUG
  4. Need a DEBUG log line saying which containers were found, then which ones will be synced in the end after all validations for it

@moxarth-elastic
Copy link
Collaborator Author

I don't have the context to know why we need if container_data but if it's needed I would log a debug statement that in else clause that we are skipping container because it's empty(?) / doesn't have data?

This condition was added due this if statement. So, at the end of the execution of this method

async def get_container(self, container_list):
we return None so added a condition in get_docs to handle it.

@jedrazb
Copy link
Member

jedrazb commented May 14, 2024

@moxarth-elastic could you share updated example DEBUG output, before we hit ✅

@moxarth-elastic
Copy link
Collaborator Author

@moxarth-elastic could you share updated example DEBUG output, before we hit ✅

Here is the updated log file: https://drive.google.com/file/d/1HJ36WDN87P3q4fCeQGu8FiQUrh9texD-/view?usp=drive_link

@praveen-elastic
Copy link
Collaborator

Buildkite test this

Comment on lines 294 to 296
self._logger.info(
f"Fetched {blob_count} blobs from '{container['name']}' container"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go into a finally statement so that it's always shown.

Copy link
Member

@artem-shelkovnikov artem-shelkovnikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can have logs of actual requests to the ABS?

@moxarth-elastic
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants