Apparently random NoCredentialsError after running for a while #1006

jsoucheiron · 2023-04-18T12:03:42Z

Describe the bug
We have an aiohttp server that sends SQS messages as result of certain actions. After running for a while we'll get

Traceback (most recent call last):
  File "/usr/local/skyscanner/app/services/sqs.py", line 21, in send_message
    await client.send_message(
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 141, in _make_api_call
    http, parsed_response = await self._make_request(
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 161, in _make_request
    return await self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 77, in _send_request
    request = await self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 70, in create_request
    await self._event_emitter.emit(event_name, request=request,
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/hooks.py", line 27, in _emit
    response = await handler(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/signers.py", line 16, in handler
    return await self.sign(operation_name, request)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/signers.py", line 63, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.10/site-packages/botocore/auth.py", line 378, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

Our code that triggers the issue in production, where we use IAM roles:

class SQSService:
    def __init__(self, sqs_region: str, sqs_url: str):
        self.default_source = "unknown"
        self.sqs_region = sqs_region
        self.sqs_url = sqs_url

    async def send_message(self, pushed_data: Dict[str, Any], data_type: str, source: Optional[str]):
        try:
            session = get_session()
            async with session.create_client("sqs", region_name=self.sqs_region) as client:
                await client.send_message(
                    QueueUrl=self.sqs_url,
                    MessageBody=json_dumps_extended(
                        {"pushed_data": pushed_data, "data_type": data_type, "source": source or self.default_source}
                    ),
                )        
        except Exception:
            logger.exception(f"Something went wrong in SQS upload of {pushed_data}")

We've tried multiple versions including 2.0.0 and 2.5.0

After many many tests trying to find a way to reproduce the issue locally, we've managed to mitigate it using backoff. When we do, this is what we get:

This leads me to believe there's a run condition somewhere that only triggers after a while running where you might end up with missing credentials temporarily.

Checklist

I have reproduced in environment where pip check passes without errors
I have provided pip freeze results
I have provided sample code or detailed way to reproduce
I have tried the same code in botocore to ensure this is an aiobotocore specific issue
I have tried similar code in aiohttp to ensure this is is an aiobotocore specific issue
I have checked the latest and older versions of aiobotocore/aiohttp/python to see if this is a regression / injection

pip freeze results

$ pip freeze
aiobotocore==2.0.0
aiocache==0.12.0
aiodns==3.0.0
aiohttp==3.8.1
aioitertools==0.11.0
aiosignal==1.3.1
aiotask-context==0.6.1
alembic==1.0.11
async-timeout==4.0.2
asyncpg==0.27.0
asyncpgsa==0.27.1
attrs==23.1.0
backoff==2.2.1
basictracer==3.2.0
black==23.3.0
boto3==1.19.8
botocore==1.22.8
Brotli==1.0.9
build==0.10.0
cachetools==5.3.0
cchardet==2.1.7
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
coverage==7.2.3
cryptography==3.4.8
Deprecated==1.2.13
exceptiongroup==1.1.1
flake8==6.0.0
frozenlist==1.3.3
googleapis-common-protos==1.59.0
grpcio==1.53.0
gunicorn==20.1.0
idna==3.4
importlib-metadata==6.4.1
iniconfig==2.0.0
isort==5.12.0
Jinja2==3.1.2
jmespath==0.10.0
jq==1.4.1
jsonpickle==3.0.1
lightstep==4.4.8
Mako==1.2.4
markdown-it-py==2.2.0
MarkupSafe==2.1.2
mccabe==0.7.0
mdurl==0.1.2
moto==4.1.7
multidict==6.0.4
mypy-extensions==1.0.0
object-pool==0.2
opentelemetry-api==1.15.0
opentelemetry-exporter-otlp==1.15.0
opentelemetry-exporter-otlp-proto-grpc==1.15.0
opentelemetry-exporter-otlp-proto-http==1.15.0
opentelemetry-instrumentation==0.36b0
opentelemetry-instrumentation-aiohttp-client==0.36b0
opentelemetry-instrumentation-logging==0.36b0
opentelemetry-opentracing-shim==0.36b0
opentelemetry-propagator-ot-trace==0.36b0
opentelemetry-proto==1.15.0
opentelemetry-sdk==1.15.0
opentelemetry-semantic-conventions==0.36b0
opentelemetry-util-http==0.36b0
opentracing==2.4.0
packaging==23.1
pathspec==0.11.1
pbr==5.11.1
pip-tools==6.13.0
platformdirs==3.2.0
pluggy==1.0.0
pprintpp==0.4.0
protobuf==3.20.3
psycopg2-binary==2.9.6
pycares==4.3.0
pycodestyle==2.10.0
pycparser==2.21
pydantic==1.10.7
pyflakes==3.0.1
pyformance==0.4
Pygments==2.15.0
pyproject_hooks==1.0.0
pytest==7.3.1
pytest-aiohttp==1.0.4
pytest-asyncio==0.21.0
pytest-clarity==1.0.1
pytest-cov==2.12.1
pytest-env==0.6.2
pytest-mock==1.12.1
python-dateutil==2.8.2
python-editor==1.0.4
PyYAML==6.0
requests==2.28.2
responses==0.23.1
rich==13.3.4
s3transfer==0.5.2
six==1.16.0
SQLAlchemy==1.3.24
statsd==3.3.0
thrift==0.16.0
toml==0.10.2
tomli==2.0.1
types-PyYAML==6.0.12.9
types-requests==2.28.11.17
types-urllib3==1.26.25.10
typing_extensions==4.5.0
urllib3==1.26.15
uvloop==0.17.0
Werkzeug==2.2.3
wrapt==1.15.0
xmltodict==0.13.0
yarl==1.8.2
zipp==3.15.0

Environment:

Python Version: [e.g. 3.9, 3.10, 3.11]
OS name and version: [e.g. linux(python-slim docker)]
We haven't been able to reproduce for 3.9 and aiobotocore==1.2.2

Additional context
Happy to provide any further context to help resolve this.

The text was updated successfully, but these errors were encountered:

thehesiod · 2023-04-19T07:56:51Z

there's definitely something funky with creds based on all the recent issues logged. we need a reliable test case where we can compare debug botocore and aiobotocore logs

jsoucheiron · 2023-04-19T07:58:46Z

I wish I could provide it but I haven't managed to reproduce this locally yet, just in production after running for a while.

hbikki · 2023-05-08T22:02:48Z

I noticed similar issue happening when reading/writing to S3 with process count > 5 for versions 2.4.2

thehesiod · 2023-05-09T06:25:17Z

any interesting info with debug level logging?

jsoucheiron · 2023-05-09T10:02:34Z

To add some additional context on this that might help untangle the issue:

SQSService is long lived
session = get_session() will be called multiple times
get_session() calls might happen during token rotation

Would it be a better approach to have a long-lived session instantiated in the class instead of creating a new one every time send_message() is called?

thehesiod · 2023-05-09T16:27:39Z

long lived session/client always preferred. botocore should take care of refreshing credentials.

jsoucheiron · 2023-05-09T16:38:17Z

If that's the case we should probably document it, specially if it can cause bugs like this one. Missatge de Alexander Mohr ***@***.***> del dia dt., 9 de maig 2023 a les 18:27:

…

long lived session/client always preferred. botocore should take care of refreshing credentials. — Reply to this email directly, view it on GitHub <#1006 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN625SC6EUTWS2HZKKTI43XFJWANANCNFSM6AAAAAAXCQ3UUI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

thehesiod · 2023-07-07T05:14:01Z

can you try after release with #1022 available?

thehesiod · 2023-08-07T03:13:56Z

could be related to #1025, I'd try once that release is available (later today)

thehesiod · 2023-10-25T16:42:48Z

actually the important part isn't the session, it's the client, you should keep your client for as long as possible. A client is tied to a connection pool, so it's heavy to keep re-creating them

thehesiod · 2023-10-25T16:52:52Z

to debug this I really need a reproducible test case. I have my own AWS account so if you can create a full encapsulated test case I can try to debug this otherwise there just isn't enough for me to go on here and I'll have to close it. Another option is to create a test case using moto

jsoucheiron · 2023-10-25T19:58:12Z

The problem is that given that the client is an async context manager there's not nice/elegant way to have a long lived client. You'd need to enter manually and create some teardown hook to exit.

thehesiod · 2023-10-26T16:07:51Z

sure there is, we do this all the time:

class SQSService:
    def __init__(self, sqs_region: str, sqs_url: str):
        self.default_source = "unknown"
        self.sqs_region = sqs_region
        self.sqs_url = sqs_url
        self._exit_stack = contextlib.AsyncExitStack()

    async def __aenter__(self):
        self._client =  await self._exit_stack.enter_async_context(session.create_client("sqs", region_name=self.sqs_region)
        return self

    async def __aexit__(self, *args):
        await self._exit_stack.__aexit__(*args)

jsoucheiron · 2023-10-26T17:18:04Z

This is the kind of pattern I'd love to see documented.
If there are certain ways of using the library that minimize load or are generally best practices given how it internally operates we should make this explicit in the docs so people can adopt this patterns.

thehesiod · 2023-10-30T21:04:10Z

I think we assumed it was common knowledge but open to PRs / issues to add to docs

thehesiod · 2023-10-30T21:05:01Z

I'd like to be able to get to the bottom of what's causing this issue as well though. Unfortunately we'll need some sort of way to reproduce

nishantkumar1292 · 2024-07-23T14:40:57Z

Do we have a solution for this yet? I'm still experiencing this. However, I thought the issue was not random, and occurred with almost every call. But it could be because I only looked at later logs.

I wonder if explicit passing of access_key and secret_key would resolve this?

thehesiod · 2024-07-23T14:57:14Z

we need a way to repro or a detailed analysis from someone who can repro

thehesiod · 2024-09-22T03:35:36Z

what's funny we're hitting something like this as well on aws. I'm guessing periodically the AWS call fails for some reason.

KedaSong · 2024-11-19T21:53:19Z

Following up on this issue - is there any update/fix coming up? Never seen this issue while using botocore/boto3 but switching to this gives me such error now

thehesiod · 2024-11-20T04:01:56Z

@KedaSong at this point I'm a broken record, need reproducible test-case

DavidRomanovizc added the bug Something isn't working label May 2, 2023

hbikki mentioned this issue May 16, 2023

[S3 storage_plugin] Seeing No credential issue at random intervals when saving / restoring snapshot from S3. pytorch/torchsnapshot#142

Open

github-actions bot added the Stale label Sep 21, 2024

github-actions bot removed the Stale label Sep 22, 2024

aio-libs deleted a comment from github-actions bot Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparently random NoCredentialsError after running for a while #1006

Apparently random NoCredentialsError after running for a while #1006

jsoucheiron commented Apr 18, 2023 •

edited

Loading

thehesiod commented Apr 19, 2023

jsoucheiron commented Apr 19, 2023

hbikki commented May 8, 2023

thehesiod commented May 9, 2023

jsoucheiron commented May 9, 2023

thehesiod commented May 9, 2023

jsoucheiron commented May 9, 2023 via email

thehesiod commented Jul 7, 2023

thehesiod commented Aug 7, 2023

thehesiod commented Oct 25, 2023

thehesiod commented Oct 25, 2023

jsoucheiron commented Oct 25, 2023

thehesiod commented Oct 26, 2023

jsoucheiron commented Oct 26, 2023

thehesiod commented Oct 30, 2023

thehesiod commented Oct 30, 2023

nishantkumar1292 commented Jul 23, 2024

thehesiod commented Jul 23, 2024

thehesiod commented Sep 22, 2024

KedaSong commented Nov 19, 2024

thehesiod commented Nov 20, 2024

Apparently random NoCredentialsError after running for a while #1006

Apparently random NoCredentialsError after running for a while #1006

Comments

jsoucheiron commented Apr 18, 2023 • edited Loading

thehesiod commented Apr 19, 2023

jsoucheiron commented Apr 19, 2023

hbikki commented May 8, 2023

thehesiod commented May 9, 2023

jsoucheiron commented May 9, 2023

thehesiod commented May 9, 2023

jsoucheiron commented May 9, 2023 via email

thehesiod commented Jul 7, 2023

thehesiod commented Aug 7, 2023

thehesiod commented Oct 25, 2023

thehesiod commented Oct 25, 2023

jsoucheiron commented Oct 25, 2023

thehesiod commented Oct 26, 2023

jsoucheiron commented Oct 26, 2023

thehesiod commented Oct 30, 2023

thehesiod commented Oct 30, 2023

nishantkumar1292 commented Jul 23, 2024

thehesiod commented Jul 23, 2024

thehesiod commented Sep 22, 2024

KedaSong commented Nov 19, 2024

thehesiod commented Nov 20, 2024

jsoucheiron commented Apr 18, 2023 •

edited

Loading