Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EC2 metadata endpoint intermittent credentials retrieval failures (IMDSv1) #666

Open
krynju opened this issue Jan 22, 2024 · 5 comments
Open

Comments

@krynju
Copy link

krynju commented Jan 22, 2024

In early December we started observing occasional AWS.jl credential retrieval failures during CI runs.
When running on an AWS EC2 instance with credentials assigned we would get the standard "Can't find AWS credentials!" message.
The issue would only appear when the Julia process had an increased load due to the CI running and I could never reproduce it outside of that scenario

This is on Julia 1.9.3, AWS 1.89.1 and IMDSv1

Debugging timeline:

  1. Verified any possible issues with networking or docker
  2. Tried manually triggering the issue, no luck
  3. Forked AWS.jl and added error logs on failures. Turned out to be connection timeouts from HTTP.jl
  4. Increased the timeout to 10s and added an additional run('curl ....') to the same URL when the ec2 metadata endpoint would time out: HTTP.jl still timed out and curl ran fine after HTTP.jl timed out
  5. I launched 3 test containers calling the metadata endpoint every 5s (AWS.jl with 1s timeout, curl with 1s timeout and curl with 60s timeout) - so far no failures, so it doesn't seem like an issue caused by the metadata endpoint availability
  6. Increased the timeout to 20s - still timed out and curl ran just fine
  7. Increased the timeout to 30s - still timed out and curl ran just fine
  8. Changed the implementation to use Downloads.jl (based on libcurl) - didn't observe any failures since then
  9. Update (23.01.2024) - I observed a timeout failure in the Downloads.jl based implementation. Even though the timeout was set to 5s the logs noted timeouts of 7 and 17 seconds for two separate calls.

It looks like it's some HTTP.jl client code issue with timeouts, but I haven't gone into this yet
I will add updates once I do as this is probably more of a HTTP.jl issue than AWS.jl, but I'm posting here for visibility in case someone gets a similiar issue

Update (23.01.2024) - Downloads.jl failed too, so it may not be a HTTP.jl issue after all, but some Julia related issue

Downloads.jl patch
https://github.com/krynju/AWS.jl/compare/v1.89.1...krynju:AWS.jl:kr/downloadsjl?expand=1

@omus
Copy link
Member

omus commented Jan 25, 2024

Nice sleuthing work.

The issue would only appear when the Julia process had an increased load due to the CI running and I could never reproduce it outside of that scenario

I believe the IMDS service itself is hosted on the same machine as the instance so it's possible that a heavy load could cause IMDS to become unresponsive. Using curl would probably work around these kinds of issues as it should behave better with retries between connection failures. For example with HTTP.jl retries with connection failures occur immediately:

julia> using HTTP

julia> @time try
           HTTP.get("http://203.0.113.0", connect_timeout=5, retries=3, retry_delay=fill(10, 3), verbose=1)  # Host is the unused TEST-NET-3 iprange
       catch e
           e
       end;
┌ LogLevel(-1001): 🔄  Retry HTTP.Exceptions.ConnectError("http://203.0.113.0", CapturedException(Base.IOError("connect: host is unreachable (EHOSTUNREACH)", -65), Any[(getconnection(::Type{Sockets.TCPSocket}, host::SubString{String}, port::SubString{String}; keepalive::Bool, readtimeout::Int64, kw::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:require_ssl_verification, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Bool, Nothing, Nothing, Int64, Vector{Int64}}}}) at Connections.jl:526, 1), (getconnection at Connections.jl:503 [inlined], 1), (#10 at Connections.jl:462 [inlined], 1), (macro expansion at try_with_timeout.jl:82 [inlined], 1), ((::ConcurrentUtilities.var"#2#4"{Any, Timer, ConcurrentUtilities.TimedOut{Any}, Channel{Any}, HTTP.Connections.var"#10#13"{Sockets.TCPSocket, Bool, Bool, Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:readtimeout, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Int64, Nothing, Nothing, Int64, Vector{Int64}}}}, SubString{String}, SubString{String}}})() at threadingconstructs.jl:416, 1)])): GET / HTTP/1.1
└ @ HTTP.RetryRequest ~/.julia/packages/HTTP/bDoga/src/clientlayers/RetryRequest.jl:62
┌ LogLevel(-1001): 🔄  Retry HTTP.Exceptions.ConnectError("http://203.0.113.0", CapturedException(Base.IOError("connect: host is unreachable (EHOSTUNREACH)", -65), Any[(getconnection(::Type{Sockets.TCPSocket}, host::SubString{String}, port::SubString{String}; keepalive::Bool, readtimeout::Int64, kw::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:require_ssl_verification, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Bool, Nothing, Nothing, Int64, Vector{Int64}}}}) at Connections.jl:526, 1), (getconnection at Connections.jl:503 [inlined], 1), (#10 at Connections.jl:462 [inlined], 1), (macro expansion at try_with_timeout.jl:82 [inlined], 1), ((::ConcurrentUtilities.var"#2#4"{Any, Timer, ConcurrentUtilities.TimedOut{Any}, Channel{Any}, HTTP.Connections.var"#10#13"{Sockets.TCPSocket, Bool, Bool, Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:readtimeout, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Int64, Nothing, Nothing, Int64, Vector{Int64}}}}, SubString{String}, SubString{String}}})() at threadingconstructs.jl:416, 1)])): GET / HTTP/1.1
└ @ HTTP.RetryRequest ~/.julia/packages/HTTP/bDoga/src/clientlayers/RetryRequest.jl:62
┌ LogLevel(-1001): 🔄  Retry HTTP.Exceptions.ConnectError("http://203.0.113.0", CapturedException(Base.IOError("connect: host is unreachable (EHOSTUNREACH)", -65), Any[(getconnection(::Type{Sockets.TCPSocket}, host::SubString{String}, port::SubString{String}; keepalive::Bool, readtimeout::Int64, kw::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:require_ssl_verification, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Bool, Nothing, Nothing, Int64, Vector{Int64}}}}) at Connections.jl:526, 1), (getconnection at Connections.jl:503 [inlined], 1), (#10 at Connections.jl:462 [inlined], 1), (macro expansion at try_with_timeout.jl:82 [inlined], 1), ((::ConcurrentUtilities.var"#2#4"{Any, Timer, ConcurrentUtilities.TimedOut{Any}, Channel{Any}, HTTP.Connections.var"#10#13"{Sockets.TCPSocket, Bool, Bool, Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:readtimeout, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Int64, Nothing, Nothing, Int64, Vector{Int64}}}}, SubString{String}, SubString{String}}})() at threadingconstructs.jl:416, 1)])): GET / HTTP/1.1
└ @ HTTP.RetryRequest ~/.julia/packages/HTTP/bDoga/src/clientlayers/RetryRequest.jl:62
  0.733070 seconds (4.53 k allocations: 1.041 MiB)

So I think to fix the issue you're seeing we need to add retries with a delay between each of them.

However, this introduces a new problem we now need to solve. For AWS.jl users without an IMDS server running we don't want to introduce delays from multiple retries and would rather fail fast. What we maybe want to do is disable retries and use a fast connect_timeout for the initial IMDS connection and then once we have confirmed IMDS is available enable retries.

If you want to make a PR for this I'd be happy to review.

Also, is there a reason you're not using AWS.jl 1.90+? There was some refactoring of our IMDS code which added support for IMDSv2 and IMDSv1 and more robust test suite.

@krynju
Copy link
Author

krynju commented Jan 25, 2024

I believe the IMDS service itself is hosted on the same machine as the instance so it's possible that a heavy load could cause IMDS to become unresponsive. Using curl would probably work around these kinds of issues as it should behave better with retries between connection failures. For example with HTTP.jl retries with connection failures occur immediately:

I don't know much about it though, but I'm still leaning into it being a Julia issue more than the imds

I'm running 3 test containers that only run the credentials fetch in a loop every 5 seconds and so far none of them has logged any errors (it's been a week). Meanwhile there have been failures on the affected containers (mostly just one, but I've seen this happen on two different ones - they're less loaded though). All on the same ec2 instance, so I think even if the imds was running off the instance we would probably see failures in the test containers

So I think to fix the issue you're seeing we need to add retries with a delay between each of them.

However, this introduces a new problem we now need to solve. For AWS.jl users without an IMDS server running we don't want to introduce delays from multiple retries and would rather fail fast. What we maybe want to do is disable retries and use a fast connect_timeout for the initial IMDS connection and then once we have confirmed IMDS is available enable retries.

If you want to make a PR for this I'd be happy to review.

There's still the issue of the fact that we're utilizing connect_timeout to detect the existence of IMDS, which as we're seeing is not reliable in this case (it fails randomly). It would probably a bigger redesign of the cred fetching interface to get this working reliably. I wonder how other AWS libs/packages are doing it - might be worthwhile to have a look

Anyway either what you're suggesting is the way to go or some advanced interface where we specify the expected source(s) and retry on it.

I'm looking into this and trying to get this somewhat stable for now, but I don't really have any bulletproof solution in mind for now. We probably shouldn't have this issue in the first place and it's weird that two different http client packages are facing the same issue (HTTP.jl and Downloads.jl)

For now I added retries to the wrapper on credentials management we use and testing it out. The additional MethodError and StatusError are due to the ec2 instance profile call sometimes failing mid-execution, which causes these.
It goes something like this:

awsconf = @repeat 10 try
    AWS.AWSConfig(; aws_config_kwargs...)
catch ex
    @retry if (
        ex isa AWS.AWSExceptions.NoCredentials ||
        ex isa MethodError ||
        ex isa HTTP.Exceptions.StatusError
    )
        @warn("Retrying AWS credentials initial fetch...")
    end
end

Also, is there a reason you're not using AWS.jl 1.90+? There was some refactoring of our IMDS code which added support for IMDSv2 and IMDSv1 and more robust test suite.

I was looking to upgrade to imdsv2, but this instability basically blocked me and I just stayed at v1.89 to figure stuff out. I tested it out and it works fine (had to up the hop limit), but the same stability issue seems to be affecting this from what I've seen in the one attempt I made to upgrade

@omus
Copy link
Member

omus commented Jan 25, 2024

I'm looking into this and trying to get this somewhat stable for now, but I don't really have any bulletproof solution in mind for now. We probably shouldn't have this issue in the first place and it's weird that two different http client packages are facing the same issue (HTTP.jl and Downloads.jl)

I think having this issue is fine as we're trying to solve "IMDS connections intermittently fail". This issue may end up being solved in another library but this at least is a place where others can report their IMDS failures.

For now I added retries to the wrapper on credentials management we use and testing it out.

The code you provided won't introduce a delay between failures which could be problematic:

julia> using HTTP, Retry

julia> e = @time try
           @repeat 3 try
              @info "GET"
              HTTP.get("http://203.0.113.0", connect_timeout=5, retry=false, verbose=1)
           catch e
              @retry if true end
           end
       catch e
           e
       end;
[ Info: GET
[ Info: GET
[ Info: GET
  0.017360 seconds (876 allocations: 54.586 KiB)

You can use Retry.jl with @delay_retry but if you want more control you may need to introduce your own delay. Alternatively:

julia> f = retry(delays=fill(3, 2)) do  # I recommend using `ExponentialBackOff` for your delay
           @info "GET"
           HTTP.get("http://203.0.113.0", connect_timeout=5, retry=false, verbose=1)
       end
#90 (generic function with 1 method)

julia> @time try; f(); catch e; end
[ Info: GET
[ Info: GET
[ Info: GET
  6.052492 seconds (21.95 k allocations: 1.466 MiB, 0.42% compilation time)

One other thing: the IMDS credentials include a session token which means they expire eventually. The code you have will only retry on the initial credential fetch and not credential renewals. You may want to manually set renew in AWSCredentials so you can force retries upon renewal too.

@krynju
Copy link
Author

krynju commented Jan 25, 2024

Thanks for this info on the delays. I'm getting some weird results even on your example.

julia> e = @time try
                  @repeat 3 try
                     @info "GET"
                     # not my local subnet, unreachable
                     HTTP.get("http://192.168.70.1", connect_timeout=10, retry=false, verbose=1) 
                  catch e
                     @retry if true end
                  end
              catch e
                  e
              end;
[ Info: GET
[ Info: GET
[ Info: GET
 30.029708 seconds (582 allocations: 38.852 KiB)
julia> e = @time try
                  @repeat 3 try
                     @info "GET"
                     # my local subnet, no device on that ip
                     HTTP.get("http://192.168.60.4", connect_timeout=10, retry=false, verbose=1) 
                  catch e
                     @retry if true end
                  end
              catch e
                  e
              end;
[ Info: GET
[ Info: GET
[ Info: GET
  9.133081 seconds (889 allocations: 55.562 KiB)

Also have a look at this: it's from when I added error logging, I had HTTP clearly waiting for 20s there in the container.
Same thing happened with Downloads.jl

┌ Error: IMDS.get error
│   exception =
│    HTTP.ConnectError for url = `http://169.254.169.254/latest/meta-data/iam/info`: TimeoutException: try_with_timeout timed out after 20.0 seconds
│    Stacktrace:
│      [1] (::HTTP.ConnectionRequest.var"#connections#4"{HTTP.ConnectionRequest.var"#connections#1#5"{HTTP.TimeoutRequest.var"#timeouts#3"{HTTP.TimeoutRequest.var"#timeouts#1#4"{HTTP.ExceptionRequest.var"#exceptions#2"{HTTP.ExceptionRequest.var"#exceptions#1#3"{typeof(HTTP.StreamRequest.streamlayer)}}}}}})(req::HTTP.Messages.Request; proxy::Nothing, socket_type::Type, socket_type_tls::Type, readtimeout::Int64, connect_timeout::Int64, logerrors::Bool, logtag::Nothing, kw::Base.Pairs{Symbol, Union{Nothing, Int64}, Tuple{Symbol, Symbol, Symbol}, NamedTuple{(:iofunction, :decompress, :verbose), Tuple{Nothing, Nothing, Int64}}})
│        @ HTTP.ConnectionRequest ~/depot/packages/HTTP/bDoga/src/clientlayers/ConnectionRequest.jl:8

Maybe this is a lead? From what I can tell in my experiments earlier it would always wait the full timeout or longer.
The case above where I tried to connect to an ip in my local network clearly exited faster than the case where it was completely unreachable

The current implementation pretty much runs 10 times and each run takes a second due to timeout, so it seems the metadata endpoint is completely unreachable?


One other thing: the IMDS credentials include a session token which means they expire eventually. The code you have will only retry on the initial credential fetch and not credential renewals. You may want to manually set renew in AWSCredentials so you can force retries upon renewal too.

We're using a wrapper over this and I'm handling refreshes automatically there. Basically on get in this wrapper I either initialize fresh credentials or refresh and I added the retry for both. I'll have more results tomorrow
Thanks for the tip!

@ericphanson
Copy link
Member

julia> using HTTP

julia> @time try
           HTTP.get("http://203.0.113.0", connect_timeout=5, retries=3, retry_delay=fill(10, 3), verbose=1)  # Host is the unused TEST-NET-3 iprange
       catch e
           e
       end;
┌ LogLevel(-1001): 🔄  Retry HTTP.Exceptions.ConnectError("http://203.0.113.0", CapturedException(Base.IOError("connect: host is unreachable (EHOSTUNREACH)", -65), Any[(getconnection(::Type{Sockets.TCPSocket}, host::SubString{String}, port::SubString{String}; keepalive::Bool, readtimeout::Int64, kw::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:require_ssl_verification, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Bool, Nothing, Nothing, Int64, Vector{Int64}}}}) at Connections.jl:526, 1), (getconnection at Connections.jl:503 [inlined], 1), (#10 at Connections.jl:462 [inlined], 1), (macro expansion at try_with_timeout.jl:82 [inlined], 1), ((::ConcurrentUtilities.var"#2#4"{Any, Timer, ConcurrentUtilities.TimedOut{Any}, Channel{Any}, HTTP.Connections.var"#10#13"{Sockets.TCPSocket, Bool, Bool, Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:readtimeout, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Int64, Nothing, Nothing, Int64, Vector{Int64}}}}, SubString{String}, SubString{String}}})() at threadingconstructs.jl:416, 1)])): GET / HTTP/1.1
└ @ HTTP.RetryRequest ~/.julia/packages/HTTP/bDoga/src/clientlayers/RetryRequest.jl:62
┌ LogLevel(-1001): 🔄  Retry HTTP.Exceptions.ConnectError("http://203.0.113.0", CapturedException(Base.IOError("connect: host is unreachable (EHOSTUNREACH)", -65), Any[(getconnection(::Type{Sockets.TCPSocket}, host::SubString{String}, port::SubString{String}; keepalive::Bool, readtimeout::Int64, kw::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:require_ssl_verification, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Bool, Nothing, Nothing, Int64, Vector{Int64}}}}) at Connections.jl:526, 1), (getconnection at Connections.jl:503 [inlined], 1), (#10 at Connections.jl:462 [inlined], 1), (macro expansion at try_with_timeout.jl:82 [inlined], 1), ((::ConcurrentUtilities.var"#2#4"{Any, Timer, ConcurrentUtilities.TimedOut{Any}, Channel{Any}, HTTP.Connections.var"#10#13"{Sockets.TCPSocket, Bool, Bool, Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:readtimeout, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Int64, Nothing, Nothing, Int64, Vector{Int64}}}}, SubString{String}, SubString{String}}})() at threadingconstructs.jl:416, 1)])): GET / HTTP/1.1
└ @ HTTP.RetryRequest ~/.julia/packages/HTTP/bDoga/src/clientlayers/RetryRequest.jl:62
┌ LogLevel(-1001): 🔄  Retry HTTP.Exceptions.ConnectError("http://203.0.113.0", CapturedException(Base.IOError("connect: host is unreachable (EHOSTUNREACH)", -65), Any[(getconnection(::Type{Sockets.TCPSocket}, host::SubString{String}, port::SubString{String}; keepalive::Bool, readtimeout::Int64, kw::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:require_ssl_verification, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Bool, Nothing, Nothing, Int64, Vector{Int64}}}}) at Connections.jl:526, 1), (getconnection at Connections.jl:503 [inlined], 1), (#10 at Connections.jl:462 [inlined], 1), (macro expansion at try_with_timeout.jl:82 [inlined], 1), ((::ConcurrentUtilities.var"#2#4"{Any, Timer, ConcurrentUtilities.TimedOut{Any}, Channel{Any}, HTTP.Connections.var"#10#13"{Sockets.TCPSocket, Bool, Bool, Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:readtimeout, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Int64, Nothing, Nothing, Int64, Vector{Int64}}}}, SubString{String}, SubString{String}}})() at threadingconstructs.jl:416, 1)])): GET / HTTP/1.1
└ @ HTTP.RetryRequest ~/.julia/packages/HTTP/bDoga/src/clientlayers/RetryRequest.jl:62
  0.733070 seconds (4.53 k allocations: 1.041 MiB)

@omus I think the keyword argument should be retry_delays. I think retry_delay doesn't do anything, and HTTP.jl has the unfortunate pattern of dropping unused keyword arguments, so it just swallows it instead of erroring. When I switch this to retry_delays it does work (though I had to change the address to a local one like http://localhost:31337 otherwise it declared it non-retryable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants