-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EC2 metadata endpoint intermittent credentials retrieval failures (IMDSv1) #666
Comments
Nice sleuthing work.
I believe the IMDS service itself is hosted on the same machine as the instance so it's possible that a heavy load could cause IMDS to become unresponsive. Using curl would probably work around these kinds of issues as it should behave better with retries between connection failures. For example with HTTP.jl retries with connection failures occur immediately: julia> using HTTP
julia> @time try
HTTP.get("http://203.0.113.0", connect_timeout=5, retries=3, retry_delay=fill(10, 3), verbose=1) # Host is the unused TEST-NET-3 iprange
catch e
e
end;
┌ LogLevel(-1001): 🔄 Retry HTTP.Exceptions.ConnectError("http://203.0.113.0", CapturedException(Base.IOError("connect: host is unreachable (EHOSTUNREACH)", -65), Any[(getconnection(::Type{Sockets.TCPSocket}, host::SubString{String}, port::SubString{String}; keepalive::Bool, readtimeout::Int64, kw::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:require_ssl_verification, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Bool, Nothing, Nothing, Int64, Vector{Int64}}}}) at Connections.jl:526, 1), (getconnection at Connections.jl:503 [inlined], 1), (#10 at Connections.jl:462 [inlined], 1), (macro expansion at try_with_timeout.jl:82 [inlined], 1), ((::ConcurrentUtilities.var"#2#4"{Any, Timer, ConcurrentUtilities.TimedOut{Any}, Channel{Any}, HTTP.Connections.var"#10#13"{Sockets.TCPSocket, Bool, Bool, Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:readtimeout, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Int64, Nothing, Nothing, Int64, Vector{Int64}}}}, SubString{String}, SubString{String}}})() at threadingconstructs.jl:416, 1)])): GET / HTTP/1.1
└ @ HTTP.RetryRequest ~/.julia/packages/HTTP/bDoga/src/clientlayers/RetryRequest.jl:62
┌ LogLevel(-1001): 🔄 Retry HTTP.Exceptions.ConnectError("http://203.0.113.0", CapturedException(Base.IOError("connect: host is unreachable (EHOSTUNREACH)", -65), Any[(getconnection(::Type{Sockets.TCPSocket}, host::SubString{String}, port::SubString{String}; keepalive::Bool, readtimeout::Int64, kw::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:require_ssl_verification, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Bool, Nothing, Nothing, Int64, Vector{Int64}}}}) at Connections.jl:526, 1), (getconnection at Connections.jl:503 [inlined], 1), (#10 at Connections.jl:462 [inlined], 1), (macro expansion at try_with_timeout.jl:82 [inlined], 1), ((::ConcurrentUtilities.var"#2#4"{Any, Timer, ConcurrentUtilities.TimedOut{Any}, Channel{Any}, HTTP.Connections.var"#10#13"{Sockets.TCPSocket, Bool, Bool, Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:readtimeout, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Int64, Nothing, Nothing, Int64, Vector{Int64}}}}, SubString{String}, SubString{String}}})() at threadingconstructs.jl:416, 1)])): GET / HTTP/1.1
└ @ HTTP.RetryRequest ~/.julia/packages/HTTP/bDoga/src/clientlayers/RetryRequest.jl:62
┌ LogLevel(-1001): 🔄 Retry HTTP.Exceptions.ConnectError("http://203.0.113.0", CapturedException(Base.IOError("connect: host is unreachable (EHOSTUNREACH)", -65), Any[(getconnection(::Type{Sockets.TCPSocket}, host::SubString{String}, port::SubString{String}; keepalive::Bool, readtimeout::Int64, kw::Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:require_ssl_verification, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Bool, Nothing, Nothing, Int64, Vector{Int64}}}}) at Connections.jl:526, 1), (getconnection at Connections.jl:503 [inlined], 1), (#10 at Connections.jl:462 [inlined], 1), (macro expansion at try_with_timeout.jl:82 [inlined], 1), ((::ConcurrentUtilities.var"#2#4"{Any, Timer, ConcurrentUtilities.TimedOut{Any}, Channel{Any}, HTTP.Connections.var"#10#13"{Sockets.TCPSocket, Bool, Bool, Base.Pairs{Symbol, Any, NTuple{5, Symbol}, NamedTuple{(:readtimeout, :iofunction, :decompress, :verbose, :retry_delay), Tuple{Int64, Nothing, Nothing, Int64, Vector{Int64}}}}, SubString{String}, SubString{String}}})() at threadingconstructs.jl:416, 1)])): GET / HTTP/1.1
└ @ HTTP.RetryRequest ~/.julia/packages/HTTP/bDoga/src/clientlayers/RetryRequest.jl:62
0.733070 seconds (4.53 k allocations: 1.041 MiB) So I think to fix the issue you're seeing we need to add retries with a delay between each of them. However, this introduces a new problem we now need to solve. For AWS.jl users without an IMDS server running we don't want to introduce delays from multiple retries and would rather fail fast. What we maybe want to do is disable retries and use a fast If you want to make a PR for this I'd be happy to review. Also, is there a reason you're not using AWS.jl 1.90+? There was some refactoring of our IMDS code which added support for IMDSv2 and IMDSv1 and more robust test suite. |
I don't know much about it though, but I'm still leaning into it being a Julia issue more than the imds I'm running 3 test containers that only run the credentials fetch in a loop every 5 seconds and so far none of them has logged any errors (it's been a week). Meanwhile there have been failures on the affected containers (mostly just one, but I've seen this happen on two different ones - they're less loaded though). All on the same ec2 instance, so I think even if the imds was running off the instance we would probably see failures in the test containers
There's still the issue of the fact that we're utilizing Anyway either what you're suggesting is the way to go or some advanced interface where we specify the expected source(s) and retry on it. I'm looking into this and trying to get this somewhat stable for now, but I don't really have any bulletproof solution in mind for now. We probably shouldn't have this issue in the first place and it's weird that two different http client packages are facing the same issue (HTTP.jl and Downloads.jl) For now I added retries to the wrapper on credentials management we use and testing it out. The additional MethodError and StatusError are due to the ec2 instance profile call sometimes failing mid-execution, which causes these. awsconf = @repeat 10 try
AWS.AWSConfig(; aws_config_kwargs...)
catch ex
@retry if (
ex isa AWS.AWSExceptions.NoCredentials ||
ex isa MethodError ||
ex isa HTTP.Exceptions.StatusError
)
@warn("Retrying AWS credentials initial fetch...")
end
end
I was looking to upgrade to imdsv2, but this instability basically blocked me and I just stayed at v1.89 to figure stuff out. I tested it out and it works fine (had to up the hop limit), but the same stability issue seems to be affecting this from what I've seen in the one attempt I made to upgrade |
I think having this issue is fine as we're trying to solve "IMDS connections intermittently fail". This issue may end up being solved in another library but this at least is a place where others can report their IMDS failures.
The code you provided won't introduce a delay between failures which could be problematic: julia> using HTTP, Retry
julia> e = @time try
@repeat 3 try
@info "GET"
HTTP.get("http://203.0.113.0", connect_timeout=5, retry=false, verbose=1)
catch e
@retry if true end
end
catch e
e
end;
[ Info: GET
[ Info: GET
[ Info: GET
0.017360 seconds (876 allocations: 54.586 KiB) You can use Retry.jl with julia> f = retry(delays=fill(3, 2)) do # I recommend using `ExponentialBackOff` for your delay
@info "GET"
HTTP.get("http://203.0.113.0", connect_timeout=5, retry=false, verbose=1)
end
#90 (generic function with 1 method)
julia> @time try; f(); catch e; end
[ Info: GET
[ Info: GET
[ Info: GET
6.052492 seconds (21.95 k allocations: 1.466 MiB, 0.42% compilation time) One other thing: the IMDS credentials include a session token which means they expire eventually. The code you have will only retry on the initial credential fetch and not credential renewals. You may want to manually set |
Thanks for this info on the delays. I'm getting some weird results even on your example. julia> e = @time try
@repeat 3 try
@info "GET"
# not my local subnet, unreachable
HTTP.get("http://192.168.70.1", connect_timeout=10, retry=false, verbose=1)
catch e
@retry if true end
end
catch e
e
end;
[ Info: GET
[ Info: GET
[ Info: GET
30.029708 seconds (582 allocations: 38.852 KiB) julia> e = @time try
@repeat 3 try
@info "GET"
# my local subnet, no device on that ip
HTTP.get("http://192.168.60.4", connect_timeout=10, retry=false, verbose=1)
catch e
@retry if true end
end
catch e
e
end;
[ Info: GET
[ Info: GET
[ Info: GET
9.133081 seconds (889 allocations: 55.562 KiB) Also have a look at this: it's from when I added error logging, I had HTTP clearly waiting for 20s there in the container.
Maybe this is a lead? From what I can tell in my experiments earlier it would always wait the full timeout or longer. The current implementation pretty much runs 10 times and each run takes a second due to timeout, so it seems the metadata endpoint is completely unreachable?
We're using a wrapper over this and I'm handling refreshes automatically there. Basically on |
@omus I think the keyword argument should be |
In early December we started observing occasional AWS.jl credential retrieval failures during CI runs.
When running on an AWS EC2 instance with credentials assigned we would get the standard "Can't find AWS credentials!" message.
The issue would only appear when the Julia process had an increased load due to the CI running and I could never reproduce it outside of that scenario
This is on Julia 1.9.3, AWS 1.89.1 and IMDSv1
Debugging timeline:
run('curl ....')
to the same URL when the ec2 metadata endpoint would time out: HTTP.jl still timed out and curl ran fine after HTTP.jl timed outIt looks like it's some HTTP.jl client code issue with timeouts, but I haven't gone into this yetI will add updates once I do as this is probably more of a HTTP.jl issue than AWS.jl, but I'm posting here for visibility in case someone gets a similiar issue
Update (23.01.2024) - Downloads.jl failed too, so it may not be a HTTP.jl issue after all, but some Julia related issue
Downloads.jl patch
https://github.com/krynju/AWS.jl/compare/v1.89.1...krynju:AWS.jl:kr/downloadsjl?expand=1
The text was updated successfully, but these errors were encountered: