-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proxy-next-upstream (including default on error and timeout) does not always pick a different upstream depending on load balancer and concurrent requests #11852
Comments
/move-kind bug I understand why your reproduce example choice is to have pods with nginx configured to return 404 on location /. But that can not be considered a real-world use-case as real-world workloads don't have all the pods configured for returning 404. If you want you can change the test to a real-world use case where first the pod or pods are returning 200. Then introduce a event for 4XX or 5XX. And so on and so forth. But unless you can post the ta like |
/remove-kind bug |
@longwuyuan While I agree the reproduction example is a bit contrived and not particularly reflective of any real world use case, it is the easiest way to reliably reproduce this issue without overly complicating the test case. The sporadic nature of this issue is why I have opted for such a simplistic approach for reproducing it. If the backend service is acting reliably at all (preventing To maybe point you more directly to where the issue lies as can be seen from my reproduction example, note this access log line that one of my curls produced in the ingress controller nginx:
As you can see, the I have amended the command for getting the logs from the ingress controller and will happily provide more information, but I think the example I have provided is the minimal reproducible one. The problem happens entirely in the ingress nginx and for any error case that |
ok. I am on a learning curve here so please help out with some questions. Replicas is 2 and both are causing a lookup for next upstream. Do I have to reproduce on my own to figure out what happens when there is at least one replica that is doing 200 instead of 404 ? Does that one not get picked ? |
If any of the upstreams that get picked return a non-error response, nginx behaves as expected and ends the retry chain there. Since for any one request, another attempt is only performed in the case there is an error according to the The default template configures |
In Nginx, fail_timeout and max_fails remove failed backend for a certain period of time, but the balancer does not have this capability.
If A fails once, it will be removed for 30 seconds |
While experimenting with the This can easily be verified by adding e.g. the The OpenResty load balancers have support for selecting a "next" upstream (although the round-robin implementation is lackluster, as it is the same as the regular balancing logic, so it's subject to the same concurrency problem), e.g. with this implementation for the consistent hashing balancer: This has no effect for ingress-nginx, as only the ingress-nginx/rootfs/etc/nginx/lua/balancer/chash.lua Lines 29 to 32 in 1c2aecb
I think what needs to happen at the very least is that the usage of these load balancers in ingress-nginx's Lua code retain the index of the last used upstream and then use the |
@marvin-roesch There are no resources like developer time available to allocate for research and triaging. Particularly so for features that are further away from the core Ingress-API specs. Secondly, you have only 2 pods in your test, both pods are configured to return 404 for ALL requests, and you are expecting the controller to route to next_upstream and get a 200. If I understand your test correctly, I can't guess how/where/why the controller should route to a new next_upstream(ipaddress of pod) when there is no 3rd pod that is healthy or capable of responding with 200. |
Again, @longwuyuan, the fact that all pods return 404 should not matter. There might well be a real-world case where a system is in a bad state and thus all upstreams return an error - that is fine for that case and "expected". The problem is that ingress-nginx does not attempt any other upstream at all, it just tries the same one over and over again (in the case of consistent hashing) or whatever the round robin state dictates right now (causing issues with many concurrent requests as exemplified in my initial report). The expected behavior here coming purely from an nginx perspective without any fancy Lua-based load balancing is that a different upstream would be picked. If a pod is unreachable for long enough that readiness probes start to fail etc, the behavior will be "correct" in a real-life system since the affected pods will just be taken out of the rotation, but for the duration between the requests failing and the readiness probe registering that, the If you really want me to, I can come up with a reproduction case that does sometimes return a 200, but again, that will only obscure the actual problem and make it harder to reproduce reliably. The real world case where we first encountered this was when a high traffic service had one of its pod crash and we had tons of requests failing despite several other pods for that service still being healthy. Our root cause analysis showed that the ingress was retrying the same upstream IP for the failing requests, but behaved "correctly" (according to |
Thanks @marvin-roesch , at least this discussion is adding info here, that could complete the triaging. My request now to you is help me clarify what I understand from your comments. So I am cop/pasting a select set of words from your below ;
based on those words, I want to draw your attention one more time, to my understanding of your test. Both pods of the backend workload return 404. OR, all pods of the real-world scene, This makes me assume the controller trying the same last upstream 3 time instead of a new upstream is not abnormal. There is no other upstream that the controller can try. Your expectation is in the below words ;
I am bothering you so much because how can the controller attempt any other upstream, if there are none in Ready state. You can choose to ignore my comments if that is better because I am not a developer. To summarize if I can run a 3 replica workload and manually trick 2 of them to fail, we can get proof that there was at least one replica healthy and the controller should have routed to that healthy replica as next_upstream. And also there is shortage of resources as I mentioned, so community contributors here would be a help. |
Also @marvin-roesch , just to reduce my repeated questions doing anything odd, I wanted to also clarify that I ack the below fact from you
And I am asking my question after reading this fact. As there is not a clear smoking gun data that shows the state of retrying same upstream pre & post this window of time. |
@longwuyuan Thanks for trying to understand this! I'll try my best to answer your questions. I'll be picking out the sentences that I think need addressing, too.
I think you've addressed this yourself in your following comment, but just to reiterate and confirm: Yes, there is no issue if Kubernetes has marked the affected pods as non-Ready and that change has propagated through to the nginx configuration. This can take several seconds though, which can be quite critical for a very high traffic system.
Yep, as outlined above, this is solely an issue for the short amount of time when the pods are in a bad state but the rest of the system has not been made aware of this yet. The ingress controller is working perfectly fine for us once that's the case.
Yep, that would be the way to replicate this that's a little closer to the real-life example we encountered. Always responding the 404s (or any other error that
I've been trying my best to point out the pertinent pieces of the code that cause this. The load balancing logic would somehow need to be made aware of which upstreams it has already tried. If using nginx's I'd be happy to contribute a fix for this, but I'm unsure about what approach to take. It might be easiest to pull what the EWMA load balancer is doing into the others, see the following snippets. I was surprised to find it already takes care of this. It seems this was addresses as part of solving #6632, but the issue still affects all other load balancers. ingress-nginx/rootfs/etc/nginx/lua/balancer/ewma.lua Lines 184 to 190 in 0111961
|
It seems the summary is ;
PS - Please express your thoughts on just using ewma for now |
@longwuyuan That summary seems mostly correct to me, except for one thing: Looking at all the load balancer implementations, this affects the round-robin balancer as well as both consistent hashing ones ( I'll check if EWMA works for our purposes in the meantime. Thanks for pointing me to the Slack, I have joined it and will create a thread to start discussion of potential ways forward. I think aligning all load balancer implementations is the way to go. |
Thanks a lot @marvin-roesch . This is a action items overview that is getting tracked here so helps a lot. I will copy/paste the summary and add the note you made. Summary
|
/remove-triage needs-information |
This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach |
/remove-lifecycle frozen |
This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach |
/remove-lifecycle frozen |
/triage accepted @marvin-roesch this was discussed with @rikatz in the community meeting today. So please engage here or in the slack channel one more time. Sorry for the long wait. Its not good on the resources side of things. Today's motivation was that we are going to address something similar in the context of timeouts/timing in #12397 but this use case is not the same as that PR |
What happened:
When one backend pod fails under a condition covered by
proxy_next_upstream
(e.g.http_404
for easy testing), if there's a large volume of requests, any one request may reuse the same backend for all tries rather than actually using the "next" backend. This happens for sure with the default round-robin balancer, but most likely with all balancer implementations.What you expected to happen:
If a backend request fails due to one of the
proxy_next_upstream
conditions, it should be retried with at least one of the other available backends, regardless of the configured load balancer or any concurrent requests.NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.11.2
Kubernetes version (use
kubectl version
): 1.28.10Environment:
Cloud provider or hardware configuration: MacBook Pro with Apple M2
OS (e.g. from /etc/os-release): Ubuntu 22.04.4 via Multipass on macOS 14.5
Kernel (e.g.
uname -a
): 5.15.0-119-genericInstall tools:
Basic cluster related info:
kubectl version
kubectl get nodes -o wide
How was the ingress-nginx-controller installed:
microk8s enable ingress
Current State of the controller:
kubectl describe ingressclasses
kubectl -n ingress get all -o wide
kubectl -n ingress describe po nginx-ingress-microk8s-controller-4hrss
Current state of ingress object, if applicable:
kubectl -n default get all,ing -o wide
kubectl -n <appnamespace> describe ing <ingressname>
Others:
How to reproduce this issue:
Install minikube/kind
Install the ingress controller
Install an application with at least 2 pods that will always respond with status 404
Create an ingress which tries next upstream on 404
Make many requests in parallel
Observe in the ingress controller's access logs (
kubectl logs -n ingress-nginx $POD_NAME
) that many requests will have the same upstream in succession in$upstream_addr
, e.g.Anything else we need to know:
The problem is exacerbated by few (like 2 in the repro case) backend pods being hit by a large request volume concurrently. There is basically a conflict between global load balancing behaviour and per-request retries at play here. For e.g. the default round-robin load balancer, the instance is obviously shared by all requests (on an nginx worker) for a particular backend.
Assuming a system with 2 backend endpoints for the sake of simplicity, the flow of information can be as follows:
proxy_next_upstream
config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancerproxy_next_upstream
config requests another endpoint from the load balancing system, it gets routed to endpoint A by round robin balancerAs you can see, this means request 1 is only handled by endpoint A despite the
proxy_next_upstream
directive. Depending on the actual rate and order of requests etc, request 2 could have faced a similar fate, but request 3 came in before the initial response failed, so it happens to work out in that case.This makes
proxy-next-upstream
extremely unreliable and behave in unexpected ways. An approach to fixing this would be that the Lua-based load balancing be made aware of what endpoints have already been tried. The semantics are hard to nail down exactly, however, since this might break the guarantees that some load balancing strategies aim to provide. On the other hand, having the next upstream choice work reliably at all is invaluable for bridging over requests in a failure scenario. A backend endpoint might become unreachable, which should result in it eventually being removed from the load balancing once probes have caught up to the fact. In the meantime, the defaulterror timeout
strategy would try the "next" available upstream for any requests trying that endpoint, but if everything aligns just right, the load balancer would always return the same endpoint, resulting in a 502 despite the system at large being perfectly capable of handling the request.The text was updated successfully, but these errors were encountered: