Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add /healthz endpoint #3095

Merged
merged 2 commits into from
Jul 15, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions server/handlers.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@
const MainnetChainId = 1
const RinkebyChainId = 4

func (s *LivepeerServer) healthzHandler() http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't use this for the startupProbe though, otherwise we'll start sending traffic to the pod during that state where it's just 5xx'ing.

Unless there's no correlation between that time the existing healthcheck fails and the service being ready to serve, and the 5xx are just nginx failing to route to it simply cause k8s thinks it's not ready.

IOW do you know why that existing healthcheck doesn't work for around a minute before the B does get ready? Cause from the outside it looks like it's actually checking something useful haha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... /status does not check anything useful. It also does not return any status other than 200. The reason that it's not available right away is because there are a number of things happening sequentially before starting the http server (exposing endpoints). In that sense, /healthz is not a regression. It's just the same.

Saying that, I'm not sure why we're seeing 5xx during deployments. My guess is that we have two different web services starting in parallel (http video: serving videos, http cli: service CLI commands) and maybe one is available and the other not?! This made me think that /healthz endpoint should be using the same port as where we serve the video. I moved it in these 2 commits: ab8e557 and https://github.com/livepeer/livepeer-infra/pull/2245/commits/2828eedbd376d2e4debf4db8bf8e047cdbf7c678

PTAL: @victorges @pwilczynskiclearcode

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thanks for looking this up and explaining. Having this healthcheck on the "video port" also makes sense to me.

respondOk(w, nil)
})

Check warning on line 35 in server/handlers.go

View check run for this annotation

Codecov / codecov/patch

server/handlers.go#L34-L35

Added lines #L34 - L35 were not covered by tests
}

// Status
func (s *LivepeerServer) statusHandler() http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
Expand Down
1 change: 1 addition & 0 deletions server/webserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ func (s *LivepeerServer) cliWebServerHandlers(bindAddr string) *http.ServeMux {
db := s.LivepeerNode.Database

// Status
mux.Handle("/healthz", s.healthzHandler())
mux.Handle("/status", s.statusHandler())
mux.Handle("/streamID", s.streamIdHandler())
mux.Handle("/manifestID", s.manifestIdHandler())
Expand Down
Loading