document that watcher stops see #273 and add each_with_retry #275

grosser · 2017-11-09T05:04:50Z

nicer alternative to the current loop { watch } we are using, could also make retry: true an option for each, but wanted to keep it clean

@simon3z

/cc @jonmoter

simon3z · 2017-11-09T10:02:38Z

As mentioned on issue #273 I don't think that a logic as each_with_retry belongs to kubeclient.
👍 for the documentation.

@grosser in the specifics of the code: what's the new done flag is used for?

cc @moolitayer @cben

cben · 2017-11-09T10:18:35Z

lib/kubeclient/watch_stream.rb

        end
+        done


the only way I see this can be false is if connection was interrupted before at least 1 complete line was received?

grosser · 2017-11-09T15:21:41Z

it will be false if the user `break` or `return` inside the yield to know the difference between finished and aborted code has to be inside the watcher the simple `loop { }` I did previously will not allow it to ever finish, that's why I added `each_with_retry` :)

…

On Thu, Nov 9, 2017 at 2:18 AM, Beni Cherniavsky-Paskin < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In lib/kubeclient/watch_stream.rb <#275 (comment)>: > end + done the only way I see this can be false is if connection was interrupted before at least 1 complete line was received? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#275 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAsZ_NXlBklUdm9HQTU94oyU48bSJD1ks5s0tF8gaJpZM4QXcOg> .

grosser · 2017-11-27T21:53:44Z

ok to merge this ?
I'd think having a stable/reliable watcher is a very common usecase, there can be many reasons a connection is interrupted and blindly retrying breaks when also using .finish

zeari

This is a little non-standard since obj.each usually returns obj but im ok with these changes 👍

grosser · 2018-01-04T16:09:11Z

@simon3z I can haz merge ?

simon3z · 2018-01-04T16:12:11Z

@simon3z I can haz merge ?

cc @moolitayer @cben

cben · 2018-01-22T13:26:49Z

Hi, I'm very much aware I owe a review here.
Meanwhile I've been trying to figure out how to restart precisely. I recently asked on kubernetes/website#6540 (comment) and IIUC the answers, it's officially possible to take last seen resourceVersion of individual object in collection during watch, and start new collection watch from that version. (should deal with possibility of 410 error meaning version is too old.)

Also @pkliczewski now told me apparently official Go client's "informer framework" solves it, hadn't checked it yet.

So, about this PR:

WatchStream here remains version-agnostic. If user created it via watch_* functions with resource_version: arg, it will have version baked into url, if not it won't.
each_with_retry with specific starting version is not good. It means you'll get same data again.
Generally, restart with overlap is problematic to consume: assuming you want to ignore out-of-date updates, those are hard to recognize, because comparing 2 versions is not allowed.

IMHO restart logic does belong in kubeclient, but we'd need something more version-aware.
@grosser what do you think, how would you like to continue this?

grosser · 2018-01-22T18:14:14Z

so "remove resource_version from subsequent calls"
or "keep track of resource_version and pass it when retrying" ?
... first seems reasonable/easy, second would be nice, but seems a bit scary

qingling128 · 2020-01-02T22:55:38Z

We are having similar issues when connection is closed and the watcher crashes instead of retrying.

+1 for having some logic in the kubeclient to:

Retry when connection is closed.
Keep track of resource_version and pass it when retrying.

need some way to distinguish "exited intentionally" (.finish/break/return) from connection closed.

cben · 2020-02-27T11:42:08Z

I quite forgot this PR. Putting resumption aside, there was a good point here that .each should distinguish deliberate exit — by break / return / .finish — from forced connection closure.

I'm by now aware of 5 ways watching can exit:

.finish => exits cleanly ✔️
break / return => exits cleanly ✔️
error such as 410 Gone => does not raise any exception, instead passes a "notice" with ERROR type into the block, then exits cleanly 🤦‍♂️
High request rate towards Kubernetes API server fabric8io/fluent-plugin-kubernetes_metadata_filter#213
connection closed by server => may exit cleanly with no exception 😕
connection closed by server/middleboxes(?)/network problems => may raise HTTP::ConnectionError.
Fluentd Crash fabric8io/fluent-plugin-kubernetes_metadata_filter#194

The technical reason is we have handle_exception helper for RestClient code paths but forgot to do any error handling for HTTP code paths 🤣

(4) vs (5) is clearly a mess. I'm tempted to say user doesn't care how exactly a watch got disconnected — the bottom line is the "infinite" loop got broken, and user needs to resume/retry in
same way.

If backward compatibility is not a question, I'd argue (3) (4) (5) should all throw an exception.
(and not leak HTTP exception — users shouldn't be aware about the precise http libraries kubeclient uses — better wrap with our Kubeclient::HttpError).
It was an unplanned accident that (4) may exit cleanly; from POV of caller the call promises an "infinite" loop, and IMO any disconnection should be an in-your-face error so people understand it's something they must handle.

The solution here of returning a boolean has benefit of backward compatibility.
There is existing code out there calling watch and handling resumption, and it's written assuming clean exit (4)... Most of it will fail in case of error (5) though?

But a boolean doesn't cover (3) well. I mean current behavior of passing error data into the block is ... workable, but not elegant. And handling 410 is essential to correct usage if you pass resourceVersion!

A "meta" consideration: as we've seen here, there is value in accumulating expereince of how things actually break, and if we just swallow errors into a boolean, there will be no details in logs, so we (kubeclient) will not learn, and users will not learn.

I propose to break compatibility (and bump major version) and turn (3) (4) (5) into Kubeclient errors.

@grosser @qingling128 @jcantrill @mszabo @Stono @fradee what do you think?
Would that be a step forward? Or too disruptive?

[And yes, I still hope in some future version we'll get Kubeclient itself resuming after disconnection. IFF initial resourceVersion was set. And it will still leave caller to handle 410, so I think we need to fix the interface first anyway.]

[I'm also working on adding some docs on watching, because README is really lacking...]

mszabo · 2020-02-27T12:18:46Z

@cben Thanks for working on this. I think it would be a step forward.

Stono · 2020-02-28T07:09:36Z

Be aware if you're watching streams that don't change frequently, you'll see 410 GONE quite a bit when you reconnect.

Examples are say, CRD's that are rarely updated. When you reconnect you'll likely get a 410 GONE if you're trying to resume from the last seen CRD, and you'll need to resume from 0 instead - dropping any ADDED events with a timestamp <= the start of your stream.

grosser · 2020-02-29T15:59:11Z

raise on 3-4-5 sounds like an improvement over silently stopping the loop, it should be easy to see/catch for client code.

I'd still like a resume: true option (or by default, but risks loops) or so that takes the resourceVersion of the incoming objects and then reconnects with the latest version when disconnected.

qingling128 · 2020-03-01T00:23:21Z

Re #275 (comment):
That does sound like a step forward to me. We did run into an issue fluent-plugin-kubernetes_metadata_filter/pull/214 when an exception was not thrown as expected.

cben · 2020-03-11T10:12:21Z

Confirmed (3) {"type": "ERROR", ...} notices are returned with a 200 status 😐
Turns out it's deliberate in kubernetes per kubernetes/kubernetes#25151:

Either way, for web sockets if we don't fix this clients can't get this
error (the browser eats the 410 during the websocket negotiation and client
code can't get access). So it's at least broken for web sockets.

This is unfortunate for :raw mode — if we don't parse the input, we can't raise an exception.

Could document this exception?
Could or add a heuristic parsing if raw text contains ERROR, just to check whether to raise an exception, otherwise discarding the parse results.

rockb1017 · 2020-09-16T07:44:20Z

Thank you for you work and great discussions! any plan for this to be merged?

grosser force-pushed the grosser/read branch from 779277f to 92b502c Compare November 9, 2017 05:09

cben reviewed Nov 9, 2017

View reviewed changes

document that watcher stops see ManageIQ#273 and add each_with_retry

927bdd0

grosser force-pushed the grosser/read branch from 92b502c to 927bdd0 Compare November 27, 2017 21:44

zeari approved these changes Jan 4, 2018

View reviewed changes

cben mentioned this pull request Jan 10, 2018

Add an API concepts document and describe terminology and API chunking kubernetes/website#6540

Merged

moolitayer added the v2.x/no label Jan 21, 2018

cben added bug enhancement labels May 27, 2018

cben mentioned this pull request Aug 23, 2018

Finish the watch stream connection when exiting ManageIQ/manageiq-providers-kubernetes#278

Merged

cben mentioned this pull request Oct 9, 2018

Watch gets dropped quite frequently for v1/services #352

Closed

qingling128 mentioned this pull request Jan 2, 2020

Fluentd Crash fabric8io/fluent-plugin-kubernetes_metadata_filter#194

Closed

cben added a commit to cben/kubeclient that referenced this pull request Feb 12, 2020

WIP, cf ManageIQ#275

56dfb9b

need some way to distinguish "exited intentionally" (.finish/break/return) from connection closed.

qingling128 mentioned this pull request Mar 1, 2020

raise exception after exiting watcher.each fabric8io/fluent-plugin-kubernetes_metadata_filter#214

Merged

This was referenced Mar 10, 2020

5.0 incompatible changes plan #435

Open

Document watch and upcoming changes #436

Merged

cben mentioned this pull request Jul 10, 2020

How to detect 410 Gone in watch response? #452

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document that watcher stops see #273 and add each_with_retry #275

document that watcher stops see #273 and add each_with_retry #275

grosser commented Nov 9, 2017

simon3z commented Nov 9, 2017 •

edited

Loading

cben Nov 9, 2017

grosser commented Nov 9, 2017 via email

grosser commented Nov 27, 2017

zeari left a comment

grosser commented Jan 4, 2018

simon3z commented Jan 4, 2018

cben commented Jan 22, 2018

grosser commented Jan 22, 2018

qingling128 commented Jan 2, 2020

cben commented Feb 27, 2020

mszabo commented Feb 27, 2020

Stono commented Feb 28, 2020 •

edited

Loading

grosser commented Feb 29, 2020

qingling128 commented Mar 1, 2020

cben commented Mar 11, 2020

rockb1017 commented Sep 16, 2020

document that watcher stops see #273 and add each_with_retry #275

Are you sure you want to change the base?

document that watcher stops see #273 and add each_with_retry #275

Conversation

grosser commented Nov 9, 2017

simon3z commented Nov 9, 2017 • edited Loading

cben Nov 9, 2017

Choose a reason for hiding this comment

grosser commented Nov 9, 2017 via email

grosser commented Nov 27, 2017

zeari left a comment

Choose a reason for hiding this comment

grosser commented Jan 4, 2018

simon3z commented Jan 4, 2018

cben commented Jan 22, 2018

grosser commented Jan 22, 2018

qingling128 commented Jan 2, 2020

cben commented Feb 27, 2020

I propose to break compatibility (and bump major version) and turn (3) (4) (5) into Kubeclient errors.

mszabo commented Feb 27, 2020

Stono commented Feb 28, 2020 • edited Loading

grosser commented Feb 29, 2020

qingling128 commented Mar 1, 2020

cben commented Mar 11, 2020

rockb1017 commented Sep 16, 2020

simon3z commented Nov 9, 2017 •

edited

Loading

Stono commented Feb 28, 2020 •

edited

Loading