Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial write up of restore / desync troubles. #30

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

timbru
Copy link
Owner

@timbru timbru commented Sep 23, 2024

Needs more thought...

@timbru timbru marked this pull request as draft September 23, 2024 15:11
@timbru timbru linked an issue Sep 23, 2024 that may be closed by this pull request
synchronisation events where it issues an [@!RFC8181] list query even if it has
no new content to publish. Because this interaction requires that the
Publication Server signs an [@!RFC8181] list reply, this operation can be costly
for Publication Servers that serve a large number of publishers. Therefore,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this costly? Is it possible that this is a particular problem with Krill because it's generating a new EE certificate for each transaction, as it does for the RFC 6492 service?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's mostly okay, but in case of 1500+ CAs (which happens) it does get rather chatty..

We spoke about this in the past and then indeed there was a problem in Krill, but that was something else - there was an inefficiency in the implementation that meant it spent lots of time on deserializing state - that has been fixed.

Nonetheless, 1500 signed responses every minute could be significant and because the publication protocol does not (yet) support rate limiting there would not be a clean way to tell clients to back off if needed.

So, therefore I thought... well getting the notification file just hits the CDN, so there is no problem in hammering that from delegated CAs, which will likely be fewer in number than RPs anyway.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: I am open to suggestions :) How about:

Because this interaction requires that the Publication Server signs an [@!RFC8181] list reply, this operation can be costly for Publication Servers that serve a large number of publishers.

New:

For Publication Server that serve a large number (1000s) of publishers this operation could become costly, and unfortunately the [@!RFC8181] protocol has no clean support for rate limiting.

Would that work?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think mentioning the rate limiting consideration is a good idea. The suggested text above sounds fine to me.

Notification file the CA MAY perform this verification every minute.

If the expected files are not found to be published within a reasonable time
(let's say 5 minutes?), or if the CA recognises that there is a regression in
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about a suggestion like this. There will typically be a few moving parts between the publication server and the RRDP service accessed by the client, and it may not be that often that repeating the updates will fix the problem. Maybe instead something like "expect to see objects within five minutes, and if you don't, please contact the publication service operator"?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above for context there, my idea was that hammering the CDN every minute would be fine.

  • wrt published within a reasonable time

But, you are right of course. There are moving parts in between and the delay can even be 10 minutes or longer dependent on setup. And, unfortunately (another thing for my publication++ wishlist) there is no indication to the client about how long is normal here (e.g. the RFC8181 success reply could have included a hint).

So, if we include a time here then we should probably be conservative and use something like 15 minutes? But note, that if the publisher does a full resync after 5 minutes, this is probably not an issue. It would just be another list request with a reply that tells the publisher that everything is there.

  • wrt CA recognises... regression

I think there is merit in monitoring the repository for changes regardless of recent publication activity. It could help publishers discover that the repository has regressed, in which cast "contact publication server operator" could just be, issue a warning and do another full synchronisation (RFC8181 list and publish diff).

Does this make sense? I will think about updated text, but suggestions are welcome of course.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if we include a time here then we should probably be conservative and use something like 15 minutes?

I think 5 minutes is better, because in the normal course of things that should be the outer limit of any delay. If it's regularly taking longer than that, then the setup on the server side needs to be improved.

But note, that if the publisher does a full resync after 5 minutes, this is probably not an issue. It would just be another list request with a reply that tells the publisher that everything is there.

Yep, that's a fair point. (I'd assumed that 'resynchronisation' meant 'delete existing objects and re-upload', which I think can happen inside a single request, so it may be worth documenting the resynchronisation process.)

Base automatically changed from cdn-not-mandatory to main September 25, 2024 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Server restore with loss of data should be clarified
2 participants