Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unboxer has an unbustable cache, which can take weeks to clear out #64

Open
dfabulich opened this issue Sep 5, 2024 · 2 comments · May be fixed by #65
Open

Unboxer has an unbustable cache, which can take weeks to clear out #64

dfabulich opened this issue Sep 5, 2024 · 2 comments · May be fixed by #65

Comments

@dfabulich
Copy link
Contributor

dfabulich commented Sep 5, 2024

Unboxer has multiple layers of caching.

  • It has a cache of the Master-Index.xml file (revalidated via etag every five minutes)
  • It manages an LRU cache directory of downloaded zips from IF Archive (which can be invalidated by changes to the Master-Index.xml file)
  • The app returns a Cache-Control: max-age header on every response
    • On error responses (e.g. 404 errors), it uses max-age: 86400 (24 hours)
    • On OK responses, it uses max-age: 604800 (7 days)
  • There's Cloudflare caching in front for the main domain https://unbox.ifarchive.org/, which caches everything for 1 to 7 days, and passes along the Cache-Control header.
  • There's an nginx cache for subdomains, (which serve the HTML files themselves), caching all that stuff for 1 to 7 days, and passing along the Cache-Control header
  • Client-side browsers (and any proxy the user might have) honors the Cache-Control header, caching everything for 1 to 7 days. For HTML documents, the user can refresh the cache by clicking the refresh button, but there's no easy user-visible way to refresh subresources in the user's browser.

When the IF Archive team updates a zip, they update it in place. In the worst case, this means:

  • A user visits a file on Monday the 1st at noon, priming nginx's cache with HTML and Cloudflare's cache with subresources.
  • The HTML file and subresources are updated in-place on the IF Archive at 12:01pm.
  • A user visits the file on Monday the 8th at 11:59am, requesting the HTML file from nginx, which responds with the old version, caching for seven days, and subresources from Cloudflare, which also caches for seven days.
  • That user visits the file on Monday the 15th at 11:58am. Their browser cache is "fresh" and so they get outdated content. They refresh their browser, getting fresh HTML from nginx, but Cloudflare continues to serve stale subresources.
  • A user with no/expired cache visits the file on Monday the 15th at 12:01pm. At last, there can be no cache that thinks it's fresh enough to serve this content, and now, finally, fresh content is conveyed to the user.

This would be a very unlucky outcome; but it's still pretty much guaranteed that any updates to a zip file won't be reflected on the unboxer for seven days, if only thanks to nginx caching.

EDIT: Updated since the nginx cache is only for subdomains, so it uses either the nginx or CloudFlare caches, but not both.

@curiousdannii
Copy link
Member

The nginx cache is only for subdomains, so it uses either the nginx or CloudFlare caches, but not both.

@dfabulich
Copy link
Contributor Author

Thanks; I updated my description with the full story.

Here are some options for what to do.

  1. We could set HTML to Cache-Control: max-age=0, or at least, a very small number, ensuring that the user's browser will try to contact us for the latest HTML. That's not quite as bad as it seems, because the browser will do a conditional GET request, so nginx will likely just return a 304 Not Modified response.

  2. Alternately, it is possible to purge nginx's cache on an URL-by-URL basis.

    The way you do that is to add a secret proxy_cache_bypass header. https://serverfault.com/questions/493411/how-to-delete-single-nginx-cache-file

    proxy_cache_bypass $http_x_b2ca678b4c936f905fb82f2733f5297f;
    
    curl -s -o /dev/null  -H "X-b2ca678b4c936f905fb82f2733f5297f: 1" "https://23nwbwjk2e.unbox.ifarchive.org/23nwbwjk2e/www/index.html"
    

    Since the nginx configuration is currently being generated by nginx.sh, we can keep the key secret in the production options.json.

    When it's time to purge nginx's cache, we can fetch those URLs with the bypass header.

    But, this is much more complicated than just setting max-age=0, and, if we continue to use max-age=604800, the user's browser will still try to cache files for 7 days.

  3. As for Cloudflare, since the subresources are the target of a 301 redirect, we could add a cache key as an URL parameter.

    For example, currently we redirect from https://23nwbwjk2e.unbox.ifarchive.org/23nwbwjk2e/www/audio/bgm/Theme1.ogg to https://unbox.ifarchive.org/23nwbwjk2e/www/audio/bgm/Theme1.ogg, with no Cache-Control header on the redirect.

    We could instead do a 302 redirect, Cache-Control: max-age=0, and make the destination be https://unbox.ifarchive.org/23nwbwjk2e/www/audio/bgm/Theme1.ogg?hash=98487658efd7b63f3e3cf237522bcae7, Cache-Control: max-age=31536000 (1 year). If Theme1.ogg ever changes, we'll use a different URL for it, ensuring that the user gets fresh content.

  4. Alternately, Cloudflare has an API to purge files from its cache. https://developers.cloudflare.com/api/operations/zone-purge

    It requires a secret key, and I think it might be complicated to compute exactly which URLs need to be purged, so I like this idea slightly less.

dfabulich added a commit to dfabulich/ifarchive-unbox that referenced this issue Sep 6, 2024
Fixes iftechfoundation#64.

Setting max-age=0 ensures that we'll always return fresh HTML content.

This won't blow out our bandwidth, because browsers/nginx/Cloudflare will still do conditional `GET` requests, which we'll respond with a cheap, fast "304 Not Modified" response in most cases.

For subresources on on the subdomain, we already redirect them to the main domain, but now, we redirect them with a `?lastmod=###` parameter. URLs with that parameter can have a week-long max-age, because if they change, we'll switch to a different URL.
dfabulich added a commit to dfabulich/ifarchive-unbox that referenced this issue Sep 6, 2024
Fixes iftechfoundation#64.

Setting max-age=0 ensures that we'll always return fresh HTML content.

This won't blow out our bandwidth, because browsers/nginx/Cloudflare will still do conditional `GET` requests, which we'll respond with a cheap, fast "304 Not Modified" response in most cases.

For subresources on on the subdomain, we already redirect them to the main domain, but now, we redirect them with a `?lastmod=###` parameter. URLs with that parameter can have a week-long max-age, because if they change, we'll switch to a different URL.
dfabulich added a commit to dfabulich/ifarchive-unbox that referenced this issue Sep 6, 2024
Fixes iftechfoundation#64.

Setting max-age=0 ensures that we'll always return fresh HTML content.

This won't blow out our bandwidth, because browsers/nginx/Cloudflare will still do conditional `GET` requests, which we'll respond with a cheap, fast "304 Not Modified" response in most cases.

For subresources on on the subdomain, we already redirect them to the main domain, but now, we redirect them with a `?lastmod=###` parameter. URLs with that parameter can have a week-long max-age, because if they change, we'll switch to a different URL.
dfabulich added a commit to dfabulich/ifarchive-unbox that referenced this issue Sep 6, 2024
Fixes iftechfoundation#64.

Setting max-age=0 ensures that we'll always return fresh HTML content.

This won't blow out our bandwidth, because browsers/nginx/Cloudflare will still do conditional `GET` requests, which we'll respond with a cheap, fast "304 Not Modified" response in most cases.

For subresources on on the subdomain, we already redirect them to the main domain, but now, we redirect them with a `?lastmod=###` parameter. URLs with that parameter can have a week-long max-age, because if they change, we'll switch to a different URL.
dfabulich added a commit to dfabulich/ifarchive-unbox that referenced this issue Sep 6, 2024
Fixes iftechfoundation#64.

Setting max-age=0 ensures that we'll always return fresh HTML content.

This won't blow out our bandwidth, because browsers/nginx/Cloudflare will still do conditional `GET` requests, which we'll respond with a cheap, fast "304 Not Modified" response in most cases.

For subresources on on the subdomain, we already redirect them to the main domain, but now, we redirect them with a `?lastmod=###` parameter. URLs with that parameter can have a week-long max-age, because if they change, we'll switch to a different URL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants