Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: fix counter resets when merging batches #9909

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

fionaliao
Copy link
Contributor

What this PR does

When merging samples from different iterators, histogram counter reset hints need to be recalculated. The mergeIterator currently does not do so.

This is currently a WIP to test out a possible fix.

We know the following:

The fix has two parts:

  1. When merging, keep track of if the last sample was written from the current batchStream or the new batch. If we switch between them, write unknown, otherwise keep the current hint.
  2. The above will always set the first sample in an incoming batch to unknown in merge(). However, a batch contains some consecutive samples in a chunk. A later batch might come from the same chunk and the samples could be consecutive to the previous batch's samples. Therefore, in batchStream, we keep a map of iteratorID -> last histogram timestamp from the iterator that was written to the batchStream. When we write the first sample in the new batch, we check if the previous sample in the batchStream has the same timestamp as the last histogram timestamp from the same iterator. If so, we can trust the counter reset hint for the sample in the new batch rather than resetting it. The idea for figuring out consecutive samples between batches using the previous timestamp comes from @krajorama and this PR of this: fix(histograms): inflated counter resets on merge #9823.

Note we could still be overdetecting unknown counter resets - it's possible that chunks in different newNonOverlappingIterators are actually consecutive but for now if samples are coming from different iterators we set them to unknown. This is because we don't know for sure as we don't have a way to tell if chunk are consecutive to each other at this point in the Mimir code (see prometheus/prometheus#15346 for related discussion).

Current TODOs:

  • Go through code carefully, the reasoning for what is safe/correct to do gets tricky
  • Add a bunch of tests

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

pkg/querier/batch/stream.go Outdated Show resolved Hide resolved
pkg/querier/batch/stream.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants