New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

kafka replay speed: add alert for when we miss records in Kafka #9921

Open

dimitarvdimitrov wants to merge 7 commits into main from dimitar/ingest/detect-gaps-when-consuming

+122 −11

Contributor

dimitarvdimitrov commented Nov 15, 2024

What this PR does

Adds an alert and metrics to detect when we have bugs.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

pracucci approved these changes

View reviewed changes

Collaborator

pracucci left a comment

Good job. Can you add an assertion to existing unit tests to ensure the metric is always 0 at the end of each test? Should be quick to do.

development/mimir-ingest-storage/docker-compose.jsonnet Show resolved Hide resolved

Base automatically changed from dimitar/ingest/remove-fetchWant-trimming to main

November 17, 2024 18:14

dimitarvdimitrov added 6 commits

November 18, 2024 11:44


          kafka replay speed: don't trim fetchWants

429d813

I realized that trimming `fetchWant`s can end up discarding offsets in extreme circumstances.

### How it works

If the fetchWant is so big that its size would exceed 2GiB, then we trim it. We trim it by reducing the end offset. The idea is that the next fetchWant will pick up from where this one left off.

### How it can break

We trim the `fetchWant` in `UpdateBytesPerRecord` too. `UpdateBytesPerRecord` can be invoked in `concurrentFEtchers.run` after the `fetchWant` is dispatched. In that case the next `fetchWant` would have already been calculated. And we would end up with a gap.

### Did it break?

It's hard to tell, but it's very unlikely. To reach 2GiB we would have needed to have the estimation for bytes per record be 2 MiB. While these large records are possible, they should be rare and our rolling average estimation for records size shouldn't reach it.

Signed-off-by: Dimitar Dimitrov <[email protected]>


          kafka replay speed: add alert for when we miss records in Kafka

87641f6

Signed-off-by: Dimitar Dimitrov <[email protected]>


          Restore local config

8e53e4a

Signed-off-by: Dimitar Dimitrov <[email protected]>


          Assert there are no missed records at the end of every test

69a3e0f

Signed-off-by: Dimitar Dimitrov <[email protected]>


          make doc

9b2f3f6

Signed-off-by: Dimitar Dimitrov <[email protected]>


          Fix rebase

90aaeeb

Signed-off-by: Dimitar Dimitrov <[email protected]>

dimitarvdimitrov force-pushed the dimitar/ingest/detect-gaps-when-consuming branch from aa30f73 to 90aaeeb Compare

November 18, 2024 09:45

dimitarvdimitrov marked this pull request as ready for review

November 18, 2024 09:45

dimitarvdimitrov requested review from tacole02 and a team as code owners

November 18, 2024 09:45


          Add support for gaps within a Fetch

e82975f

Signed-off-by: Dimitar Dimitrov <[email protected]>

tacole02 approved these changes

View reviewed changes

Contributor

tacole02 left a comment

Looks good! A few minor questions/suggestions.

docs/sources/mimir/manage/mimir-runbooks/_index.md


		How it works:

		- Ingester reads records from Kafka, and processes them sequentially. It keeps track of the offset of the last record it processed.

Contributor

tacole02 Nov 18, 2024

Suggested change

      
            - Ingester reads records from Kafka, and processes them sequentially. It keeps track of the offset of the last record it processed.
          
            - The ingester reads records from Kafka and processes them sequentially. It keeps track of the offset of the last record it's processed.

docs/sources/mimir/manage/mimir-runbooks/_index.md

+              How it **works**:
+              - Ingester reads records from Kafka, and processes them sequentially. It keeps track of the offset of the last record it processed.
+              - Upon fetching the next batch of records, it checks if the first available record has an offset one greater than the last processed offset. If the first available offset is larger than that, then the ingester has missed some records.

Contributor

tacole02 Nov 18, 2024

Suggested change

      
            - Upon fetching the next batch of records, it checks if the first available record has an offset one greater than the last processed offset. If the first available offset is larger than that, then the ingester has missed some records.
          
            - Upon fetching the next batch of records, it checks if the first available record has an offset of one greater than the last processed offset. If the first available offset is larger than that, then the ingester has missed some records.

docs/sources/mimir/manage/mimir-runbooks/_index.md

+              - Ingester reads records from Kafka, and processes them sequentially. It keeps track of the offset of the last record it processed.
+              - Upon fetching the next batch of records, it checks if the first available record has an offset one greater than the last processed offset. If the first available offset is larger than that, then the ingester has missed some records.
+              - Kafka doesn't guarantee sequential offsets. If a record has been manually deleted from Kafka or the records have been produced in a transaction and the transaction was aborted, then there may be a gap.

Contributor

tacole02 Nov 18, 2024

Suggested change

      
            - Kafka doesn't guarantee sequential offsets. If a record has been manually deleted from Kafka or the records have been produced in a transaction and the transaction was aborted, then there may be a gap.
          
            - Kafka doesn't guarantee sequential offsets. If a record has been manually deleted from Kafka or if the records have been produced in a transaction and the transaction was aborted, then there may be a gap.

docs/sources/mimir/manage/mimir-runbooks/_index.md

+              - Upon fetching the next batch of records, it checks if the first available record has an offset one greater than the last processed offset. If the first available offset is larger than that, then the ingester has missed some records.
+              - Kafka doesn't guarantee sequential offsets. If a record has been manually deleted from Kafka or the records have been produced in a transaction and the transaction was aborted, then there may be a gap.
+              - Mimir doesn't produce in transactions and does not delete records.
+              - When the ingester starts up, it will attempt to resume from the last offset it processed. If the ingester has been unavailable for long enough that the next record is already removed due to retention, then the ingester will miss some records.

Contributor

tacole02 Nov 18, 2024

Suggested change

      
            - When the ingester starts up, it will attempt to resume from the last offset it processed. If the ingester has been unavailable for long enough that the next record is already removed due to retention, then the ingester will miss some records.
          
            - When the ingester starts, it attempts to resume from the last offset it processed. If the ingester has been unavailable for long enough that the next record is already removed due to retention, then the ingester misses some records.

Contributor

tacole02 Nov 18, 2024

We avoid using future tense in the docs.

docs/sources/mimir/manage/mimir-runbooks/_index.md

+              - Ingester reads records from Kafka, and processes them sequentially. It keeps track of the offset of the last record it processed.
+              - Upon fetching the next batch of records, it checks if the first available record has an offset one greater than the last processed offset. If the first available offset is larger than that, then the ingester has missed some records.
+              - Kafka doesn't guarantee sequential offsets. If a record has been manually deleted from Kafka or the records have been produced in a transaction and the transaction was aborted, then there may be a gap.
+              - Mimir doesn't produce in transactions and does not delete records.

Contributor

tacole02 Nov 18, 2024

"Mimir doesn't produce in transactions" reads unclear to me. Is the "in" supposed to be here?

docs/sources/mimir/manage/mimir-runbooks/_index.md


		How to investigate:

		- Verify that there have been no deleted records in your Kafka cluster by humans or other applications.

Contributor

tacole02 Nov 18, 2024

I think we can probably remove "by humans or other applications".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet