-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[operator] Collector fails with featureGate errors when Upgrading the Operator to chart version 0.68.1 #1320
Comments
Hey there, @jlcrow! 👋 I'm here to assist you with any bugs, questions, or contributions you have while waiting for a human maintainer. The issue you're encountering with the OpenTelemetry collector failing with featureGate errors and crashing in a loop after upgrading to chart version 0.68.1 is due to changes in the To resolve this issue, you need to ensure that your CRDs are properly annotated and labeled before upgrading. Here are the steps:
Additionally, if you are upgrading from version 0.67 to 0.68, note that the |
@jlcrow this is a know bug and will be fixed by this PR in the operator open-telemetry/opentelemetry-operator#3074 |
separately, i would recommend upgrading the collector's version to v1beta1 when you get a chance :) |
solved by open-telemetry/opentelemetry-operator#3074 this will be fixed in the next operator helm release. Thank you for your patience :) |
@jlcrow can you upgrade to latest and let me know if that fixes things? |
@jaronoff97
Still seeing errors when the collector comes up
|
hmm any logs from the operator? |
@jaronoff97 Nothing on the operator but info logs for the manager container
|
one note, i tried running your config and you should know that the memory_ballast extension is removed. testing this locally now though! |
i saw this message from the otel operator:
and this is working now:
Note: the target allocator is failing to startup because it's missing permissions on its service account, but otherwise things worked fully as expected. |
before:
After:
|
@jaronoff97 Should have provided my latest config:
|
also note, i needed to get rid of the priority class name and the service account name which weren't provided. but thanks for updating, giving it a try... |
yeah i tested going from 0.65.0 -> 0.69.0 which was fully successful with this configuration: Config``` apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: otel-prometheus spec: mode: statefulset podAnnotations: sidecar.istio.io/inject: "false" targetAllocator: enabled: true prometheusCR: enabled: true observability: metrics: enableMetrics: true resources: requests: memory: 300Mi cpu: 300m limits: memory: 512Mi cpu: 500m resources: requests: memory: 600Mi cpu: 300m limits: memory: 1Gi cpu: 500m env: - name: K8S_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: K8S_POD_IP valueFrom: fieldRef: fieldPath: status.podIP config: processors: batch: {} memory_limiter: check_interval: 5s limit_percentage: 90 extensions: health_check: endpoint: Source: opentelemetry-kube-stack/templates/clusterrole.yamlapiVersion: rbac.authorization.k8s.io/v1
Source: opentelemetry-kube-stack/templates/clusterrole.yamlapiVersion: rbac.authorization.k8s.io/v1
|
@jaronoff97 idk man the feature gates seem to be sticking around for me when the operator is deploying the collector. I'm running on GKE don't think that should matter though.
|
what was the version before? I thought it 0.65.1, but want to confirm. And did you install the operator helm chart with upgrades disabled or any other flags? If i can get a local repro, I can try to get a fix out ASAP, otherwise it would be helpful to enable debug logging on the operator. |
I was able to make it to 0.67.0, any version later breaks the same way |
yeah i just did this exact process:
|
another user who reported a similar issue by doing a clean install of the operator #1339 (comment) |
Looks like a full uninstall and reinstall and now the flag is no longer present and the collector comes up successfully |
okay thats good, but im not satisfied with it. im going to keep investigating here and try to get a repro... im thinking maybe going from an older version to one that adds the flag, back to the previous version and then up to latest may cause it. |
@jaronoff97 I spoke too soon, somewhere along the lines the targetallocator stopped picking up my monitors and I lost almost all of my metrics, I just went back to the alpha spec and 67 to get things working again |
that's probably due to the permissions change i alluded to here. This was the error message I saw:
|
this block should do the trick, but I'm on mobile rn so sorry if it's not exactly right 😅 |
I'm still having weird issues with the targetallocator on one of my clusters - it consistently fails to pick up any servicemonitor or podmonitor crds. I tried a number of things including full uninstall and reinstall, working with version 69 of the chart and 108 of the collector. I checked the rbac for the sa account and the auth appears to be there.
At the end on a whim I reverted the api back to v1alpha1 and when I deployed the spec and the targetallocator/scrape_configs started showing all the podmonitors and servicemonitors instead of only the default prometheus config that's in the chart. I'm actually not understanding at all why this isn't working correctly as I have another operator on another GKE cluster with the same config that doesn't seem to have an issue with the beta api. |
Performed a routine helm upgrade from chart version 0.65.1 to 0.68.1 after the upgrade created Open Telemetry collector will not start. No errors in the operator - the collector errors and Crashloops
Collector config
The text was updated successfully, but these errors were encountered: