The Grafana Cloud Operator is an Ansible-based OpenShift Operator that automates the configuration and management of Grafana OnCall within an OpenShift cluster. This operator simplifies the process of setting up Grafana OnCall, ensuring seamless integration with Alertmanager and consistent alert forwarding.This operator also ensures that dashboards are created for SLO. The operator also ensures proper cleanup of integrations and dashboards.
Grafana Cloud is a fully managed observability platform from Grafana Labs, providing a seamless experience across metrics, logs, and traces. Within Grafana Cloud, Grafana OnCall is a dedicated incident response coordination system, directly integrating with Grafana's alerting mechanism to manage on-call schedules, escalations, and incident tracking.
Manually configuring Grafana OnCall on a cluster involves several complex steps, including creating accounts, configuring integrations, and editing configurations. This process is time-consuming, error-prone, and can lead to inconsistencies and misconfigurations if not done accurately. Automating these tasks with the Grafana Cloud Operator simplifies the setup, reduces errors, and ensures consistency across clusters.
The Grafana Cloud Operator supports the creation and management of dashboards that track key Service Level Indicators (SLIs) and ensure compliance with Service Level Objectives (SLOs) for Kubernetes (K8s) and OpenShift environments. This section provides an overview of the SLIs monitored and the SLOs established, along with the automation approach for creating dashboards.
Key SLIs and SLOs Monitored:
-
K8s API Uptime
- SLI: Measures the uptime of the Kubernetes API.
- SLO: Ensure 99.5% uptime.
-
K8s API Request Error Rate
- SLI: Monitors the error rate of requests made to the Kubernetes API.
- SLO: Ensure 99.9% success rate.
-
OpenShift Console Uptime
- SLI: Tracks the uptime of the OpenShift web console.
-
HAProxy / Router Uptime
- SLI: Measures the uptime of HAProxy or router services.
-
OpenShift Authentication Uptime
- SLI: Monitors the uptime of the OpenShift authentication service.
-
Dashboard Organization:
- Dedicated Folders: Each customer will have a dedicated folder in Grafana Cloud for storing their respective dashboards.
- One Dashboard per Cluster: Dashboards will be created for each cluster to track these SLIs and ensure they meet the defined SLOs.
This operator is built using the Ansible Operator Framework built on Operator SDK, combining the ease of use of Operators with the power of Ansible automation. It reacts to custom resources created within the OpenShift cluster to manage the creation and integration of Grafana OnCall resources with the OpenShift Cluster.
The Grafana Cloud Operator leverages the flexibility of Ansible in responding to custom resource changes on the cluster, ensuring Grafana OnCall is properly configured and maintained.
The operator's workflow can be described in two different architectural models:
-
A. Hub and Spoke Model
In the Hub-Spoke model, the operator is installed on a central Hub cluster and manages Grafana OnCall configurations for multiple Spoke clusters. This model is ideal for organizations with multiple clusters and aims to centralize monitoring and management.
graph TD subgraph "Hub and Spoke Integration with Grafana OnCall and SLO Management" subgraph "OpenShift Hub Cluster" InitHub[Start: Operation Initiated in Hub] GetGCOHub[Get All Config CRs] CheckMultipleCRs[Ensure Only One GCC CR Exists] GetToken[Retrieve Grafana API Token from Secret] ListIntegrations[Fetch List of Existing Integrations in Grafana OnCall] FetchClusters[Fetch ManagedClusters] FetchSlackChannels[Fetch Slack Channel CRs from All Namespaces] DetermineMissingIntegrations[Determine Clusters Missing Integrations] CreateIntegration[Create Integration in Grafana OnCall for Missing Clusters] CreateSLOFolders[Create SLO Folders in Grafana Cloud] CreateSLODashboards[Create SLO Dashboards for Each Cluster] InitHub --> GetGCOHub GetGCOHub --> CheckMultipleCRs CheckMultipleCRs --> GetToken GetToken --> ListIntegrations ListIntegrations --> FetchClusters FetchClusters --> FetchSlackChannels FetchSlackChannels --> DetermineMissingIntegrations DetermineMissingIntegrations --> CreateIntegration DetermineMissingIntegrations --> CreateSLOFolders CreateSLOFolders --> CreateSLODashboards end subgraph "Grafana Cloud" GOHub[Grafana OnCall] SLOFolders[SLO Folders] SLODashboards[SLO Dashboards] end subgraph "Spoke Clusters" SC1[Spoke Cluster 1] SC2[Spoke Cluster 2] SC3[Spoke Cluster 3] end subgraph "OpenClusterManagement (OCM)" CreateIntegration --> |Create Integration| GOHub CreateSLOFolders --> |Create Folders| SLOFolders CreateSLODashboards --> |Create Dashboards| SLODashboards GOHub -->|Return: Endpoint| RHACM/OCM RHACM/OCM --> |ManifestWork| SC1 RHACM/OCM --> |ManifestWork| SC2 RHACM/OCM --> |ManifestWork| SC3 end end
Centralized ManagedClusters Monitoring: The operator, installed on the Hub cluster, continually monitors for the presence of ManagedCluster resources from Hive that are registered from Spoke clusters. These resources are significant markers, indicating the clusters that require Grafana OnCall integration.
Centralized Slack Channel CRs Monitoring: The operator installed on the Hub cluster continually monitors for the presence of Slack Channel resources from Slack Operator that are registered for Spoke clusters. The channel resources are present in the same namespace as the Custom Resource generating the ManagedCluster and are attached to the Grafana OnCall integration.
Cross-Cluster Grafana OnCall Setup: For each ManagedClusters identified, the operator communicates with the Grafana Cloud's API, initiating the integration process. This setup involves creating necessary configurations on Grafana Cloud and retrieving vital details such as the Alertmanager HTTP URL for each respective Spoke cluster.
ManifestWork
Synchronization: UtilizingManifestWork
resources from open-cluster-management, the operator ensures that alerting configurations are consistent across all Spoke clusters. This mechanism efficiently propagates configuration changes from the Hub to the Spokes, particularly for alert forwarding settings in Alertmanager and Utilizing Watchdog for heartbeats.Centralized Secret Management: The operator centrally manages the
alertmanager-main-generated
secret for each Spoke cluster. Through theManifestWork
, it disseminates the updated secret configurations, ensuring each Spoke cluster's Alertmanager can successfully forward alerts to Grafana OnCall. Additionally, it adds option for OnCall Heartbeat which acts as a monitoring for monitoring systems. It is utilizing Watchdog for our heartbeats.Forwarding alerts to Slack Fetch Slack Info and Configure Slack, details how the operator additionally configures Grafana OnCall to send alerts directly to a specified Slack channel for enhanced incident awareness and response. Note: This feature utilizes slack-operator which is another one of our open source projects. Please head over there to find detailed information on that operator.
-
B. Standalone Cluster Model
In a standalone cluster model, the operator is installed directly on a single cluster and manages the Grafana OnCall configuration solely for that cluster. This setup is suitable for individual clusters or standalone environments.
graph TD subgraph "OpenShift Standalone Cluster" Init[Start: Operation Initiated] GetClusterName[Retrieve Cluster Name] CheckIntegration[Check Grafana Integration Existence] CreateIntegration[Create Grafana OnCall Integration] ModSecret[Include: modify_alertmanager_secret] Reencode[Re-encode Alertmanager Content] PatchSecret[Patch alertmanager-main Secret] UpdateCR[Update CR Status to ConfigUpdated] CreateSLOFolder[Create SLO Folder in Grafana Cloud] CreateSLODashboard[Create SLO Dashboard in Grafana Cloud] Init --> GetClusterName GetClusterName --> CheckIntegration CheckIntegration -->|Integration doesn't exist| CreateIntegration CheckIntegration -->|Integration exists| UpdateCR CreateIntegration --> FetchSlackInfo FetchSlackInfo --> ConfigureSlack ConfigureSlack --> ModSecret ModSecret --> Reencode Reencode --> PatchSecret PatchSecret --> UpdateCR CreateIntegration --> CreateSLOFolder CreateSLOFolder --> CreateSLODashboard end subgraph "Grafana Cloud" GO[Grafana OnCall] end ConfigureSlack -->|API Call: Configure Slack| GO GO -->|Return: Endpoint| ConfigureSlack
Operator Workflow in Standalone Cluster: The operator functions within the single OpenShift cluster, monitoring resources that indicate the local cluster's need for Grafana OnCall integration.
Direct Grafana OnCall Setup: Upon identifying the GCC CR, described in the next section, the operator proceeds with the Grafana OnCall setup by interacting with Grafana Cloud's API. It establishes the necessary integrations and secures essential details, including the Alertmanager HTTP URL.
In-Cluster Configuration Management: The operator directly applies configuration changes within the cluster, bypassing the need for
ManifestWork
. It ensures the Alertmanager's alert forwarding settings are correctly configured for seamless communication with Grafana OnCall. Additionally, it adds option for On call Heartbeat which acts as a monitoring for monitoring systems using Watchdog.Local Secret Management: Managing the
alertmanager-main-generated
secret locally, the operator updates its configurations. This update enables the Alertmanager within the standalone cluster to route alerts effectively to Grafana OnCall, completing the integration process.Forwarding alerts to Slack Just like the hub-and-spoke model, Slack channel can be configured in Standalone mode by populating the
slackId
field , this additionally, configures Grafana OnCall to send alerts directly to a specified Slack channel for enhanced incident awareness and response. Note: This feature utilizes slack-operator which is another one of our open source projects. Please head over there to find detailed information on that operator.
- An OpenShift cluster up and running.
oc
CLI tool.- Access to Grafana OnCall's API key with relevant permissions.
This section outlines the process of installing the Grafana Cloud Operator through a custom catalog as well as helm charts. By following these steps, you will be able to deploy the operator on a cluster.
-
Create a Namespace
Start by creating a specific namespace for the Grafana Cloud Operator.
oc create namespace grafana-cloud-operator
-
Create a Secret Called
saap-dockerconfigjson
Since the catalog image is private, you need to create a secret that contains your Docker credentials. This secret is necessary to pull the catalog image from the GitHub Container Registry.
oc -n grafana-cloud-operator create secret docker-registry saap-dockerconfigjson \ --docker-server=ghcr.io \ --docker-username=<username> \ --docker-password=<your-access-token> \ --docker-email=<email>
Note: Replace
username
,your-access-token
, andemail
with your GitHub username, a personal access token (withread:packages
scope enabled), and your email, respectively. -
Create Grafana API Token Secret
The operator needs to interact with the Grafana Cloud's APIs, and for this, it requires an API token. Create a secret to store this token securely.
oc -n grafana-cloud-operator create secret generic grafana-api-token-secret \ --from-literal=api-token=<your-grafana-api-token>
Note: Obtain the API token from your Grafana OnCall settings page and replace with your actual API token.
-
Apply the Custom Catalog Source Now, apply the custom catalog source configuration to your cluster. This catalog source contains the operator that you wish to install.
oc -n grafana-cloud-operator create -f custom-catalog.yaml
Note: Ensure that
custom-catalog.yaml
is properly configured with the right details of your custom catalog. -
Install the Operator via OperatorHub
Navigate to the OperatorHub in your OpenShift console. Search for "Grafana Cloud Operator" and proceed with the installation by following the on-screen instructions. Select the
grafana-cloud-operator
namespace for deploying the operator. -
Verify the Installation
After the installation, ensure that the operator's components are running properly. Check the status of the pods with the following command:
oc -n grafana-cloud-operator get pods
You should see the operator pod in a Running state.
-
Create a Namespace
We need a separate namespace for Grafana Cloud Operator to keep things organized and isolated.
oc create namespace grafana-cloud-operator
-
Create a Docker Registry Secret This secret is required to pull the operator image from a private registry. Without it, the cluster won't be able to access the images, and the deployment will fail.
oc -n grafana-cloud-operator create secret docker-registry saap-dockerconfigjson \ --docker-server=ghcr.io \ --docker-username=<username> \ --docker-password=<your-access-token> \ --docker-email=<email>
Note: Make sure to replace
username
,your-access-token
, andemail
with your actual information. The access token should have the appropriate permissions to read from the container registry. -
Create a Grafana API Token Secret
The Grafana Cloud Operator interacts with Grafana Cloud's APIs. As such, it requires an API token, which should be stored as a Kubernetes secret.
oc -n grafana-cloud-operator create secret generic grafana-api-token-secret \ --from-literal=api-token=<your-grafana-api-token>
Note: Replace with your actual Grafana API token. You can generate/find this token in your Grafana OnCall settings page.
-
Install the Grafana Cloud Operator Using Helm
Now you're set to install the Grafana Cloud Operator using Helm. Run the following command, making sure to replace with the path to your Helm chart. This command installs the Helm chart with the release name
grafana-cloud-operator
in thegrafana-cloud-operator
namespace.helm install grafana-cloud-operator <chart-path> --namespace grafana-cloud-operator
Note: The default is
charts/grafana-oncall
from the root of the repo. -
Verify the Installation
Check if all the pods related to the Grafana Cloud Operator are up and running.
oc -n grafana-cloud-operator get pods
This command will list the pods in the
grafana-cloud-operator
namespace, allowing you to verify their status. Ensure that all pods are either in the Running or Completed state, indicating that they are operational.This Helm-based approach simplifies the deployment of the Grafana Cloud Operator by encapsulating the configuration details. Users can easily upgrade or rollback the operator, leveraging Helm's package management capabilities.
After installation, you can create a Config
resource by applying the below CRD that the operator recognizes.
The operator gets its instructions from a custom resource (CR) that follows the Config
Custom Resource Definition (CRD). This CR contains all the necessary information, from the API token required to interact with Grafana Cloud to the mode of operation the operator should adopt.
Here's a step-by-step guide on understanding and applying this configuration:
-
Preparing Your Custom Resource:
First, let's break down the essential parts of the CR:
apiVersion: grafanacloud.stakater.com/v1alpha1 kind: Config metadata: name: config-sample # This is a user-defined name for your custom resource namespace: grafana-cloud-operator # Namespace where the operator is installed spec: enabled: true sloObservabilityURL: https://raw.githubusercontent.com/stakater/charts/slo-observability-0.0.9 # URL of SLO Dashboard that needs to be used sloCloudAPI: https://grafana.net/api # Api URL for SLO Dashboards sloDashboardAPIToken: key: api-token # The key field within the secret holding the Dashboard API token secretName: slo-dashboard-api-token-secret # The name of the Kubernetes secret storing the Dashboard API token grafanaAPIToken: key: api-token # The key field within the secret holding the Grafana OnCall API token secretName: grafana-api-token-secret # The name of the Kubernetes secret storing the Grafana OnCall API token slackId: C0DDD0ZD4JZ # For Standalone mode populate this field to connect Slack Channel to Grafana OnCall Integration slack: true # Slack alerts toggle for integration. This would disable sending of alerts to the channel. By default, it set to true provisionMode: standalone # Determines the mode of operation - 'hubAndSpoke' or 'standaloneCluster'
metadata
: Contains general information about the custom resource that you are creating, such as its name and the namespace it resides in.spec
: This is where the bulk of the configuration goes. It's broken down further below:enabled
: Currently does nothing. But the idea is to use the flag to support removal of Grafana Integration in the future.sloCloudAPI
: API for Grafana DashboardsloObservabilityVersion
: You can use this field to use any available release for the SLO DashboardssloDashboardAPIToken
: Since the operator needs to interact with Grafana Dashboards API, you need to provide it with an API token. This token is stored within a Kubernetes secret for security, and here you point the operator to the right secret and key.grafanaAPIToken
: Since the operator needs to interact with Grafana OnCall's API, you need to provide it with an API token. This token is stored within a Kubernetes secret for security, and here you point the operator to the right secret and key.provisionMode
: Indicates how the operator should function. It could be in a 'hubAndSpoke' mode where it manages multiple clusters or 'standaloneCluster' for managing a single cluster.slackId
: Forstandalone
provision mode populate this field to connect Slack Channel to Grafana OnCall Integration.slack
: This is toggle for slack alerts to channel. It accepts boolean. By default, it set to true.
-
Applying the Custom Resource:
Once your custom resource is ready and tailored for your specific use case, you need to apply it within your OpenShift environment. This action tells the operator what it should do.
oc apply -f your-config-file.yaml
-
Modes of Operation:
The
provisionMode
in the spec can be one of the following two values:hubAndSpoke
: Use this when you have the operator installed on a central Hub cluster, and you intend for it to manage Grafana OnCall integrations on multiple Spoke clusters.standaloneCluster
: This is used when the operator is handling Grafana OnCall integration for a single cluster, where it's installed and operated.
Here's how you would set the
provisionMode
for a standalone cluster:spec: provisionMode: standaloneCluster
The operator adapts its behavior based on this directive, ensuring that your Grafana OnCall integrations are set up and managed in a way that's optimal for your organizational architecture and needs.
The Grafana Cloud Operator includes a robust deletion mechanism that not only handles dashboards but also integrations in Grafana Cloud. This feature ensures outdated or unnecessary resources are efficiently removed, maintaining an organized and optimal environment.
- ManagedCluster Monitoring: The operator actively watches ManagedClusters to detect changes or deletions. This ensures that resources associated with deleted or modified clusters are identified for removal.
- Identification of Outdated Resources:
- Dashboards: Retrieves unique identifiers
(UIDs)
for dashboards linked to ManagedClusters to target for deletion. - Integrations: Collects identifiers for integrations tied to ManagedClusters to accurately manage their lifecycle.
- Dashboards: Retrieves unique identifiers
- Automated Deletion: Once outdated or deleted ManagedClusters are detected, the operator sends DELETE requests to the Grafana Cloud API to remove the associated dashboards and integrations.
- Error Handling: Built-in error handling manages scenarios where resources may have already been deleted or do not exist, preventing unnecessary failures.
- Automated Cleanup: Ensures dashboards and integrations tied to ManagedClusters are cleaned up automatically, reducing manual intervention.
- Resource Optimization: Helps maintain a lean Grafana Cloud environment by removing unused resources, improving performance and manageability.
- Smooth Integration with Playbooks: The deletion process can be integrated with Ansible playbooks for a comprehensive orchestration logic.
After you've applied the CR, the operator starts performing its duties based on the instructions given. You can monitor the operator's activities and troubleshoot potential issues by examining the logs of the operator pod:
oc -n grafana-cloud-operator logs -f <operator-pod-name>
This command will stream the logs from the operator to your console, providing real-time updates on what the operator is doing. It's crucial for identifying any problems the operator encounters while trying to set up Grafana OnCall.
File a GitHub issue.
Join and talk to us on Slack for discussing Reloader
Please use the issue tracker to report any bugs or file feature requests.
Apache2 © Stakater
Grafana Cloud Ansible Operator
is maintained by Stakater. Like it? Please let us know at [email protected]
See our other projects or contact us in case of professional services and queries on [email protected]