Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry Metrics: Adds support to collect Operation level metrics #4682

Open
wants to merge 40 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
3d1d6ba
Added request level metrics
sourabh1007 Sep 13, 2024
60c411b
add IsClientMetricsEnabled option
sourabh1007 Sep 13, 2024
55b6505
added contract file
sourabh1007 Sep 13, 2024
6fb216c
wip
sourabh1007 Sep 16, 2024
841c7bc
adding test
sourabh1007 Sep 18, 2024
ee32e43
added documentation
sourabh1007 Oct 14, 2024
1ddbcbe
emit metrics
sourabh1007 Oct 16, 2024
7e9a344
fixed dimensions
sourabh1007 Oct 17, 2024
bc6f273
nonworking changes
sourabh1007 Oct 17, 2024
278c5c5
final commit
sourabh1007 Oct 18, 2024
8bae2a9
remove unnecessary dependencies
sourabh1007 Oct 18, 2024
cebace3
contract update
sourabh1007 Oct 18, 2024
6c32e52
fix merges
sourabh1007 Oct 18, 2024
f6974ea
remove console
sourabh1007 Oct 18, 2024
161abe9
add noops if disables
sourabh1007 Oct 18, 2024
2569ed9
added null check
sourabh1007 Oct 18, 2024
a3ee34d
[INTERNAL] CI: Fixes emulator set-up to leverage central SDK teams sc…
kirankumarkolli Oct 18, 2024
de96acc
VectorIndexDefinition: Adds Support for Partitioned DiskANN (#4792)
kundadebdatta Oct 18, 2024
2af0b05
Azurecore: Fixes upgrading azure core dependency to latest (#4819)
kirankumarkolli Oct 18, 2024
8d80c1c
DeleteAllItemsByPartitionKeyStreamAsync: Adds DeleteAllItemsByPartiti…
kirankumarkolli Oct 18, 2024
316b3d8
Merge branch 'master' into users/sourabhjain/otelmetriccpu
sourabh1007 Oct 22, 2024
c6a33b0
rename file
sourabh1007 Oct 22, 2024
4fd1193
refactor code
sourabh1007 Oct 23, 2024
033fda4
refactor code
sourabh1007 Oct 23, 2024
e92477a
perf tests
sourabh1007 Oct 23, 2024
cf5bb03
updated contracts
sourabh1007 Oct 24, 2024
6338908
code refactor
sourabh1007 Oct 24, 2024
321520c
refactored code
sourabh1007 Oct 25, 2024
51d485c
Merge branch 'master' into users/sourabhjain/otelmetriccpu
sourabh1007 Oct 25, 2024
7d03b8f
Merge branch 'master' into users/sourabhjain/otelmetriccpu
sourabh1007 Oct 26, 2024
1614b75
Merge branch 'master' into users/sourabhjain/otelmetriccpu
sourabh1007 Nov 1, 2024
9081e1a
added region contacted as dimension
sourabh1007 Nov 1, 2024
66efcf9
Merge branch 'master' into users/sourabhjain/otelmetriccpu
sourabh1007 Nov 6, 2024
ba63724
Merge branch 'master' into users/sourabhjain/otelmetriccpu
sourabh1007 Nov 12, 2024
a451d49
Merge branch 'master' into users/sourabhjain/otelmetriccpu
sourabh1007 Nov 14, 2024
736c292
perf fix
sourabh1007 Nov 14, 2024
6d6957b
inc perf test
sourabh1007 Nov 14, 2024
36d0ee1
perf results
sourabh1007 Nov 14, 2024
2509f3a
Merge branch 'master' into users/sourabhjain/otelmetriccpu
sourabh1007 Nov 14, 2024
36079d4
refactor according to versioning
sourabh1007 Nov 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions Microsoft.Azure.Cosmos/src/CosmosClient.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1438,6 +1438,8 @@ private int DecrementNumberOfActiveClients()
// In case dispose is called multiple times. Check if at least 1 active client is there
if (NumberOfActiveClients > 0)
{
CosmosOperationMeter.RemoveInstanceCount(this.Endpoint);

return Interlocked.Decrement(ref NumberOfActiveClients);
}

Expand Down
8 changes: 8 additions & 0 deletions Microsoft.Azure.Cosmos/src/CosmosClientTelemetryOptions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,13 @@ public class CosmosClientTelemetryOptions
/// but has to beware that customer data may be shown when the later option is chosen. It's the user's responsibility to sanitize the queries if necessary.
/// </summary>
public QueryTextMode QueryTextMode { get; set; } = QueryTextMode.None;

/// <summary>
/// Indicates whether client-side metrics collection is enabled or disabled.
/// When set to true, the application will capture and report client metrics such as request counts, latencies, errors, and other key performance indicators.
/// If false, no metrics related to the client will be gathered or reported.
/// <remarks>Metrics data can be published to a monitoring system like Prometheus or Azure Monitor, depending on the configured metrics provider.</remarks>
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
/// </summary>
public bool IsClientMetricsEnabled { get; set; }
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
}
}
7 changes: 7 additions & 0 deletions Microsoft.Azure.Cosmos/src/DocumentClient.cs
Original file line number Diff line number Diff line change
Expand Up @@ -955,6 +955,13 @@ internal virtual void Initialize(Uri serviceEndpoint,
// Loading VM Information (non blocking call and initialization won't fail if this call fails)
VmMetadataApiHandler.TryInitialize(this.httpClient);

if (this.cosmosClientTelemetryOptions.IsClientMetricsEnabled)
{
CosmosOperationMeter.Initialize();

CosmosOperationMeter.AddInstanceCount(this.ServiceEndpoint);
}

// Starting ClientTelemetry Job
this.telemetryToServiceHelper = TelemetryToServiceHelper.CreateAndInitializeClientConfigAndTelemetryJob(this.clientId,
this.ConnectionPolicy,
Expand Down
60 changes: 41 additions & 19 deletions Microsoft.Azure.Cosmos/src/Resource/ClientContextCore.cs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ namespace Microsoft.Azure.Cosmos
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using global::Azure;
using Microsoft.Azure.Cosmos.Handlers;
using Microsoft.Azure.Cosmos.Resource.CosmosExceptions;
using Microsoft.Azure.Cosmos.Routing;
Expand Down Expand Up @@ -498,22 +499,24 @@ private async Task<TResult> RunWithDiagnosticsHelperAsync<TResult>(
RequestOptions requestOptions,
ResourceType? resourceType = null)
{
Func<string> getOperationName = () =>
{
// If opentelemetry is not enabled then return null operation name, so that no activity is created.
if (openTelemetry == null)
{
return null;
}

if (resourceType is not null && this.IsBulkOperationSupported(resourceType.Value, operationType))
{
return OpenTelemetryConstants.Operations.ExecuteBulkPrefix + openTelemetry.Item1;
}
return openTelemetry.Item1;
};

using (OpenTelemetryCoreRecorder recorder =
OpenTelemetryRecorderFactory.CreateRecorder(
getOperationName: () =>
{
// If opentelemetry is not enabled then return null operation name, so that no activity is created.
if (openTelemetry == null)
{
return null;
}

if (resourceType is not null && this.IsBulkOperationSupported(resourceType.Value, operationType))
{
return OpenTelemetryConstants.Operations.ExecuteBulkPrefix + openTelemetry.Item1;
}
return openTelemetry.Item1;
},
getOperationName: getOperationName,
containerName: containerName,
databaseName: databaseName,
operationType: operationType,
Expand All @@ -525,20 +528,30 @@ private async Task<TResult> RunWithDiagnosticsHelperAsync<TResult>(
try
{
TResult result = await task(trace).ConfigureAwait(false);
if (openTelemetry != null && recorder.IsEnabled)
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
// Checks if OpenTelemetry is configured for this operation and either Trace or Metrics are enabled by customer
if (openTelemetry != null
&& (!this.ClientOptions.CosmosClientTelemetryOptions.DisableDistributedTracing || this.ClientOptions.CosmosClientTelemetryOptions.IsClientMetricsEnabled))
{
// Record request response information
// Extracts and records telemetry data from the result of the operation.
OpenTelemetryAttributes response = openTelemetry?.Item2(result);

// Records the telemetry attributes for Distributed Tracing (if enabled)
recorder.Record(response);
}

// Records metrics such as request units, latency, and item count for the operation.
CosmosOperationMeter.RecordTelemetry(getOperationName: getOperationName,
accountName: this.client.Endpoint,
containerName: containerName,
databaseName: databaseName,
attributes: response);
}
return result;
}
catch (OperationCanceledException oe) when (!(oe is CosmosOperationCanceledException))
{
CosmosOperationCanceledException operationCancelledException = new CosmosOperationCanceledException(oe, trace);
recorder.MarkFailed(operationCancelledException);

throw operationCancelledException;
}
catch (ObjectDisposedException objectDisposed) when (!(objectDisposed is CosmosObjectDisposedException))
Expand All @@ -563,7 +576,16 @@ private async Task<TResult> RunWithDiagnosticsHelperAsync<TResult>(
catch (Exception ex)
{
recorder.MarkFailed(ex);

if (openTelemetry != null && ex is CosmosException cosmosException)
{
// Records telemetry data related to the exception.
CosmosOperationMeter.RecordTelemetry(getOperationName: getOperationName,
accountName: this.client.Endpoint,
containerName: containerName,
databaseName: databaseName,
ex: cosmosException);
}

throw;
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
namespace Microsoft.Azure.Cosmos.Telemetry
{
using System;
using System.Collections.Generic;
using global::Azure.Core;

internal sealed class AppInsightClassicAttributeKeys : IActivityAttributePopulator
Expand Down Expand Up @@ -160,5 +161,19 @@ public void PopulateAttributes(DiagnosticScope scope, QueryTextMode? queryTextMo
}
}
}

public KeyValuePair<string, object>[] PopulateOperationMeterDimensions(string operationName, string containerName, string databaseName, Uri accountName, OpenTelemetryAttributes attributes, CosmosException ex)
{
return new KeyValuePair<string, object>[]
{
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.ContainerName, containerName),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.DbName, databaseName),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.ServerAddress, accountName.Host),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.DbOperation, operationName),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.StatusCode, (int)(attributes?.StatusCode ?? ex?.StatusCode)),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.SubStatusCode, attributes?.SubStatusCode ?? ex?.SubStatusCode),
new KeyValuePair<string, object>(AppInsightClassicAttributeKeys.Region, string.Join(",", attributes.Diagnostics.GetContactedRegions()))
};
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// ------------------------------------------------------------

namespace Microsoft.Azure.Cosmos
{
/// <summary>
/// The CosmosDbClientMetrics class provides constants related to OpenTelemetry metrics for Azure Cosmos DB.
/// These metrics are useful for tracking various aspects of Cosmos DB client operations and compliant with Open Telemetry Semantic Conventions
/// It defines standardized names, units, descriptions, and histogram buckets for measuring and monitoring performance through OpenTelemetry.
/// </summary>
public sealed class CosmosDbClientMetrics
sourabh1007 marked this conversation as resolved.
Show resolved Hide resolved
{
/// <summary>
/// OperationMetrics
/// </summary>
public static class OperationMetrics
{
/// <summary>
/// the name of the operation meter
/// </summary>
public const string MeterName = "Azure.Cosmos.Client.Operation";

/// <summary>
/// Version of the operation meter
/// </summary>
public const string Version = "1.0.0";

/// <summary>
/// Metric Names
/// </summary>
public static class Name
{
/// <summary>
/// Total request units per operation (sum of RUs for all requested needed when processing an operation)
/// </summary>
public const string RequestCharge = "db.client.cosmosdb.operation.request_charge";

/// <summary>
/// Total end-to-end duration of the operation
/// </summary>
public const string Latency = "db.client.operation.duration";

/// <summary>
/// For feed operations (query, readAll, readMany, change feed) batch operations this meter capture the actual item count in responses from the service.
/// </summary>
public const string RowCount = "db.client.response.row_count";

/// <summary>
/// Number of active SDK client instances.
/// </summary>
public const string ActiveInstances = "db.client.cosmosdb.active_instance.count";
}

/// <summary>
/// Unit for metrics
/// </summary>
public static class Unit
{
/// <summary>
/// Unit representing a simple count
/// </summary>
public const string Count = "#";

/// <summary>
/// Unit representing time in seconds
/// </summary>
public const string Sec = "s";

/// <summary>
/// Unit representing request units
/// </summary>
public const string RequestUnit = "# RU";

}

/// <summary>
/// Provides descriptions for metrics.
/// </summary>
public static class Description
{
/// <summary>
/// Description for operation duration
/// </summary>
public const string Latency = "Total end-to-end duration of the operation";

/// <summary>
/// Description for total request units per operation
/// </summary>
public const string RequestCharge = "Total request units per operation (sum of RUs for all requested needed when processing an operation)";

/// <summary>
/// Description for the item count metric in responses
/// </summary>
public const string RowCount = "For feed operations (query, readAll, readMany, change feed) batch operations this meter capture the actual item count in responses from the service";

/// <summary>
/// Description for the active SDK client instances metric
/// </summary>
public const string ActiveInstances = "Number of active SDK client instances.";
}
}

/// <summary>
/// Buckets
/// </summary>
public static class HistogramBuckets
{
/// <summary>
/// ExplicitBucketBoundaries for "db.cosmosdb.operation.request_charge" Metrics
/// </summary>
/// <remarks>
/// <b>1, 5, 10</b>: Low Usage Levels, These smaller buckets allow for precise tracking of operations that consume a minimal number of Request Units. This is important for lightweight operations, such as basic read requests or small queries, where resource utilization should be optimized. Monitoring these low usage levels can help ensure that the application is not inadvertently using more resources than necessary.<br></br>
/// <b>25, 50</b>: Moderate Usage Levels, These ranges capture more moderate operations, which are typical in many applications. For example, queries that return a reasonable amount of data or perform standard CRUD operations may fall within these limits. Identifying usage patterns in these buckets can help detect efficiency issues in routine operations.<br></br>
/// <b>100, 250</b>: Higher Usage Levels, These boundaries represent operations that may require significant resources, such as complex queries or larger transactions. Monitoring RUs in these ranges can help identify performance bottlenecks or costly queries that might lead to throttling.<br></br>
/// <b>500, 1000</b>: Very High Usage Levels, These buckets capture operations that consume a substantial number of Request Units, which may indicate potentially expensive queries or batch processes. Understanding the frequency and patterns of such high RU usage can be critical in optimizing performance and ensuring the application remains within provisioned throughput limits.
/// </remarks>
public static readonly double[] RequestUnitBuckets = new double[] { 1, 5, 10, 25, 50, 100, 250, 500, 1000};

/// <summary>
/// ExplicitBucketBoundaries for "db.client.operation.duration" Metrics.
/// </summary>
/// <remarks>
/// <b>0.001, 0.005, 0.010</b> seconds: Higher Precision at Sub-Millisecond Levels, For high-performance workloads, especially when dealing with microservices or low-latency queries. <br></br>
/// <b>0.050, 0.100, 0.200</b> seconds: Granularity for Standard Web Applications, These values allow detailed tracking for latencies between 50ms and 200ms, which are common in web applications. Fine-grained buckets in this range help in identifying performance issues before they grow critical, while covering the typical response times expected in Cosmos DB.<br></br>
/// <b>0.500, 1.000</b> seconds: Wider Range for Intermediate Latencies, Operations that take longer, in the range of 500ms to 1 second, are still important for performance monitoring. By capturing these values, you maintain awareness of potential bottlenecks or slower requests that may need optimization.<br></br>
/// <b>2.000, 5.000</b> seconds: Capturing Outliers and Slow Queries, It’s important to track outliers that might go beyond 1 second. Having buckets for 2 and 5 seconds enables identification of rare, long-running operations that may require further investigation.
/// </remarks>
public static readonly double[] RequestLatencyBuckets = new double[] { 0.001, 0.005, 0.010, 0.050, 0.100, 0.200, 0.500, 1.000, 2.000, 5.000 };

/// <summary>
/// ExplicitBucketBoundaries for "db.client.response.row_count" Metrics
/// </summary>
/// <remarks>
/// <b>10, 50, 100</b>: Small Response Sizes, These buckets are useful for capturing scenarios where only a small number of items are returned. Such small queries are common in real-time or interactive applications where performance and quick responses are critical. They also help in tracking the efficiency of operations that should return minimal data, minimizing resource usage.<br></br>
/// <b>250, 500, 1000</b>: Moderate Response Sizes, These values represent typical workloads where moderate amounts of data are returned in each query. This is useful for applications that need to return more information, such as data analytics or reporting systems. Tracking these ranges helps identify whether the system is optimized for these relatively larger data sets and if they lead to any performance degradation.<br></br>
/// <b>2000, 5000</b>: Larger Response Sizes, These boundaries capture scenarios where the query returns large datasets, often used in batch processing or in-depth analytical queries. These larger page sizes can potentially increase memory or CPU usage and may lead to longer query execution times, making it important to track performance in these ranges.<br></br>
/// <b>10000</b>: Very Large Response Sizes (Outliers), This boundary is included to capture rare, very large response sizes. Such queries can put significant strain on system resources, including memory, CPU, and network bandwidth, and can often lead to performance issues such as high latency or even network drops.
/// </remarks>
public static readonly double[] RowCountBuckets = new double[] { 1, 10, 50, 100, 250, 500, 1000, 2000, 5000, 10000 };
}
}
}
Loading
Loading