Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Internal] Upgrade Resiliency: Fixes Duplicate Channel and Task Creation. #4123

Conversation

kundadebdatta
Copy link
Member

@kundadebdatta kundadebdatta commented Oct 13, 2023

Pull Request Template

Description

Scope: Applicable when the "Advanced Replica Selection" feature is enabled using the environment variable AZURE_COSMOS_REPLICA_VALIDATION_ENABLED = True

Problem: One or our recent conversations with the IC3 team has helped us unveiling a potential issue with the open connection async flow. The team migrated to the latest preview sdk version 3.35.2-preview and experienced a sudden and consistent spike in number of tcp connections since the cosmos client initialization. To understand this in more detail, please take a look at the number of open tcp connections graph below.

image

image

image

image

It appears that during the time of impact, the RPS was around 15000 Reqs/Sec, and the number of tcp connections has doubled compared to their historical trends.

Analysis : The reason, this is happening, is because it is possible that while validating both the Unknown and Unhealthy replicas, during the OpenChannelAsync() stage, in the LoadBalancingPartition, today there is no way to validate if there is a connection already made to the given endpoint of the BE replica, thus if there are N number of parallel requests land to this method, they all ended up opening N tcp connections for the given endpoint. This is what is causing the sudden spike on the number of open tcp connections.

        internal Task OpenChannelAsync(Guid activityId)
        {
            IChannel channel = null;
            this.capacityLock.EnterWriteLock();
            try
            {
                if (this.capacity < this.maxCapacity)
                {
                    channel = this.OpenChannelAndIncrementCapacity(
                        activityId: activityId);
                }
                else
                {
                   ....
                }
            }
         }

Solution: The code changes in this PR is doing the following -

  • It uses the latest version of the Cosmos.Direct version 3.31.5 package, that refactors the behavior in the LoadBalancingPartition, and adds an extra check to validate if the openChannels dictionary in the LoadBalancingPartiton contains a healthy LbChannelState for a specific endpoint. If there exists a healthy channel established to that endpoint, then we skip the open connection creation for the given endpoint. This basically means, if there is a healthy connection present for a given endpoint, no new connection will required to be opened.

  • Additionally, this PR adds an additional RefreshAsync() method in the AsyncLazyWithRefreshTask<T> in the AsyncCacheNonBlocking<K, V> that maintains a reference to the background task. If the task is already running for the cache key, then any future Refresh requests will be a no-op. This dramatically reduces the background task creation, thus reducing the number of duplicate replica validations for the same partition key range.

  • This PR also reduces the replica validation scope, only to the Unhealthy replicas by default. The Unknown replicas will be validated only when the CosmosClient is initialized with CreateAndInitializeAsync() flow.

Detailed Sequence Diagram: Below is the sequence diagram for validating the channel health before opening the connections.

sequenceDiagram
    participant X as GatewayAddressCache <br> [v3 Code]
    participant A as RntbdOpenConnectionHandler <br> [Direct Code]
    participant B as Rntbd.TransportClient <br> [Direct Code]
    participant C as ChannelDictionary <br> [Direct Code]
    participant D as LoadBalancingChannel <br> [Direct Code]
    participant E as LoadBalancingPartition <br> [Direct Code]
    participant F as Channel <br> [Direct Code]
    Note over X: The open connection process is <br> triggered as a background task.    
    X->>A: 1. TryOpenRntbdChannelsAsync <br> (transport addresses)
    Note over A: The incoming requests <br> acquires the semaphore <br> and opens channel.    
    A->>B: 2. OpenConnectionAsync()
    Note over B: Rntbd.TransportClient <br> implements abstract class <br> Microsoft.Azure.Documents.TransportClient
    B->>C: 3. GetChannel <br> (validationRequired: true)
    C-->>B: 4. Gets a <br> LoadBalancingChannel
    B->>D: 5. OpenChannelAsync() 
    D->>E: 6. OpenChannelAsync()
    E->>E: 7. Are there any healthy LbChannelState <br> present in the openChannels dictionary ?
    E-->>D: 8. Yes - Healthy LbChannelState present. <br> Skip new channel creation.
    E->>F: 9. No - healthy LbChannelState present. <br> Create Channel using OpenChannelAsync().
    F->>E: 10. Returns Channel Init Task.   
    E-->>D: 11. Returns Channel Init Task.
    D-->>C: 12. Returns Channel Init <br> Task Or No-Op.
    C-->>B: 13. Returns Channel Init <br> Task Or No-Op.
    B-->>A: 14. Returns Channel Init <br> Task Or No-Op.
    A-->>A: 15. a) If successful, mark transport address to Connected <br> b) If exception is caught, mark transport address to Unhealthy.
Loading

Benchmark Results: Below are the benchmark results before and after the fix.

VM Configuration: SKU: Standard D16s v3, vCPU: 16, RAM: 64 GB.
VM Location: West US 2.
Cosmos Test Account Location: West US 2.
Cosmos Test Account RU Count: 500K.
Upgrade Duration: 5 hours.
Upgrade Domains: 20
Test Run Time: 6 Hours.
RPS: Approx. 5000.

TCP Connection Count Before the Fix:

image

TCP Connection Count After the Fix:

image

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

Closing issues

To automatically close an issue: closes #4033

@kundadebdatta kundadebdatta self-assigned this Oct 14, 2023
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks. Very good PR description - thanks for setting an example here!

@kundadebdatta
Copy link
Member Author

kundadebdatta commented Nov 4, 2023 via email

@kundadebdatta kundadebdatta added Upgrade Resiliency auto-merge Enables automation to merge PRs labels Nov 6, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot merged commit f7a4c56 into master Nov 6, 2023
20 checks passed
@microsoft-github-policy-service microsoft-github-policy-service bot deleted the users/kundadebdatta/4034_fix_duplicate_channel_creation branch November 6, 2023 06:27
}
finally
{
if (slimAcquired)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not this semaphore be released only when the refresh completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline. The below code should cover this:

if (addresses
    .Get(Protocol.Tcp)
    .ReplicaTransportAddressUris
    .Any(x => x.ShouldRefreshHealthStatus()))

The semaphore will be released as soon as the task is successfully scheduled. and the above check is good enough to block other parallel threads to create duplicate tasks.

System.Diagnostics.Trace.CorrelationManager.ActivityId);
}
});
}
finally
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: try-finally be moved inside IF (code refractoring and fainally clause then don't need explicit check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I will refactor this with the next PR to master. Good catch!

@@ -302,11 +314,12 @@ public void SetOpenConnectionsHandler(IOpenConnectionsHandler openConnectionsHan
.ReplicaTransportAddressUris
.Any(x => x.ShouldRefreshHealthStatus()))
{
Task refreshAddressesInBackgroundTask = Task.Run(async () =>
bool slimAcquired = await this.semaphore.WaitAsync(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification: What will happen if this check not exists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the below condition doesn't exist, then there is no gate to detect if other threads has acquired the semaphore in the past and already scheduled a background refresh task. So, without the check, the try block will get executed.

        if (slimAcquired)
        {
            this.serverPartitionAddressCache.Refresh(
                key: partitionKeyRangeIdentity,
                singleValueInitFunc: (currentCachedValue) => this.GetAddressesForRangeIdAsync(
                    request,
                    cachedAddresses: currentCachedValue,
                    partitionKeyRangeIdentity.CollectionRid,
                    partitionKeyRangeIdentity.PartitionKeyRangeId,
                    forceRefresh: true));
        }

Also, the code snippet bool slimAcquired = await this.semaphore.WaitAsync(0); checks the semaphore hook is acquired at that point of time. If it's already acquired, then the thread immediately returns a false and continue it's execution. This guarantees that no duplicate tasks were created.

microsoft-github-policy-service bot added a commit that referenced this pull request May 21, 2024
* HttpClient: Adds detection of DNS changes through use of SocketsHttpHandler for .NET 6 and above (#3762)

* initial commit

* removed unneeded usings

* added validation callback, still needs tests

* nits + fixes

* added additional test

* test change

* removed unneeded Dispose calls

* removed unnneed dispose calls

* requested changes

* added pooledConnectionLifetime as client option

* nit

Co-authored-by: Kevin Pilch <[email protected]>

* Update Microsoft.Azure.Cosmos/src/HttpClient/CosmosHttpClientCore.cs

Co-authored-by: Matias Quaranta <[email protected]>

* Update Microsoft.Azure.Cosmos/src/CosmosClientOptions.cs

Co-authored-by: Matias Quaranta <[email protected]>

* suggested changes

* remove test, reorder usings

* updated contracts

* removed all non XXXAPI.json changes from UpdateContracts run

* removed public surface, added random timespan

* removed change from unrelated file

* Update Microsoft.Azure.Cosmos/src/HttpClient/CosmosHttpClientCore.cs

Co-authored-by: Matias Quaranta <[email protected]>

* added thread safe random method

* nit

* fixed merge mistake

* added reflection failsafe/error tracing

* nits

* added back removed if

* fixed formatting

* changed random method, fixed serverCertificateCustomValidation

---------

Co-authored-by: Kevin Pilch <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Tests: Fixes Open Telemetry attributes for ReadMany test (#3805)

* Fixing test

* New baseline

* Undo some changes

* [Internal] Client Telemetry: Refactors code to run client telemetry data processing task in background. (#3783)

* first draft

* remove failure count test

* refactporing

* code refactor

* create task with timeout

* fix test

* code refactoring

* fix timeout code

* space fix

* not failing if processor is taking time

* fix procrsser test

* code refactor

* refactor and test fix

* Patch: Adds Move Operation (#3389)

* Basic changes to introduce move operator

* Added "from" object in patch spec operation.
Added testcase block.

* Fixed testcase.

* Changes made to address comments'

* Added comment regarding enum mutations

* Formatted comment

Co-authored-by: Matias Quaranta <[email protected]>

* Moved summary location.

* Ran UpdateContracts.ps1

---------

Co-authored-by: Amaan Haque <[email protected]>
Co-authored-by: Amaan Haque <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Pipelines: Adds nightly build to produce packages (#3802)

* Support cleaning

* wire previous content delete

* as text

* with variable

* another test

* param with types

* as string

* no delete

* no quotes

* undoing

* re-adding quotes

* testing empty

* trying another test

* readding version

* fixing publishing artifacts

* fixing parameter

* Fixing official pipeline

* version 5

* fixing main pipeline

* test with true

* using start time

* nightly preview

* passing parameters to pack

* Fixing nuget version

* arguments on the nuget pack

* folder structure

* testing v5

* Using only content

* Removing currentDate

* [Internal] OpenTelemetry: Direct Package update and replacing diagnostic files (#3797)

* Direct Package update and replacing dagnostic files

* Resolve merge conflicts

* Running updateCOnstracts script

* Removed LinqTranslationWithCustomSerializerBaseline file

* Adding isDistributedTracingEnabled flag

* Running update contracts

* Running update contracts

* Running update contracts

* fix test

* Code cleanup for test fix

* Code cleanup for test fix

* Making regex expression readable

* Adding comment for regex expression

---------

Co-authored-by: Sourabh Jain <[email protected]>

* [Internal] MerlinBot: Adds auto-merge and cleanup automation (#3813)

* Add config changes

* Polishing automerge config

* Update fabricbot.json (#3824)

* [Internal] Upgrade Resiliency: Adds Logic to Validate `Unknown` Replicas along with `Unhealthy`. (#3820)

* Code changes to add aggressive validation logic.

* Code changes to enable aggressive validation for all regions.

* Code changes to pull in msdata cosmos.direct changes related to aggresive validation logic.

* Code changes to make minor cosmetic changes.

* Code changes to address review comments.

* Serialization: Fixes call to CosmosSerializer.FromStream on Gateway mode when EnableContentResponseOnWrite is false (#3814)

* Do not call serializer if ResponseMessage.Content is empty.

* Add unit test

* Update unit tests

* Remove unused usings

---------

Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Documentation: Adds documentation covering build pipelines (#3822)

* Add doc

* Move benchmark

* Fixing texts

* Client Encryption: Adds release version of Microsoft.Azure.Cosmos to Microsoft.Azure.Cosmos.Encryption.Custom (#3799)

* cosmos version change

* changing preview to release

* resolving code review comments

---------

Co-authored-by: Santosh Kulkarni <[email protected]>

* SDK 3.33.0 : Adds version bump and changelog (#3823)

* release 3.30.0

* added changelog

* updated changelog

* updated changelog

* suggested change to changelog

* updated changelog

---------

Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Documentation: Adds msdata/direct Sync-up Guide. (#3828)

* Code changes to add msdata/direct sync-up documentation.

* Code changes to address review comments.

* Code changes to address further review comments.

* Code changes to address minor review comments.

* Removed internal links.

* Query: Adds TRIM string system function support in LINQ (#3833)

* add trim support

* Added some test coverage

* address reviews

---------

Co-authored-by: Minh Le <[email protected]>

* Query: Fixes Parsing Error in SQL DOM when CultureInfo is available (#3832)

* add fix

* Add cultural info to test to verify correct behavior

* address pr review to restore to restore culture

* fix comment

---------

Co-authored-by: Minh Le <[email protected]>

* Client Encryption: Adds api FetchDataEncryptionKeyWithoutRawKeyAsync and FetchDataEncryptionKeyAsync to get DEK without and with raw key respectively.  (#3809)

* added raw key to MdeEncryption

* adding ray key to Mde Algo

* test case changes

* resolving code review comments

* code optimization to reduce keyvault calls

* removed Microsoft.Data.Encryption.Cryptography nuget package

* added api for dek with raw key

* resolved code review comments

* adding change log

* code review changes

* Initial commit (#3826)

* Query: Adds Computed Property SDK Support (#3761)

* Initial commit

* Restored settings.json changes.

* Update

* Addressed comments; still need to be tested using Emulator.

* Fixes after test run.

* Ignored the computed property tests based on the sync this morning (to allow for preview release).

* Suite0 fixes.

* Test update.

* Suite0 fixes

* [Internal] Samples: Adds OpenTelemetry and Application Insights samples (#3818)

* add opentelemetry and application insights samples

* address pr comments

* [Internal] Query: Added custom serializer coverage tests to ExpressionToSQL.cs (#3722)

* Ensure enum as string is preserved for custom serializer

* Failing test

* Added failing tests

* Updated requested names

* Ignore result of test for now

* Added additional comment on why the test is ignored

* Replaced with sample code

* Remove ignore attribute from tests, documented misbehavior for future use

* Updated comment

---------

Co-authored-by: leminh98 <[email protected]>

* Query: Added remaining Cosmos Type checking functions to CosmosLinqExtensions (#3724)

* Added the remaining Cosmos Type checking functions to the CosmosLinqExtensions

* Added comments requested

* Updated comment

* Updated baseline

* Improve readability of dictionary initialization

* Aligned with code style guide

* Revert change to baseline

* Executed update baseline script

---------

Co-authored-by: neildsh <[email protected]>
Co-authored-by: leminh98 <[email protected]>

* update sdk version and section tags (#3841)

* PackageLicense: Removes PackageLicenseUrl and Adds PackageLicenseFile since PackageLicenseUrl is deprecated (#3847)

* proposal to add PackageLicenseFile since PackageLicenseUrl is deprecated. https://github.com/NuGet/Home/issues/4628

* adding attribute Visible=false

* making ChangeFeedMode.LatestVersion accessible to the public (#3854)

* AI Integration: Fixes Operation Name in the activity and end to end Tests. (#3845)

* first draft

* second draft

* 3rd draft

* remove untouched file

* test fix

* fix order

* change order

* refactor

* skip network activities in test

* remove network attributes

* SDK 3.34.0 : Adds version bump and changelog (#3855)

* SDK 3.34.0: Adds version bump and changelog

* adding changelog changes

* added a missing PREVIEW PR

* Update changelog.md

Co-authored-by: Justine Cocchi <[email protected]>

* Update changelog.md

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* removed 3840 as it was not committed

* change text for 3832

* fix merge issue

* add 3724

* Update changelog.md

Co-authored-by: Matias Quaranta <[email protected]>

* Update changelog.md

Co-authored-by: Matias Quaranta <[email protected]>

* including 3845

---------

Co-authored-by: Justine Cocchi <[email protected]>
Co-authored-by: Kiran Kumar Kolli <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>

* Release: Fixes changelog.md change for 3845 to preview (#3859)

* removing ThirdPartyNotice.txt from content and contentfiles folders (#3864)

* Documentation: Adds see also link to Container.CreateTransactionalBatch (#3860)

* Linking limit documentation to Container.CreateTransactionalBatch(PartitionKey) method

* Resolved PR comments

* Links update

* Using learn.microsoft instead of docs.microsoft in the links

---------

Co-authored-by: Matias Quaranta <[email protected]>

* Query: Adds type-markers with count and length for large arrays (#3852)

* initial commit

* cleanup

* update test output

* cleanup

* typo

* Pr comments

* [Internal] AI Integration or Open Telemetry: Design Document (#3858)

* first draft

* redesign

* ädd link

* updated observability url

* Benchmarking: Adds use of ARM Templates for benchmarking (#3838)

* initial commit DONT REVIEW

* fixes and documentation

* Apply suggestions from code review

Co-authored-by: Matias Quaranta <[email protected]>

* requested changes

* Apply suggestions from code review

Co-authored-by: Matias Quaranta <[email protected]>

* name changes

* readme changes

* nits + changing case of parameters file

---------

Co-authored-by: Matias Quaranta <[email protected]>

* Update README.md (#3875)

URL typo.

* moved to new file (#3876)

* Direct Package Upgrade 3.31.0: Refactors code to make compatible with latest direct (#3877)

* upgrade to 3.31.0

* add more regions

* enable dt for operations

* updated contract file

* [Preview] Integrated cache: Adds BypassIntegratedCache to DedicatedGatewayRequestOptions (#3836)

* Integrated cache: Add BypassIntegratedCache to DedicatedGatewayRequestOptions

Currently, integrated cache is used by default for Dedicated Gateway.
Customers cannot skip cache for particular requests or data unless they
shift to multi-tenant Gateway,which will lose the benefits of Dedicated
Gateway.

For customers to have more control over integrated cache, we're
introducing a new "RequestOption" called "BypassIntegratedCache". This
option will allow the customer to decide whether to use integrated cache
for each request or not. If this value is set to true, the item/query
will be served from backend and won't be cached in Dedicated Gateway.

* Move this feature to public preview

* Address comments

1. Add more tests
2. Add more detail and example code for BypassIntegratedCache

* Revert changes in EncryptionCustomAPI

---------

Co-authored-by: Jiajun Peng <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>

* Client Encryption: Adds Microsoft.Azure.Cosmos compatibility to version 3.34.0 (#3874)

* chaging Microsoft.Azure.Cosmos support version

* resolved merge conflicts

* CosmosClient: Fixes missing Trace when converting HTTP Timeout to 503 (#3866)

* Added tracing when converting HTTP Timeout to 503

* Fixed tracing when converting HTTP Timeout to 503

* Resolved PR comments

* Using ITrace as part of ClientSideRequestStatisticsTraceDatum

* Refactoring

* Test update

* Unit tests fix

* AI Integration: Fixes Open Telemetry Example (#3868)

* first draft

* add filter

* revert csproj

* fix sample

* changed log message

* remove unused library

* [Internal] Query: Adds OptimisticDirectExecute and RequiresDistribution headers (#3882)

* Adding ODE and RequiresDistribution Headers

* Fixed comments

* Updated parameter in SwitchToFallbackPipelineAsync

* Renamed TryUnwrapContinuationToken to UnwrapContinuationToken

---------

Co-authored-by: neildsh <[email protected]>

* Query: Refactors the EnableOptimisticDirectExecution flag to be public (#3883)

* Made EnableOptimisticDirectExecution a public flag

* Updated contract

* [Internal] OpenTelemetry : Adds Telemetry Distributed Tracing functionality (#3801)

* Direct Package update and replacing dagnostic files

* Resolve merge conflicts

* Running updateCOnstracts script

* Add code changes for distributed tracing open telemetry changes

* Add distributed tracing tests

* Updated tests for distributed tracing

* Addin traceID for diagnostics

* Running update contract script

* Removed LinqTranslationWithCustomSerializerBaseline file

* Adding isDistributedTracingEnabled flag

* Running update contracts

* Running update contracts

* Updates based on differnt code review comments

* Running update contracts

* Running update contracts

* Running update contracts

* fix test

* Code cleanup for test fix

* Running Update contracts

* resolving merge conflicts

* resolving merge conflicts

* Set EnableDistributedTracing to true for performance tests

* Benchmark project change for distributed tracing

* Updating tests

* Updated unit tests

* Updated unit tests

* Updated tests and constructors based on review comments

* Updated scope name in recorder

* Updated distributedOtel tests to cover more scenarios

* Updated distributedOtel tests

* Reverting benchmark performance test changes

* Update DistributedOpentelemetry tests

* Update test cleanup

* Update distributed tests with custom builder

* Update distributed open telemetry tests

* Update contracts

* Cleanup files

* Update distributed Otel tests

* Update distributed Otel tests

* code refactoring

* fix custom listener

* Update direct package to 3.31.1

* Code clean up

* Update tests with display name

---------

Co-authored-by: Sourabh Jain <[email protected]>

* Documentation: Adds additional remarks to CosmosClient (#3891)

* CosmosClient documentation improvements

* Cref fix

* Link fix

* Documentation fix

* Typo fix

---------

Co-authored-by: Matias Quaranta <[email protected]>

* Open Telemetry End To End Test: Adds baseline for network level requests trace (#3887)

* enable request level in end to end

* made some changes

* fix tests

* fix display name

* hardcoded containername and databasenam

* fix tests

* temp

* fix tests

* update contracts

* fix tests

* fixed display name

* [Internal] Design Docs: Adds Design Document for Client Telemetry (#3590)

* sdk design for client telemetry

* Otel design

* update optel design

* added more nformation

* updated ct design

* remove otel design

* Client Telemetry Design

* update typos

* fix typos

* fix typos

* added limitation

* updated docs

* updated doc

* updated text

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* Update docs/observability.md

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* move stuff here and there.

---------

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* [Internal] Design Docs: Adds Design Document for Client Telemetry Part 2 (#3903)

* updated doc

* Update docs/observability.md

Co-authored-by: Justine Cocchi <[email protected]>

* updated text

---------

Co-authored-by: Justine Cocchi <[email protected]>

* ConnectionPolicy: Refactors Code to Reduce Default Request Timeout to 6 Seconds. (#3902)

* Code changes to reduce default request timeout to 6 seconds.

* Code changes to update API doc default request timeout to 6 seconds.

* [Internal] Upgrade Resiliency: Adds Replica Health State Diagnostics. (#3835)

* Code changes to add replica health status in diagnostics.

* Code changes to fix performance test build failure.

* Code changes to add health state capture logic in address cache.

* Code changes to fix benchmark test execution.

* Code changes to add tests to validate health state cache.

* Code changes to reduce default request timeout to 5 seconds.

* Revert "Code changes to reduce default request timeout to 5 seconds."

This reverts commit 139f37e588fc9dfed608431f4186c567a080e622.

* Subpartitioning: Fixes handling of split physical partitions (#3879)

* Initial Commit DO NOT REVIEW

* bug fix, needs Direct Package Changes

* fix for change feed and query plus tests

* clean up

* query + clean up

---------

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* [Query] Fixes empty property name parsing exception (#3907)

* inital commit

* cleanup

* test cleanup

* PR comments

* PR comment

* [Preview] Query: Refactors EnableOptimisticDirectExecution to true by default in Preview mode (#3909)

* Setting EnableODE to true by default in Preview Mode.

* Added seperate if block for default value

* Updated property

* Removed unused Usings

* Updated contracts

* Updated test

* Updated directive indentation

* Documentations: Adds links to PatchItems docs (#3910)

* Added links to PatchItems docs

* Undo changes from internal file

* [Internal] Direct Package Upgrade: Refactors Code to Bump Up `Cosmos.Direct` Package to `3.31.2` (#3918)

* Code changes to bump up the direct version.

* Code changes to mark the Israel Central region as public.

* Code changes to update contracts.

* Code changes to fix test failure. Some clean ups.

* Code changes to add detailed message for open channels count.

* SDK 3.35.0 : Adds version bump and changelog (#3920)

* release PR

* updated changelog.md

* more changelog updates

* [Internal] Last minute fix to changelog for 3.35.0 (#3921)

* release PR

* updated changelog.md

* more changelog updates

* changelog fix

* Update changelog.md

* Update changelog.md

* [Internal] Query: Adds new header SupportedSerializationFormats (#3911)

* Binary Serilaization Response test

* Added new header SupportedSerializationFormats

* Modified existing use of CosmosSerializationFormatOptions

* Modified tests and removed unused code

* Addressed comments

* Added more negative cases

* Revert changes

* Added spaces

* Addressed comments

* Addressed comments

* Removed SupportedSerializationFormats from Headers file

* Removed unused JsonSerilazationFormats option

* Addressed comments

* Addressed comments

* Addressed comments

* Addressed comments

* Added new enum TransportSerializationFormat

* Added new enum TransportSerializationFormat

* Addressed comments

* Removed unused parameter

* Addressed comments

* updating API

* remove tests

* Text fixes

* fix typo

* remove TransportSerializationFormat header

* text reverts

* revert

* test update

* PR comments

* remove test owner headers HeadersValidationTests.cs

* PR comments - remove unsupported tests and scope client

---------

Co-authored-by: Heet <[email protected]>
Co-authored-by: neildsh <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>

* Code changes to optimize the rntbd open connection logic to open connections in parallel. (#3939)

* Query : Adds support for newtonsoft member access via ExtensionData (#3834)

* Support newtonsoft member access via ExtensionData

* Return null instead of empty string

* Added tests for select & where

* Updated baseline with note

---------

Co-authored-by: leminh98 <[email protected]>

* HttpTransport: Fixes HttpTimeoutPolicies to not accidentally suppress retries (#3944)

* Fix HttpTimeoutPolicies to not accidentally suppress retries

* Removing HttpTimeoutPolicy.MaxRetryTimeLimit altogether

* SDK 3.35.1 : Adds version bump and changelog (#3945)

* version bump

* changelog

* contract

* [Internal] Changelog: Fixes recommended version and title (#3948)

* SDK 3.35.1: Adds version bump and changelog

* Update changelog.md

* Update changelog.md

* Update changelog.md

* Update changelog.md

* Update changelog.md

* [Internal] Dependencies: Fixes dependabot alert for System.Linq.Dynamic.Core (#3957)

* Removing 1

* Removing 2

* Removing 3

* [Internal] Upgrade Resiliency: Adds Code to Enable Replica Validation Feature for Preview (#3951)

* Code changes to add replica validation feature in cosmos client options.

* Code changes to upgrade the cosmos direct version to 3.31.3.

* Adding emulator test to cover replica validation.

* Code changes to address cosmetic clean ups.

* Code changes to address review comments. Fixed preview build failures.

* Code changes to enable replica validation for preview package by default.

* Code changes to address review comments.

* Code changes to fix preview unit tests.

* Code changes to disable environment variable at the end of the test.

* Client Encryption: Adds package reference Microsoft.Azure.Cosmos version 3.35.1-preview (#3956)

* changing cosmos preview version

* updating build file

---------

Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] FabricBot: Adds GitOps.ResourceManagement because of FabricBot decommissioning (#3966)

* Add prIssueManagement.yml to onboard repo to GitOps.ResourceManagement as FabricBot replacement

Owners of the FabricBot configuration should have received email notification. The same information contained in the email is published internally at: https://aka.ms/gim/fabricbot. Details on the replacement service and the syntax of the new yaml configuration file is available publicly at: https://microsoft.github.io/GitOps/policies/resource-management.html

Please review and merge this PR to complete the process of onboarding to the new service.

* Deleting fabricbot.json

---------

Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>

* [Internal] Query: Refactors certain tests to not fail when EnableOptimisticDirectExecution is set to true in 3.35.0-preview package (#3955)

* Updated emulator and baseline tests to not fail when ODE is set to default true in PREVIEW mode

* Fixed QueryAsync() test

* Fixed QueryAsync() in EndToEndTraceBaselineTests

* Undid changes to IndexMetrics baseline file

* Updated EndToEndTraceWriterBaselineTests.QueryAsync xml

* Updated xml

* Updated xml to have request options tag

* Diagnostics: Fixes verbose levels for "Operation will NOT be retried" (#3969)

* Query: Fixes malformed continuation token exception type and message (#3917)

* Fixed malformed continuation token issue where Exception was not CosmosExceptionan and did not have the correct Status and Sub Status codes.

* Fixed incorrect indentation

* Added type check for incoming exception

* Replaced if/else with extra catch block

* Moved fix to a higher point in the call stack

* Removed unused Usings

* Updated test code

---------

Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Upgrade Resiliency: Refactors Code to Enable Replica Validation Feature Through `CosmosClientOptions` And Environment Variable (#3974)

* Code changes to use client options to enable or disable replica validation.

* Code changes to fix preview build failures.

* Query : Adds string comparison alternative when converting LINQ to SQL (#3668)

* string.Compare supported with LINQ to SQL

* Update tests

* Update test name

* Update tests

* Add test

* Create helper ReverseExpressionTypeForStrings

* PR feedback

* Update tests

* Update base line

---------

Co-authored-by: Aditya <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>

* AI Integration: Fixes event generation for failed requests (#3973)

* first draft

* fix code

* included feedback

* flip condition

* updated docs

* Update docs/observability.md

Co-authored-by: Matias Quaranta <[email protected]>

* Update observability.md

* updated contract

---------

Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Category: Refactors Cosmos benchmark operations (#3961)

* Refactoring: base classes for operations.

* Updating comments.

* Adding new line at the end of the file.

* Fixing code review points.

* Restore PrepareAsync to be virtual.

* 3.35.2: Adds new SDK versions and contract files (#3985)

* Updated change log and bumped up the version.

* Changing the version to 3.35.2

* Code changes to address review comments.

* Code changes to make minor fixes.

* Code changes to move some fixes into preview.

* [INTERNAL] LocalQuorum: Adds documentation for LocalQuorum (#3993)

* Draft of local-quorum documentation

* Adding experimental to header

* Adding cross-region read guarantees

* Reads Bounded clarification

* Adding account consistency step also

* Non-Prod usage note at top

* Addressing review comments

* Some more refinement

* Code changes to update release note. (#3996)

* Client Encryption: Adds fix for supporting Prefix Partition Key (Hierarchical partitioning) (#3979)

* Hirarchical pk bug fix

* Hirarchical pk bug fix

* Hirarchical pk bug fix

* Hirarchical pk bug fix

* Hirarchical pk bug fix

* testing new version

* adding more tests

* adding more tests

* adding more tests

* code review changes

* test fix

* test fix

* test fix

* test fix

---------

Co-authored-by: Nalu Tripician <[email protected]>

* Query: Refactors changelog.md with Optimistic Direct Execution recommendation (#4004)

* Update changelog.md

This is a recommendation for customers if they would like to use the ODE features.

* Updated release notes for ODE

* [Internal] Query: Adds performance testing for OptimisticDirectExecution pipeline (#3839)

* Infrastructure for performance testing with ODE pipeline.

* Resolve comments

* Removed randomization from data creation process

* Fixed comments

* Removed Query and EnableODE from QueryStatisticsMetrics, as they do not relate to query statistics.

* Removed try catch to make CreateItemAsync call always succeed

* Removed one liner functions

* Removed code from MetricsSerializer and QueryStatisticsDatumVisitor files

* Fixed comments

* Removed request Charge check

* Bug in Debug Assert

* Test

* Bug in debug assert fix

* Fixed second bug in Metrics Accumalator class

* Added ignore flag to ode perf tests so that they do not run on every loop build

* Added comment explaining the Ignore flag.

* Query: Adds ODE continuation token support for non-ODE pipelines (#4009)

* Added code to throw exception if ODE continuation token goes into non ODE pipeline

* Removed count variable

* Updated test name

* Removed ODE continuation token logic from caller class

* Simplified code

* Fixed comments

* Updated continuation token cast

* Removed const string for continuation token

* Added Ignore flag for test

* Added baseline test

* Updated baseline test

* Code changes to disable replica validation in preview package. (#4019)

* 3.35.3: Adds new SDK versions and contract files (#4024)

* Updated change log and bumped up the patch version.

* Code changes to update the change log message.

* [Internal] Distributed tracing: Adds a sample to collect activities and events using custom listener (#4021)

* custom listener example

* removed unwanted code

* add comments

* fix appsettings

* revert changes

* Code changes to fix race condition by calling dispose too early. (#4030)

* Code changes to update change log for release 3.35.3 (#4032)

* Documentation: Fixes article links (replaced links V2 to V3 SDK version) + Azure Cosmos DB typo (#4031)

* Documentation link fix

* Fixed Typo "Azure CosmosDB"→"Azure Cosmos DB"

* [Internal] Benchmark tool: Adds Cosmos Benchmark Metrics (#3950)

* Adding metrics for Benchmark tool.

* Adding OpenTelemetry.

* Revert "Adding OpenTelemetry."

This reverts commit c7da0884697064103145099e284892365f4ebb68.

* Telemetry for windowed percentiles.

* OpenTelemetry, AppInsights and Dashboard.

* Removing DiagnosticDataListener.

* Code styling, comments and clean-up.

* Fixing issues with dashboard.

* Fixing positions of charts on the dashboard.

* Fixing the dashboard.

* Updating titles and subtitles.

* Removing ILogger and other not required references.

* Fixing code review points.

* Fixing issues after rebase.

* Removing unnecessary changes.

* Fixing code review points.

* Adding metrics for Benchmark tool.

* Adding OpenTelemetry.

* Revert "Adding OpenTelemetry."

This reverts commit c7da0884697064103145099e284892365f4ebb68.

* Telemetry for windowed percentiles.

* OpenTelemetry, AppInsights and Dashboard.

* Removing DiagnosticDataListener.

* Code styling, comments and clean-up.

* Fixing issues with dashboard.

* Fixing positions of charts on the dashboard.

* Fixing the dashboard.

* Updating titles and subtitles.

* Removing ILogger and other not required references.

* Fixing code review points.

* Fixing issues after rebase.

* Removing unnecessary changes.

* Fixing code review points.

* Fixing code review points.

* make MetrcisCollectorProvider non static and remove locks

* fix

* fixes

* use static class name TelemetrySpan.IncludePercentile

* use app insights connection string

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/Program.cs

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/Program.cs

* rename AppInsightsConnectionString

* fix

* fix comments

* fix if AppInsights c string is not set

* summary

* fix

* remove unnecessary collector types

* remove unnecesary metere provicer

* add event source

* remove folder

* fix

* split success and failed latencies

* fix

* fix

---------

Co-authored-by: David Chaava <[email protected]>
Co-authored-by: David Chaava <[email protected]>

* GatewayAddressCache: Fixes Unobserved Exception During Background Address Refresh (#4039)

* Code changes to fix unobserved exception during background address refresh.

* Code changes to add exception handler in task.

* Code changes to fix null ref exception.

* Revert "Code changes to fix null ref exception."

This reverts commit 83f90d578bd301339f6fa13981a0fe2fc3d65fa6.

* Revert "Code changes to add exception handler in task."

This reverts commit c49ed8162758217a09df28417a6f76649eab6a26.

* Code changes to address review comments.

* Revert "Code changes to address review comments."

This reverts commit d2b9f6b501f64f1a50b8a49de3ea76fbb9b5c853.

* Documentation: Adds additional note for GetContactedRegions method (#4042)

* Added small remark for GetContactedRegions method documentation

* Moved to remarks

* [Internal] Client Telemetry: Adds Client Telemetry pipeline sending data to service (#3900)

* first draft

* comment other pipelines

* pint variables

* commnet other pipelines

* added env variable

* minor changes

* update env variable

* print env variable

* add space in end

* fix test

* fix tests

* fix test

* fix tests

* remove response interceptor

* logs

* debuug mode

* 3failing test to print llgs

* minor refactoring

* 2nd windows-2019

* fix ct tests

* 2remove debugging

* fix tests

* revert

* ncomment pipelines

* fix test

* minor changes

* release and emulator pipeline

* update pipelines

* ignore abstract class test

* fixing pipeline

* refactor code

* change it to class name to run tests

* added emulator setup

* 1 temp commit

* env variable

* renames env variable

* fix tests

* add condition

* fix tests

* reorder env variable

* revert pipeline

* did some clean up

* change to revert

* Revert "change to revert"

This reverts commit 03db3c104505dc7b8f3cea267835c92ca530f8f4.

* fix typos

* throw if exception intercepter is null

* remove modelling changes

* removed virtaul

* Update Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/Utils/HttpHandlerHelper.cs

Co-authored-by: Matias Quaranta <[email protected]>

* added condition for pipelines

* Revert "added condition for pipelines"

This reverts commit f9a208cd28e01badee97a2eb770a486cea67c1f0.

* changed cond

* fix codn

* more enhancement

* testing for release pipeline

* refactore code and using test category

* added comments on test

* refactor pipeline code

* fix variables

* fix pipeline

---------

Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Client Telemetry: Refactors code for collectors (#4037)

* refactored code

* implemented review comments

* test fix

* fix tests

* fix test

* fix test

* logger fix

* update contract

* fic test

* updated benchmarks

* [Internal] Automation: Adds logic to tag customer-reported issues (#4047)

* Added customer-reported label

* Changing condition

* padding

* more padding

* permission name

* padding

* [Internal] Benchmark tool: Adds requests diagnostic data capture and upload to storage (#3926)

* azure-cosmos-dotnet-v3/issues/3889
add diagnostics data capturing during bechmark and storing into blob storage after finish

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/README.md

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/scripts/custom-script.sh

* fix bug

* fix review comments

* fix comments

* fix comments

* fix case

* add tests and refactoring

* fix

* unify logging

* add summaries

* fix method summary

* fix BOM

* fix review comments

* fix comment

* fix line breaks

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/ARMTemplate/README.md

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/README.md

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/README.md

* catch exceptions

* add container prefix

* ResultStorageContainerPrefix

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/scripts/execute.sh

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/scripts/custom-script.sh

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* Update Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/BenchmarkConfig.cs

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* Update Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

Co-authored-by: Matias Quaranta <[email protected]>

* Update Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/BenchmarkConfig.cs

Co-authored-by: Kiran Kumar Kolli <[email protected]>

* Update Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/Fx/DiagnosticDataListener.cs

Co-authored-by: Matias Quaranta <[email protected]>

* fix comments

* fix comments

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/README.md

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/scripts/execute.sh

* make BlobCLient Lazy singleton

* new file:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/README.md
	modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/azuredeploy.json

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/README.md

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/scripts/execute.sh

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/scripts/execute.sh

* check on diagnostic colletiong

* remove locks and improve logs appending

* removed unnecesary directory

* removed unnecesary directory

* removed unnecesary directory

* new file:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/ARMTemplate/README.md

* add dashboard

* fix arm template

* change branch

* fix

* add dashboard name

* fix dashboard

* add logging

* fix

* trace error

* fix devide zero

* add trace errors

* fix

* fix

* fix

* fix

* fix

* migrate to text writer

* fixes

* diagnostic logs

* add diagnostic logs

* remove flush and reset

* metric collection window lock

* collection window

* force flush every n seconds

* fix bug

* fix

* Update Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/README.md

Co-authored-by: Matias Quaranta <[email protected]>

* change deafult metric interval

* constant

* fix container creating conflict issue

* change azuredeply branch name

* remove ArmTemplate folder

* fix DiagnosticLatencyThresholdInMs default value

---------

Co-authored-by: David Chaava <[email protected]>
Co-authored-by: Kiran Kumar Kolli <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Benchmark tool: Adds feature to the dashboard  that generate plots queries for metrics with a workload name prefix, depending on the benchmark workload type. #4048 (#4053)

* Merge remote-tracking branch 'origin/master' into users/v-dchaava/benchmark-diagnostics/3889

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/AzureVmBenchmark/README.md

* add metrics prefixes

* fix chart metrics names

* fix dashboard queries according selected workload type

---------

Co-authored-by: David Chaava <[email protected]>

* [Internal] Client Telemetry: Adds client config api call to get latest flag status (#4050)

* first draft

* tets fix

* fix dependent projects

* reduce refresh time in tests

* fix tests and added comments

* fix diagnostic handler fix

* fix test

* adding test

* ret pushmove console

* fix test

* provide options to enable/disable this featire in benchmark and ctl proj

* updated trace message

Co-authored-by: Matias Quaranta <[email protected]>

* remove import

* updated traces

Co-authored-by: Matias Quaranta <[email protected]>

* test fix

* remove null assignment

* fix test

---------

Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Benchmark tool: Fixes benchmark run command using OSSProjectRef parameter (#4066)

* fix benchmark run command using OSSProjectRef parameter

* remove ShouldUnsetParentConfigurationAndPlatform=false

---------

Co-authored-by: David Chaava <[email protected]>

* [Query] Adds public backend metrics property to Diagnostics (#4001)

* initial commit

* some pr comments, WIP

* Refactor

* more

* Public constructors and modify accumulators

* accumulator updates and undo test changes

* add test

* PR comments

* bug fix

* ToString() refactor

* contract updates

* test updates

* small fixes

* text fix

* Update accumulators

* fix

* PR comments

* small fix

* Rename BE -> ServerSide

* more renaming

* Update API and tests

* separate public and internal classes

* API update

* change namespace

* Pr comments

* public constructors and bug fix

* API updates

* renaming and test updates

* PR comments

* more PR comments

* PR comments, test additions

* API updates and more tests

* tests and pkrangeid update

* PR comments

* more PR comments

* smol test fix

* PR comments - renaming properties and constructor rehash

* contract update

* seal classes and private fields.

* update indexHitRatio calc

* mocking refactor to abstract classes

* contract updates

* PR comments - Update documentation

* [Query][Internal] Adds tests for aggregate queries with invalid continuation tokens (#4052)

* partial test

* Tests and error handling update

* update error message

* typo

* update original err msg

* combine tests

* test cleanup

* undo error message update

* [Internal] Benchmark tool: Fixes code refractoring to model the metrics as EventSource (#4040)

* Adding metrics for Benchmark tool.

* Adding OpenTelemetry.

* Revert "Adding OpenTelemetry."

This reverts commit c7da0884697064103145099e284892365f4ebb68.

* Telemetry for windowed percentiles.

* OpenTelemetry, AppInsights and Dashboard.

* Removing DiagnosticDataListener.

* Code styling, comments and clean-up.

* Fixing issues with dashboard.

* Fixing positions of charts on the dashboard.

* Fixing the dashboard.

* Updating titles and subtitles.

* Removing ILogger and other not required references.

* Fixing code review points.

* Fixing issues after rebase.

* Removing unnecessary changes.

* Fixing code review points.

* Adding metrics for Benchmark tool.

* Adding OpenTelemetry.

* Revert "Adding OpenTelemetry."

This reverts commit c7da0884697064103145099e284892365f4ebb68.

* Telemetry for windowed percentiles.

* OpenTelemetry, AppInsights and Dashboard.

* Removing DiagnosticDataListener.

* Code styling, comments and clean-up.

* Fixing issues with dashboard.

* Fixing positions of charts on the dashboard.

* Fixing the dashboard.

* Updating titles and subtitles.

* Removing ILogger and other not required references.

* Fixing code review points.

* Fixing issues after rebase.

* Removing unnecessary changes.

* Fixing code review points.

* Fixing code review points.

* make MetrcisCollectorProvider non static and remove locks

* fix

* fixes

* use static class name TelemetrySpan.IncludePercentile

* use app insights connection string

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/Program.cs

* modified:   Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/Program.cs

* rename AppInsightsConnectionString

* fix

* fix comments

* fix if AppInsights c string is not set

* summary

* fix

* remove unnecessary collector types

* remove unnecesary metere provicer

* add event source

* remove folder

* fix

* split success and failed latencies

* Code refractor to use EvenSource design pattern for metrics

* Fixing build breaks

* Removing BenchmarkExecutionEventSource

* Fixign misc things

* Some extra cleanup

* use TimeSpan except milliseconds

* fix metrics publication

* fix metrics publication

* move tests to benchmark folder

* move back benchmark test

* use background task for flushing metrics

* remove sync metrics flushing

* split failed and success operations

* fix latenclies charts

* fix benchmark run command

* remove ShouldUnsetParentConfigurationAndPlatform=false

---------

Co-authored-by: Mikhail Lipin <[email protected]>
Co-authored-by: David Chaava <[email protected]>
Co-authored-by: David Chaava <[email protected]>

* first draft (#4079)

* Subpartitioning: Fixes bug for queries on subpartitioned containers (#3934)

* initial fix, needs testing on prod

* test fix

* clean up pr

* query rework

* refactors previous changes

* requested changes and bug fixes

* nits

* requested changes

* bug fixes

* start of test

* added test

* nit: changed name of EffectivePartitionKeyRanges to EffectiveRangesForPartitionKey

* Address code comments

* Address code comments

* saving work

* addresses code comments

* nit, spacing

* PartitionKeyHash fixes

* Fixes bugs in tests

* Removed bad method, added additional test coverage

* Removed EffectivePartitionKeyString use

* test fix

* requested changes

* Requested changes

* fixed test

* Test fix

* Added comment

---------

Co-authored-by: SrinikhilReddy <[email protected]>

* [Internal] Query: Fixes LINQ Test Organization (#4076)

* preliminary change

* Add some more boiler plate code

* move all linq test to the same folder; add some groupBy test

* fix references error in test refactoring

add code for group by substitution. Still need to adjust binding post groupby

* preliminary for the groupby functions with key and value selector

* trying to change collection inputs for group by

* Undo the LINQ GROUP BY part

* fix accidental changes

---------

Co-authored-by: Minh Le <[email protected]>

* ClientTelemetry : Adds logic to call client config in every 10 minutes (#4071)

* first draft

* fix tests

* fixes

* fix tests

* remove consoles

* added exception

* remove comment

* fix tests

* fix test

* rev comments

* rev comments

* refactor code

* remove log from api exception

* SDK 3.35.4: Adds version bump and changelog (#4087)

* bump version and changelog

* added apis

* Update changelog.md

* [Internal] Query: Fixes escaped string parsing in SqlParser (#4054)

* Initial commit

* Addressed comments.

* Bechmark : Fixes benchmark runs (#4088)

* pk to result container

* set pk

* pk value fix

* update run.sh

* remove changes value

* remove telemetry service end point

* cleanup

* [Internal] Query: Adds Index Metrics V2 Object Model (#4058)

* making necessary ownership change

* made change to ownerships

* header test

* Call to TryCreate instead of Create in Responsemessage

* Add baseline test infra for index metric parser

* update baseline files

* Add parse retry logic

* Update headers test

* address code review

* address code review

* fix tests

* Update csproj file

* Adopt the new header

* update the response to parse with text instead of base 64

* test for headers adoption of uri escape

* Add URI Decode logic

* Update baseline

* Update with the new header name from back end

* update the query parsing requirement

* New Index Metrics DOM

* fix build error

* Code clean up

* Address code review

* Turn off switching to V2

* Fix test

* fix test errors

* Address code review comment

* addressed code review

* removed the empty entity

* update test parse

* update test

---------

Co-authored-by: Minh Le <[email protected]>

* Distributed Tracing: Fixes dependency failure on appinsights (#4098)

* first draft

* refactor

* fix tests

* fixed condition

* [Internal] Query: Adds deserializing logic for ClientQL Coordinator Distribution Plan (#3988)

* First commit.

* Added remaining classes for ClientQL structure

* Added ClientQLDeserializing class and added CoordinatorDistributionPlan folder

* Added support for all Enumerable and Scalar Expressions

* Added baseline tests for testing CoordinatorDistributionPlan deserializing

* Made ClientQL objects immutable

* Added error and null checks for Value calls

* Updated List<> with IReadOnlyList<>

* Made most functions in the Deserializing class private and static

* Added static constant class for Enumerable expressions

* Added null checking for arrays

* Removed null checks from deserializing array functions

* Removed support for JavaScript

* Removed support for Unwind

* Function names changed

* Removed few functions.

* Updated constants class

* Function Formatting for ClientQL Deserializing (#4062)

* Adding error handling for Deserializing functions

* Finished updating code to remove all dependency on Newtonsoft.Json

* Removed try catch for all upper level functions

* Resolved comments

* Resolved comments pt2

* Updated error message

* Resolved comments pt3

* Changed parameter types from int to long

* Removed ClientQLDelegate

* Syntax Fixes

* Removed ClientQLFlattenEnumerable file. This is JS.

* Fixed List helper functions

* Made singleton constructors from public to private

* Updated the DeserializeClientQLBinaryLiteral function

* Renamed ClientQL to QL

* Fixed variable names

* Updated more variable names

* Removed support for Type

* Removed all extra newlines

* Added null checks

* Updated the name CoordinatorDistributionPlan to ClientDistributionPlan

* Removed all support for Cassandra, Mongo and Binary Literal

* Updaed ClientQL to Cql

* Updated baseline test class property.

* [Internal] Query: Adds check to detect unsupported queries for Optimistic Direct Execution code path (#4090)

* Added query validity function on Ode code path

* Fixed syntax

* Updated to use string search instead of query parsing

* Updated string search to now be regex

* Changed location of caller for QueryValidityCheck()

* Updated regex string

* Added extra test coverage

* Added const string to error messages

* Added compile flag to Regex

* Fixed comments

* Added missing null reference coverage

* Removed extra foreach loop in test

* Removed useQueryPlan bool in test code

* [Internal] Query: Fixes minor issues with TestQueryValidityCheckWithODEAsync (#4105)

* Fixed typos and made test more readable

* Another typo

* Query: Adds LINQ RegexMatch Extension method (#4078)

* Add support for translation to REgexmatch

* Add test and fix some indexing issues

* remove visit explicit, add some comment. Update public contract and added the baseline for the test

* add the missing baseline

* added test

* address code review

* update csproj

---------

Co-authored-by: Minh Le <[email protected]>

* Chaning Bounded to Strong (#4103)

* Client Telemetry: Adds new public APIs (#4056)

* Revert "[Internal] Client Telemetry: Refactors code for collectors (#4037)"

This reverts commit e2311a9fdcca392ec7d49c13939aaff3404deb85.

* Revert "Revert "[Internal] Client Telemetry: Refactors code for collectors (#4037)""

This reverts commit f04234b76174180b482eadfa0f6f412c80d380c3.

* firdst draft

* initialize object

* null handle

* update contracts

* compilation charges

* fix tests

* public API changes

* add docs

* contract updated

* fixed tests

* by default switch of te;emetry in sdk

* fix tests

* fix assertion

* incorporate review comments

* fetaure flag fix in script

* switch case

* add test

* fix tests

* fix test

* fixed run.sh

* minor changes

* code refactor

* changed default values and fix tests

* [Internal] Build: Adds CodeQL support in nightly builds (#4113)

* Update azure-pipelines-nightly.yml

* Newlines in variables sections

* Benchmark: Fixes to show estimated cost of a container only when new container is getting created (#4109)

* Showing Estimated Cost only when new container is getting created

* read container to get container response

* disable client telemetry by default

* removed unused imports

* resolve merge conflict

* fixed name

* fix container not found

* removed the message

* Update Microsoft.Azure.Cosmos.Samples/Tools/Benchmark/Program.cs

Co-authored-by: Matias Quaranta <[email protected]>

* removed line space

---------

Co-authored-by: Kiran Kumar Kolli <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>

* Distributed Tracing: Fixes SDK responses compatibility with opentelemetry response (#4097)

* adding tets

* wip

* wip2

* fix code

* add tests

* fix test

* fix test

* remove consoles

* fix indent and remove unused imports

* internal to private rollback

* added docs

* removed unused imports

* added exception in message

* fix exception catching

* Revert "Query: Adds new system strings in JsonBinaryEncoding, replacing 1-byte user strings (#3400)" (#4108)

This reverts commit 9140890d788cd43d5668d12072be6b965995a28a.

* CosmosClientOptions: Adds support for multiple formats of Azure region names (#4016)

* Allow ComosClientOptions to take ApplicationRegion and ApplicationPreferredRegions in multiple region name formats.

This is a proposed fix for - https://github.com/Azure/azure-cosmos-dotnet-v3/issues/2330

* Address PR comment to avoid duplicating list of names.

* Remove the map table cache

The map table is only used on initialization, so there's no need to keep a cache of it for the lifetime of the application

* Only convert the region names when the client is initializing

The cache is created before converting all the names, so it only needs created once, but doesn't remain for the entire lifetime of the application

* Update tests

* Make RegionNameMapper an instantiable class

Instead of having a prepare/clear cache system on a static class, make RegionNameMapper a class that gets instantiated for use and let the ctor handle it.

* Remove debugging

* Update tests to actually test things

---------

Co-authored-by: Pradeep Chellappan <[email protected]>
Co-authored-by: Pradeep Chellappan <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>
Co-authored-by: Kiran Kumar Kolli <[email protected]>

* Distributed Tracing: Fixes traceid null exception issue (#4111)

* Fix traceid null exception issue

* Fixing merge conflicts

* Fixing merge conflicts

* Update script

* Code cleanup

* Updated change description

* updated comment description

* updated comment description

---------

Co-authored-by: Matias Quaranta <[email protected]>

* Telemetry Options: Adds telemetry options in GA package (#4117)

* GA telemetry options and updated contract

* enabe requuest level option

* added request option in public contract

* [Internal] Direct Package: Adds version bump (#4120)

* direct version bump

* Code changes to fix emulator tests to comply with direct release 3.31.5.

---------

Co-authored-by: Debdatta Kunda <[email protected]>

* Query : Adds Missing QueryMetrics Documentation (#4127)

* Update ServerSidePartitionedMetrics.cs

* Update ServerSidePartitionedMetrics.cs

* TriggerOperation: Adds Upsert Operation Support (#4119)

* Added Upsert Trigger Operation Support

* updated contract

* fix test

* SDK 3.36.0 : Adds version bump and changelog (#4118)

* first draft

* updated changelog

* remove already released PRs

* updated pr links

* changelog and contract changes

* updated changelog

* updated changelog

* updated changelog

* remove 4071 from changelog as it should be internal PR

* removed an query internal log

* updated contracts

* Release 3.36.0: Fixes pipeline by removing ReleasePackage variable (#4130)

* remove release variable

* revert build config variable change also

* Item Operations: Fixes JsonSerialization exception when MissingMemberHandling = Error on Json default settings when NotFound on Item operations. (#4125)

* issue 4115 initial checkin. need insight from issuer on reproducing this issue

* test refactoring and adding more coverage for other NotFound scenarios

* commit on some actionables

* setting JsonConvert.DefaultSettings to null so that other tests will not fail

* as requested, removed catches from test methods

* [Internal] Query : Adds test coverage for custom serializers (#4114)

* initial cleanup

* test updates - working

* cleanup

* more cleanup

* more

* whoops

* Add results to basline

* adding payload to xml

* some generics

* cleanup

* Add datamember serializer

* reorder functions and test fix

* tostring() update and add case

* fix payload

* fix datamembertest

* cleanup

* cleanup

* PR comment

---------

Co-authored-by: Matias Quaranta <[email protected]>

* Release 3.36.0 : Fixes Client Telemetry Release Test (#4132)

* Client Telemetry Release test fix

* get endpoint from env variable

* read client telemetry endpoint service from env

* updated yaml

* Update CosmosItemTests.cs (#4141)

* Release 3.36.0: Fixes client config test and preview pipeline (#4149) (#4150)

* Fixed client cpnfig test and preview pipeline

* fix telemertry service step

* Query : Fixes querying conflicts (#4100)

* Initial commit

* Update

* Updated the test

* Updated the test

* Sample fix; to validate Suite0.

* Skipped the ConflictsTest (which depends on azure cosmosdb account)

* Addressed comments

* Added Unit Test.

* Reverted unnecessary change.

* Fixes changelog typo and date (#4155)

* Bump Azure.Identity in /Microsoft.Azure.Cosmos.Samples/Usage/Encryption (#4136)

Bumps [Azure.Identity](https://github.com/Azure/azure-sdk-for-net) from 1.5.0 to 1.10.2.
- [Release notes](https://github.com/Azure/azure-sdk-for-net/releases)
- [Commits](https://github.com/Azure/azure-sdk-for-net/compare/Azure.Identity_1.5.0...Azure.Identity_1.10.2)

---
updated-dependencies:
- dependency-name: Azure.Identity
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matias Quaranta <[email protected]>

* Bump Azure.Identity (#4135)

Bumps [Azure.Identity](https://github.com/Azure/azure-sdk-for-net) from 1.5.0 to 1.10.2.
- [Release notes](https://github.com/Azure/azure-sdk-for-net/releases)
- [Commits](https://github.com/Azure/azure-sdk-for-net/compare/Azure.Identity_1.5.0...Azure.Identity_1.10.2)

---
updated-dependencies:
- dependency-name: Azure.Identity
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Per Partition Automatic Failover: Fixes Gateway 503 Cold Start Issue (#4073)

* Code changes to add retry logic for GW returned 503.9002.

* Revert "Code changes to add retry logic for GW returned 503.9002."

This reverts commit 53ef5f3c1b038d14dbb1473cafa18223b33af2ce.

* Code changes to clean up the PPAF retry logic fix.

* Code changes to add retry logic for GW returned 503.9002.

* Revert "Code changes to add retry logic for GW returned 503.9002."

This reverts commit 53ef5f3c1b038d14dbb1473cafa18223b33af2ce.

* Code changes to clean up the PPAF retry logic fix.

* Code changes to revert location cache changes.

* Code changes ro revert location cache changes.

* Code changes to fix some of the failing tests.

* Code changes to fix unit tests.

* Code changes to add unit tests for client options.

* Code changes to draft docs for PPAF design approach.

* Code changes to add SDK side design docs for PPAF.

* Code changes to modify the PPAF design.

* Code changes to fix unit test.

* Code changes to rename test name.

* Code changes to add some cosmetic changes.

* Code changes to enable retry on write for all regions in single master accounts.

* Code changes to add code comments.

* Code changes to clean up and handle endpoints in location cache.

* Code changes to fix unit tests. Added detailed code comments.

* Code changes to clean up the account read endpoints generation logic.

* Code changes to fix unit tests.

* Code changes to disable retry when ppaf is not enabled. Also validated application preferred region.

* Code changes to fix unit tests.

* Code changes to update md file.

* Code changes to remove chache expiry check for account read endpoints.

* Code changes to fix unit test.

* Code changes to fix more tests.

* Code changes to address review comments.

* Code changes to fix verbaige in design document.

* [Internal] Query: Fixes optimalPageSize logic for OFFSET LIMIT in ORDER BY queries (#4158)

* Fix logic in CosmosQueryExecutionContextFactory where we determine optimal page size for ORDER BY queries that have an OFFSET/LIMIT clause.  Previously, the logic was only being applied to TOP and not OFFSET/LIMIT.

* Changes based on PR feedback

* Change based on PR feedback

* [Internal] Client Telemetry: Adds telemetry contract (#4161)

* add tests

* fix tests

* remove unreated files

* Client Encryption: Adds Azure.Identity from 1.1.1 to 1.10.2 (#4134)

* Bump Azure.Identity in /Microsoft.Azure.Cosmos.Encryption.Custom/src

Bumps [Azure.Identity](https://github.com/Azure/azure-sdk-for-net) from 1.1.1 to 1.10.2.
- [Release notes](https://github.com/Azure/azure-sdk-for-net/releases)
- [Commits](https://github.com/Azure/azure-sdk-for-net/compare/Azure.Identity_1.1.1...Azure.Identity_1.10.2)

---
updated-dependencies:
- dependency-name: Azure.Identity
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update Microsoft.Azure.Cosmos.Encryption.Custom.csproj

Updated the Azure.Core version

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Santosh Kulkarni <[email protected]>
Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Query: Removes ForcePassThrough support (#4160)

* Remove passThrough from the codebase

* Updated TestOptimisticDirectExecutionQueryAsync() to remove all aspects of passThrough from it

* Undoing previous change

* Removed forcePassThrough from FullPipelineTests.cs

* Undoing changes to remove PassThrough

* Undoing passThrough removal pt2

* Undoing changes to SanityQueryTests.cs

* Updated TestTryExecuteQueryHelper()

* Updated comment

* Changed boolean location

* Removed TryExecuteQueryAsync()

* Fixed indentation

* [Internal] Upgrade Resiliency: Fixes Duplicate Channel and Task Creation. (#4123)

* Code changes to fix duplicate channel and thread pool on refresh flow.

* Code changes to fix failed tests.

* Code changes to add global semaphore for concurrency control in address cache.

* Code changes to refactor the refresh async method.

* Code changes to address review comments.

* Code changes to update summary.

* [Internal] DocumentClient: Adds TryGetAccountProperties (#4167)

* add api

* tests

* Update test

* Rename

* Query: Fixes documentation to reflect state of System.Text.Json serializer (#4170)

* Update Program.cs

* Update Program.cs

* Update Program.cs

* Update Program.cs

---------

Co-authored-by: Matias Quaranta <[email protected]>

* [Internal] Query: Adds interface for linq serialization functions (#4163)

* intial commit

* add interface

* PR comments and TranslationContext cleanup

* update params

* fix parameters

* PR comments

* PR comments

* PR comments

* simplifying serializer class

* interface updates

* Update docs

* PR comments

* PR comments

* PR comments - rename and fix assert

* Routing: Adds ExcludeRegions Feature to RequestOptions (#4128)

* adds excludeRegions

* suggested changes

* removed unused usings

* fixed blank line error

* removed using

* update contracts

* fixed test

* reverted automatic changes to BaselineTests

* requested changes

* bug fix

* PPOF test fix

* Upgrade Resiliency: Adds Code to Enable Advanced Replica Selection Feature for Preview (#4180)

* Code changes to enable replica validation for preview.

* Code changes to enable replica validation for preview and GA.

* VS 17.8 auto runs NuGetAudit and flagging 10.0.2, CosmosDB SDK already mitigated it by changing the MAXDEPATH (#4185)

* Documentation: Adds Upsert documentation to include status codes for Created vs Replaced (#4186)

* Upset status codes clarification

* Upgrade Resiliency: Adds Code to Enable Advanced Replica Selection Feature for Preview (#4180)

* Code changes to enable replica validation for preview.

* Code changes to enable replica validation for preview and GA.

---------

Co-authored-by: Debdatta Kunda <[email protected]>

* [Internal] Code Analysis: Fixes all warning in source/test/usage projects (#4188)

* [Internal] CodeAnalsis: Fixing CA2200 for test projects

* Making code warning clean

* fixing the usages projects

* Removing the insource overrides

* One mroe small fix

---------

Co-authored-by: Sourabh Jain <[email protected]>

* 3.37.0: Adds new SDK versions and contract files (#4191)

* Updated change log and bumped up the patch version.

* Updated change log and bumped up the minor version.

* Updated change log to reflect correct version.

* [Internal] Versioning: Adds guidance for versioning SDK releases (#4192)

* Create versioning.md

* Update versioning.md
…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Enables automation to merge PRs Upgrade Resiliency
Projects
Status: Done
5 participants