Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terraform: fix security rule reconciliation on Azure #3454

Merged
merged 10 commits into from
Nov 4, 2024

Conversation

elchead
Copy link
Contributor

@elchead elchead commented Oct 22, 2024

Context

In #3257, the migration to defining the network security rules as separate resources was started.
Unfortunately, the concurrent usage of inline and security rule resources causes problems (see https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/network_security_group). Terraform cannot resolve which declarative state of the rules should be applied if both inline and resources exist, thus making it a concurrency race which resource overwrites the others.

This bug was first observed as part of the weekly Azure TDX Terraform provider test, but it did indeed also fail during the release test with the CLI (https://github.com/edgelesssys/constellation/actions/runs/11402745527/job/31778136903).
The observed behavior was that the Kube API server was not reachable anymore, because the network security rule was missing.
This bug applies also to Azure SEV-SNP, but since the bug is a concurrency race, its manifestation is not deterministic. In fact, the correct behavior happened to occur during the release test on SEV-SNP (https://github.com/edgelesssys/constellation/actions/runs/11402745527/job/31778136329#step:19:874).

The problem can occur on any cluster on Azure and every terraform apply could cause this bug.

Proposed change(s)

By leaving the inline security rules block of the network security group empty, this resource does not reconcile the rules.
Thus, the solution is to fully migrate to separate resources for the security rules.

I propose to do a patch release where users from 2.18 are required to upgrade to 2.19.1 directly.

  • remove the inline network security rules on Azure

Additional info

I tested setting up a new cluster with this change and subsequent applies didn’t result in any terraform diff as expected.
The cluster remained reachable.

Upgrade test from v2.18.0 -> v2.19.1
Azure TDX
Azure SEV-SNP

For upgrades from v2.18.0 on Azure, the old security rules are orphaned and should be manually cleaned up by the user:

#!/bin/bash
name="<insert>"  # the name provided in the config
id="<insert>"    # the cluster id

rules=(
  "kubernetes"
  "bootstrapper"
  "verify"
  "recovery"
  "join"
  "debugd"
  "konnectivity"
)

for rule in "${rules[@]}"; do
  echo "Deleting rule: $rule"
  az network nsg rule delete \
    --resource-group "$name-rg" \
    --nsg-name "$name-$id" \
    --name "$rule"
done

echo "All specified rules have been deleted."

Checklist

  • Run the E2E tests that are relevant to this PR's changes
  • Update docs
  • Add labels (e.g., for changelog category)
  • Is PR title adequate for changelog?
  • Link to Milestone

Copy link

netlify bot commented Oct 22, 2024

Deploy Preview for constellation-docs ready!

Name Link
🔨 Latest commit e31af63
🔍 Latest deploy log https://app.netlify.com/sites/constellation-docs/deploys/6723ac5a6150500008359a35
😎 Deploy Preview https://deploy-preview-3454--constellation-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@elchead elchead changed the title terraform: fix security rule reconcilation on azure terraform: fix security rule reconcilation on Azure Oct 22, 2024
@elchead elchead added bug fix Fixing a bug needs backport This PR needs to be backported to a previous release labels Oct 22, 2024
@elchead elchead added this to the v2.20.0 milestone Oct 22, 2024
@elchead
Copy link
Contributor Author

elchead commented Oct 22, 2024

@3u13r I understand that your intention for the dangling konnectivity rule in #3257 was to enforce that the other rules are deleted, but this should not be necessary afais. Or do you see any issue?

@3u13r
Copy link
Member

3u13r commented Oct 22, 2024

I understand that your intention for the dangling konnectivity rule in #3257 was to enforce that the other rules are deleted, but this should not be necessary afais. Or do you see any issue?

No, if the posted CI runs succeeds this solution is cleaner in general and is likely to cause less problems in the future.

@elchead
Copy link
Contributor Author

elchead commented Oct 24, 2024

I'm not sure where the best place for the migration script would be (see Additional Information in the description).

echo "CLI VERSION:"
echo $(./build/constellation version)
CLI=$(realpath ./build/constellation)
bazel run --test_timeout=14400 //e2e/internal/upgrade:upgrade_test -- --want-worker "$WORKERNODES" --want-control "$CONTROLNODES" --target-image "$IMAGE" --target-kubernetes "$KUBERNETES" --target-microservices "$MICROSERVICES" --cli "$CLI"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old code was broken because an empty string "" as substitution for a flag breaks the flag parsing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, the upgrade CLI (with potentially simulated patch version) should be used. Previously, the target embedded its own CLI that was always tagged with vNEXT-pre. This doesn't work for simulating a patch upgrade.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for simulating a patch upgrade.

I thought the idea of simulatedTargetVersion is supposed to solve this problem. If we only use the build CLI to migrate the config, I don't thing simulatedTargetVersion works as intended. We should either fix this or remove it.

Also if we pass the cli path as a flag, then this should be the flow of this e2e test and we remove the cli dependency of the test in bazel. What I'd prefer though is to just also use the simulatedTargetVersion here and remove the flag again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to keep the explicit cli passing, such that the identical CLI is used for "config migrate" and the rest of the upgrade workflow

@@ -297,10 +297,10 @@ func getCLIPath(cliPathFlag string) (string, error) {
pathCLI := os.Getenv("PATH_CLI")
var relCLIPath string
switch {
case pathCLI != "":
relCLIPath = pathCLI
case cliPathFlag != "":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a flag should take precedence over ENV

@elchead elchead marked this pull request as ready for review October 24, 2024 12:03
@3u13r
Copy link
Member

3u13r commented Oct 24, 2024

If you are not already on it, changing PR from draft to open is a good point to actually merge the fixup commits and present a neat commit history.

@elchead elchead force-pushed the as/fix-azure-upgrade branch 3 times, most recently from 2334a78 to f887c8e Compare October 24, 2024 12:36
Copy link
Member

@3u13r 3u13r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also document the not necessarily required but at least recommended migration steps in docs/docs/reference/migration.md

echo "CLI VERSION:"
echo $(./build/constellation version)
CLI=$(realpath ./build/constellation)
bazel run --test_timeout=14400 //e2e/internal/upgrade:upgrade_test -- --want-worker "$WORKERNODES" --want-control "$CONTROLNODES" --target-image "$IMAGE" --target-kubernetes "$KUBERNETES" --target-microservices "$MICROSERVICES" --cli "$CLI"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work for simulating a patch upgrade.

I thought the idea of simulatedTargetVersion is supposed to solve this problem. If we only use the build CLI to migrate the config, I don't thing simulatedTargetVersion works as intended. We should either fix this or remove it.

Also if we pass the cli path as a flag, then this should be the flow of this e2e test and we remove the cli dependency of the test in bazel. What I'd prefer though is to just also use the simulatedTargetVersion here and remove the flag again.

@elchead
Copy link
Contributor Author

elchead commented Oct 25, 2024

As discussed with @3u13r, we decided to build the target CLI inside the upgrade job itself, because separating this created several consistency issues for the simulated patch version scenario
The main reason being that the upgrade tool itself didn't have the patched version, but referred to its own embedded version to determine some target versions.

Moreover, the extra job didn't save any time because the CLI is built inside the bazel upgrade target too.

@elchead elchead requested a review from 3u13r October 25, 2024 11:09

### Azure

* During the upgrade, security rules are migrated and the old ones are recommended to be cleaned up manually by the user. This step is recommended but not strictly required. The below script shows how to programatically delete the old rules through the Azure CLI:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend to not make this optional, because we could not change the security rule name back and remove the priority offset in Terraform otherwise.


### Azure

* During the upgrade, security rules are migrated and the old ones need to be cleaned up manually by the user. The below script shows how to delete them through the Azure CLI:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once approved, I will backport this to the v2.19 docs

docs/docs/reference/migration.md Outdated Show resolved Hide resolved
docs/docs/reference/migration.md Outdated Show resolved Hide resolved
docs/docs/reference/migration.md Outdated Show resolved Hide resolved
docs/docs/reference/migration.md Outdated Show resolved Hide resolved
@elchead elchead requested a review from 3u13r October 31, 2024 15:30
docs/docs/reference/migration.md Outdated Show resolved Hide resolved
@elchead elchead force-pushed the as/fix-azure-upgrade branch 3 times, most recently from 41313fd to 9f47e9a Compare October 31, 2024 15:38
Copy link
Member

@3u13r 3u13r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- To keep using an existing UAMI, add the `Owner` permission with the scope of your `resourceGroup`.
- Otherwise, simply [create new Constellation IAM credentials](../workflows/config.md#creating-an-iam-configuration) and use the created UAMI.
- To migrate the authentication for an existing cluster on Azure to an UAMI with the necessary permissions:
* The `provider.azure.appClientID` and `provider.azure.appClientSecret` fields are no longer supported and should be removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we currently enforce any rules regarding * vs -. They seem to be interchangeable. Did your linter change this, or do you have a reason for this preference? I know that docs currently favor * (i.e. are used more). I think we can revert this for now to not have the last commit that changed the line updated and revisit this discussion later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed on markdownlint, which (unfortunately) picks the character of the topmost list item on the page. So if you wanted to keep the diff small but still use the linter, you would replace the first item with - in this case.

@elchead
Copy link
Contributor Author

elchead commented Nov 4, 2024

@elchead elchead changed the title terraform: fix security rule reconcilation on Azure terraform: fix security rule reconciliation on Azure Nov 4, 2024
@elchead elchead merged commit 54058ee into main Nov 4, 2024
13 checks passed
@elchead elchead deleted the as/fix-azure-upgrade branch November 4, 2024 07:59
elchead added a commit that referenced this pull request Nov 5, 2024
* fix security rule reconciliation on azure
* fix simulated patch version upgrade
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix Fixing a bug needs backport This PR needs to be backported to a previous release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants