Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VPC: Add subnet reconciliation #1970

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cjschaef
Copy link
Contributor

Add support to reoncile VPC subnets for the new v2 VPC Infrastructure reoncile logic.

Related: #1896

What this PR does / why we need it:
Add new support for reconciling VPC subnets for VPC Infra

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

/area provider/ibmcloud

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

Add support to new (v2) reconciliation of Subnets for VPC Infrastructure

@k8s-ci-robot k8s-ci-robot added the area/provider/ibmcloud Issues or PRs related to ibmcloud provider label Sep 18, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cjschaef
Once this PR has been reviewed and has the lgtm label, please assign mkumatag for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 18, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @cjschaef. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 18, 2024
Copy link

netlify bot commented Sep 18, 2024

Deploy Preview for kubernetes-sigs-cluster-api-ibmcloud ready!

Name Link
🔨 Latest commit 6689ccf
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-ibmcloud/deploys/66fab794c67a7d0008972da5
😎 Deploy Preview https://deploy-preview-1970--kubernetes-sigs-cluster-api-ibmcloud.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor

@Karthik-K-N Karthik-K-N left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM


// Reconcile Control Plane subnets.
requeue := false
for _, subnet := range subnets {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no inter dependency, In future for performance optimization we can think of reconciling data and control plane subnets concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is likely possible, if subnet creation is found to take longer than desired I can revisit that. I know Custom Image and LB's are the larger time sinks (in terms of waiting for them to be Ready, and SG/SGR reconciliation is heavily complex and can take time as well.

return nil
}

// findOrCreatePublicGateway will attempt to find if there is an existing Public Gateway for a specific zone, for the cluster (in cluster's Resource Group and VPC), or create a new one. Only one Public Gateway is required in each zone, for any subnets in that zone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my better understanding, Do you plan to support multiple subnets in same zone, Do you have any use case in mind when we will need to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the simple path here, is when using different subnets for Control Plane and Data Plane, distributed across the same zones (one CP subnet per zone and one DP subnet per zone).
Otherwise, if the user desires to include edge subnets, or distribute nodes within smaller subnets IP CIDR's, multiple subnets could be used per zone as well.

@Karthik-K-N
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 19, 2024
// If no Worker subnets were supplied, attempt to create one in each zone.
if s.IBMVPCCluster.Spec.Network.WorkerSubnets == nil || len(s.IBMVPCCluster.Spec.Network.WorkerSubnets) == 0 {
// If neither Control Plane nor Worker subnets were supplied, we rely on both Planes using the same subnet per zone, and we will re-reconcile those subnets below, for IBMVPCCluster Status updates.
if len(s.IBMVPCCluster.Spec.Network.ControlPlaneSubnets) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have mentioned as when both are not supplied rely on same subnets.
Don't you need to compare the len equals to 0 here instead of not equals to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could reword, but no, in this case if some Control Plane subnets were supplied, we will auto-generate some Worker subnets.
If neither got provided, then Worker subnets will use the Control Plane subnets that were auto-generated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it.

// If we have the subnet name, we can easily check Network Stauts on the subnet's status.
if isControlPlane {
// If we find the subnet in Control Plane subnet status and that it is ready, we can return, with no requeue required.
if subnet, ok := s.NetworkStatus().ControlPlaneSubnets[*subnet.Name]; ok && subnet.Ready {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.NetworkStatus() != nil is not required here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a good idea to be safe.

}
} else {
// If we find the subnet in Worker subnet status and that it is ready, we can return, with no requeue required.
if subnet, ok := s.NetworkStatus().WorkerSubnets[*subnet.Name]; ok && subnet.Ready {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as prev comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update.

// If we have the subnet name, we can easily check Network Stauts on the subnet's status.
if isControlPlane {
// If we find the subnet in Control Plane subnet status and that it is ready, we can return, with no requeue required.
if subnet, ok := s.NetworkStatus().ControlPlaneSubnets[*subnet.Name]; ok && subnet.Ready {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we find it in status, don't we need to check the cloud that its still in desired state?
Please take care of this in all the similar blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion I had previously, was that with certain VPC resources, if the details are already in Status, checking the realtime status of said resource would not need to happen. For instance, handling cases where CAPI created a resource, and someone/thing deleted the resource, having CAPI attempt to re-reconcile that case, even though Status had a name and ID (which are no longer valid, like ID) would not be desirable.

As VPC and Subnets, do not typically change status, unlike Load Balancers, the expectation is that once these resources are Ready (or the equivalent VPC status value/constant) it should stay that way until deletion (either requested or done maliciously). Load Balancers, due to updating being a frequent occurrence, likely would not be able to rely on previous status (instead checking status in realtime).

If this expectation has changed, and CAPI is expected to recreate all VPC resources (if they disappear, API fails to find them, etc.), I can change all resource logic to no rely on Status details (such as ID) if that changes in realtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Karthik-K-N @mkumatag could you please share your opinion here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually follow this pattern of fetching id from status and checking its actual status in cloud, Its to make sure that the observed resource status is correct and up to date with real world.

I think its better to follow this pattern as its helps in early detection of any potential problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should reconciliation completely stop (panic) if a resource cannot be found then?
Or should resources be recreated and update Status with new ID's, etc.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now reconciler will throw error. For reference. Its upto the user to resolve the error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will attempt to refactor, with the expectation that anything in Status that cannot be found in IBM Cloud, etc. results in an continuous error. Other resources will be updated in the same fashion in other PR's then.

}

// checkSubnetStatus will check the status of a IBM Cloud Subnet and update the Network Status.
func (s *VPCClusterScope) checkSubnetStatus(subnetDetails *vpcv1.Subnet, isControlPlane bool) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we need to rename this if you are going to update the status also as part of this func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change to update.

}

// Add a tag to the subnet for the cluster.
err = s.TagResource(s.IBMVPCCluster.Name, *subnetDetails.CRN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope tagging subnet is tested? We have nt tried before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does. I don't perform exhaustive testing, due to time constraints, but I test all the reconciliation logic after each set of changes for the standard path. So tagging works on subnets (based on the API I don't think GlobalTagging cares what a resource is, as long as it has a CRN), even though the UI leaves a lot to be desired in terms of tagging support.

# ibmcloud resource search "tags:\"us-east-capi-subnet-1-v2vzf\"" | grep "Name\|Resource Type"
Name:                us-east-capi-subnet-1-v2vzf-pgateway-us-east-1
Resource Type:       public-gateway
Name:                us-east-capi-subnet-1-v2vzf-subnet-compute-us-east-1
Resource Type:       subnet
Name:                us-east-capi-subnet-1-v2vzf-subnet-control-plane-us-east-1
Resource Type:       subnet
Name:                us-east-capi-subnet-1-v2vzf-pgateway-us-east-2
Resource Type:       public-gateway
Name:                us-east-capi-subnet-1-v2vzf-subnet-control-plane-us-east-2
Resource Type:       subnet
Name:                us-east-capi-subnet-1-v2vzf-subnet-compute-us-east-2
Resource Type:       subnet
Name:                us-east-capi-subnet-1-v2vzf-pgateway-us-east-3
Resource Type:       public-gateway
Name:                us-east-capi-subnet-1-v2vzf-subnet-compute-us-east-3
Resource Type:       subnet
Name:                us-east-capi-subnet-1-v2vzf-subnet-control-plane-us-east-3
Resource Type:       subnet
Name:                us-east-capi-subnet-1-v2vzf-rhcos
Resource Type:       image
Name:                us-east-capi-subnet-1-v2vzf-vpc
Resource Type:       vpc

@cjschaef
Copy link
Contributor Author

I don't see this error is local linting, and other instances don't appear to fail

controllers/ibmvpccluster_controller.go:268:176: printf: non-constant format string in call to sigs.k8s.io/cluster-api/util/conditions.MarkFalse (govet)
		conditions.MarkFalse(clusterScope.IBMVPCCluster, infrav1beta2.VPCSubnetReadyCondition, infrav1beta2.VPCSubnetReconciliationFailedReason, capiv1beta1.ConditionSeverityError, err.Error())
		                                                                                                                                                                             ^

Going to assume this is a test flake and get the test to rebuild.

@cjschaef cjschaef force-pushed the vpc_reconcile_subnets branch 2 times, most recently from 8fb2ad3 to f0f141c Compare September 19, 2024 18:10
@cjschaef
Copy link
Contributor Author

ah, think I found the error and have fixed it now.

}

options := &vpcv1.CreateSubnetOptions{}
options.SetSubnetPrototype(&vpcv1.SubnetPrototype{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently in existing VPC flow we create subnet with these properties https://github.com/Karthik-K-N/cluster-api-provider-ibmcloud/blob/83e571fe6e85a6edd94d46b762a1844f9f7dd195/cloud/scope/cluster.go#L219-L236. Do you think its not required to set CIDR block or is it not required when you set TotalIpv4AddressCount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the two are mutually exclusive (not completely sure though), so you are able to define one or the other (total IPs or a CIDR).
Given the history we have, I opted to keep using the total IP's for now, but once we consider allowing this to be configurable, we could provide both as options (total IP's or CIDR's).

if _, ok := subnetMap[*subnet.Name]; ok {
subnetName = subnet.Name
}
} else if subnetID != nil && subnet.ID != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you comparing subnetID != nil here?
Are n't you defining this few lines above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll clean that up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +986 to +1005
if subnetDetails == nil || subnetDetails.ID == nil || subnetDetails.CRN == nil {
return fmt.Errorf("error failed creating subnet: %s", *subnet.Name)
}

// Initially populate subnet's status.
resourceStatus := &infrav1beta2.ResourceStatus{
ID: *subnetDetails.ID,
Name: subnetDetails.Name,
Ready: false,
}
if isControlPlane {
s.SetResourceStatus(infrav1beta2.ResourceTypeControlPlaneSubnet, resourceStatus)
} else {
s.SetResourceStatus(infrav1beta2.ResourceTypeWorkerSubnet, resourceStatus)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use updateSubnetStatus() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to default during the initial population and update on followup loops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to be honest, it doesn't matter anymore, we will have a followup loop after subnet(s) creation, so we will callback to updateSubnetStatus on the next iteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what I meant is that I don't see a code difference here from line 986 to 1000 comparing to updateSubnetStatus() func.
So why can't we just call that func instead of these dup lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code diff is Ready: false is hardcoded here, compared with the function lookup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't that func sets the ready as false if you pass the subnet details from here? Since it won't be available immediately after creation.
IMHO we can just use that func here instead of building and setting the status from here. It looks duplicate to me.
@Karthik-K-N can you please see whether my opinion making sense here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if subnetDetails == nil || subnetDetails.ID == nil || subnetDetails.CRN == nil {
return fmt.Errorf("error failed creating subnet: %s", *subnet.Name)
}
// Initially populate subnet's status.
resourceStatus := &infrav1beta2.ResourceStatus{
ID: *subnetDetails.ID,
Name: subnetDetails.Name,
Ready: false,
}
if isControlPlane {
s.SetResourceStatus(infrav1beta2.ResourceTypeControlPlaneSubnet, resourceStatus)
} else {
s.SetResourceStatus(infrav1beta2.ResourceTypeWorkerSubnet, resourceStatus)
}
if subnetDetails == nil || subnetDetails.ID == nil || subnetDetails.CRN == nil {
return fmt.Errorf("error failed creating subnet: %s", *subnet.Name)
}
return s.updateSubnetStatus(subnetDetails, isControlPlane)

Copy link
Contributor

@Karthik-K-N Karthik-K-N Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whats will be the value for these subnetDetails, for the newly created subnet
but if either subnetDetails.Status is nil or subnetDetails.Status not vpcv1.SubnetStatusAvailableConst then anyhow we are setting ready as false. so we can reuse I believe.

 subnetDetails.Status != nil && *subnetDetails.Status == string(vpcv1.SubnetStatusAvailableConst) 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If create succeeds, it would populate and give subnetDetails properly or else it will throw an error.
Not sure in what case it would return err as nil and subnetDetails also as nil.

}

// Otherwise, if these is an ID or name, attempt to lookup the subnet and update status as necessary.
if subnet.ID != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you found an existing subnet with id or name, wondering how are you planning to differentiate this during delete phase? Just asking for my understanding!
If it via tags, hope all the resources involved are having CRN like you mentioned, to attach the tags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this moment, I don't have solid plans for deletion logic. Tagging is going to likely be involved/required to properly handle all of the edge cases involved with bring your own 'resource' (vpc, subnet, image, etc.).
Yes, I have already gone through and tested the tagging required for the resource CAPI is creating, but deletion has not been a focus, given time constraints.

Copy link
Contributor

@Karthik-K-N Karthik-K-N left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

return nil, err
}
if subnet == nil {
return nil, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we can return some custom error here for better consumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

var subnets []infrav1beta2.Subnet
var err error
// If no ControlPlane Subnets were supplied, we default to create one in each availability zone of the region.
if s.IBMVPCCluster.Spec.Network.ControlPlaneSubnets == nil || len(s.IBMVPCCluster.Spec.Network.ControlPlaneSubnets) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil check is not required for the ControlPlaneSubnets as it is a slice and len function will take care of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

// For Subnets, we collect all of the required subnets, for each Plane, and reconcile them individually. Requeing if one is missing or just created. Reconciliation is attempted on all subnets each loop, to prevent single subnet creation per reconciliation loop.
func (s *VPCClusterScope) ReconcileSubnets() (bool, error) {
var subnets []infrav1beta2.Subnet
var err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see pointing in declaring this err variable here, can be safely removed and move the declaration wherever needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I need this declared here, so that when assigning subnets below, it is not redefined. The linting at least requires the declarations first, before assigning.

}

// If no Worker subnets were supplied, attempt to create one in each zone.
if s.IBMVPCCluster.Spec.Network.WorkerSubnets == nil || len(s.IBMVPCCluster.Spec.Network.WorkerSubnets) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, remove the nil check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 871 to 879
} else {
// If no ID or name was provided, that is an error to be raised. One or the other must be specified when subnets are supplied.
return false, fmt.Errorf("error subnet has no defined id or name, one is required")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be checked in the beginning and return early failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think that is a safe expectation. updated

Comment on lines +866 to +876
} else if subnetDetails != nil {
// Update status if subnet was found.
return s.updateSubnetStatus(subnetDetails, isControlPlane)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateSubnetStatus called multiple times, lets optimise later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a slight improvement possible, but the complexity of lookup calls, dependent on which fields/values are defined, may make the logic more complex to reduce the duplication of updateSubnetStatus.
I could review once VPC support changes are completed.

}

// TODO(cjschaef): Move to webhook validation.
if subnet.Zone == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check and return early

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 951 to 953
var ipCount int64 = 256
// We currnetly only support IPv4
ipVersion := "ipv4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved to constant section in the debugging of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I moved to consts.

Add support to reoncile VPC subnets for the new v2
VPC Infrastructure reoncile logic.

Related: kubernetes-sigs#1896
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/ibmcloud Issues or PRs related to ibmcloud provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants