Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA: refactor utils related to NodeInfos #7479

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

towca
Copy link
Collaborator

@towca towca commented Nov 7, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. There were multiple very similar utils related to copying and sanitizing NodeInfos scattered around the CA codebase. Instead of adding similar DRA handling to all of them separately, they're consolidated into a single location that will be later adapted to handle DRA.

Which issue(s) this PR fixes:

The CA/DRA integration is tracked in kubernetes/kubernetes#118612, this is just part of the implementation.

Special notes for your reviewer:

The first commit in the PR is just a squash of #7466, and it shouldn't be a part of this review. The PR will be rebased on top of master after #7466 is merged.

This is intended to be a no-op refactor. It was extracted from #7350 after #7447, and #7466.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/blob/9de7f62e16fc5c1ea3bd40689487c9edc7fa5057/keps/sig-node/4381-dra-structured-parameters/README.md

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 7, 2024
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 7, 2024
@towca
Copy link
Collaborator Author

towca commented Nov 7, 2024

/assign @MaciekPytel
/assign @jackfrancis
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024
towca added a commit to towca/autoscaler that referenced this pull request Nov 14, 2024
towca added a commit to towca/autoscaler that referenced this pull request Nov 19, 2024
@towca
Copy link
Collaborator Author

towca commented Nov 19, 2024

/assign @BigDarkClown

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 19, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 19, 2024
towca added a commit to towca/autoscaler that referenced this pull request Nov 19, 2024
@@ -34,7 +34,7 @@ import (
"k8s.io/autoscaler/cluster-autoscaler/core/scaledown/planner"
scaledownstatus "k8s.io/autoscaler/cluster-autoscaler/core/scaledown/status"
"k8s.io/autoscaler/cluster-autoscaler/core/scaleup"
orchestrator "k8s.io/autoscaler/cluster-autoscaler/core/scaleup/orchestrator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes

towca added a commit to towca/autoscaler that referenced this pull request Nov 20, 2024
towca added a commit to towca/autoscaler that referenced this pull request Nov 20, 2024
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate,
and scheduler_utils.DeepCopyTemplateNode all had very similar logic
for sanitizing and copying NodeInfos. They're all consolidated to
one file in simulator, sharing common logic.

DeepCopyNodeInfo is changed to be a framework.NodeInfo method.

MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to
correlate Nodes to scheduled pods, instead of using a live Pod lister.
This means that the snapshot now has to be properly initialized in a
bunch of tests.
towca added a commit to towca/autoscaler that referenced this pull request Nov 21, 2024
}
nodeInfo, err := simulator.BuildNodeInfoForNode(sanitizedNode, podsForNodes[node.Name], daemonsets, p.forceDaemonSets)
templateNodeInfo, caErr := simulator.TemplateNodeInfoFromExampleNodeInfo(nodeInfo, id, daemonsets, p.forceDaemonSets, taintConfig)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be if caErr != nil { here?

(also do we need to define a new caErr variable here instead of just re-using err?)

for _, slice := range n.LocalResourceSlices {
newSlices = append(newSlices, slice.DeepCopy())
}
return NewNodeInfo(n.Node().DeepCopy(), newSlices, newPods...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the NewNodeInfo constructor only sets a node object if the passed in node is not nil:

	if node != nil {
		result.schedNodeInfo.SetNode(node)
	}

... invoking n.Node().DeepCopy(), inline like this might be (theoretically) subject to a nil pointer exception

id := nodeGroup.Id()
baseNodeInfo, err := nodeGroup.TemplateNodeInfo()
if err != nil {
return nil, errors.ToAutoscalerError(errors.CloudProviderError, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this error response too generic?

// TemplateNodeInfoFromNodeGroupTemplate returns a template NodeInfo object based on NodeGroup.TemplateNodeInfo(). The template is sanitized, and only
// contains the pods that should appear on a new Node from the same node group (e.g. DaemonSet pods).
func TemplateNodeInfoFromNodeGroupTemplate(nodeGroup nodeGroupTemplateNodeInfoGetter, daemonsets []*appsv1.DaemonSet, taintConfig taints.TaintConfig) (*framework.NodeInfo, errors.AutoscalerError) {
id := nodeGroup.Id()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to assign this to a var

return TemplateNodeInfoFromExampleNodeInfo(baseNodeInfo, id, daemonsets, true, taintConfig)
}

// TemplateNodeInfoFromExampleNodeInfo returns a template NodeInfo object based on a real example NodeInfo from the cluster. The template is sanitized, and only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I love the term "example" here. Would TemplateNodeInfoFromNode or TemplateNodeInfoFromRealNode or TemplateNodeInfoFromRealNodeInfo work? Then we'd document like this

// TemplateNodeInfoFromNode returns a template NodeInfo object based on a NodeInfo from a real node on the cluster. The template is sanitized, and only
	// We need to sanitize the node before determining the DS pods, since taints are checked there, and
	// we might need to filter some out during sanitization.
	sanitizedNode := sanitizeNodeInfo(realNode, newNodeNameBase, randSuffix, &taintConfig)
// No need to sanitize the expected pods again - they either come from sanitizedNode and were sanitized above,

etc.

I think my observation is that the word "example" suggests something non-real, mock object, something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants