feat(recommender): add OOMMinBumpUp&OOMBumpUpRatio to CRD #8012

omerap12 · 2025-04-05T13:19:00Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds new options to the Vertical Pod Autoscaler (VPA) to better handle Out of Memory (OOM) events:
It adds two new settings to the VPA configuration:

OOMBumpUpRatio: How much to increase memory after an OOM event
OOMMinBumpUp: The smallest amount to increase memory after an OOM event

These settings can be set for each container within a VPA's resource policy.
If not set for a specific container, it will use default values from the VPA recommender.

Example:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: oom-test-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: oom-test
  updatePolicy:
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
     oomBumpUpRatio: "3.5"
     oomMinBumpUp: "100Mi"

Which issue(s) this PR fixes:

part of #7650

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added OOMBumpUpRatio and OOMMinBumpUp options to VPA for customizing memory increase after OOM events.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

VPA now supports OOMBumpUpRatio and OOMMinBumpUp for fine-tuning memory recommendations after OOM events, configurable globally or per-VPA.

omerap12 · 2025-04-06T17:30:19Z

We might want to create a proper AEP for this, but this is the general direction I'm thinking. I can open additional issues to track the specific flags we’d like to support for this type of configuration.
What do you think?
cc @voelzmo @raywainman
(Wanted to loop in Adrian as well, but he's currently traveling :) )
/hold
/kind api-change

voelzmo

Hey @omerap12 thanks for the PR!

I agree it makes sense to be able to configure the OOM bump behavior on VPA level. There's a few questions on how to implement this, though:

I'm not sure if we want this to be a configuration on Container level or on Pod level, i.e. should this apply to all Containers controlled by a certain VPA or should this rather be something that's controlled per individual Container? I think so far we've been mostly offering configurations on Container level, probably this would also apply here. Or do we have some indication that people who want to configure custom OOM bumps want to do this for all Containers of a Pod in the same way?
I don't think we should introduce a new configuration type recommenderConfig. Technically, all of these properties are configuration options of the recommender (histogram decay options, maxAllowed, minAllowed, which resources to include in the recommendations, etc), so this doesn't seem like a reasonable way to group things. If we agree to make this configuration Container specific, I'd rather add this to the ContainerResourcePolicy

Currently, OOM bump configuration is part of the AggregationsConfig, as it is assumed to be globally configured, like all the other options in there. This config is only initialized once, in the main.go:

autoscaler/vertical-pod-autoscaler/pkg/recommender/main.go

Line 224 in 3e92831

    
           model.InitializeAggregationsConfig(model.NewAggregationsConfig(*memoryAggregationInterval, *memoryAggregationIntervalCount, *memoryHistogramDecayHalfLife, *cpuHistogramDecayHalfLife, *oomBumpUpRatio, *oomMinBumpUp))

If we, however, want to make this configurable per VPA, I'd rather opt for pushing this configuration down, rather than adding some if-else to the cluster_feeder resulting in having to find the correct VPA for a Pod every time we add an OOM sample
IMHO, a possible place to put this configuration options would be the aggregate_container_state, where we already have the necessary methods to re-load the ContainerResourcePolicy options on VPA updates and then read this in cluster.go, right before we add the OOM sample to the ContainerAggregation:

autoscaler/vertical-pod-autoscaler/pkg/recommender/model/cluster.go

Line 262 in 3e92831

err := containerState.RecordOOM(timestamp, requestedMemory)

WDYT?

omerap12 · 2025-04-07T15:30:48Z

Hey @omerap12 thanks for the PR!

I agree it makes sense to be able to configure the OOM bump behavior on VPA level. There's a few questions on how to implement this, though:

I'm not sure if we want this to be a configuration on Container level or on Pod level, i.e. should this apply to all Containers controlled by a certain VPA or should this rather be something that's controlled per individual Container? I think so far we've been mostly offering configurations on Container level, probably this would also apply here. Or do we have some indication that people who want to configure custom OOM bumps want to do this for all Containers of a Pod in the same way?

I don't think we should introduce a new configuration type recommenderConfig. Technically, all of these properties are configuration options of the recommender (histogram decay options, maxAllowed, minAllowed, which resources to include in the recommendations, etc), so this doesn't seem like a reasonable way to group things. If we agree to make this configuration Container specific, I'd rather add this to the ContainerResourcePolicy

Currently, OOM bump configuration is part of the AggregationsConfig, as it is assumed to be globally configured, like all the other options in there. This config is only initialized once, in the main.go:

autoscaler/vertical-pod-autoscaler/pkg/recommender/main.go

Line 224 in 3e92831

model.InitializeAggregationsConfig(model.NewAggregationsConfig(*memoryAggregationInterval, *memoryAggregationIntervalCount, *memoryHistogramDecayHalfLife, *cpuHistogramDecayHalfLife, *oomBumpUpRatio, *oomMinBumpUp))

If we, however, want to make this configurable per VPA, I'd rather opt for pushing this configuration down, rather than adding some if-else to the cluster_feeder resulting in having to find the correct VPA for a Pod every time we add an OOM sample

IMHO, a possible place to put this configuration options would be the aggregate_container_state, where we already have the necessary methods to re-load the ContainerResourcePolicy options on VPA updates and then read this in cluster.go, right before we add the OOM sample to the ContainerAggregation:

autoscaler/vertical-pod-autoscaler/pkg/recommender/model/cluster.go

Line 262 in 3e92831

err := containerState.RecordOOM(timestamp, requestedMemory)

WDYT?

Thanks for the input!

You're right. it makes sense to keep this as a per-container configuration, in line with most of our existing settings.
The recommenderConfig was just part of my initial POC, so with (1) in mind, we definitely don’t need it.
Agreed. Thanks for pointing out the relevant spot in the code!

So yep, I agree with all of your suggestions :)

Signed-off-by: Omer Aplatony <[email protected]>

k8s-ci-robot · 2025-04-07T17:03:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omerap12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~vertical-pod-autoscaler/OWNERS~~ [omerap12]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

omerap12 · 2025-04-07T17:05:09Z

/remove area provider/cluster-api
/remove area/cluster-autoscaler

omerap12 · 2025-04-07T17:06:59Z

/remove-area provider/cluster-api
/remove-area cluster-autoscaler

Signed-off-by: Omer Aplatony <[email protected]>

vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go

Signed-off-by: Omer Aplatony <[email protected]>

Co-authored-by: Adrian Moisey <[email protected]>

omerap12 · 2025-08-04T18:24:34Z

/label tide/merge-method-squash

adrianmoisey

I haven't looked at the tests yet, but here are a few small comments

vertical-pod-autoscaler/pkg/admission-controller/resource/vpa/handler.go

vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go

vertical-pod-autoscaler/pkg/features/versioned_features.go

vertical-pod-autoscaler/pkg/admission-controller/resource/vpa/handler.go

vertical-pod-autoscaler/pkg/recommender/model/aggregate_container_state.go

vertical-pod-autoscaler/deploy/vpa-v1-crd-gen.yaml

Co-authored-by: Adrian Moisey <[email protected]>

Signed-off-by: Omer Aplatony <[email protected]>

vertical-pod-autoscaler/pkg/admission-controller/resource/vpa/handler.go

Co-authored-by: Adrian Moisey <[email protected]>

Signed-off-by: Omer Aplatony <[email protected]>

Co-authored-by: Adrian Moisey <[email protected]>

Signed-off-by: Omer Aplatony <[email protected]>

adrianmoisey · 2025-08-14T15:17:05Z

/label api-review

Signed-off-by: Omer Aplatony <[email protected]>

adrianmoisey · 2025-08-21T16:21:36Z

vertical-pod-autoscaler/pkg/admission-controller/resource/vpa/handler.go

+			// check that perVPA is on if being used
+			if err := validatePerVPAFeatureFlag(&policy); err != nil {
+				return err
+			}


This piece of code made me wonder if we should try move closer to how k/k validates resources. That's something for another day though

Yeah.. we should create a seperate issue to discuss it.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 5, 2025

k8s-ci-robot requested review from adrianmoisey and voelzmo April 5, 2025 13:19

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/vertical-pod-autoscaler size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2025

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Apr 6, 2025

voelzmo suggested changes Apr 7, 2025

View reviewed changes

omerap12 force-pushed the oom-feat branch 2 times, most recently from 5e23c1d to b7b84de Compare April 7, 2025 16:54

k8s-ci-robot added area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 7, 2025

feat(recommender): add OOMMinBumpUp&OOMBumpUpRatio to CRD

86796b9

Signed-off-by: Omer Aplatony <[email protected]>

omerap12 force-pushed the oom-feat branch from b7b84de to 86796b9 Compare April 7, 2025 17:02

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2025

k8s-ci-robot removed area/provider/cluster-api Issues or PRs related to Cluster API provider area/cluster-autoscaler labels Apr 7, 2025

omerap12 added 5 commits April 7, 2025 17:14

lint

8228c32

Signed-off-by: Omer Aplatony <[email protected]>

Add to test

4c6bdfa

Signed-off-by: Omer Aplatony <[email protected]>

fmt

7605f81

Signed-off-by: Omer Aplatony <[email protected]>

align values with defaults

e0eeaaf

Signed-off-by: Omer Aplatony <[email protected]>

fixed functions

7ccaf49

Signed-off-by: Omer Aplatony <[email protected]>

omerap12 added 2 commits July 31, 2025 11:07

fmt & fix e2e

87f8bd6

Signed-off-by: Omer Aplatony <[email protected]>

OomBumpUpRatio->OOMBumpRatio

52110a7

Signed-off-by: Omer Aplatony <[email protected]>

adrianmoisey reviewed Jul 31, 2025

View reviewed changes

vertical-pod-autoscaler/pkg/apis/autoscaling.k8s.io/v1/types.go Outdated Show resolved Hide resolved

OOMBumpRatio -> OOMBumpUpRatio

6dbfc45

Signed-off-by: Omer Aplatony <[email protected]>

omerap12 requested a review from adrianmoisey August 1, 2025 11:09

omerap12 and others added 2 commits August 1, 2025 11:31

fixed typo

bdfdc9f

Signed-off-by: Omer Aplatony <[email protected]>

fixed comment

58aec44

Co-authored-by: Adrian Moisey <[email protected]>

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Aug 4, 2025

adrianmoisey reviewed Aug 4, 2025

View reviewed changes

omerap12 and others added 3 commits August 5, 2025 09:37

fixed typo

fa913df

Co-authored-by: Adrian Moisey <[email protected]>

fixed comment

7ebd1d0

Co-authored-by: Adrian Moisey <[email protected]>

changed controller-gen version

561c1b2

Signed-off-by: Omer Aplatony <[email protected]>

adrianmoisey reviewed Aug 6, 2025

View reviewed changes

vertical-pod-autoscaler/pkg/admission-controller/resource/vpa/handler.go Outdated Show resolved Hide resolved

omerap12 and others added 5 commits August 6, 2025 19:14

fixed error message

c04e788

Co-authored-by: Adrian Moisey <[email protected]>

Fixed feature flag default

b73a5b5

Signed-off-by: Omer Aplatony <[email protected]>

Minimum OOMBumpUpRatio is 1

91ac55d

Co-authored-by: Adrian Moisey <[email protected]>

update flags

c843b16

Signed-off-by: Omer Aplatony <[email protected]>

fixed e2e tests for PerVPAConfig

68b52f8

Signed-off-by: Omer Aplatony <[email protected]>

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2025

Merge branch 'master' into oom-feat

e979364

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 12, 2025

omerap12 added 3 commits August 12, 2025 06:16

update flags

72889dd

Signed-off-by: Omer Aplatony <[email protected]>

Add logs when feature gate is disabled

151ee25

Signed-off-by: Omer Aplatony <[email protected]>

fixed e2e

6cbd449

Signed-off-by: Omer Aplatony <[email protected]>

k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Aug 14, 2025

Add feature flag to admission controller

697934c

Signed-off-by: Omer Aplatony <[email protected]>

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Aug 21, 2025

adrianmoisey reviewed Aug 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(recommender): add OOMMinBumpUp&OOMBumpUpRatio to CRD #8012

feat(recommender): add OOMMinBumpUp&OOMBumpUpRatio to CRD #8012

omerap12 commented Apr 5, 2025 •

edited

Loading

Uh oh!

omerap12 commented Apr 6, 2025

Uh oh!

voelzmo left a comment

Uh oh!

omerap12 commented Apr 7, 2025

Uh oh!

k8s-ci-robot commented Apr 7, 2025

Uh oh!

omerap12 commented Apr 7, 2025

Uh oh!

omerap12 commented Apr 7, 2025

Uh oh!

Uh oh!

omerap12 commented Aug 4, 2025

Uh oh!

adrianmoisey left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrianmoisey commented Aug 14, 2025

Uh oh!

adrianmoisey Aug 21, 2025

Uh oh!

omerap12 Aug 23, 2025

Uh oh!

Uh oh!

feat(recommender): add OOMMinBumpUp&OOMBumpUpRatio to CRD #8012

Are you sure you want to change the base?

feat(recommender): add OOMMinBumpUp&OOMBumpUpRatio to CRD #8012

Conversation

omerap12 commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

omerap12 commented Apr 6, 2025

Uh oh!

voelzmo left a comment

Choose a reason for hiding this comment

Uh oh!

omerap12 commented Apr 7, 2025

Uh oh!

k8s-ci-robot commented Apr 7, 2025

Uh oh!

omerap12 commented Apr 7, 2025

Uh oh!

omerap12 commented Apr 7, 2025

Uh oh!

Uh oh!

omerap12 commented Aug 4, 2025

Uh oh!

adrianmoisey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrianmoisey commented Aug 14, 2025

Uh oh!

adrianmoisey Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

omerap12 Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

omerap12 commented Apr 5, 2025 •

edited

Loading