optimize the memory usage and fix bugs #1528

M00nF1sh · 2020-10-21T02:32:44Z

This PR contains below changes:

Optimize the memory usage by using a in-memory podInfo repository instead of Informer.
- On large clusters with large number of pod and big podSpec size, the pod Informer could consume tons of memory, but is unnecessary from our controller's perspective.
- we change to use a in-memory podInfo repository which only contains necessary podInformation we needed.
- Note: with this implementation, it would still needs memory to hold all pod spec during startup/resync, but these memory will be GCed. In the future, we can create a pod clone during the ListFunc with our cared fields to allow early GC.
Fix a race condition when readinessGate enabled.
- when we receive the endpoints events, some unready endpoints's corresponding pod might still don't have containerReady status, thus cannot be registered. However, since they have readinessGate, they cannot turn health either and won't trigger endpoint change to give itself chance to be registered. (this can be more easily occur if APIServer is slow)
- this PR introduced a new return value called containsPotentialReadyEndpoints when resolve PodEndpoints, which means whether there are unready endpoints but could potentially turn ready if we re-do a reconcile.
- In the future, we might evaluate to use pod events directly.
Change the RequeueAfterError and RequeueError to be RequeueNeededAfter and RequeueNeeded to better match its semantic.
- they are not errors, we use a reason instead of error for these two structure.

Test done

On a 200 node cluster with 4k pods, each pod with 124LB spec size, the memory usage reduced from 2GB to 100MB.
Tested the above race condition no longer occur when enabled podReadinessGate with a slow APIServer environment
log message for requeue: {"level":"debug","ts":1603295614.2622936,"logger":"controllers.targetGroupBinding","msg":"requeue after","duration":15,"reason":"monitor targetHealth"}

k8s-ci-robot · 2020-10-21T15:59:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kishorj, M00nF1sh
To complete the pull request process, please assign after the PR has been reviewed.
You can assign the PR to them by writing /assign in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* add pod repo * change requeue errors to use reason instead

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 21, 2020

add pod repo

80bcab9

M00nF1sh changed the title ~~[WIP] optimize the memory usage~~ optimize the memory usage and fix bugs Oct 21, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2020

change requeue errors to use reason instead

8c4964d

kishorj approved these changes Oct 21, 2020

View reviewed changes

M00nF1sh merged commit 4506ade into kubernetes-sigs:main Oct 21, 2020

Timothy-Dougherty pushed a commit to adammw/aws-load-balancer-controller that referenced this pull request Nov 9, 2023

optimize the memory usage and fix bugs (kubernetes-sigs#1528)

12b5397

* add pod repo * change requeue errors to use reason instead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize the memory usage and fix bugs #1528

optimize the memory usage and fix bugs #1528

Uh oh!

M00nF1sh commented Oct 21, 2020 •

edited

Loading

Uh oh!

k8s-ci-robot commented Oct 21, 2020

Uh oh!

Uh oh!

optimize the memory usage and fix bugs #1528

optimize the memory usage and fix bugs #1528

Uh oh!

Conversation

M00nF1sh commented Oct 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR contains below changes:

Test done

Uh oh!

k8s-ci-robot commented Oct 21, 2020

Uh oh!

Uh oh!

M00nF1sh commented Oct 21, 2020 •

edited

Loading