Skip to content

Conversation

M00nF1sh
Copy link
Collaborator

@M00nF1sh M00nF1sh commented Oct 21, 2020

This PR contains below changes:

  1. Optimize the memory usage by using a in-memory podInfo repository instead of Informer.
    • On large clusters with large number of pod and big podSpec size, the pod Informer could consume tons of memory, but is unnecessary from our controller's perspective.
    • we change to use a in-memory podInfo repository which only contains necessary podInformation we needed.
    • Note: with this implementation, it would still needs memory to hold all pod spec during startup/resync, but these memory will be GCed. In the future, we can create a pod clone during the ListFunc with our cared fields to allow early GC.
  2. Fix a race condition when readinessGate enabled.
    • when we receive the endpoints events, some unready endpoints's corresponding pod might still don't have containerReady status, thus cannot be registered. However, since they have readinessGate, they cannot turn health either and won't trigger endpoint change to give itself chance to be registered. (this can be more easily occur if APIServer is slow)
    • this PR introduced a new return value called containsPotentialReadyEndpoints when resolve PodEndpoints, which means whether there are unready endpoints but could potentially turn ready if we re-do a reconcile.
    • In the future, we might evaluate to use pod events directly.
  3. Change the RequeueAfterError and RequeueError to be RequeueNeededAfter and RequeueNeeded to better match its semantic.
    • they are not errors, we use a reason instead of error for these two structure.

Test done

  1. On a 200 node cluster with 4k pods, each pod with 124LB spec size, the memory usage reduced from 2GB to 100MB.
  2. Tested the above race condition no longer occur when enabled podReadinessGate with a slow APIServer environment
  3. log message for requeue: {"level":"debug","ts":1603295614.2622936,"logger":"controllers.targetGroupBinding","msg":"requeue after","duration":15,"reason":"monitor targetHealth"}

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 21, 2020
@M00nF1sh M00nF1sh changed the title [WIP] optimize the memory usage optimize the memory usage and fix bugs Oct 21, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kishorj, M00nF1sh
To complete the pull request process, please assign after the PR has been reviewed.
You can assign the PR to them by writing /assign in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@M00nF1sh M00nF1sh merged commit 4506ade into kubernetes-sigs:main Oct 21, 2020
Timothy-Dougherty pushed a commit to adammw/aws-load-balancer-controller that referenced this pull request Nov 9, 2023
* add pod repo

* change requeue errors to use reason instead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants