Skip to content

Multiple errors after upgrade #5671

@josemrs

Description

@josemrs

After checking multiple breaking changes, I thought got it under control, apparently not.

We run EKS 1.32 AWSManagedControlPlanes with 1.32 AWSManagedMachinePools with AL2 custom AMIs

The upgrade was going to be in 2 stages, first to "latest 1beta1" then latest 1beta2 as it is recommended here

So I did:

./clusterctl-v1.10.6 upgrade plan

Checking new release availability...

Latest release available for the v1beta1 API Version of Cluster API (contract):

NAME                    NAMESPACE                           TYPE                     CURRENT VERSION   NEXT VERSION
bootstrap-kubeadm       capi-kubeadm-bootstrap-system       BootstrapProvider        v1.7.3            v1.10.6
control-plane-kubeadm   capi-kubeadm-control-plane-system   ControlPlaneProvider     v1.7.3            v1.10.6
cluster-api             capi-system                         CoreProvider             v1.7.3            v1.10.6
infrastructure-aws      capa-system                         InfrastructureProvider   v2.5.2            v2.9.1

You can now apply the upgrade by executing the following command:

clusterctl upgrade apply --contract v1beta1

So I run the upgrade command to do the intermediate upgrade and I got all upgraded, however, both, CAPI and CAPA, started complaining constantly about reconciliation and connection errors.

Perhaps is this but I thought I had it under control because of this

These are the logs, I tried to pick only the ones for one particular cluster, we have almost 30, all failing like this.

Logs from capa-controller-manager
I0919 10:58:42.605598       1 awsmanagedmachinepool_controller.go:202] "Reconciling AWSManagedMachinePool" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED"
I0919 10:58:42.605729       1 launchtemplate.go:81] "checking for existing launch template" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED"
[...]
I0919 10:58:45.429754       1 tags.go:128] "Reconciling ASG tags" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED" cluster-name="services_ap-southeast-2_prod_alienvault_cloud" nodegroup-name="services-prod-pool-ap-southeast-2a"
Logs from capi-controller-manager
E0919 11:01:39.644472       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2b" namespace="prod" name="services-prod-pool-ap-southeast-2b" reconcileID="dd96348e-37dc-4d9d-90f8-33b72cca5aa1"
E0919 11:01:42.691574       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2c" namespace="prod" name="services-prod-pool-ap-southeast-2c" reconcileID="4b104a11-3d94-401f-b227-c89eceb45e71"
E0919 11:01:44.009112       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2c" namespace="prod" name="services-prod-pool-ap-southeast-2c" reconcileID="35240758-7625-420d-85cc-517b095fa4f4"
E0919 11:01:52.674593       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="5cd6d5a9-452a-474b-bcff-09ad0e98e6a1"
E0919 11:01:52.952752       1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="36a5298a-d1d2-4e8c-a7e3-da275b13d90b"
Logs from capi-kubeadm-bootstrap-controller-manager
I0919 10:57:44.297447       1 cluster_accessor.go:320] "Disconnecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="de112319-22c9-4bc8-a248-da3869cb4f13"
I0919 10:57:44.297492       1 cluster_accessor.go:327] "Disconnected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="de112319-22c9-4bc8-a248-da3869cb4f13"
I0919 10:57:44.298712       1 cluster_accessor.go:252] "Connecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="b212685a-8419-4acd-8ff3-7d893b41a2e3"
I0919 10:57:47.933214       1 cluster_accessor.go:274] "Connected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="b212685a-8419-4acd-8ff3-7d893b41a2e3"
Logs from capi-kubeadm-control-plane-system
I0919 11:00:09.828007       1 cluster_accessor.go:320] "Disconnecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="f74b3271-9d4b-4b6a-95a7-7abe21839a7b"
I0919 11:00:09.828056       1 cluster_accessor.go:327] "Disconnected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="f74b3271-9d4b-4b6a-95a7-7abe21839a7b"
I0919 11:00:09.829332       1 cluster_accessor.go:252] "Connecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="95222f01-14a5-4e4b-bec3-372e95d9b983"
I0919 11:00:13.479651       1 cluster_accessor.go:274] "Connected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="95222f01-14a5-4e4b-bec3-372e95d9b983"

This is the config of this particular cluster:

ap-southeast-2 cluster YAML
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: services.REDACTED
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "0"
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v2beta2
    kind: AWSManagedControlPlane
    name: services.REDACTED
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSManagedCluster
    name: services.REDACTED

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedCluster
metadata:
  name: services.REDACTED
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "10"
spec: {}

---
apiVersion: controlplane.cluster.x-k8s.io/v1beta2
kind: AWSManagedControlPlane
metadata:
  name: services.REDACTED
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "20"
spec:
  associateOIDCProvider: true
  eksClusterName: services_REDACTED_1
  region: ap-southeast-2
  version: v1.32.0
  network:
    vpc:
      id: vpc-XXXXXXXXXX
    subnets:
    - id: subnet-X
    - id: subnet-Y
    - id: subnet-Z
    securityGroupOverrides: 
      node-eks-additional: sg-W
  endpointAccess:
    private: true
    public: false
  bastion:
    enabled: false
  oidcIdentityProviderConfig:
    identityProviderConfigName: Okta
    issuerUrl: https://.okta.com/oauth2/XXXXXXXXXXXX
    clientId: XXXXXXXXX
    usernameClaim: preferred_username
    groupsClaim: groups
    groupsPrefix: "okta:"
  logging:
    apiServer: false
    controllerManager: false
    audit: false
    authenticator: false
    scheduler: false
  iamAuthenticatorConfig:
    mapRoles:
    - username: "kubernetes-admin"
      rolearn: "arn:aws:iam::XXXXXXXXXXXX:role/saas-OktaAdmins"
      groups:
      - "system:masters"
  addons:
  - name: "kube-proxy"
    version: "v1.32.6-eksbuild.6"
    conflictResolution: "overwrite"
  - name: "vpc-cni"
    version: "v1.20.1-eksbuild.1"
    conflictResolution: "overwrite"
  - name: "aws-ebs-csi-driver"
    version: "v1.48.0-eksbuild.1"
    conflictResolution: "overwrite"
    serviceAccountRoleARN: "arn:aws:iam::XXXXXXXXXXXX:role/prod-AmazonEKS_EBS_CSI_DriverRole"
  vpcCni:
    env:
    - name: POD_SECURITY_GROUP_ENFORCING_MODE
      value: standard
    - name: ENABLE_POD_ENI
      value: "true"
    - name: ENABLE_PREFIX_DELEGATION
      value: "true"
  additionalTags:
    Owner: "EngOps"
    created_by: "https://bitbucket.org/redacted/capi-cluster"
    Environment: "prod"
  identityRef:
    kind: AWSClusterRoleIdentity
    name: prod
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonEKSVPCResourceController
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
metadata:
  name: services.REDACTED
  namespace: prod
spec:
  boostrapCommandOverride: "# Self-bootstrap embedded in AMI, doing nothing here for cluster"
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: services-prod-pool-prometheus-ap-southeast-2
  namespace: prod
  annotations:
    cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
    argocd.argoproj.io/sync-wave: "30"
spec:
  clusterName: services.REDACTED
  replicas: 2
  failureDomains:
  - ap-southeast-2a
  - ap-southeast-2b
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfig
          name: services.REDACTED
          namespace: prod
        dataSecretName: services.REDACTED
      clusterName: services.REDACTED
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSManagedMachinePool
        name: services-prod-pool-prometheus-ap-southeast-2

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: services-prod-pool-prometheus-ap-southeast-2
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "30"
spec:
  eksNodegroupName: services-prod-pool-prometheus
  availabilityZones:
  - ap-southeast-2a
  - ap-southeast-2b
  scaling:
    minSize: 2
    maxSize: 4
  updateConfig:
    maxUnavailable: 1
  awsLaunchTemplate:
    instanceType: m5.large
    ami:
      id: ami-YYYYYY
  labels:
    usm.io/role: prometheus
  taints:
  - key: dedicated
    effect: no-schedule
    value: prometheus
  subnetIDs:
  - subnet-X
  - subnet-Y
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: services-prod-pool-ap-southeast-2a
  namespace: prod
  annotations:
    cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
    argocd.argoproj.io/sync-wave: "40"
spec:
  clusterName: services.REDACTED
  replicas: 2
  failureDomains:
  - ap-southeast-2a
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfig
          name: services.REDACTED
          namespace: prod
        dataSecretName: services.REDACTED
      clusterName: services.REDACTED
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSManagedMachinePool
        name: services-prod-pool-ap-southeast-2a

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: services-prod-pool-ap-southeast-2a
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "40"
spec:
  eksNodegroupName: services-prod-pool-ap-southeast-2a
  availabilityZones:
  - ap-southeast-2a
  scaling:
    minSize: 2
    maxSize: 25
  updateConfig:
    maxUnavailablePercentage: 40
  subnetIDs:
  - subnet-X
  awsLaunchTemplate:
    instanceType: m5.xlarge
    ami:
      id: ami-YYYYYY
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: services-prod-pool-ap-southeast-2b
  namespace: prod
  annotations:
    cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
    argocd.argoproj.io/sync-wave: "41"
spec:
  clusterName: services.REDACTED
  replicas: 2
  failureDomains:
  - ap-southeast-2b
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfig
          name: services.REDACTED
          namespace: prod
        dataSecretName: services.REDACTED
      clusterName: services.REDACTED
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSManagedMachinePool
        name: services-prod-pool-ap-southeast-2b

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: services-prod-pool-ap-southeast-2b
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "41"
spec:
  eksNodegroupName: services-prod-pool-ap-southeast-2b
  availabilityZones:
  - ap-southeast-2b
  scaling:
    minSize: 2
    maxSize: 25
  updateConfig:
    maxUnavailablePercentage: 40
  subnetIDs:
  - subnet-Y
  awsLaunchTemplate:
    instanceType: m5.xlarge
    ami:
      id: ami-YYYYYY
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  name: services-prod-pool-ap-southeast-2c
  namespace: prod
  annotations:
    cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
    argocd.argoproj.io/sync-wave: "42"
spec:
  clusterName: services.REDACTED
  replicas: 2
  failureDomains:
  - ap-southeast-2c
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
          kind: EKSConfig
          name: services.REDACTED
          namespace: prod
        dataSecretName: services.REDACTED
      clusterName: services.REDACTED
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSManagedMachinePool
        name: services-prod-pool-ap-southeast-2c

---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
  name: services-prod-pool-ap-southeast-2c
  namespace: prod
  annotations:
    argocd.argoproj.io/sync-wave: "42"
spec:
  eksNodegroupName: services-prod-pool-ap-southeast-2c
  availabilityZones:
  - ap-southeast-2c
  scaling:
    minSize: 2
    maxSize: 25
  updateConfig:
    maxUnavailablePercentage: 40
  subnetIDs:
  - subnet-Z
  awsLaunchTemplate:
    instanceType: m5.xlarge
    ami:
      id: ami-YYYYYY
  roleAdditionalPolicies:
  - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions