-
Notifications
You must be signed in to change notification settings - Fork 629
Open
Labels
needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
Description
After checking multiple breaking changes, I thought got it under control, apparently not.
We run EKS 1.32 AWSManagedControlPlanes with 1.32 AWSManagedMachinePools with AL2 custom AMIs
The upgrade was going to be in 2 stages, first to "latest 1beta1" then latest 1beta2 as it is recommended here
So I did:
./clusterctl-v1.10.6 upgrade plan
Checking new release availability...
Latest release available for the v1beta1 API Version of Cluster API (contract):
NAME NAMESPACE TYPE CURRENT VERSION NEXT VERSION
bootstrap-kubeadm capi-kubeadm-bootstrap-system BootstrapProvider v1.7.3 v1.10.6
control-plane-kubeadm capi-kubeadm-control-plane-system ControlPlaneProvider v1.7.3 v1.10.6
cluster-api capi-system CoreProvider v1.7.3 v1.10.6
infrastructure-aws capa-system InfrastructureProvider v2.5.2 v2.9.1
You can now apply the upgrade by executing the following command:
clusterctl upgrade apply --contract v1beta1
So I run the upgrade command to do the intermediate upgrade and I got all upgraded, however, both, CAPI and CAPA, started complaining constantly about reconciliation and connection errors.
Perhaps is this but I thought I had it under control because of this
These are the logs, I tried to pick only the ones for one particular cluster, we have almost 30, all failing like this.
Logs from capa-controller-manager
I0919 10:58:42.605598 1 awsmanagedmachinepool_controller.go:202] "Reconciling AWSManagedMachinePool" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED"
I0919 10:58:42.605729 1 launchtemplate.go:81] "checking for existing launch template" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED"
[...]
I0919 10:58:45.429754 1 tags.go:128] "Reconciling ASG tags" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED" cluster-name="services_ap-southeast-2_prod_alienvault_cloud" nodegroup-name="services-prod-pool-ap-southeast-2a"
Logs from capi-controller-manager
E0919 11:01:39.644472 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2b" namespace="prod" name="services-prod-pool-ap-southeast-2b" reconcileID="dd96348e-37dc-4d9d-90f8-33b72cca5aa1"
E0919 11:01:42.691574 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2c" namespace="prod" name="services-prod-pool-ap-southeast-2c" reconcileID="4b104a11-3d94-401f-b227-c89eceb45e71"
E0919 11:01:44.009112 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2c" namespace="prod" name="services-prod-pool-ap-southeast-2c" reconcileID="35240758-7625-420d-85cc-517b095fa4f4"
E0919 11:01:52.674593 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="5cd6d5a9-452a-474b-bcff-09ad0e98e6a1"
E0919 11:01:52.952752 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="36a5298a-d1d2-4e8c-a7e3-da275b13d90b"
Logs from capi-kubeadm-bootstrap-controller-manager
I0919 10:57:44.297447 1 cluster_accessor.go:320] "Disconnecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="de112319-22c9-4bc8-a248-da3869cb4f13"
I0919 10:57:44.297492 1 cluster_accessor.go:327] "Disconnected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="de112319-22c9-4bc8-a248-da3869cb4f13"
I0919 10:57:44.298712 1 cluster_accessor.go:252] "Connecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="b212685a-8419-4acd-8ff3-7d893b41a2e3"
I0919 10:57:47.933214 1 cluster_accessor.go:274] "Connected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="b212685a-8419-4acd-8ff3-7d893b41a2e3"
Logs from capi-kubeadm-control-plane-system
I0919 11:00:09.828007 1 cluster_accessor.go:320] "Disconnecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="f74b3271-9d4b-4b6a-95a7-7abe21839a7b"
I0919 11:00:09.828056 1 cluster_accessor.go:327] "Disconnected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="f74b3271-9d4b-4b6a-95a7-7abe21839a7b"
I0919 11:00:09.829332 1 cluster_accessor.go:252] "Connecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="95222f01-14a5-4e4b-bec3-372e95d9b983"
I0919 11:00:13.479651 1 cluster_accessor.go:274] "Connected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="95222f01-14a5-4e4b-bec3-372e95d9b983"
This is the config of this particular cluster:
ap-southeast-2 cluster YAML
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: services.REDACTED
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "0"
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v2beta2
kind: AWSManagedControlPlane
name: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedCluster
name: services.REDACTED
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedCluster
metadata:
name: services.REDACTED
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "10"
spec: {}
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta2
kind: AWSManagedControlPlane
metadata:
name: services.REDACTED
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "20"
spec:
associateOIDCProvider: true
eksClusterName: services_REDACTED_1
region: ap-southeast-2
version: v1.32.0
network:
vpc:
id: vpc-XXXXXXXXXX
subnets:
- id: subnet-X
- id: subnet-Y
- id: subnet-Z
securityGroupOverrides:
node-eks-additional: sg-W
endpointAccess:
private: true
public: false
bastion:
enabled: false
oidcIdentityProviderConfig:
identityProviderConfigName: Okta
issuerUrl: https://.okta.com/oauth2/XXXXXXXXXXXX
clientId: XXXXXXXXX
usernameClaim: preferred_username
groupsClaim: groups
groupsPrefix: "okta:"
logging:
apiServer: false
controllerManager: false
audit: false
authenticator: false
scheduler: false
iamAuthenticatorConfig:
mapRoles:
- username: "kubernetes-admin"
rolearn: "arn:aws:iam::XXXXXXXXXXXX:role/saas-OktaAdmins"
groups:
- "system:masters"
addons:
- name: "kube-proxy"
version: "v1.32.6-eksbuild.6"
conflictResolution: "overwrite"
- name: "vpc-cni"
version: "v1.20.1-eksbuild.1"
conflictResolution: "overwrite"
- name: "aws-ebs-csi-driver"
version: "v1.48.0-eksbuild.1"
conflictResolution: "overwrite"
serviceAccountRoleARN: "arn:aws:iam::XXXXXXXXXXXX:role/prod-AmazonEKS_EBS_CSI_DriverRole"
vpcCni:
env:
- name: POD_SECURITY_GROUP_ENFORCING_MODE
value: standard
- name: ENABLE_POD_ENI
value: "true"
- name: ENABLE_PREFIX_DELEGATION
value: "true"
additionalTags:
Owner: "EngOps"
created_by: "https://bitbucket.org/redacted/capi-cluster"
Environment: "prod"
identityRef:
kind: AWSClusterRoleIdentity
name: prod
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonEKSVPCResourceController
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
metadata:
name: services.REDACTED
namespace: prod
spec:
boostrapCommandOverride: "# Self-bootstrap embedded in AMI, doing nothing here for cluster"
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: services-prod-pool-prometheus-ap-southeast-2
namespace: prod
annotations:
cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
argocd.argoproj.io/sync-wave: "30"
spec:
clusterName: services.REDACTED
replicas: 2
failureDomains:
- ap-southeast-2a
- ap-southeast-2b
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
name: services.REDACTED
namespace: prod
dataSecretName: services.REDACTED
clusterName: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
name: services-prod-pool-prometheus-ap-southeast-2
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: services-prod-pool-prometheus-ap-southeast-2
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "30"
spec:
eksNodegroupName: services-prod-pool-prometheus
availabilityZones:
- ap-southeast-2a
- ap-southeast-2b
scaling:
minSize: 2
maxSize: 4
updateConfig:
maxUnavailable: 1
awsLaunchTemplate:
instanceType: m5.large
ami:
id: ami-YYYYYY
labels:
usm.io/role: prometheus
taints:
- key: dedicated
effect: no-schedule
value: prometheus
subnetIDs:
- subnet-X
- subnet-Y
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: services-prod-pool-ap-southeast-2a
namespace: prod
annotations:
cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
argocd.argoproj.io/sync-wave: "40"
spec:
clusterName: services.REDACTED
replicas: 2
failureDomains:
- ap-southeast-2a
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
name: services.REDACTED
namespace: prod
dataSecretName: services.REDACTED
clusterName: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
name: services-prod-pool-ap-southeast-2a
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: services-prod-pool-ap-southeast-2a
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "40"
spec:
eksNodegroupName: services-prod-pool-ap-southeast-2a
availabilityZones:
- ap-southeast-2a
scaling:
minSize: 2
maxSize: 25
updateConfig:
maxUnavailablePercentage: 40
subnetIDs:
- subnet-X
awsLaunchTemplate:
instanceType: m5.xlarge
ami:
id: ami-YYYYYY
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: services-prod-pool-ap-southeast-2b
namespace: prod
annotations:
cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
argocd.argoproj.io/sync-wave: "41"
spec:
clusterName: services.REDACTED
replicas: 2
failureDomains:
- ap-southeast-2b
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
name: services.REDACTED
namespace: prod
dataSecretName: services.REDACTED
clusterName: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
name: services-prod-pool-ap-southeast-2b
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: services-prod-pool-ap-southeast-2b
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "41"
spec:
eksNodegroupName: services-prod-pool-ap-southeast-2b
availabilityZones:
- ap-southeast-2b
scaling:
minSize: 2
maxSize: 25
updateConfig:
maxUnavailablePercentage: 40
subnetIDs:
- subnet-Y
awsLaunchTemplate:
instanceType: m5.xlarge
ami:
id: ami-YYYYYY
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: services-prod-pool-ap-southeast-2c
namespace: prod
annotations:
cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
argocd.argoproj.io/sync-wave: "42"
spec:
clusterName: services.REDACTED
replicas: 2
failureDomains:
- ap-southeast-2c
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
name: services.REDACTED
namespace: prod
dataSecretName: services.REDACTED
clusterName: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
name: services-prod-pool-ap-southeast-2c
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: services-prod-pool-ap-southeast-2c
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "42"
spec:
eksNodegroupName: services-prod-pool-ap-southeast-2c
availabilityZones:
- ap-southeast-2c
scaling:
minSize: 2
maxSize: 25
updateConfig:
maxUnavailablePercentage: 40
subnetIDs:
- subnet-Z
awsLaunchTemplate:
instanceType: m5.xlarge
ami:
id: ami-YYYYYY
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Metadata
Metadata
Assignees
Labels
needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.