Kube Adaptor Restart on Lost Lease

**Describe the bug**
Within a Site, only one kube-adaptor container should be considered the leader at a time. This includes situations where multiple router deployments are running due to HA, as well as during momentary rollout scale up/down events. To achieve this, the kube-adaptor exercises the kubernetes leases API.

Presently when the current leader loses the Lease, usually due to the lease API availability, the container exits 1. The kube-adaptor is then restarted in the Pod (pod .spec.restartPolicy=Always), sometimes with a CrashBackOff when the issue is persistent.

While the router should continue to operate as configured gracefully without a running adaptor, this ends up contributing to larger network instability in a few ways.

* Readiness: the Pod readiness check depends on a running kube-adaptor container. When the Pod is marked as Not Ready it is removed from the kube EndpointSlices so that Service traffic is not sent to the router.
* Configuration drift: While the kube-adaptor is not running, it is not syncing desired configuration to the router.


Originally reported here: https://github.com/skupperproject/skupper/issues/2250

**How To Reproduce**
Steps to reproduce the behavior:
⚠️  Applying this configuration will indiscriminately limit kube api lease operations and should be considered harmful to overall cluster health. ⚠️ 

```
---
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: PriorityLevelConfiguration
metadata:
  name: gh2291-leader-election
spec:
  limited:
    lendablePercent: 0
    limitResponse:
      type: Reject
    nominalConcurrencyShares: 0
    borrowingLimitPercent: 0
  type: Limited
---
apiVersion: flowcontrol.apiserver.k8s.io/v1
kind: FlowSchema
metadata:
  name: gh2291-leader-election
spec:
  distinguisherMethod:
    type: ByUser
  matchingPrecedence: 110
  priorityLevelConfiguration:
    name: gh2291-leader-election
  rules:
    - resourceRules:
        - apiGroups:
            - coordination.k8s.io
          resources:
            - leases
          namespaces:
            - '*'
          verbs:
            - get
            - watch
            - create
            - update
      subjects:
        - kind: Group
          group:
            name: system:serviceaccounts

```
- Apply the above API Priority configurations to severely limit api server concurrency on the leases apis.
- deploy _enough_ skupper Sites depending on your cluster until kube-adaptors begin to crash. Local with kind (single node control plane and etcd) it took ~15ish sites.


Alternatively, fight the skupper controller to update the skupper-router role: removing create,update,delete verbs from the leases rule.

**Expected behavior**

When the leader election is lost:
* do not exit
* stop the leader processes (site flow controller + status sync)
* log the error
* retry leader election with backoff

**Environment details**
 - Skupper Operator: 2.0+
 - Platform: kubernetes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Kube Adaptor Restart on Lost Lease #2291

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Kube Adaptor Restart on Lost Lease #2291

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions