Skip to content

gpu-device-plugin does not respect tolerations #1921

@audiophonicz

Description

@audiophonicz

Describe the support request
I just updated NFD and GPU Plugins from 0.27 to 0.31.1 due to K3s version increment.

Installing gpu-device-plugin via Helm chart per INSTALL.md

NFD Node-Feature-Discovery tolerations work. Apparently Device-Plugins-Operator does not support tolerations, but thats OK cuz it can run on any of the other 5 nodes.

GPU-Device-Plugin Chart Values.yaml shows tolerations as a supported field.

Helm accepts both --set commands and values.yaml inputs for Tolerations, but the Daemonset does not respect them.

> helm install gpu-device-plugin intel/intel-device-plugins-gpu \
>   --namespace inteldeviceplugins-system --version 0.31.1  -f gpu-device-plugin-values.yml

W1207 23:08:44.767727 3079715 warnings.go:70] unknown field "spec.tolerations"
Helm Get Values Output Showing Tolerations
>helm get values gpu-device-plugin -n inteldeviceplugins-system
USER-SUPPLIED VALUES:
allocationPolicy: none
enableMonitoring: true
image:
  hub: intel
  tag: ""
initImage:
  enable: false
  hub: intel
  tag: ""
logLevel: 2
name: worker
nodeFeatureRule: true
nodeSelector:
  intel.feature.node.kubernetes.io/gpu: "true"
resourceManager: false
sharedDevNum: 10
tolerations:
- effect: NoSchedule
  key: dedicated
  operator: Equal
  value: transcode

But the pod/daemonset does not accept tolerations and throws them away. kubectl edit directly will save as a valid edit but the tolerations will be thrown away immediately.

GPU-Device-Plugin Daemonset Manifest
> k get daemonset -n inteldeviceplugins-system intel-gpu-plugin-worker -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "2"
  creationTimestamp: "2024-12-08T04:20:14Z"
  generation: 2
  labels:
    app: intel-gpu-plugin
  name: intel-gpu-plugin-worker
  namespace: inteldeviceplugins-system
  ownerReferences:
  - apiVersion: deviceplugin.intel.com/v1
    blockOwnerDeletion: true
    controller: true
    kind: GpuDevicePlugin
    name: worker
    uid: 6d565661-c726-4359-8ee7-e5a35bee391d
  resourceVersion: "313888847"
  uid: c822c290-bfe7-4388-a5d9-2270ecdc4a2d
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: intel-gpu-plugin
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: intel-gpu-plugin
    spec:
      containers:
      - args:
        - -v
        - "2"
        - -enable-monitoring
        - -shared-dev-num
        - "10"
        - -allocation-policy
        - none
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: HOST_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: intel/intel-gpu-plugin:0.31.1
        imagePullPolicy: IfNotPresent
        name: intel-gpu-plugin
        resources:
          limits:
            cpu: 100m
            memory: 90Mi
          requests:
            cpu: 40m
            memory: 45Mi
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          seLinuxOptions:
            type: container_device_plugin_t
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/dri
          name: devfs
          readOnly: true
        - mountPath: /sys/class/drm
          name: sysfsdrm
          readOnly: true
        - mountPath: /var/lib/kubelet/device-plugins
          name: kubeletsockets
        - mountPath: /var/run/cdi
          name: cdipath
      dnsPolicy: ClusterFirst
      nodeSelector:
        intel.feature.node.kubernetes.io/gpu: "true"
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /dev/dri
          type: ""
        name: devfs
      - hostPath:
          path: /sys/class/drm
          type: ""
        name: sysfsdrm
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: ""
        name: kubeletsockets
      - hostPath:
          path: /var/run/cdi
          type: DirectoryOrCreate
        name: cdipath
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 5
  desiredNumberScheduled: 5
  numberAvailable: 5
  numberMisscheduled: 0
  numberReady: 5
  observedGeneration: 2
  updatedNumberScheduled: 5

Expected Behavior
Tolerations are accepted and daemonset workers are scheduled on all nodes regardless of taint.

System (please complete the following information if applicable):

  • OS version: Debian Bookworm 12.7
  • Kubernetes: K3s v1.30.6+k3s1
  • Kernel version: Linux 6.10.11+bpo-amd64
  • Device plugins version: v0.31.1
  • Hardware info: N/A - Multi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions