-
Notifications
You must be signed in to change notification settings - Fork 212
Description
Describe the support request
I just updated NFD and GPU Plugins from 0.27 to 0.31.1 due to K3s version increment.
Installing gpu-device-plugin via Helm chart per INSTALL.md
NFD Node-Feature-Discovery tolerations work. Apparently Device-Plugins-Operator does not support tolerations, but thats OK cuz it can run on any of the other 5 nodes.
GPU-Device-Plugin Chart Values.yaml shows tolerations as a supported field.
Helm accepts both --set commands and values.yaml inputs for Tolerations, but the Daemonset does not respect them.
> helm install gpu-device-plugin intel/intel-device-plugins-gpu \
> --namespace inteldeviceplugins-system --version 0.31.1 -f gpu-device-plugin-values.yml
W1207 23:08:44.767727 3079715 warnings.go:70] unknown field "spec.tolerations"Helm Get Values Output Showing Tolerations
>helm get values gpu-device-plugin -n inteldeviceplugins-system
USER-SUPPLIED VALUES:
allocationPolicy: none
enableMonitoring: true
image:
hub: intel
tag: ""
initImage:
enable: false
hub: intel
tag: ""
logLevel: 2
name: worker
nodeFeatureRule: true
nodeSelector:
intel.feature.node.kubernetes.io/gpu: "true"
resourceManager: false
sharedDevNum: 10
tolerations:
- effect: NoSchedule
key: dedicated
operator: Equal
value: transcodeBut the pod/daemonset does not accept tolerations and throws them away. kubectl edit directly will save as a valid edit but the tolerations will be thrown away immediately.
GPU-Device-Plugin Daemonset Manifest
> k get daemonset -n inteldeviceplugins-system intel-gpu-plugin-worker -o yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
annotations:
deprecated.daemonset.template.generation: "2"
creationTimestamp: "2024-12-08T04:20:14Z"
generation: 2
labels:
app: intel-gpu-plugin
name: intel-gpu-plugin-worker
namespace: inteldeviceplugins-system
ownerReferences:
- apiVersion: deviceplugin.intel.com/v1
blockOwnerDeletion: true
controller: true
kind: GpuDevicePlugin
name: worker
uid: 6d565661-c726-4359-8ee7-e5a35bee391d
resourceVersion: "313888847"
uid: c822c290-bfe7-4388-a5d9-2270ecdc4a2d
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: intel-gpu-plugin
template:
metadata:
creationTimestamp: null
labels:
app: intel-gpu-plugin
spec:
containers:
- args:
- -v
- "2"
- -enable-monitoring
- -shared-dev-num
- "10"
- -allocation-policy
- none
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
image: intel/intel-gpu-plugin:0.31.1
imagePullPolicy: IfNotPresent
name: intel-gpu-plugin
resources:
limits:
cpu: 100m
memory: 90Mi
requests:
cpu: 40m
memory: 45Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
seLinuxOptions:
type: container_device_plugin_t
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /dev/dri
name: devfs
readOnly: true
- mountPath: /sys/class/drm
name: sysfsdrm
readOnly: true
- mountPath: /var/lib/kubelet/device-plugins
name: kubeletsockets
- mountPath: /var/run/cdi
name: cdipath
dnsPolicy: ClusterFirst
nodeSelector:
intel.feature.node.kubernetes.io/gpu: "true"
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /dev/dri
type: ""
name: devfs
- hostPath:
path: /sys/class/drm
type: ""
name: sysfsdrm
- hostPath:
path: /var/lib/kubelet/device-plugins
type: ""
name: kubeletsockets
- hostPath:
path: /var/run/cdi
type: DirectoryOrCreate
name: cdipath
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 5
desiredNumberScheduled: 5
numberAvailable: 5
numberMisscheduled: 0
numberReady: 5
observedGeneration: 2
updatedNumberScheduled: 5
Expected Behavior
Tolerations are accepted and daemonset workers are scheduled on all nodes regardless of taint.
System (please complete the following information if applicable):
- OS version: Debian Bookworm 12.7
- Kubernetes: K3s v1.30.6+k3s1
- Kernel version: Linux 6.10.11+bpo-amd64
- Device plugins version: v0.31.1
- Hardware info: N/A - Multi