-
Notifications
You must be signed in to change notification settings - Fork 212
gpu: do not fail if no GPU devices found #272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mythi @rojkov - you'll have to let me know if this is the kind of fix you were thinking of. I don't know the device plugin framework well enough to understand all the interactions etc. (like that 5s sleep/poll loop - makes me twitch...). @poussa - fyi. |
|
CI fail - I need to fix the test as well. will do so and repush.. you can still review the 'intent' though ;-) |
94bb778 to
1ebb830
Compare
|
tests updated. So, updating involved deleting the error check, as I've removed the ability to error... which somehow feels a bit remiss, but I can't obviously see a way to add anything more to the tests. |
|
should fix #230 too btw |
Are deploying with the YAML provided in this repo? It would probably work if you leave the |
|
Ah, yes, because the scan is done at daemonset time, not container time (I was looking at how the |
cmd/gpu_plugin/gpu_plugin.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also keep returning error from scan() to keep the unit tests as is and trap the error here without returning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, like that idea @mythi Didn't occur to me as I started by fixing the lower func, and the changes rippled up... have rewritten the patch - it is now a one line change!
Pushed.
1ebb830 to
5d70de8
Compare
|
Jenkins looks to have failed/timed out - not sure if the 'push rejected' are the problem or not: |
mythi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're dealing with user experience design here. We no longer err on anything. Users who expect to get GPU resources and kubelet shows none (while the pod is running OK) need to find the logs to find the reason. OTOH, from now, no GPU and no Intel GPU behaves the same...
Ack. One of the downsides of daemonsets, particularly if you have a mixed cluster where not all nodes have GPUs - you don't want to wholly fail the daemonset deployment, but nor do you want some of the nodes to get stuck in a constant retry loop. afaik, the daemonset containers should not fail, unless it really is a fatal issue. I think staring at the logs and the resource metrics are then the only real ways to see what got deployed. |
rojkov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks!
|
@grahamwhaley please re-push to trigger jenkins (just in case) |
If we fail to scan for GPU devices (note, that is potentially different from not finding any devices during a scan), then warn on it, and go around the poll loop again. Do not treat it as a fatal error or we might end up in a re-launch death deploy loop... Of course, getting a warning in your logs every 5s could also be annoying, but is somewhat 'less fatal'. Fixes: intel#260 Fixes: intel#230 Signed-off-by: Graham Whaley <[email protected]>
5d70de8 to
6537e38
Compare
|
re-based and pushed |
|
@mythi, looks like Travis CI timed out... |
|
restarted... let's see if it's fast enough now |
If we do not find any GPU devices, for whatever reason
(such as missing or unparsable /dev or /sys files), do not
error and quit, just return an empty device tree (indicating
we have no devices on this node).
This is preferable as we will then not enter a retry-death-spin
type scenario.
Fixes: #260
Signed-off-by: Graham Whaley [email protected]