gpu: do not fail if no GPU devices found #272

grahamwhaley · 2020-01-23T16:15:39Z

If we do not find any GPU devices, for whatever reason
(such as missing or unparsable /dev or /sys files), do not
error and quit, just return an empty device tree (indicating
we have no devices on this node).
This is preferable as we will then not enter a retry-death-spin
type scenario.

Fixes: #260

Signed-off-by: Graham Whaley [email protected]

grahamwhaley · 2020-01-23T16:17:37Z

@mythi @rojkov - you'll have to let me know if this is the kind of fix you were thinking of. I don't know the device plugin framework well enough to understand all the interactions etc. (like that 5s sleep/poll loop - makes me twitch...).
I also cannot trivially get a setup to test this with CoreOS. I tried minikube and kind, but they both still import my GPU :-(
I tried looking for the CoreOS kernel config, to check they did not set up CONFIG_DRM, but could not find their configs!

@poussa - fyi.

grahamwhaley · 2020-01-23T16:31:38Z

CI fail - I need to fix the test as well. will do so and repush.. you can still review the 'intent' though ;-)

grahamwhaley · 2020-01-23T16:45:16Z

tests updated. So, updating involved deleting the error check, as I've removed the ability to error... which somehow feels a bit remiss, but I can't obviously see a way to add anything more to the tests.

mythi · 2020-01-23T17:05:16Z

should fix #230 too btw

mythi · 2020-01-23T17:44:37Z

I also cannot trivially get a setup to test this with CoreOS.

Are deploying with the YAML provided in this repo? It would probably work if you leave the /sys mount out so that the container does not see the dir...

grahamwhaley · 2020-01-23T17:47:23Z

Ah, yes, because the scan is done at daemonset time, not container time (I was looking at how the /dev/ got mapped into the containers earlier, which is a twisty little path...).
I'll give that a try.

mythi · 2020-01-23T19:34:39Z

cmd/gpu_plugin/gpu_plugin.go

we could also keep returning error from scan() to keep the unit tests as is and trap the error here without returning?

ah, like that idea @mythi Didn't occur to me as I started by fixing the lower func, and the changes rippled up... have rewritten the patch - it is now a one line change!
Pushed.

grahamwhaley · 2020-01-27T11:46:44Z

Jenkins looks to have failed/timed out - not sure if the 'push rejected' are the problem or not:

The push refers to repository [cloud-native-image-registry.westus.cloudapp.azure.com/ubuntu-demo-opencl]

fb8676ae6c1a: Preparing

98901c6661fd: Preparing

af72ecf4a4df: Preparing

30f0e153f434: Preparing

f55aa0bd26b8: Preparing

1d0dfb259f6a: Preparing

21ec61b65b20: Preparing

43c67172d1d1: Preparing

1d0dfb259f6a: Waiting

21ec61b65b20: Waiting

43c67172d1d1: Waiting

f55aa0bd26b8: Layer already exists

1d0dfb259f6a: Layer already exists

21ec61b65b20: Layer already exists

af72ecf4a4df: Pushed

43c67172d1d1: Layer already exists

30f0e153f434: Pushed

98901c6661fd: Pushed

fb8676ae6c1a: Pushed

272-rejected: digest: sha256:c5ea36cebe9731a30dba93aa6da2327c9e832ca1942f162a77d4dff8496f0d73 size: 1989

[Pipeline] }

mythi

I think we're dealing with user experience design here. We no longer err on anything. Users who expect to get GPU resources and kubelet shows none (while the pod is running OK) need to find the logs to find the reason. OTOH, from now, no GPU and no Intel GPU behaves the same...

grahamwhaley · 2020-01-28T09:45:17Z

I think we're dealing with user experience design here. We no longer err on anything. Users who expect to get GPU resources and kubelet shows none (while the pod is running OK) need to find the logs to find the reason. OTOH, from now, no GPU and no Intel GPU behaves the same...

Ack. One of the downsides of daemonsets, particularly if you have a mixed cluster where not all nodes have GPUs - you don't want to wholly fail the daemonset deployment, but nor do you want some of the nodes to get stuck in a constant retry loop. afaik, the daemonset containers should not fail, unless it really is a fatal issue. I think staring at the logs and the resource metrics are then the only real ways to see what got deployed.

rojkov

Looks good to me. Thanks!

mythi · 2020-01-29T05:30:08Z

@grahamwhaley please re-push to trigger jenkins (just in case)

If we fail to scan for GPU devices (note, that is potentially different from not finding any devices during a scan), then warn on it, and go around the poll loop again. Do not treat it as a fatal error or we might end up in a re-launch death deploy loop... Of course, getting a warning in your logs every 5s could also be annoying, but is somewhat 'less fatal'. Fixes: intel#260 Fixes: intel#230 Signed-off-by: Graham Whaley <[email protected]>

grahamwhaley · 2020-01-29T09:26:15Z

re-based and pushed

grahamwhaley · 2020-01-29T11:38:06Z

@mythi, looks like Travis CI timed out...

The job exceeded the maximum time limit for jobs, and has been terminated.

rojkov · 2020-01-29T11:44:08Z

restarted... let's see if it's fast enough now

grahamwhaley force-pushed the 20200123_gpu_nofail branch from 94bb778 to 1ebb830 Compare January 23, 2020 16:44

mythi reviewed Jan 23, 2020

View reviewed changes

grahamwhaley force-pushed the 20200123_gpu_nofail branch from 1ebb830 to 5d70de8 Compare January 24, 2020 16:39

mythi requested a review from rojkov January 28, 2020 04:23

mythi previously approved these changes Jan 28, 2020

View reviewed changes

rojkov previously approved these changes Jan 28, 2020

View reviewed changes

grahamwhaley dismissed stale reviews from rojkov and mythi via 6537e38 January 29, 2020 09:25

grahamwhaley force-pushed the 20200123_gpu_nofail branch from 5d70de8 to 6537e38 Compare January 29, 2020 09:25

mythi approved these changes Jan 29, 2020

View reviewed changes

rojkov self-requested a review January 30, 2020 09:05

rojkov approved these changes Jan 30, 2020

View reviewed changes

rojkov merged commit 8841141 into intel:master Jan 30, 2020

gpu: do not fail if no GPU devices found #272

gpu: do not fail if no GPU devices found #272

Uh oh!

Conversation

grahamwhaley commented Jan 23, 2020

Uh oh!

grahamwhaley commented Jan 23, 2020

Uh oh!

grahamwhaley commented Jan 23, 2020

Uh oh!

grahamwhaley commented Jan 23, 2020

Uh oh!

mythi commented Jan 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mythi commented Jan 23, 2020

Uh oh!

grahamwhaley commented Jan 23, 2020

Uh oh!

mythi Jan 23, 2020

Choose a reason for hiding this comment

Uh oh!

grahamwhaley Jan 24, 2020

Choose a reason for hiding this comment

Uh oh!

grahamwhaley commented Jan 27, 2020

Uh oh!

mythi left a comment

Choose a reason for hiding this comment

Uh oh!

grahamwhaley commented Jan 28, 2020

Uh oh!

rojkov left a comment

Choose a reason for hiding this comment

Uh oh!

mythi commented Jan 29, 2020

Uh oh!

grahamwhaley commented Jan 29, 2020

Uh oh!

grahamwhaley commented Jan 29, 2020

Uh oh!

rojkov commented Jan 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mythi commented Jan 23, 2020 •

edited

Loading