Skip to content

Conversation

@grahamwhaley
Copy link

If we do not find any GPU devices, for whatever reason
(such as missing or unparsable /dev or /sys files), do not
error and quit, just return an empty device tree (indicating
we have no devices on this node).
This is preferable as we will then not enter a retry-death-spin
type scenario.

Fixes: #260

Signed-off-by: Graham Whaley [email protected]

@grahamwhaley
Copy link
Author

@mythi @rojkov - you'll have to let me know if this is the kind of fix you were thinking of. I don't know the device plugin framework well enough to understand all the interactions etc. (like that 5s sleep/poll loop - makes me twitch...).
I also cannot trivially get a setup to test this with CoreOS. I tried minikube and kind, but they both still import my GPU :-(
I tried looking for the CoreOS kernel config, to check they did not set up CONFIG_DRM, but could not find their configs!

@poussa - fyi.

@grahamwhaley
Copy link
Author

CI fail - I need to fix the test as well. will do so and repush.. you can still review the 'intent' though ;-)

@grahamwhaley
Copy link
Author

tests updated. So, updating involved deleting the error check, as I've removed the ability to error... which somehow feels a bit remiss, but I can't obviously see a way to add anything more to the tests.

@mythi
Copy link
Contributor

mythi commented Jan 23, 2020

should fix #230 too btw

@mythi
Copy link
Contributor

mythi commented Jan 23, 2020

I also cannot trivially get a setup to test this with CoreOS.

Are deploying with the YAML provided in this repo? It would probably work if you leave the /sys mount out so that the container does not see the dir...

@grahamwhaley
Copy link
Author

Ah, yes, because the scan is done at daemonset time, not container time (I was looking at how the /dev/ got mapped into the containers earlier, which is a twisty little path...).
I'll give that a try.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also keep returning error from scan() to keep the unit tests as is and trap the error here without returning?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, like that idea @mythi Didn't occur to me as I started by fixing the lower func, and the changes rippled up... have rewritten the patch - it is now a one line change!
Pushed.

@grahamwhaley
Copy link
Author

Jenkins looks to have failed/timed out - not sure if the 'push rejected' are the problem or not:

The push refers to repository [cloud-native-image-registry.westus.cloudapp.azure.com/ubuntu-demo-opencl]

fb8676ae6c1a: Preparing

98901c6661fd: Preparing

af72ecf4a4df: Preparing

30f0e153f434: Preparing

f55aa0bd26b8: Preparing

1d0dfb259f6a: Preparing

21ec61b65b20: Preparing

43c67172d1d1: Preparing

1d0dfb259f6a: Waiting

21ec61b65b20: Waiting

43c67172d1d1: Waiting

f55aa0bd26b8: Layer already exists

1d0dfb259f6a: Layer already exists

21ec61b65b20: Layer already exists

af72ecf4a4df: Pushed

43c67172d1d1: Layer already exists

30f0e153f434: Pushed

98901c6661fd: Pushed

fb8676ae6c1a: Pushed

272-rejected: digest: sha256:c5ea36cebe9731a30dba93aa6da2327c9e832ca1942f162a77d4dff8496f0d73 size: 1989

[Pipeline] }

@mythi mythi requested a review from rojkov January 28, 2020 04:23
mythi
mythi previously approved these changes Jan 28, 2020
Copy link
Contributor

@mythi mythi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're dealing with user experience design here. We no longer err on anything. Users who expect to get GPU resources and kubelet shows none (while the pod is running OK) need to find the logs to find the reason. OTOH, from now, no GPU and no Intel GPU behaves the same...

@grahamwhaley
Copy link
Author

I think we're dealing with user experience design here. We no longer err on anything. Users who expect to get GPU resources and kubelet shows none (while the pod is running OK) need to find the logs to find the reason. OTOH, from now, no GPU and no Intel GPU behaves the same...

Ack. One of the downsides of daemonsets, particularly if you have a mixed cluster where not all nodes have GPUs - you don't want to wholly fail the daemonset deployment, but nor do you want some of the nodes to get stuck in a constant retry loop. afaik, the daemonset containers should not fail, unless it really is a fatal issue. I think staring at the logs and the resource metrics are then the only real ways to see what got deployed.

rojkov
rojkov previously approved these changes Jan 28, 2020
Copy link
Contributor

@rojkov rojkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks!

@mythi
Copy link
Contributor

mythi commented Jan 29, 2020

@grahamwhaley please re-push to trigger jenkins (just in case)

If we fail to scan for GPU devices (note, that is potentially
different from not finding any devices during a scan), then
warn on it, and go around the poll loop again. Do not treat
it as a fatal error or we might end up in a re-launch death
deploy loop...

Of course, getting a warning in your logs every 5s could also
be annoying, but is somewhat 'less fatal'.

Fixes: intel#260
Fixes: intel#230

Signed-off-by: Graham Whaley <[email protected]>
@grahamwhaley grahamwhaley dismissed stale reviews from rojkov and mythi via 6537e38 January 29, 2020 09:25
@grahamwhaley
Copy link
Author

re-based and pushed

@grahamwhaley
Copy link
Author

@mythi, looks like Travis CI timed out...

The job exceeded the maximum time limit for jobs, and has been terminated.

@rojkov
Copy link
Contributor

rojkov commented Jan 29, 2020

restarted... let's see if it's fast enough now

@rojkov rojkov self-requested a review January 30, 2020 09:05
@rojkov rojkov merged commit 8841141 into intel:master Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GPU plugin fails on CoreOS

3 participants