-
Couldn't load subscription status.
- Fork 7
Description
Symptom: on-demand/spot compute nodes failing to boot with Scheduler health check failures shortly after job being allocated. Here is an example from the /var/log/slurmctld.log:
2025-10-18T07:34:17.003] sched: Allocate JobId=684906 NodeList=od-r6a-l-dy-od-16-gb-1-cores-4 #CPUs=1 Partition=od-16-gb-1-cores
[2025-10-18T07:37:44.231] update_node: node od-r6a-l-dy-od-16-gb-1-cores-4 reason set to: Scheduler health check failed
[2025-10-18T07:37:44.231] requeue job JobId=684906 due to failure of node od-r6a-l-dy-od-16-gb-1-cores-4
[2025-10-18T07:37:44.232] powering down node od-r6a-l-dy-od-16-gb-1-cores-4
The node was failing about three minutes after job being allocated to the node.
When a compute node boots it will download on_compute_node_configured.sh to /opt/slurm/config/bin as on_compute_node_configured.sh.new and validate it against the existing file:
# Make sure we're running the latest version
dest_script="$config_bin_dir/${script_name}"
mkdir -p $config_bin_dir
aws s3 cp s3://$assets_bucket/$assets_base_key/config/bin/${script_name} $dest_script.new
chmod 0700 $dest_script.new
if ! [ -e $dest_script ] || ! diff -q $dest_script $dest_script.new; then
mv -f $dest_script.new $dest_script
exec $dest_script
else
rm $dest_script.new
fi
On a compute node, this is downloading the on_compute_node_configured.sh to an NFS mounted location off the head_node. If many machines spin up in parallel and all try to do the aws s3 cp, I've hit the case where a compute node does the copy, but it then fails when it tries to do the chmod as some other machine (I'm assuming) is also doing the cp and the file "isn't there". And then my machine fails to boot and is marked bad due to Scheduler health check failure.
Here's the output from my bootstrap_error_msg showing the fail:
{
"datetime": "2025-10-18T16:34:48.938+00:00",
"version": 0,
"scheduler": "slurm",
"cluster-name": "tsi5s",
"node-role": "ComputeFleet",
"component": "custom-action",
"level": "ERROR",
"instance-id": "i-00081cf47aa1c8b47",
"event-type": "custom-action-error",
"message": "Failed to execute OnNodeConfigured script 1, return code: 126.",
"detail": {
"action": "OnNodeConfigured",
"step": 1,
"stage": "executing",
"error": {
"exit_code": 126,
"stderr": "+ set -x\n+ script_name=on_compute_node_configured.sh\n+ exec\n++ logger -s -t on_compute_node_configured.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: ++ date\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + echo 'Sat Oct 18 09:34:48 PDT 2025: Started on_compute_node_configured.sh'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: Sat Oct 18 09:34:48 PDT 2025: Started on_compute_node_configured.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + assets_bucket=cdk-hnb659fds-assets-326469498578-us-west-2\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + assets_base_key=tsi5s\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + export AWS_DEFAULT_REGION=us-west-2\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + AWS_DEFAULT_REGION=us-west-2\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + ErrorSnsTopicArn=arn:aws:sns:us-west-2:326469498578:eda-slurm-dev\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + HomeMountSrc=\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + playbooks_s3_url=s3://cdk-hnb659fds-assets-326469498578-us-west-2/tsi5s/playbooks.zip\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + trap on_exit EXIT\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + config_dir=/opt/slurm/config\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + config_bin_dir=/opt/slurm/config/bin\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + '[' -e /opt/slurm/config/bin/on_compute_node_configured_custom_prolog.sh ']'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + /opt/slurm/config/bin/on_compute_node_configured_custom_prolog.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + set -x\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + set -e\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + script_name=on_compute_node_configured_custom_prolog.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + exec\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: ++ logger -s -t on_compute_node_configured_custom_prolog.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: ++ date\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + echo 'Sat Oct 18 09:34:48 PDT 2025: Started on_compute_node_configured_custom_prolog.sh'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: Sat Oct 18 09:34:48 PDT 2025: Started on_compute_node_configured_custom_prolog.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + echo 'Enabling SSH X11 forwarding...'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: Enabling SSH X11 forwarding...\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + grep -q '^X11Forwarding yes' /etc/ssh/sshd_config\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + echo 'X11Forwarding yes'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + echo 'X11DisplayOffset 10'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + echo 'X11UseLocalhost yes'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + systemctl restart sshd\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + echo 'SSH X11 forwarding enabled and sshd restarted'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: SSH X11 forwarding enabled and sshd restarted\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + systemctl is-enabled --quiet gdm\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + echo 'GDM not enabled - no need to disable... skipping'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: GDM not enabled - no need to disable... skipping\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: ++ date\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + '[' -z ']'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + diff -q /opt/slurm/config/munge.key /etc/munge/munge.key\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + echo 'Sat Oct 18 09:34:48 PDT 2025: Finished on_compute_node_configured_custom_prolog.sh'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: Sat Oct 18 09:34:48 PDT 2025: Finished on_compute_node_configured_custom_prolog.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: <13>Oct 18 09:34:48 on_compute_node_configured_custom_prolog.sh: + exit 0\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + dest_script=/opt/slurm/config/bin/on_compute_node_configured.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + mkdir -p /opt/slurm/config/bin\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + aws s3 cp s3://cdk-hnb659fds-assets-326469498578-us-west-2/tsi5s/config/bin/on_compute_node_configured.sh /opt/slurm/config/bin/on_compute_node_configured.sh.new\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: Completed 3.6 KiB/3.6 KiB (44.1 KiB/s) with 1 file(s) remaining\ndownload: s3://cdk-hnb659fds-assets-326469498578-us-west-2/tsi5s/config/bin/on_compute_node_configured.sh to ../../opt/slurm/config/bin/on_compute_node_configured.sh.new\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + chmod 0700 /opt/slurm/config/bin/on_compute_node_configured.sh.new\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: chmod: changing permissions of '/opt/slurm/config/bin/on_compute_node_configured.sh.new': No such file or directory\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + '[' -e /opt/slurm/config/bin/on_compute_node_configured.sh ']'\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + diff -q /opt/slurm/config/bin/on_compute_node_configured.sh /opt/slurm/config/bin/on_compute_node_configured.sh.new\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: diff: /opt/slurm/config/bin/on_compute_node_configured.sh.new: No such file or directory\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + mv -f /opt/slurm/config/bin/on_compute_node_configured.sh.new /opt/slurm/config/bin/on_compute_node_configured.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: mv: cannot stat '/opt/slurm/config/bin/on_compute_node_configured.sh.new': No such file or directory\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: + exec /opt/slurm/config/bin/on_compute_node_configured.sh\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: /tmp/tmp0gqh2_m9: line 62: /opt/slurm/config/bin/on_compute_node_configured.sh: Permission denied\n<13>Oct 18 09:34:48 on_compute_node_configured.sh: /tmp/tmp0gqh2_m9: line 62: exec: /opt/slurm/config/bin/on_compute_node_configured.sh: cannot execute: Permission denied\n"
}
},
"compute": {
"name": "od-r7a-m-dy-od-8-gb-1-cores-2",
"instance-id": "i-00081cf47aa1c8b47",
"instance-type": "r7a.medium",
"availability-zone": "us-west-2a",
"address": "10.6.13.193",
"hostname": "ip-10-6-13-193.example.net",
"queue-name": "od-r7a-m",
"compute-resource": "od-8-gb-1-cores",
"node-type": "dynamic"
}
}
This needs to be rethought. I'd suggest we just assume the s3 copy is always the correct copy and don't bother checking against the head_node - there is no upside as nothing ever uses the head_node copy?
If we want to make sure the head_node version on NFS is correct, on_compute_node_configured.sh needs to be copied locally first, checked, then overwrite the head_node version w/o actually checking what's on the head_node. I'm not sure if this second option will work if many nodes try to do the copy? I guess the last one to write wins? And who actually cares as we're not executing from this version?
For now, I've removed this section of code and I am relying on the S3 version of being correct.