nsutil.UnmountNS: CNI-created network namespace not cleaned up for short-lived jobs

### Nomad version

```
Nomad v1.8.11+ent
BuildDate 2025-03-11T09:23:02Z
Revision f1d10f7f43b943002a505307ae896f8176c038e4+CHANGES
```

### Operating system and Environment details

```
$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ uname -a
Linux nomadclndev03 6.1.0-32-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.129-1 (2025-03-06) x86_64 GNU/Linux
$ podman version
Client:       Podman Engine
Version:      5.2.2
API Version:  5.2.2
Go Version:   go1.23.1
Built:        Tue Sep 17 17:43:50 2024
OS/Arch:      linux/amd64
$ apt-cache policy nomad-driver-podman 
nomad-driver-podman:
  Installed: 0.6.2-1
  Candidate: 0.6.2-1
  Version table:
 *** 0.6.2-1 500
        500 http://aptly.ugent.be hashicorp/bookworm amd64 Packages
        100 /var/lib/dpkg/status
     0.6.1-1 500
        500 http://aptly.ugent.be hashicorp/bookworm amd64 Packages
     0.6.0-1 500
        500 http://aptly.ugent.be hashicorp/bookworm amd64 Packages
```

### Issue

We are experiencing a race condition issue when using short-lived workloads with CNI. Specifically, `nsutil.unmountNS` https://github.com/hashicorp/nomad/blob/031630927619f3f54af6ee0a7a8850cdbc5dbaf1/client/lib/nsutil/netns_linux.go#L132 fails during the garbage collection (GC) process due to the target network namespace being marked as busy.

This issue is only triggered when:
- CNI networking is enabled (`network { mode = "cni/private" }`)
- The workload completes very quickly (e.g., `exit 0` immediately)

If either of the following changes are made, the issue is no longer reproducible:
- Removing the CNI configuration (i.e., not setting `network.mode`)
- Making the job run slightly longer (e.g., using `sleep 45`)

We are currently mitigating this by adding an artificial delay (`sleep 45`) to our short-lived batch jobs. However, a more robust resolution would be ideal. Let us know if further logs or traces are helpful.

### Reproduction steps

1. Submit the job definition provided below.
2. Observe log output on the client.
3. Run several periodic job allocations.
4. Within a few runs (typically 3–4), the error occurs.

#### Expected Result

The short-lived CNI workload terminates cleanly, and GC proceeds without errors - specifically, the container and its associated network namespace should be removed without hitting `device or resource busy`.

#### Actual Result

GC fails to unmount the network namespace with a `device or resource busy` error, resulting in noisy logs and potentially resource leaks.

### Job file (if appropriate)

```hcl
job "cni-debug" {
  type        = "batch"
  namespace   = "default"
  datacenters = ["S10"]

  periodic {
    crons            = ["*/1 * * * * *"]
    prohibit_overlap = false
  }

 # Pin on host for easier testing
  constraint {
    attribute = "${attr.unique.hostname}"
    value     = "hostname"
  }

  group "debug" {
    restart {
      attempts = 0
      mode     = "fail"
    }

    reschedule {
      attempts  = 0
      unlimited = false
    }

    count = 1

    network {
      mode = "cni/private"
    }

    task "sleep" {
      driver = "podman"

      config {
        image   = "busybox:latest"
        command = "/bin/sh"
        args    = ["-c", "exit 0"]
      }
    }
  }
}
```

### Nomad Server logs (if appropriate)

N/A

### Nomad Client logs (if appropriate)

```
2025-04-07T11:24:18.681+0200 [ERROR] client.alloc_runner: postrun failed: alloc_id=66f88f81-ea9c-15ea-c113-068bf09aca81 error="hook \"network\" failed: failed to unmount NS: at /var/run/netns/66f88f81-ea9c-15ea-c113-068bf09aca81: device or resource busy"
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nsutil.UnmountNS: CNI-created network namespace not cleaned up for short-lived jobs #25610

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nsutil.UnmountNS: CNI-created network namespace not cleaned up for short-lived jobs #25610

Description

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions