Skip to content

Conversation

@pvts-mat
Copy link
Contributor

@pvts-mat pvts-mat commented Jul 10, 2025

[LTS 9.4]
CVE-2025-21786
VULN-54096

Problem

https://access.redhat.com/security/cve/CVE-2025-21786

A vulnerability was found in the Linux kernel's work queue subsystem, which manages background task execution. The issue stems from improper handling of the "rescuer" thread during the cleanup of unbound work queues.

Background

Workqueue system allows user space programs to defer some tasks to be executed asynchronously by the kernel - the "generic async execution mechanism" as expressed in kernel/workqueue.c's header comment.

A piece of work to be executed is called work item. It's represented by a simple struct work_struct coupling a function defining the job with some additional data:

struct work_struct {
atomic_long_t data;
struct list_head entry;
work_func_t func;
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
};

The work items are put through API on the work queues

struct workqueue_struct {

The type of different work queues the work items can be put on determine how they will be executed.

From there they are distributed to internal pool work queues

struct pool_workqueue {

where they await execution by the kernel threads called workers. Those can be easily observed with any process-listing tool like ps or top, eg.

# ps -e w | grep kworker/

      7 ?        I      0:00 [kworker/0:0-events]
      8 ?        I<     0:00 [kworker/0:0H-events_highpri]
      9 ?        I      0:00 [kworker/u22:0-events_unbound]
     11 ?        I      0:00 [kworker/u22:1-events_unbound]
     19 ?        I      0:00 [kworker/0:1-events]
     25 ?        I      0:00 [kworker/1:0-rcu_gp]
     26 ?        I<     0:00 [kworker/1:0H-events_highpri]
     31 ?        I      0:00 [kworker/2:0-events]
…

The workers are gathered in work pools

struct worker_pool {

Each work pool has a single pool work queue and zero or more workers associated. Each CPU has two work pools assigned - one for normal work items and the other for high priority ones. Apart from CPU-bound pools there are also unbound work pools (with unbound work queues mentioned in the CVE), the number of which is dynamic. (This variety of work pools exists for balancing the tradeoff between having high locality of execution (and thus efficiency) for the CPU-bound work pools and much simpler load balancing with the unbound ones.)

It's possible for the work items in a work pool to become deadlocked. For this reason the work queue contains a rescue worker

struct worker *rescuer; /* MD: rescue worker */

which can pick up any work item from the work pool, break the deadlock and push execution forward. The rescuer's thread function rescuer_thread is the subject of the CVE's fix e769461 in the mainline kernel.

Analysis

The bug

Following the KASAN logs from https://lore.kernel.org/lkml/CAKHoSAvP3iQW+GwmKzWjEAOoPvzeWeoMO0Gz7Pp3_4kxt-RMoA@mail.gmail.com/ it can be seen that the use-after-free scenario unfolded as follows:

  1. The rescuer thread released the pool workqueue with put_pwq(…) at

    put_pwq(pwq);

    It was sure
    * Put the reference grabbed by send_mayday(). @pool won't
    * go away while we're still attached to it.
    that the pool associated with this workqueue will still be around at the moment of worker_detach_from_pool(…) call at
    worker_detach_from_pool(rescuer);

  2. Simultaneously, some regular worker from the same pool released it as well

    Last potentially related work creation:
     kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
     __kasan_record_aux_stack+0x8c/0xa0 mm/kasan/generic.c:541
     __call_rcu_common.constprop.0+0x6a/0xad0 kernel/rcu/tree.c:3086
     put_unbound_pool+0x552/0x830 kernel/workqueue.c:4965
     pwq_release_workfn+0x4c6/0x9e0 kernel/workqueue.c:5065
     kthread_worker_fn+0x2b9/0xb00 kernel/kthread.c:844
     kthread+0x2c2/0x3a0 kernel/kthread.c:389
     ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
    

    at

    work->func(work);
    reducing its ref count to 0 and scheduling it for destruction.

  3. The pool workqueue, guarded by he Read-Copy-Update mechanism, was destroyed soon after by the idle thread 0, along with its worker pool:

    Freed by task 0:
     kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
     kasan_save_track+0x14/0x30 mm/kasan/common.c:68
     kasan_save_free_info+0x3a/0x60 mm/kasan/generic.c:579
     poison_slab_object mm/kasan/common.c:247 [inline]
     __kasan_slab_free+0x38/0x50 mm/kasan/common.c:264
     kasan_slab_free include/linux/kasan.h:230 [inline]
     slab_free_hook mm/slub.c:2342 [inline]
     slab_free mm/slub.c:4579 [inline]
     kfree+0x212/0x4a0 mm/slub.c:4727
     rcu_do_batch kernel/rcu/tree.c:2567 [inline]
     rcu_core+0x835/0x17f0 kernel/rcu/tree.c:2823
     handle_softirqs+0x1b1/0x7d0 kernel/softirq.c:554
     __do_softirq kernel/softirq.c:588 [inline]
     invoke_softirq kernel/softirq.c:428 [inline]
     __irq_exit_rcu kernel/softirq.c:637 [inline]
     irq_exit_rcu+0x94/0xc0 kernel/softirq.c:649
     instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1049 [inline]
     sysvec_apic_timer_interrupt+0x70/0x80 arch/x86/kernel/apic/apic.c:1049
     asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
    
  4. The rescuer thread continued execution, hitting the worker_detach_from_pool(…) call, which attempted to remove the rescuer worker from the workers list on the pool which no longer existed

    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
    print_address_description mm/kasan/report.c:377 [inline]
    print_report+0xcb/0x620 mm/kasan/report.c:488
    kasan_report+0xbd/0xf0 mm/kasan/report.c:601
    __list_del include/linux/list.h:195 [inline]
    __list_del_entry include/linux/list.h:218 [inline]
    list_del include/linux/list.h:229 [inline]
    detach_worker+0x164/0x180 kernel/workqueue.c:2709
    worker_detach_from_pool kernel/workqueue.c:2728 [inline]
    rescuer_thread+0x69d/0xcd0 kernel/workqueue.c:3526
    kthread+0x2c2/0x3a0 kernel/kthread.c:389
    ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
    ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
    

    See

    list_del(&worker->node);
    and the read/write operations in list_del's implementation in
    static inline void __list_del(struct list_head * prev, struct list_head * next)
    {
    next->prev = prev;
    WRITE_ONCE(prev->next, next);
    }

The fix

The core of the fix is moving the put_pwq(…) call after the worker_detach_from_pool(…) call to ensure the pool's ref count remains greater than zero at the moment of detaching the rescuer from it. Before:

/*
* Put the reference grabbed by send_mayday(). @pool won't
* go away while we're still attached to it.
*/
put_pwq(pwq);
/*
* Leave this pool. Notify regular workers; otherwise, we end up
* with 0 concurrency and stalling the execution.
*/
kick_pool(pool);
raw_spin_unlock_irq(&pool->lock);
worker_detach_from_pool(rescuer);
raw_spin_lock_irq(&wq_mayday_lock);

After:

/*
* Leave this pool. Notify regular workers; otherwise, we end up
* with 0 concurrency and stalling the execution.
*/
kick_pool(pool);
raw_spin_unlock_irq(&pool->lock);
worker_detach_from_pool(rescuer);
/*
* Put the reference grabbed by send_mayday(). @pool might
* go away any time after it.
*/
put_pwq_unlocked(pwq);
raw_spin_lock_irq(&wq_mayday_lock);

Although the moved function changed to put_pwq_unlocked(…), it's actually the same put_pwq(…) call, but wrapped in the raw_spin_lock_irq(…) / raw_spin_unlock_irq(…) pair

raw_spin_lock_irq(&pwq->pool->lock);
put_pwq(pwq);
raw_spin_unlock_irq(&pwq->pool->lock);

This can be seen even more clearly in the original proposition of the fix given by Tejun Heo in the mailing list https://lore.kernel.org/lkml/[email protected]/:

+		/*
+		 * Put the reference grabbed by send_mayday(). This must come
+		 * after the final access of the pool.
+		 */
+		raw_spin_lock_irq(&pool->lock);
+		put_pwq(pwq);
+		raw_spin_unlock_irq(&pool->lock);

This wrapping was not necessary before because the pool->lock was already being held at the time of put_pwq(pwq) call, see

raw_spin_lock_irq(&pool->lock);

Applicability: no

The affected file kernel/workqueue.c is unconditionally compiled into every kernel

signal.o sys.o umh.o workqueue.o pid.o task_work.o \

so it's part of any LTS 9.4 build regardless of the configuration used.

However, the CVE-2025-21786 bug fixed by e769461 patch does not apply to the code found under ciqlts9_4 revision and applying the patch, while not harmful on the functional level, shouldn't be done. The arguments are listed below.

The "fixes" commit is missing from the LTS 9.4 history

The e769461 fix mentions 68f8305 as the commit introducing the bug and it's missing from LTS 9.4 history of kernel/workqueue.c, neither was it backported - see
workqueue-history.txt.

Commit's e769461 message explicitly blames changes introduced in 68f8305:

The commit 68f8305("workqueue: Reap workers via kthread_stop() and remove detach_completion") adds code to reap the normal workers but mistakenly does not handle the rescuer and also removes the code waiting for the rescuer in put_unbound_pool(), which caused a use-after-free bug reported by Cheung Wall.

The "code waiting for the rescuer" removed in 68f8305 is present in the ciqlts9_4 revision:

if (pool->detach_completion)
wait_for_completion(pool->detach_completion);

The put_pwq(…) call is not placed randomly

Examining git history shows that the authors of the workqueue mechanism - Lai Jiangshan and Tejun Heo - took great care to place the grab/put functions in proper places. See commit 77668c8 which introduced the put_pwq(…) call

workqueue: fix a possible race condition between rescuer and pwq-release

There is a race condition between rescuer_thread() and
pwq_unbound_release_workfn().

Even after a pwq is scheduled for rescue, the associated work items
may be consumed by any worker.  If all of them are consumed before the
rescuer gets to them and the pwq's base ref was put due to attribute
change, the pwq may be released while still being linked on
@wq->maydays list making the rescuer dereference already freed pwq
later.

Make send_mayday() pin the target pwq until the rescuer is done with
it.

(In fact, this commit pre-emptively fixed the CVE-2023-1281 bug (not a CVE back then) which only re-surfaced after the 68f8305 commit - It addresses the same problem.)

Commit 13b1d62, in turn, dealt with the placement of worker_detach_from_pool(…) call and explicitly related it to the put_pwq(…) call:

workqueue: move rescuer pool detachment to the end

In 51697d393922 ("workqueue: use generic attach/detach routine for
rescuers"), The rescuer detaches itself from the pool before put_pwq()
so that the put_unbound_pool() will not destroy the rescuer-attached
pool.

It is unnecessary.  worker_detach_from_pool() can be used as the last
statement to access to the pool just like the regular workers,
put_unbound_pool() will wait for it to detach and then free the pool.

So we move the worker_detach_from_pool() down, make it coincide with
the regular workers.

It's only the "put_unbound_pool() will wait for it to detach" part which turned false after the introduction of 68f8305 which, again, was not done in LTS 9.4.

Using the patched version is not without any cost

From the short bug and fix analysis it should be rather clear (hopefully), that applying the CVE-2025-21786 patch is just a matter of holding a reference a little longer. It could therefore not hurt to apply the patch "just in case". However, putting aside the nevertheless nonzero degree of uncertainty around the harmlessness of this treatment, doing it requires unnecessary locking / unlocking of the &pwq->pool->lock around put_pwq(pwq) call (see the fix). In general it's always better to avoid unnecessary locks, as they hurt performance and can introduce deadlocking problems not present before.

RedHat's "Affected" classification doesn't hold much weight

The counter-argument to not backporting the patch can be RedHat listing "Red Hat Enterprise Linux 9" as "Affected" on the CVE-2025-21786 bug's page https://access.redhat.com/security/cve/CVE-2025-21786.

However, RH's "Affected" may in actuality mean either "affected, confirmed" or "not investigated yet":

Unless explicitly stated as not affected, all previous versions of packages in any minor update stream of a product listed here should be assumed vulnerable, although may not have been subject to full analysis.

This stands in contrast to "not affected" classification which actually means "not affected, confirmed" only.

@pvts-mat pvts-mat marked this pull request as draft July 10, 2025 15:27
This was referenced Jul 10, 2025
@pvts-mat
Copy link
Contributor Author

The "draft" status is only to prevent accidental merge, the PR is ready for review.

@kerneltoast
Copy link
Collaborator

The reason the pwq refcount was able to hit zero was because the initial pwq reference was put in apply_wqattrs_cleanup(). This happened because a task changed the implicated workqueue's CPU affinity mask by writing to /sys/devices/virtual/workqueue/WQ_NAME/cpumask, which triggers a pwq replacement. After the new pwqs are committed, the old ones are freed by apply_wqattrs_cleanup() putting those initial references.

So for the issue to occur, the following must happen at around the same time:

  • There is a worker running from inside a workqueue's rescuer kthread. Only workqueues with WQ_MEM_RECLAIM have a rescuer kthread, and even then the rescuer kthread is only used as a fallback to guarantee forward progress of the workqueue's workers when memory pressure is high. There aren't many workqueues with WQ_MEM_RECLAIM and even then it is rare for a worker to hit the rescuer kthread.
  • There is a task writing to /sys/devices/virtual/workqueue/WQ_NAME/cpumask for that workqueue that has a worker running in the rescuer kthread.
  • The last reference on the pwq must be put by either the worker in the rescuer kthread, or apply_wqattrs_cleanup() quickly enough to get the pwq freed before the rescuer kthread is done using it.
  • At least one RCU grace period must elapse after the last pwq reference is put so that the kfree_rcu() RCU callback can run and actually kfree the pwq. And this must occur before the rescuer kthread finishes using the pwq.

This can be triggered under high memory pressure while writing to /sys/devices/virtual/workqueue/WQ_NAME/cpumask and hammering the CPU running the rescuer kthread for WQ_NAME, I guess.

I don't think we should bother picking this, since the Fixes commit was introduced in 6.11 and wasn't backported to any stable kernels. The CVE fix itself is only present on 6.12+ kernels upstream, so I think it's safe to say we don't need to bother with this.

@pvts-mat
Copy link
Contributor Author

Thanks @kerneltoast for adding more light to this issue

@PlaidCat
Copy link
Collaborator

Closing as not applicable thank you @pvts-mat and @kerneltoast

@PlaidCat PlaidCat closed this Jul 15, 2025
github-actions bot pushed a commit that referenced this pull request Aug 8, 2025
Use BPF_TRAMP_F_INDIRECT flag to detect struct ops and emit proper
prologue and epilogue for this case.

With this patch, all of the struct_ops related testcases (except
struct_ops_multi_pages) passed on LoongArch.

The testcase struct_ops_multi_pages failed is because the actual
image_pages_cnt is 40 which is bigger than MAX_TRAMP_IMAGE_PAGES.

Before:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  WATCHDOG: test case struct_ops_module/struct_ops_load executes for 10 seconds...

After:

  $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages
  ...
  #15      bad_struct_ops:OK
  ...
  #399     struct_ops_autocreate:OK
  ...
  #400     struct_ops_kptr_return:OK
  ...
  #401     struct_ops_maybe_null:OK
  ...
  #402     struct_ops_module:OK
  ...
  #404     struct_ops_no_cfi:OK
  ...
  #405     struct_ops_private_stack:SKIP
  ...
  #406     struct_ops_refcounted:OK
  Summary: 8/25 PASSED, 3 SKIPPED, 0 FAILED

Signed-off-by: Tiezhu Yang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 1, 2025
As arm64 JIT now supports timed may_goto instruction, make sure all
relevant tests run on this architecture. Some tests were enabled and
other required modifications to work properly on arm64.

 $ ./test_progs -a "stream*","*may_goto*",verifier_bpf_fastcall

 #404     stream_errors:OK
 [...]
 #406/2   stream_success/stream_cond_break:OK
 [...]
 #494/23  verifier_bpf_fastcall/may_goto_interaction_x86_64:SKIP
 #494/24  verifier_bpf_fastcall/may_goto_interaction_arm64:OK
 [...]
 #539/1   verifier_may_goto_1/may_goto 0:OK
 #539/2   verifier_may_goto_1/batch 2 of may_goto 0:OK
 #539/3   verifier_may_goto_1/may_goto batch with offsets 2/1/0:OK
 #539/4   verifier_may_goto_1/may_goto batch with offsets 2/0:OK
 #539     verifier_may_goto_1:OK
 #540/1   verifier_may_goto_2/C code with may_goto 0:OK
 #540     verifier_may_goto_2:OK
 Summary: 7/16 PASSED, 25 SKIPPED, 0 FAILED

Signed-off-by: Puranjay Mohan <[email protected]>
Acked-by: Kumar Kartikeya Dwivedi <[email protected]>
Acked-by: Xu Kuohai <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
github-actions bot pushed a commit that referenced this pull request Oct 1, 2025
Puranjay Mohan says:

====================
bpf, arm64: support for timed may_goto

Changes in v2->v3:
v2: https://lore.kernel.org/all/[email protected]/
- Rebased on bpf-next/master
- Added Acked-by: tags from Xu and Kumar

Changes in v1->v2:
v1: https://lore.kernel.org/bpf/[email protected]/
- Added comment in arch_bpf_timed_may_goto() about BPF_REG_FP setup (Xu
  Kuohai)

This set adds support for the timed may_goto instruction for the arm64.
The timed may_goto instruction is implemented by the verifier by
reserving 2 8byte slots in the program stack and then calling
arch_bpf_timed_may_goto() in a loop with the stack offset of these two
slots in BPF_REG_AX. It expects the function to put a timestamp in the
first slot and the returned count in BPF_REG_AX is put into the second
slot by a store instruction emitted by the verifier.

arch_bpf_timed_may_goto() is special as it receives the parameter in
BPF_REG_AX and is expected to return the result in BPF_REG_AX as well.
It can't clobber any caller saved registers because verifier doesn't
save anything before emitting the call.

So, arch_bpf_timed_may_goto() is implemented in assembly so the exact
registers that are stored/restored can be controlled (BPF caller saved
registers here) and it also needs to take care of moving arguments and
return values to and from BPF_REG_AX <-> arm64 R0.

So, arch_bpf_timed_may_goto() acts as a trampoline to call
bpf_check_timed_may_goto() which does the main logic of placing the
timestamp and returning the count.

All tests that use may_goto instruction pass after the changing some of
them in patch 2

 #404     stream_errors:OK
 [...]
 #406/2   stream_success/stream_cond_break:OK
 [...]
 #494/23  verifier_bpf_fastcall/may_goto_interaction_x86_64:SKIP
 #494/24  verifier_bpf_fastcall/may_goto_interaction_arm64:OK
 [...]
 #539/1   verifier_may_goto_1/may_goto 0:OK
 #539/2   verifier_may_goto_1/batch 2 of may_goto 0:OK
 #539/3   verifier_may_goto_1/may_goto batch with offsets 2/1/0:OK
 #539/4   verifier_may_goto_1/may_goto batch with offsets 2/0:OK
 #539     verifier_may_goto_1:OK
 #540/1   verifier_may_goto_2/C code with may_goto 0:OK
 #540     verifier_may_goto_2:OK
 Summary: 7/16 PASSED, 25 SKIPPED, 0 FAILED
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants