[LTS 9.4] CVE-2025-21786 #406

pvts-mat · 2025-07-10T15:24:39Z

[LTS 9.4]
CVE-2025-21786
VULN-54096

Problem

https://access.redhat.com/security/cve/CVE-2025-21786

A vulnerability was found in the Linux kernel's work queue subsystem, which manages background task execution. The issue stems from improper handling of the "rescuer" thread during the cleanup of unbound work queues.

Background

Workqueue system allows user space programs to defer some tasks to be executed asynchronously by the kernel - the "generic async execution mechanism" as expressed in kernel/workqueue.c's header comment.

A piece of work to be executed is called work item. It's represented by a simple struct work_struct coupling a function defining the job with some additional data:

kernel-src-tree/include/linux/workqueue_types.h

Lines 16 to 23 in 7339233

    
           struct work_struct { 
        
           	atomic_long_t data; 
        
           	struct list_head entry; 
        
           	work_func_t func; 
        
           #ifdef CONFIG_LOCKDEP 
        
           	struct lockdep_map lockdep_map; 
        
           #endif 
        
           };

The work items are put through API on the work queues

kernel-src-tree/kernel/workqueue.c

Line 335 in 7339233

struct workqueue_struct {

The type of different work queues the work items can be put on determine how they will be executed.

From there they are distributed to internal pool work queues

kernel-src-tree/kernel/workqueue.c

Line 256 in 7339233

struct pool_workqueue {

where they await execution by the kernel threads called workers. Those can be easily observed with any process-listing tool like ps or top, eg.

# ps -e w | grep kworker/

      7 ?        I      0:00 [kworker/0:0-events]
      8 ?        I<     0:00 [kworker/0:0H-events_highpri]
      9 ?        I      0:00 [kworker/u22:0-events_unbound]
     11 ?        I      0:00 [kworker/u22:1-events_unbound]
     19 ?        I      0:00 [kworker/0:1-events]
     25 ?        I      0:00 [kworker/1:0-rcu_gp]
     26 ?        I<     0:00 [kworker/1:0H-events_highpri]
     31 ?        I      0:00 [kworker/2:0-events]
…

The workers are gathered in work pools

kernel-src-tree/kernel/workqueue.c

Line 184 in 7339233

struct worker_pool {

Each work pool has a single pool work queue and zero or more workers associated. Each CPU has two work pools assigned - one for normal work items and the other for high priority ones. Apart from CPU-bound pools there are also unbound work pools (with unbound work queues mentioned in the CVE), the number of which is dynamic. (This variety of work pools exists for balancing the tradeoff between having high locality of execution (and thus efficiency) for the CPU-bound work pools and much simpler load balancing with the unbound ones.)

It's possible for the work items in a work pool to become deadlocked. For this reason the work queue contains a rescue worker

kernel-src-tree/kernel/workqueue.c

Line 348 in 7339233

struct worker *rescuer; /* MD: rescue worker */

which can pick up any work item from the work pool, break the deadlock and push execution forward. The rescuer's thread function rescuer_thread is the subject of the CVE's fix e769461 in the mainline kernel.

Analysis

The bug

Following the KASAN logs from https://lore.kernel.org/lkml/CAKHoSAvP3iQW+GwmKzWjEAOoPvzeWeoMO0Gz7Pp3_4kxt-RMoA@mail.gmail.com/ it can be seen that the use-after-free scenario unfolded as follows:

The rescuer thread released the pool workqueue with put_pwq(…) at

kernel-src-tree/kernel/workqueue.c

Line 3516 in d40797d

put_pwq(pwq);

It was sure

kernel-src-tree/kernel/workqueue.c

Lines 3513 to 3514 in d40797d

* Put the reference grabbed by send_mayday(). @pool won't

* go away while we're still attached to it.

that the pool associated with this workqueue will still be around at the moment of worker_detach_from_pool(…) call at

kernel-src-tree/kernel/workqueue.c

Line 3526 in d40797d

worker_detach_from_pool(rescuer);

Simultaneously, some regular worker from the same pool released it as well

Last potentially related work creation:
 kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
 __kasan_record_aux_stack+0x8c/0xa0 mm/kasan/generic.c:541
 __call_rcu_common.constprop.0+0x6a/0xad0 kernel/rcu/tree.c:3086
 put_unbound_pool+0x552/0x830 kernel/workqueue.c:4965
 pwq_release_workfn+0x4c6/0x9e0 kernel/workqueue.c:5065
 kthread_worker_fn+0x2b9/0xb00 kernel/kthread.c:844
 kthread+0x2c2/0x3a0 kernel/kthread.c:389
 ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

at

kernel-src-tree/kernel/kthread.c

Line 844 in d40797d

work->func(work);

reducing its ref count to 0 and scheduling it for destruction.

The pool workqueue, guarded by he Read-Copy-Update mechanism, was destroyed soon after by the idle thread 0, along with its worker pool:

Freed by task 0:
 kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
 kasan_save_track+0x14/0x30 mm/kasan/common.c:68
 kasan_save_free_info+0x3a/0x60 mm/kasan/generic.c:579
 poison_slab_object mm/kasan/common.c:247 [inline]
 __kasan_slab_free+0x38/0x50 mm/kasan/common.c:264
 kasan_slab_free include/linux/kasan.h:230 [inline]
 slab_free_hook mm/slub.c:2342 [inline]
 slab_free mm/slub.c:4579 [inline]
 kfree+0x212/0x4a0 mm/slub.c:4727
 rcu_do_batch kernel/rcu/tree.c:2567 [inline]
 rcu_core+0x835/0x17f0 kernel/rcu/tree.c:2823
 handle_softirqs+0x1b1/0x7d0 kernel/softirq.c:554
 __do_softirq kernel/softirq.c:588 [inline]
 invoke_softirq kernel/softirq.c:428 [inline]
 __irq_exit_rcu kernel/softirq.c:637 [inline]
 irq_exit_rcu+0x94/0xc0 kernel/softirq.c:649
 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1049 [inline]
 sysvec_apic_timer_interrupt+0x70/0x80 arch/x86/kernel/apic/apic.c:1049
 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702

The rescuer thread continued execution, hitting the worker_detach_from_pool(…) call, which attempted to remove the rescuer worker from the workers list on the pool which no longer existed

__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:377 [inline]
print_report+0xcb/0x620 mm/kasan/report.c:488
kasan_report+0xbd/0xf0 mm/kasan/report.c:601
__list_del include/linux/list.h:195 [inline]
__list_del_entry include/linux/list.h:218 [inline]
list_del include/linux/list.h:229 [inline]
detach_worker+0x164/0x180 kernel/workqueue.c:2709
worker_detach_from_pool kernel/workqueue.c:2728 [inline]
rescuer_thread+0x69d/0xcd0 kernel/workqueue.c:3526
kthread+0x2c2/0x3a0 kernel/kthread.c:389
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

See

kernel-src-tree/kernel/workqueue.c

Line 2709 in d40797d

list_del(&worker->node);

and the read/write operations in list_del's implementation in

kernel-src-tree/include/linux/list.h

Lines 193 to 197 in d40797d

    
           static inline void __list_del(struct list_head * prev, struct list_head * next) 
        
           { 
        
           	next->prev = prev; 
        
           	WRITE_ONCE(prev->next, next); 
        
           }

The fix

The core of the fix is moving the put_pwq(…) call after the worker_detach_from_pool(…) call to ensure the pool's ref count remains greater than zero at the moment of detaching the rescuer from it. Before:

kernel-src-tree/kernel/workqueue.c

Lines 3512 to 3528 in d40797d

    
           		/* 
        
           		 * Put the reference grabbed by send_mayday().  @pool won't 
        
           		 * go away while we're still attached to it. 
        
           		 */ 
        
           		put_pwq(pwq); 
        
           		/* 
        
           		 * Leave this pool. Notify regular workers; otherwise, we end up 
        
           		 * with 0 concurrency and stalling the execution. 
        
           		 */ 
        
           		kick_pool(pool); 
        
           		raw_spin_unlock_irq(&pool->lock); 
        
           		worker_detach_from_pool(rescuer); 
        
           		raw_spin_lock_irq(&wq_mayday_lock);

After:

kernel-src-tree/kernel/workqueue.c

Lines 3519 to 3535 in e769461

    
           		/* 
        
           		 * Leave this pool. Notify regular workers; otherwise, we end up 
        
           		 * with 0 concurrency and stalling the execution. 
        
           		 */ 
        
           		kick_pool(pool); 
        
           		raw_spin_unlock_irq(&pool->lock); 
        
           		worker_detach_from_pool(rescuer); 
        
           		/* 
        
           		 * Put the reference grabbed by send_mayday().  @pool might 
        
           		 * go away any time after it. 
        
           		 */ 
        
           		put_pwq_unlocked(pwq); 
        
           		raw_spin_lock_irq(&wq_mayday_lock);

Although the moved function changed to put_pwq_unlocked(…), it's actually the same put_pwq(…) call, but wrapped in the raw_spin_lock_irq(…) / raw_spin_unlock_irq(…) pair

kernel-src-tree/kernel/workqueue.c

Lines 1662 to 1664 in e769461

    
           raw_spin_lock_irq(&pwq->pool->lock); 
        
           put_pwq(pwq); 
        
           raw_spin_unlock_irq(&pwq->pool->lock);

This can be seen even more clearly in the original proposition of the fix given by Tejun Heo in the mailing list https://lore.kernel.org/lkml/[email protected]/:

+		/*
+		 * Put the reference grabbed by send_mayday(). This must come
+		 * after the final access of the pool.
+		 */
+		raw_spin_lock_irq(&pool->lock);
+		put_pwq(pwq);
+		raw_spin_unlock_irq(&pool->lock);

This wrapping was not necessary before because the pool->lock was already being held at the time of put_pwq(pwq) call, see

kernel-src-tree/kernel/workqueue.c

Line 3473 in d40797d

raw_spin_lock_irq(&pool->lock);

Applicability: no

The affected file kernel/workqueue.c is unconditionally compiled into every kernel

kernel-src-tree/kernel/Makefile

Line 9 in 499f93a

signal.o sys.o umh.o workqueue.o pid.o task_work.o \

so it's part of any LTS 9.4 build regardless of the configuration used.

However, the CVE-2025-21786 bug fixed by e769461 patch does not apply to the code found under ciqlts9_4 revision and applying the patch, while not harmful on the functional level, shouldn't be done. The arguments are listed below.

The "fixes" commit is missing from the LTS 9.4 history

The e769461 fix mentions 68f8305 as the commit introducing the bug and it's missing from LTS 9.4 history of kernel/workqueue.c, neither was it backported - see
workqueue-history.txt.

Commit's e769461 message explicitly blames changes introduced in 68f8305:

The commit 68f8305("workqueue: Reap workers via kthread_stop() and remove detach_completion") adds code to reap the normal workers but mistakenly does not handle the rescuer and also removes the code waiting for the rescuer in put_unbound_pool(), which caused a use-after-free bug reported by Cheung Wall.

The "code waiting for the rescuer" removed in 68f8305 is present in the ciqlts9_4 revision:

kernel-src-tree/kernel/workqueue.c

Lines 3720 to 3721 in 389d406

    
           if (pool->detach_completion) 
        
           	wait_for_completion(pool->detach_completion);

The `put_pwq(…)` call is not placed randomly

Examining git history shows that the authors of the workqueue mechanism - Lai Jiangshan and Tejun Heo - took great care to place the grab/put functions in proper places. See commit 77668c8 which introduced the put_pwq(…) call

workqueue: fix a possible race condition between rescuer and pwq-release

There is a race condition between rescuer_thread() and
pwq_unbound_release_workfn().

Even after a pwq is scheduled for rescue, the associated work items
may be consumed by any worker.  If all of them are consumed before the
rescuer gets to them and the pwq's base ref was put due to attribute
change, the pwq may be released while still being linked on
@wq->maydays list making the rescuer dereference already freed pwq
later.

Make send_mayday() pin the target pwq until the rescuer is done with
it.

(In fact, this commit pre-emptively fixed the CVE-2023-1281 bug (not a CVE back then) which only re-surfaced after the 68f8305 commit - It addresses the same problem.)

Commit 13b1d62, in turn, dealt with the placement of worker_detach_from_pool(…) call and explicitly related it to the put_pwq(…) call:

workqueue: move rescuer pool detachment to the end

In 51697d393922 ("workqueue: use generic attach/detach routine for
rescuers"), The rescuer detaches itself from the pool before put_pwq()
so that the put_unbound_pool() will not destroy the rescuer-attached
pool.

It is unnecessary.  worker_detach_from_pool() can be used as the last
statement to access to the pool just like the regular workers,
put_unbound_pool() will wait for it to detach and then free the pool.

So we move the worker_detach_from_pool() down, make it coincide with
the regular workers.

It's only the "put_unbound_pool() will wait for it to detach" part which turned false after the introduction of 68f8305 which, again, was not done in LTS 9.4.

Using the patched version is not without any cost

From the short bug and fix analysis it should be rather clear (hopefully), that applying the CVE-2025-21786 patch is just a matter of holding a reference a little longer. It could therefore not hurt to apply the patch "just in case". However, putting aside the nevertheless nonzero degree of uncertainty around the harmlessness of this treatment, doing it requires unnecessary locking / unlocking of the &pwq->pool->lock around put_pwq(pwq) call (see the fix). In general it's always better to avoid unnecessary locks, as they hurt performance and can introduce deadlocking problems not present before.

RedHat's "Affected" classification doesn't hold much weight

The counter-argument to not backporting the patch can be RedHat listing "Red Hat Enterprise Linux 9" as "Affected" on the CVE-2025-21786 bug's page https://access.redhat.com/security/cve/CVE-2025-21786.

However, RH's "Affected" may in actuality mean either "affected, confirmed" or "not investigated yet":

Unless explicitly stated as not affected, all previous versions of packages in any minor update stream of a product listed here should be assumed vulnerable, although may not have been subject to full analysis.

This stands in contrast to "not affected" classification which actually means "not affected, confirmed" only.

pvts-mat · 2025-07-10T22:08:18Z

The "draft" status is only to prevent accidental merge, the PR is ready for review.

kerneltoast · 2025-07-11T22:30:10Z

The reason the pwq refcount was able to hit zero was because the initial pwq reference was put in apply_wqattrs_cleanup(). This happened because a task changed the implicated workqueue's CPU affinity mask by writing to /sys/devices/virtual/workqueue/WQ_NAME/cpumask, which triggers a pwq replacement. After the new pwqs are committed, the old ones are freed by apply_wqattrs_cleanup() putting those initial references.

So for the issue to occur, the following must happen at around the same time:

There is a worker running from inside a workqueue's rescuer kthread. Only workqueues with WQ_MEM_RECLAIM have a rescuer kthread, and even then the rescuer kthread is only used as a fallback to guarantee forward progress of the workqueue's workers when memory pressure is high. There aren't many workqueues with WQ_MEM_RECLAIM and even then it is rare for a worker to hit the rescuer kthread.
There is a task writing to /sys/devices/virtual/workqueue/WQ_NAME/cpumask for that workqueue that has a worker running in the rescuer kthread.
The last reference on the pwq must be put by either the worker in the rescuer kthread, or apply_wqattrs_cleanup() quickly enough to get the pwq freed before the rescuer kthread is done using it.
At least one RCU grace period must elapse after the last pwq reference is put so that the kfree_rcu() RCU callback can run and actually kfree the pwq. And this must occur before the rescuer kthread finishes using the pwq.

This can be triggered under high memory pressure while writing to /sys/devices/virtual/workqueue/WQ_NAME/cpumask and hammering the CPU running the rescuer kthread for WQ_NAME, I guess.

I don't think we should bother picking this, since the Fixes commit was introduced in 6.11 and wasn't backported to any stable kernels. The CVE fix itself is only present on 6.12+ kernels upstream, so I think it's safe to say we don't need to bother with this.

pvts-mat · 2025-07-13T20:21:44Z

Thanks @kerneltoast for adding more light to this issue

PlaidCat · 2025-07-15T19:23:11Z

Closing as not applicable thank you @pvts-mat and @kerneltoast

Use BPF_TRAMP_F_INDIRECT flag to detect struct ops and emit proper prologue and epilogue for this case. With this patch, all of the struct_ops related testcases (except struct_ops_multi_pages) passed on LoongArch. The testcase struct_ops_multi_pages failed is because the actual image_pages_cnt is 40 which is bigger than MAX_TRAMP_IMAGE_PAGES. Before: $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages ... WATCHDOG: test case struct_ops_module/struct_ops_load executes for 10 seconds... After: $ sudo ./test_progs -t struct_ops -d struct_ops_multi_pages ... #15 bad_struct_ops:OK ... #399 struct_ops_autocreate:OK ... #400 struct_ops_kptr_return:OK ... #401 struct_ops_maybe_null:OK ... #402 struct_ops_module:OK ... #404 struct_ops_no_cfi:OK ... #405 struct_ops_private_stack:SKIP ... #406 struct_ops_refcounted:OK Summary: 8/25 PASSED, 3 SKIPPED, 0 FAILED Signed-off-by: Tiezhu Yang <[email protected]> Signed-off-by: Huacai Chen <[email protected]>

As arm64 JIT now supports timed may_goto instruction, make sure all relevant tests run on this architecture. Some tests were enabled and other required modifications to work properly on arm64. $ ./test_progs -a "stream*","*may_goto*",verifier_bpf_fastcall #404 stream_errors:OK [...] #406/2 stream_success/stream_cond_break:OK [...] #494/23 verifier_bpf_fastcall/may_goto_interaction_x86_64:SKIP #494/24 verifier_bpf_fastcall/may_goto_interaction_arm64:OK [...] #539/1 verifier_may_goto_1/may_goto 0:OK #539/2 verifier_may_goto_1/batch 2 of may_goto 0:OK #539/3 verifier_may_goto_1/may_goto batch with offsets 2/1/0:OK #539/4 verifier_may_goto_1/may_goto batch with offsets 2/0:OK #539 verifier_may_goto_1:OK #540/1 verifier_may_goto_2/C code with may_goto 0:OK #540 verifier_may_goto_2:OK Summary: 7/16 PASSED, 25 SKIPPED, 0 FAILED Signed-off-by: Puranjay Mohan <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Acked-by: Xu Kuohai <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Puranjay Mohan says: ==================== bpf, arm64: support for timed may_goto Changes in v2->v3: v2: https://lore.kernel.org/all/[email protected]/ - Rebased on bpf-next/master - Added Acked-by: tags from Xu and Kumar Changes in v1->v2: v1: https://lore.kernel.org/bpf/[email protected]/ - Added comment in arch_bpf_timed_may_goto() about BPF_REG_FP setup (Xu Kuohai) This set adds support for the timed may_goto instruction for the arm64. The timed may_goto instruction is implemented by the verifier by reserving 2 8byte slots in the program stack and then calling arch_bpf_timed_may_goto() in a loop with the stack offset of these two slots in BPF_REG_AX. It expects the function to put a timestamp in the first slot and the returned count in BPF_REG_AX is put into the second slot by a store instruction emitted by the verifier. arch_bpf_timed_may_goto() is special as it receives the parameter in BPF_REG_AX and is expected to return the result in BPF_REG_AX as well. It can't clobber any caller saved registers because verifier doesn't save anything before emitting the call. So, arch_bpf_timed_may_goto() is implemented in assembly so the exact registers that are stored/restored can be controlled (BPF caller saved registers here) and it also needs to take care of moving arguments and return values to and from BPF_REG_AX <-> arm64 R0. So, arch_bpf_timed_may_goto() acts as a trampoline to call bpf_check_timed_may_goto() which does the main logic of placing the timestamp and returning the count. All tests that use may_goto instruction pass after the changing some of them in patch 2 #404 stream_errors:OK [...] #406/2 stream_success/stream_cond_break:OK [...] #494/23 verifier_bpf_fastcall/may_goto_interaction_x86_64:SKIP #494/24 verifier_bpf_fastcall/may_goto_interaction_arm64:OK [...] #539/1 verifier_may_goto_1/may_goto 0:OK #539/2 verifier_may_goto_1/batch 2 of may_goto 0:OK #539/3 verifier_may_goto_1/may_goto batch with offsets 2/1/0:OK #539/4 verifier_may_goto_1/may_goto batch with offsets 2/0:OK #539 verifier_may_goto_1:OK #540/1 verifier_may_goto_2/C code with may_goto 0:OK #540 verifier_may_goto_2:OK Summary: 7/16 PASSED, 25 SKIPPED, 0 FAILED ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

Phony commit for PR discussion

643ba0f

pvts-mat marked this pull request as draft July 10, 2025 15:27

This was referenced Jul 10, 2025

[LTS 9.2] CVE-2025-21786 #407

Closed

[LTS 8.6] CVE-2025-21786 #408

Closed

pvts-mat mentioned this pull request Jul 10, 2025

[CBR 7.9] CVE-2025-21786 #409

Closed

PlaidCat closed this Jul 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LTS 9.4] CVE-2025-21786 #406

[LTS 9.4] CVE-2025-21786 #406

Uh oh!

pvts-mat commented Jul 10, 2025 •

edited

Loading

Uh oh!

pvts-mat commented Jul 10, 2025

Uh oh!

kerneltoast commented Jul 11, 2025

Uh oh!

pvts-mat commented Jul 13, 2025

Uh oh!

PlaidCat commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

	struct work_struct {
	atomic_long_t data;
	struct list_head entry;
	work_func_t func;
	#ifdef CONFIG_LOCKDEP
	struct lockdep_map lockdep_map;
	#endif
	};

	* Put the reference grabbed by send_mayday(). @pool won't
	* go away while we're still attached to it.

	static inline void __list_del(struct list_head * prev, struct list_head * next)
	{
	next->prev = prev;
	WRITE_ONCE(prev->next, next);
	}

	/*
	* Put the reference grabbed by send_mayday(). @pool won't
	* go away while we're still attached to it.
	*/
	put_pwq(pwq);

	/*
	* Leave this pool. Notify regular workers; otherwise, we end up
	* with 0 concurrency and stalling the execution.
	*/
	kick_pool(pool);

	raw_spin_unlock_irq(&pool->lock);

	worker_detach_from_pool(rescuer);

	raw_spin_lock_irq(&wq_mayday_lock);

	raw_spin_lock_irq(&pwq->pool->lock);
	put_pwq(pwq);
	raw_spin_unlock_irq(&pwq->pool->lock);

	if (pool->detach_completion)
	wait_for_completion(pool->detach_completion);

[LTS 9.4] CVE-2025-21786 #406

[LTS 9.4] CVE-2025-21786 #406

Uh oh!

Conversation

pvts-mat commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Background

Analysis

The bug

The fix

Applicability: no

The "fixes" commit is missing from the LTS 9.4 history

The put_pwq(…) call is not placed randomly

Using the patched version is not without any cost

RedHat's "Affected" classification doesn't hold much weight

Uh oh!

pvts-mat commented Jul 10, 2025

Uh oh!

kerneltoast commented Jul 11, 2025

Uh oh!

pvts-mat commented Jul 13, 2025

Uh oh!

PlaidCat commented Jul 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

pvts-mat commented Jul 10, 2025 •

edited

Loading

The `put_pwq(…)` call is not placed randomly