Skip to content

Conversation

alanwguo
Copy link
Contributor

Why are these changes needed?

Bugs introduced in #52102

Two bugs:

  • proc is a TypedDict so it needs to be fetched via proc["pid"] instead of proc.pid.
  • Changing processes_pid is backwards-incompatible change that ends up changing the dashboard APIs that power the ray dashboard. Maintain backwards-compatibility

Verified fix:
Metrics work again:
Screenshot 2025-08-27 at 12 22 40 PM

Ray Dashboard works again:
Screenshot 2025-08-27 at 12 21 51 PM

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Alan Guo <[email protected]>
@alanwguo alanwguo requested a review from a team as a code owner August 27, 2025 19:30
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses two critical bugs related to GPU metrics that were impacting the Ray Dashboard. The first fix correctly handles TypedDict access for process information, changing proc.pid to proc["pid"]. The second fix restores backward compatibility for the dashboard API by converting the processes_pids dictionary back to a list.

The changes are correct and well-targeted. I've added a couple of suggestions to improve the robustness of the code against potential inconsistencies in the data structure of processes_pids coming from different GPU monitoring providers (nvidia-smi vs. pynvml), which could lead to runtime errors.

@@ -937,7 +937,7 @@ def _get_workers(self, gpus: Optional[List[GpuUtilizationInfo]] = None):
processes = gpu.get("processes_pids")
if processes:
for proc in processes.values():
gpu_pid_mapping[proc.pid].append(proc)
gpu_pid_mapping[proc["pid"]].append(proc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: is there a test we can add to actually hit this code path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update test_report_per_component_stats_gpu to make sure it fails without my fix

Signed-off-by: Alan Guo <[email protected]>
@alanwguo
Copy link
Contributor Author

verified with RAY_metric_enable_gpu_nvsmi enabled and disabled

Signed-off-by: Alan Guo <[email protected]>
Signed-off-by: Alan Guo <[email protected]>
Signed-off-by: Alan Guo <[email protected]>
@@ -893,7 +893,7 @@ def _get_agent_proc(self) -> psutil.Process:
def _generate_worker_key(self, proc: psutil.Process) -> Tuple[int, float]:
return (proc.pid, proc.create_time())

def _get_workers(self, gpus: Optional[List[GpuUtilizationInfo]] = None):
def _get_worker_processes(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is for test mocking

Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great test, just one question

@@ -221,7 +221,7 @@ async def _get_actor_info(actor: Optional[dict]) -> Optional[dict]:
break

for gpu_stats in node_physical_stats.get("gpus", []):
# gpu_stats.get("processes") can be None, an empty list or a
# gpu_stats.get("processesPids") can be None, an empty list or a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1]

@@ -1763,6 +1769,14 @@ def _compose_stats_payload(

self._metrics_agent.clean_all_dead_worker_metrics()

# Convert processes_pids back to a list of dictionaries to maintain backwards-compatibility
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] is using processPids as the key while this uses processes_pids so I cannot connect the dots of who is using this field; perhaps add comments to explain what is the consumption of this gpu object (the datacenter in this case? and that it expects this type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here

@@ -579,29 +777,58 @@ def test_report_per_component_stats_gpu():
else:
assert record.value == 86

# Test stats with two tasks on one GPU.
# Test stats with two tasks on one GPU.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: undo?

Signed-off-by: Alan Guo <[email protected]>
@can-anyscale can-anyscale self-requested a review August 27, 2025 21:49
Signed-off-by: Alan Guo <[email protected]>
# TODO(aguo): Add a pydantic model for this dict to maintain compatibility
# with the Ray Dashboard API and UI code.

# NOTE: This converts keys to "Google style", (e.g: "processes_pids" -> "processesPids")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@can-anyscale added comment here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the implementation of jsonify_asdict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's confusing and unnecessary but it's too late to change. It would be massively backwards-incompatible to remove this conversion.

@alanwguo alanwguo added the go add ONLY when ready to merge, run all tests label Aug 27, 2025
Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Alan Guo <[email protected]>
Signed-off-by: Alan Guo <[email protected]>
@ray-gardener ray-gardener bot added dashboard Issues specific to the Ray Dashboard core Issues that should be addressed in Ray Core labels Aug 28, 2025
@alanwguo alanwguo merged commit bbdce36 into ray-project:master Aug 28, 2025
5 checks passed
@aslonnie aslonnie mentioned this pull request Aug 28, 2025
8 tasks
aslonnie pushed a commit that referenced this pull request Aug 28, 2025
cherrypick #56009

Bugs introduced in #52102, Two bugs:
- proc is a TypedDict so it needs to be fetched via `proc["pid"]`
instead of `proc.pid`.
- Changing `processes_pid` is backwards-incompatible change that ends up
changing the dashboard APIs that power the ray dashboard. Maintain
backwards-compatibility

Verified fix:
Metrics work again:
<img width="947" height="441" alt="Screenshot 2025-08-27 at 12 22 40 PM"
src="https://github.com/user-attachments/assets/0a9a83e7-b720-4ad0-b90e-1baa394edde5"
/>


Ray Dashboard works again:
<img width="1824" height="1029" alt="Screenshot 2025-08-27 at 12 21
51 PM"
src="https://github.com/user-attachments/assets/6b0e08e4-69c9-4223-b736-ff69b8d306db"
/>

---------

Signed-off-by: Alan Guo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core dashboard Issues specific to the Ray Dashboard go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants