-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[Dashboard] Add GPU component usage #52102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
|
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
Script to test: import ray
import torch
import os
import time
# Initialize Ray, using all available GPUs
ray.init()
# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device count: {torch.cuda.device_count()}")
@ray.remote(num_gpus=0.5)
class TorchGPUWorker:
def __init__(self):
assert torch.cuda.is_available(), "CUDA is not available"
print(os.getenv("CUDA_VISIBLE_DEVICES"))
self.device = torch.device("cuda")
print(f"Worker running on device: {self.device}")
def matrix_multiply(self, size=16384):
a = torch.randn(size, size, device=self.device)
b = torch.randn(size, size, device=self.device)
torch.matmul(a, b)
torch.cuda.synchronize()
REPEATS = 6
start = time.time()
for _ in range(REPEATS):
c = torch.matmul(a, b)
torch.cuda.synchronize()
if __name__ == "__main__":
# Create an actor
gpu_workers = [TorchGPUWorker.remote() for i in range(4)]
# Run a GPU task
for i in range(100):
result = ray.get([worker.matrix_multiply.remote() for worker in gpu_workers])
print(f"Result norm of matrix multiply on GPU: {result}") |
Signed-off-by: zhaoch23 <[email protected]>
expr="sum(ray_component_gpu_utilization{{{global_filters}}} / 100) by (Component, pid, GpuIndex, GpuDeviceName)", | ||
legend="{{Component}}::{{pid}}, gpu.{{GpuIndex}}, {{GpuDeviceName}}", | ||
), | ||
], | ||
), | ||
Panel( | ||
id=46, | ||
title="Component GPU Memory Usage", | ||
description="GPU memory usage of Ray components.", | ||
unit="bytes", | ||
targets=[ | ||
Target( | ||
expr="sum(ray_component_gpu_memory_usage{{{global_filters}}}) by (Component, pid, GpuIndex, GpuDeviceName)", | ||
legend="{{Component}}::{{pid}}, gpu.{{GpuIndex}}, {{GpuDeviceName}}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove pid to align with the Node CPU component graph.
if pid == "-": # no process on this GPU | ||
continue | ||
gpu_id = int(gpu_id) | ||
pinfo = ProcessGPUInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use a different type here? Since gpu_memory_usage
is of type Megabytes
. It's very confusing for it to be a percentage and may introduce tricky bugs later.
if nv_process.usedGpuMemory | ||
else 0 | ||
), | ||
gpu_utilization=None, # Not available in pynvml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we match the Ray Dashboard behavior where we show the total gpu utilization for gpu that the process attaches to. Not necessarily the utilization exclusive to that process.
The nvdia-smi parsing is a fragile. I don't know what backwards compatability guarantees nvidia-smi provides
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean we give up nvidia-smi? Or we implement a fallback strategy that use the pynvml to display the total gpu utilization if nvidia-smi is not available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, lets remove the usage of nvidia-smi and just use pynvml all the time. We can add nvidia-smi at a later time if there is enough demand. But I think for most usecases the pynvml approach should be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. We will be adding the nvidia-smi
dependency. We will add a test validating the output of nvidia-smi pmon
. We will also update the ray dashboard UI to utilize the gpu utilization value from nvidia-smi
instead of pynvml
.
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
I have fixed some potential parsing error. This is what is looks like on my side: |
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
# Whether GPU metrics collection via `nvidia-smi` is enabled. | ||
# Controlled by the environment variable `RAY_metric_enable_gpu_nvsmi`. | ||
# Defaults to False to use pynvml to collect usage. | ||
RAY_METRIC_ENABLE_GPU_NVSMI = env_bool("RAY_metric_enable_gpu_nvsmi", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we turn it on by default, otherwise the code path will never be tested and no one is going to use this feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: we will gradually roll it out and we also have the follow up to use c nvml directly instead of nvida-smi CLI
0 if columns[pid_index] == "-" else int(columns[pid_index]), | ||
0 if columns[sm_index] == "-" else int(columns[sm_index]), | ||
0 if columns[mem_index] == "-" else int(columns[mem_index]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed to be integer, never float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are int types according to NVML.
gpu_id_index = table_header.index("gpu") | ||
pid_index = table_header.index("pid") | ||
sm_index = table_header.index("sm") | ||
mem_index = table_header.index("mem") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if these headers don't exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pmon is expected to have all the following fields. Now a ValueError is raised and handled if the header or any of these fields are missing
# Build process ID -> GPU info mapping for faster lookups | ||
gpu_pid_mapping = defaultdict(list) | ||
if gpus is not None: | ||
for gpu in gpus: | ||
processes = gpu.get("processes_pids") | ||
if processes: | ||
for proc in processes.values(): | ||
gpu_pid_mapping[proc.pid].append(proc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed since gpu_ids
is already a dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One pid can be in multiple gpus. Here we create a flat map gathering by the pids among all gpus.
""" | ||
tags = {"ip": self._ip, "Component": component_name} | ||
|
||
records = [] | ||
records.append( | ||
Record( | ||
gauge=METRICS_GAUGES["component_gpu_memory_mb"], | ||
value=0.0, | ||
tags=tags, | ||
) | ||
) | ||
records.append( | ||
Record( | ||
gauge=METRICS_GAUGES["component_gpu_percentage"], | ||
value=0.0, | ||
tags=tags, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be moved to _generate_reseted_stats_record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike cpu processes, a gpu process is deemed stale as soon as it stops using gpu resources, even if it remains alive. To address this, we implemented a separate method for generating and resetting gpu statistics.
|
||
# Track if this process has GPU usage | ||
if ( | ||
stat.get("gpu_memory_usage", 0) > 0 | ||
or stat.get("gpu_utilization", 0) > 0 | ||
): | ||
gpu_proc.add(proc_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all these code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to the above, this code tracks only processes with positive gpu utilization to identify stale gpu processes and generate corresponding reset records for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG. Just some naming nits.
Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: zhilong <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhilong <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
@zhaoch23 @Bye-legumes there are some lint failures. |
Signed-off-by: zhaoch23 <[email protected]>
Follow UPuse nvml with pybind to implement pynvml to get the GPU usage directly. |
Signed-off-by: zhilong <[email protected]> Signed-off-by: zhaoch23 <[email protected]> Signed-off-by: zhilong <[email protected]> Co-authored-by: zhaoch23 <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: avigyabb <[email protected]>
Signed-off-by: zhilong <[email protected]> Signed-off-by: zhaoch23 <[email protected]> Signed-off-by: zhilong <[email protected]> Co-authored-by: zhaoch23 <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: avigyabb <[email protected]>
Signed-off-by: zhilong <[email protected]> Signed-off-by: zhaoch23 <[email protected]> Signed-off-by: zhilong <[email protected]> Co-authored-by: zhaoch23 <[email protected]> Co-authored-by: Jiajun Yao <[email protected]>
Signed-off-by: zhilong <[email protected]> Signed-off-by: zhaoch23 <[email protected]> Signed-off-by: zhilong <[email protected]> Co-authored-by: zhaoch23 <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Kamil Kaczmarek <[email protected]>
Signed-off-by: zhilong <[email protected]> Signed-off-by: zhaoch23 <[email protected]> Signed-off-by: zhilong <[email protected]> Co-authored-by: zhaoch23 <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Michael Acar <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Bugs introduced in #52102 Two bugs: - proc is a TypedDict so it needs to be fetched via `proc["pid"]` instead of `proc.pid`. - Changing `processes_pid` is backwards-incompatible change that ends up changing the dashboard APIs that power the ray dashboard. Maintain backwards-compatibility <!-- Please give a short summary of the change and the problem this solves. --> Verified fix: Metrics work again: <img width="947" height="441" alt="Screenshot 2025-08-27 at 12 22 40 PM" src="https://github.com/user-attachments/assets/0a9a83e7-b720-4ad0-b90e-1baa394edde5" /> Ray Dashboard works again: <img width="1824" height="1029" alt="Screenshot 2025-08-27 at 12 21 51 PM" src="https://github.com/user-attachments/assets/6b0e08e4-69c9-4223-b736-ff69b8d306db" /> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alan Guo <[email protected]>
cherrypick #56009 Bugs introduced in #52102, Two bugs: - proc is a TypedDict so it needs to be fetched via `proc["pid"]` instead of `proc.pid`. - Changing `processes_pid` is backwards-incompatible change that ends up changing the dashboard APIs that power the ray dashboard. Maintain backwards-compatibility Verified fix: Metrics work again: <img width="947" height="441" alt="Screenshot 2025-08-27 at 12 22 40 PM" src="https://github.com/user-attachments/assets/0a9a83e7-b720-4ad0-b90e-1baa394edde5" /> Ray Dashboard works again: <img width="1824" height="1029" alt="Screenshot 2025-08-27 at 12 21 51 PM" src="https://github.com/user-attachments/assets/6b0e08e4-69c9-4223-b736-ff69b8d306db" /> --------- Signed-off-by: Alan Guo <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Bugs introduced in ray-project#52102 Two bugs: - proc is a TypedDict so it needs to be fetched via `proc["pid"]` instead of `proc.pid`. - Changing `processes_pid` is backwards-incompatible change that ends up changing the dashboard APIs that power the ray dashboard. Maintain backwards-compatibility <!-- Please give a short summary of the change and the problem this solves. --> Verified fix: Metrics work again: <img width="947" height="441" alt="Screenshot 2025-08-27 at 12 22 40 PM" src="https://github.com/user-attachments/assets/0a9a83e7-b720-4ad0-b90e-1baa394edde5" /> Ray Dashboard works again: <img width="1824" height="1029" alt="Screenshot 2025-08-27 at 12 21 51 PM" src="https://github.com/user-attachments/assets/6b0e08e4-69c9-4223-b736-ff69b8d306db" /> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alan Guo <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Bugs introduced in ray-project#52102 Two bugs: - proc is a TypedDict so it needs to be fetched via `proc["pid"]` instead of `proc.pid`. - Changing `processes_pid` is backwards-incompatible change that ends up changing the dashboard APIs that power the ray dashboard. Maintain backwards-compatibility <!-- Please give a short summary of the change and the problem this solves. --> Verified fix: Metrics work again: <img width="947" height="441" alt="Screenshot 2025-08-27 at 12 22 40 PM" src="https://github.com/user-attachments/assets/0a9a83e7-b720-4ad0-b90e-1baa394edde5" /> Ray Dashboard works again: <img width="1824" height="1029" alt="Screenshot 2025-08-27 at 12 21 51 PM" src="https://github.com/user-attachments/assets/6b0e08e4-69c9-4223-b736-ff69b8d306db" /> ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alan Guo <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>
Why are these changes needed?
Close #45755.
This PR addresses the need for enhanced GPU usage metrics at the task/actor level in the Ray dashboard. Currently, the Ray dashboard provides detailed CPU and memory usage metrics for individual tasks and actors, but lacks similar granularity for GPU metrics. This enhancement aims to fill that gap by introducing per-task/actor GPU utilization and memory usage metrics.
dashboard/agent.py
,dashboard/modules/stats_collector.py
nvidia-smi --query-gpu
if NVML is not available).dashboard/frontend/src/pages/node/Stats.vue
dashboard/frontend/src/components/ResourceIcon.tsx
gpu-core
,gpu-mem
icons and tooltip helpers.python/ray/dashboard/tests/test_gpu_stats.py
CUDA_VISIBLE_DEVICES=0
+ mock NVML bindings to assert Dashboard JSON schema and time-series values.pylint: disable=c-extension-no-member
guards, build-time NVML check insetup.py
.Related issue number
Close #45755.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.