-
Notifications
You must be signed in to change notification settings - Fork 68
✨ OPRUN-4240: Add test profiling toolchain for heap and CPU analysis #2298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2298 +/- ##
==========================================
- Coverage 74.24% 74.21% -0.03%
==========================================
Files 91 91
Lines 7046 7046
==========================================
- Hits 5231 5229 -2
- Misses 1402 1403 +1
- Partials 413 414 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
pedjak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that all other interactions with the code are happening through Makefile targets, it would be nice if we can continue doing that for the profiling as well, something like:
make baseline-profile test-experiment-e2eor maybe
make test-experiment-e2e PROFILING=baselineSimilar goes for other added scripts.
| # Get pod name | ||
| log_info "Finding pod with label: ${POD_LABEL}" | ||
| POD=$(kubectl get pod -n "${NAMESPACE}" -l "${POD_LABEL}" -o name | head -1) | ||
| if [ -z "${POD}" ]; then | ||
| log_error "No pod found with label ${POD_LABEL}" | ||
| exit 1 | ||
| fi | ||
| log_success "Found pod: ${POD}" | ||
| PODS[$component]="${POD}" | ||
|
|
||
| # Create component output directory | ||
| mkdir -p "${OUTPUT_DIR}/${component}" | ||
|
|
||
| # Set up port forwarding | ||
| log_info "Setting up port forwarding to ${POD}:${PPROF_PORT} -> localhost:${LOCAL_PORT}..." | ||
| kubectl port-forward -n "${NAMESPACE}" "${POD}" "${LOCAL_PORT}:${PPROF_PORT}" > "${OUTPUT_DIR}/${component}/port-forward.log" 2>&1 & | ||
| PF_PIDS[$component]=$! | ||
| log_info "Port-forward started (PID: ${PF_PIDS[$component]})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could set port-forwarding on the deployment directly, and that would simplify the script, because we do not need to search pods - deployment name is fixed.
|
|
||
| # Wait for port forwards to be ready | ||
| log_info "Waiting for port forwards to initialize..." | ||
| sleep 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of wait here, I would try to connect to the endpoint and retry several times if I get the error.
| log_success "Collection complete. Collected ${n} profiles from each component." | ||
|
|
||
| # Clean up empty profiles (created during cluster teardown) | ||
| log_info "Cleaning up empty profile files..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this should be executed also when user presses ctrl-c?
| EOF | ||
| } | ||
|
|
||
| log_info() { | ||
| echo -e "${BLUE}[INFO]${NC} $*" | ||
| } | ||
|
|
||
| log_error() { | ||
| echo -e "${RED}[ERROR]${NC} $*" | ||
| } | ||
|
|
||
| log_success() { | ||
| echo -e "${GREEN}[SUCCESS]${NC} $*" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All added scripts have a share of common code. Perhaps we should extract it into something like common.sh and source it other scripts?
| - --leader-elect | ||
| {{- end }} | ||
| - --metrics-bind-address=:7443 | ||
| {{- if .Values.options.e2e.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Profiling could we turned on also for other reasons, not just for e2e tests. Hence, I would suggest to name this option something like:
.Values.options.profiling.enabled
| --- | ||
|
|
||
| **Created:** 2025-10-28 | ||
| **Status:** Production Ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude thinks it is. :)
|
|
||
| **Created:** 2025-10-28 | ||
| **Status:** Production Ready | ||
| **Version:** 1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we gonna track the version of this separately from the released artifacts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to version this, so we won't bother trying to keep it in sync.
| # This will: | ||
| # 1. Start make test-experimental-e2e | ||
| # 2. Wait for pod to be ready | ||
| # 3. Collect profiles every 15s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code and PR description are actually speaking about 10s intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I changed the default, and Claude didn't pick up on this.
|
|
||
| ```bash | ||
| # Set up symlinks to existing data | ||
| ln -s /home/tshort/experimental-e2e-testing/test1 e2e-profiles/baseline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should make some more abstract paths instead of publishing /home/tshort ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is probably a bad example... and Claude didn't clean it up.
I'm going to disagree here. We have a number of other utilities that can be run outside the Makefile. Also, passing arguments to Makefiles is onerous (as given by your example of having to use |
881c6a5 to
e02313c
Compare
We should not prevent people to call the scripts directly, but having no target on Makefile level makes them harder to discover. I would also say that profiling itself is orthogonal to running e2e tests. We could profile controllers even if do not run tests. If we make target for starting and stopping profiling, then we could do something like: make baseline-profiling e2e stop-profilingIf we want to profile without running tests, we can then do make baseline-profiling
# play with OLM
# ...
# ...
make stop-profiling |
anik120
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider this separation:
- e2e profiling toolchain for heap and CPU analysis in introduced (independent of any Claude/AI stuff)
- A helpful Claude command is introduced to help with the AI-ness (in a different PR)
Isn't a that more cleaner?
1 is something we'd do traditionally (but now it's being done with assistance from AI).
2 is the new world stuff.
| # E2E Profiling Plugin - Summary | ||
|
|
||
| ## 📊 What Was Created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as the other PR, we probably shouldn't commit this to main. The Readme.md and Usage.md seems to be doing the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is useful for Claude to know how to use the tool as a plug-in.
I hear your concerns. Because the scripts are expecting to run a specific e2e test, that would require a bit of refactoring. The functionality works now with the existing e2e tests. I would consider doing that as a follow up. |
That's starting to sound like busy-work. I can do it, but I don't really see the benefit. |
|
/hold |
e02313c to
a645f32
Compare
a645f32 to
eb1ecbb
Compare
|
/unhold |
|
I did a lint pass with ShellCheck: |
Makefile
Outdated
| start-profiling: #HELP Start profiling in background. Run your tests, then use 'make stop-profiling'. Use PROFILE_NAME=<name> to specify output name. | ||
| ./hack/tools/e2e-profiling/start-profiling.sh $(PROFILE_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we craft the target like:
start-profiling/%:
./hack/tools/e2e-profiling/start-profiling.sh $*
then we could simply the invocation with:
make start-profiling/foo| cat >> "${REPORT_FILE}" << 'EOF' | ||
| ### Top Memory Allocators (Peak Profile) | ||
| ``` | ||
| EOF | ||
|
|
||
| cd "${component_dir}" && go tool pprof -top "${peak_name}" 2>/dev/null | head -20 >> "${REPORT_FILE}" | ||
|
|
||
| cat >> "${REPORT_FILE}" << 'EOF' | ||
| ``` | ||
| EOF | ||
|
|
||
| # OpenAPI-specific analysis | ||
| log_info " Analyzing OpenAPI allocations..." >&2 | ||
| cat >> "${REPORT_FILE}" << 'EOF' | ||
| ### OpenAPI-Related Allocations | ||
| ``` | ||
| EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these three cat invocation (and perhaps others too) could be merged into single one:
cat >> "${REPORT_FILE}" << EOF
### Top Memory Allocators (Peak Profile)
` ``
$(cd "${component_dir}" && go tool pprof -top "${peak_name}" 2>/dev/null | head -20)
` ``
### OpenAPI-Related Allocations
` ``
$(cd "${component_dir}" && go tool pprof -text "${peak_name}" 2>/dev/null | grep -i openapi | head -20 >> "${REPORT_FILE}" || echo "No OpenAPI allocations found")
` ``
EOF
It would mprove readability if we wrap the calls to go tool pprof into bash functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please confirm your requsted change? I don't think it would work as given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please confirm your requsted change? I don't think it would work as given.
It is a nit, but IMHO is more readable. See this little example:
function bar() {
echo "hello world"
}
cat << EOF
first line
$(bar)
second line
EOFThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've cleaned this up a bit, hopefully more readable.
|
|
||
| # Start everything in background | ||
| nohup bash -c ' | ||
| set -euo pipefail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we move this into a separare script instead keeping it in this large string?
We don't use ShellCheck for anything... so that is a requirement that has not been placed on anything else. I don't think it's appropriate to apply it on this and not anything else. In other words, if we want to use ShellCheck, we should create a verify target for it. It does pass a However, I'll look at some of those issues. |
| - --metrics-bind-address=:8443 | ||
| {{- if .Values.options.profiling.enabled }} | ||
| - --pprof-bind-address=:6060 | ||
| {{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Liked we do not do that by default
Perfect
hack/tools/e2e-profiling/README.md
Outdated
| - [ ] Add real-time dashboard | ||
| - [ ] Support different output formats (JSON, CSV) | ||
| - [ ] Add mutex profiling | ||
| - [ ] Support custom component configurations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it all is generated by AI
Could we just shape it and have only what is required here, making sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reduced it.
|
general comment/question: the introduced logic is not so trivial, and I wonder if we would get on maintainability if it is written in some other more suited language than bash? (not pushing for anything particular here) |
If it's non-trivial in bash, it's likely non-trivial in any other language. Given how far this has progressed, I'm reluctant to switch to another language and take even longer to get this merged. I've also tried to minimize the number of required tools, and doing it in anything other than Bash or Go will add new build environment requirements. As of now, nothing new is required to run this. That being said... I can look to see what would happen if it were written in golang, but I don't want that to hold up this PR. |
eb1ecbb to
c121576
Compare
pedjak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm thanks!
| e2e: | ||
| enabled: false | ||
| profiling: | ||
| enabled: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for stable disabled by default great
hack/tools/e2e-profiling/README.md
Outdated
| ## Requirements | ||
|
|
||
| - kubectl (cluster access) | ||
| - go (for `go tool pprof`) | ||
| - make | ||
| - curl | ||
| - bash 4.0+ | ||
|
|
||
| ## Real-World Results | ||
|
|
||
| OpenAPI caching optimization: | ||
| - **Memory:** 49.6 MB → 41.2 MB (-16.9%) | ||
| - **OpenAPI allocations:** 13 MB → 3.5 MB (-73%) | ||
| - **Key insight:** Repeated schema fetching was #1 memory consumer | ||
|
|
||
| ## Architecture | ||
|
|
||
| **Scripts:** | ||
| - `profile-collector-daemon.sh` - Background collection process | ||
| - `start-profiling.sh` - Start daemon mode profiling | ||
| - `stop-profiling.sh` - Stop daemon and cleanup | ||
| - `run-profiled-test.sh` - Orchestrate test + profiling | ||
| - `analyze-profiles.sh` - Generate analysis reports | ||
| - `compare-profiles.sh` - Create comparison reports | ||
| - `common.sh` - Shared utilities and logging | ||
|
|
||
| **Key Features:** | ||
| - Deployment-based port-forwarding (survives pod restarts) | ||
| - Automatic retry with 30s timeout | ||
| - Graceful cleanup on exit/interrupt | ||
| - Multi-component support (operator-controller + catalogd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Requirements | |
| - kubectl (cluster access) | |
| - go (for `go tool pprof`) | |
| - make | |
| - curl | |
| - bash 4.0+ | |
| ## Real-World Results | |
| OpenAPI caching optimization: | |
| - **Memory:** 49.6 MB → 41.2 MB (-16.9%) | |
| - **OpenAPI allocations:** 13 MB → 3.5 MB (-73%) | |
| - **Key insight:** Repeated schema fetching was #1 memory consumer | |
| ## Architecture | |
| **Scripts:** | |
| - `profile-collector-daemon.sh` - Background collection process | |
| - `start-profiling.sh` - Start daemon mode profiling | |
| - `stop-profiling.sh` - Stop daemon and cleanup | |
| - `run-profiled-test.sh` - Orchestrate test + profiling | |
| - `analyze-profiles.sh` - Generate analysis reports | |
| - `compare-profiles.sh` - Create comparison reports | |
| - `common.sh` - Shared utilities and logging | |
| **Key Features:** | |
| - Deployment-based port-forwarding (survives pod restarts) | |
| - Automatic retry with 30s timeout | |
| - Graceful cleanup on exit/interrupt | |
| - Multi-component support (operator-controller + catalogd) |
I think we can remove this section — what do you think?
It looks like it might have been AI-generated and doesn’t seem necessary for our use case.
It also feels a bit verbose and not quite aligned with the writing style we usually use for these READMEs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the level is fine. It was definitely reduced, and can always be edited back later. Generally, for documentation, more detail is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a great example of how it's helpful to generate docs like this to provide additional context.
and not quite aligned with the writing style we usually use for these READMEs.
Writing styles before were plagued with not-enough context syndrome, which made contributors reliant on person/people who wrote the code to understand context (tribal knowledge).
I personally saw this file while doing my first "eye over" of the PR, and was grateful to get more context from this file about what to expect from this PR (just like how reading the summary of a research paper helps understand the rest of the paper)
.gitignore
Outdated
| .claude/cache/ | ||
| .claude/logs/ | ||
| .claude/.session* | ||
| .claude/*.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could not be
# Ignore Claude local settings, history, and logs
.claude/
Also, it seems out of scope here
Are we adding any claude implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what Claude said to include in the .gitignore; it knows what should be there! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmshort this does feel a little out of place tbh. Everything else is looking great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @tmshort
I added just a few nits.
I think the README we need to shape
AI will add info not really required and or not helpful.
Often too verbose, so I think it requires some human adjustments.
PS.: I did not test it out locally.
Beyond that I am ok with we add it.
@dtfranz worked in the performance checks so I think it would be nice get his review.
anik120
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@todd there's just one little tiny thing bothering me otherwise this is looking great!
hack/tools/e2e-profiling/README.md
Outdated
| ## Requirements | ||
|
|
||
| - kubectl (cluster access) | ||
| - go (for `go tool pprof`) | ||
| - make | ||
| - curl | ||
| - bash 4.0+ | ||
|
|
||
| ## Real-World Results | ||
|
|
||
| OpenAPI caching optimization: | ||
| - **Memory:** 49.6 MB → 41.2 MB (-16.9%) | ||
| - **OpenAPI allocations:** 13 MB → 3.5 MB (-73%) | ||
| - **Key insight:** Repeated schema fetching was #1 memory consumer | ||
|
|
||
| ## Architecture | ||
|
|
||
| **Scripts:** | ||
| - `profile-collector-daemon.sh` - Background collection process | ||
| - `start-profiling.sh` - Start daemon mode profiling | ||
| - `stop-profiling.sh` - Stop daemon and cleanup | ||
| - `run-profiled-test.sh` - Orchestrate test + profiling | ||
| - `analyze-profiles.sh` - Generate analysis reports | ||
| - `compare-profiles.sh` - Create comparison reports | ||
| - `common.sh` - Shared utilities and logging | ||
|
|
||
| **Key Features:** | ||
| - Deployment-based port-forwarding (survives pod restarts) | ||
| - Automatic retry with 30s timeout | ||
| - Graceful cleanup on exit/interrupt | ||
| - Multi-component support (operator-controller + catalogd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a great example of how it's helpful to generate docs like this to provide additional context.
and not quite aligned with the writing style we usually use for these READMEs.
Writing styles before were plagued with not-enough context syndrome, which made contributors reliant on person/people who wrote the code to understand context (tribal knowledge).
I personally saw this file while doing my first "eye over" of the PR, and was grateful to get more context from this file about what to expect from this PR (just like how reading the summary of a research paper helps understand the rest of the paper)
.gitignore
Outdated
| .claude/cache/ | ||
| .claude/logs/ | ||
| .claude/.session* | ||
| .claude/*.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmshort this does feel a little out of place tbh. Everything else is looking great.
c121576 to
8f49c58
Compare
|
OK, I've converted this into golang (vs. bash) and generalized it a bit more:
|
hack/tools/test-profiling/README.md
Outdated
| ## Requirements | ||
|
|
||
| **Runtime:** | ||
| - Kubernetes cluster access (via kubeconfig) | ||
|
|
||
| **Build:** | ||
| - go 1.24+ | ||
| - make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be hard we keep it maintained
when we bump do we really need this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking to remove it, and make other improvements... this is becoming an HnH project today,...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work ! 🎉
Nit: Do we need the requirement section in the doc
if not, I think is better we remove since will be hard for us keep maintained.
8f49c58 to
5bf313d
Compare
5bf313d to
5147fad
Compare
Introduces automated memory and CPU profiling for e2e tests with: - Automatic port-forwarding to Kubernetes deployment pprof endpoints - Configurable periodic heap and CPU profile collection with differential timing - Analysis report generation with growth metrics and top allocators - Makefile targets: start-profiling, stop-profiling, analyze-profiles 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: Todd Short <[email protected]>
5147fad to
a75a2b0
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: camilamacedo86, pedjak The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✨ Add e2e profiling toolchain for heap and CPU analysis
Introduces automated memory and CPU profiling for e2e tests with:
🤖 Generated with Claude Code
Co-Authored-By: Claude [email protected]
Signed-off-by: Todd Short [email protected]
Reviewer Checklist