-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Problem description
We've discovered a memory leak in Node.js applications using gRPC with xDS (@grpc/grpc-js v1.12.5 and @grpc/grpc-js-xds v1.12.1). Applications showed a consistent pattern of memory growth, increasing by approximately 300MB within 24 hours of running, eventually leading to container crashes due to hitting memory limits.
Reproduction steps
The issue occurred consistently when xDS was enabled.
Environment
- OS name, version and architecture: Debian 4.19.208-1 x86_64 GNU/Linux
- Node version: v20.17.0
- Node installation method [e.g. nvm]: nvm
- If applicable, compiler version [e.g. clang 3.8.0-2ubuntu4]: N/A
- Package name and version [e.g. [email protected]]: @grpc/grpc-js v1.12.5 and @grpc/grpc-js-xds v1.12.1
Additional context
We have recently started to use xDS with gRPC, and in our Node applications we’re using the following packages:
- @grpc/grpc-js: v1.12.5
- @grpc/grpc-js-xds: v1.12.1
With xDS enabled, we noticed that all Node applications exhibit the same memory utilization pattern: within 24 hours of running, the memory footprint of the application increases by about 300 MB.
At some point the containers hit the max memory limit, crash and are recreated.
With xDS disabled, we do not observe this pattern. Disabled means that the grpc-js-xds package is still loaded; however, the endpoint used for gRPC does not use the xDS protocol. E.g.,
process.env.WS_GRPC_XDS_OFF` === 'true' ?
'${packageName}.platform.internal:8001' :
'xds:///${packageName}:8000';
Below is a chart of memory utilization for a particular Node application. In this example, xDS was turned on for the application just before 6:00 PM on January 9th, over a 24 hour period we see the memory utilization climb abnormally, just after 6:00 on January 10th, xDS was disabled and the containers were restarted. For the next 2+ days the memory profile on the application was normal.
This is one particular Node application, but we observed this pattern on all Node applications where xDS was enabled.
After analyzing heap snapshots of affected application instances, we identified a large number of LrsCallState objects with a high total retained memory size (249 instances, with 631 MB retained in this example).
After reviewing the implementation, we expected to see only one instance of the LrsCallState Class as this is set in the XdsSingleServerClient, which we found had only one instance:
After deeper review, we found that this lrsCallState attribute is unset here:
And then recreated here:
Normally, the unset instance without references would be garbage collected; however, the LrsCallState has a NodeJS.Timeout that is created with a setInterval call here:
When an instance is unset, the statsTimer is not cleared, and it continues to operate (be referenced) in the global context – because of the backreference to the instance of the LrsCallState, the LrsCallState and associated resources are not collected, resulting in a memory leak:
We created a patched version of the grpc-js-xds package, with the changes in this draft PR:
After 3 full days, we see memory utilization performing much closer to the normal profile for Node Applications with the patched xDS client.