Improve performance of loading json graph profile #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I had an issue where profile my project using mkcheck generated a graph json file of about 40 MB, and even the simplest of the python tool operations, like "list" would take over 8min 27s to complete.
I profiled the complete python invocation using scalene, and found out that it wasn't a CPU-only bottleneck, but a memory one in parse_graph(). The
inputs
andoutputs
are sets, and are filled in through the loop with the|
(union) operator, which returns a new set with elements from the set and all others.. The complete object was copyied over to a new one at each iteration. This is why the scalene profiling showed a peak memory allocation of 149GB only for the line(It probably wasn't used at once, but the multiple assignments throughout the loop iterations).
The solution for this, is to use the update operator, which updates the set, adding elements from all others.. The same object is used and the set contains the new elements. Also, since the
proc_in
variable wasn't used anywhere else, I inlined the call, removing an extra set instantiation.After this, I made a small change that doesn't help at much: using
json.load(f)
instead ofjson.loads(f.read())
. The functionjson.loads()
parses a string (that comes fromf.read()
), whilejson.load()
loads json from a file.With these changes, I passed from 8 min 27s to 26 seconds, which is way more managable.