Iterate over objects in TDirectory in linear time. #638
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This was a dumb mistake, pointed out by Andrew Wightman (Notre Dame) with a file of 61944 histograms.
Since names are not unique identifiers for objects in TDirectories, the
uproot.TDirectory.__getitem__was iterating through the list, looking for matches. If you do that n times, the time complexity is O(n²).However, names are almost unique identifiers for objects in TDirectories, so I added a
uproot.TDirectory._keys_lookup, which is a hashmap from names to lists of matching indexes inuproot.TDirectory._keys. For a given name, the number of items to search through is much shorter, usually 1.This reduces the time needed to read the 61944 from 206 seconds to 54 seconds. Most importantly, it's flat: both the first and the last 1000 histograms take 0.85 seconds, whereas before it was 0.85 seconds for the first 1000 histograms and 6.2 seconds for the last 1000 histograms. The time complexity to read n histograms is O(n).