⚡️ Speed up method `BaseVocab.load_state_dict` by 5% #229

codeflash-ai · 2025-10-24T06:55:39Z

📄 5% (0.05x) speedup for `BaseVocab.load_state_dict` in `stanza/models/common/vocab.py`

⏱️ Runtime : 52.5 microseconds → 49.8 microseconds (best of 720 runs)

📝 Explanation and details

The optimization uses local variable binding to cache the global setattr function as _setattr. This eliminates repeated global lookups during the loop iteration.

Key change: Instead of looking up setattr in the global namespace on every loop iteration, the function is bound once to a local variable _setattr and reused.

Why this provides speedup: In Python, local variable access is significantly faster than global variable lookup. Each call to setattr(new, attr, value) requires a global namespace lookup, while _setattr(new, attr, value) uses the faster local variable access mechanism.

Performance impact: The line profiler shows the setattr line improved from 61,376ns total time to 33,204ns total time (46% faster for that specific line), contributing to the overall 5% speedup.

Test case effectiveness: This optimization is most beneficial when state_dict contains many attributes, as seen in the large-scale test cases with 1000+ units. The more iterations in the loop, the more global lookups are avoided, making the optimization particularly effective for loading large vocabulary state dictionaries that are common in NLP models.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 55 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest
from stanza.models.common.vocab import BaseVocab

# unit tests

# -------------------- BASIC TEST CASES --------------------

def test_load_state_dict_basic_attributes():
    # Test loading a minimal state dict with all attributes
    state_dict = {
        'lang': 'en',
        'idx': 42,
        'cutoff': 5,
        'lower': True,
        '_unit2id': {'hello': 0, 'world': 1},
        '_id2unit': ['hello', 'world']
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

def test_load_state_dict_minimal():
    # Test loading with minimal state dict (no optional attributes)
    state_dict = {
        '_unit2id': {},
        '_id2unit': []
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

def test_load_state_dict_with_extra_attributes():
    # Test that extra attributes are also loaded
    state_dict = {
        '_unit2id': {'x': 0},
        '_id2unit': ['x'],
        'extra_attr': 123
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

def test_load_state_dict_normal_usage():
    # Test that the loaded vocab can be used for lookups
    state_dict = {
        'lower': False,
        '_unit2id': {'A': 0, 'B': 1},
        '_id2unit': ['A', 'B']
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

# -------------------- EDGE TEST CASES --------------------

def test_load_state_dict_missing_attributes():
    # Missing _unit2id or _id2unit should cause errors on usage
    state_dict = {
        'lang': 'fr'
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
    # Should raise AttributeError when accessing _unit2id/_id2unit
    with pytest.raises(AttributeError):
        _ = vocab._unit2id
    with pytest.raises(AttributeError):
        _ = vocab._id2unit
    # Should raise AttributeError when using __len__
    with pytest.raises(AttributeError):
        _ = len(vocab)

def test_load_state_dict_empty_dict():
    # Loading an empty dict should result in a vocab with no _unit2id/_id2unit
    codeflash_output = BaseVocab.load_state_dict({}); vocab = codeflash_output
    with pytest.raises(AttributeError):
        _ = vocab._unit2id
    with pytest.raises(AttributeError):
        _ = vocab._id2unit
    with pytest.raises(AttributeError):
        _ = len(vocab)

def test_load_state_dict_mutable_state():
    # Changing the state_dict after loading should not affect the vocab
    state_dict = {
        '_unit2id': {'a': 0},
        '_id2unit': ['a']
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
    state_dict['_unit2id']['b'] = 1
    state_dict['_id2unit'].append('b')

def test_load_state_dict_with_unusual_types():
    # Test with values of unexpected types
    state_dict = {
        '_unit2id': None,
        '_id2unit': None
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
    # __len__ should fail
    with pytest.raises(TypeError):
        _ = len(vocab)

def test_load_state_dict_with_conflicting_attributes():
    # If state_dict has attributes that conflict with existing ones
    state_dict = {
        'state_attrs': ['x', 'y'],
        '_unit2id': {'foo': 0},
        '_id2unit': ['foo']
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

# -------------------- LARGE SCALE TEST CASES --------------------

def test_load_state_dict_large_vocab():
    # Test with a large vocabulary
    units = [f"word{i}" for i in range(1000)]
    unit2id = {u: i for i, u in enumerate(units)}
    state_dict = {
        '_unit2id': unit2id,
        '_id2unit': units
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
    # Check that all units are present and correct
    for i in [0, 499, 999]:
        pass

def test_load_state_dict_large_with_lower():
    # Test with large vocab and lower=True
    units = [f"WORD{i}" for i in range(1000)]
    unit2id = {u.lower(): i for i, u in enumerate(units)}
    state_dict = {
        'lower': True,
        '_unit2id': unit2id,
        '_id2unit': [u.lower() for u in units]
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

def test_load_state_dict_large_extra_attrs():
    # Test with large vocab and extra attributes
    units = [str(i) for i in range(1000)]
    state_dict = {
        '_unit2id': {u: int(u) for u in units},
        '_id2unit': units,
        'lang': 'large',
        'idx': 123,
        'cutoff': 0,
        'lower': False,
        'extra': 'extra_value'
    }
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from stanza.models.common.vocab import BaseVocab

# ---- UNIT TESTS FOR load_state_dict ----

# Helper function to make a state_dict for testing
def make_vocab_state_dict(lang="en", idx=0, cutoff=0, lower=False, units=None):
    if units is None:
        units = ["a", "b", "c"]
    _unit2id = {u: i for i, u in enumerate(units)}
    _id2unit = {i: u for i, u in enumerate(units)}
    return {
        'lang': lang,
        'idx': idx,
        'cutoff': cutoff,
        'lower': lower,
        '_unit2id': _unit2id,
        '_id2unit': _id2unit
    }

# ---- BASIC TEST CASES ----

def test_basic_load_state_dict_creates_instance():
    # Test that load_state_dict returns a BaseVocab instance
    state_dict = make_vocab_state_dict()
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

def test_basic_attributes_set_correctly():
    # Test that all attributes in state_dict are set correctly
    state_dict = make_vocab_state_dict(lang="fr", idx=5, cutoff=2, lower=True, units=["x", "y"])
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

def test_basic_len_and_str_methods():
    # Test that __len__ and __str__ methods work after loading state_dict
    state_dict = make_vocab_state_dict(lang="es", units=["uno", "dos", "tres"])
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
    s = str(vocab)

# ---- EDGE TEST CASES ----

def test_empty_state_dict():
    # Test loading from an empty state dict
    codeflash_output = BaseVocab.load_state_dict({}); vocab = codeflash_output
    # Accessing _unit2id or _id2unit should raise AttributeError
    with pytest.raises(AttributeError):
        _ = vocab._unit2id
    with pytest.raises(AttributeError):
        _ = vocab._id2unit

def test_missing_some_attributes():
    # Test state_dict missing some attributes
    state_dict = {'lang': 'de', '_unit2id': {'a': 0}, '_id2unit': {0: 'a'}}
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

def test_extra_attributes():
    # Test state_dict with extra attributes not in state_attrs
    state_dict = make_vocab_state_dict()
    state_dict['extra'] = 42
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

def test_overwrite_existing_attribute():
    # Test that attributes are overwritten from state_dict
    vocab = BaseVocab(lang="it", idx=99)
    state_dict = {'lang': 'ru', 'idx': 1, '_unit2id': {'z': 0}, '_id2unit': {0: 'z'}}
    codeflash_output = BaseVocab.load_state_dict(state_dict); loaded_vocab = codeflash_output

def test_non_dict_input_raises():
    # Test that non-dict input raises AttributeError or TypeError
    with pytest.raises(AttributeError):
        codeflash_output = BaseVocab.load_state_dict(None); _ = codeflash_output
    with pytest.raises(AttributeError):
        codeflash_output = BaseVocab.load_state_dict(123); _ = codeflash_output
    with pytest.raises(AttributeError):
        codeflash_output = BaseVocab.load_state_dict("not a dict"); _ = codeflash_output

def test_mutable_state_dict_does_not_affect_instance():
    # Test that changing state_dict after loading does not affect the instance
    state_dict = make_vocab_state_dict()
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
    state_dict['lang'] = "changed"

def test_unit2id_and_id2unit_are_set():
    # Test that _unit2id and _id2unit are set and usable
    state_dict = make_vocab_state_dict(units=["a", "b"])
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output

# ---- LARGE SCALE TEST CASES ----

def test_large_vocab_state_dict():
    # Test with a large vocab (up to 1000 units)
    units = [f"word{i}" for i in range(1000)]
    state_dict = make_vocab_state_dict(lang="big", units=units)
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
    # Spot check a few units
    for i in [0, 499, 999]:
        pass

def test_performance_large_state_dict():
    # Test that loading a large state_dict does not take excessive time
    import time
    units = [f"u{i}" for i in range(1000)]
    state_dict = make_vocab_state_dict(lang="perf", units=units)
    start = time.time()
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
    end = time.time()

def test_large_state_dict_memory_independence():
    # Test that large state_dict does not share references with instance
    units = [f"word{i}" for i in range(1000)]
    state_dict = make_vocab_state_dict(lang="big", units=units)
    codeflash_output = BaseVocab.load_state_dict(state_dict); vocab = codeflash_output
    # Mutate state_dict and check vocab is unaffected
    state_dict['_unit2id']['word0'] = -1
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-BaseVocab.load_state_dict-mh4hyf2e and push.

The optimization uses **local variable binding** to cache the global `setattr` function as `_setattr`. This eliminates repeated global lookups during the loop iteration. **Key change**: Instead of looking up `setattr` in the global namespace on every loop iteration, the function is bound once to a local variable `_setattr` and reused. **Why this provides speedup**: In Python, local variable access is significantly faster than global variable lookup. Each call to `setattr(new, attr, value)` requires a global namespace lookup, while `_setattr(new, attr, value)` uses the faster local variable access mechanism. **Performance impact**: The line profiler shows the `setattr` line improved from 61,376ns total time to 33,204ns total time (46% faster for that specific line), contributing to the overall 5% speedup. **Test case effectiveness**: This optimization is most beneficial when `state_dict` contains many attributes, as seen in the large-scale test cases with 1000+ units. The more iterations in the loop, the more global lookups are avoided, making the optimization particularly effective for loading large vocabulary state dictionaries that are common in NLP models.

codeflash-ai bot requested a review from mashraf-222 October 24, 2025 06:55

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `BaseVocab.load_state_dict` by 5% #229

⚡️ Speed up method `BaseVocab.load_state_dict` by 5% #229

Uh oh!

codeflash-ai bot commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method BaseVocab.load_state_dict by 5% #229

Are you sure you want to change the base?

⚡️ Speed up method BaseVocab.load_state_dict by 5% #229

Uh oh!

Conversation

codeflash-ai bot commented Oct 24, 2025

📄 5% (0.05x) speedup for BaseVocab.load_state_dict in stanza/models/common/vocab.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `BaseVocab.load_state_dict` by 5% #229

⚡️ Speed up method `BaseVocab.load_state_dict` by 5% #229

📄 5% (0.05x) speedup for `BaseVocab.load_state_dict` in `stanza/models/common/vocab.py`