Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 24, 2025

📄 45% (0.45x) speedup for CompositeVocab.id2unit in stanza/models/common/vocab.py

⏱️ Runtime : 1.64 milliseconds 1.13 milliseconds (best of 145 runs)

📝 Explanation and details

The optimized code achieves a 44% speedup through several micro-optimizations that reduce the overhead of Python's attribute and method lookups within tight loops:

Key optimizations:

  1. Local variable caching for loop operations: The items.append method is cached as a local variable (append = items.append) before the loops. This eliminates repeated method lookups on the items list during each iteration, providing faster access since local variable lookup is faster than attribute lookup in Python.

  2. Cached instance variables: self._id2unit and EMPTY_ID are cached as local variables (_id2unit and EMPTY) before the loops. This reduces the cost of attribute access during each iteration of the potentially large loops.

  3. Eliminated redundant conditional checks: The original code had a single loop with conditional logic inside (if self.keyed). The optimized version splits this into two separate loops - one for keyed mode and one for non-keyed mode. This eliminates the repeated conditional check inside the loop, reducing overhead when processing many items.

  4. Minor string comparison optimization: Changed if res == "" to if not res, which is marginally faster for empty string checks.

Performance characteristics:
These optimizations are particularly effective for the large-scale test cases in the test suite, where the loops process hundreds or thousands of vocabulary items. The local variable caching becomes more beneficial as the loop iteration count increases, explaining why this optimization shows strong performance gains across the comprehensive test cases that include large vocabularies with 500-1000 keys.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 132 Passed
⏪ Replay Tests 56 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from collections.abc import Iterable

# imports
import pytest
from stanza.models.common.vocab import CompositeVocab

EMPTY_ID = 2

class BaseVocab:
    def __init__(self, data=None, lang="", idx=0, cutoff=0, lower=False):
        self.data = data
        self.lang = lang
        self.idx = idx
        self.cutoff = cutoff
        self.lower = lower
        if data is not None:
            self.build_vocab()
        self.state_attrs = ['lang', 'idx', 'cutoff', 'lower', '_unit2id', '_id2unit']

    def __str__(self):
        lang_str = "(%s)" % self.lang if self.lang else ""
        name = str(type(self)) + lang_str
        return "<%s: %s>" % (name, self._id2unit)

    def __len__(self):
        return len(self._id2unit)

    def __getitem__(self, key):
        if isinstance(key, str):
            return self.unit2id(key)
        elif isinstance(key, int) or isinstance(key, list):
            return self.id2unit(key)
        else:
            raise TypeError("Vocab key must be one of str, list, or int")

    def __contains__(self, key):
        return self.normalize_unit(key) in self._unit2id
from stanza.models.common.vocab import CompositeVocab


# Helper function to create a CompositeVocab with a custom _id2unit mapping
def make_vocab(id2unit_dict, sep="", keyed=False):
    vocab = CompositeVocab(sep=sep, keyed=keyed)
    vocab._id2unit = id2unit_dict
    return vocab

# ---------------- BASIC TEST CASES ----------------

def test_basic_positional_sep():
    # Basic test: positional, sep="|"
    # _id2unit: {"A": ["foo", "bar"], "B": ["baz", "qux"]}
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep="|", keyed=False)
    # id [0,0] -> "foo|baz"
    codeflash_output = v.id2unit([0, 0])
    # id [1,1] -> "bar|qux"
    codeflash_output = v.id2unit([1, 1])
    # id [1,0] -> "bar|baz"
    codeflash_output = v.id2unit([1, 0])

def test_basic_keyed_sep():
    # Basic test: keyed, sep="|"
    # _id2unit: {"A": ["foo", "bar"], "B": ["baz", "qux"]}
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep="|", keyed=True)
    # id [0,0] -> "A=foo|B=baz"
    codeflash_output = v.id2unit([0, 0])
    # id [1,1] -> "A=bar|B=qux"
    codeflash_output = v.id2unit([1, 1])
    # id [1,0] -> "A=bar|B=baz"
    codeflash_output = v.id2unit([1, 0])

def test_basic_positional_no_sep():
    # Basic test: positional, sep=None
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep=None, keyed=False)
    # Should return list of values, not string
    codeflash_output = v.id2unit([0, 1])
    codeflash_output = v.id2unit([1, 0])

def test_basic_keyed_no_sep():
    # Basic test: keyed, sep=None
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep=None, keyed=True)
    # Should return list of "A=foo", "B=qux"
    codeflash_output = v.id2unit([0, 1])
    codeflash_output = v.id2unit([1, 0])

def test_basic_single_field():
    # Vocab with only one field, positional, sep="|"
    v = make_vocab({"A": ["foo", "bar"]}, sep="|", keyed=False)
    # id=0 as int should be treated as (0,)
    codeflash_output = v.id2unit(0)
    codeflash_output = v.id2unit(1)
    # id=[1] as list should also work
    codeflash_output = v.id2unit([1])

def test_basic_single_field_keyed():
    # Vocab with only one field, keyed, sep="|"
    v = make_vocab({"A": ["foo", "bar"]}, sep="|", keyed=True)
    codeflash_output = v.id2unit(0)
    codeflash_output = v.id2unit(1)
    codeflash_output = v.id2unit([1])

# ---------------- EDGE TEST CASES ----------------

def test_empty_id_returns_underscore():
    # All ids are EMPTY_ID, sep is not None
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep="|", keyed=False)
    # [EMPTY_ID, EMPTY_ID] should return "_"
    codeflash_output = v.id2unit([EMPTY_ID, EMPTY_ID])

def test_empty_id_skips_fields():
    # Some ids are EMPTY_ID, sep is not None
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep="|", keyed=False)
    # [EMPTY_ID, 0] should return "baz"
    codeflash_output = v.id2unit([EMPTY_ID, 0])
    # [1, EMPTY_ID] should return "bar"
    codeflash_output = v.id2unit([1, EMPTY_ID])

def test_empty_id_skips_fields_keyed():
    # Keyed version
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep="|", keyed=True)
    # [EMPTY_ID, 1] -> "B=qux"
    codeflash_output = v.id2unit([EMPTY_ID, 1])
    # [1, EMPTY_ID] -> "A=bar"
    codeflash_output = v.id2unit([1, EMPTY_ID])
    # [EMPTY_ID, EMPTY_ID] -> "_"
    codeflash_output = v.id2unit([EMPTY_ID, EMPTY_ID])

def test_empty_id_with_sep_none():
    # sep=None, should return list, skip EMPTY_ID
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep=None, keyed=False)
    codeflash_output = v.id2unit([EMPTY_ID, 1])
    codeflash_output = v.id2unit([1, EMPTY_ID])
    codeflash_output = v.id2unit([EMPTY_ID, EMPTY_ID])

def test_empty_id_with_sep_none_keyed():
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep=None, keyed=True)
    codeflash_output = v.id2unit([EMPTY_ID, 1])
    codeflash_output = v.id2unit([1, EMPTY_ID])
    codeflash_output = v.id2unit([EMPTY_ID, EMPTY_ID])

def test_len1_vocab_int_id():
    # Vocab with one field, sep=None, int id
    v = make_vocab({"A": ["foo", "bar"]}, sep=None, keyed=False)
    codeflash_output = v.id2unit(1)
    codeflash_output = v.id2unit(0)

def test_len1_vocab_int_id_keyed():
    v = make_vocab({"A": ["foo", "bar"]}, sep=None, keyed=True)
    codeflash_output = v.id2unit(1)
    codeflash_output = v.id2unit(0)

def test_id_shorter_than_fields():
    # id shorter than number of fields: zip truncates, so only first fields used
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"], "C": ["x", "y"]}, sep="|", keyed=False)
    # id=[1] should only use "A", ignore "B", "C"
    codeflash_output = v.id2unit([1])
    # id=[0,1] should use "A", "B"
    codeflash_output = v.id2unit([0,1])

def test_id_longer_than_fields():
    # id longer than fields: zip truncates, so extra ids ignored
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep="|", keyed=False)
    # id=[1,0,1,2] should only use first two ids
    codeflash_output = v.id2unit([1,0,1,2])

def test_sep_is_empty_string():
    # sep="" (empty string), should join with no separator
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep="", keyed=False)
    codeflash_output = v.id2unit([1, 0])

def test_sep_is_none_and_empty_id():
    # sep=None, all EMPTY_ID, should return []
    v = make_vocab({"A": ["foo", "bar"]}, sep=None, keyed=False)
    codeflash_output = v.id2unit([EMPTY_ID])

def test_non_iterable_id_with_len1_vocab():
    # id is not iterable, vocab has length 1
    v = make_vocab({"A": ["foo", "bar"]}, sep="|", keyed=False)
    codeflash_output = v.id2unit(0)
    codeflash_output = v.id2unit(1)

def test_non_iterable_id_with_len1_vocab_keyed():
    v = make_vocab({"A": ["foo", "bar"]}, sep="|", keyed=True)
    codeflash_output = v.id2unit(0)
    codeflash_output = v.id2unit(1)

def test_id_is_tuple():
    # id is tuple
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep="|", keyed=False)
    codeflash_output = v.id2unit((0, 1))

def test_id_is_range():
    # id is range object
    v = make_vocab({"A": ["foo", "bar"], "B": ["baz", "qux"]}, sep="|", keyed=False)
    codeflash_output = v.id2unit(range(2))


def test_id_is_string_should_not_be_treated_as_iterable():
    # id is string, but should not be treated as iterable (would be a bug)
    v = make_vocab({"A": ["foo", "bar"]}, sep="|", keyed=False)
    # Should treat string as int index, not as iterable of chars
    with pytest.raises(TypeError):
        v.id2unit("0")  # Should fail, as "0" is not int or iterable of ints

# ---------------- LARGE SCALE TEST CASES ----------------

def test_large_vocab_positional():
    # Large vocab, positional, sep="|"
    # 1000 fields, each with 2 values
    id2unit_dict = {f"F{i}": [str(i), str(i+1)] for i in range(1000)}
    v = make_vocab(id2unit_dict, sep="|", keyed=False)
    # All 0s
    ids = [0]*1000
    expected = "|".join(str(i) for i in range(1000))
    codeflash_output = v.id2unit(ids)
    # All 1s
    ids = [1]*1000
    expected = "|".join(str(i+1) for i in range(1000))
    codeflash_output = v.id2unit(ids)

def test_large_vocab_keyed():
    # Large vocab, keyed, sep="|"
    id2unit_dict = {f"K{i}": [str(i), str(i+1)] for i in range(1000)}
    v = make_vocab(id2unit_dict, sep="|", keyed=True)
    ids = [0]*1000
    expected = "|".join(f"K{i}={i}" for i in range(1000))
    codeflash_output = v.id2unit(ids)
    ids = [1]*1000
    expected = "|".join(f"K{i}={i+1}" for i in range(1000))
    codeflash_output = v.id2unit(ids)

def test_large_vocab_with_empty_ids():
    # Large vocab, some EMPTY_IDs
    id2unit_dict = {f"F{i}": [str(i), str(i+1)] for i in range(1000)}
    v = make_vocab(id2unit_dict, sep="|", keyed=False)
    ids = [EMPTY_ID]*1000
    codeflash_output = v.id2unit(ids)
    # Mix of EMPTY_ID and 1
    ids = [EMPTY_ID if i % 2 == 0 else 1 for i in range(1000)]
    expected = "|".join(str(i+1) for i in range(1000) if i % 2 == 1)

def test_large_vocab_sep_none():
    # Large vocab, sep=None
    id2unit_dict = {f"F{i}": [str(i), str(i+1)] for i in range(1000)}
    v = make_vocab(id2unit_dict, sep=None, keyed=False)
    ids = [1]*1000
    expected = [str(i+1) for i in range(1000)]
    codeflash_output = v.id2unit(ids)
    # All EMPTY_IDs
    ids = [EMPTY_ID]*1000
    codeflash_output = v.id2unit(ids)

def test_large_vocab_keyed_sep_none():
    id2unit_dict = {f"K{i}": [str(i), str(i+1)] for i in range(1000)}
    v = make_vocab(id2unit_dict, sep=None, keyed=True)
    ids = [0]*1000
    expected = [f"K{i}={i}" for i in range(1000)]
    codeflash_output = v.id2unit(ids)
    ids = [EMPTY_ID]*1000
    codeflash_output = v.id2unit(ids)

def test_large_vocab_short_id():
    # Large vocab, id shorter than fields
    id2unit_dict = {f"F{i}": [str(i), str(i+1)] for i in range(1000)}
    v = make_vocab(id2unit_dict, sep="|", keyed=False)
    ids = [1]  # Only first field used
    codeflash_output = v.id2unit(ids)
    ids = [0, 1, 1]  # Only first three fields used
    expected = "|".join([str(0), str(2), str(3)])
    codeflash_output = v.id2unit(ids)

def test_large_vocab_long_id():
    # Large vocab, id longer than fields
    id2unit_dict = {f"F{i}": [str(i), str(i+1)] for i in range(1000)}
    v = make_vocab(id2unit_dict, sep="|", keyed=False)
    ids = [1]*1005  # Only first 1000 used
    expected = "|".join(str(i+1) for i in range(1000))
    codeflash_output = v.id2unit(ids)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from collections.abc import Iterable

# imports
import pytest
from stanza.models.common.vocab import CompositeVocab

EMPTY_ID = 2

class BaseVocab:
    def __init__(self, data=None, lang="", idx=0, cutoff=0, lower=False):
        self.data = data
        self.lang = lang
        self.idx = idx
        self.cutoff = cutoff
        self.lower = lower
        if data is not None:
            self.build_vocab()
        self.state_attrs = ['lang', 'idx', 'cutoff', 'lower', '_unit2id', '_id2unit']

    def __str__(self):
        lang_str = "(%s)" % self.lang if self.lang else ""
        name = str(type(self)) + lang_str
        return "<%s: %s>" % (name, self._id2unit)

    def __len__(self):
        return len(self._id2unit)

    def __getitem__(self, key):
        if isinstance(key, str):
            return self.unit2id(key)
        elif isinstance(key, int) or isinstance(key, list):
            return self.id2unit(key)
        else:
            raise TypeError("Vocab key must be one of str, list, or int")

    def __contains__(self, key):
        return self.normalize_unit(key) in self._unit2id
from stanza.models.common.vocab import CompositeVocab

# --- Unit Tests for id2unit ---

# Helper to create a CompositeVocab with custom _id2unit mapping
def make_vocab(id2unit_dict, sep="|", keyed=False):
    vocab = CompositeVocab(sep=sep, keyed=keyed)
    vocab._id2unit = id2unit_dict
    return vocab

# -------------------
# 1. Basic Test Cases
# -------------------

def test_basic_positional_vocab():
    # 2-part positional vocab, sep='|', keyed=False
    vocab = make_vocab({'A': ['x', 'y', 'z'], 'B': ['a', 'b', 'c']}, sep="|", keyed=False)
    # Normal case: both ids valid
    codeflash_output = vocab.id2unit([0, 1])  # A=0->x, B=1->b
    codeflash_output = vocab.id2unit([2, 2])
    # Single id for vocab with length 1
    vocab1 = make_vocab({'A': ['x', 'y']}, sep="|", keyed=False)
    codeflash_output = vocab1.id2unit(1)
    codeflash_output = vocab1.id2unit([0])

def test_basic_keyed_vocab():
    # 2-part keyed vocab, sep='|', keyed=True
    vocab = make_vocab({'TENSE': ['Past', 'Pres', 'Fut'], 'NUMBER': ['Sing', 'Plur']}, sep="|", keyed=True)
    # Normal case: both ids valid
    codeflash_output = vocab.id2unit([1, 0])
    codeflash_output = vocab.id2unit([2, 1])

def test_basic_empty_separator():
    # sep=None returns list
    vocab = make_vocab({'A': ['x', 'y'], 'B': ['a', 'b']}, sep=None, keyed=False)
    codeflash_output = vocab.id2unit([1, 0])
    vocab = make_vocab({'K': ['foo', 'bar']}, sep=None, keyed=True)
    codeflash_output = vocab.id2unit([1])

# -------------------
# 2. Edge Test Cases
# -------------------

def test_edge_empty_id():
    # EMPTY_ID should be skipped
    vocab = make_vocab({'A': ['x', 'y', 'z'], 'B': ['a', 'b', 'c']}, sep="|", keyed=False)
    # One id is EMPTY_ID
    codeflash_output = vocab.id2unit([EMPTY_ID, 1])
    codeflash_output = vocab.id2unit([0, EMPTY_ID])
    # Both ids EMPTY_ID, should return "_"
    codeflash_output = vocab.id2unit([EMPTY_ID, EMPTY_ID])

def test_edge_empty_id_keyed():
    vocab = make_vocab({'TENSE': ['Past', 'Pres', 'Fut'], 'NUMBER': ['Sing', 'Plur']}, sep="|", keyed=True)
    # One id is EMPTY_ID
    codeflash_output = vocab.id2unit([EMPTY_ID, 1])
    codeflash_output = vocab.id2unit([2, EMPTY_ID])
    # Both EMPTY_ID
    codeflash_output = vocab.id2unit([EMPTY_ID, EMPTY_ID])

def test_edge_separator_none_empty():
    # sep=None, all EMPTY_IDs, should return []
    vocab = make_vocab({'A': ['x', 'y'], 'B': ['a', 'b']}, sep=None, keyed=False)
    codeflash_output = vocab.id2unit([EMPTY_ID, EMPTY_ID])
    vocab = make_vocab({'K': ['foo', 'bar']}, sep=None, keyed=True)
    codeflash_output = vocab.id2unit([EMPTY_ID])

def test_edge_single_id_noniterable():
    # Single id, non-iterable, vocab with length 1
    vocab = make_vocab({'A': ['x', 'y', 'z']}, sep="|", keyed=False)
    vocab._id2unit = {'A': ['x', 'y', 'z']}  # length 1
    codeflash_output = vocab.id2unit(2)
    # When id is EMPTY_ID, should return "_"
    codeflash_output = vocab.id2unit(EMPTY_ID)

def test_edge_id_length_mismatch():
    # If id shorter than vocab keys, should only zip up to min length
    vocab = make_vocab({'A': ['x', 'y'], 'B': ['a', 'b']}, sep="|", keyed=False)
    # id has only one element
    codeflash_output = vocab.id2unit([1])
    # id longer than vocab keys: extra ids ignored
    codeflash_output = vocab.id2unit([1, 0, 2])

def test_edge_empty_vocab():
    # No keys in vocab
    vocab = make_vocab({}, sep="|", keyed=False)
    codeflash_output = vocab.id2unit([])
    codeflash_output = vocab.id2unit([EMPTY_ID])
    # sep=None returns []
    vocab = make_vocab({}, sep=None, keyed=False)
    codeflash_output = vocab.id2unit([])



def test_large_positional_vocab():
    # 1000 keys, each with 3 values
    keys = [f"K{i}" for i in range(1000)]
    id2unit_dict = {k: [f"{k}_v{j}" for j in range(3)] for k in keys}
    vocab = make_vocab(id2unit_dict, sep="|", keyed=False)
    # All ids are 1
    ids = [1]*1000
    expected = "|".join([f"K{i}_v1" for i in range(1000)])
    codeflash_output = vocab.id2unit(ids)
    # All EMPTY_IDs
    ids = [EMPTY_ID]*1000
    codeflash_output = vocab.id2unit(ids)
    # First half valid, second half EMPTY_ID
    ids = [0]*500 + [EMPTY_ID]*500
    expected = "|".join([f"K{i}_v0" for i in range(500)])
    codeflash_output = vocab.id2unit(ids)

def test_large_keyed_vocab():
    # 500 keys, each with 2 values
    keys = [f"F{i}" for i in range(500)]
    id2unit_dict = {k: [f"{k}A", f"{k}B"] for k in keys}
    vocab = make_vocab(id2unit_dict, sep="|", keyed=True)
    # All ids are 1
    ids = [1]*500
    expected = "|".join([f"{k}={k}B" for k in keys])
    codeflash_output = vocab.id2unit(ids)
    # All EMPTY_IDs
    ids = [EMPTY_ID]*500
    codeflash_output = vocab.id2unit(ids)
    # Alternate EMPTY_ID and valid
    ids = [EMPTY_ID if i%2==0 else 1 for i in range(500)]
    expected = "|".join([f"{k}={k}B" for i, k in enumerate(keys) if i%2==1])
    codeflash_output = vocab.id2unit(ids)

def test_large_separator_none():
    # 100 keys, sep=None, keyed=False
    keys = [f"K{i}" for i in range(100)]
    id2unit_dict = {k: [f"{k}_0", f"{k}_1"] for k in keys}
    vocab = make_vocab(id2unit_dict, sep=None, keyed=False)
    ids = [1]*100
    expected = [f"K{i}_1" for i in range(100)]
    codeflash_output = vocab.id2unit(ids)
    # All EMPTY_IDs
    ids = [EMPTY_ID]*100
    codeflash_output = vocab.id2unit(ids)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
⏪ Replay Tests and Runtime

To edit these changes git checkout codeflash/optimize-CompositeVocab.id2unit-mh4j6fbb and push.

Codeflash

The optimized code achieves a **44% speedup** through several micro-optimizations that reduce the overhead of Python's attribute and method lookups within tight loops:

**Key optimizations:**

1. **Local variable caching for loop operations**: The `items.append` method is cached as a local variable (`append = items.append`) before the loops. This eliminates repeated method lookups on the `items` list during each iteration, providing faster access since local variable lookup is faster than attribute lookup in Python.

2. **Cached instance variables**: `self._id2unit` and `EMPTY_ID` are cached as local variables (`_id2unit` and `EMPTY`) before the loops. This reduces the cost of attribute access during each iteration of the potentially large loops.

3. **Eliminated redundant conditional checks**: The original code had a single loop with conditional logic inside (`if self.keyed`). The optimized version splits this into two separate loops - one for keyed mode and one for non-keyed mode. This eliminates the repeated conditional check inside the loop, reducing overhead when processing many items.

4. **Minor string comparison optimization**: Changed `if res == ""` to `if not res`, which is marginally faster for empty string checks.

**Performance characteristics:**
These optimizations are particularly effective for the large-scale test cases in the test suite, where the loops process hundreds or thousands of vocabulary items. The local variable caching becomes more beneficial as the loop iteration count increases, explaining why this optimization shows strong performance gains across the comprehensive test cases that include large vocabularies with 500-1000 keys.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 24, 2025 07:30
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants