Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 24, 2025

📄 113% (1.13x) speedup for known_nicknames in stanza/resources/default_packages.py

⏱️ Runtime : 318 microseconds 149 microseconds (best of 578 runs)

📝 Explanation and details

The optimized code achieves a 113% speedup through two key improvements:

1. Efficient Dictionary Value Extraction

  • Original: list(value for key, value in TRANSFORMER_NICKNAMES.items()) creates a generator expression that iterates over key-value pairs, discarding keys
  • Optimized: list(TRANSFORMER_NICKNAMES.values()) directly extracts dictionary values without creating unnecessary key-value tuples
  • This eliminates the overhead of tuple creation and unpacking for each dictionary entry

2. In-Place Sorting vs. Creating New Sorted List

  • Original: sorted(nicknames, key=lambda x: -len(x)) creates a new list and uses a lambda function to negate lengths
  • Optimized: nicknames.sort(key=len, reverse=True) sorts the existing list in-place using the built-in len function with reverse=True
  • In-place sorting avoids memory allocation for a new list and eliminates the lambda function overhead

The line profiler confirms these improvements: the dictionary extraction time drops from 651,272ns to 69,467ns (89% faster), and the sorting time decreases from 700,842ns to 170,034ns (76% faster).

These optimizations are particularly effective for the typical use case with ~70 transformer nicknames in the dictionary, and scale well for larger datasets as shown in the test cases with 1000+ nicknames. The optimizations maintain identical functionality while being more memory-efficient and CPU-friendly.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 31 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
from stanza.resources.default_packages import known_nicknames

# function to test
TRANSFORMER_NICKNAMES = {
    # ar
    "asafaya/bert-base-arabic": "asafaya-bert",
    "aubmindlab/araelectra-base-discriminator": "aubmind-electra",
    "aubmindlab/bert-base-arabertv2": "aubmind-bert",

    # da
    "vesteinn/ScandiBERT": "scandibert",

    # de
    "bert-base-german-cased": "bert-base-german-cased",
    "dbmdz/bert-base-german-cased": "dbmdz-bert-german-cased",
    "german-nlp-group/electra-base-german-uncased": "german-nlp-electra",

    # en
    "bert-base-multilingual-cased": "mbert",
    "xlm-roberta-large": "xlm-roberta-large",
    "google/electra-large-discriminator": "electra-large",
    "microsoft/deberta-v3-large": "deberta-v3-large",
    "princeton-nlp/Sheared-LLaMA-1.3B": "sheared-llama-1b3",

    # es
    "bertin-project/bertin-roberta-base-spanish": "bertin-roberta",

    # fa
    "HooshvareLab/bert-base-parsbert-uncased": "parsbert",

    # fi
    "TurkuNLP/bert-base-finnish-cased-v1": "bert",

    # fr
    "benjamin/roberta-base-wechsel-french": "wechsel-roberta",
    "camembert-base": "camembert-base",
    "camembert/camembert-large": "camembert-large",
    "dbmdz/electra-base-french-europeana-cased-discriminator": "dbmdz-electra",

    # grc
    "pranaydeeps/Ancient-Greek-BERT": "grc-pranaydeeps",
    "lgessler/microbert-ancient-greek-m": "grc-microbert-m",
    "lgessler/microbert-ancient-greek-mx": "grc-microbert-mx",
    "lgessler/microbert-ancient-greek-mxp": "grc-microbert-mxp",
    "altsoph/bert-base-ancientgreek-uncased": "grc-altsoph",

    # he
    "HeNLP/HeRo": "hero-roberta",
    "imvladikon/alephbertgimmel-base-512": "alephbertgimmel",
    "onlplab/alephbert-base": "alephbert",

    # hy
    "xlm-roberta-base": "xlm-roberta-base",

    # id
    "indolem/indobert-base-uncased":         "indobert",
    "indobenchmark/indobert-large-p1":       "indobenchmark-large-p1",
    "indobenchmark/indobert-base-p1":        "indobenchmark-base-p1",
    "indobenchmark/indobert-lite-large-p1":  "indobenchmark-lite-large-p1",
    "indobenchmark/indobert-lite-base-p1":   "indobenchmark-lite-base-p1",
    "indobenchmark/indobert-large-p2":       "indobenchmark-large-p2",
    "indobenchmark/indobert-base-p2":        "indobenchmark-base-p2",
    "indobenchmark/indobert-lite-large-p2":  "indobenchmark-lite-large-p2",
    "indobenchmark/indobert-lite-base-p2":   "indobenchmark-lite-base-p2",

    # it
    "dbmdz/electra-base-italian-xxl-cased-discriminator": "electra",

    # ja
    "rinna/japanese-roberta-base": "rinna-roberta",

    # mr
    "l3cube-pune/marathi-roberta": "l3cube-marathi-roberta",

    # pl
    "allegro/herbert-base-cased": "herbert",

    # pt
    "neuralmind/bert-large-portuguese-cased": "bertimbau",

    # ta: tamil
    "monsoon-nlp/tamillion":         "tamillion",
    "lgessler/microbert-tamil-m":    "ta-microbert-m",
    "lgessler/microbert-tamil-mxp":  "ta-microbert-mxp",
    "l3cube-pune/tamil-bert":        "l3cube-tamil-bert",
    "d42kw01f/Tamil-RoBERTa":        "ta-d42kw01f-roberta",

    # th
    "airesearch/wangchanberta-base-att-spm-uncased":   "wangchanberta",

    # tr
    "dbmdz/bert-base-turkish-128k-cased": "bert",

    # vi
    "vinai/phobert-base": "phobert-base",
    "vinai/phobert-large": "phobert-large",

    # zh
    "google-bert/bert-base-chinese": "google-bert-chinese",
    "hfl/chinese-bert-wwm": "hfl-bert-chinese",
    "hfl/chinese-macbert-large": "hfl-macbert-chinese",
    "hfl/chinese-roberta-wwm-ext": "hfl-roberta-chinese",
    "hfl/chinese-electra-180g-large-discriminator": "electra-large",
    "ShannonAI/ChineseBERT-base": "shannonai-chinese-bert",

    # multi-lingual Indic
    "ai4bharat/indic-bert": "indic-bert",
    "google/muril-base-cased": "muril-base-cased",
    "google/muril-large-cased": "muril-large-cased",

    # multi-lingual
    "FacebookAI/xlm-roberta-large": "xlm-roberta-large",
}
from stanza.resources.default_packages import known_nicknames

# unit tests

def test_basic_presence_and_type():
    # Basic: Test that function returns a list
    codeflash_output = known_nicknames(); result = codeflash_output

def test_basic_nicknames_inclusion():
    # Basic: Test that some specific nicknames are present
    codeflash_output = known_nicknames(); result = codeflash_output

def test_basic_no_duplicates():
    # Basic: Test that there are no duplicate nicknames
    codeflash_output = known_nicknames(); result = codeflash_output

def test_basic_length():
    # Basic: Test that the length matches the expected number of nicknames
    expected_length = len(set(TRANSFORMER_NICKNAMES.values())) + 1  # +1 for "transformer"
    codeflash_output = known_nicknames(); result = codeflash_output

def test_edge_sort_order():
    # Edge: Test that the list is sorted in decreasing key length
    codeflash_output = known_nicknames(); result = codeflash_output
    lengths = [len(nick) for nick in result]

def test_edge_transformer_last():
    # Edge: "transformer" should be present and, as the shortest, should be last or among the last
    codeflash_output = known_nicknames(); result = codeflash_output
    min_length = min(len(n) for n in result)

def test_edge_shortest_and_longest_nicknames():
    # Edge: Test that the shortest and longest nicknames are present
    codeflash_output = known_nicknames(); result = codeflash_output
    # Find the shortest and longest nicknames in the source
    values = list(TRANSFORMER_NICKNAMES.values()) + ["transformer"]
    shortest = min(values, key=len)
    longest = max(values, key=len)

def test_edge_case_sensitive():
    # Edge: Test that nicknames are case-sensitive
    codeflash_output = known_nicknames(); result = codeflash_output

def test_edge_empty_source():
    # Edge: If the source dict is empty, only "transformer" should be returned
    global TRANSFORMER_NICKNAMES
    backup = TRANSFORMER_NICKNAMES.copy()
    TRANSFORMER_NICKNAMES.clear()
    try:
        codeflash_output = known_nicknames(); result = codeflash_output
    finally:
        TRANSFORMER_NICKNAMES.update(backup)

def test_edge_all_same_length():
    # Edge: If all nicknames have the same length, order should be preserved except "transformer" last
    global TRANSFORMER_NICKNAMES
    backup = TRANSFORMER_NICKNAMES.copy()
    # Make all nicknames length 4
    TRANSFORMER_NICKNAMES = {
        "a": "abcd",
        "b": "efgh",
        "c": "ijkl"
    }
    try:
        codeflash_output = known_nicknames(); result = codeflash_output
    finally:
        TRANSFORMER_NICKNAMES = backup

def test_large_scale_nicknames():
    # Large Scale: Test with a large number of nicknames
    global TRANSFORMER_NICKNAMES
    backup = TRANSFORMER_NICKNAMES.copy()
    # Generate 999 nicknames of varying lengths
    TRANSFORMER_NICKNAMES = {f"model_{i}": f"nickname_{i}" + "x" * (i % 20) for i in range(999)}
    try:
        codeflash_output = known_nicknames(); result = codeflash_output
        # Should contain all nicknames plus "transformer"
        expected_length = 999 + 1
        # Check sorting: longest first, shortest last
        lengths = [len(nick) for nick in result]
    finally:
        TRANSFORMER_NICKNAMES = backup

def test_large_scale_performance():
    # Large Scale: Ensure function runs efficiently with 1000 nicknames
    import time
    global TRANSFORMER_NICKNAMES
    backup = TRANSFORMER_NICKNAMES.copy()
    TRANSFORMER_NICKNAMES = {f"model_{i}": f"nick_{i}" for i in range(1000)}
    try:
        start = time.time()
        codeflash_output = known_nicknames(); result = codeflash_output
        end = time.time()
    finally:
        TRANSFORMER_NICKNAMES = backup

def test_edge_nickname_with_spaces_and_special_chars():
    # Edge: Nicknames with spaces or special characters
    global TRANSFORMER_NICKNAMES
    backup = TRANSFORMER_NICKNAMES.copy()
    TRANSFORMER_NICKNAMES = {
        "model space": "nick name",
        "model!@#": "nick!@#",
        "model-hyphen": "nick-name"
    }
    try:
        codeflash_output = known_nicknames(); result = codeflash_output
    finally:
        TRANSFORMER_NICKNAMES = backup

def test_edge_nickname_collision():
    # Edge: Two keys with same nickname value
    global TRANSFORMER_NICKNAMES
    backup = TRANSFORMER_NICKNAMES.copy()
    TRANSFORMER_NICKNAMES = {
        "model1": "collision",
        "model2": "collision"
    }
    try:
        codeflash_output = known_nicknames(); result = codeflash_output
    finally:
        TRANSFORMER_NICKNAMES = backup

def test_edge_nickname_empty_string():
    # Edge: Nickname is empty string
    global TRANSFORMER_NICKNAMES
    backup = TRANSFORMER_NICKNAMES.copy()
    TRANSFORMER_NICKNAMES = {
        "model1": "",
        "model2": "nonempty"
    }
    try:
        codeflash_output = known_nicknames(); result = codeflash_output
    finally:
        TRANSFORMER_NICKNAMES = backup

def test_edge_nickname_unicode():
    # Edge: Nicknames with unicode characters
    global TRANSFORMER_NICKNAMES
    backup = TRANSFORMER_NICKNAMES.copy()
    TRANSFORMER_NICKNAMES = {
        "model1": "昵称",  # Chinese
        "model2": "имя",  # Cyrillic
        "model3": "اسم"   # Arabic
    }
    try:
        codeflash_output = known_nicknames(); result = codeflash_output
    finally:
        TRANSFORMER_NICKNAMES = backup
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest
from stanza.resources.default_packages import known_nicknames

# function to test
TRANSFORMER_NICKNAMES = {
    # ar
    "asafaya/bert-base-arabic": "asafaya-bert",
    "aubmindlab/araelectra-base-discriminator": "aubmind-electra",
    "aubmindlab/bert-base-arabertv2": "aubmind-bert",

    # da
    "vesteinn/ScandiBERT": "scandibert",

    # de
    "bert-base-german-cased": "bert-base-german-cased",
    "dbmdz/bert-base-german-cased": "dbmdz-bert-german-cased",
    "german-nlp-group/electra-base-german-uncased": "german-nlp-electra",

    # en
    "bert-base-multilingual-cased": "mbert",
    "xlm-roberta-large": "xlm-roberta-large",
    "google/electra-large-discriminator": "electra-large",
    "microsoft/deberta-v3-large": "deberta-v3-large",
    "princeton-nlp/Sheared-LLaMA-1.3B": "sheared-llama-1b3",

    # es
    "bertin-project/bertin-roberta-base-spanish": "bertin-roberta",

    # fa
    "HooshvareLab/bert-base-parsbert-uncased": "parsbert",

    # fi
    "TurkuNLP/bert-base-finnish-cased-v1": "bert",

    # fr
    "benjamin/roberta-base-wechsel-french": "wechsel-roberta",
    "camembert-base": "camembert-base",
    "camembert/camembert-large": "camembert-large",
    "dbmdz/electra-base-french-europeana-cased-discriminator": "dbmdz-electra",

    # grc
    "pranaydeeps/Ancient-Greek-BERT": "grc-pranaydeeps",
    "lgessler/microbert-ancient-greek-m": "grc-microbert-m",
    "lgessler/microbert-ancient-greek-mx": "grc-microbert-mx",
    "lgessler/microbert-ancient-greek-mxp": "grc-microbert-mxp",
    "altsoph/bert-base-ancientgreek-uncased": "grc-altsoph",

    # he
    "HeNLP/HeRo": "hero-roberta",
    "imvladikon/alephbertgimmel-base-512": "alephbertgimmel",
    "onlplab/alephbert-base": "alephbert",

    # hy
    "xlm-roberta-base": "xlm-roberta-base",

    # id
    "indolem/indobert-base-uncased":         "indobert",
    "indobenchmark/indobert-large-p1":       "indobenchmark-large-p1",
    "indobenchmark/indobert-base-p1":        "indobenchmark-base-p1",
    "indobenchmark/indobert-lite-large-p1":  "indobenchmark-lite-large-p1",
    "indobenchmark/indobert-lite-base-p1":   "indobenchmark-lite-base-p1",
    "indobenchmark/indobert-large-p2":       "indobenchmark-large-p2",
    "indobenchmark/indobert-base-p2":        "indobenchmark-base-p2",
    "indobenchmark/indobert-lite-large-p2":  "indobenchmark-lite-large-p2",
    "indobenchmark/indobert-lite-base-p2":   "indobenchmark-lite-base-p2",

    # it
    "dbmdz/electra-base-italian-xxl-cased-discriminator": "electra",

    # ja
    "rinna/japanese-roberta-base": "rinna-roberta",

    # mr
    "l3cube-pune/marathi-roberta": "l3cube-marathi-roberta",

    # pl
    "allegro/herbert-base-cased": "herbert",

    # pt
    "neuralmind/bert-large-portuguese-cased": "bertimbau",

    # ta: tamil
    "monsoon-nlp/tamillion":         "tamillion",
    "lgessler/microbert-tamil-m":    "ta-microbert-m",
    "lgessler/microbert-tamil-mxp":  "ta-microbert-mxp",
    "l3cube-pune/tamil-bert":        "l3cube-tamil-bert",
    "d42kw01f/Tamil-RoBERTa":        "ta-d42kw01f-roberta",

    # th
    "airesearch/wangchanberta-base-att-spm-uncased":   "wangchanberta",

    # tr
    "dbmdz/bert-base-turkish-128k-cased": "bert",

    # vi
    "vinai/phobert-base": "phobert-base",
    "vinai/phobert-large": "phobert-large",

    # zh
    "google-bert/bert-base-chinese": "google-bert-chinese",
    "hfl/chinese-bert-wwm": "hfl-bert-chinese",
    "hfl/chinese-macbert-large": "hfl-macbert-chinese",
    "hfl/chinese-roberta-wwm-ext": "hfl-roberta-chinese",
    "hfl/chinese-electra-180g-large-discriminator": "electra-large",
    "ShannonAI/ChineseBERT-base": "shannonai-chinese-bert",

    # multi-lingual Indic
    "ai4bharat/indic-bert": "indic-bert",
    "google/muril-base-cased": "muril-base-cased",
    "google/muril-large-cased": "muril-large-cased",

    # multi-lingual
    "FacebookAI/xlm-roberta-large": "xlm-roberta-large",
}
from stanza.resources.default_packages import known_nicknames


# unit tests
def test_basic_nicknames_inclusion():
    # Basic: Check that a few known nicknames are present
    codeflash_output = known_nicknames(); result = codeflash_output

def test_basic_transformer_inclusion():
    # Basic: "transformer" is always included
    codeflash_output = known_nicknames(); result = codeflash_output

def test_no_duplicates_in_nicknames():
    # Edge: No duplicate nicknames (except for "electra-large", which occurs for two models)
    codeflash_output = known_nicknames(); result = codeflash_output
    # All elements except "transformer" must be unique
    nicknames_no_transformer = [n for n in result if n != "transformer"]

def test_sorted_by_decreasing_length():
    # Basic: Nicknames are sorted by decreasing length
    codeflash_output = known_nicknames(); result = codeflash_output
    lengths = [len(n) for n in result]

def test_edge_case_empty_transformer_nicknames(monkeypatch):
    # Edge: If TRANSFORMER_NICKNAMES is empty, only "transformer" is returned
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", {})
    codeflash_output = known_nicknames(); result = codeflash_output

def test_edge_case_all_same_length(monkeypatch):
    # Edge: All nicknames of the same length, should preserve all and include "transformer"
    test_dict = {
        "a": "aaa",
        "b": "bbb",
        "c": "ccc"
    }
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output
    # All nicknames + "transformer"
    expected = sorted(["aaa", "bbb", "ccc", "transformer"], key=lambda x: -len(x))

def test_edge_case_transformer_nickname_collision(monkeypatch):
    # Edge: If a model's nickname is "transformer", there should be two "transformer"s in the result
    test_dict = {
        "a": "transformer"
    }
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output

def test_edge_case_non_ascii_nicknames(monkeypatch):
    # Edge: Non-ASCII nicknames are handled correctly
    test_dict = {
        "a": "трансформер",
        "b": "变形金刚",
        "c": "トランスフォーマー"
    }
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output
    # All nicknames + "transformer"
    expected = sorted(["трансформер", "变形金刚", "トランスフォーマー", "transformer"], key=lambda x: -len(x))

def test_large_scale(monkeypatch):
    # Large scale: 1000 unique nicknames, plus "transformer"
    test_dict = {f"model{i}": f"nickname_{i}" for i in range(1000)}
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output
    # All nicknames are present
    for i in range(1000):
        pass
    # Sorted by decreasing length
    lengths = [len(n) for n in result]

def test_large_scale_same_length(monkeypatch):
    # Large scale: 500 nicknames, all same length, plus "transformer"
    test_dict = {f"model{i}": f"n{i:03d}" for i in range(500)}
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output
    # All nicknames are present
    for i in range(500):
        pass
    # Sorted by decreasing length (all 4, then "transformer" (11))
    lengths = [len(n) for n in result]

def test_large_scale_duplicate_nicknames(monkeypatch):
    # Large scale: 1000 models, but only 10 unique nicknames, plus "transformer"
    test_dict = {f"model{i}": f"nick{i%10}" for i in range(1000)}
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output
    # Only 10 unique nicknames + "transformer"
    expected = sorted(list({f"nick{i}" for i in range(10)}) + ["transformer"], key=lambda x: -len(x))
    # Sorted by decreasing length
    lengths = [len(n) for n in result]

def test_mutation_detection(monkeypatch):
    # Mutation: If sorting is ascending, test fails
    test_dict = {"a": "short", "b": "mediumlen", "c": "loooooooooong"}
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output
    # Should be sorted by decreasing length
    expected = sorted(["short", "mediumlen", "loooooooooong", "transformer"], key=lambda x: -len(x))

def test_mutation_detection_missing_transformer(monkeypatch):
    # Mutation: If "transformer" is omitted, test fails
    test_dict = {"a": "x"}
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output

def test_mutation_detection_wrong_nicknames(monkeypatch):
    # Mutation: If function returns keys not values, test fails
    test_dict = {"a": "b", "c": "d"}
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output

def test_mutation_detection_wrong_sort(monkeypatch):
    # Mutation: If sorting is not by length, test fails
    test_dict = {"a": "x", "b": "yyyyy", "c": "zz"}
    monkeypatch.setitem(globals(), "TRANSFORMER_NICKNAMES", test_dict)
    codeflash_output = known_nicknames(); result = codeflash_output
    expected = sorted(["x", "yyyyy", "zz", "transformer"], key=lambda x: -len(x))
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-known_nicknames-mh4hd6mh and push.

Codeflash

The optimized code achieves a 113% speedup through two key improvements:

**1. Efficient Dictionary Value Extraction**
- Original: `list(value for key, value in TRANSFORMER_NICKNAMES.items())` creates a generator expression that iterates over key-value pairs, discarding keys
- Optimized: `list(TRANSFORMER_NICKNAMES.values())` directly extracts dictionary values without creating unnecessary key-value tuples
- This eliminates the overhead of tuple creation and unpacking for each dictionary entry

**2. In-Place Sorting vs. Creating New Sorted List**
- Original: `sorted(nicknames, key=lambda x: -len(x))` creates a new list and uses a lambda function to negate lengths
- Optimized: `nicknames.sort(key=len, reverse=True)` sorts the existing list in-place using the built-in `len` function with `reverse=True`
- In-place sorting avoids memory allocation for a new list and eliminates the lambda function overhead

The line profiler confirms these improvements: the dictionary extraction time drops from 651,272ns to 69,467ns (89% faster), and the sorting time decreases from 700,842ns to 170,034ns (76% faster).

These optimizations are particularly effective for the typical use case with ~70 transformer nicknames in the dictionary, and scale well for larger datasets as shown in the test cases with 1000+ nicknames. The optimizations maintain identical functionality while being more memory-efficient and CPU-friendly.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 24, 2025 06:39
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant