Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 5, 2025

📄 42% (0.42x) speedup for strip_accents_text in spacy/lang/yo/lex_attrs.py

⏱️ Runtime : 4.86 milliseconds 3.42 milliseconds (best of 250 runs)

📝 Explanation and details

The optimization achieves a 42% speedup through two key micro-optimizations that reduce function call overhead in the Unicode processing loop:

What was optimized:

  1. Function lookup caching: Moved unicodedata.normalize and unicodedata.category lookups outside the loop by storing them as local variables normalize and category
  2. List comprehension conversion: Changed the generator expression to a list comprehension inside "".join()

Why this speeds up the code:

  • Reduced global lookups: In the original code, each iteration accessed unicodedata.normalize and unicodedata.category through global module attribute lookups. The optimized version performs these lookups once and stores them as local variables, which are faster to access in Python.
  • Improved join() performance: While both generators and list comprehensions work with join(), CPython's implementation of join() can be slightly more efficient when given a list directly rather than consuming a generator.

Performance impact based on test results:
The optimization shows consistent improvements across all test cases, with particularly strong gains on:

  • Large-scale operations (30-47% faster on large texts)
  • Text processing with many Unicode operations (20-40% improvement)
  • Basic accent removal tasks (5-20% faster)

Real-world benefits:
This function appears to be part of spaCy's Yoruba language processing pipeline for lexical attribute handling. Given that text preprocessing often involves processing large volumes of text with many accented characters, these micro-optimizations compound significantly. The improvements are most pronounced on longer texts and Unicode-heavy content, making this optimization particularly valuable for natural language processing workloads where accent stripping is performed repeatedly on large corpora.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 57 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import unicodedata

# imports
import pytest  # used for our unit tests
from spacy.lang.yo.lex_attrs import strip_accents_text

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------

def test_basic_ascii():
    # ASCII text should remain unchanged
    codeflash_output = strip_accents_text("hello world") # 3.21μs -> 2.79μs (14.9% faster)

def test_basic_accented():
    # Common accented characters
    codeflash_output = strip_accents_text("café") # 2.96μs -> 2.75μs (7.60% faster)
    codeflash_output = strip_accents_text("naïve") # 1.66μs -> 1.59μs (4.15% faster)
    codeflash_output = strip_accents_text("résumé") # 1.48μs -> 1.33μs (10.9% faster)

def test_basic_mixed():
    # Mixed accented and non-accented
    codeflash_output = strip_accents_text("fiancé and friend") # 3.81μs -> 3.32μs (14.8% faster)

def test_basic_uppercase():
    # Accented uppercase letters
    codeflash_output = strip_accents_text("ÉCOLE") # 2.70μs -> 2.48μs (9.05% faster)
    codeflash_output = strip_accents_text("À LA MODE") # 1.84μs -> 1.61μs (13.9% faster)

def test_basic_multiple_accents():
    # Multiple accents on one letter
    codeflash_output = strip_accents_text("Crème brûlée") # 3.46μs -> 3.12μs (10.8% faster)

# ---------------------------
# Edge Test Cases
# ---------------------------

def test_empty_string():
    # Empty string should return empty string
    codeflash_output = strip_accents_text("") # 1.18μs -> 1.09μs (8.52% faster)

def test_only_accents():
    # String with only combining marks
    # U+0301: COMBINING ACUTE ACCENT
    codeflash_output = strip_accents_text("\u0301\u0300\u0327") # 2.41μs -> 2.10μs (14.6% faster)

def test_non_latin_characters():
    # Non-latin scripts should not be affected unless they use combining marks
    # Cyrillic
    codeflash_output = strip_accents_text("Привет") # 2.53μs -> 2.35μs (7.66% faster)
    # Greek with accent
    codeflash_output = strip_accents_text("άέήίόύώ") # 2.63μs -> 2.50μs (5.20% faster)
    # Chinese
    codeflash_output = strip_accents_text("你好") # 1.00μs -> 892ns (12.7% faster)

def test_combined_characters():
    # Characters already in decomposed form
    text = "e\u0301"  # 'e' + combining acute accent
    codeflash_output = strip_accents_text(text) # 1.68μs -> 1.56μs (7.16% faster)

def test_special_symbols_and_numbers():
    # Numbers and symbols should not be changed
    codeflash_output = strip_accents_text("12345!@#$%") # 2.87μs -> 2.40μs (19.3% faster)

def test_mixed_whitespace():
    # Whitespace should not be affected
    codeflash_output = strip_accents_text("café \t\n résumé") # 4.02μs -> 3.63μs (10.8% faster)

def test_surrogate_pairs_and_emojis():
    # Emojis and surrogate pairs should not be changed
    codeflash_output = strip_accents_text("😀😁😂") # 2.23μs -> 2.15μs (3.82% faster)

def test_combining_marks_on_non_letters():
    # Combining marks applied to symbols
    text = "$\u0301"  # ' + combining acute accent
    codeflash_output = strip_accents_text(text) # 1.79μs -> 1.64μs (9.16% faster)

def test_edge_case_zalgo_text():
    # Zalgo text (letters with excessive combining marks)
    zalgo = "Zalgo"
    codeflash_output = strip_accents_text(zalgo) # 23.5μs -> 20.2μs (16.3% faster)

def test_non_string_input():
    # Should raise an error if input is not a string
    with pytest.raises(TypeError):
        strip_accents_text(123) # 1.84μs -> 2.02μs (9.10% slower)
    with pytest.raises(TypeError):
        strip_accents_text(None) # 1.01μs -> 1.07μs (5.32% slower)
    with pytest.raises(TypeError):
        strip_accents_text(['café']) # 679ns -> 705ns (3.69% slower)

# ---------------------------
# Large Scale Test Cases
# ---------------------------

def test_large_text_no_accents():
    # Large text with no accents
    text = "a" * 1000
    codeflash_output = strip_accents_text(text) # 68.2μs -> 48.8μs (39.8% faster)

def test_large_text_with_accents():
    # Large text with accents
    text = "é" * 1000
    codeflash_output = strip_accents_text(text) # 135μs -> 100μs (33.8% faster)

def test_large_mixed_text():
    # Large text with mixed accented and non-accented letters
    text = ("café " * 200).strip()  # 200 repetitions
    expected = ("cafe " * 200).strip()
    codeflash_output = strip_accents_text(text) # 87.6μs -> 63.0μs (39.0% faster)

def test_large_text_random_accents():
    # Large text with randomly placed accented letters
    base = "The résumé of José contains naïve ideas and café receipts. "
    text = base * 20  # 20 repetitions
    expected = "The resume of Jose contains naive ideas and cafe receipts. " * 20
    codeflash_output = strip_accents_text(text) # 96.4μs -> 69.6μs (38.5% faster)

def test_large_unicode_range():
    # All Latin-1 Supplement characters (U+00C0 to U+00FF)
    latin1_supp = ''.join(chr(i) for i in range(0xC0, 0x100))
    # Remove accents from all
    codeflash_output = strip_accents_text(latin1_supp); expected = codeflash_output # 12.2μs -> 10.2μs (20.3% faster)
    # Check that no combining marks remain
    for c in expected:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import unicodedata  # required for the function under test

# imports
import pytest  # used for our unit tests
from spacy.lang.yo.lex_attrs import strip_accents_text

# unit tests

# --------------------------
# Basic Test Cases
# --------------------------

def test_basic_ascii():
    # ASCII string should remain unchanged
    codeflash_output = strip_accents_text("hello world") # 2.67μs -> 2.31μs (15.4% faster)

def test_basic_accented():
    # Accented Latin characters should lose their accents
    codeflash_output = strip_accents_text("café") # 2.62μs -> 2.35μs (11.8% faster)
    codeflash_output = strip_accents_text("naïve") # 1.51μs -> 1.44μs (5.43% faster)
    codeflash_output = strip_accents_text("résumé") # 1.39μs -> 1.33μs (4.98% faster)
    codeflash_output = strip_accents_text("São Paulo") # 1.62μs -> 1.40μs (15.9% faster)

def test_basic_mixed():
    # Mixed accented and non-accented characters
    codeflash_output = strip_accents_text("Crème brûlée") # 3.43μs -> 2.98μs (15.0% faster)

def test_basic_uppercase_accented():
    # Uppercase accented characters
    codeflash_output = strip_accents_text("ÉCOLE") # 2.55μs -> 2.26μs (13.0% faster)

def test_basic_multiple_accents():
    # Multiple accents in a word
    codeflash_output = strip_accents_text("fiancée") # 2.66μs -> 2.44μs (8.88% faster)
    codeflash_output = strip_accents_text("touché") # 1.52μs -> 1.41μs (7.65% faster)

def test_basic_combining_marks():
    # Characters with combining marks (e.g. 'e' + combining acute accent)
    s = "e\u0301"  # 'e' + combining acute accent
    codeflash_output = strip_accents_text(s) # 1.71μs -> 1.48μs (15.5% faster)

# --------------------------
# Edge Test Cases
# --------------------------

def test_edge_empty_string():
    # Empty string should return empty string
    codeflash_output = strip_accents_text("") # 1.21μs -> 1.03μs (17.3% faster)

def test_edge_only_accents():
    # String with only combining marks (should return empty string)
    codeflash_output = strip_accents_text("\u0301\u0300\u0327") # 2.48μs -> 2.05μs (20.8% faster)

def test_edge_non_latin_script():
    # Non-Latin scripts should be preserved if not decomposed into combining marks
    # Cyrillic with accent (should remove accent)
    codeflash_output = strip_accents_text("й") # 2.20μs -> 2.00μs (10.4% faster)
    # Greek with tonos (should remove accent)
    codeflash_output = strip_accents_text("ά") # 959ns -> 817ns (17.4% faster)
    # Hebrew, Arabic, Chinese: should remain unchanged
    codeflash_output = strip_accents_text("שלום") # 1.44μs -> 1.34μs (6.69% faster)
    codeflash_output = strip_accents_text("مرحبا") # 1.04μs -> 992ns (5.04% faster)
    codeflash_output = strip_accents_text("你好") # 758ns -> 665ns (14.0% faster)

def test_edge_emojis_and_symbols():
    # Emojis and symbols should remain unchanged
    codeflash_output = strip_accents_text("😀👍🏽") # 2.24μs -> 2.01μs (11.2% faster)
    codeflash_output = strip_accents_text("©™®") # 1.41μs -> 1.42μs (0.777% slower)

def test_edge_whitespace_and_punctuation():
    # Whitespace and punctuation should remain unchanged
    codeflash_output = strip_accents_text(" \t\n!@#$%^&*()_+-=[]{};:'\",.<>/?") # 4.90μs -> 3.92μs (25.0% faster)

def test_edge_surrogate_pairs():
    # Surrogate pairs (e.g. rare Unicode symbols) should remain unchanged
    codeflash_output = strip_accents_text("𝄞") # 1.49μs -> 1.39μs (7.49% faster)

def test_edge_precomposed_vs_decomposed():
    # Precomposed and decomposed forms should yield same result
    precomposed = "é"
    decomposed = "e\u0301"
    codeflash_output = strip_accents_text(precomposed) # 2.24μs -> 2.00μs (12.3% faster)
    codeflash_output = strip_accents_text(decomposed) # 908ns -> 747ns (21.6% faster)


def test_edge_long_combining_sequence():
    # A single letter with many combining marks
    s = "a" + "\u0301\u0327\u0300\u031B"  # a with acute, cedilla, grave, horn
    codeflash_output = strip_accents_text(s) # 3.66μs -> 3.46μs (5.84% faster)

def test_edge_accented_nonspacing_marks():
    # Nonspacing marks on non-Latin letters
    s = "א" + "\u0301"  # Hebrew Alef + combining acute
    codeflash_output = strip_accents_text(s) # 1.98μs -> 1.83μs (7.68% faster)

# --------------------------
# Large Scale Test Cases
# --------------------------

def test_large_scale_long_string():
    # Large string with repeated accented characters
    s = "áéíóúñ" * 1000
    expected = "aeioun" * 1000
    codeflash_output = strip_accents_text(s) # 795μs -> 573μs (38.7% faster)

def test_large_scale_mixed_script():
    # Large string mixing Latin, Cyrillic, Greek, Chinese, and accented Latin
    latin = "aáeéiíoóuú" * 100
    cyrillic = "йцукен" * 100
    greek = "αάβγ" * 100
    chinese = "你好世界" * 100
    s = latin + cyrillic + greek + chinese
    expected = ("aaeiiioouu" * 100) + ("ицукен" * 100) + ("ααβγ" * 100) + ("你好世界" * 100)
    codeflash_output = strip_accents_text(s) # 236μs -> 174μs (35.2% faster)

def test_large_scale_all_unicode_blocks():
    # Test a string containing characters from many Unicode blocks
    blocks = [
        "áéíóú",  # Latin accented
        "йцукен",  # Cyrillic
        "αάβγ",    # Greek
        "你好世界", # Chinese
        "😀👍🏽",   # Emoji
        "𝄞",       # Musical symbol
        "مرحبا",   # Arabic
        "שלום",    # Hebrew
        "©™®",     # Symbols
    ]
    s = "".join(blocks) * 50
    expected = "".join([
        "aeiou",
        "ицукен",
        "ααβγ",
        "你好世界",
        "😀👍🏽",
        "𝄞",
        "مرحبا",
        "שלום",
        "©™®",
    ]) * 50
    codeflash_output = strip_accents_text(s) # 175μs -> 132μs (32.2% faster)

def test_large_scale_no_accents():
    # Large string with no accents, should be unchanged
    s = "The quick brown fox jumps over the lazy dog. " * 1000
    codeflash_output = strip_accents_text(s) # 3.08ms -> 2.10ms (46.7% faster)

def test_large_scale_only_combining_marks():
    # Large string of only combining marks, should return empty string
    s = "\u0301" * 1000  # combining acute accent
    codeflash_output = strip_accents_text(s) # 57.2μs -> 43.8μs (30.6% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-strip_accents_text-mhmjqmvm and push.

Codeflash Static Badge

The optimization achieves a 42% speedup through two key micro-optimizations that reduce function call overhead in the Unicode processing loop:

**What was optimized:**
1. **Function lookup caching**: Moved `unicodedata.normalize` and `unicodedata.category` lookups outside the loop by storing them as local variables `normalize` and `category`
2. **List comprehension conversion**: Changed the generator expression to a list comprehension inside `"".join()`

**Why this speeds up the code:**
- **Reduced global lookups**: In the original code, each iteration accessed `unicodedata.normalize` and `unicodedata.category` through global module attribute lookups. The optimized version performs these lookups once and stores them as local variables, which are faster to access in Python.
- **Improved join() performance**: While both generators and list comprehensions work with `join()`, CPython's implementation of `join()` can be slightly more efficient when given a list directly rather than consuming a generator.

**Performance impact based on test results:**
The optimization shows consistent improvements across all test cases, with particularly strong gains on:
- Large-scale operations (30-47% faster on large texts)
- Text processing with many Unicode operations (20-40% improvement)
- Basic accent removal tasks (5-20% faster)

**Real-world benefits:**
This function appears to be part of spaCy's Yoruba language processing pipeline for lexical attribute handling. Given that text preprocessing often involves processing large volumes of text with many accented characters, these micro-optimizations compound significantly. The improvements are most pronounced on longer texts and Unicode-heavy content, making this optimization particularly valuable for natural language processing workloads where accent stripping is performed repeatedly on large corpora.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 5, 2025 22:05
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant