⚡️ Speed up function `like_num` by 47% #10

codeflash-ai · 2025-11-05T21:58:13Z

📄 47% (0.47x) speedup for `like_num` in `spacy/lang/tn/lex_attrs.py`

⏱️ Runtime : 5.95 milliseconds → 4.06 milliseconds (best of 130 runs)

📝 Explanation and details

The optimized code achieves a 46% speedup through several key performance optimizations:

1. Set-based membership lookups with lazy caching: The most impactful change converts list membership checks (_num_words and _ordinal_words) to set lookups. Sets provide O(1) average-case lookup time versus O(n) for lists. The optimization uses lazy initialization with caching on the function object to avoid overhead on first call while maintaining the same external behavior.

2. Conditional string processing: Instead of always calling text.replace(",", "").replace(".", ""), the optimized version first checks if commas or periods exist using ',' in text or '.' in text. This eliminates unnecessary string operations for the majority of inputs that don't contain these characters.

3. Improved leading character checks: Replaces text.startswith(("+", "-", "±", "~")) with text and text[0] in "+-±~", which is faster for single character checks and includes a safety check for empty strings.

4. Optimized fraction validation: Adds a preliminary "/" in text check before the more expensive text.count("/") == 1 operation, providing early exit for non-fractional inputs.

5. Enhanced ordinal suffix checking: Adds a length check (len(text_lower) > 2) before string slicing operations, avoiding unnecessary work on short strings.

Performance impact by test case type:

Integer recognition: 12-34% faster due to reduced overhead from conditional string processing
Word lookups: 40-52% faster from O(1) set lookups versus O(n) list searches
Invalid inputs: 46-80% faster from early exits and reduced processing
Ordinal numbers: 16-76% faster from optimized suffix checking and set lookups
Large-scale tests: Show consistent 30-50% improvements, demonstrating the optimizations scale well

The optimizations are particularly effective for workloads with many word-based number lookups and mixed input types, while maintaining identical functionality and correctness.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 10889 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import pytest  # used for our unit tests
from spacy.lang.tn.lex_attrs import like_num

# function to test
_num_words = [
    "lefela",
    "nngwe",
    "pedi",
    "tharo",
    "nne",
    "tlhano",
    "thataro",
    "supa",
    "robedi",
    "robongwe",
    "lesome",
    "lesomenngwe",
    "lesomepedi",
    "sometharo",
    "somenne",
    "sometlhano",
    "somethataro",
    "somesupa",
    "somerobedi",
    "somerobongwe",
    "someamabedi",
    "someamararo",
    "someamane",
    "someamatlhano",
    "someamarataro",
    "someamasupa",
    "someamarobedi",
    "someamarobongwe",
    "lekgolo",
    "sekete",
    "milione",
    "bilione",
    "terilione",
    "kwatirilione",
    "gajillione",
    "bazillione",
]

_ordinal_words = [
    "ntlha",
    "bobedi",
    "boraro",
    "bone",
    "botlhano",
    "borataro",
    "bosupa",
    "borobedi ",
    "borobongwe",
    "bolesome",
    "bolesomengwe",
    "bolesomepedi",
    "bolesometharo",
    "bolesomenne",
    "bolesometlhano",
    "bolesomethataro",
    "bolesomesupa",
    "bolesomerobedi",
    "bolesomerobongwe",
    "somamabedi",
    "someamararo",
    "someamane",
    "someamatlhano",
    "someamarataro",
    "someamasupa",
    "someamarobedi",
    "someamarobongwe",
    "lekgolo",
    "sekete",
    "milione",
    "bilione",
    "terilione",
    "kwatirilione",
    "gajillione",
    "bazillione",
]
from spacy.lang.tn.lex_attrs import like_num

# unit tests

# --------------------------
# 1. Basic Test Cases
# --------------------------

def test_basic_integer():
    # Basic integer string
    codeflash_output = like_num("123") # 1.06μs -> 942ns (12.5% faster)
    # Integer with leading zeros
    codeflash_output = like_num("000123") # 515ns -> 429ns (20.0% faster)

def test_basic_fraction():
    # Simple fraction
    codeflash_output = like_num("1/2") # 1.56μs -> 1.52μs (2.70% faster)
    # Fraction with larger numbers
    codeflash_output = like_num("123/456") # 1.07μs -> 937ns (14.2% faster)

def test_basic_num_words():
    # Basic number words in _num_words
    codeflash_output = like_num("lefela") # 1.41μs -> 1.34μs (5.01% faster)
    codeflash_output = like_num("milione") # 1.05μs -> 687ns (52.3% faster)

def test_basic_ordinal_words():
    # Basic ordinal words in _ordinal_words
    codeflash_output = like_num("ntlha") # 1.52μs -> 1.24μs (22.0% faster)
    codeflash_output = like_num("bobedi") # 994ns -> 774ns (28.4% faster)

def test_basic_ordinal_th_suffix():
    # Numeric ordinals with 'th' suffix
    codeflash_output = like_num("1th") # 2.13μs -> 1.86μs (14.8% faster)
    codeflash_output = like_num("123th") # 1.45μs -> 1.10μs (31.3% faster)

def test_basic_negative_positive_signs():
    # Numbers with positive/negative/approximate signs
    codeflash_output = like_num("+123") # 1.10μs -> 891ns (23.7% faster)
    codeflash_output = like_num("-123") # 508ns -> 427ns (19.0% faster)
    codeflash_output = like_num("±123") # 549ns -> 626ns (12.3% slower)
    codeflash_output = like_num("~123") # 419ns -> 348ns (20.4% faster)

def test_basic_case_insensitivity():
    # Case insensitivity for number/ordinal words
    codeflash_output = like_num("LEFELA") # 1.36μs -> 1.21μs (12.3% faster)
    codeflash_output = like_num("Milione") # 1.05μs -> 670ns (56.6% faster)
    codeflash_output = like_num("NtLha") # 849ns -> 609ns (39.4% faster)

def test_basic_with_commas_and_periods():
    # Numbers with commas and periods
    codeflash_output = like_num("1,234") # 998ns -> 1.08μs (7.85% slower)
    codeflash_output = like_num("1.234") # 556ns -> 633ns (12.2% slower)
    codeflash_output = like_num("1,234,567") # 535ns -> 491ns (8.96% faster)
    codeflash_output = like_num("1.234.567") # 434ns -> 491ns (11.6% slower)

# --------------------------
# 2. Edge Test Cases
# --------------------------

def test_edge_empty_string():
    # Empty string should not be recognized as a number
    codeflash_output = like_num("") # 1.98μs -> 1.19μs (65.8% faster)

def test_edge_only_sign():
    # Only sign, no digits
    codeflash_output = like_num("+") # 2.04μs -> 1.46μs (39.4% faster)
    codeflash_output = like_num("-") # 1.03μs -> 604ns (71.4% faster)
    codeflash_output = like_num("±") # 903ns -> 520ns (73.7% faster)
    codeflash_output = like_num("~") # 820ns -> 416ns (97.1% faster)

def test_edge_non_digit_fraction():
    # Fraction with non-digit numerator/denominator
    codeflash_output = like_num("a/2") # 2.33μs -> 2.40μs (2.87% slower)
    codeflash_output = like_num("2/b") # 1.26μs -> 1.21μs (4.39% faster)
    codeflash_output = like_num("a/b") # 930ns -> 706ns (31.7% faster)
    codeflash_output = like_num("1/2/3") # 1.13μs -> 800ns (41.2% faster)

def test_edge_non_num_words():
    # Words not in _num_words or _ordinal_words
    codeflash_output = like_num("hello") # 1.77μs -> 1.47μs (20.6% faster)
    codeflash_output = like_num("number") # 1.29μs -> 973ns (32.8% faster)

def test_edge_mixed_alpha_numeric():
    # Mixed alphanumeric that isn't a valid number
    codeflash_output = like_num("123abc") # 1.94μs -> 1.46μs (32.8% faster)
    codeflash_output = like_num("abc123") # 1.10μs -> 655ns (68.1% faster)
    codeflash_output = like_num("12thabc") # 1.07μs -> 688ns (54.9% faster)

def test_edge_spaces():
    # Numbers with leading/trailing spaces should not be recognized
    codeflash_output = like_num(" 123") # 1.88μs -> 1.43μs (31.2% faster)
    codeflash_output = like_num("123 ") # 1.07μs -> 829ns (28.8% faster)
    codeflash_output = like_num(" lefela") # 1.10μs -> 700ns (56.6% faster)
    codeflash_output = like_num("lefela ") # 842ns -> 514ns (63.8% faster)

def test_edge_th_suffix_non_numeric():
    # 'th' suffix with non-numeric prefix
    codeflash_output = like_num("abcth") # 2.13μs -> 1.80μs (18.6% faster)

def test_edge_commas_periods_in_fraction():
    # Fraction with commas/periods inside
    codeflash_output = like_num("1,234/567") # 1.81μs -> 2.04μs (11.2% slower)
    codeflash_output = like_num("1234/5,678") # 1.09μs -> 1.10μs (1.18% slower)
    codeflash_output = like_num("1.234/567") # 806ns -> 816ns (1.23% slower)
    codeflash_output = like_num("1234/5.678") # 657ns -> 729ns (9.88% slower)

def test_edge_ordinal_with_space():
    # Ordinal words with extra spaces
    codeflash_output = like_num("bobedi ") # 2.02μs -> 1.46μs (38.1% faster)

def test_edge_long_word_not_in_list():
    # Long word not in num/ordinal lists
    codeflash_output = like_num("supermilione") # 2.07μs -> 1.51μs (36.9% faster)

def test_edge_zero_and_leading_zeros():
    # Zero and various representations
    codeflash_output = like_num("0") # 804ns -> 687ns (17.0% faster)
    codeflash_output = like_num("00") # 472ns -> 451ns (4.66% faster)

def test_edge_fraction_with_leading_zeros():
    codeflash_output = like_num("01/02") # 1.43μs -> 1.56μs (8.65% slower)

def test_edge_num_words_with_case_and_punctuation():
    # Number words with punctuation or case
    codeflash_output = like_num("LeFeLa") # 1.43μs -> 1.28μs (11.8% faster)
    codeflash_output = like_num("milione!") # 1.62μs -> 1.21μs (33.2% faster)
    codeflash_output = like_num("milione.") # 1.02μs -> 975ns (4.82% faster)

def test_edge_ordinal_words_with_case_and_punctuation():
    codeflash_output = like_num("NtLha") # 1.58μs -> 1.21μs (30.9% faster)
    codeflash_output = like_num("ntlha!") # 1.47μs -> 1.10μs (33.0% faster)
    codeflash_output = like_num("ntlha.") # 924ns -> 907ns (1.87% faster)

def test_edge_fraction_with_signs():
    codeflash_output = like_num("+1/2") # 1.74μs -> 1.76μs (1.36% slower)
    codeflash_output = like_num("-123/456") # 1.22μs -> 1.20μs (1.42% faster)
    codeflash_output = like_num("±123/456") # 913ns -> 930ns (1.83% slower)
    codeflash_output = like_num("~123/456") # 588ns -> 589ns (0.170% slower)

def test_edge_fraction_with_multiple_slashes():
    codeflash_output = like_num("1/2/3") # 1.90μs -> 1.73μs (9.34% faster)

def test_edge_th_suffix_with_signs():
    codeflash_output = like_num("+123th") # 2.42μs -> 1.97μs (22.9% faster)
    codeflash_output = like_num("-123th") # 1.35μs -> 885ns (52.3% faster)
    codeflash_output = like_num("±123th") # 1.18μs -> 843ns (39.6% faster)
    codeflash_output = like_num("~123th") # 960ns -> 564ns (70.2% faster)

def test_edge_th_suffix_with_leading_zeros():
    codeflash_output = like_num("0001th") # 2.31μs -> 1.76μs (30.9% faster)

def test_edge_fraction_with_leading_trailing_spaces():
    codeflash_output = like_num(" 1/2") # 2.22μs -> 2.18μs (1.88% faster)
    codeflash_output = like_num("1/2 ") # 1.54μs -> 1.30μs (18.8% faster)

def test_edge_num_word_with_leading_trailing_spaces():
    codeflash_output = like_num(" lefela") # 1.94μs -> 1.52μs (27.2% faster)
    codeflash_output = like_num("lefela ") # 1.01μs -> 711ns (42.1% faster)

def test_edge_ordinal_word_with_leading_trailing_spaces():
    codeflash_output = like_num(" nthla") # 1.91μs -> 1.39μs (36.7% faster)
    codeflash_output = like_num("nthla ") # 976ns -> 657ns (48.6% faster)

def test_edge_fraction_with_non_digit_and_sign():
    codeflash_output = like_num("+a/2") # 2.35μs -> 2.26μs (3.94% faster)
    codeflash_output = like_num("-2/b") # 1.37μs -> 1.34μs (2.01% faster)

# --------------------------
# 3. Large Scale Test Cases
# --------------------------

def test_large_scale_many_integers():
    # Test a range of numbers up to 999
    for i in range(1, 1000):
        codeflash_output = like_num(str(i)) # 248μs -> 185μs (34.4% faster)

def test_large_scale_many_fractions():
    # Test fractions from 1/1 to 999/999
    for i in range(1, 1000, 100):
        for j in range(1, 1000, 100):
            codeflash_output = like_num(f"{i}/{j}")

def test_large_scale_num_words():
    # All number words in _num_words should be recognized
    for word in _num_words:
        codeflash_output = like_num(word) # 20.0μs -> 14.2μs (40.7% faster)
        codeflash_output = like_num(word.upper())  # Case insensitivity

def test_large_scale_ordinal_words():
    # All ordinal words in _ordinal_words should be recognized
    for word in _ordinal_words:
        codeflash_output = like_num(word.strip()) # 22.8μs -> 15.3μs (49.4% faster)
        codeflash_output = like_num(word.strip().upper())  # Case insensitivity

def test_large_scale_th_suffix():
    # Numbers with th suffix from 1th to 999th
    for i in range(1, 1000, 100):
        codeflash_output = like_num(f"{i}th") # 10.1μs -> 6.96μs (45.5% faster)

def test_large_scale_negative_positive_signs():
    # Numbers with signs from 1 to 999
    for sign in ["+", "-", "±", "~"]:
        for i in range(1, 1000, 100):
            codeflash_output = like_num(f"{sign}{i}")

def test_large_scale_invalid_inputs():
    # Large number of invalid inputs
    for i in range(1, 1000, 100):
        codeflash_output = like_num(f"abc{i}") # 9.06μs -> 5.76μs (57.3% faster)
        codeflash_output = like_num(f"{i}abc")
        codeflash_output = like_num(f"{i}/abc") # 7.93μs -> 4.84μs (63.8% faster)
        codeflash_output = like_num(f"abc/{i}")

def test_large_scale_commas_periods():
    # Numbers with commas/periods in various positions
    for i in range(1, 1000, 100):
        s = f"{i:,}"  # e.g., "1,000"
        codeflash_output = like_num(s) # 3.32μs -> 2.48μs (34.1% faster)
        s = f"{i}.{i}"
        codeflash_output = like_num(s)

def test_large_scale_fraction_with_commas_periods():
    # Fractions with commas/periods in numerator/denominator
    for i in range(1, 1000, 100):
        for j in range(1, 1000, 100):
            s = f"{i:,}/{j:,}"
            codeflash_output = like_num(s)
            s = f"{i}.{i}/{j}.{j}"
            codeflash_output = like_num(s)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import pytest  # used for our unit tests
from spacy.lang.tn.lex_attrs import like_num

# function to test
_num_words = [
    "lefela",
    "nngwe",
    "pedi",
    "tharo",
    "nne",
    "tlhano",
    "thataro",
    "supa",
    "robedi",
    "robongwe",
    "lesome",
    "lesomenngwe",
    "lesomepedi",
    "sometharo",
    "somenne",
    "sometlhano",
    "somethataro",
    "somesupa",
    "somerobedi",
    "somerobongwe",
    "someamabedi",
    "someamararo",
    "someamane",
    "someamatlhano",
    "someamarataro",
    "someamasupa",
    "someamarobedi",
    "someamarobongwe",
    "lekgolo",
    "sekete",
    "milione",
    "bilione",
    "terilione",
    "kwatirilione",
    "gajillione",
    "bazillione",
]

_ordinal_words = [
    "ntlha",
    "bobedi",
    "boraro",
    "bone",
    "botlhano",
    "borataro",
    "bosupa",
    "borobedi ",
    "borobongwe",
    "bolesome",
    "bolesomengwe",
    "bolesomepedi",
    "bolesometharo",
    "bolesomenne",
    "bolesometlhano",
    "bolesomethataro",
    "bolesomesupa",
    "bolesomerobedi",
    "bolesomerobongwe",
    "somamabedi",
    "someamararo",
    "someamane",
    "someamatlhano",
    "someamarataro",
    "someamasupa",
    "someamarobedi",
    "someamarobongwe",
    "lekgolo",
    "sekete",
    "milione",
    "bilione",
    "terilione",
    "kwatirilione",
    "gajillione",
    "bazillione",
]
from spacy.lang.tn.lex_attrs import like_num

# unit tests

# -----------------------
# BASIC TEST CASES
# -----------------------

def test_basic_integers():
    # Simple positive integer
    codeflash_output = like_num("123") # 901ns -> 763ns (18.1% faster)
    # Simple zero
    codeflash_output = like_num("0") # 518ns -> 421ns (23.0% faster)
    # Simple negative integer with '-'
    codeflash_output = like_num("-456") # 638ns -> 513ns (24.4% faster)
    # Simple positive integer with '+'
    codeflash_output = like_num("+789") # 334ns -> 290ns (15.2% faster)
    # Integer with leading zeros
    codeflash_output = like_num("000123") # 433ns -> 339ns (27.7% faster)

def test_basic_fractions():
    # Simple fraction
    codeflash_output = like_num("3/4") # 1.53μs -> 1.46μs (4.58% faster)
    # Fraction with leading zeros
    codeflash_output = like_num("03/04") # 1.01μs -> 981ns (3.16% faster)
    # Fraction with negative sign
    codeflash_output = like_num("-5/6") # 838ns -> 763ns (9.83% faster)
    # Fraction with positive sign
    codeflash_output = like_num("+7/8") # 524ns -> 486ns (7.82% faster)
    # Fraction with ± sign
    codeflash_output = like_num("±1/2") # 697ns -> 758ns (8.05% slower)
    # Fraction with ~ sign
    codeflash_output = like_num("~9/10") # 811ns -> 803ns (0.996% faster)

def test_basic_formatted_numbers():
    # Number with commas
    codeflash_output = like_num("1,234") # 1.05μs -> 1.01μs (4.36% faster)
    # Number with periods (should be stripped)
    codeflash_output = like_num("1.234") # 564ns -> 593ns (4.89% slower)
    # Number with both commas and periods
    codeflash_output = like_num("1,234.567") # 593ns -> 568ns (4.40% faster)

def test_basic_num_words():
    # Lowercase number word
    codeflash_output = like_num("pedi") # 1.43μs -> 1.29μs (10.9% faster)
    # Uppercase number word
    codeflash_output = like_num("PEDI") # 605ns -> 485ns (24.7% faster)
    # Mixed case number word
    codeflash_output = like_num("Tharo") # 635ns -> 545ns (16.5% faster)

def test_basic_ordinal_words():
    # Lowercase ordinal word
    codeflash_output = like_num("bobedi") # 1.66μs -> 1.24μs (34.5% faster)
    # Uppercase ordinal word
    codeflash_output = like_num("BOBEDI") # 779ns -> 532ns (46.4% faster)
    # Mixed case ordinal word
    codeflash_output = like_num("Bone") # 858ns -> 668ns (28.4% faster)

def test_basic_ordinal_suffix():
    # Numeric ordinal with 'th'
    codeflash_output = like_num("5th") # 2.12μs -> 1.82μs (16.5% faster)
    # Numeric ordinal with leading zeros
    codeflash_output = like_num("0010th") # 1.54μs -> 1.19μs (30.2% faster)
    # Numeric ordinal with sign
    codeflash_output = like_num("+12th") # 1.45μs -> 949ns (53.0% faster)
    codeflash_output = like_num("-100th") # 1.17μs -> 835ns (40.2% faster)

# -----------------------
# EDGE TEST CASES
# -----------------------

def test_edge_empty_and_whitespace():
    # Empty string
    codeflash_output = like_num("") # 1.78μs -> 1.14μs (56.3% faster)
    # Whitespace only
    codeflash_output = like_num("   ") # 1.39μs -> 1.18μs (17.5% faster)

def test_edge_non_numeric_strings():
    # Random string
    codeflash_output = like_num("hello") # 1.78μs -> 1.52μs (16.7% faster)
    # String with numbers and letters
    codeflash_output = like_num("abc123") # 1.28μs -> 817ns (56.2% faster)
    # String with special characters
    codeflash_output = like_num("@#$%") # 984ns -> 640ns (53.8% faster)

def test_edge_invalid_fractions():
    # Fraction with non-digit numerator
    codeflash_output = like_num("a/2") # 2.07μs -> 2.13μs (2.95% slower)
    # Fraction with non-digit denominator
    codeflash_output = like_num("2/b") # 1.23μs -> 1.17μs (4.87% faster)
    # Fraction with both non-digit
    codeflash_output = like_num("x/y") # 1.02μs -> 721ns (41.1% faster)
    # Fraction with more than one slash
    codeflash_output = like_num("1/2/3") # 1.06μs -> 754ns (40.1% faster)
    # Fraction with missing denominator
    codeflash_output = like_num("5/") # 1.17μs -> 958ns (21.7% faster)
    # Fraction with missing numerator
    codeflash_output = like_num("/5") # 961ns -> 771ns (24.6% faster)

def test_edge_invalid_ordinal_suffix():
    # Non-digit prefix with 'th'
    codeflash_output = like_num("abcth") # 2.09μs -> 1.78μs (17.6% faster)
    # 'th' only
    codeflash_output = like_num("th") # 1.32μs -> 810ns (63.3% faster)
    # Only sign and 'th'
    codeflash_output = like_num("+th") # 1.21μs -> 732ns (65.6% faster)

def test_edge_signs_and_punctuation():
    # Only sign
    codeflash_output = like_num("+") # 1.93μs -> 1.41μs (37.6% faster)
    codeflash_output = like_num("-") # 1.07μs -> 609ns (76.2% faster)
    codeflash_output = like_num("±") # 907ns -> 525ns (72.8% faster)
    codeflash_output = like_num("~") # 814ns -> 416ns (95.7% faster)
    # Only punctuation
    codeflash_output = like_num(",") # 925ns -> 713ns (29.7% faster)
    codeflash_output = like_num(".") # 827ns -> 585ns (41.4% faster)
    # Sign with whitespace
    codeflash_output = like_num("+   ") # 1.25μs -> 1.08μs (15.4% faster)
    # Number with extra spaces
    codeflash_output = like_num("  123  ") # 1.03μs -> 720ns (43.6% faster)

def test_edge_case_sensitivity_and_trailing_spaces():
    # Number word with trailing space
    codeflash_output = like_num("pedi ") # 1.69μs -> 1.39μs (21.7% faster)
    # Ordinal word with trailing space (note: one entry in _ordinal_words has a trailing space)
    codeflash_output = like_num("borobedi ") # 1.05μs -> 910ns (15.7% faster)
    # Number word with leading space
    codeflash_output = like_num(" pedi") # 866ns -> 582ns (48.8% faster)

def test_edge_large_numbers_and_words():
    # Large numbers as digits
    codeflash_output = like_num("999999999") # 805ns -> 712ns (13.1% faster)
    # Large number word
    codeflash_output = like_num("milione") # 1.21μs -> 873ns (38.7% faster)
    # Large ordinal word
    codeflash_output = like_num("bazillione") # 855ns -> 561ns (52.4% faster)

def test_edge_punctuation_inside_words():
    # Number word with punctuation
    codeflash_output = like_num("pedi!") # 1.62μs -> 1.40μs (15.8% faster)
    # Ordinal word with punctuation
    codeflash_output = like_num("bobedi.") # 1.30μs -> 1.19μs (9.34% faster)
    # Ordinal word with internal punctuation
    codeflash_output = like_num("bo,bedi") # 878ns -> 704ns (24.7% faster)

def test_edge_malformed_numbers():
    # Number with multiple signs
    codeflash_output = like_num("++123") # 1.94μs -> 1.59μs (22.0% faster)
    # Number with sign in the middle
    codeflash_output = like_num("12+3") # 1.24μs -> 970ns (27.3% faster)
    # Number with sign at the end
    codeflash_output = like_num("123-") # 856ns -> 558ns (53.4% faster)

# -----------------------
# LARGE SCALE TEST CASES
# -----------------------

def test_large_many_numbers_and_fractions():
    # Test a batch of 1000 valid numbers
    for i in range(1000):
        codeflash_output = like_num(str(i)) # 249μs -> 185μs (34.2% faster)
    # Test a batch of 1000 valid fractions
    for i in range(1, 1001):
        codeflash_output = like_num(f"{i}/{i+1}") # 424μs -> 374μs (13.4% faster)

def test_large_many_invalid_strings():
    # Test a batch of 1000 invalid strings
    for i in range(1000):
        codeflash_output = like_num(f"foo{i}") # 722μs -> 401μs (79.7% faster)

def test_large_num_words():
    # All _num_words should be recognized (case-insensitive)
    for word in _num_words:
        codeflash_output = like_num(word) # 20.2μs -> 14.2μs (42.1% faster)
        codeflash_output = like_num(word.upper())
        codeflash_output = like_num(word.capitalize()) # 17.4μs -> 11.4μs (52.8% faster)

def test_large_ordinal_words():
    # All _ordinal_words should be recognized (case-insensitive)
    for word in _ordinal_words:
        codeflash_output = like_num(word) # 22.8μs -> 14.9μs (52.5% faster)
        codeflash_output = like_num(word.upper())
        codeflash_output = like_num(word.capitalize()) # 20.3μs -> 11.8μs (72.4% faster)

def test_large_numeric_ordinals():
    # Test ordinals from 1th to 999th
    for i in range(1, 1000):
        codeflash_output = like_num(f"{i}th") # 814μs -> 462μs (76.0% faster)

def test_large_formatted_numbers():
    # Test numbers with many commas and periods
    for i in range(1, 1000):
        num = f"{i:,}".replace(",", ",")  # e.g., '1', '1,000'
        codeflash_output = like_num(num) # 248μs -> 185μs (33.6% faster)
        num_dot = f"{i:,}".replace(",", ".")  # e.g., '1', '1.000'
        codeflash_output = like_num(num_dot)

def test_large_edge_invalid_fractions():
    # Fractions with invalid parts
    for i in range(1, 1000):
        codeflash_output = like_num(f"foo/{i}") # 833μs -> 567μs (46.9% faster)
        codeflash_output = like_num(f"{i}/bar")
        codeflash_output = like_num(f"{i}/{i}/{i}") # 854μs -> 585μs (46.1% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-like_num-mhmjhccy and push.

The optimized code achieves a **46% speedup** through several key performance optimizations: **1. Set-based membership lookups with lazy caching:** The most impactful change converts list membership checks (`_num_words` and `_ordinal_words`) to set lookups. Sets provide O(1) average-case lookup time versus O(n) for lists. The optimization uses lazy initialization with caching on the function object to avoid overhead on first call while maintaining the same external behavior. **2. Conditional string processing:** Instead of always calling `text.replace(",", "").replace(".", "")`, the optimized version first checks if commas or periods exist using `',' in text or '.' in text`. This eliminates unnecessary string operations for the majority of inputs that don't contain these characters. **3. Improved leading character checks:** Replaces `text.startswith(("+", "-", "±", "~"))` with `text and text[0] in "+-±~"`, which is faster for single character checks and includes a safety check for empty strings. **4. Optimized fraction validation:** Adds a preliminary `"/" in text` check before the more expensive `text.count("/") == 1` operation, providing early exit for non-fractional inputs. **5. Enhanced ordinal suffix checking:** Adds a length check (`len(text_lower) > 2`) before string slicing operations, avoiding unnecessary work on short strings. **Performance impact by test case type:** - **Integer recognition:** 12-34% faster due to reduced overhead from conditional string processing - **Word lookups:** 40-52% faster from O(1) set lookups versus O(n) list searches - **Invalid inputs:** 46-80% faster from early exits and reduced processing - **Ordinal numbers:** 16-76% faster from optimized suffix checking and set lookups - **Large-scale tests:** Show consistent 30-50% improvements, demonstrating the optimizations scale well The optimizations are particularly effective for workloads with many word-based number lookups and mixed input types, while maintaining identical functionality and correctness.

codeflash-ai bot requested a review from mashraf-222 November 5, 2025 21:58

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `like_num` by 47% #10

⚡️ Speed up function `like_num` by 47% #10

Uh oh!

codeflash-ai bot commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function like_num by 47% #10

Are you sure you want to change the base?

⚡️ Speed up function like_num by 47% #10

Uh oh!

Conversation

codeflash-ai bot commented Nov 5, 2025

📄 47% (0.47x) speedup for like_num in spacy/lang/tn/lex_attrs.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `like_num` by 47% #10

⚡️ Speed up function `like_num` by 47% #10

📄 47% (0.47x) speedup for `like_num` in `spacy/lang/tn/lex_attrs.py`