[format check] Misleading scientific, engineering, or underscore grouping notations in numbers#10425
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #10425 +/- ##
==========================================
+ Coverage 96.19% 96.23% +0.04%
==========================================
Files 178 178
Lines 19683 19896 +213
==========================================
+ Hits 18934 19147 +213
Misses 749 749
🚀 New features to boost your workflow:
|
fdf5bc1 to
18b00a7
Compare
float
DanielNoord
left a comment
There was a problem hiding this comment.
Seems useful, but would probably make it an extension? Or perhaps, first have it as an extension to get feedback?
|
+1 for the idea |
floatfloat
floatfloat
f3ff9d4 to
1db9508
Compare
1db9508 to
77a3d6b
Compare
This comment has been minimized.
This comment has been minimized.
cbe8b6a to
2886dc7
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
8ba51e0 to
35898de
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
afc04e3 to
dc7592d
Compare
This comment has been minimized.
This comment has been minimized.
dc7592d to
bad3e82
Compare
This comment has been minimized.
This comment has been minimized.
Python's tokenizer always produces NUMBER tokens without the minus sign (the `-` is a separate OP token), so _group_right is never called with a negative string. Remove the unreachable branch.
Fold infinity and extreme exponent tests into test_to_another_standard_notation and merge both threshold tests into a single parametrized test.
Beyond 5 groups of 3 digits (e.g. 0.000_000_000_000_000_2), underscore notation is less readable than scientific notation. Fall back to scientific notation for 16+ digit expansions.
Use the original string from the source (e.g. '0.00002') instead of the float() representation (e.g. '2e-05') in error messages about invalid bases. This avoids confusing messages where the base itself looks like scientific notation.
When a number literal has more than 15 significant digits (the float64
guarantee), cap scientific/engineering suggestions to 15 sig figs and
add decimal.Decimal("...") as an alternative suggestion. The precision
warning is included in the message reason.
- Replace float-based _decimal_g_format with pure Decimal formatting
to avoid introducing additional rounding in suggestions
- Add to_decimal_suggestion() to NumberFormatterHelper
- Add >15 sig figs test cases to unit tests
The underscore grouping was still showing all digits from str(float) (up to 17) while scientific/engineering were capped to 15. Now round the float to 15 significant digits before generating the underscore suggestion when the original literal exceeds float precision.
All callers guard against zero and always pass significands >= 1, making the `if not value` and `abs_val < 1` branches unreachable. Document assumptions as a commented-out assert.
Move the Decimal formatting of threshold and close-to-zero threshold strings from _check_bad_number_notation/_check_non_decimal_notation into open(), and skip all number-notation setup when the message is disabled.
Replace per-call rf-string pattern construction in _check_non_decimal_notation with module-level compiled patterns.
Compute the Decimal once and pass it to standardize() instead of having both the checker and standardize() parse the same string.
The PEP 515 validation regex rejected properly left-grouped fractional
underscores like 1_234.567_89 and 123_456.123_456, then suggested
them back as their own fix. Fix the regex to accept the fractional
pattern \d{3}(_\d{3})*(_\d{1,2})? and require at least one integer
digit so .123_456 (missing leading zero) is still caught.
Also improve the error message from "has underscores that are not
delimiting packs of three digits" to "has non-standard underscore
grouping" since the checker validates both integer and fractional
parts with different grouping directions.
The only caller always passes dec_number, so the None fallback was dead code that could not be reached in normal operation.
The bad-number-notation guard only skipped lowercase 'j', so uppercase
complex literals like 1.5E3J fell through to _check_bad_number_notation
and triggered float() to raise ValueError, which propagated as an
astroid-error fatal crash for the whole file.
Use str.endswith(('j', 'J')) to skip both forms.
Literals that overflow float (1.5e500) or underflow to zero (1e-1000) are value-fidelity bugs even when the form is valid scientific notation: the source reads as a tiny/huge number but Python evaluates it to 0.0 or math.inf, silently. Detect these cases via a unified predicate (Decimal(str(float(dec_number))) != dec_number) and flag them with a dedicated reason message. Also fix two related issues uncovered while refactoring: - 1e-1000 was misclassified as "unconventional zero literal" with a destructive suggestion of '0.0' — applying the fix would silently change a tiny-but-nonzero literal into an actual zero. The zero check now uses dec_number == 0 (preserving the literal value) instead of the float-converted value. - For overflow/underflow literals, the suggestion list dropped the form-canonical rebuilds (scientific, engineering, underscore) since they share the same runtime issue and don't help the user. Only math.inf/0.0 (runtime-equivalent) and decimal.Decimal (precision- preserving) are now suggested, giving the user two clear choices. The standardize() helper is also simplified: it derives the float value from dec_number internally rather than taking both as parameters, which removes a redundant argument and makes dec_number the single source of truth for the literal.
The zero-literal check in the float path already flagged unusual zero forms like '00.0' and '0e10', but the integer path had no equivalent check, so '00', '000', '0_0', etc. silently passed. Add an early branch in _check_non_decimal_notation that flags these with suggestion '0'. Also placed before the underscore-grouping check so '0_0' gets the clearer "unconventional zero" message instead of a misleading "underscores not grouping by 3" with suggestion '00'. Limited to the decimal path: prefixed multi-zero literals like '0x00', '0b00000000', '0o000' can be intentional padding (byte alignment, permission masks, fixed-width) and aren't flagged. Cleanup: dropped '0' from the float path's conventional-zero set since that path is only reached when the literal contains '.', 'e' or 'E', making '0' unreachable there.
The pattern was inlined in _check_bad_number_notation while the integer patterns next to it were already pre-compiled at module level. Hoist for consistency — re's internal cache means the perf delta is negligible, but the asymmetry made the inlined one read as somehow special when it isn't.
The 'this is a complex literal, not something we format' decision is a dispatch concern, so it belongs with the dispatcher. Pulling the endswith check into _check_number_notation lets process_tokens treat all NUMBER tokens uniformly.
The leftover ``# assert value and abs(value) >= 1`` comment was the function's actual precondition: the digit-counting ``len(str(int(abs)))`` only matches the intended precision when the integer part has at least one digit (i.e. abs >= 1). Callers (to_standard_*) always satisfy this because they compute the base as a tuple-shifted Decimal in [1, 10) or [1, 1000). Activate the assert so the invariant is documented and any future caller that breaks it fails loudly.
The number-notation-threshold help text describes the semantics with "greater than ... or smaller than the inverse ..."; the messages emitted at runtime said "bigger than" instead. Aligning the wording so the help and the diagnostic phrase the same threshold the same way.
The original function carried 100+ lines, 13 returns, and a closure that captured ten variables — a too-many-locals/too-many-returns suppression was the only way to keep pylint happy. Each branch was hard to read in isolation and impossible to test on its own. Refactor: - introduce a frozen _NumberContext NamedTuple bundling the per-literal state (line/start/string/value/dec_number/sig_figs/float_loses_value + the per-mode flags + has_exponent/has_underscore). - replace the add_bad_notation_message closure with a method _emit_bad_notation that takes the context. - split each branch into its own method: - _handle_zero_literal - _handle_plain_threshold - _handle_exponent_form (returns bool: emitted?) - _handle_underscore_form (returns bool: emitted?) - _handle_value_fidelity - the dispatcher _check_bad_number_notation is now ~25 lines and reads top-to-bottom: zero literal → plain threshold → form check → fall through to value fidelity. No behavior change; functional + unit tests are identical pre/post.
The bulk of numeric literals in real code are plain numbers (no exponent, no underscore) within the allowed threshold band — values that the checker is going to skip anyway. The previous flow built a full Decimal, computed sig_figs and float_loses_value, and constructed a _NumberContext for every one of them before deciding to do nothing. Hoist the threshold-band check ahead of the Decimal step so plain in-band literals exit immediately on the float value alone. Behavior is unchanged because the same predicate previously gated the early return inside the plain-threshold handler. Measured against a synthetic file of 1000 plain literals, total pylint runtime drops from ~1.32s to ~1.05s (~25% faster on this workload). Decimal parsing was the dominant per-literal cost.
The check now flags float-fidelity issues independently of notation form: literals that overflow to math.inf, underflow to 0.0, or exceed the 15 sig-fig float guarantee. Add a separate ``precision.py`` example pair for these cases (alongside the original ``notation.py``), and a paragraph in details.rst explaining when decimal.Decimal is suggested.
Add new C0330 bad-float-precision check for overflow/underflow/precision loss. Notation form and float-value fidelity are orthogonal concerns; users can now enable each independently.
8af0d5b to
5f0ac4b
Compare
Replace three mutually-exclusive boolean flags with an allowlist of accepted notation styles. Default `()` keeps the original behavior (any of scientific/engineering/underscore is fine); subsets like `scientific,engineering` let a project allow either form while still rejecting underscore. Reason wording for mismatches is picked from the most permissive remaining allowed style.
|
🤖 Effect of this PR on checked open source code: 🤖 Effect on astropy: The following messages are now emitted: Details
This comment was truncated because GitHub allows only 65536 characters in a comment. This comment was generated for commit 5f0ac4b |
|
🤖 Effect of this PR on checked open source code: 🤖 Effect on astropy: The following messages are now emitted: Details
This comment was truncated because GitHub allows only 65536 characters in a comment. This comment was generated for commit e3f419d |
Type of Changes
Description
A little experiment to gather feedback with primers and reviews (still a draft).
The idea is to enforce normalized ways to write number above a threshold or close to 0 in absolute value:
Another idea could be to warn to use python's Decimal for number with more than 15 digits of precision, not sure if it should be the same checker though. We could also set/enforce a default precision for a project.