-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gensym and mangle updates #1651
Conversation
Make the hyx_ prefix just another X-quoted name, XhyX (mapped to ""), which only appears when required. Remove the trailing ? to leading `is_` special case.
The new format doesn't need them.
And fix remaining gensym tests.
#1634 should already be fixed as of 498a54e. I'm not a fan of removing I think a private Unicode character is a better choice than ASCII NUL so that the user can type it, if desired, without weird stuff happening like Unix tools mistaking the source code for binary. |
498a54e only fixes it for Hy's internal use though. Anything written in Hy will have to re-implement the same workaround. This would remove the need for the workaround altogether.
It's not just Pytest that does this kind of thing. The standard library cmd module, for example, gives a special meaning to the
Would |
See #1634 (comment). I would want to avoid situations where whether an identifier is mangled (in the narrow sense of having an escaped Python-illegal character) isn't easily discernible, or where an identifier that looks as if it were mangled were generated by something other than Hy. That's the value of an explicit prefix. I don't think U+FFFF is the best choice because it's designated as a non-character, so data that isn't Unicode can be more readily identified. I'd recommend instead a Unicode private-use character such as U+E000 or U+EFAF. This is the sort of thing the Private Use Areas are for, I think. |
That's not what "non-character" means. (At least not according to the current standard.) "Private-use" characters can have nonstandard, but widespread, interpretations, e.g. extra Chinese characters. Cooperating parties may have documented interpretations of such private characters. Hy should avoid taking them to avoid interfering with such use. Noncharacters, on the other hand, are just like private-use characters (they're still valid in Unicode strings) but the interpretation of them is meant for internal application use, rather than some shared nonstandard interpretation. And in fact, our old keyword implementation used a noncharacter. Thus a noncharacter seems more appropriate for gensyms. I think we can call the gensym format an implementation detail that is subject to change even after Hy 1.0, so we shouldn't worry to much about picking the wrong character if it turns out we have to change it later. |
Thanks for the link. This is a more complex issue than I realized, and I think Wikipedia misled me. Yeah, U+FFFF or another noncharacter seems fine. |
I mostly agree with this concern, but I think the X-quotes are sufficient to make a mangled symbol distinguishable at a glance and the further restriction that they must contain a valid name make them pretty resistant to accidents in software. Standard Python style uses Prefixed symbols are not immune to collisions, they just make them unlikely, but X-quotes already do that. Adding the prefix on top of that doesn't seems to help much, but it does cause the aforementioned compatibility problems. Snake case (or lower-case) and uppercase styles are entirely immune to collisions with X-quotes since mangled symbols always use both cases. Camel case could possibly cause a collision if two words in a row start with X, but this also requires the part between the X's to be a valid symbol name. Not likely. The risk is admittedly higher with the shorter punctuation names, but probably still too low to be concerned with. We could also eliminate that risk without resorting to an identifier prefix, for example, by making the quotes Generators are also resistant. Other Python AST generators I've heard of avoid making valid Python identifiers when they don't want collisions, since AST doesn't require it. Our gensyms used to work this way. String generators will likely already be using their own prefix, or at least one of the three standard styles. |
I see where you're coming from. You make some good points. Still, mangling seems weird enough that I would prefer an explicit prefix and a fail-noisy |
@gilch, where would you like to go from here? Do you want to undo the things I objected to, try to get other approvals to overcome my veto, or close the PR? |
I'm closing this for lack of response, but I think I'll reuse your changes to the gensym format for a new PR. |
Closes #1577
Closes #1531
Fixes #1634
It doesn't have NEWS or docs yet. This update will require clearing your
.pyc
s. Tests have been updated, but there might be a few more edge cases we'd want tested.Addressing a recent example #1648 (comment)
This changes the mangled output of
from
on the current master, to
which is still not quite as pretty as the gensyms before the mangling makeover, but it's close. And at least it's still valid Python.
Mangled symbols in general are prettier now. The
hyx_
prefix is now just another X-quoted name,XhyX
(mapped to ""), which only appears when it's actually required--when mangling a Python keyword--like(mangle "for")
toXhyXfor
--or something starting with an invalid character that doesn't otherwise need X-quotes--like(mangle "42")
toXhyX42
This simplifies the logic for handling leading underscores and makes mangled symbols more readable generally.
unmangle
can now detect a mangled symbol by an even number of X's that have valid symbol names inside, which is not likely to happen accidentally.This also removes the trailing
?
to leadingis_
special case, which makes the mangling rules more consistent. It wasn't that useful for using Python's predicate functions, which aren't named that consistently anyway. (E.g. string functions, like.islower
, andhasattr
etc.) And it interfered with pytest collection and anything else using similar identifier prefix logic, which isn't that unusual in Python. But, predicates meant for export to Python will now have to explicitly sayis-foo
instead offoo?
.Common ASCII punctuation likely to actually appear in Hy symbols now have shorter mangled names via a secondary lookup table. This changes
->>
to_XgtXXgtX
instead of the rather long-winded_hyx_XgreaterHthan_signXXgreaterHthan_signX
that we have now.Gensyms now have an embedded ASCII NUL as the invalid character (instead of the semicolon), which mangles to
X__X
, so that it can also be used to separate the gensym name from its number suffix (instead of the|
). NUL obviously should not be allowed in normal Hy symbols, although I haven't tested this. (Another option would have been a unicode noncharacter like we had for keywords. I'd suggest'\uffff'
, which is easy to recognize.) The gensym number suffix is now in hexadecimal and starts with two digits instead of four.