Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gensym and mangle updates #1651

Closed
wants to merge 12 commits into from
Closed

Gensym and mangle updates #1651

wants to merge 12 commits into from

Conversation

gilch
Copy link
Member

@gilch gilch commented Jul 2, 2018

Closes #1577
Closes #1531
Fixes #1634

It doesn't have NEWS or docs yet. This update will require clearing your .pycs. Tests have been updated, but there might be a few more edge cases we'd want tested.

Addressing a recent example #1648 (comment)

This changes the mangled output of

(let [a 1 b 2]
  (do
    (print a b)
    (let [b 3 c 4]
      (print a b c))
    (print a b)))

from

_hyx_XsemicolonXletXvertical_lineX1237 = {}
_hyx_XsemicolonXletXvertical_lineX1237['a'] = 1
_hyx_XsemicolonXletXvertical_lineX1237['b'] = 2
print(_hyx_XsemicolonXletXvertical_lineX1237['a'],
    _hyx_XsemicolonXletXvertical_lineX1237['b'])
_hyx_XsemicolonXletXvertical_lineX1238 = {}
_hyx_XsemicolonXletXvertical_lineX1238['b'] = 3
_hyx_XsemicolonXletXvertical_lineX1238['c'] = 4
print(_hyx_XsemicolonXletXvertical_lineX1237['a'],
    _hyx_XsemicolonXletXvertical_lineX1238['b'],
    _hyx_XsemicolonXletXvertical_lineX1238['c'])
print(_hyx_XsemicolonXletXvertical_lineX1237['a'],
    _hyx_XsemicolonXletXvertical_lineX1237['b'])

on the current master, to

_letX__X10 = {}
_letX__X10['a'] = 1
_letX__X10['b'] = 2
print(_letX__X10['a'], _letX__X10['b'])
_letX__X11 = {}
_letX__X11['b'] = 3
_letX__X11['c'] = 4
print(_letX__X10['a'], _letX__X11['b'], _letX__X11['c'])
print(_letX__X10['a'], _letX__X10['b'])

which is still not quite as pretty as the gensyms before the mangling makeover, but it's close. And at least it's still valid Python.

Mangled symbols in general are prettier now. The hyx_ prefix is now just another X-quoted name,
XhyX (mapped to ""), which only appears when it's actually required--when mangling a Python keyword--like (mangle "for") to XhyXfor--or something starting with an invalid character that doesn't otherwise need X-quotes--like (mangle "42") to XhyX42

This simplifies the logic for handling leading underscores and makes mangled symbols more readable generally. unmangle can now detect a mangled symbol by an even number of X's that have valid symbol names inside, which is not likely to happen accidentally.

This also removes the trailing ? to leading is_ special case, which makes the mangling rules more consistent. It wasn't that useful for using Python's predicate functions, which aren't named that consistently anyway. (E.g. string functions, like .islower, and hasattr etc.) And it interfered with pytest collection and anything else using similar identifier prefix logic, which isn't that unusual in Python. But, predicates meant for export to Python will now have to explicitly say is-foo instead of foo?.

Common ASCII punctuation likely to actually appear in Hy symbols now have shorter mangled names via a secondary lookup table. This changes ->> to _XgtXXgtX instead of the rather long-winded
_hyx_XgreaterHthan_signXXgreaterHthan_signX that we have now.

Gensyms now have an embedded ASCII NUL as the invalid character (instead of the semicolon), which mangles to X__X, so that it can also be used to separate the gensym name from its number suffix (instead of the |). NUL obviously should not be allowed in normal Hy symbols, although I haven't tested this. (Another option would have been a unicode noncharacter like we had for keywords. I'd suggest '\uffff', which is easy to recognize.) The gensym number suffix is now in hexadecimal and starts with two digits instead of four.

@Kodiologist
Copy link
Member

Kodiologist commented Jul 2, 2018

#1634 should already be fixed as of 498a54e.

I'm not a fan of removing is_ or removing the mangling prefix from most mangled names.

I think a private Unicode character is a better choice than ASCII NUL so that the user can type it, if desired, without weird stuff happening like Unix tools mistaking the source code for binary.

@gilch
Copy link
Member Author

gilch commented Jul 4, 2018

#1634 should already be fixed as of 498a54e.

498a54e only fixes it for Hy's internal use though. Anything written in Hy will have to re-implement the same workaround. This would remove the need for the workaround altogether.

I'm not a fan of removing is_ or removing the mangling prefix from most mangled names.

It's not just Pytest that does this kind of thing. The standard library cmd module, for example, gives a special meaning to the do_ prefix in method names. Both hyx_ and is_ will interfere with these. And if the standard library does it, you can bet the same technique is used elsewhere in the Python ecosystem (Pytest itself being one example). We can't predict what prefixes will be significant to special case them. (Special cases aren't special enough to break the rules.) I consider Python compatibility a Hy Priority. So it's better if Hy's mangling doesn't use prefixes at all.

I think a private Unicode character is a better choice than ASCII NUL so that the user can type it, if desired, without weird stuff happening like Unix tools mistaking the source code for binary.

Would \uffff work better then? It wouldn't be too difficult to change that. But I expect the user will be typing either X__X or \x00, not a literal NUL. What exactly were you imagining here?

@gilch gilch requested review from vodik and refi64 July 4, 2018 20:48
@Kodiologist
Copy link
Member

Kodiologist commented Jul 4, 2018

498a54e only fixes it for Hy's internal use though.

See #1634 (comment).

I would want to avoid situations where whether an identifier is mangled (in the narrow sense of having an escaped Python-illegal character) isn't easily discernible, or where an identifier that looks as if it were mangled were generated by something other than Hy. That's the value of an explicit prefix.

I don't think U+FFFF is the best choice because it's designated as a non-character, so data that isn't Unicode can be more readily identified. I'd recommend instead a Unicode private-use character such as U+E000 or U+EFAF. This is the sort of thing the Private Use Areas are for, I think.

@gilch
Copy link
Member Author

gilch commented Jul 4, 2018

I don't think U+FFFF is the best choice because it's designated as a non-character, so data that isn't Unicode can be more readily identified.

That's not what "non-character" means. (At least not according to the current standard.) "Private-use" characters can have nonstandard, but widespread, interpretations, e.g. extra Chinese characters. Cooperating parties may have documented interpretations of such private characters. Hy should avoid taking them to avoid interfering with such use.

Noncharacters, on the other hand, are just like private-use characters (they're still valid in Unicode strings) but the interpretation of them is meant for internal application use, rather than some shared nonstandard interpretation.

And in fact, our old keyword implementation used a noncharacter. Thus a noncharacter seems more appropriate for gensyms.

I think we can call the gensym format an implementation detail that is subject to change even after Hy 1.0, so we shouldn't worry to much about picking the wrong character if it turns out we have to change it later.

@Kodiologist
Copy link
Member

Thanks for the link. This is a more complex issue than I realized, and I think Wikipedia misled me. Yeah, U+FFFF or another noncharacter seems fine.

@gilch
Copy link
Member Author

gilch commented Jul 5, 2018

I would want to avoid situations where whether an identifier is mangled (in the narrow sense of having an escaped Python-illegal character) isn't easily discernible, or where an identifier that looks as if it were mangled were generated by something other than Hy. That's the value of an explicit prefix.

I mostly agree with this concern, but I think the X-quotes are sufficient to make a mangled symbol distinguishable at a glance and the further restriction that they must contain a valid name make them pretty resistant to accidents in software. Standard Python style uses snake_case, UPPER_CASE, or CamelCase for identifiers. Mangled symbols don't look like any of these, even without the prefix.

Prefixed symbols are not immune to collisions, they just make them unlikely, but X-quotes already do that. Adding the prefix on top of that doesn't seems to help much, but it does cause the aforementioned compatibility problems. Snake case (or lower-case) and uppercase styles are entirely immune to collisions with X-quotes since mangled symbols always use both cases. Camel case could possibly cause a collision if two words in a row start with X, but this also requires the part between the X's to be a valid symbol name. Not likely. The risk is admittedly higher with the shorter punctuation names, but probably still too low to be concerned with. We could also eliminate that risk without resorting to an identifier prefix, for example, by making the quotes X_/_X instead of X/X, which might even improve readability.

Generators are also resistant. Other Python AST generators I've heard of avoid making valid Python identifiers when they don't want collisions, since AST doesn't require it. Our gensyms used to work this way. String generators will likely already be using their own prefix, or at least one of the three standard styles.

@Kodiologist
Copy link
Member

I see where you're coming from. You make some good points. Still, mangling seems weird enough that I would prefer an explicit prefix and a fail-noisy unmangle, although this has the cost of the user needing to keep in mind that a Python identifier may start with a different character than the corresponding Hy symbol. And that's already the case because of the "is_" rule, which I also want to keep.

@Kodiologist
Copy link
Member

Kodiologist commented Apr 10, 2019

@gilch, where would you like to go from here? Do you want to undo the things I objected to, try to get other approvals to overcome my veto, or close the PR?

@Kodiologist
Copy link
Member

I'm closing this for lack of response, but I think I'll reuse your changes to the gensym format for a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants