-
-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes for rfc-179 unicode string #3679
Conversation
@rowland66 after #3683 is merged you can rebase on master (or at least cherry-pick the changes to |
Hi @rowland66, The changelog - changed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do. Release notes are added by creating a uniquely named file in the The basic format of the release notes (using markdown) should be:
Thanks. |
I'm marking as "DO NOT MERGE" given the RFC hasn't been accepted yet and I don't want this to get accidentally merged. |
@@ -380,7 +380,12 @@ class Array[A] is Seq[A] | |||
Truncate an array to the given length, discarding excess elements. If the | |||
array is already smaller than len, do nothing. | |||
""" | |||
_size = _size.min(len) | |||
if len >= _alloc then | |||
_size = len.min(_alloc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len >= _alloc
then len.min(_alloc)
is just _alloc
.
Or these two len.min(_alloc)
lines could be pulled out
Returns the byte array underlying the string. This buffer will contain | ||
bytes of the String codepoints in the default system encoding (UTF-8). | ||
The array will not reflect all changes in the String from which it is | ||
obtained. This is an unsafe function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about "This is an unsafe function." as despite a few weird bits of behavior this doesn't cause any "unsafety" per se and it's a bit scary.
Maybe more like:
"The array may not reflect future changes in the String from which it is obtained.
This function is meant to supply low-level access to the bytes of a string, the array function provides a safer interface."
""" | ||
Increase the size of a string to the give len in bytes. This is an | ||
unsafe operation, and should only be used when string's _ptr has | ||
been manipulated through a FFI call and the string size is known. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't get caught in the RFC as is stands, but as I'm reading this, this can easily cause UB/segfaults in pure, isolated Pony code. I think this needs some form of authorization or require an extra FFI call.
Can these use cases use from_cpointer instead, which takes authorization in the form of a non-null Pointer[U8] ref
?
var last: USize = 0 | ||
let offset = _offset_to_index(from.isize()) | ||
|
||
if (to > to.isize().max_value().usize()) then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra parens here?
|
||
// use the new size' for alloc if we're not including the last used byte | ||
// from the original data and only include the extra allocated bytes if | ||
// we're including the last byte. | ||
_alloc = if last == _size then _alloc - offset else size' end | ||
_alloc = if to == _size then _alloc - from else size' end | ||
|
||
_size = size' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory safety issues here, the user can set _size to more or less anything with this. I think this needs some extra capping on the to and from
end | ||
|
||
while (((inc > 0) and (i < limit) and (n <= offset)) or | ||
((inc < 0) and (i >= 0) and (n > offset))) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like an extra set of parens
""" | ||
Return an iterator over the codepoints in the string. | ||
Return an iterator over the codepoint in the string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was unintentional?
end | ||
|
||
class _LimittedIterator[A] is Iterator[A] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should be Limited
@@ -66,7 +66,7 @@ class _TestPing is UDPNotify | |||
=> | |||
_h.complete_action("ping receive") | |||
|
|||
let s = String .> append(consume data) | |||
let s = recover val String.from_iso_array(consume data) end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recover is a bit odd here since it's dropping from iso^
to val
, perhaps annotate s: String val
@@ -106,7 +106,7 @@ class _TestPong is UDPNotify | |||
=> | |||
_h.complete_action("pong receive") | |||
|
|||
let s = String .> append(consume data) | |||
let s = recover val String.from_iso_array(consume data) end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Thanks for the comments Jason. I have added responses below.
Here is something else that I have been thinking about. The functions in
the String class are not consistent about using signed or unsigned integers
for string indexes. In the interest of backward compatibility, I have not
changed function signatures. My preference would be to use signed integers
always, except in special cases when dealing with raw data that might not
represent numbers. This comes from my understanding that the interactions
between signed and unsigned integer values can be confusing and a potential
source of bugs that can be easily avoided by consistently using signed
integer values. However, Pony does not appear to follow this convention.
Has this issue ever been discussed, and are there strong opinions for the
opposite point of view?
On Sat, Dec 12, 2020 at 12:05 AM Jason Carr ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In packages/builtin/array.pony
<#3679 (comment)>:
> @@ -380,7 +380,12 @@ class Array[A] is Seq[A]
Truncate an array to the given length, discarding excess elements. If the
array is already smaller than len, do nothing.
"""
- _size = _size.min(len)
+ if len >= _alloc then
+ _size = len.min(_alloc)
if len >= _alloc then len.min(_alloc) is just _alloc.
Or these two len.min(_alloc) lines could be pulled out
This code was copied from the old truncate() function in String. I agree
that it is redundant. I will also remove the call to reserve() that comes
later. This code made sense in the String class because a \0 was added at
the end of the string.
------------------------------
In packages/builtin/string.pony
<#3679 (comment)>:
> end
+ fun current_byte_buffer(): this->Array[U8] box =>
+ """
+ Returns the byte array underlying the string. This buffer will contain
+ bytes of the String codepoints in the default system encoding (UTF-8).
+ The array will not reflect all changes in the String from which it is
+ obtained. This is an unsafe function.
I'm not sure about "This is an unsafe function." as despite a few weird
bits of behavior this doesn't cause any "unsafety" per se and it's a bit
scary.
Maybe more like:
"The array may not reflect future changes in the String from which it is
obtained.
This function is meant to supply low-level access to the bytes of a
string, the array function provides a safer interface."
I will change the language. I suppose that it is unsafe in the sense the
byte buffer might fail to reflect the value of the string if modifications
are made. So it is safe to use if you understand the underlying
implementation of the String class, but general use might produce
surprising results and bugs.
------------------------------
In packages/builtin/string.pony
<#3679 (comment)>:
> @@ -338,36 +470,68 @@ actor Main
_size = s
end
+ fun ref resize(len: USize) =>
+ """
+ Increase the size of a string to the give len in bytes. This is an
+ unsafe operation, and should only be used when string's _ptr has
+ been manipulated through a FFI call and the string size is known.
This didn't get caught in the RFC as is stands, but as I'm reading this,
this can easily cause UB/segfaults in pure, isolated Pony code. I think
this needs some form of authorization or require an extra FFI call.
Can these use cases use from_cpointer instead, which takes authorization
in the form of a non-null Pointer[U8] ref?
I think that the use case for this is to create a String of a certain size
as a buffer. A pointer to that buffer is then passed to a foreign function.
The foreign function puts string bytes in the buffer, and returns the size
of the string in bytes. This function is then called with the returned size
to synchronize the String class with its buffer (_ptr). This was previously
done using the truncate() function. I changed the truncate() function to
take the number of codepoints as a parameter, so truncate() could no longer
be used in this way. That is why I created resize(). Maybe we should change
the name to resize_bytes() to better indicate that this function takes the
number of bytes as a parameter. All bytes related functions are somewhat
dangerous in String now.
------------------------------
In packages/builtin/string.pony
<#3679 (comment)>:
> fun ref trim_in_place(from: USize = 0, to: USize = -1) =>
"""
Trim the string to a portion of itself, covering `from` until `to`.
Unlike slice, the operation does not allocate a new string nor copy
elements.
"""
- let last = _size.min(to)
- let offset = last.min(from)
- let size' = last - offset
+ var last: USize = 0
+ let offset = _offset_to_index(from.isize())
+
+ if (to > to.isize().max_value().usize()) then
extra parens here?
Yes, 20 years of Java and it's a muscle memory. I will remove.
------------------------------
In packages/builtin/string.pony
<#3679 (comment)>:
>
// use the new size' for alloc if we're not including the last used byte
// from the original data and only include the extra allocated bytes if
// we're including the last byte.
- _alloc = if last == _size then _alloc - offset else size' end
+ _alloc = if to == _size then _alloc - from else size' end
_size = size'
Memory safety issues here, the user can set _size to more or less anything
with this. I think this needs some extra capping on the to and from
fun ref _trim_in_place() is not available outside the package. It is only
called from trim_in_place() and chop(). In both cases the value of 'to' is
controlled.
------------------------------
In packages/builtin/string.pony
<#3679 (comment)>:
> @@ -388,8 +552,17 @@ actor Main
Both the original and the new string are immutable, as they share memory.
The operation does not allocate a new string pointer nor copy elements.
"""
- let last = _size.min(to)
- let offset = last.min(from)
+ var last: USize = 0
+ let offset = _offset_to_index(from.isize())
+ if (to > to.isize().max_value().usize()) then
Extra parens
Will fix
------------------------------
In packages/builtin/string.pony
<#3679 (comment)>:
> @@ -388,8 +552,17 @@ actor Main
Both the original and the new string are immutable, as they share memory.
The operation does not allocate a new string pointer nor copy elements.
"""
- let last = _size.min(to)
- let offset = last.min(from)
+ var last: USize = 0
+ let offset = _offset_to_index(from.isize())
These also need the min operations as before for memory safety
The _offset_to_index() function is safe in that 'from' is limited to the
number of codepoints in the String.
------------------------------
In packages/builtin/string.pony
<#3679 (comment)>:
> + fun _offset_to_index(offset: ISize, start: USize = 0): USize =>
+ let limit: USize = _size
+ var inc: ISize = 1
+ var n = ISize(0)
+ var i = start.min(_size)
+ if offset < 0 then
+ inc = -1
+ if start == 0 then
+ i = _size - 1
+ else
+ i = start - 1
+ end
+ end
+
+ while (((inc > 0) and (i < limit) and (n <= offset)) or
+ ((inc < 0) and (i >= 0) and (n > offset))) do
Seems like an extra set of parens
Will fix
------------------------------
In packages/builtin/string.pony
<#3679 (comment)>:
> """
- Return an iterator over the codepoints in the string.
+ Return an iterator over the codepoint in the string.
I think this was unintentional?
Will fix
------------------------------
In packages/builtin/string.pony
<#3679 (comment)>:
> end
+
+class _LimittedIterator[A] is Iterator[A]
Nit: should be Limited
Will fix
------------------------------
In packages/net/_test.pony
<#3679 (comment)>:
> @@ -66,7 +66,7 @@ class _TestPing is UDPNotify
=>
_h.complete_action("ping receive")
- let s = String .> append(consume data)
+ let s = recover val String.from_iso_array(consume data) end
recover is a bit odd here since it's dropping from iso^ to val, perhaps
annotate s: String val
I will remove the recover and declare the type.
------------------------------
In packages/net/_test.pony
<#3679 (comment)>:
> @@ -106,7 +106,7 @@ class _TestPong is UDPNotify
=>
_h.complete_action("pong receive")
- let s = String .> append(consume data)
+ let s = recover val String.from_iso_array(consume data) end
Same here
Will do the same as above.
… —
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3679 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLV4MOKQNTOUDNPUCTMLSUL2ZDANCNFSM4UIJ3LUA>
.
--
*Rowland E. Smith*
P: (862) 260-4163
M: (201) 396-3842
|
Thanks for the responses @rowland66. I didn't think to catch that the private methods' safety conditions were being checked by the calling public methods. And I totally get you on that muscle memory haha. I make exactly the same mistake when switching between Java and parens-less languages. Wrt. resize I think in general "dangerous" behavior is fine, but (and I think others could speak to other methods for ensuring it) I don't believe there's any flexibility with regards to memory safety in pure Pony code |
Discussed the I suggest renaming it |
OK. I will go ahead and make that change. We will now have 2 truncate
functions. truncate() will only allow the size of the string to be
decreased by codepoints. truncate_bytes() will only allow the size of the
string to be increased by bytes up to the reserved buffer (_alloc). Does
not seem terribly intuitive.
…On Tue, Dec 15, 2020 at 2:13 PM Joe Eli McIlvain ***@***.***> wrote:
Discussed the resize/resize_bytes issue in the sync call.
I suggest renaming it truncate_bytes, and ensuring that it can only
reduce the size of the buffer and never increase it, to avoid the memory
safety issue. This would be at parity with how truncate previously
worked, right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3679 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLV4MNFP4WPWUHSILOWLSU6YOTANCNFSM4UIJ3LUA>
.
--
*Rowland E. Smith*
P: (862) 260-4163
M: (201) 396-3842
|
Oh, if we support resizing up to the length of the buffer, then that is reasonable to be named resize. |
I can change it back. Not sure about the buffer initialization. Truncate
seemed to work this way previously.
…On Wed, Dec 16, 2020 at 1:42 AM Jason Carr ***@***.***> wrote:
Oh, if we support resizing up to the length of the buffer, then that is
reasonable to be named resize.
The only pitfall with that is we may need to ensure the entire buffer is
always initialized for all strings in that case, based on my understanding
of the codegen wrt. LLVMs memory model
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3679 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLVZMG6QQG52AGLD3WRDSVBJD3ANCNFSM4UIJ3LUA>
.
--
*Rowland E. Smith*
P: (862) 260-4163
M: (201) 396-3842
|
@rowland66 are you still interested in working on this? |
Hi Sean,
I have less time to work on this kind of stuff, but I would like to see it
finished. I left it at a point where I thought that it was pretty much
done. I had responded to code reviews with some changes. There may have
been some issues with test suites that needed to be addressed, and I was
not able to make much progress on my own. So if others are willing to help,
I would be willing to keep working to get it merged.
…On Sun, Jan 30, 2022 at 1:54 PM Sean T Allen ***@***.***> wrote:
@rowland66 <https://github.com/rowland66> are you still interested in
working on this?
—
Reply to this email directly, view it on GitHub
<#3679 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGEQLV2WJFZ6P6NL65FK3JLUYWCNZANCNFSM4UIJ3LUA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
*Rowland E. Smith*
P: (862) 260-4163
M: (201) 396-3842
|
@ponylang/committer I feel like we should close this as it is rather out of date. Agreed? |
Implementation of rfc-179 improving unicode support in the String class.