Changes for rfc-179 unicode string #3679

rowland66 · 2020-12-01T00:53:20Z

Implementation of rfc-179 improving unicode support in the String class.

chalcolith · 2020-12-09T19:58:17Z

@rowland66 after #3683 is merged you can rebase on master (or at least cherry-pick the changes to .cirrus.yml) to make the Windows checks run.

ponylang-main · 2020-12-09T23:58:03Z

Hi @rowland66,

The changelog - changed label was added to this pull request; all PRs with a changelog label need to have release notes included as part of the PR. If you haven't added release notes already, please do.

Release notes are added by creating a uniquely named file in the .release-notes directory. We suggest you call the file 3679.md to match the number of this pull request.

The basic format of the release notes (using markdown) should be:

## Title

End user description of changes, why it's important,
problems it solves etc.

If a breaking change, make sure to include 1 or more
examples what code would look like prior to this change
and how to update it to work after this change.

Thanks.

SeanTAllen · 2020-12-09T23:58:35Z

I'm marking as "DO NOT MERGE" given the RFC hasn't been accepted yet and I don't want this to get accidentally merged.

…on tests.

jasoncarr0 · 2020-12-12T04:18:35Z

packages/builtin/array.pony

@@ -380,7 +380,12 @@ class Array[A] is Seq[A]
    Truncate an array to the given length, discarding excess elements. If the
    array is already smaller than len, do nothing.
    """
-    _size = _size.min(len)
+    if len >= _alloc then
+      _size = len.min(_alloc)


if len >= _alloc then len.min(_alloc) is just _alloc.
Or these two len.min(_alloc) lines could be pulled out

jasoncarr0 · 2020-12-12T04:29:56Z

packages/builtin/string.pony

+    Returns the byte array underlying the string. This buffer will contain
+    bytes of the String codepoints in the default system encoding (UTF-8).
+    The array will not reflect all changes in the String from which it is
+    obtained. This is an unsafe function.


I'm not sure about "This is an unsafe function." as despite a few weird bits of behavior this doesn't cause any "unsafety" per se and it's a bit scary.

Maybe more like:
"The array may not reflect future changes in the String from which it is obtained.

This function is meant to supply low-level access to the bytes of a string, the array function provides a safer interface."

jasoncarr0 · 2020-12-12T04:34:07Z

packages/builtin/string.pony

+    """
+    Increase the size of a string to the give len in bytes. This is an
+    unsafe operation, and should only be used when string's _ptr has
+    been manipulated through a FFI call and the string size is known.


This didn't get caught in the RFC as is stands, but as I'm reading this, this can easily cause UB/segfaults in pure, isolated Pony code. I think this needs some form of authorization or require an extra FFI call.

Can these use cases use from_cpointer instead, which takes authorization in the form of a non-null Pointer[U8] ref?

jasoncarr0 · 2020-12-12T04:41:34Z

packages/builtin/string.pony

+    var last: USize = 0
+    let offset = _offset_to_index(from.isize())
+
+    if (to > to.isize().max_value().usize()) then


extra parens here?

jasoncarr0 · 2020-12-12T04:42:46Z

packages/builtin/string.pony


    // use the new size' for alloc if we're not including the last used byte
    // from the original data and only include the extra allocated bytes if
    // we're including the last byte.
-    _alloc = if last == _size then _alloc - offset else size' end
+    _alloc = if to == _size then _alloc - from else size' end

    _size = size'


Memory safety issues here, the user can set _size to more or less anything with this. I think this needs some extra capping on the to and from

jasoncarr0 · 2020-12-12T04:53:41Z

packages/builtin/string.pony

+    end
+
+    while (((inc > 0) and (i < limit) and (n <= offset)) or
+           ((inc < 0) and (i >= 0) and (n > offset))) do


Seems like an extra set of parens

jasoncarr0 · 2020-12-12T04:54:19Z

packages/builtin/string.pony

    """
-    Return an iterator over the codepoints in the string.
+    Return an iterator over the codepoint in the string.


I think this was unintentional?

jasoncarr0 · 2020-12-12T04:55:04Z

packages/builtin/string.pony

    end
+
+class _LimittedIterator[A] is Iterator[A]


Nit: should be Limited

jasoncarr0 · 2020-12-12T05:03:47Z

packages/net/_test.pony

@@ -66,7 +66,7 @@ class _TestPing is UDPNotify
  =>
    _h.complete_action("ping receive")

-    let s = String .> append(consume data)
+    let s = recover val String.from_iso_array(consume data) end


recover is a bit odd here since it's dropping from iso^ to val, perhaps annotate s: String val

jasoncarr0 · 2020-12-12T05:03:55Z

packages/net/_test.pony

@@ -106,7 +106,7 @@ class _TestPong is UDPNotify
  =>
    _h.complete_action("pong receive")

-    let s = String .> append(consume data)
+    let s = recover val String.from_iso_array(consume data) end


rowland66 · 2020-12-14T01:11:31Z

Thanks for the comments Jason. I have added responses below. Here is something else that I have been thinking about. The functions in the String class are not consistent about using signed or unsigned integers for string indexes. In the interest of backward compatibility, I have not changed function signatures. My preference would be to use signed integers always, except in special cases when dealing with raw data that might not represent numbers. This comes from my understanding that the interactions between signed and unsigned integer values can be confusing and a potential source of bugs that can be easily avoided by consistently using signed integer values. However, Pony does not appear to follow this convention. Has this issue ever been discussed, and are there strong opinions for the opposite point of view?

On Sat, Dec 12, 2020 at 12:05 AM Jason Carr ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In packages/builtin/array.pony <#3679 (comment)>: > @@ -380,7 +380,12 @@ class Array[A] is Seq[A] Truncate an array to the given length, discarding excess elements. If the array is already smaller than len, do nothing. """ - _size = _size.min(len) + if len >= _alloc then + _size = len.min(_alloc) if len >= _alloc then len.min(_alloc) is just _alloc. Or these two len.min(_alloc) lines could be pulled out

This code was copied from the old truncate() function in String. I agree that it is redundant. I will also remove the call to reserve() that comes later. This code made sense in the String class because a \0 was added at the end of the string.

------------------------------ In packages/builtin/string.pony <#3679 (comment)>: > end + fun current_byte_buffer(): this->Array[U8] box => + """ + Returns the byte array underlying the string. This buffer will contain + bytes of the String codepoints in the default system encoding (UTF-8). + The array will not reflect all changes in the String from which it is + obtained. This is an unsafe function. I'm not sure about "This is an unsafe function." as despite a few weird bits of behavior this doesn't cause any "unsafety" per se and it's a bit scary. Maybe more like: "The array may not reflect future changes in the String from which it is obtained. This function is meant to supply low-level access to the bytes of a string, the array function provides a safer interface."

I will change the language. I suppose that it is unsafe in the sense the byte buffer might fail to reflect the value of the string if modifications are made. So it is safe to use if you understand the underlying implementation of the String class, but general use might produce surprising results and bugs.

------------------------------ In packages/builtin/string.pony <#3679 (comment)>: > @@ -338,36 +470,68 @@ actor Main _size = s end + fun ref resize(len: USize) => + """ + Increase the size of a string to the give len in bytes. This is an + unsafe operation, and should only be used when string's _ptr has + been manipulated through a FFI call and the string size is known. This didn't get caught in the RFC as is stands, but as I'm reading this, this can easily cause UB/segfaults in pure, isolated Pony code. I think this needs some form of authorization or require an extra FFI call. Can these use cases use from_cpointer instead, which takes authorization in the form of a non-null Pointer[U8] ref?

I think that the use case for this is to create a String of a certain size as a buffer. A pointer to that buffer is then passed to a foreign function. The foreign function puts string bytes in the buffer, and returns the size of the string in bytes. This function is then called with the returned size to synchronize the String class with its buffer (_ptr). This was previously done using the truncate() function. I changed the truncate() function to take the number of codepoints as a parameter, so truncate() could no longer be used in this way. That is why I created resize(). Maybe we should change the name to resize_bytes() to better indicate that this function takes the number of bytes as a parameter. All bytes related functions are somewhat dangerous in String now.

------------------------------ In packages/builtin/string.pony <#3679 (comment)>: > fun ref trim_in_place(from: USize = 0, to: USize = -1) => """ Trim the string to a portion of itself, covering `from` until `to`. Unlike slice, the operation does not allocate a new string nor copy elements. """ - let last = _size.min(to) - let offset = last.min(from) - let size' = last - offset + var last: USize = 0 + let offset = _offset_to_index(from.isize()) + + if (to > to.isize().max_value().usize()) then extra parens here?

Yes, 20 years of Java and it's a muscle memory. I will remove.

------------------------------ In packages/builtin/string.pony <#3679 (comment)>: > // use the new size' for alloc if we're not including the last used byte // from the original data and only include the extra allocated bytes if // we're including the last byte. - _alloc = if last == _size then _alloc - offset else size' end + _alloc = if to == _size then _alloc - from else size' end _size = size' Memory safety issues here, the user can set _size to more or less anything with this. I think this needs some extra capping on the to and from

fun ref _trim_in_place() is not available outside the package. It is only called from trim_in_place() and chop(). In both cases the value of 'to' is controlled.

------------------------------ In packages/builtin/string.pony <#3679 (comment)>: > @@ -388,8 +552,17 @@ actor Main Both the original and the new string are immutable, as they share memory. The operation does not allocate a new string pointer nor copy elements. """ - let last = _size.min(to) - let offset = last.min(from) + var last: USize = 0 + let offset = _offset_to_index(from.isize()) + if (to > to.isize().max_value().usize()) then Extra parens

Will fix

------------------------------ In packages/builtin/string.pony <#3679 (comment)>: > @@ -388,8 +552,17 @@ actor Main Both the original and the new string are immutable, as they share memory. The operation does not allocate a new string pointer nor copy elements. """ - let last = _size.min(to) - let offset = last.min(from) + var last: USize = 0 + let offset = _offset_to_index(from.isize()) These also need the min operations as before for memory safety

The _offset_to_index() function is safe in that 'from' is limited to the number of codepoints in the String.

------------------------------ In packages/builtin/string.pony <#3679 (comment)>: > + fun _offset_to_index(offset: ISize, start: USize = 0): USize => + let limit: USize = _size + var inc: ISize = 1 + var n = ISize(0) + var i = start.min(_size) + if offset < 0 then + inc = -1 + if start == 0 then + i = _size - 1 + else + i = start - 1 + end + end + + while (((inc > 0) and (i < limit) and (n <= offset)) or + ((inc < 0) and (i >= 0) and (n > offset))) do Seems like an extra set of parens

Will fix

------------------------------ In packages/builtin/string.pony <#3679 (comment)>: > """ - Return an iterator over the codepoints in the string. + Return an iterator over the codepoint in the string. I think this was unintentional?

Will fix

------------------------------ In packages/builtin/string.pony <#3679 (comment)>: > end + +class _LimittedIterator[A] is Iterator[A] Nit: should be Limited

Will fix

------------------------------ In packages/net/_test.pony <#3679 (comment)>: > @@ -66,7 +66,7 @@ class _TestPing is UDPNotify => _h.complete_action("ping receive") - let s = String .> append(consume data) + let s = recover val String.from_iso_array(consume data) end recover is a bit odd here since it's dropping from iso^ to val, perhaps annotate s: String val

I will remove the recover and declare the type.

------------------------------ In packages/net/_test.pony <#3679 (comment)>: > @@ -106,7 +106,7 @@ class _TestPong is UDPNotify => _h.complete_action("pong receive") - let s = String .> append(consume data) + let s = recover val String.from_iso_array(consume data) end Same here

Will do the same as above.

…

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3679 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLV4MOKQNTOUDNPUCTMLSUL2ZDANCNFSM4UIJ3LUA> .

-- *Rowland E. Smith* P: (862) 260-4163 M: (201) 396-3842

jasoncarr0 · 2020-12-14T03:59:40Z

Thanks for the responses @rowland66. I didn't think to catch that the private methods' safety conditions were being checked by the calling public methods. And I totally get you on that muscle memory haha. I make exactly the same mistake when switching between Java and parens-less languages.

Wrt. resize I think in general "dangerous" behavior is fine, but (and I think others could speak to other methods for ensuring it) I don't believe there's any flexibility with regards to memory safety in pure Pony code

jemc · 2020-12-15T19:13:30Z

Discussed the resize/resize_bytes issue in the sync call.

I suggest renaming it truncate_bytes, and ensuring that it can only reduce the size of the buffer and never increase it, to avoid the memory safety issue. This would be at parity with how truncate previously worked, right?

rowland66 · 2020-12-16T02:42:08Z

OK. I will go ahead and make that change. We will now have 2 truncate functions. truncate() will only allow the size of the string to be decreased by codepoints. truncate_bytes() will only allow the size of the string to be increased by bytes up to the reserved buffer (_alloc). Does not seem terribly intuitive.

…

On Tue, Dec 15, 2020 at 2:13 PM Joe Eli McIlvain ***@***.***> wrote: Discussed the resize/resize_bytes issue in the sync call. I suggest renaming it truncate_bytes, and ensuring that it can only reduce the size of the buffer and never increase it, to avoid the memory safety issue. This would be at parity with how truncate previously worked, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3679 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLV4MNFP4WPWUHSILOWLSU6YOTANCNFSM4UIJ3LUA> .

-- *Rowland E. Smith* P: (862) 260-4163 M: (201) 396-3842

jasoncarr0 · 2020-12-16T06:41:49Z

Oh, if we support resizing up to the length of the buffer, then that is reasonable to be named resize.
The only pitfall with that is we may need to ensure the entire buffer is always initialized for all strings in that case, based on my understanding of the codegen wrt. LLVMs memory model

rowland66 · 2020-12-18T02:45:34Z

I can change it back. Not sure about the buffer initialization. Truncate seemed to work this way previously.

…

On Wed, Dec 16, 2020 at 1:42 AM Jason Carr ***@***.***> wrote: Oh, if we support resizing up to the length of the buffer, then that is reasonable to be named resize. The only pitfall with that is we may need to ensure the entire buffer is always initialized for all strings in that case, based on my understanding of the codegen wrt. LLVMs memory model — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3679 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLVZMG6QQG52AGLD3WRDSVBJD3ANCNFSM4UIJ3LUA> .

-- *Rowland E. Smith* P: (862) 260-4163 M: (201) 396-3842

SeanTAllen · 2022-01-30T18:54:08Z

@rowland66 are you still interested in working on this?

rowland66 · 2022-01-31T03:17:41Z

Hi Sean, I have less time to work on this kind of stuff, but I would like to see it finished. I left it at a point where I thought that it was pretty much done. I had responded to code reviews with some changes. There may have been some issues with test suites that needed to be addressed, and I was not able to make much progress on my own. So if others are willing to help, I would be willing to keep working to get it merged.

…

On Sun, Jan 30, 2022 at 1:54 PM Sean T Allen ***@***.***> wrote: @rowland66 <https://github.com/rowland66> are you still interested in working on this? — Reply to this email directly, view it on GitHub <#3679 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGEQLV2WJFZ6P6NL65FK3JLUYWCNZANCNFSM4UIJ3LUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- *Rowland E. Smith* P: (862) 260-4163 M: (201) 396-3842

SeanTAllen · 2024-01-29T02:05:12Z

@ponylang/committer I feel like we should close this as it is rather out of date. Agreed?

rowland66 added 2 commits November 29, 2020 22:21

Changes for rfc-179 unicode string

73cf4c2

Fix for concat_bytes() bug. rfc 179

4d9571b

SeanTAllen added the changelog - changed Automatically add "Changed" CHANGELOG entry on merge label Dec 9, 2020

SeanTAllen added the do not merge This PR should not be merged at this time label Dec 9, 2020

rowland66 added 3 commits December 11, 2020 20:19

Added current_byte_buffer() function and current_byte_buffer() functi…

2fc45b2

…on tests.

Fixed buffered/writer.pony writev() to be paramaterized by StringEncoder

cb0723b

Refresh with changes from Master

9bbfa26

jasoncarr0 reviewed Dec 12, 2020

View reviewed changes

Base automatically changed from master to main February 8, 2021 23:02

rhagenson mentioned this pull request Jul 14, 2021

Package-level documentation (#3281) #3785

Merged

SeanTAllen force-pushed the main branch from 9af58ff to 7f41bd7 Compare August 9, 2023 20:53

ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Jan 29, 2024

SeanTAllen closed this Jan 30, 2024

ponylang-main removed the discuss during sync Should be discussed during an upcoming sync label Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes for rfc-179 unicode string #3679

Changes for rfc-179 unicode string #3679

rowland66 commented Dec 1, 2020 •

edited by aturley

Loading

chalcolith commented Dec 9, 2020

ponylang-main commented Dec 9, 2020

SeanTAllen commented Dec 9, 2020

jasoncarr0 Dec 12, 2020

jasoncarr0 Dec 12, 2020

jasoncarr0 Dec 12, 2020

jasoncarr0 Dec 12, 2020

jasoncarr0 Dec 12, 2020

jasoncarr0 Dec 12, 2020

jasoncarr0 Dec 12, 2020

jasoncarr0 Dec 12, 2020

jasoncarr0 Dec 12, 2020

jasoncarr0 Dec 12, 2020

rowland66 commented Dec 14, 2020 via email

jasoncarr0 commented Dec 14, 2020 •

edited

Loading

jemc commented Dec 15, 2020

rowland66 commented Dec 16, 2020 via email

jasoncarr0 commented Dec 16, 2020

rowland66 commented Dec 18, 2020 via email

SeanTAllen commented Jan 30, 2022

rowland66 commented Jan 31, 2022 via email

SeanTAllen commented Jan 29, 2024

Changes for rfc-179 unicode string #3679

Changes for rfc-179 unicode string #3679

Conversation

rowland66 commented Dec 1, 2020 • edited by aturley Loading

chalcolith commented Dec 9, 2020

ponylang-main commented Dec 9, 2020

SeanTAllen commented Dec 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rowland66 commented Dec 14, 2020 via email

jasoncarr0 commented Dec 14, 2020 • edited Loading

jemc commented Dec 15, 2020

rowland66 commented Dec 16, 2020 via email

jasoncarr0 commented Dec 16, 2020

rowland66 commented Dec 18, 2020 via email

SeanTAllen commented Jan 30, 2022

rowland66 commented Jan 31, 2022 via email

SeanTAllen commented Jan 29, 2024

rowland66 commented Dec 1, 2020 •

edited by aturley

Loading

jasoncarr0 commented Dec 14, 2020 •

edited

Loading