GH-45523: [R] Implement Utf8View type bindings#49712
Conversation
|
We should rebase once #49710 is merged as changes from that PR are in this branch as they were needed to make it work. |
e8768d0 to
bb711a7
Compare
| return this->value_builder_->Append( | ||
| std::string_view(view_.bytes, static_cast<size_t>(view_.size))); |
There was a problem hiding this comment.
The old code passed (const char*, int32_t) which matches StringBuilder::Append but not StringViewBuilder::Append (which takes int64_t). Switching to std::string_view works for both builder types.
|
A lot of |
de59365 to
cf9ad10
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds R bindings and conversion support for Arrow utf8_view / string_view, addressing IPC/table conversion failures for data containing StringView columns.
Changes:
- Adds
string_view()R data type binding, export registration, and type tests. - Implements R↔Arrow conversion paths for StringView arrays and dictionary values.
- Updates generated documentation and related converter support in C++/Python.
Reviewed changes
Copilot reviewed 13 out of 23 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| r/R/type.R | Adds StringView R6 type, constructor, and canonical type aliases. |
| r/R/arrowExports.R | Registers R wrapper for StringView__initialize. |
| r/NAMESPACE | Exports string_view. |
| r/src/datatype.cpp | Maps Arrow STRING_VIEW to R StringView and initializes utf8_view. |
| r/src/r_to_arrow.cpp | Adds R-to-Arrow StringView conversion and dictionary StringView value handling. |
| r/src/array_to_vector.cpp | Adds Arrow-to-R StringView and wider dictionary index conversion support. |
| r/src/arrowExports.cpp | Registers native StringView initialization entry point. |
| cpp/src/arrow/util/converter.h | Enables dictionary converters for StringViewType. |
| python/pyarrow/src/arrow/python/python_to_arrow.cc | Adjusts Python dictionary StringView append call. |
| r/tests/testthat/test-Array.R | Adds StringView array round-trip tests. |
| r/tests/testthat/test-Table.R | Adds table/dictionary StringView and wider index tests. |
| r/tests/testthat/test-data-type.R | Adds StringView data type and code round-trip tests. |
| r/DESCRIPTION | Updates roxygen metadata. |
| r/man/data-type.Rd | Adds string_view() documentation entry. |
| r/man/acero.Rd | Regenerated Acero documentation. |
| r/man/arrow-package.Rd | Regenerated package author documentation. |
| r/man/csv_convert_options.Rd | Regenerated CSV conversion docs. |
| r/man/csv_read_options.Rd | Regenerated CSV read docs. |
| r/man/CsvReadOptions.Rd | Regenerated CSV read options docs. |
| r/man/enums.Rd | Regenerated enum documentation. |
| r/man/JsonFileFormat.Rd | Regenerated JSON file format docs. |
| r/man/reexports.Rd | Regenerated reexports docs. |
| r/man/vctrs_extension_array.Rd | Regenerated vctrs extension docs. |
Files not reviewed (10)
- r/man/CsvReadOptions.Rd: Language not supported
- r/man/JsonFileFormat.Rd: Language not supported
- r/man/acero.Rd: Language not supported
- r/man/arrow-package.Rd: Language not supported
- r/man/csv_convert_options.Rd: Language not supported
- r/man/csv_read_options.Rd: Language not supported
- r/man/data-type.Rd: Language not supported
- r/man/enums.Rd: Language not supported
- r/man/reexports.Rd: Language not supported
- r/man/vctrs_extension_array.Rd: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 23 changed files in this pull request and generated 1 comment.
Files not reviewed (10)
- r/man/CsvReadOptions.Rd: Language not supported
- r/man/JsonFileFormat.Rd: Language not supported
- r/man/acero.Rd: Language not supported
- r/man/arrow-package.Rd: Language not supported
- r/man/csv_convert_options.Rd: Language not supported
- r/man/csv_read_options.Rd: Language not supported
- r/man/data-type.Rd: Language not supported
- r/man/enums.Rd: Language not supported
- r/man/reexports.Rd: Language not supported
- r/man/vctrs_extension_array.Rd: Language not supported
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
a74c9c1 to
1842e7e
Compare
|
Current CI failure is unrelated; #50219 |
|
Lots of rounds with Claude, seems to make sense! I tested the code from the original issue and it works with this PR. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
pitrou
left a comment
There was a problem hiding this comment.
Thanks for doing this @thisisnic . Some comments below, but the main question is: can we also add support for BinaryView? It should be relatively easy given all the work already done here.
| DICTIONARY_CASE(LargeBinaryType); | ||
| DICTIONARY_CASE(StringType); | ||
| DICTIONARY_CASE(LargeStringType); | ||
| DICTIONARY_CASE(StringViewType); |
| ) | ||
| ) | ||
| StringView <- R6Class( | ||
| "StringView", |
There was a problem hiding this comment.
Is there a reason for calling it StringView instead of Utf8View (like the existing Utf8 and LargeUtf8)?
| code = function(namespace = FALSE) call2("binary", .ns = if (namespace) "arrow") | ||
| ) | ||
| ) | ||
| LargeBinary <- R6Class( |
There was a problem hiding this comment.
Why not also add BinaryView here?
| // StringViewArray uses a different memory layout (views + data buffers) rather | ||
| // than offsets, so skip the offset-based fast path and fall through to the | ||
| // GetView()-based element loop below. | ||
| if (array->type_id() != Type::STRING_VIEW) { |
There was a problem hiding this comment.
Can use is_binary_view_like(array->type_id()) to also match BinaryView.
| // than offsets, so skip the offset-based fast path and fall through to the | ||
| // GetView()-based element loop below. | ||
| if (array->type_id() != Type::STRING_VIEW) { | ||
| auto p_offset = array->data()->GetValues<int32_t>(1); |
There was a problem hiding this comment.
So this doesn't work for LargeBinary and LargeString, which have 64-bit offsets, right?
| template <typename T> | ||
| struct RConverterTrait<T, enable_if_t<is_binary_view_like_type<T>::value && | ||
| !is_string_view_type<T>::value>> { | ||
| // not implemented |
There was a problem hiding this comment.
Why not implement it too? It should be reasonably easy given you already the implementations for Binary and StringView.
|
|
||
| template <typename T> | ||
| class RPrimitiveConverter<T, enable_if_string_view<T>> | ||
| : public PrimitiveConverter<T, RConverter> { |
There was a problem hiding this comment.
It seems there's a lot in common between this and the regular String converter (only UnsafeAppendUtf8Strings differs AFAICT), perhaps it's worth factoring things out in a common base class?
| expect_array_roundtrip(c("itsy", NA, "spider"), utf8()) | ||
| expect_array_roundtrip(c("itsy", NA, "spider"), large_utf8(), as = large_utf8()) | ||
|
|
||
| expect_array_roundtrip(c("itsy", NA, "", "spider"), string_view(), as = string_view()) |
There was a problem hiding this comment.
String view arrays have a different representation for "short" and "long" strings (i.e. > 12 characters), I suggest exercising that:
| expect_array_roundtrip(c("itsy", NA, "", "spider"), string_view(), as = string_view()) | |
| expect_array_roundtrip(c("itsy", NA, "", "spider"), string_view(), as = string_view()) | |
| expect_array_roundtrip(c("itsy", NA, "", "a long non-inlined string", "another long string"), string_view(), as = string_view()) |
| expect_equal_data_frame(tab_large, fact) | ||
| }) | ||
|
|
||
| test_that("Table converts dictionary arrays with wider index types back to R", { |
There was a problem hiding this comment.
This test does not seem related, is it just adding a test that was overlooked before?
| }) | ||
|
|
||
| test_that("Table converts dictionary arrays with string_view values", { | ||
| expected <- data.frame(foo = factor(c("x", NA, "x"))) |
There was a problem hiding this comment.
Would also suggest testing with longer strings here.
Rationale for this change
No bindings for Utf8View type in the R package
What changes are included in this PR?
Implement bindings
Are these changes tested?
Yep
Are there any user-facing changes?
Yep, adding functionality.
AI Usage
Heavily used Codex/Claude here. I'm not confident of every line of code. I read things over, and iterated on it making sure that tests pass and nothing seemed wildly incorrect.