Skip to content

GH-45523: [R] Implement Utf8View type bindings#49712

Open
thisisnic wants to merge 16 commits into
apache:mainfrom
thisisnic:GH-45523-ipc-polars
Open

GH-45523: [R] Implement Utf8View type bindings#49712
thisisnic wants to merge 16 commits into
apache:mainfrom
thisisnic:GH-45523-ipc-polars

Conversation

@thisisnic

@thisisnic thisisnic commented Apr 11, 2026

Copy link
Copy Markdown
Member

Rationale for this change

No bindings for Utf8View type in the R package

What changes are included in this PR?

Implement bindings

Are these changes tested?

Yep

Are there any user-facing changes?

Yep, adding functionality.

AI Usage

Heavily used Codex/Claude here. I'm not confident of every line of code. I read things over, and iterated on it making sure that tests pass and nothing seemed wildly incorrect.

@thisisnic

Copy link
Copy Markdown
Member Author

We should rebase once #49710 is merged as changes from that PR are in this branch as they were needed to make it work.

Comment on lines +829 to +830
return this->value_builder_->Append(
std::string_view(view_.bytes, static_cast<size_t>(view_.size)));

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code passed (const char*, int32_t) which matches StringBuilder::Append but not StringViewBuilder::Append (which takes int64_t). Switching to std::string_view works for both builder types.

Comment thread r/src/r_to_arrow.cpp
@thisisnic

Copy link
Copy Markdown
Member Author

A lot of .Rd updates as I used a newer roxygen2 version.

Copilot AI review requested due to automatic review settings May 28, 2026 09:37
@thisisnic thisisnic force-pushed the GH-45523-ipc-polars branch from de59365 to cf9ad10 Compare May 28, 2026 09:37
@github-actions github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 28, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds R bindings and conversion support for Arrow utf8_view / string_view, addressing IPC/table conversion failures for data containing StringView columns.

Changes:

  • Adds string_view() R data type binding, export registration, and type tests.
  • Implements R↔Arrow conversion paths for StringView arrays and dictionary values.
  • Updates generated documentation and related converter support in C++/Python.

Reviewed changes

Copilot reviewed 13 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
r/R/type.R Adds StringView R6 type, constructor, and canonical type aliases.
r/R/arrowExports.R Registers R wrapper for StringView__initialize.
r/NAMESPACE Exports string_view.
r/src/datatype.cpp Maps Arrow STRING_VIEW to R StringView and initializes utf8_view.
r/src/r_to_arrow.cpp Adds R-to-Arrow StringView conversion and dictionary StringView value handling.
r/src/array_to_vector.cpp Adds Arrow-to-R StringView and wider dictionary index conversion support.
r/src/arrowExports.cpp Registers native StringView initialization entry point.
cpp/src/arrow/util/converter.h Enables dictionary converters for StringViewType.
python/pyarrow/src/arrow/python/python_to_arrow.cc Adjusts Python dictionary StringView append call.
r/tests/testthat/test-Array.R Adds StringView array round-trip tests.
r/tests/testthat/test-Table.R Adds table/dictionary StringView and wider index tests.
r/tests/testthat/test-data-type.R Adds StringView data type and code round-trip tests.
r/DESCRIPTION Updates roxygen metadata.
r/man/data-type.Rd Adds string_view() documentation entry.
r/man/acero.Rd Regenerated Acero documentation.
r/man/arrow-package.Rd Regenerated package author documentation.
r/man/csv_convert_options.Rd Regenerated CSV conversion docs.
r/man/csv_read_options.Rd Regenerated CSV read docs.
r/man/CsvReadOptions.Rd Regenerated CSV read options docs.
r/man/enums.Rd Regenerated enum documentation.
r/man/JsonFileFormat.Rd Regenerated JSON file format docs.
r/man/reexports.Rd Regenerated reexports docs.
r/man/vctrs_extension_array.Rd Regenerated vctrs extension docs.
Files not reviewed (10)
  • r/man/CsvReadOptions.Rd: Language not supported
  • r/man/JsonFileFormat.Rd: Language not supported
  • r/man/acero.Rd: Language not supported
  • r/man/arrow-package.Rd: Language not supported
  • r/man/csv_convert_options.Rd: Language not supported
  • r/man/csv_read_options.Rd: Language not supported
  • r/man/data-type.Rd: Language not supported
  • r/man/enums.Rd: Language not supported
  • r/man/reexports.Rd: Language not supported
  • r/man/vctrs_extension_array.Rd: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread r/src/r_to_arrow.cpp Outdated
Comment thread r/man/acero.Rd Outdated
Copilot AI review requested due to automatic review settings May 28, 2026 12:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 23 changed files in this pull request and generated 1 comment.

Files not reviewed (10)
  • r/man/CsvReadOptions.Rd: Language not supported
  • r/man/JsonFileFormat.Rd: Language not supported
  • r/man/acero.Rd: Language not supported
  • r/man/arrow-package.Rd: Language not supported
  • r/man/csv_convert_options.Rd: Language not supported
  • r/man/csv_read_options.Rd: Language not supported
  • r/man/data-type.Rd: Language not supported
  • r/man/enums.Rd: Language not supported
  • r/man/reexports.Rd: Language not supported
  • r/man/vctrs_extension_array.Rd: Language not supported

Comment thread r/src/r_to_arrow.cpp Outdated
thisisnic and others added 2 commits June 10, 2026 16:07
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@thisisnic thisisnic force-pushed the GH-45523-ipc-polars branch from a74c9c1 to 1842e7e Compare June 10, 2026 15:10
Copilot AI review requested due to automatic review settings June 10, 2026 15:18

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated no new comments.

Files not reviewed (1)
  • r/man/data-type.Rd: Language not supported

@thisisnic

Copy link
Copy Markdown
Member Author

Current CI failure is unrelated; #50219

@thisisnic thisisnic marked this pull request as ready for review June 18, 2026 15:01
@thisisnic thisisnic requested a review from jonkeane as a code owner June 18, 2026 15:01
Copilot AI review requested due to automatic review settings June 18, 2026 15:01
@thisisnic thisisnic requested a review from pitrou as a code owner June 18, 2026 15:01
@thisisnic

Copy link
Copy Markdown
Member Author

Lots of rounds with Claude, seems to make sense! I tested the code from the original issue and it works with this PR.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • r/man/data-type.Rd: Generated file

Comment thread python/pyarrow/src/arrow/python/python_to_arrow.cc
Comment thread r/src/array_to_vector.cpp Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 16:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • r/man/data-type.Rd: Generated file

Comment thread r/src/array_to_vector.cpp
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 16:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 13 changed files in this pull request and generated no new comments.

Files not reviewed (1)
  • r/man/data-type.Rd: Generated file

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @thisisnic . Some comments below, but the main question is: can we also add support for BinaryView? It should be relatively easy given all the work already done here.

DICTIONARY_CASE(LargeBinaryType);
DICTIONARY_CASE(StringType);
DICTIONARY_CASE(LargeStringType);
DICTIONARY_CASE(StringViewType);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also BinaryViewType?

Comment thread r/R/type.R
)
)
StringView <- R6Class(
"StringView",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for calling it StringView instead of Utf8View (like the existing Utf8 and LargeUtf8)?

Comment thread r/R/type.R
code = function(namespace = FALSE) call2("binary", .ns = if (namespace) "arrow")
)
)
LargeBinary <- R6Class(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also add BinaryView here?

Comment thread r/src/array_to_vector.cpp
// StringViewArray uses a different memory layout (views + data buffers) rather
// than offsets, so skip the offset-based fast path and fall through to the
// GetView()-based element loop below.
if (array->type_id() != Type::STRING_VIEW) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use is_binary_view_like(array->type_id()) to also match BinaryView.

Comment thread r/src/array_to_vector.cpp
// than offsets, so skip the offset-based fast path and fall through to the
// GetView()-based element loop below.
if (array->type_id() != Type::STRING_VIEW) {
auto p_offset = array->data()->GetValues<int32_t>(1);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this doesn't work for LargeBinary and LargeString, which have 64-bit offsets, right?

Comment thread r/src/r_to_arrow.cpp
template <typename T>
struct RConverterTrait<T, enable_if_t<is_binary_view_like_type<T>::value &&
!is_string_view_type<T>::value>> {
// not implemented

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not implement it too? It should be reasonably easy given you already the implementations for Binary and StringView.

Comment thread r/src/r_to_arrow.cpp

template <typename T>
class RPrimitiveConverter<T, enable_if_string_view<T>>
: public PrimitiveConverter<T, RConverter> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there's a lot in common between this and the regular String converter (only UnsafeAppendUtf8Strings differs AFAICT), perhaps it's worth factoring things out in a common base class?

expect_array_roundtrip(c("itsy", NA, "spider"), utf8())
expect_array_roundtrip(c("itsy", NA, "spider"), large_utf8(), as = large_utf8())

expect_array_roundtrip(c("itsy", NA, "", "spider"), string_view(), as = string_view())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String view arrays have a different representation for "short" and "long" strings (i.e. > 12 characters), I suggest exercising that:

Suggested change
expect_array_roundtrip(c("itsy", NA, "", "spider"), string_view(), as = string_view())
expect_array_roundtrip(c("itsy", NA, "", "spider"), string_view(), as = string_view())
expect_array_roundtrip(c("itsy", NA, "", "a long non-inlined string", "another long string"), string_view(), as = string_view())

expect_equal_data_frame(tab_large, fact)
})

test_that("Table converts dictionary arrays with wider index types back to R", {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test does not seem related, is it just adding a test that was overlooked before?

})

test_that("Table converts dictionary arrays with string_view values", {
expected <- data.frame(foo = factor(c("x", NA, "x")))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also suggest testing with longer strings here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants