Skip to content

Conversation

@JigaoLuo
Copy link
Contributor

@JigaoLuo JigaoLuo commented Jun 8, 2025

Description

For issue #18967, this PR is one part of merging the PR Draft #18968. In this PR, almost all rmm::device_scalar calls in libcudf are replaced with cudf::detail::device_scalar due to its internal host-pinned bounce buffer.

This is also a call to action to use host-pinned memory globally in libcudf, with arguments stated in #18967 and #18968.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@JigaoLuo JigaoLuo requested a review from a team as a code owner June 8, 2025 13:10
@JigaoLuo JigaoLuo requested review from nvdbaranec and vuule June 8, 2025 13:10
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jun 8, 2025
@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Jun 8, 2025

I also updated DEVELOPER_GUIDE.md to promote the use of cudf::detail::device_scalar. We can discuss expanding this section further. I recommend replacing all instances of rmm::device_scalar with cudf::detail::device_scalar across libcudf as soon as feasible.

So, when is such replacement not feasible?

Only one instance of rmm::device_scalar exists outside cpp/include/cudf/detail/device_scalar.hpp is here:

rmm::device_scalar<cuda::std::atomic_flag> needs_global_memory_fallback(stream);
auto global_set_ref = global_set.ref(cuco::op::insert_and_find);
compute_mapping_indices(grid_size,
num_rows,
global_set_ref,
row_bitmask,
skip_rows_with_nulls,
local_mapping_index.data(),
global_mapping_index.data(),
block_cardinality.data(),
needs_global_memory_fallback.data(),
stream);
cuda::std::atomic_flag h_needs_fallback;
// Cannot use `device_scalar::value` as it requires a copy constructor, which
// `atomic_flag` doesn't have.
CUDF_CUDA_TRY(cudaMemcpyAsync(&h_needs_fallback,

Replacing it is not feasible as cuda::std::atomic_flag has no copy constructor (also noted in code comment), but such a copy constructor is needed by the cudf::detail::device_scalar's bounce buffer.

The error log if this replacement is attempted:
/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(306): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
        : exemplar(x)
                  ^
          detected during:
            instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const T &) [with T=cuda::std::__4::atomic_flag]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::uninitialized_fill_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 86 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
            instantiation of "thrust::detail::disable_if<thrust::detail::allocator_traits_detail::needs_default_construct_via_allocator<Allocator, thrust::detail::pointer_element<Pointer>::type>::value, void>::type thrust::detail::allocator_traits_detail::value_initialize_range(Allocator &, Pointer, Size) [with Allocator=cudf::detail::rmm_host_allocator<cuda::std::__4::atomic_flag>, Pointer=cuda::std::__4::atomic_flag *, Size=std::size_t]" at line 94 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
            [ 5 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(306): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
        : exemplar(x)
                  ^
          detected during:
            instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const T &) [with T=cuda::std::__4::atomic_flag]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::uninitialized_fill_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 86 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
            instantiation of "thrust::detail::disable_if<thrust::detail::allocator_traits_detail::needs_default_construct_via_allocator<Allocator, thrust::detail::pointer_element<Pointer>::type>::value, void>::type thrust::detail::allocator_traits_detail::value_initialize_range(Allocator &, Pointer, Size) [with Allocator=cudf::detail::rmm_host_allocator<cuda::std::__4::atomic_flag>, Pointer=cuda::std::__4::atomic_flag *, Size=std::size_t]" at line 94 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
            [ 5 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::nullable_global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(311): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
        : exemplar(other.exemplar)
                  ^
          detected during:
            instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const thrust::detail::uninitialized_fill_functor<T> &) [with T=cuda::std::__4::atomic_flag]" at line 64 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
            instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
            instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
            [ 7 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(311): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
        : exemplar(other.exemplar)
                  ^
          detected during:
            instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const thrust::detail::uninitialized_fill_functor<T> &) [with T=cuda::std::__4::atomic_flag]" at line 64 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
            instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
            instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
            [ 7 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::nullable_global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(320): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
      ::new (static_cast<void*>(&x)) T(exemplar);
                                      ^
          detected during:
            instantiation of "void thrust::detail::uninitialized_fill_functor<T>::operator()(T &) [with T=cuda::std::__4::atomic_flag]" at line 44 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/function.h
            instantiation of "Result thrust::detail::wrapped_function<Function, Result>::operator()(Ts &&...) const [with Function=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>, Result=void, Ts=<cuda::std::__4::atomic_flag &>]" at line 70 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
            instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
            instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            [ 8 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(320): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
      ::new (static_cast<void*>(&x)) T(exemplar);
                                      ^
          detected during:
            instantiation of "void thrust::detail::uninitialized_fill_functor<T>::operator()(T &) [with T=cuda::std::__4::atomic_flag]" at line 44 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/function.h
            instantiation of "Result thrust::detail::wrapped_function<Function, Result>::operator()(Ts &&...) const [with Function=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>, Result=void, Ts=<cuda::std::__4::atomic_flag &>]" at line 70 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
            instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
            instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            [ 8 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::nullable_global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu

3 errors detected in the compilation of "/home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu".
gmake[2]: *** [CMakeFiles/cudf.dir/build.make:1296: CMakeFiles/cudf.dir/src/groupby/hash/compute_aggregations.cu.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
3 errors detected in the compilation of "/home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu".
gmake[2]: *** [CMakeFiles/cudf.dir/build.make:1311: CMakeFiles/cudf.dir/src/groupby/hash/compute_aggregations_null.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1353: CMakeFiles/cudf.dir/all] Error 2
gmake: *** [Makefile:166: all] Error 2

Non-Parquet-related Changes

Changes in these following files do not impact Parquet:

  • cpp/include/cudf_test/nanoarrow_utils.hpp
  • cpp/src/copying/get_element.cu
  • cpp/src/join/sort_merge_join.cu
  • and the two test files

Replacements were still made for consistency with the DEVELOPER_GUIDE.md update. A global replacement makes more sense to me and better to fit the changed DEVELOPER_GUIDE.md. We can also discuss. Thanks

@vuule vuule added Performance Performance related issue non-breaking Non-breaking change labels Jun 9, 2025
@vuule
Copy link
Contributor

vuule commented Jun 9, 2025

/ok to test 04e5415

@vuule vuule added the improvement Improvement / enhancement to an existing function label Jun 9, 2025
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!
Some of the changes need to be reverted, the rest looks good 👍
Do you know which of these caused pageable copies in the Parquet reader?

@JigaoLuo

This comment was marked as resolved.

…uh (and other changes for linker and compiler)"

This reverts commit 7f0785a.
@vyasr
Copy link
Contributor

vyasr commented Jun 11, 2025

Addressing rapidsai/rmm#1955 would help improve this issue with rmm directly (we shouldn't block this PR on that making that change though, it's a long-term suggestion for improvement).

@JigaoLuo JigaoLuo requested a review from vuule June 17, 2025 06:53
@vuule
Copy link
Contributor

vuule commented Aug 8, 2025

@vuule Would you please describe the next steps before we are ready to merge?

I think the code changes can be merged as they are now, I'm just waiting for @JigaoLuo to add the change described here to proceed with the (final) reviews.

@ttnghia
Copy link
Contributor

ttnghia commented Aug 8, 2025

I think the code changes can be merged as they are now

I think this should better wait until #19608 merges in, then rebase before merging.

@JigaoLuo JigaoLuo force-pushed the no-miss-sync-cudf-devicescalar branch from 23b400c to be5c1f4 Compare August 9, 2025 07:48
@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Aug 9, 2025

I’ve reverted the previous changes in DEVELOPER_GUIDE.md.

Additionally, I ran a similar profiling to #18968 (comment) and observed a slight drop in the number of mis-sync:

The command:

for t in 1 2 4 8 16 32 64 128; 
do 
  ./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=$((1024 * 1024 * 128 * t)) --axis num_threads=$t --axis num_iterations=10 --csv <PATH>;
done

# then with nsys profile

nsys export -t sqlite report{thread-case}.nsys-rep

nsys analyze -r cuda_memcpy_async:rows=-1 report{thread-case}.sqlite| wc -l

The current branch 25.10: Before this PR

$ nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
63181

With this PR

$ nsys analyze -r cuda_memcpy_async:rows=-1 report_PR_8threads.sqlite | wc -l
63021

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Aug 9, 2025

The target number we're aiming for is the one mentioned in the draft PR comment : ~301, based on the same binary benchmark run.

I don’t recall the exact impact of each individual change on the reduction of mis-syncs, but from what I observed, the thrust reduction seemed to trigger the most mis-synchronizations. There are also a lot of calls of thrust reduction.

Apologies for the delay—I wasn’t able to find time over the past two days. I also have a major paper deadline coming up, so I want to apologize in advance for any slow responses.

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Aug 9, 2025

Now I turn this PR Draft into a PR. But let's wait for the #19608

@JigaoLuo JigaoLuo marked this pull request as ready for review August 9, 2025 09:21
rapids-bot bot pushed a commit that referenced this pull request Aug 11, 2025
Fixes the `cudf::reduction::detail::reduce` internal utility to use the returned `cudf::scalar` instances directly in the CUB calls to simplify the logic.
This should help solve the issues for building/running #19119 -- the device-scalar ctors are no longer required.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Nghia Truong (https://github.com/ttnghia)

URL: #19608
@davidwendt
Copy link
Contributor

Ok, #19608 is merge. I think this will be good to go once the merge conflict is resolved here.

@JigaoLuo
Copy link
Contributor Author

Thanks for the notification. I’ve resolved the conflict.
Before merging, I’d also like to request a code review for the changes made in DEVELOPER_GUIDE.md.

@rapidsai rapidsai deleted a comment from copy-pr-bot bot Aug 12, 2025
@vuule
Copy link
Contributor

vuule commented Aug 12, 2025

/ok to test 6708c96

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating on this.

@JigaoLuo
Copy link
Contributor Author

JigaoLuo commented Aug 12, 2025

Thanks again for your help—really appreciate it!

Also, big thanks to everyone here, including RMM forks.

@rapidsai rapidsai deleted a comment from copy-pr-bot bot Aug 12, 2025
@davidwendt
Copy link
Contributor

/ok to test 5fc3411

@vuule
Copy link
Contributor

vuule commented Aug 12, 2025

/merge

@rapids-bot rapids-bot bot merged commit 6a7134c into rapidsai:branch-25.10 Aug 12, 2025
90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants