Implement merge update for timeseries matching on the index #2781

vasil-pashov · 2025-11-26T23:59:17Z

Reference Issues/PRs

Monday https://man312219.monday.com/boards/7852509418/views/168855452

What does this implement or fix?

This is the first part of the merge update functionality. This provides only updating timeseries containing only non-string columns.

The MergeUpdate is implemented as a clause in the processing pipeline which uses the read-modify-write codepath. This clause can only appear as a first clause in a processing pipeline.

Overview of the algorithm

Filter the segments that have to be read from storage. This is different than the existing clauses (row/date rage, resampling) which load consecutive segments in a range. This clause can load any row slice. The clause receives a source input frame containing the new data and a strategy determining what to do with the new data update/insert. There are two cases depending on the index type:

1.1. Row range indexes require full table scan
1.2. In case of timestamp index this will filter out only the ranges and keys whose index span contains at least one
value from the source's index. This does not mean that there's a match only that a match is possible. A crucial
assumption is that the source is ordered. This means that after ranges_and_keys are ordered by row slice we can
perform only forward iteration over the source index to find matches.

Decide which rows of should be updated and which rows from source should be inserted.

2.2. Filter by index. If the data is timestamp index binary search can be used to find the matching rows. If there's a timestamp index use MergeUpdateClause::filter_index_match this will produce a vector of size equal
to the number of rows from the source that fall into the processed slice. Each vector will contain a vector of
row-indexes in target that match the corresponding soruce index value. If the nested vector is empty this means that insertion must happen (not implemented)

2.3. (NOT IMPLEMENTED) If MergeUpdateClause::on_ is not empty, all columns that are listed in it must mach for a particular row for it to be marked as matching. For each column in MergeUpdateClause::on_ iterate over the vector of vectors produced in the previous step. Checking for match only the target rows that are in the inner vector. If there is no match for this particular column remove the target row index. This means that the ordering of the columns in MergeUpdateClause::on_ matters and it would be more efficient to start with the columns that have a lesser chance of matching.

Testing utilities
This PR also adds utility functions which can be used to create segments and input frames easily.

create_dense_segment Takes in a stream descriptor and argument pack where the i-th argument in the pack corresponds to the i-th field in the descriptor. It then creates a pre-allocated dense segment out of the data in the packs. The data in the packs can be any sequence with a finite size (e.g. std::view, std::array, std::vector). With the slight optimization that if the range is contiguous it'll use memcpy otherwise it'll use std::copy. All ranges in the pack must have the same length.
slice_data_into_segments takes in a descriptor rows per segment, cols per segment and pack of arguments. It operates similarly to create_dense_segment but will return a tuple of 3 vectors. The segments (with no more than rows_per_segment rows and cols_per_segment cols), the corresponding row slice and col slice for the segment. The returned segments are in column major order (first are all the segments for the first col slice, next for the second col slice, etc...)
input_frame_from_tensors similar in concept to create_dense_segment but it will create an InputFrame of NativeTensors. The only gotcha is that the NativeTensors are non-owning. That is why this function will materialize all views. The second element in the return value is the all data that the NativeTensors point to. If it's a view it's materialized otherwise it's just forwarded.

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

…dlock

…t thread

…ents

vasil-pashov · 2025-11-27T22:49:17Z

cpp/arcticdb/column_store/segment_utils.cpp

+ankerl::unordered_dense::set<entity::position_t> unique_values_for_string_column(const Column& column) {
+    ankerl::unordered_dense::set<entity::position_t> output_set;
+    // Guessing that unique values is a third of the column length
+    // TODO would be useful to have actual unique count here from stats
+    static auto map_reserve_ratio = ConfigsMap::instance()->get_int("UniqueColumns.AllocationRatio", 3);
+    output_set.reserve(column.row_count() / map_reserve_ratio);
+
+    details::visit_type(column.type().data_type(), [&](auto col_desc_tag) {
+        using type_info = ScalarTypeInfo<decltype(col_desc_tag)>;
+        if constexpr (is_sequence_type(type_info::data_type)) {
+            arcticdb::for_each<typename type_info::TDT>(column, [&output_set](auto value) {
+                output_set.emplace(value);
+            });
+        } else {
+            util::raise_rte("Column {} is not a string type column");
+        }
+    });
+    return output_set;
+}


This was moved from cpp/arcticdb/column_store/segment_utils.hpp without any changes

vasil-pashov · 2025-11-27T22:49:46Z

cpp/arcticdb/column_store/segment_utils.hpp

-inline ankerl::unordered_dense::set<entity::position_t> unique_values_for_string_column(const Column& column) {
-    ankerl::unordered_dense::set<entity::position_t> output_set;
-    // Guessing that unique values is a third of the column length
-    // TODO would be useful to have actual unique count here from stats
-    static auto map_reserve_ratio = ConfigsMap::instance()->get_int("UniqueColumns.AllocationRatio", 3);
-    output_set.reserve(column.row_count() / map_reserve_ratio);
-
-    details::visit_type(column.type().data_type(), [&](auto col_desc_tag) {
-        using type_info = ScalarTypeInfo<decltype(col_desc_tag)>;
-        if constexpr (is_sequence_type(type_info::data_type)) {
-            arcticdb::for_each<typename type_info::TDT>(column, [&output_set](auto value) {
-                output_set.emplace(value);
-            });


This is moved inside cpp/arcticdb/column_store/segment_utils.cpp without any changes

IvoDD · 2025-12-04T15:23:45Z

cpp/arcticdb/processing/clause.cpp

+    while (target_index_it != target_index_end && source_row < source_row_end) {
+        const timestamp source_ts = source_->index_value_at(source_row);
+        // TODO: Profile and compare to linear or adaptive (linear below some threshold) search
+        auto target_match_it = std::lower_bound(target_index_it, target_index_end, source_ts);


This solution is O(source_rows * log(target_rows)) which in the somewhat common case where all the rows are matches it will be O(N*log(N)).

We can do better with a search with a similar idea to binary lifting but for an array.

I.e. we can do a search similar to a binary search but we start with just powers of 2 away from target_index_it. E.g. we check:

target_index_it + 1

target_index_it + 2

target_index_it + 4

target_index_it + 8

target_index_it + 16 # E.g. here we finally get a larger number and continue with a regular binary search between 8 and 16

This approach is nice because complexity is O(source_rows * log(target_rows / source_rows)) or something similar which is guaranteed to be <=O(target_rows)

I have a ticket for performance improvements 10655963947
I'm adding this there. I'll also leave a link to the ticket in the code.

It's worse actually, as target_index_it is not a random access iterator (although we do have one)

Another optimisation would be to decrement target_index_end until it is just >= source_row_end, to further reduce the search space

It's worse actually, as target_index_it is not a random access iterator (although we do have one)

I didn't know about this one. Does it really need the dynamic dispatching that folly::polly has looks like it's going to have overhead in hot loops. Fortunately it's not used anywhere, I suggest removing folly polly and the base and use this.

Another optimisation would be to decrement target_index_end until it is just >= source_row_end, to further reduce the search space

Wouldn't this make its worse case linear?

IvoDD · 2025-12-04T15:46:55Z

cpp/arcticdb/processing/clause.hpp


+/// This clause will perform update values or insert values based on strategy_ in a segment. The source of new values is
+/// the source_ member. Source and target must have the same index type. There are two actions
+/// UPDATE: For a particular row in the segment if there's  a row source_ for which all values in the columns listed in


For a particular row in the segment if there's a row source_

Nit: "row in source_"

IvoDD · 2025-12-05T12:34:03Z

cpp/arcticdb/processing/clause.cpp

+template<typename T>
+void select(std::span<const size_t> indexes_to_keep, std::vector<T>& vec) {
+    arcticdb::debug::check<arcticdb::ErrorCode::E_ASSERTION_FAILURE>(
+            std::ranges::is_sorted(indexes_to_keep),


Nit: It is needed to be strictly sorted. E.g. {0, 0, 1} would not work.

Could use std::adjescent_find(std::greater_equals)==std::end but is less readable and probably not worth it for a debug check anyway.

Yeah, it never crossed my mind someone could put repeated indexes in such function. I like the adjescent_find idea.

EDIT: I just looked at how debug checks are implemented and I'll remove this check. Debug the ifdef in the debug check has empty body in the release build but the standard mandates that the arguments must be evaluated. The compilers probably optimize this but there are no guarantees and I definitely don't want this check to be executed.

IvoDD · 2025-12-05T13:10:08Z

cpp/arcticdb/processing/clause.cpp

+            new_entity_id++;
+        }
+    }
+    ranges_and_keys.erase(ranges_and_keys.begin() + new_entity_id, ranges_and_keys.end());


What do you think instead of writing this loop again to instead use join_view like:

select(row_slices_to_keep, offsets) auto ranges_to_keep = std::ranges::join_view(offsets); return select(ranges_to_keep, ranges_and_keys);

Note that the inner loop also performs inplace reindexing which is lost with the suggested approach. ranges_and_keys are the global slices to be added (since this can be only the first clause the input ranges_and_keys is the index), then the offsets is "local" each vector is index into the filtered ranges_and_keys.

IvoDD · 2025-12-05T13:16:15Z

cpp/arcticdb/processing/clause.cpp

+        debug::check<ErrorCode::E_ASSERTION_FAILURE>(
+                new_row >= source_data_row_, "Cannot move SourceColumnIterator backwards"
+        );
+        // TODO: Implement operator+= because std::advance is linear


Maybe would be good to open a tickets for some of the TODOs introduced in this PR?

Yeah, most of the TODO's have tickets but it's good idea to leave the ticket id in the code as well. Ticket for this is 10754574651

IvoDD · 2025-12-05T13:30:26Z

cpp/arcticdb/processing/clause.cpp

+        // If the source frame is a list of tensors, the index frame is kept separately, so the first
+        // non-index column will be at index 0. If there's an index the first ColRange will start from
+        // 1 and the first column of each segment will contain the index
+        return col_range.first + index_in_range - 2 * index_field_count;


TODO: Come back to this

What do you mean?

I suspect he means to come back and figure out what this is doing, cos I also have no clue

This is indexing in the source descriptor which is not sliced using the column sliced indexes.
Example:
Descriptor: index col0 col1 col2 col3
Cols per slice = 2
Column slices will be [{1, 3}, {3, 5}]
The target column slice for the second slice will be index|col2|col3
The for loop goes column_idx = source_descriptor.index().field_count(); column_idx < columns_in_range
Where const size_t columns_in_range = target_segment.num_columns() which is 3
so we have for(column_idx = 1; column_idx < 3...)
In order to get the correct source index (keep in mind the source is not sliced) we need to take into account that the column slices start from 1 (to account for the index) and that also for each column slice column_idx starts from 1 again to account for the index. Thus we need to subtract 2 * index size to get the correct index into the source.

This can be done differently so that the loop is col_idx = col_range.start; col_idx < col_range.end this will simplify indexing into the source but will require similar re-indexing to get the correct column from the target.

IvoDD · 2025-12-05T13:52:56Z

cpp/arcticdb/processing/clause.cpp

+    for (size_t segment_idx = 0; segment_idx < target_segments.size(); ++segment_idx) {
+        // TODO: Implement equivalent QueryBuilder.optimise_for_speed. Initial implementation will always drop the
+        //  current string pool and recreate it from scratch taking account the new string data from source. This is
+        //  equivalent to QueryBuilder.optimise_for_memory.


I'm not sure how useful optimise_for_speed will be for things we write back to storage. Yes the MergeClause will be faster because we will need to append to the existing StringPool however the encoding decoding and IO in WriteClause and while reading what was written will be slower because it has to process a needlessly long StringPool.

Thinking more: there are definetely use cases where optimise_for_speed would make sense. E.g. when updating a categorical column, string pool wouldn't change.

We just need to caveat optimise_for_speed that if used repeatedly it will gradually erode read performance by piling up unused strings.

I'm going to open a ticket and we can evaluate it later 10754806837

There are other considerations here. With data read from disk, there is no fast way to answer the question "does this string pool contain this string?" (i.e. the map_ member is not populated). By appending values from the source into the pool, the values in it would no longer be unique, which is likely to cause headaches down the road.
I'd like to keep strings in the pool unique, and no "dead" values in stringpools on disk initially.

IvoDD · 2025-12-05T15:09:52Z

cpp/arcticdb/processing/clause.cpp

+                const bool next_segment_starts_with_last_used_source_index =
+                        next_segment_range.first == source_->index_value_at(last_value_first_occurrence);
+                if (index_value_spans_two_segments && next_segment_starts_with_last_used_source_index) {
+                    source_row = last_value_first_occurrence;


Note that this can end up O(num_target_row_slices * num_source_rows) when all index values are equal to each other.

This will probably rarely be a problem (unless with very fragmented target data). We can resolve this if instead we iterate over the unique index value ranges in the source.

Might just worth adding a small comment explaining this being quadratic in certain edge cases if we need to fix in the future.

Turned out to be a small change. I added it to the PR

IvoDD · 2025-12-05T15:15:31Z

cpp/arcticdb/processing/clause.cpp

+/// Decide which rows of should be updated and which rows from source should be inserted.
+/// 1. If there's a timestamp index  use MergeUpdateClause::filter_index_match this will produce a vector of size equal
+/// to the number of rows from the source that fall into the processed slice. Each vector will contain a vector of
+/// row-indexes in target that match the corresponding soruce index value.


Nit: Typo soruce

IvoDD · 2025-12-05T15:19:45Z

cpp/arcticdb/processing/clause.cpp

+    } else {
+        internal::check<ErrorCode::E_ASSERTION_FAILURE>(
+                source_->has_tensors(), "Input frame does not contain neither a segment nor tensors"
+        );


Nit: I'd maybe find it easier to read if written as:

if (has_segment){ // do somethig with segment } else if (has_tensors()) { // do something with tensors } else { internal::raise<E_ASSERTION_FAILURE> }

IvoDD · 2025-12-05T15:34:29Z

cpp/arcticdb/processing/clause.cpp

+                        std::span<const size_t> rows_to_update_for_source_row =
+                                rows_to_update[source_row - source_row_start];
+                        if (rows_to_update_for_source_row.empty()) {
+                            ++source_row;


Nit: Maybe use a for(; source_row < source_row_end; ++source_row) and not increment it in two places?

Also I think this if is not really needed. If rows_to_update_for_source_row.empty() The for loop below wouldn't do anything and just incrementing the iterator every time wouldn't hurt.

IvoDD · 2025-12-05T15:53:30Z

cpp/arcticdb/processing/clause.cpp

+                "Fixed string sequences are not supported for merge update"
+        );
+    } else if constexpr (is_dynamic_string_type(TDT::data_type())) {
+        std::optional<ScopedGILLock> gil_lock;


Maybe we can copy paste the doc on the optional scoped gil from here. Took me a bit to figure out why we were initializing a nullopt gil_lock

IvoDD · 2025-12-08T08:56:29Z

cpp/arcticdb/column_store/memory_segment.hpp

+                arcticdb::ColumnData cd = segment.column_data(i);
+                for (auto it = cd.begin<TDT>(); it != cd.end<TDT>(); ++it) {
+                    if constexpr (std::same_as<typename TDT::DataTypeTag::raw_type, int8_t>) {
+                        out = fmt::format_to(out, "{} ", i, int(*it));


Why do we convert the int8_t to int which is probably int32_t?

int8 are printed as letters instead of numbers. Using the {:n} formatter is the proper way to do this. Updated

I think uint8_t are also printed as characters?

I fix this by using {:n} and it's committed in the PR now

IvoDD · 2025-12-08T09:10:27Z

cpp/arcticdb/column_store/segment_utils.hpp

+        constexpr static bool is_sequence = is_sequence_type(col_type_info::data_type);
+        if constexpr (is_input_string_like && is_sequence) {
+            // Clang has a bug where it the for_each is just regular range-based for the constexpr if will not
+            // the body of the if even if the condition is false. This leads to compile time errors because it tries


Nit: Typos: Maybe meant to say:

// Clang has a bug where IF the for_each is just regular range-based for the constexpr if will COMPILE // the body of the if even if the condition is false. This leads to compile time errors because it tries // to call set_string with non-string values. // https://stackoverflow.com/questions/79817660/discarded-branch-of-c-constexpr-if-fails-compilation-because-it-calls-non-matc

lots typos indeed :(

IvoDD · 2025-12-08T09:16:53Z

cpp/arcticdb/column_store/segment_utils.hpp

+                    std::is_same_v<typename col_type_info::RawType, InputValueType>,
+                    "Type mismatch when setting data for Column[{}]. Column data type is {}.",
+                    column_index,
+                    col_type_info::data_type


Nit: Can also include the typeid(InputValueType).name() to the error message

typeid:::name is implementation defined and differs across compilers and compiler versions but I guess it's better than nothing.

IvoDD · 2025-12-08T09:53:48Z

cpp/arcticdb/column_store/segment_utils.hpp

+
+template<
+        std::ranges::sized_range IndexCols, typename ColumnSlice, std::ranges::sized_range CurrentCol,
+        std::ranges::sized_range... RestCols>


Out of curiosity: Does adding these heavily templatized methods impact build time?

I guess it should be fine as long as we don't use these utilities to do anything crazy like creating sliced segments with 100 columns.

E.g. this recursive method would need to get compiled N times for N columns segment.

Yeah it'll most definitely impact the build time especially if we start using them widely. It's hard to measure now because it's used only in one TU and most of the test have the same descriptor and thus mostly the same types so that the template is not instantiated for each separate test.

IvoDD · 2025-12-08T10:29:15Z

cpp/arcticdb/column_store/segment_utils.cpp

+    if (descriptor.fields().size() <= cols_per_segment) {
+        return std::vector{descriptor};
+    }
+    const size_t num_segments = descriptor.fields().size() / cols_per_segment;


Shouldn't we round up? E.g. with 3 fields and 2 cols per segment we'll end up dropping the last field?

IvoDD · 2025-12-08T10:50:20Z

cpp/arcticdb/version/version_core.cpp

+        std::vector<SliceAndKey>&& new_slices, std::unique_ptr<proto::descriptors::UserDefinedMetadata>&& user_meta
+) {
+    ranges::sort(new_slices, [](const SliceAndKey& a, const SliceAndKey& b) {
+        return std::tie(a.slice_.col_range, a.slice_.row_range) < std::tie(b.slice_.col_range, b.slice_.row_range);


Doesn't SliceAndKey already have an implemented ordering operator here

It compares only the start column of the column slice and the start row of the row slice and doesn't take into account the last col and row. It doesn't matter for pure updates but it could matter for inserts. I'll use the the one in SliceAndKey as you suggest because it's easier on the eye but I might need to add custom comparator in the next PR

IvoDD · 2025-12-08T10:52:51Z

cpp/arcticdb/version/version_core.cpp

+    std::vector<SliceAndKey> merged_ranges_and_keys;
+    auto new_slice = new_slices.begin();
+    for (SliceAndKey& slice : pipeline_context.slice_and_keys_) {
+        if (new_slice != new_slices.end() && new_slice->slice_ == slice.slice_) {


Maybe adding a TODO here for when we implement insert would be good. insert would potentially produce new slices. Also depending on how we decide to handle defragmentation on merge we might need to also do something more special

With insert we would also need to adjust the row range on pre existing slices.

Yeah, insert will require changes here. I thought it was obvious but I'll leave a TODO.

IvoDD · 2025-12-08T11:01:52Z

cpp/vcpkg

@@ -1 +1 @@
-Subproject commit 2dc54dfc5af9fb973860a38e4245ae0063740988
+Subproject commit 7220a4eebf20515cdd9c34721e15ca082bae9038


Why was a vcpkg submodule update needed?

The target branch is not master but a bit older. This vcpkg is the latest master. We can ignore vcpkg differences for now. I'll make sure all is good before merging into master

alexowens90 · 2025-12-16T15:16:13Z

cpp/arcticdb/column_store/memory_segment.hpp

+                    if constexpr (std::same_as<typename TDT::DataTypeTag::raw_type, int8_t>) {
+                        out = fmt::format_to(out, "{} ", i, int(*it));
+                    } else {
+                        out = fmt::format_to(out, "{} ", i, *it);


Could add a comment that string columns, bools, sparse data, etc are not handled properly yet

Fixed #2781 (comment)

…shov/merge-update-using-write-clause

alexowens90 · 2025-12-16T15:18:21Z

cpp/arcticdb/column_store/memory_segment.hpp

+    }
+
+    template<typename FormatContext>
+    constexpr auto format(const arcticdb::SegmentInMemory& segment, FormatContext& ctx) const {


This would be dangerous to accidentally include in production code, it would be crazy slow for any non-trivially sized data. Could we move it to some test-only header instead?

Discussed offline that it can be beneficial during debugging. I added a CMake a compile time macro controlled by a CMake variable to enable/disable this. It's off by default. I also changed the build_steps.yml to enable this for test.

alexowens90 · 2025-12-16T15:20:53Z

cpp/arcticdb/entity/types.hpp


    TypeDescriptor() : data_type_(DataType::UINT8), dimension_(Dimension::Dim0) {}

+    static TypeDescriptor scalar_type(DataType type) { return TypeDescriptor(type, Dimension::Dim0); }


There's already a function called something like make_scalar_type that does this

Turned out it's not even used :(

alexowens90 · 2025-12-16T15:37:13Z

cpp/arcticdb/util/type_traits.hpp

+template<typename Given, typename Base>
+concept decays_to = std::same_as<std::decay_t<Given>, Base>;
+
+template<typename ReturnType, typename Functor, typename... Args>
+concept invocable_r = std::is_invocable_r_v<ReturnType, Functor, Args...>;


I personally find this kind of aliasing unhelpful

I no longer need invocable_r and I'll remove it. However this is the way the standard library handles type traits and concepts. It's widely accepted now (e.g. std::is_same_v/std::same_as, std::is_floating_point_v/std::floating_point, etc...)

alexowens90 · 2025-12-16T15:39:23Z

python/test.sh

Guessing this isn't to be left in

alexowens90 · 2025-12-17T09:37:10Z

python/arcticdb/version_store/library.py

+        self,
+        symbol: str,
+        data: NormalizableType,
+        strategy: MergeStrategy,


Suggested change

strategy: MergeStrategy,

strategy: MergeStrategy = MergeStrategy(),

alexowens90 · 2025-12-18T16:06:26Z

cpp/arcticdb/processing/clause.cpp

+    while (target_index_it != target_index_end && source_row < source_row_end) {
+        const timestamp source_ts = source_->index_value_at(source_row);
+        // TODO: Profile and compare to linear or adaptive (linear below some threshold) search
+        auto target_match_it = std::lower_bound(target_index_it, target_index_end, source_ts);


It's worse actually, as target_index_it is not a random access iterator (although we do have one)

alexowens90 · 2025-12-18T16:07:55Z

cpp/arcticdb/processing/clause.cpp

+    while (target_index_it != target_index_end && source_row < source_row_end) {
+        const timestamp source_ts = source_->index_value_at(source_row);
+        // TODO: Profile and compare to linear or adaptive (linear below some threshold) search
+        auto target_match_it = std::lower_bound(target_index_it, target_index_end, source_ts);


Another optimisation would be to decrement target_index_end until it is just >= source_row_end, to further reduce the search space

alexowens90 · 2025-12-18T16:16:43Z

cpp/arcticdb/processing/clause.cpp

+    for (size_t segment_idx = 0; segment_idx < target_segments.size(); ++segment_idx) {
+        // TODO: Implement equivalent QueryBuilder.optimise_for_speed. Initial implementation will always drop the
+        //  current string pool and recreate it from scratch taking account the new string data from source. This is
+        //  equivalent to QueryBuilder.optimise_for_memory.


There are other considerations here. With data read from disk, there is no fast way to answer the question "does this string pool contain this string?" (i.e. the map_ member is not populated). By appending values from the source into the pool, the values in it would no longer be unique, which is likely to cause headaches down the road.
I'd like to keep strings in the pool unique, and no "dead" values in stringpools on disk initially.

alexowens90 · 2025-12-18T16:26:42Z

cpp/arcticdb/processing/clause.cpp

+        // If the source frame is a list of tensors, the index frame is kept separately, so the first
+        // non-index column will be at index 0. If there's an index the first ColRange will start from
+        // 1 and the first column of each segment will contain the index
+        return col_range.first + index_in_range - 2 * index_field_count;


I suspect he means to come back and figure out what this is doing, cos I also have no clue

alexowens90 · 2025-12-18T16:37:36Z

cpp/arcticdb/processing/clause.cpp

+/// It has two template specialisations for each way an InputFrame can store its data. It's supposed to be used only to
+/// perform forward iteration.
+template<typename TDT, typename T>
+class SourceView {};


I'm really confused by this. If it's designed for forward iteration, then why doesn't it implement ++? It seems to be used as a weird random accessor by calling set_row and then dereferencing it?
It's hard to speculatively abstract for a future implementation. Given that this stuff just isn't implemented for Arrow yet, maybe we should start without all the abstractions, and then have a separate ticket to make it work with Arrow sources?

alexowens90 · 2025-12-18T17:07:38Z

I might have missed them, but I didn't see any larger Python tests for the source overwriting a set of disjoint segments, and the index key structure being correct afterwards?

vasil-pashov added 30 commits November 3, 2025 10:49

Add basic structure for read_modify_write

167cb2f

Working version of read modify write

580cf14

Implement write clause

abe52e3

Make the write clause store future components so that it does not dea…

59af346

…dlock

Fix a bug with empty dataframes. Test all filtering

97549b6

Fix col range for multiindex

dfda400

wip

0473e65

Compact rows function

3490e74

Add comments

b613264

Fix compilation errors

769e271

Fix compilation errors

83a8065

Fix compilation errors

c57fdc0

Apply comments

27cbe29

Add C++ stress test to check for deadlocks in write clause

5f94216

Add overloat for async_write that performs the encoding in the curren…

12e2c28

…t thread

Structure write clause by row slice

658e2dc

Fix resampling tests

056b8d8

Address review comments

c28903b

Extract the variant match in async write to a separate function

6c9438a

Fix resampling tests

809d961

Prepare for merge update

6615c97

Add merge clause skeleton

556fdce

Implement structure for processing

7c861b2

WIP

fca60b9

Unify iteration over source columns

38ac5b5

Add utility functions for generating columns, native tensors and segm…

ab71903

…ents

Add one more utility function for generating a dense segment in memory

9b80b7a

Add more testing C++ utils

0e1921c

WIP on test

8691d88

Fixes to C++ tests

3d02c50

vasil-pashov requested review from IvoDD and poodlewars as code owners November 26, 2025 23:59

vasil-pashov added 2 commits November 28, 2025 00:05

Fix custom formatters for gtest

29a28e5

Split utils from formatters

c88363d

vasil-pashov changed the title ~~Vasil.pashov/merge update using write clause~~ Implement merge update for timeseries containing non-string columns Nov 27, 2025

Fix compilation issues on the CI

967b70c

vasil-pashov force-pushed the vasil.pashov/merge-update-using-write-clause branch from 6447387 to 967b70c Compare November 27, 2025 22:47

vasil-pashov commented Nov 27, 2025

View reviewed changes

vasil-pashov and others added 9 commits November 28, 2025 13:13

Fix unreachable code

51f8854

Fix windows build issues

07580e9

Fix use after stack free

546ccb7

Fix out ouf bounds access

336bae9

Fix column selection when the source is SegmetInMemory

c5f0516

Python tests for string column pass

ba437a8

Refactor

8431069

Add comments

051af2c

Fix concept

686d8d7

vasil-pashov changed the title ~~Implement merge update for timeseries containing non-string columns~~ Implement merge update for timeseries matching on the index Dec 3, 2025

IvoDD reviewed Dec 4, 2025

View reviewed changes

IvoDD reviewed Dec 5, 2025

View reviewed changes

IvoDD reviewed Dec 8, 2025

View reviewed changes

vasil-pashov added 4 commits December 12, 2025 17:38

Address comments

87ac39d

Handle sequences of segments having a single index value

993139e

Fix split descriptor

f184883

Address comments

a4b5844

alexowens90 reviewed Dec 16, 2025

View reviewed changes

Merge branch 'vasil.pashov/merge-update-string-columns' into vasil.pa…

607ff55

…shov/merge-update-using-write-clause

alexowens90 requested changes Dec 18, 2025

View reviewed changes

		@@ -1 +1 @@
		Subproject commit 2dc54dfc5af9fb973860a38e4245ae0063740988
		Subproject commit 7220a4eebf20515cdd9c34721e15ca082bae9038


		TypeDescriptor() : data_type_(DataType::UINT8), dimension_(Dimension::Dim0) {}

		static TypeDescriptor scalar_type(DataType type) { return TypeDescriptor(type, Dimension::Dim0); }

	strategy: MergeStrategy,
	strategy: MergeStrategy = MergeStrategy(),

Implement merge update for timeseries matching on the index #2781

Are you sure you want to change the base?

Implement merge update for timeseries matching on the index #2781

Uh oh!

Conversation

vasil-pashov commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IvoDD Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasil-pashov Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasil-pashov Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasil-pashov Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasil-pashov commented Nov 26, 2025 •

edited

Loading

IvoDD Dec 4, 2025 •

edited

Loading

vasil-pashov Dec 11, 2025 •

edited

Loading

vasil-pashov Dec 11, 2025 •

edited

Loading

vasil-pashov Dec 11, 2025 •

edited

Loading