materialize-clickhouse: add ClickHouse materialization connector#4022
materialize-clickhouse: add ClickHouse materialization connector#4022jacobmarble wants to merge 35 commits intomainfrom
Conversation
|
I'm a little concerned about the connector state variable. We need to guarantee monotonically increasing version numbers per record, and I'm just storing a version value in the state. What if multiple tasks implement a single materialize? Is that state object per-task, or have I introduced a race condition? It might be better to track version per record, will consider it today. |
Implements a Go materialization connector for ClickHouse using ReplacingMergeTree tables with INSERT-only upsert semantics. Key design choices: - ReplacingMergeTree exclusively (no delta/MergeTree mode) - Native bulk inserts via clickhouse-go PrepareBatch - Includes offline tests, online tests against live ClickHouse, and full integration tests
bd21954 to
7afb1c1
Compare
Hard-delete rows now use the full record values from ConvertAll instead of synthesizing typed zero values per column. ReplacingMergeTree only needs _is_deleted=1 to hide rows from FINAL queries, so the actual column values in delete rows don't matter. This simplifies the code by removing the tombstoneValue helper and its type-specific zero map.
Replace the sequential single-batch approach with a map of per-binding batches. Rows are appended during the Store loop and all batches are flushed in the StartCommitFunc callback, allowing interleaved bindings and deferring the network send to commit time.
Remove connectorState and the incrementing version counter persisted in the Flow recovery log. Instead, always write _version=0 and rely on ReplacingMergeTree's documented tie-breaking rule: when ver is equal for multiple rows, the most recently inserted row wins. This eliminates the checkpoint JSON serialization, UnmarshalState logic, and the associated state management complexity.
Enable automatic background CLEANUP merges on ReplacingMergeTree tables via SETTINGS block in CREATE TABLE. This ensures deleted rows are eventually purged without requiring manual OPTIMIZE ... FINAL CLEANUP. Also adds a README documenting ClickHouse table mechanics, the ReplacingMergeTree approach for updates/deletes, and illustrated examples of merge behavior. Bumps docker-compose ClickHouse to 25.8.
|
I've made some important changes since I opened the PR. Deleting records is simplified. Rather than replace all record values with zero values, we set The The Record version is always zero so that version reconciliation depends on the age of the part, not the value of the version. This shifts the logical clock burden to ClickHouse, where it likely doesn't add meaningful load. Added a README for humans. |
|
This might be imprecise, since it just reflects my current understanding:
I wonder if it makes sense to only operate in delta update mode. In this mode, there is no Load operation and we write all changes in the Store phase. You can see this mode in the filesink materializations, such as materialize-s3-parquet, and I think it can be enabled optionally for any materialization binding. The Load query could still have value, it would reduce the size of any deltas. So maybe its a question of what the default mode is?
The binding is a number that identifies the binding, in case there is more than one collection being written. The document identifier is usually the There also exists a
Soft-delete is a tombstone and hard-delete is actually deleted, so A normal OTLP materialization (with flow_document) will look something like this: Then if this record is deleted op changes to I would think to soft delete we would just append the |
CI expects container name materialize-clickhouse-db-1 but service was named 'clickhouse', producing materialize-clickhouse-clickhouse-1.
The dedicated _version UInt64 column (always set to 0) is replaced by flow_published_at, a standard Estuary metadata timestamp that naturally increases with each transaction. This simplifies the schema and makes version ordering explicit rather than relying on part insertion order. Also corrects README record counts, FINAL behavior description (FINAL both deduplicates and filters deleted rows when is_deleted is configured), and removes the WHERE _is_deleted = 0 predicate from load queries since FINAL already handles it.
|
I've added batching limits today - flush batches at 100,000 records or one minute, which ever comes first. |
Load keys are now bulk-inserted into a session-scoped temporary table via PrepareBatch, then joined against the target table to fetch existing documents. Each binding gets its own single-connection native pool to guarantee temp table visibility. This eliminates the GroupSet parameter binding limitation and supports much larger key batches (100k vs 1k).
Replace openDB, openNativeConn, and openNativeConnSingle with a single newClickhouseOptions() that returns *clickhouse.Options, letting callers configure pool settings and open connections directly.
There was a problem hiding this comment.
The general structure of the connector looks good! Thank you!
I left some inline comments about specific parts, but two main issues need to be addressed:
Deletions
I think we should re-use the existing _meta/op: "d" instead of _is_deleted for deletions. One way to do this is:
- In Validate, the connector should signal that
_meta/opis a column that must be materialized (this is called putting a "constraint" on a column) -- this means the connector asks that this column must always be materialized and the user cannot disable / exclude this column. This then means we are guaranteed to be always materializing this column. - Update the code so that instead of checking for
_is_deleted, we check for_meta/opcolumn being equal to the stringdfor deletions
This may or may not work with Clickhouse, I'm not sure about how flexible their system is for deletions, but let me know what you think on this
Transactions
Currently the Store phase is inserting documents in batches, but there is no transactionality: i.e. if the Flow transaction fails at any point in time, the documents that have been inserted are not going to be rolled back. Further, documents should not be visible in the destination system until the whole transaction can be committed at the same time.
We most likely want to use the post-commit apply pattern in Clickhouse as well, so: stage inserts in a temporary table, and in Acknowledge move them over to the destination tables, or something along these lines. We can discuss more about this to better understand the constraints of Clickhouse and see how we can best achieve transactionality.
Co-authored-by: Mahdi Dibaiee <mdibaiee@pm.me>
ef0bed5 to
c16836a
Compare
|
I can improve the In order to atomically store more than one batch (100,000) of documents, ClickHouse does provide a method, but it involves writing to a staging table that lives on disk, then migrating the inserted partitions to the target table. To just |
|
@jacobmarble regarding |
Description:
Adds materialize-clickhouse.
Closes #783
Workflow steps:
Configuration is similar to other materialize types.
Documentation links affected:
Documentation added in estuary/flow#2776
Notes for reviewers:
Quirks
ClickHouse is append-only, so delta updates cannot be implemented as database-level operations. To update an existing record, we insert a new version (column
_version) of the record (a record with the same primary key aka order by key), and leave the old record as-is. To delete an existing record, we insert a tombstone record, which is simply a record with column_is_deleted = 1.In a background process, the ReplacingMergeTree engine removes duplicate records AND deleted records after a non-deterministic period of time. Until then, queries return all versions of the record.
To mitigate this, we add the
FINALqualifier to our own Load queries. This causes the query engine to deduplicate primary keys at query time, but does not cause it to remove deleted records from results. (Weird, right?) Hence, theWHERE is_deleted = 0clause.Please Review Closely
I'm uncertain about the way that soft delete and hard delete are coupled. What does the user expect when a document is soft-deleted?
Other materialize implementations's load queries fetch the binding as well as the document identifier. I'm not sure what these things are yet, so perhaps a good topic for 1:1 conversation.
Performance
The
Load()method is relatively naive compared to others - it synchronously executes a SELECT query per loaded key. I've left it this way for now so that we can improve performance empirically when required.