Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse/sink: Add Exactly-Once Semantics to ClickHouse Target #146

Open
laskoviymishka opened this issue Dec 19, 2024 · 3 comments · May be fixed by #147
Open

Clickhouse/sink: Add Exactly-Once Semantics to ClickHouse Target #146

laskoviymishka opened this issue Dec 19, 2024 · 3 comments · May be fixed by #147
Labels
enhancement New feature or request

Comments

@laskoviymishka
Copy link
Contributor

laskoviymishka commented Dec 19, 2024

Add Exactly-Once Semantics to ClickHouse Target

Feature Request

Implement exactly-once delivery semantics for the ClickHouse target in the Transfer project. This ensures that data processed by the Transfer pipeline is delivered to ClickHouse without duplication or loss, even in the presence of failures or retries.


Motivation

ClickHouse is widely used for analytics workloads, where data consistency and accuracy are critical. Supporting exactly-once semantics in the ClickHouse target will:

  1. Eliminate Duplicate Data: Prevent data duplication caused by retries or failures.
  2. Ensure Data Integrity: Guarantee that all records are processed once and only once.
  3. Improve Fault Tolerance: Handle failures like crashes, rebalances, or network issues gracefully.

Proposed Approach

Adopt techniques similar to the Kafka Connect Exactly-Once Delivery model:

  1. Transactional Writes:
  2. Offset Management:
    • Store offsets alongside the written data in ClickHouse. Use _partition and _offset columns to track processed records.
    • Ensure that offsets are committed atomically with the data.
  3. Distributed Coordination:
    • Use a distributed lock mechanism (e.g., Zookeeper, etcd) to ensure only one instance of the sinker processes a partition at a time during rebalances.

Testing

Add test cases to verify exactly-once semantics:

  1. Crash Recovery:
    • Simulate a crash during data writing and verify no duplicates exist after restart.
  2. Concurrent Processing:
    • Simulate a Kafka rebalance causing concurrent processing by two sinkers and ensure no duplicates or data loss.
  3. Offset Rewind:
    • Rewind offsets and reprocess data, ensuring idempotent writes.

References


Additional Notes

  • This feature may require additional configurations for users who do not need exactly-once semantics, ensuring backward compatibility.
  • It require that clickhouse enables KeeperMap table engine.
laskoviymishka added a commit that referenced this issue Dec 19, 2024
@laskoviymishka laskoviymishka linked a pull request Dec 19, 2024 that will close this issue
@laskoviymishka laskoviymishka added the enhancement New feature or request label Dec 19, 2024
@BorisTyshkevich
Copy link

The mentioned transactions support in Clickhouse has been in an early beta state for more than a year, and it is not expected to be ready for production in cluster env for an observable time. So it should not be the base for Exactly once functionality.

The approach based on idempotent inserts and block deduplication is pretty solid and used in tools like Clickhouse Kafka Sync Connector (java) for the same purpose.

Moreover, if you use Keeper, it would be the best clusterized transactional store for source table offset instead of a source DB table or s3 basket. Universal between different sources connectors, too. For Exactly Once, it is better to store offsets on dest DB, not sourceDB.

The KeeperMap Engine allows you to manage the data without a separate Keeper connection by standard SQL commands like SELECT/INSERT/ALTER. The setting keeper_map_strict_mode=1 makes an UPDATE process transactional.

@laskoviymishka
Copy link
Contributor Author

Yes, storing offset in destination DB inside keeper table (see here), this would allow to utilize feature without waiting for tx-support in clickhouse itself.

@BorisTyshkevich
Copy link

Great! I still propose using keeper_map_strict_mode=1 for better atomicity of operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants