-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mysql/source: binlog position/starttime logic #209
Comments
Thought about this, but not sure regards to end user flow, any ideas how it can be used from end user perspective? |
Common usecase. Sync between mysql and ch failed in production at midnight) Binlog is huge and replicate it from the beginning is not effective way to restore and sync state again. But with suggested functionality we can restart replication and sync process from midnight. |
Yeah, but how this shall be operated on ops level? As an extra option? |
As soon as incremenal snaphot might be differ or not available from provider to provider I would put time tag or pointer for now as an optional parameter for mysql connector and fill it in transfer config source connector options. Also it would be nice to show actual value in logs on data transfer finished or failed. |
Hello gentlemens! |
@work-vv @Poltoruhin it's still unclear to me how it would work. |
Suppose we have a MySQL to ClickHouse replication with real-time synchronization via binlog. If, for some reason, data desynchronization occurs (e.g., an error in the transfer process, ClickHouse crashing after a version update), it is necessary to restore synchronization shortly. During testing, creating a snapshot in ClickHouse from the binlog for 4 tables with 5 million records each took 1.5 hours. However, if there is a way to synchronize data from a specific point, we can retain the existing data and specify synchronization from the moment of failure. Alternatively, we could create a dump file and import it directly into ClickHouse (another optimization suggestion is to enable snapshot creation from MySQL to ClickHouse via table dumps), specifying the synchronization starting point at the moment the dump was created. |
Let me rephrase, we have following config:
Let's add a new field:
So once transfer is started - we start not from a head of a binlog, but exactly from There is a question: what to do if there is no such binlog file? (i.e. it's got rotated). |
this is optional value in config which is creator responsibility |
|
Okay, now it's a lot more clear, and seems kinda trivial to implement, it's enough to add a new property to model here and use it inside SyncBinlogPosition, if model field is presented - took value from it, otherwise - keep exist logic. |
I don't think it's a good idea to place the binlog position into the yaml config mixing settings and data/metadata. CDC position is already stored somewhere, and we need to get access to that data using several operations:
And, of course, all connectors should follow the same rules for position management. |
That's also doable, but all connectors have different rules regards to position, some store position in source itself (like Kafka with consumer group), but some outside (like MySQL). |
I expect that in the future, all of them will use KepperMap to support EoD. Till then - yes, some connectors won't allow offset manipulations. |
That won't happen, since keeper map is only available in ClickHouse target, and some DBs doesn't allow you to directly manipulate offset position that stored outside (for example PostgreSQL). As alternative we could add keeper map coordinator implementation instead of s3, this will make management of transfer state a lot easier. |
Yes, coodinator=keepermap would be a great addition. I agree that offsets are more related to the SRC, but probably to the coordinator. Anyway, I see it as an operation, not a setting in Yaml
|
I would say that a command that set position can be even simplified
Since we still need a config to verify that position is valid, and config already contains information about type of source. |
of course, we need config file and coordinator settings (like the bucket). Commands for reset/get would also be very useful in real life. |
Binlog start position or/and binlog startime missed in mysql adapter config, though pointer used in runtime logic. It looks like a minor change but huge improvement which can reduce replication time in some cases, such as quick recovering replication on top of existing snapshot.
The text was updated successfully, but these errors were encountered: