Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mysql/source: binlog position/starttime logic #209

Open
work-vv opened this issue Feb 7, 2025 · 18 comments
Open

Mysql/source: binlog position/starttime logic #209

work-vv opened this issue Feb 7, 2025 · 18 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@work-vv
Copy link

work-vv commented Feb 7, 2025

Binlog start position or/and binlog startime missed in mysql adapter config, though pointer used in runtime logic. It looks like a minor change but huge improvement which can reduce replication time in some cases, such as quick recovering replication on top of existing snapshot.

@laskoviymishka
Copy link
Contributor

Thought about this, but not sure regards to end user flow, any ideas how it can be used from end user perspective?

@work-vv
Copy link
Author

work-vv commented Feb 8, 2025

Common usecase. Sync between mysql and ch failed in production at midnight) Binlog is huge and replicate it from the beginning is not effective way to restore and sync state again. But with suggested functionality we can restart replication and sync process from midnight.

@laskoviymishka
Copy link
Contributor

Yeah, but how this shall be operated on ops level? As an extra option?

@work-vv
Copy link
Author

work-vv commented Feb 8, 2025

As soon as incremenal snaphot might be differ or not available from provider to provider I would put time tag or pointer for now as an optional parameter for mysql connector and fill it in transfer config source connector options. Also it would be nice to show actual value in logs on data transfer finished or failed.

@laskoviymishka laskoviymishka changed the title Add Mysql binlog position/starttime logic Mysql/source: binlog position/starttime logic Feb 8, 2025
@laskoviymishka laskoviymishka added the enhancement New feature or request label Feb 8, 2025
@Poltoruhin
Copy link

Hello gentlemens!
Just wanna say that this enhancement would be much appreciated. We trying to use CH for OLAP alongside with main MySQL db and the case described by @work-vv seems very possible.

@laskoviymishka
Copy link
Contributor

@work-vv @Poltoruhin it's still unclear to me how it would work.
If the idea is to allow transfer to recover after binlog rotate - then, it's a case for increment-only transfer where SyncBinlog procedure will recover binlog on a restart.

@work-vv
Copy link
Author

work-vv commented Feb 12, 2025

Suppose we have a MySQL to ClickHouse replication with real-time synchronization via binlog. If, for some reason, data desynchronization occurs (e.g., an error in the transfer process, ClickHouse crashing after a version update), it is necessary to restore synchronization shortly. During testing, creating a snapshot in ClickHouse from the binlog for 4 tables with 5 million records each took 1.5 hours. However, if there is a way to synchronize data from a specific point, we can retain the existing data and specify synchronization from the moment of failure. Alternatively, we could create a dump file and import it directly into ClickHouse (another optimization suggestion is to enable snapshot creation from MySQL to ClickHouse via table dumps), specifying the synchronization starting point at the moment the dump was created.

@laskoviymishka
Copy link
Contributor

Let me rephrase, we have following config:

id: test
type: SNAPSHOT_AND_INCREMENT
src:
  type: mysql
  params:
    Host: mysql
    User: myuser
    Password: mypassword
    Database: mydb
    Port: 3306
dst:
  type: ch
  params:
    ....

Let's add a new field: startpos, as binlog position / gtid-id

id: test
type: INCREMENT_ONLY
src:
  type: mysql
  params:
    Host: mysql
    User: myuser
    Password: mypassword
    Database: mydb
    Port: 3306
    StartPos:
      type: binlog
      value: "binlog_0001/123"
dst:
  type: ch
  params:
    ....

So once transfer is started - we start not from a head of a binlog, but exactly from "binlog_0001/123".

There is a question: what to do if there is no such binlog file? (i.e. it's got rotated).

@work-vv
Copy link
Author

work-vv commented Feb 12, 2025

this is optional value in config which is creator responsibility

@work-vv
Copy link
Author

work-vv commented Feb 12, 2025

There is a question: what to do if there is no such binlog file? (i.e. it's got rotated). #
I would suggest to stop with error, same as you do for any other incorrectly defined option.

@laskoviymishka
Copy link
Contributor

Okay, now it's a lot more clear, and seems kinda trivial to implement, it's enough to add a new property to model here and use it inside SyncBinlogPosition, if model field is presented - took value from it, otherwise - keep exist logic.

@laskoviymishka laskoviymishka added good first issue Good for newcomers help wanted Extra attention is needed labels Feb 12, 2025
@BorisTyshkevich
Copy link

I don't think it's a good idea to place the binlog position into the yaml config mixing settings and data/metadata.

CDC position is already stored somewhere, and we need to get access to that data using several operations:

  • trcli cdc-position set "string"
  • trcli cdc-position get
  • trcli cdc-position reset

And, of course, all connectors should follow the same rules for position management.

@laskoviymishka
Copy link
Contributor

That's also doable, but all connectors have different rules regards to position, some store position in source itself (like Kafka with consumer group), but some outside (like MySQL).

@BorisTyshkevich
Copy link

I expect that in the future, all of them will use KepperMap to support EoD. Till then - yes, some connectors won't allow offset manipulations.

@laskoviymishka
Copy link
Contributor

That won't happen, since keeper map is only available in ClickHouse target, and some DBs doesn't allow you to directly manipulate offset position that stored outside (for example PostgreSQL). As alternative we could add keeper map coordinator implementation instead of s3, this will make management of transfer state a lot easier.
Offset Management is a source db specific concern, it's really hard to unify it somehow and I don't see any real profit from such unification.

@BorisTyshkevich
Copy link

BorisTyshkevich commented Feb 15, 2025

Yes, coodinator=keepermap would be a great addition.

I agree that offsets are more related to the SRC, but probably to the coordinator. Anyway, I see it as an operation, not a setting in Yaml

trcli mysql-position set "...."
trcli postgres-position set "...."
trcli s3-position set "...."
trcli keepermap-position set "...."

trcli set-position mysql "...."
trcli set-position postgres "...."

@laskoviymishka
Copy link
Contributor

I would say that a command that set position can be even simplified

trcli state set-position "" --config file.yaml

Since we still need a config to verify that position is valid, and config already contains information about type of source.

@BorisTyshkevich
Copy link

of course, we need config file and coordinator settings (like the bucket).

Commands for reset/get would also be very useful in real life.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants