Skip to content

Pitr collector can't handle transactions split over multiple binlogs #2286

@wonko

Description

@wonko

Report

(Operator: 1.18.0, PXC @@Version: 8.0.42-33.1, Logcollector image: percona/percona-xtrabackup:8.0.35-33.1 with init image percona/percona-xtradb-cluster-operator:1.18.0.)

When the set from the last uploaded binlog spans two binlogs on another host, the log collector assumes there's a gap.

More about the problem

Consider the log snippet below, taken from a live cluster (I filtered out the lines about the cache, can re-insert them). First of all, the log has three runs, each of them taking the binlogs from different nodes (first node 2, then 1, then 0 - no idea why, but hey 🤷 ). I've added some newlines to make it clearer.

First run goes well, and the second run as well. The second detects correctly that the previous uploaded last set is matching one of the binlogs on this node. So far, so good, and we uploaded the following set as last set: 146109a7-04bf-11f0-a88e-d3a485b81481:6093289-6093543.

It's the third run that goes wrong. Looking carefully, you'll see that the start-entry (6093289) is in binlog.083868 (having ids 6093268-6093330) and the end-marker is in binlog.083869 (having 6093331-6093585). So, the data is actually there, but it is split across two binlogs on this machine.

This case isn't handled, and the result is a) gap detected, b) everything is uploaded.

I believe the logic is flawed, there should only be a search for the binlog which contains the last uploaded transaction.

If more data is needed, lmk.

2025/12/03 10:22:10 creating collector functions
2025/12/03 10:22:11 running binlog collector
2025/12/03 10:22:18 reading binlogs from pxc with hostname= app-pxc-db-pxc-2.app-pxc-db-pxc.percona.svc.cluster.local
2025/12/03 10:22:18 no cache entry for binlog.038968
2025/12/03 10:22:18 updating binlog cache
2025/12/03 10:22:19 last uploaded GTID set: 146109a7-04bf-11f0-a88e-d3a485b81481:6093310-6093351
2025/12/03 10:22:19 checking binlog.038968 (146109a7-04bf-11f0-a88e-d3a485b81481:6093352-6093375) against last uploaded set
2025/12/03 10:22:19 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6093310-6093351 is not subset of 146109a7-04bf-11f0-a88e-d3a485b81481:6093352-6093375 in binlog.038968 or vice versa
2025/12/03 10:22:19 checking binlog.038967 (146109a7-04bf-11f0-a88e-d3a485b81481:6093310-6093351) against last uploaded set
2025/12/03 10:22:19 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6093310-6093351 is equal to 146109a7-04bf-11f0-a88e-d3a485b81481:6093310-6093351 in binlog.038967
2025/12/03 10:22:19 last uploaded binlog: binlog.038967
2025/12/03 10:22:19 starting to process binlog with name binlog.038968
2025/12/03 10:22:19 successfully wrote binlog file binlog.038968 to storage with name binlog_1764757292_ab03247eec263a1762919c95229b6697


2025/12/03 10:32:31 reading binlogs from pxc with hostname= app-pxc-db-pxc-1.app-pxc-db-pxc.percona.svc.cluster.local
2025/12/03 10:32:32 no cache entry for binlog.078307
2025/12/03 10:32:32 no cache entry for binlog.078308
2025/12/03 10:32:32 no cache entry for binlog.078309
2025/12/03 10:32:32 no cache entry for binlog.078310
2025/12/03 10:32:32 updating binlog cache
2025/12/03 10:32:32 last uploaded GTID set: 146109a7-04bf-11f0-a88e-d3a485b81481:6093352-6093375
2025/12/03 10:32:32 checking binlog.078310 (146109a7-04bf-11f0-a88e-d3a485b81481:6093289-6093543) against last uploaded set
2025/12/03 10:32:32 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6093352-6093375 is subset of 146109a7-04bf-11f0-a88e-d3a485b81481:6093289-6093543 in binlog.078310
2025/12/03 10:32:32 last uploaded binlog: binlog.078310
2025/12/03 10:32:32 starting to process binlog with name binlog.078310
2025/12/03 10:32:33 successfully wrote binlog file binlog.078310 to storage with name binlog_1764757204_1f2bc15feab62ff967bded2cc9e7f3f5


2025/12/03 10:42:55 reading binlogs from pxc with hostname= app-pxc-db-pxc-0.app-pxc-db-pxc.percona.svc.cluster.local
2025/12/03 10:42:56 no cache entry for binlog.083868
2025/12/03 10:42:56 no cache entry for binlog.083869
2025/12/03 10:42:56 no cache entry for binlog.083870
2025/12/03 10:42:56 updating binlog cache
2025/12/03 10:42:56 last uploaded GTID set: 146109a7-04bf-11f0-a88e-d3a485b81481:6093289-6093543
2025/12/03 10:42:56 checking binlog.083870 () against last uploaded set
2025/12/03 10:42:56 checking binlog.083869 (146109a7-04bf-11f0-a88e-d3a485b81481:6093331-6093585) against last uploaded set
2025/12/03 10:42:56 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6093289-6093543 is not subset of 146109a7-04bf-11f0-a88e-d3a485b81481:6093331-6093585 in binlog.083869 or vice versa
2025/12/03 10:42:56 checking binlog.083868 (146109a7-04bf-11f0-a88e-d3a485b81481:6093268-6093330) against last uploaded set
2025/12/03 10:42:56 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6093289-6093543 is not subset of 146109a7-04bf-11f0-a88e-d3a485b81481:6093268-6093330 in binlog.083868 or vice versa
2025/12/03 10:42:56 checking binlog.083867 (146109a7-04bf-11f0-a88e-d3a485b81481:6092908-6093267) against last uploaded set
2025/12/03 10:42:56 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6093289-6093543 is not subset of 146109a7-04bf-11f0-a88e-d3a485b81481:6092908-6093267 in binlog.083867 or vice versa
2025/12/03 10:42:56 checking binlog.083866 (146109a7-04bf-11f0-a88e-d3a485b81481:6092527-6092907) against last uploaded set
2025/12/03 10:42:56 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6093289-6093543 is not subset of 146109a7-04bf-11f0-a88e-d3a485b81481:6092527-6092907 in binlog.083866 or vice versa
2025/12/03 10:42:56 checking binlog.083865 (146109a7-04bf-11f0-a88e-d3a485b81481:6092443-6092526) against last uploaded set
2025/12/03 10:42:56 last uploaded 146109a7-04bf-11f0-a88e-d3a485b81481:6093289-6093543 is not subset of 146109a7-04bf-11f0-a88e-d3a485b81481:6092443-6092526 in binlog.083865 or vice versa

Steps to reproduce

No idea on how to exactly trigger this, I guess some careful flushing of binlogs might trigger this. We observed this in our cluster fairly regular (once, twice per day). Cluster has bursty traffic.

Versions

  1. Kubernetes 1.33
  2. Operator 1.18.0
  3. Database 8.0.42

Anything else?

I looked at the source code to figure out the logic while debugging this issue (all our backups are marked as not-PITR-safe-to-restore). I believe the logic in https://github.com/percona/percona-xtradb-cluster-operator/blob/main/cmd/pitr/collector/collector.go#L521 and following lines should be changed to only consider the end-id to find the binlog which was (half-)processed.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions