[clone]refactor the clone action as we introduced external path #4844

neuyilan · 2025-01-06T13:06:33Z

Purpose

https://cwiki.apache.org/confluence/display/PAIMON/PIP-29%3A+Introduce+Table+Multi-Location++Management

refactor the clone action as we introduced the external path.

I want to point out that regardless of where the data in the source table is stored (warehouse path or external path). We will all copy the data to the warehouse path of the target table.

If we still use the external path of the source table as the data path in target table. In that case, the data from the source table and the target table will be merged together.
what's your opinion？

Tests

Add CloneActionITCase.testCloneTableWithSourceTableExternalPath

API and Format

no

Documentation

JingsongLi · 2025-01-07T07:11:43Z

I feel that the current clone process needs to be refactored:

Single parallelism to query all manifest files and copy the manifest list file.
Read the manifest in a distributed parallelism, determine whether to rewrite it (with or without an external path), and complete the copy or rewrite of the manifest.
shuffle by data file name.
Distributed copy data files.

This hierarchical approach to copying is the correct solution.

neuyilan · 2025-01-07T11:23:54Z

I feel that the current clone process needs to be refactored:

Single parallelism to query all manifest files and copy the manifest list file.

Read the manifest in a distributed parallelism, determine whether to rewrite it (with or without an external path), and complete the copy or rewrite of the manifest.

shuffle by data file name.

Distributed copy data files.

This hierarchical approach to copying is the correct solution.

Thanks for your advice, I will try do this best.

neuyilan · 2025-01-07T16:13:02Z

Hi, Jingsong, according to the original design[1] and the above discussion, I plan to refactore to the following Flink batch job.

The first stage is responsible for pick the tables need cloned.If the database parameter is not passed, then all tables of all databases will be cloned.If the table parameter is not passed, then all tables of the database will be cloned. (not changed, the same as the original design).
The second stage pick related files(Snapshot, Schema, ManifestList, Manifest, Datafile, ChangeLog, IndexFile) of the snapshot in source table.(not changed, the same as the original design).
The thrid stage is only copy the schema files to the target path. the schema files contains: Snapshot, Schema, ManifestList and IndexFile.
The fourth stage mainly involves copying or rewriting the manifest file in distributed parallelism. If it is an external path, rewrite it; otherwise, copy it.
Shuffle the data file by the filename.(data file contains Datafile and ChangeLog).
The fifth stage is copy the data files in distributed parallelism.
Shuffle by the target's table name to next stage.
The sixth stage is recreate the snapshot hint file. (not changed, the same as the original design).

Please help confirm if this refactoring is appropriate, Thanks.

[1] https://cwiki.apache.org/confluence/display/PAIMON/PIP-18%3A+Introduce+clone+Action+and+Procedure

JingsongLi · 2025-01-08T08:20:15Z

Hi @neuyilan , thanks for your design!

The second stage, I think we can just pick manifests. We don't need to pick files here.

neuyilan · 2025-01-08T09:22:33Z

The second stage, I think we can just pick manifests. We don't need to pick files here.

Hi, @JingsongLi ,
if we only pick the manifests files in second stage, when we copy the Snapshot, Schema and IndexFile files, do you mean that we only pass one snapshot ID upstream and downstream, and then pick the required files at each step, and then copy the corresponding files?

The original design was to pick out all files and then copy the corresponding files according to the file type at each step.

JingsongLi · 2025-01-09T00:38:12Z

The second stage, I think we can just pick manifests. We don't need to pick files here.

Hi, @JingsongLi , if we only pick the manifests files in second stage, when we copy the Snapshot, Schema and IndexFile files, do you mean that we only pass one snapshot ID upstream and downstream, and then pick the required files at each step, and then copy the corresponding files?

The original design was to pick out all files and then copy the corresponding files according to the file type at each step.

Yes, I think we can refactor it now.

neuyilan · 2025-01-09T03:53:02Z

Hi, @JingsongLi , thanks again for advice, and I have refactored to the following Flink batch job, please review it again. Thanks.

The first stage is responsible for pick the tables need cloned.If the database parameter is not passed, then all tables of all databases will be cloned.If the table parameter is not passed, then all tables of the database will be cloned. (not changed, the same as the original design).
The second stage just pick the schema files and copy it to the target path, the schema file contains Snapshot, Schema, ManifestList and IndexFile.
The thrid stage just pick the mainifest file in single parallelism.
The fourth stage mainly involves copying or rewriting the manifest file in distributed parallelism. If it is an external path, rewrite it; otherwise, copy it.
The fifth stage is picking all the data files in single parallelism. (data file contains Datafile and ChangeLog).
Shuffle the data file by the filename.
The sixth stage is copy the data files in distributed parallelism.
Shuffle by the target's table name to next stage.
The seventh stage is recreate the snapshot hint file. (not changed, the same as the original design).

wwj6591812 · 2025-01-09T16:22:34Z

@neuyilan
Very thanks for prepare this PR.
I think change the job topology like this has no problem. And "pick the required files at each step, then copy the corresponding files" not only more clearer, but also increases the scalability.
Only one small question, why you emphasize this refactor only for batch job？Why don't modify the stream job's topology as same as the batch job?

neuyilan · 2025-01-10T02:45:13Z

Only one small question, why you emphasize this refactor only for batch job？Why don't modify the stream job's topology as same as the batch job?

Hi, @wwj6591812, Thanks for remind, I had a misunderstanding before. After this modification, both batch job and stream job will be affected. Is that right?

neuyilan · 2025-01-13T02:11:30Z

@JingsongLi @wwj6591812 PTAL, Thanks.

paimon-core/src/main/java/org/apache/paimon/io/DataFileMeta.java

paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/clone/CloneFileInfo.java

...link-common/src/main/java/org/apache/paimon/flink/clone/PickSchemaFilesForCloneOperator.java

paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/clone/CloneFileInfo.java

...aimon-flink-common/src/main/java/org/apache/paimon/flink/clone/CopyManifestFileOperator.java

...link/paimon-flink-common/src/test/java/org/apache/paimon/flink/action/CloneActionITCase.java

JingsongLi

Looks good to me!

fix the clone tests

644594d

neuyilan marked this pull request as draft January 6, 2025 13:06

neuyilan changed the title ~~[clone]fix the clone when we introduced external path~~ [clone]fix the clone action when we introduced external path Jan 6, 2025

refactor the clone job

aeeb30f

neuyilan changed the title ~~[clone]fix the clone action when we introduced external path~~ [clone]refactor the clone action as we introduced external path Jan 10, 2025

neuyilan added 2 commits January 10, 2025 11:41

add clone it for external path table

31bf4e7

remove useless codes

80a034f

neuyilan marked this pull request as ready for review January 10, 2025 03:48

JingsongLi reviewed Jan 14, 2025

View reviewed changes

paimon-core/src/main/java/org/apache/paimon/io/DataFileMeta.java Outdated Show resolved Hide resolved

paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/clone/CloneFileInfo.java Outdated Show resolved Hide resolved

remove FileType

0f96789

JingsongLi reviewed Jan 14, 2025

View reviewed changes

paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/clone/CloneFileInfo.java Outdated Show resolved Hide resolved

neuyilan requested a review from JingsongLi January 14, 2025 16:01

merge master

3e47968

neuyilan closed this Jan 16, 2025

neuyilan reopened this Jan 16, 2025

neuyilan closed this Jan 17, 2025

neuyilan reopened this Jan 17, 2025

JingsongLi reviewed Jan 17, 2025

View reviewed changes

...link-common/src/main/java/org/apache/paimon/flink/clone/PickSchemaFilesForCloneOperator.java Outdated Show resolved Hide resolved

JingsongLi reviewed Jan 17, 2025

View reviewed changes

paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/clone/CloneFileInfo.java Outdated Show resolved Hide resolved

JingsongLi reviewed Jan 17, 2025

View reviewed changes

...aimon-flink-common/src/main/java/org/apache/paimon/flink/clone/CopyManifestFileOperator.java Show resolved Hide resolved

neuyilan added 2 commits January 17, 2025 17:40

refactor the clone action

bb5523a

ignore the testCloneTableWithExpiration

902e659

JingsongLi reviewed Feb 10, 2025

View reviewed changes

...link/paimon-flink-common/src/test/java/org/apache/paimon/flink/action/CloneActionITCase.java Outdated Show resolved Hide resolved

neuyilan added 3 commits February 10, 2025 20:04

merge master

42f1bed

add retry clone job

0f02a58

fix clone it

be96e7f

neuyilan requested a review from JingsongLi February 12, 2025 01:32

JingsongLi approved these changes Feb 12, 2025

View reviewed changes

JingsongLi merged commit ecdf46f into apache:master Feb 12, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clone]refactor the clone action as we introduced external path #4844

[clone]refactor the clone action as we introduced external path #4844

neuyilan commented Jan 6, 2025 •

edited

Loading

JingsongLi commented Jan 7, 2025

neuyilan commented Jan 7, 2025

neuyilan commented Jan 7, 2025

JingsongLi commented Jan 8, 2025

neuyilan commented Jan 8, 2025

JingsongLi commented Jan 9, 2025

neuyilan commented Jan 9, 2025

wwj6591812 commented Jan 9, 2025

neuyilan commented Jan 10, 2025

neuyilan commented Jan 13, 2025

JingsongLi left a comment

[clone]refactor the clone action as we introduced external path #4844

[clone]refactor the clone action as we introduced external path #4844

Conversation

neuyilan commented Jan 6, 2025 • edited Loading

Purpose

Tests

API and Format

Documentation

JingsongLi commented Jan 7, 2025

neuyilan commented Jan 7, 2025

neuyilan commented Jan 7, 2025

JingsongLi commented Jan 8, 2025

neuyilan commented Jan 8, 2025

JingsongLi commented Jan 9, 2025

neuyilan commented Jan 9, 2025

wwj6591812 commented Jan 9, 2025

neuyilan commented Jan 10, 2025

neuyilan commented Jan 13, 2025

JingsongLi left a comment

Choose a reason for hiding this comment

neuyilan commented Jan 6, 2025 •

edited

Loading