Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Simplify migrate_table and migrate_iceberg_table into one procedure for easier use #5074

Open
2 tasks done
liyubin117 opened this issue Feb 13, 2025 · 4 comments
Open
2 tasks done
Labels
enhancement New feature or request

Comments

@liyubin117
Copy link
Contributor

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

I found that #4639 introduce a new procedure migrate_iceberg_table, and it is similar to migrate_table. we could use connector argument to distinguish the two scenarios in one procedure instead of introduce a new procedure.

CALL sys.migrate_table(connector => 'hive', source_table => 'default.hivetable', options => 'file.format=orc');
CALL sys.migrate_table(connector => 'iceberg', source_table => 'default.icebergtable', options => 'file.format=orc');

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@liyubin117 liyubin117 added the enhancement New feature or request label Feb 13, 2025
@liyubin117
Copy link
Contributor Author

@LsomeYeah What do you think? Looking forward your opinions

@LsomeYeah
Copy link
Contributor

@liyubin117 Hi, thanks for inviting. In the first version of iceberg migration, we use connector argument in migrate_table to distinguish the two scenarios. The reason why I introduced a new procedure is as follows.

  1. the arguments that the two scenarios are different. When migrating a hive table to paimon, a hive catalog in paimon is needed for accessing the origin hive table and the target paimon table, catalog in CALL catalog.sys.migrate_table must be a hive catalog in paimon, so we just need provide the source_table in procedure to get the origin hive table. But for migration of iceberg, we cannot use a paimon catalog to access the source iceberg table, so we have to provide some extra information about the origin iceberg table(such as the type of iceberg catalog for managing the iceberg table, the warehouse, the hive metastore uri , etc.).
  2. for future scalability and usability. In the future, we may consider migrating tables from delta or hudi to paimon, the arguments may be different too, too many arguments in the same procedure may increase the complexity of usage.
  3. **the migration for hive and for delta is separated in Iceberg too.**https://iceberg.apache.org/docs/1.6.0/table-migration/?h=migrati#migrating-from-different-table-formats. Migration for hive in iceberg also need an iceberg catalog to accessing origin table and target table, while migration for other data lake need special processing.

@liyubin117
Copy link
Contributor Author

liyubin117 commented Feb 14, 2025

@LsomeYeah Thanks for your explanation, I have a minor doubts, the only difference in arguements between two procedures is that the migrate_iceberg_table has iceberg_options, Could we use sys.migrate_table('iceberg','icebergCatalog.db.t1') to reuse the options defined in created catalog instead of declaring them in the procdure again?
I found that migration procedure in icerbeg is CALL catalog_name.system.migrate('spark_catalog.db.sample', map('foo', 'bar'));, catalog is included in table argument

MigrateTableProcedure

@ProcedureHint(
            argument = {
                @ArgumentHint(name = "connector", type = @DataTypeHint("STRING")),
                @ArgumentHint(name = "source_table", type = @DataTypeHint("STRING")),
                @ArgumentHint(name = "options", type = @DataTypeHint("STRING"), isOptional = true),
                @ArgumentHint(
                        name = "parallelism",
                        type = @DataTypeHint("Integer"),
                        isOptional = true)
            })

MigrateIcebergTableProcedure

@ProcedureHint(
            argument = {
                @ArgumentHint(name = "source_table", type = @DataTypeHint("STRING")),
                @ArgumentHint(
                        name = "iceberg_options",
                        type = @DataTypeHint("STRING"),
                        isOptional = true),
                @ArgumentHint(name = "options", type = @DataTypeHint("STRING"), isOptional = true),
                @ArgumentHint(
                        name = "parallelism",
                        type = @DataTypeHint("Integer"),
                        isOptional = true)
            })

@LsomeYeah
Copy link
Contributor

LsomeYeah commented Feb 14, 2025

@liyubin117 Happy to discuss. Currently, there is no catalog in Paimon that can access Iceberg tables. So the icebergCatalog in 'icebergCatalog.db.t1' should be an iceberg catalog. In fact, I had used an iceberg catalog to access iceberg tables for migration before, but the migration is written in paimon-core module and this will introduce iceberg dependencies to paimon-core module which is unexpected after discussing with some paimon committers.

As I know, Iceberg use the catalog in catalog.database.tablename as source catalog, and it must be a SparkCatalog or SparkSessionCatalog, SparkCatalog is a wrapped iceberg catalog which can only load iceberg table, SparkSessionCatalog wraps an iceberg catalog and a delegate catalog which implements some interfaces about spark catalog for loading non-Iceberg tables, and this may make iceberg migration can handle migrating csv or parquet etc. to Iceberg. And the procedure introduced now is in Flink, Paimon now has no catalog for flink which can load paimon tables and non-paimon tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants