Ora2Iceberg is your go-to tool for simplifying data migration! It seamlessly transfers Oracle tables and views into Iceberg structures across various data lakes and warehouses—whether cloud-based or on-premises. Supporting a wide range of storage systems, including S3-compatible, HDFS, and local files, it integrates effortlessly with all Iceberg-compatible catalogs. This includes JDBC connections to popular free databases like PostgreSQL and MySQL, making complex Hadoop setups or Spark/Hive unnecessary. Seriously, who wants more Hadoop headaches?
-
Data Engineers: Effortlessly migrate from legacy Oracle databases to modern systems (finally, something less painful than Mondays!).
-
Cloud Architects: Enable cost-effective and scalable data lakes like AWS Glue and S3Tables.
-
Database Administrators: Simplify schema conversions and data type mappings without breaking a sweat.
-
Analytics Teams: Unlock high-performance querying with Iceberg’s partitioning and indexing. It’s like upgrading from a bicycle to a sports car.
-
Multi-Catalog Support: Compatible with Glue, Hive, S3Tables, and more.
-
Customizable Data Mappings: Override Oracle-to-Iceberg type mappings with ease.
-
Partitioning Options: Optimize table performance with advanced partitioning strategies.
-
Flexible Upload Modes: Full, incremental, and future merge upload modes.
-
Security Compliance: Validate dependencies for secure builds—because nobody likes surprises in production.
The machine on which you run ora2Iceberg can be:
-
Remote or local to the source Oracle database;
-
Remote or local to the destination catalogs;
-
Remote to the destination storage if it supports the network access (S3-compatible, Hadoop, SnowFlake);
-
Local to the destination storage if copying to the plain local file system.
The tool can be run on Linux or Windows in any shell of your choice capable of running Java 11 and later (the popular free JREs like Corretto and Temurin are fully supported).
This scenario implies copying the Oracle tables to the general AWS S3 storage, that can be further accessed by AWS analytical tools.
-
Configure the S3 storage bucket (specified as iceberg-warehouse in the below example)
-
Configure and save the AWS access key pair according to https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds-programmatic-access.html
-
Set the following environment variables in your shell:
-
AWS_REGION to the region of your bucket;
-
AWS_ACCESS_KEY_ID to your access key ID;
-
AWS_SECRET_ACCESS_KEY to the secret access key.
-
export AWS_REGION=us-east-1 export AWS_ACCESS_KEY_ID=SOMEKEYYY export AWS_SECRET_ACCESS_KEY=SOMESECRET cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T glue -C test \ -H "s3://iceberg-warehouse" \ -Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose -U 1
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: catalog branch or reference;
-U: catalog endpoint URI, mandatory parameter, but with AWS Glue you don’t have to specify it, hence we used the placeholder value 1;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket;
-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; work with S3 requires to explicitly specify the IO implementation, exactly as shown in the above example.
Scenario 2: Copying the table to the Local File System, using the on-prem Nessie or relational database as a catalog
This scenario is useful if you’re going to access the Iceberg tables locally via Clickhouse or DuckDB.
-
The tool must be local to your destination storage.
-
If Nessie catalog is used, it must be configured to access the connections from your account.
-
If a database is used as a catalog, you need to know its type (Postgres, Oracle or MySQL), login credentials and the hostname and port on which it accepts the connections.
cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T nessie -C test \ -U "http://cataloghostname:19120/api/v2" \ -H "file:///clickhouse/iceberg" \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose -U 1
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: catalog branch or reference;
-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket.
cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T jdbc -C test \ -U "jdbc:postgresql://pgdbhost:5432/postgres" \ -Rjdbc.user=catdbuser -Rjdbc.password=catdbpassword -H "file:///clickhouse/iceberg" \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: catalog branch or reference;
-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in jdbc format where pgdbhost is PostgreSQL database host, 5432 is its listener’s port and postgres is the name of the database that will store the catalog data;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table in quotes, in this example the path to the local directory /clickhouse/iceberg prefixed with file://;
-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with catalog in database you have to specify -Rjdbc.user and -Rjdbc.password of the database that will store the catalog.
You can use this scenario for transfering the tables to the existing on-prem or cloud-based S3-compatible storage, such as Apache Ozone.
-
Configure the S3 storage bucket (specified as bucket-test in the below example)
-
Configure and save the access key pair if needed.
-
Set the following environment variables in your shell:
-
AWS_REGION to the region of your bucket;
-
AWS_ACCESS_KEY_ID to your access key ID;
-
AWS_SECRET_ACCESS_KEY to the secret access key.
-
export AWS_REGION=us-east-1 export AWS_ACCESS_KEY_ID=SOMEONESKEY export AWS_SECRET_ACCESS_KEY=THEIRSECRET cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T nessie -C test \ -U "http://cataloghostname:19120/api/v2" \ -H "s3://bucket-test" \ -Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \ -Rs3.endpoint=http://s3host:9878/ \ -Rs3.path-style-access=true \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: catalog branch or reference;
-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;
-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the third-party S3-compatible storage, you have to specify the IO implementation (-Rio-impl) exactly as shown, -Rs3.endpoint in the above http format without quotes where s3host is the S3 storage hostname, 9878 is its port, and -Rs3.path-style-access=true.
You can use this scenario for transfering the tables to the AWS S3 storage, when already having Hadoop cluster on-prem or using AWS EMR service with activated Hive Server.
-
Configure the S3 storage bucket (specified as bucket-test in the below example)
-
Create the Hive database if using other than ‘default’.
-
Set the following environment variables in your shell:
-
AWS_REGION to the region of your bucket;
-
AWS_ACCESS_KEY_ID to your access key ID;
-
AWS_SECRET_ACCESS_KEY to the secret access key.
-
export AWS_REGION=us-east-1 export AWS_ACCESS_KEY_ID=AccOuNtKey export AWS_SECRET_ACCESS_KEY=OhSecReT cd ~/ora2iceberg/build/libs/ java -jar ora2iceberg-0.8.1.7-all.jar \ --source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \ --source-user dbuser --source-password hispassword \ -T hive -C default \ -U "thrift://hiveserver:9083" \ -H "s3://bucket-test" \ -Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \ --source-object mtl_item_attributes \ --source-schema inv -N dest-warehose
In the example above, the parameters represent the following:
–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;
–source-user and –source-password: source database username and password without any quotes, as is;
–source-object: name of the table in the source database;
–source-schema: name of the schema containing the table;
-T: catalog type, could be glue, hive, nessie, jdbc;
-C: Catalog database name in Hive;
-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in thrift format where hiveserver is Hive server host, 9083 is Hive port;
-N: Iceberg namespace, no quotes;
-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;
-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the AWS S3 storage, you have to specify the IO implementation (-Rio-impl) exactly as shown.
Ora2Iceberg maps Oracle types to Iceberg types as follows:
Oracle Type | Iceberg Type |
---|---|
NUMBER |
decimal(38,10) |
NUMBER(p,s) |
decimal(p,s) |
NUMBER(p,0), s=0, p<10 |
integer, int |
NUMBER(p,0), s=0, p<19 |
long, BigInt |
VARCHAR2, CHAR |
string |
TIMESTAMP |
timestamp |
DATE |
timestamp |
Customize mappings using the -m option:
-m "COLUMN_NAME:NUMBER=long; %_ID:NUMBER=integer"
Type | Description |
---|---|
IDENTITY |
Direct column mapping |
YEAR |
Partition by year |
MONTH |
Partition by month |
DAY |
Partition by day |
HOUR |
Partition by hour |
BUCKET |
Hash-based bucketing (requires bucket count) |
TRUNCATE |
Truncate strings to a fixed length |
Short | Long | Explanation | Example |
---|---|---|---|
|
|
Oracle JDBC URL for the source connection. This parameter is required. |
|
|
|
Oracle username for the source connection. |
|
|
|
Password for the source Oracle connection. |
|
|
|
Source schema name. If not specified, the value of |
|
|
|
Name of the source table, view, or SQL |
|
|
|
Optional |
|
Short | Long | Explanation | Example |
---|---|---|---|
|
|
Type of Iceberg catalog. Can be predefined (e.g., REST, JDBC, HADOOP) or a fully qualified class name. |
|
|
|
Name of the Apache Iceberg catalog. |
|
|
|
URI for the Apache Iceberg catalog. |
|
|
|
Location of the Apache Iceberg warehouse. |
|
|
|
Namespace for the Iceberg catalog. Defaults to the source schema. |
|
|
|
Name of the destination Iceberg table. Defaults to the source object name for tables/views. |
|
|
|
Partitioning definitions for the Iceberg table. |
|
|
|
Upload mode: |
|
|
|
Additional properties for Apache Iceberg catalog implementation |
|
Short | Long | Explanation | Example |
---|---|---|---|
|
|
Automatically infer numeric types (e.g., BIGINT vs NUMERIC). Not implemented yet. |
|
|
|
Default numeric precision/scale for ambiguous |
|
|
|
Custom mappings from source data types to Iceberg types. |
|
For more details, documentation, and updates, visit the official website: