ora2iceberg Data Pipe

1. What is Ora2Iceberg?

Ora2Iceberg is your go-to tool for simplifying data migration! It seamlessly transfers Oracle tables and views into Iceberg structures across various data lakes and warehouses—whether cloud-based or on-premises. Supporting a wide range of storage systems, including S3-compatible, HDFS, and local files, it integrates effortlessly with all Iceberg-compatible catalogs. This includes JDBC connections to popular free databases like PostgreSQL and MySQL, making complex Hadoop setups or Spark/Hive unnecessary. Seriously, who wants more Hadoop headaches?

Who Benefits from Ora2Iceberg?

Data Engineers: Effortlessly migrate from legacy Oracle databases to modern systems (finally, something less painful than Mondays!).
Cloud Architects: Enable cost-effective and scalable data lakes like AWS Glue and S3Tables.
Database Administrators: Simplify schema conversions and data type mappings without breaking a sweat.
Analytics Teams: Unlock high-performance querying with Iceberg’s partitioning and indexing. It’s like upgrading from a bicycle to a sports car.

Key Features

Multi-Catalog Support: Compatible with Glue, Hive, S3Tables, and more.
Customizable Data Mappings: Override Oracle-to-Iceberg type mappings with ease.
Partitioning Options: Optimize table performance with advanced partitioning strategies.
Flexible Upload Modes: Full, incremental, and future merge upload modes.
Security Compliance: Validate dependencies for secure builds—because nobody likes surprises in production.

2. Getting Started

Prerequisites

Environment Setup:

Install Java (JDK 11 or later). ==== Supported Iceberg Catalogs:
REST, JDBC, Hadoop, Hive, Nessie, Glue, S3Tables, DynamoDB.

Installation

1. Clone the repository

git clone https://github.com/a2solutions/ora2iceberg.git

2. Navigate to the project directory:

cd ora2iceberg

3. Build the project:

On Linux/Unix:

gradle clean build

On Windows:

./gradlew clean build

4. Run the tool:

java -jar build/libs/ora2iceberg.jar

Optional: Validating Secure Dependencies

Run the following to ensure all dependency libraries are secure:

gradle dependencyCheckAnalyze

3. Basic Usage

Common assumptions:

The machine on which you run ora2Iceberg can be:

Remote or local to the source Oracle database;
Remote or local to the destination catalogs;
Remote to the destination storage if it supports the network access (S3-compatible, Hadoop, SnowFlake);
Local to the destination storage if copying to the plain local file system.

The tool can be run on Linux or Windows in any shell of your choice capable of running Java 11 and later (the popular free JREs like Corretto and Temurin are fully supported).

Scenario 1: Copying the table to AWS S3 Storage and using AWS Glue as catalog.

This scenario implies copying the Oracle tables to the general AWS S3 storage, that can be further accessed by AWS analytical tools.

Prerequisites

Configure the S3 storage bucket (specified as iceberg-warehouse in the below example)
Configure and save the AWS access key pair according to https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds-programmatic-access.html
Set the following environment variables in your shell:
- AWS_REGION to the region of your bucket;
- AWS_ACCESS_KEY_ID to your access key ID;
- AWS_SECRET_ACCESS_KEY to the secret access key.

Example

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=SOMEKEYYY
export AWS_SECRET_ACCESS_KEY=SOMESECRET

cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T glue -C test \
-H "s3://iceberg-warehouse" \
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose -U 1

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI, mandatory parameter, but with AWS Glue you don’t have to specify it, hence we used the placeholder value 1;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; work with S3 requires to explicitly specify the IO implementation, exactly as shown in the above example.

Scenario 2: Copying the table to the Local File System, using the on-prem Nessie or relational database as a catalog

This scenario is useful if you’re going to access the Iceberg tables locally via Clickhouse or DuckDB.

Prerequisites

The tool must be local to your destination storage.
If Nessie catalog is used, it must be configured to access the connections from your account.
If a database is used as a catalog, you need to know its type (Postgres, Oracle or MySQL), login credentials and the hostname and port on which it accepts the connections.

Example: Local Iceberg storage + Nessie catalog on-prem

cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T nessie -C test \
-U "http://cataloghostname:19120/api/v2" \
-H "file:///clickhouse/iceberg" \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose -U 1

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket.

Example: Local Iceberg storage + catalog in relational database

cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T jdbc -C test \
-U "jdbc:postgresql://pgdbhost:5432/postgres" \
-Rjdbc.user=catdbuser -Rjdbc.password=catdbpassword
-H "file:///clickhouse/iceberg" \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in jdbc format where pgdbhost is PostgreSQL database host, 5432 is its listener’s port and postgres is the name of the database that will store the catalog data;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table in quotes, in this example the path to the local directory /clickhouse/iceberg prefixed with file://;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with catalog in database you have to specify -Rjdbc.user and -Rjdbc.password of the database that will store the catalog.

Scenario 3: Copying the table to the S3-compatible storage with Nessie or jdbc as catalog

You can use this scenario for transfering the tables to the existing on-prem or cloud-based S3-compatible storage, such as Apache Ozone.

Prerequisites

Configure the S3 storage bucket (specified as bucket-test in the below example)
Configure and save the access key pair if needed.
Set the following environment variables in your shell:
- AWS_REGION to the region of your bucket;
- AWS_ACCESS_KEY_ID to your access key ID;
- AWS_SECRET_ACCESS_KEY to the secret access key.

Example with Nessie

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=SOMEONESKEY
export AWS_SECRET_ACCESS_KEY=THEIRSECRET
cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T nessie -C test \
-U "http://cataloghostname:19120/api/v2" \
-H "s3://bucket-test" \
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
-Rs3.endpoint=http://s3host:9878/ \
-Rs3.path-style-access=true \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the third-party S3-compatible storage, you have to specify the IO implementation (-Rio-impl) exactly as shown, -Rs3.endpoint in the above http format without quotes where s3host is the S3 storage hostname, 9878 is its port, and -Rs3.path-style-access=true.

Scenario 4: Copying the table to the AWS S3 storage with Hive as catalog

You can use this scenario for transfering the tables to the AWS S3 storage, when already having Hadoop cluster on-prem or using AWS EMR service with activated Hive Server.

Prerequisites

Configure the S3 storage bucket (specified as bucket-test in the below example)
Create the Hive database if using other than ‘default’.
Set the following environment variables in your shell:
- AWS_REGION to the region of your bucket;
- AWS_ACCESS_KEY_ID to your access key ID;
- AWS_SECRET_ACCESS_KEY to the secret access key.

Example

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=AccOuNtKey
export AWS_SECRET_ACCESS_KEY=OhSecReT
cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T hive -C default \
-U "thrift://hiveserver:9083" \
-H "s3://bucket-test" \
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: Catalog database name in Hive;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in thrift format where hiveserver is Hive server host, 9083 is Hive port;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the AWS S3 storage, you have to specify the IO implementation (-Rio-impl) exactly as shown.

3. Data Type Mapping

Default Mappings

Ora2Iceberg maps Oracle types to Iceberg types as follows:

Oracle Type	Iceberg Type
NUMBER	decimal(38,10)
NUMBER(p,s)	decimal(p,s)
NUMBER(p,0), s=0, p<10	integer, int
NUMBER(p,0), s=0, p<19	long, BigInt
VARCHAR2, CHAR	string
TIMESTAMP	timestamp
DATE	timestamp

Custom Overrides

Customize mappings using the -m option:

-m "COLUMN_NAME:NUMBER=long; %_ID:NUMBER=integer"

Syntax:

COLUMN_OR_PATTERN:ORACLE_TYPE=ICEBERG_TYPE

Examples:

Map a specific column: -m "EMP_ID:NUMBER=long"

Use patterns: -m "%_ID:NUMBER=integer"

Supports % for partial matches (at the beginning or end only). (Seriously, no middle matches—don’t even try!)

4. Default Number Format

The default fallback for ambiguous Oracle NUMBER columns is decimal(38,10).

Configuration

Override using the -d parameter:

-d "decimal(20,5)"

Example:

java -jar build/libs/ora2iceberg.jar \ -d "decimal(20,5)"

5. Partitioning

Supported Partition Types

Type	Description
IDENTITY	Direct column mapping
YEAR	Partition by year
MONTH	Partition by month
DAY	Partition by day
HOUR	Partition by hour
BUCKET	Hash-based bucketing (requires bucket count)
TRUNCATE	Truncate strings to a fixed length

Syntax

Define partitions using the -P option:

-P column_name=IDENTITY

-P column_name=BUCKET,10

Example:

java -jar build/libs/ora2iceberg.jar \

-P dept=IDENTITY \

-P emp_id=BUCKET,10

(If you don’t love partitions yet, you will soon!)

6. Understanding Parameters

Source Connection Parameters

Short	Long	Explanation	Example
`-j`	`--source-jdbc-url`	Oracle JDBC URL for the source connection. This parameter is required.	`jdbc:oracle:thin:@localhost:1521:xe`
`-u`	`--source-user`	Oracle username for the source connection.	`system`
`-p`	`--source-password`	Password for the source Oracle connection.	`password123`
`-s`	`--source-schema`	Source schema name. If not specified, the value of `<source-user>` is used.	`HR`
`-o`	`--source-object`	Name of the source table, view, or SQL `SELECT` query. SQL `SELECT` is not implemented yet.	`employees` or `SELECT * FROM employees WHERE department_id = 10`
`-w`	`--where-clause`	Optional `WHERE` clause for the `<source-object>`. Valid only if `<source-object>` points to a table or view.	`WHERE salary > 50000`

Iceberg Destination Parameters

Short	Long	Explanation	Example
`-T`	`--iceberg-catalog-type`	Type of Iceberg catalog. Can be predefined (e.g., REST, JDBC, HADOOP) or a fully qualified class name.	`REST`
`-C`	`--iceberg-catalog`	Name of the Apache Iceberg catalog.	`MyCatalog`
`-U`	`--iceberg-catalog-uri`	URI for the Apache Iceberg catalog.	`http://localhost:8080`
`-H`	`--iceberg-warehouse`	Location of the Apache Iceberg warehouse.	`/path/to/warehouse`
`-N`	`--iceberg-namespace`	Namespace for the Iceberg catalog. Defaults to the source schema.	`db_namespace`
`-t`	`--iceberg-table`	Name of the destination Iceberg table. Defaults to the source object name for tables/views.	`iceberg_table_name`
`-P`	`--iceberg-partition`	Partitioning definitions for the Iceberg table.	`columnName=YEAR` or `columnName=BUCKET,10`
`-L`	`--upload-mode`	Upload mode: `full`, `incremental`, or `merge`. Merge is not implemented yet.	`full`
`-R`	`--iceberg-catalog-properties`	Additional properties for Apache Iceberg catalog implementation	`-Rs3.endpoint=http://ozone.companyname.com:9878/`

Additional Options

Short	Long	Explanation	Example
`-f`	`--auto-infer-types`	Automatically infer numeric types (e.g., BIGINT vs NUMERIC). Not implemented yet.	`-f`
`-d`	`--default-number-type`	Default numeric precision/scale for ambiguous `NUMBER` columns. Defaults to `decimal(38,10)`.	`decimal(10,2)`
`-m`	`--data-type-map`	Custom mappings from source data types to Iceberg types.	`"COLUMN_NAME:NUMBER=integer; PATTERN%:NUMBER=decimal(20,0)"`

More Information

For more details, documentation, and updates, visit the official website:

https://ora2iceberg.app/

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.adoc		README.adoc
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

License

averemee-si/ora2iceberg

Folders and files

Latest commit

History

Repository files navigation

ora2iceberg Data Pipe

1. What is Ora2Iceberg?

Who Benefits from Ora2Iceberg?

Key Features

2. Getting Started

Prerequisites

Environment Setup:

Installation

1. Clone the repository

2. Navigate to the project directory:

3. Build the project:

4. Run the tool:

Optional: Validating Secure Dependencies

3. Basic Usage

Common assumptions:

Scenario 1: Copying the table to AWS S3 Storage and using AWS Glue as catalog.

Prerequisites

Example

Scenario 2: Copying the table to the Local File System, using the on-prem Nessie or relational database as a catalog

Prerequisites

Example: Local Iceberg storage + Nessie catalog on-prem

Example: Local Iceberg storage + catalog in relational database

Scenario 3: Copying the table to the S3-compatible storage with Nessie or jdbc as catalog

Prerequisites

Example with Nessie

Scenario 4: Copying the table to the AWS S3 storage with Hive as catalog

Prerequisites

Example

3. Data Type Mapping

Default Mappings

Custom Overrides

Syntax:

Examples:

4. Default Number Format

Configuration

Example:

5. Partitioning

Supported Partition Types

Syntax

Example:

6. Understanding Parameters

Source Connection Parameters

Iceberg Destination Parameters

Additional Options

More Information

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages