Skip to content

Transfer data from Oracle database tables, views, and query results to Apache Iceberg tables

License

Notifications You must be signed in to change notification settings

averemee-si/ora2iceberg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

License ora2iceberg

ora2iceberg Data Pipe

1. What is Ora2Iceberg?

Ora2Iceberg is your go-to tool for simplifying data migration! It seamlessly transfers Oracle tables and views into Iceberg structures across various data lakes and warehouses—whether cloud-based or on-premises. Supporting a wide range of storage systems, including S3-compatible, HDFS, and local files, it integrates effortlessly with all Iceberg-compatible catalogs. This includes JDBC connections to popular free databases like PostgreSQL and MySQL, making complex Hadoop setups or Spark/Hive unnecessary. Seriously, who wants more Hadoop headaches?

Who Benefits from Ora2Iceberg?

  • Data Engineers: Effortlessly migrate from legacy Oracle databases to modern systems (finally, something less painful than Mondays!).

  • Cloud Architects: Enable cost-effective and scalable data lakes like AWS Glue and S3Tables.

  • Database Administrators: Simplify schema conversions and data type mappings without breaking a sweat.

  • Analytics Teams: Unlock high-performance querying with Iceberg’s partitioning and indexing. It’s like upgrading from a bicycle to a sports car.

Key Features

  • Multi-Catalog Support: Compatible with Glue, Hive, S3Tables, and more.

  • Customizable Data Mappings: Override Oracle-to-Iceberg type mappings with ease.

  • Partitioning Options: Optimize table performance with advanced partitioning strategies.

  • Flexible Upload Modes: Full, incremental, and future merge upload modes.

  • Security Compliance: Validate dependencies for secure builds—because nobody likes surprises in production.

2. Getting Started

Prerequisites

Environment Setup:

  • Install Java (JDK 11 or later). ==== Supported Iceberg Catalogs:

  • REST, JDBC, Hadoop, Hive, Nessie, Glue, S3Tables, DynamoDB.

Installation

1. Clone the repository

git clone https://github.com/a2solutions/ora2iceberg.git

2. Navigate to the project directory:

cd ora2iceberg

3. Build the project:

  • On Linux/Unix:

gradle clean build
  • On Windows:

./gradlew clean build

4. Run the tool:

java -jar build/libs/ora2iceberg.jar

Optional: Validating Secure Dependencies

Run the following to ensure all dependency libraries are secure:

gradle dependencyCheckAnalyze

3. Basic Usage

Common assumptions:

The machine on which you run ora2Iceberg can be:

  • Remote or local to the source Oracle database;

  • Remote or local to the destination catalogs;

  • Remote to the destination storage if it supports the network access (S3-compatible, Hadoop, SnowFlake);

  • Local to the destination storage if copying to the plain local file system.

The tool can be run on Linux or Windows in any shell of your choice capable of running Java 11 and later (the popular free JREs like Corretto and Temurin are fully supported).

Scenario 1: Copying the table to AWS S3 Storage and using AWS Glue as catalog.

This scenario implies copying the Oracle tables to the general AWS S3 storage, that can be further accessed by AWS analytical tools.

Prerequisites

Example

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=SOMEKEYYY
export AWS_SECRET_ACCESS_KEY=SOMESECRET

cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T glue -C test \
-H "s3://iceberg-warehouse" \
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose -U 1

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI, mandatory parameter, but with AWS Glue you don’t have to specify it, hence we used the placeholder value 1;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; work with S3 requires to explicitly specify the IO implementation, exactly as shown in the above example.

Scenario 2: Copying the table to the Local File System, using the on-prem Nessie or relational database as a catalog

This scenario is useful if you’re going to access the Iceberg tables locally via Clickhouse or DuckDB.

Prerequisites

  1. The tool must be local to your destination storage.

  2. If Nessie catalog is used, it must be configured to access the connections from your account.

  3. If a database is used as a catalog, you need to know its type (Postgres, Oracle or MySQL), login credentials and the hostname and port on which it accepts the connections.

Example: Local Iceberg storage + Nessie catalog on-prem

cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T nessie -C test \
-U "http://cataloghostname:19120/api/v2" \
-H "file:///clickhouse/iceberg" \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose -U 1

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table, in this example the path to the AWS S3 bucket.

Example: Local Iceberg storage + catalog in relational database

cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T jdbc -C test \
-U "jdbc:postgresql://pgdbhost:5432/postgres" \
-Rjdbc.user=catdbuser -Rjdbc.password=catdbpassword
-H "file:///clickhouse/iceberg" \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in jdbc format where pgdbhost is PostgreSQL database host, 5432 is its listener’s port and postgres is the name of the database that will store the catalog data;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table in quotes, in this example the path to the local directory /clickhouse/iceberg prefixed with file://;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with catalog in database you have to specify -Rjdbc.user and -Rjdbc.password of the database that will store the catalog.

Scenario 3: Copying the table to the S3-compatible storage with Nessie or jdbc as catalog

You can use this scenario for transfering the tables to the existing on-prem or cloud-based S3-compatible storage, such as Apache Ozone.

Prerequisites

  • Configure the S3 storage bucket (specified as bucket-test in the below example)

  • Configure and save the access key pair if needed.

  • Set the following environment variables in your shell:

    • AWS_REGION to the region of your bucket;

    • AWS_ACCESS_KEY_ID to your access key ID;

    • AWS_SECRET_ACCESS_KEY to the secret access key.

Example with Nessie

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=SOMEONESKEY
export AWS_SECRET_ACCESS_KEY=THEIRSECRET
cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T nessie -C test \
-U "http://cataloghostname:19120/api/v2" \
-H "s3://bucket-test" \
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
-Rs3.endpoint=http://s3host:9878/ \
-Rs3.path-style-access=true \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: catalog branch or reference;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in http format where cataloghostname is Nessie catalog host, 19120 is Nessie port;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the third-party S3-compatible storage, you have to specify the IO implementation (-Rio-impl) exactly as shown, -Rs3.endpoint in the above http format without quotes where s3host is the S3 storage hostname, 9878 is its port, and -Rs3.path-style-access=true.

Scenario 4: Copying the table to the AWS S3 storage with Hive as catalog

You can use this scenario for transfering the tables to the AWS S3 storage, when already having Hadoop cluster on-prem or using AWS EMR service with activated Hive Server.

Prerequisites

  • Configure the S3 storage bucket (specified as bucket-test in the below example)

  • Create the Hive database if using other than ‘default’.

  • Set the following environment variables in your shell:

    • AWS_REGION to the region of your bucket;

    • AWS_ACCESS_KEY_ID to your access key ID;

    • AWS_SECRET_ACCESS_KEY to the secret access key.

Example

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=AccOuNtKey
export AWS_SECRET_ACCESS_KEY=OhSecReT
cd ~/ora2iceberg/build/libs/
java -jar ora2iceberg-0.8.1.7-all.jar \
--source-jdbc-url jdbc:oracle:thin:@dbhost:1521/SID \
--source-user dbuser --source-password hispassword \
-T hive -C default \
-U "thrift://hiveserver:9083" \
-H "s3://bucket-test" \
-Rio-impl=org.apache.iceberg.aws.s3.S3FileIO \
--source-object mtl_item_attributes \
--source-schema inv -N dest-warehose

In the example above, the parameters represent the following:

–source-jdbc-url: Specifies the source database URL, where dbhost is the hostname, 1521 is the listener port and SID is the database’s service name;

–source-user and –source-password: source database username and password without any quotes, as is;

–source-object: name of the table in the source database;

–source-schema: name of the schema containing the table;

-T: catalog type, could be glue, hive, nessie, jdbc;

-C: Catalog database name in Hive;

-U: catalog endpoint URI in double quotes, mandatory parameter, in this case in thrift format where hiveserver is Hive server host, 9083 is Hive port;

-N: Iceberg namespace, no quotes;

-H: destination path for the iceberg table in quotes, in this example the path to the S3 bucket named bucket-test;

-R: is used to pass the additional Iceberg properties, has to be used as prefix to each additional parameter; the parameter follows it with no spaces; when used with the AWS S3 storage, you have to specify the IO implementation (-Rio-impl) exactly as shown.

3. Data Type Mapping

Default Mappings

Ora2Iceberg maps Oracle types to Iceberg types as follows:

Oracle Type Iceberg Type

NUMBER

decimal(38,10)

NUMBER(p,s)

decimal(p,s)

NUMBER(p,0), s=0, p<10

integer, int

NUMBER(p,0), s=0, p<19

long, BigInt

VARCHAR2, CHAR

string

TIMESTAMP

timestamp

DATE

timestamp

Custom Overrides

Customize mappings using the -m option:

-m "COLUMN_NAME:NUMBER=long; %_ID:NUMBER=integer"

Syntax:

COLUMN_OR_PATTERN:ORACLE_TYPE=ICEBERG_TYPE

Examples:

Map a specific column: -m "EMP_ID:NUMBER=long"

Use patterns: -m "%_ID:NUMBER=integer"

Supports % for partial matches (at the beginning or end only). (Seriously, no middle matches—don’t even try!)

4. Default Number Format

The default fallback for ambiguous Oracle NUMBER columns is decimal(38,10).

Configuration

Override using the -d parameter:

-d "decimal(20,5)"

Example:

java -jar build/libs/ora2iceberg.jar \ -d "decimal(20,5)"

5. Partitioning

Supported Partition Types

Type Description

IDENTITY

Direct column mapping

YEAR

Partition by year

MONTH

Partition by month

DAY

Partition by day

HOUR

Partition by hour

BUCKET

Hash-based bucketing (requires bucket count)

TRUNCATE

Truncate strings to a fixed length

Syntax

Define partitions using the -P option:

-P column_name=IDENTITY

-P column_name=BUCKET,10

Example:

java -jar build/libs/ora2iceberg.jar \

-P dept=IDENTITY \
-P emp_id=BUCKET,10

(If you don’t love partitions yet, you will soon!)

6. Understanding Parameters

Source Connection Parameters

Short Long Explanation Example

-j

--source-jdbc-url

Oracle JDBC URL for the source connection. This parameter is required.

jdbc:oracle:thin:@localhost:1521:xe

-u

--source-user

Oracle username for the source connection.

system

-p

--source-password

Password for the source Oracle connection.

password123

-s

--source-schema

Source schema name. If not specified, the value of <source-user> is used.

HR

-o

--source-object

Name of the source table, view, or SQL SELECT query. SQL SELECT is not implemented yet.

employees or SELECT * FROM employees WHERE department_id = 10

-w

--where-clause

Optional WHERE clause for the <source-object>. Valid only if <source-object> points to a table or view.

WHERE salary > 50000

Iceberg Destination Parameters

Short Long Explanation Example

-T

--iceberg-catalog-type

Type of Iceberg catalog. Can be predefined (e.g., REST, JDBC, HADOOP) or a fully qualified class name.

REST

-C

--iceberg-catalog

Name of the Apache Iceberg catalog.

MyCatalog

-U

--iceberg-catalog-uri

URI for the Apache Iceberg catalog.

http://localhost:8080

-H

--iceberg-warehouse

Location of the Apache Iceberg warehouse.

/path/to/warehouse

-N

--iceberg-namespace

Namespace for the Iceberg catalog. Defaults to the source schema.

db_namespace

-t

--iceberg-table

Name of the destination Iceberg table. Defaults to the source object name for tables/views.

iceberg_table_name

-P

--iceberg-partition

Partitioning definitions for the Iceberg table.

columnName=YEAR or columnName=BUCKET,10

-L

--upload-mode

Upload mode: full, incremental, or merge. Merge is not implemented yet.

full

-R

--iceberg-catalog-properties

Additional properties for Apache Iceberg catalog implementation

-Rs3.endpoint=http://ozone.companyname.com:9878/

Additional Options

Short Long Explanation Example

-f

--auto-infer-types

Automatically infer numeric types (e.g., BIGINT vs NUMERIC). Not implemented yet.

-f

-d

--default-number-type

Default numeric precision/scale for ambiguous NUMBER columns. Defaults to decimal(38,10).

decimal(10,2)

-m

--data-type-map

Custom mappings from source data types to Iceberg types.

"COLUMN_NAME:NUMBER=integer; PATTERN%:NUMBER=decimal(20,0)"

More Information

For more details, documentation, and updates, visit the official website:

License

This project is licensed under the Apache-2.0 License.

About

Transfer data from Oracle database tables, views, and query results to Apache Iceberg tables

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages