Add support for encrypted/protected data type in iceberg table #1582

yigal-rozenberg · 2025-01-27T17:28:01Z

Feature Request / Improvement

I am working on extending Apache Iceberg supported data types with a new complex type: 'ProtctedType'.
This new data type internally is a StructType including a header and a payload.
The Header to include at minimum:

Encryption Provider ID
Encryption Key ID
Data Type

The payload to include the encrypted data as BinaryType.

The goal is to allow end user transparent interaction with the new type, allowing operations between encrypted data items, and clear text.
Further more, allow extension of puffin files to store aggregate data based on the clear text values, bloom filters, and optionally inverted index for gerex search without a full table scan.

Looking for guidance on how such data type can be introduced and what are the dependencies I would need to address with the various readers and writers.

protected_type_merge.txt

yigal-rozenberg · 2025-01-27T17:38:06Z

Sample program to read and write using ProtectedType data.

app.txt

kevinjqliu · 2025-01-28T19:18:45Z

Hey there! Thanks for creating this issue. Typically for something like this we would want to create a Improvement Proposals and get feedback from the community.
In this case, the proposal seem to be adding a new data type to the table specification. (Note, the table specification should be language agnostic.)

Here are some somewhat related threads that i've found
https://lists.apache.org/thread/jm5xoy3fro4omlqlo476cf0118dcznkr
apache/iceberg#10909

Hope this helps!

yigal-rozenberg · 2025-01-28T21:36:17Z

Before posting this as a proper improvement request, I would like to come up with a POC that demonstrate the desired functionality/ The thread you provided talks about the need for proper data centric security, and I have some years of experience in this topic.
IMHO the best way to secure data and centrally control access is to use data item encryption. In some cases this can also be referred to column level encryption, however, one can confuse this with file encryption in column based data.
When data items are encrypted, the cipher text can be sent and shared/accessed across multiple systems and engines.
The challenge is that cipher text by itself does not include metadata such as the key-id used to encrypt it, and the original data type of the clear text.
I am trying to understand, as a first phase, how in Iceberg Python interface I can crate a new Data Type, which has a different behavior when it stores and reads the data from the table storage, and a different behavior when data is inserted/updated/selected.

Do I need to implement different str / repr methods in the new data type, or I need to do it elsewhere?
Where to implement the operators to support operations between 2 encrypted types, and operations between encrypted and clear text?

yigal-rozenberg · 2025-01-31T17:03:56Z

added code change here: #1594

Fokko · 2025-02-03T10:01:44Z

As @kevinjqliu already pointed out, this is a bigger thing than just implementing it on PyIceberg. There is also a proposal out for file-level encryption: apache/iceberg#12162. For the implementation, I think there are a few more things to take into account. For example, when doing query optimization we use the min/max of the column from the Manifest file, we need to make sure to decrypt these as well before evaluating the metrics.

Do I need to implement different str / repr methods in the new data type, or I need to do it elsewhere?

I think this is being done on the PrimitiveType.

yigal-rozenberg · 2025-02-03T16:39:53Z

Hi @Fokko,
File encryption is an essential functionality that adds an important security layer to data integrity and security on a file level.
Data Centric security addresses a different challenge in privacy and security of data.
The idea is based on a concept where you protect and classify data during creation and carry the security properties across all systems without the need to re-encrypt and reclassify it.
Traditional data security solutions rely on data discovery and classification based on the location of the data (server.db.schema.table.column), and the need to continuously track and re-identify data as it flows in the organization.
Once the base protected data type is identified by the engine as protected and not just a structure or binary buffer, the engine layer can set min/max and even aggregation values in puffin files to further improve performance.
Additionally these puffin files can be extended to include bloom filters (based on the clear text values) and even inverted indexes to allow wild card and other operations over clear text data representation (this is where files encryption is important).

I am still trying to understand the code base, and identify how and where I can implement supported operators according to the input data type to decipher the data and decode it to the original data type (e.g., binary->str, binary->date, etc.).

yigal-rozenberg · 2025-02-03T17:16:14Z

related technical publication: https://www.etsi.org/deliver/etsi_ts/103500_103599/103532/01.01.01_60/ts_103532v010101p.pdf

Fokko · 2025-02-06T18:53:27Z

Thanks for the additional context @yigal-rozenberg. I find this very interesting. Also, I know certain companies that use this pattern for GDPR, where they erase the decryption key in case a right-to-be-forgotten request is being issued.

If you want to have more feedback on this, I would highly encourage you to discuss this on the dev-list this is the official place for communication for the Iceberg project (across languages).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for encrypted/protected data type in iceberg table #1582

Add support for encrypted/protected data type in iceberg table #1582

yigal-rozenberg commented Jan 27, 2025

yigal-rozenberg commented Jan 27, 2025

kevinjqliu commented Jan 28, 2025

yigal-rozenberg commented Jan 28, 2025

yigal-rozenberg commented Jan 31, 2025

Fokko commented Feb 3, 2025

yigal-rozenberg commented Feb 3, 2025 •

edited

Loading

yigal-rozenberg commented Feb 3, 2025

Fokko commented Feb 6, 2025

Add support for encrypted/protected data type in iceberg table #1582

Add support for encrypted/protected data type in iceberg table #1582

Comments

yigal-rozenberg commented Jan 27, 2025

Feature Request / Improvement

yigal-rozenberg commented Jan 27, 2025

kevinjqliu commented Jan 28, 2025

yigal-rozenberg commented Jan 28, 2025

yigal-rozenberg commented Jan 31, 2025

Fokko commented Feb 3, 2025

yigal-rozenberg commented Feb 3, 2025 • edited Loading

yigal-rozenberg commented Feb 3, 2025

Fokko commented Feb 6, 2025

yigal-rozenberg commented Feb 3, 2025 •

edited

Loading