Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for encrypted/protected data type in iceberg table #1582

Open
yigal-rozenberg opened this issue Jan 27, 2025 · 8 comments
Open

Comments

@yigal-rozenberg
Copy link
Contributor

Feature Request / Improvement

I am working on extending Apache Iceberg supported data types with a new complex type: 'ProtctedType'.
This new data type internally is a StructType including a header and a payload.
The Header to include at minimum:

  1. Encryption Provider ID
  2. Encryption Key ID
  3. Data Type

The payload to include the encrypted data as BinaryType.

The goal is to allow end user transparent interaction with the new type, allowing operations between encrypted data items, and clear text.
Further more, allow extension of puffin files to store aggregate data based on the clear text values, bloom filters, and optionally inverted index for gerex search without a full table scan.

Looking for guidance on how such data type can be introduced and what are the dependencies I would need to address with the various readers and writers.

protected_type_merge.txt

@yigal-rozenberg
Copy link
Contributor Author

Sample program to read and write using ProtectedType data.

app.txt

@kevinjqliu
Copy link
Contributor

Hey there! Thanks for creating this issue. Typically for something like this we would want to create a Improvement Proposals and get feedback from the community.
In this case, the proposal seem to be adding a new data type to the table specification. (Note, the table specification should be language agnostic.)

Here are some somewhat related threads that i've found
https://lists.apache.org/thread/jm5xoy3fro4omlqlo476cf0118dcznkr
apache/iceberg#10909

Hope this helps!

@yigal-rozenberg
Copy link
Contributor Author

Before posting this as a proper improvement request, I would like to come up with a POC that demonstrate the desired functionality/ The thread you provided talks about the need for proper data centric security, and I have some years of experience in this topic.
IMHO the best way to secure data and centrally control access is to use data item encryption. In some cases this can also be referred to column level encryption, however, one can confuse this with file encryption in column based data.
When data items are encrypted, the cipher text can be sent and shared/accessed across multiple systems and engines.
The challenge is that cipher text by itself does not include metadata such as the key-id used to encrypt it, and the original data type of the clear text.
I am trying to understand, as a first phase, how in Iceberg Python interface I can crate a new Data Type, which has a different behavior when it stores and reads the data from the table storage, and a different behavior when data is inserted/updated/selected.

Do I need to implement different str / repr methods in the new data type, or I need to do it elsewhere?
Where to implement the operators to support operations between 2 encrypted types, and operations between encrypted and clear text?

@yigal-rozenberg
Copy link
Contributor Author

added code change here: #1594

@Fokko
Copy link
Contributor

Fokko commented Feb 3, 2025

As @kevinjqliu already pointed out, this is a bigger thing than just implementing it on PyIceberg. There is also a proposal out for file-level encryption: apache/iceberg#12162. For the implementation, I think there are a few more things to take into account. For example, when doing query optimization we use the min/max of the column from the Manifest file, we need to make sure to decrypt these as well before evaluating the metrics.

Do I need to implement different str / repr methods in the new data type, or I need to do it elsewhere?

I think this is being done on the PrimitiveType.

@yigal-rozenberg
Copy link
Contributor Author

yigal-rozenberg commented Feb 3, 2025

Hi @Fokko,
File encryption is an essential functionality that adds an important security layer to data integrity and security on a file level.
Data Centric security addresses a different challenge in privacy and security of data.
The idea is based on a concept where you protect and classify data during creation and carry the security properties across all systems without the need to re-encrypt and reclassify it.
Traditional data security solutions rely on data discovery and classification based on the location of the data (server.db.schema.table.column), and the need to continuously track and re-identify data as it flows in the organization.
Once the base protected data type is identified by the engine as protected and not just a structure or binary buffer, the engine layer can set min/max and even aggregation values in puffin files to further improve performance.
Additionally these puffin files can be extended to include bloom filters (based on the clear text values) and even inverted indexes to allow wild card and other operations over clear text data representation (this is where files encryption is important).

I am still trying to understand the code base, and identify how and where I can implement supported operators according to the input data type to decipher the data and decode it to the original data type (e.g., binary->str, binary->date, etc.).

@yigal-rozenberg
Copy link
Contributor Author

@Fokko
Copy link
Contributor

Fokko commented Feb 6, 2025

Thanks for the additional context @yigal-rozenberg. I find this very interesting. Also, I know certain companies that use this pattern for GDPR, where they erase the decryption key in case a right-to-be-forgotten request is being issued.

If you want to have more feedback on this, I would highly encourage you to discuss this on the dev-list this is the official place for communication for the Iceberg project (across languages).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants