-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for encrypted/protected data type in iceberg table #1582
Comments
Sample program to read and write using ProtectedType data. |
Hey there! Thanks for creating this issue. Typically for something like this we would want to create a Improvement Proposals and get feedback from the community. Here are some somewhat related threads that i've found Hope this helps! |
Before posting this as a proper improvement request, I would like to come up with a POC that demonstrate the desired functionality/ The thread you provided talks about the need for proper data centric security, and I have some years of experience in this topic. Do I need to implement different str / repr methods in the new data type, or I need to do it elsewhere? |
added code change here: #1594 |
As @kevinjqliu already pointed out, this is a bigger thing than just implementing it on PyIceberg. There is also a proposal out for file-level encryption: apache/iceberg#12162. For the implementation, I think there are a few more things to take into account. For example, when doing query optimization we use the min/max of the column from the Manifest file, we need to make sure to decrypt these as well before evaluating the metrics.
I think this is being done on the |
Hi @Fokko, I am still trying to understand the code base, and identify how and where I can implement supported operators according to the input data type to decipher the data and decode it to the original data type (e.g., binary->str, binary->date, etc.). |
related technical publication: https://www.etsi.org/deliver/etsi_ts/103500_103599/103532/01.01.01_60/ts_103532v010101p.pdf |
Thanks for the additional context @yigal-rozenberg. I find this very interesting. Also, I know certain companies that use this pattern for GDPR, where they erase the decryption key in case a right-to-be-forgotten request is being issued. If you want to have more feedback on this, I would highly encourage you to discuss this on the dev-list this is the official place for communication for the Iceberg project (across languages). |
Feature Request / Improvement
I am working on extending Apache Iceberg supported data types with a new complex type: 'ProtctedType'.
This new data type internally is a StructType including a header and a payload.
The Header to include at minimum:
The payload to include the encrypted data as BinaryType.
The goal is to allow end user transparent interaction with the new type, allowing operations between encrypted data items, and clear text.
Further more, allow extension of puffin files to store aggregate data based on the clear text values, bloom filters, and optionally inverted index for gerex search without a full table scan.
Looking for guidance on how such data type can be introduced and what are the dependencies I would need to address with the various readers and writers.
protected_type_merge.txt
The text was updated successfully, but these errors were encountered: