-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat: Dynamic Parquet encryption and decryption properties #16779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @alamb, could you please take a look at this when you have time, or suggest someone else who could review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adamreeve -- I took a look and I think the APIs and code look pretty good to me . I have some API and testing suggestions but otherwise 👍
const ENCRYPTION_FACTORY_ID: &str = "example.memory_kms_encryption"; | ||
|
||
/// This example demonstrates reading and writing Parquet files that | ||
/// are encrypted using Parquet Modular Encryption, and uses the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of actually integrating with a KMS in the example, could we potentially stub it out and avoid the dependency on the new crate?
I realize this is like a "end to end" integration test, but I think that might make more sense to keep somewhere outside the core datafusion crate 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll change this example so it doesn't use the new crate. I intend to add a new feature or crate in parquet-key-management-rs that implements integration with DataFusion anyway, so having an example here that just partially implements this probably isn't very helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the new dependency now and mocked out the KMS behaviour by just base64 encoding encryption keys.
@@ -96,13 +95,11 @@ impl TryFrom<&TableParquetOptions> for WriterPropertiesBuilder { | |||
global, | |||
column_specific_options, | |||
key_value_metadata, | |||
crypto, | |||
crypto: _, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean the crypto options on the table parquet options are ignored now (perhaps replaced by the crypto factory). If so I suggest we remove the field or mark it deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they are still used, but I wanted to keep the logic for creating the FileEncryptionProperties
in once place, so after try_from
is called, there's a new set_writer_encryption_properties
method that will either use the existing ConfigFileEncryptionProperties
or the new factory: https://github.com/adamreeve/datafusion/blob/b42f21ddf7dd01991e576dc34472db010dc23012/datafusion/datasource-parquet/src/file_format.rs#L1327-L1334
I couldn't include this logic in the try_from
implementation as it requires some other data from outside the TableParquetOptions
.
I guess this could be a little bit surprising if a user expects try_from
to be non-lossy. I could probably restore the old behaviour of using the ConfigFileEncryptionProperties
here, or just add a comment about this?. I've added a comment about this for now.
I will try and review this again in the next day or two |
Same. |
use std::sync::Arc; | ||
use tempfile::TempDir; | ||
|
||
const ENCRYPTION_FACTORY_ID: &str = "example.mock_kms_encryption"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question I have is whether we're happy to keep these identifiers arbitrary or if we should suggest or enforce some kind of format. We could suggest using the qualified name of the factory type, eg. parquet_encrypted_with_kms::TestEncryptionFactory
, but there could be valid reasons to register different instances of the same type.
We might also want to reserve a prefix to allow for factories to be built-in in the future, eg. apache.
or datafusion.
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think arbitrary keys are good for now, personally, until we have some specific use case for name spacing of these keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep for now I think is clear enough. If we notice some obvious deficiency in the future we can revisit the naming scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adamreeve -- this looks good to me 🏆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me. The true test for me will be trying to start to use it as soon as it's merged :)
Thanks @adamreeve!
@mbutrovich I'm curious if you can share any details about how you want to use this. Do you need compatibility with PyArrow or Spark, and would the parquet-key-management crate be useful to you? I plan to add DataFusion integration to that crate once this is released. |
At the moment my primary interest is for Comet, which is a Spark accelerator built on DataFusion. For Comet I think my plan is to use JNI to call back to the registered KMS on the Spark side since I don't expect every Spark KMS implementation to reimplement a Rust version. |
Let's get it merged so we can begin testing more easily @mbutrovich it would be great if you could test this with comet before the 50.0.0 release so that we can make any adjustments if needed before a public release |
Thanks again @adamreeve |
) * Add encryption factory API for more flexible encryption configuration * Remove extra DynEncryptionFactory trait * Tidy up example * Tidy ups * Fix "unnecessary qualification" errors in example * Fix toml format with taplo * Fix broken link in docs * Clippy fixes * Update examples README * Add extra validation of table encryption * clippy fix * Remove register_parquet_encryption_factory from SessionContext * Remove new dependency from example * Update examples readme * Run taplo format * Fix outdated method reference in comment * Extra comment --------- Co-authored-by: Andrew Lamb <[email protected]>
One thing we might want to change before releasing this is making the Related: apache/arrow-rs#8238 |
I've made a PR to change this to async: #17342 |
) * Add encryption factory API for more flexible encryption configuration * Remove extra DynEncryptionFactory trait * Tidy up example * Tidy ups * Fix "unnecessary qualification" errors in example * Fix toml format with taplo * Fix broken link in docs * Clippy fixes * Update examples README * Add extra validation of table encryption * clippy fix * Remove register_parquet_encryption_factory from SessionContext * Remove new dependency from example * Update examples readme * Run taplo format * Fix outdated method reference in comment * Extra comment --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Rationale for this change
See #16778. This allows per-file encryption key generation and for keys to be retrieved based on encryption metadata stored in the Parquet files, rather than readers needing to know AES keys upfront.
What changes are included in this PR?
EncryptionFactory
trait for types that generate file encryption and decryption properties. This is loosely based on the approach used by Spark (see this comment for details).EncryptionFactory
instances in theRuntimeEnv
, similar to howObjectStore
s can be registered. (ObjectStore
s are registered based on their URI scheme, but encryption factory ids are arbitrary string identifiers).crypto
configuration field inTableParquetOptions
to allow setting an encryption factory id, and opaque configuration options required by the encryption factory.EncryptionFactory
where necessary.Are these changes tested?
Yes, new unit tests and an example have been added.
Are there any user-facing changes?
Yes, this is a new user-facing feature.