Skip to content

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Jul 15, 2025

Which issue does this PR close?

Rationale for this change

See #16778. This allows per-file encryption key generation and for keys to be retrieved based on encryption metadata stored in the Parquet files, rather than readers needing to know AES keys upfront.

What changes are included in this PR?

  • Adds a new EncryptionFactory trait for types that generate file encryption and decryption properties. This is loosely based on the approach used by Spark (see this comment for details).
  • Allows registering EncryptionFactory instances in the RuntimeEnv, similar to how ObjectStores can be registered. (ObjectStores are registered based on their URI scheme, but encryption factory ids are arbitrary string identifiers).
  • Updates the crypto configuration field in TableParquetOptions to allow setting an encryption factory id, and opaque configuration options required by the encryption factory.
  • Updates Parquet encryption and decryption code to use a registered EncryptionFactory where necessary.

Are these changes tested?

Yes, new unit tests and an example have been added.

Are there any user-facing changes?

Yes, this is a new user-facing feature.

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate execution Related to the execution crate datasource Changes to the datasource crate labels Jul 15, 2025
@adamreeve
Copy link
Contributor Author

Hi @alamb, could you please take a look at this when you have time, or suggest someone else who could review this?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adamreeve -- I took a look and I think the APIs and code look pretty good to me . I have some API and testing suggestions but otherwise 👍

const ENCRYPTION_FACTORY_ID: &str = "example.memory_kms_encryption";

/// This example demonstrates reading and writing Parquet files that
/// are encrypted using Parquet Modular Encryption, and uses the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of actually integrating with a KMS in the example, could we potentially stub it out and avoid the dependency on the new crate?

I realize this is like a "end to end" integration test, but I think that might make more sense to keep somewhere outside the core datafusion crate 🤔

Copy link
Contributor Author

@adamreeve adamreeve Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll change this example so it doesn't use the new crate. I intend to add a new feature or crate in parquet-key-management-rs that implements integration with DataFusion anyway, so having an example here that just partially implements this probably isn't very helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the new dependency now and mocked out the KMS behaviour by just base64 encoding encryption keys.

@@ -96,13 +95,11 @@ impl TryFrom<&TableParquetOptions> for WriterPropertiesBuilder {
global,
column_specific_options,
key_value_metadata,
crypto,
crypto: _,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean the crypto options on the table parquet options are ignored now (perhaps replaced by the crypto factory). If so I suggest we remove the field or mark it deprecated

Copy link
Contributor Author

@adamreeve adamreeve Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they are still used, but I wanted to keep the logic for creating the FileEncryptionProperties in once place, so after try_from is called, there's a new set_writer_encryption_properties method that will either use the existing ConfigFileEncryptionProperties or the new factory: https://github.com/adamreeve/datafusion/blob/b42f21ddf7dd01991e576dc34472db010dc23012/datafusion/datasource-parquet/src/file_format.rs#L1327-L1334

I couldn't include this logic in the try_from implementation as it requires some other data from outside the TableParquetOptions.

I guess this could be a little bit surprising if a user expects try_from to be non-lossy. I could probably restore the old behaviour of using the ConfigFileEncryptionProperties here, or just add a comment about this?. I've added a comment about this for now.

@alamb
Copy link
Contributor

alamb commented Aug 6, 2025

I will try and review this again in the next day or two

@mbutrovich
Copy link
Contributor

I will try and review this again in the next day or two

Same.

use std::sync::Arc;
use tempfile::TempDir;

const ENCRYPTION_FACTORY_ID: &str = "example.mock_kms_encryption";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I have is whether we're happy to keep these identifiers arbitrary or if we should suggest or enforce some kind of format. We could suggest using the qualified name of the factory type, eg. parquet_encrypted_with_kms::TestEncryptionFactory, but there could be valid reasons to register different instances of the same type.

We might also want to reserve a prefix to allow for factories to be built-in in the future, eg. apache. or datafusion..

Copy link
Contributor

@alamb alamb Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think arbitrary keys are good for now, personally, until we have some specific use case for name spacing of these keys

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep for now I think is clear enough. If we notice some obvious deficiency in the future we can revisit the naming scheme.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adamreeve -- this looks good to me 🏆

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me. The true test for me will be trying to start to use it as soon as it's merged :)

Thanks @adamreeve!

@adamreeve
Copy link
Contributor Author

@mbutrovich I'm curious if you can share any details about how you want to use this. Do you need compatibility with PyArrow or Spark, and would the parquet-key-management crate be useful to you? I plan to add DataFusion integration to that crate once this is released.

@mbutrovich
Copy link
Contributor

mbutrovich commented Aug 8, 2025

@mbutrovich I'm curious if you can share any details about how you want to use this. Do you need compatibility with PyArrow or Spark, and would the parquet-key-management crate be useful to you? I plan to add DataFusion integration to that crate once this is released.

At the moment my primary interest is for Comet, which is a Spark accelerator built on DataFusion. For Comet I think my plan is to use JNI to call back to the registered KMS on the Spark side since I don't expect every Spark KMS implementation to reimplement a Rust version.

@alamb
Copy link
Contributor

alamb commented Aug 8, 2025

Let's get it merged so we can begin testing more easily

@mbutrovich it would be great if you could test this with comet before the 50.0.0 release so that we can make any adjustments if needed before a public release

@alamb alamb merged commit cbc0614 into apache:main Aug 8, 2025
29 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 8, 2025

Thanks again @adamreeve

@adamreeve adamreeve deleted the encryption_factory branch August 10, 2025 20:49
shruti2522 pushed a commit to shruti2522/datafusion that referenced this pull request Aug 10, 2025
)

* Add encryption factory API for more flexible encryption configuration

* Remove extra DynEncryptionFactory trait

* Tidy up example

* Tidy ups

* Fix "unnecessary qualification" errors in example

* Fix toml format with taplo

* Fix broken link in docs

* Clippy fixes

* Update examples README

* Add extra validation of table encryption

* clippy fix

* Remove register_parquet_encryption_factory from SessionContext

* Remove new dependency from example

* Update examples readme

* Run taplo format

* Fix outdated method reference in comment

* Extra comment

---------

Co-authored-by: Andrew Lamb <[email protected]>
@adamreeve adamreeve restored the encryption_factory branch August 13, 2025 04:58
@adamreeve
Copy link
Contributor Author

One thing we might want to change before releasing this is making the EncryptionFactory methods async. I guess for the Comet use case, calling into Java will need to be synchronous, but for native Rust use it would make sense for these to be async as they may need to access a KMS over the network.

Related: apache/arrow-rs#8238

@adamreeve
Copy link
Contributor Author

I've made a PR to change this to async: #17342

adamreeve added a commit to adamreeve/datafusion that referenced this pull request Aug 29, 2025
)

* Add encryption factory API for more flexible encryption configuration

* Remove extra DynEncryptionFactory trait

* Tidy up example

* Tidy ups

* Fix "unnecessary qualification" errors in example

* Fix toml format with taplo

* Fix broken link in docs

* Clippy fixes

* Update examples README

* Add extra validation of table encryption

* clippy fix

* Remove register_parquet_encryption_factory from SessionContext

* Remove new dependency from example

* Update examples readme

* Run taplo format

* Fix outdated method reference in comment

* Extra comment

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate execution Related to the execution crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More flexible Parquet encryption configuration
3 participants