Description
Describe the bug
DatasetBuilder.to_dataframe()
fails if S3 buckets are encrypted with server-side KMS encryption and a KMS key is supplied.
S3Uploader.upload()
method uses {"SSEKMSKeyId": kms_key, "ServerSideEncryption": "aws:kms"}
[ref] if a KMS key is provided, which is used within DatasetBuilder.to_csv_file()
to upload the objects. But in the next step of DatasetBuilder.to_dataframe()
method, S3Uploader.download()
is called that makes KMS key as {"SSECustomerKey": kms_key}
[ref]. This is incorrect and leads to the error stated in logs, as it should use SSEKMSKeyId to decrypt (not SSECustomerKey), which was originally used to upload the objects.
To reproduce
Mentioned in the log section.
Expected behavior
Expected behavior is being able to load a query output using in a pandas data frame when using DatasetBuilder.to_dataframe()
with a KMS key. Note that KMS key is supplied in feature_store.create_dataset(kms_key_id=<key>)
Screenshots or logs
s3.S3Downloader.download(s3_uri=csv_file,
local_path="./",
kms_key='<a valid kms key for SSE bucket>',
sagemaker_session=feature_group_session.feature_store_session
)
On doing the above, it fails with the following error.
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/botocore/client.py:530, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
526 raise TypeError(
527 f"{py_operation_name}() only accepts keyword arguments."
528 )
529 # The "self" in this scope is referring to the BaseClient.
--> 530 return self._make_api_call(operation_name, kwargs)
File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/botocore/client.py:964, in BaseClient._make_api_call(self, operation_name, api_params)
962 error_code = parsed_response.get("Error", {}).get("Code")
963 error_class = self.exceptions.from_code(error_code)
--> 964 raise error_class(parsed_response, operation_name)
965 else:
966 return parsed_response
ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request
System information
A description of your system. Please provide:
- SageMaker Python SDK version: 2.167.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): N/A
- Framework version: N/A
- Python version: 3.10.10
- CPU or GPU: CPU
- Custom Docker image (Y/N): N
Workaround
Using S3_URI, _ = DatasetBuilder.to_csv_file()
and then calling pd.read_csv(S3_URI)
works. It is weird that within DatasetBuilder.to_dataframe()
, first an object is being downloaded, loaded to data frame and then deleted, when one should simply load in data frame by getting the object without downloading it.