Skip to content

DatasetBuilder.to_dataframe() fails if S3 buckets are encrypted with server-side KMS encryption and a KMS key is supplied. Β #4034

Open
@groverpr

Description

@groverpr

Describe the bug

DatasetBuilder.to_dataframe() fails if S3 buckets are encrypted with server-side KMS encryption and a KMS key is supplied.

S3Uploader.upload()method uses {"SSEKMSKeyId": kms_key, "ServerSideEncryption": "aws:kms"} [ref] if a KMS key is provided, which is used within DatasetBuilder.to_csv_file() to upload the objects. But in the next step of DatasetBuilder.to_dataframe() method, S3Uploader.download() is called that makes KMS key as {"SSECustomerKey": kms_key} [ref]. This is incorrect and leads to the error stated in logs, as it should use SSEKMSKeyId to decrypt (not SSECustomerKey), which was originally used to upload the objects.

To reproduce

Mentioned in the log section.

Expected behavior

Expected behavior is being able to load a query output using in a pandas data frame when using DatasetBuilder.to_dataframe() with a KMS key. Note that KMS key is supplied in feature_store.create_dataset(kms_key_id=<key>)

Screenshots or logs

s3.S3Downloader.download(s3_uri=csv_file,
    local_path="./",
    kms_key='<a valid kms key for SSE bucket>',
    sagemaker_session=feature_group_session.feature_store_session
)

On doing the above, it fails with the following error.

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/botocore/client.py:530, in ClientCreator._create_api_method.<locals>._api_call(self, *args, **kwargs)
    526     raise TypeError(
    527         f"{py_operation_name}() only accepts keyword arguments."
    528     )
    529 # The "self" in this scope is referring to the BaseClient.
--> 530 return self._make_api_call(operation_name, kwargs)

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/botocore/client.py:964, in BaseClient._make_api_call(self, operation_name, api_params)
    962     error_code = parsed_response.get("Error", {}).get("Code")
    963     error_class = self.exceptions.from_code(error_code)
--> 964     raise error_class(parsed_response, operation_name)
    965 else:
    966     return parsed_response

ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.167.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): N/A
  • Framework version: N/A
  • Python version: 3.10.10
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N

Workaround
Using S3_URI, _ = DatasetBuilder.to_csv_file() and then calling pd.read_csv(S3_URI) works. It is weird that within DatasetBuilder.to_dataframe(), first an object is being downloaded, loaded to data frame and then deleted, when one should simply load in data frame by getting the object without downloading it.

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions