Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] RAPIDS Qualification Tool Fails to Obtain S3 action getFileStatus for Databricks-AWS Platform #1441

Open
anguy116 opened this issue Nov 27, 2024 · 7 comments
Assignees
Labels
question Further information is requested user_tools Scope the wrapper module running CSP, QualX, and reports (python)

Comments

@anguy116
Copy link

What is your question?

I'm currently trying to run the rapids qualification command against an existing databricks-aws cluster where the spark event logs are being outputted to s3://bucket/cluster_logs.

My question being why with all of my prerequisite checks out of the way does the rapids tool not have the correct privilege to read my s3 bucket that contains all of my cluster logs.

The commands that I'm running:

export AWS_PROFILE=role_with_suffice_permissions
export DATABRICKS_CONFIG_FILE=~/.databrickscfg

spark_rapids qualification \
--platform databricks-aws \
--eventLogs s3://bucket/cluster_logs/<cluster_id>/eventlog/<cluster_id>_<cluster_ip>/<some_id>/

another command I'm also attempting to run:

export AWS_PROFILE=role_with_suffice_permissions
export DATABRICKS_CONFIG_FILE=~/.databrickscfg

spark_rapids qualification \
--platform databricks-aws \
--eventLogs s3://bucket/cluster_logs \
--cpu-cluster <cluster_id>

My databricks cluster:
cluster size: m7gd.2xlarge
mode: single access
dbx runtime: 16.0
instance_profile: (has putAclObject s3 policy for resource s3://bucket/cluster_logs/)

The issue I'm running into for the first command:

2024-11-27 09:04:06 main WARN EventLogPathProcessor:93 - Unexpected exception occurred reading s3a://bucket/cluster_logs/1125-182533-elyhzrsq/eventlog/1125-182533-elyhzrsq_15_1_31_253/6465501575869150700/, skipping!
java.nio.file.AccessDeniedException: s3a://bucket/cluster_logs/1125-182533-elyhzrsq/eventlog/1125-182533-elyhzrsq_15_1_31_253/6465501575869150700: getFileStatus on s3a://bucket/cluster_logs/1125-182533-elyhzrsq/eventlog/1125-182533-elyhzrsq_15_1_31_253/6465501575869150700: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 1AKHSJ6NKD6E01FT; S3 Extended Request ID: fwkXq5X/AU8lQx54bAFz41bhoCqD3Ya6e/gxW8EgNoAYIadaLKJRUaCfafFf33bT2FOOY6u1IBY=; Proxy: null), S3 Extended Request ID: fwkXq5X/AU8lQx54bAFz41bhoCqD3Ya6e/gxW8EgNoAYIadaLKJRUaCfafFf33bT2FOOY6u1IBY=:403 Forbidden

Similarly when I run the second command:

2024-11-27 08:47:04 main WARN EventLogPathProcessor:93 - Unexpected exception occurred reading s3a://bucket/cluster_logs/, skipping!
java.nio.file.AccessDeniedException: s3a://bucket/cluster_logs: getFileStatus on s3a://bucketcluster_logs: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: MC1XR3WYNZG64KZD; S3 Extended Request ID: FHNkNKGHdXGrCDGVjD/p+5tIQ9OUWTX/PMgO7SEPpgqtZRs2LJkJklPoZhrA0IQVyv4rm/60KOs=; Proxy: null), S3 Extended Request ID: FHNkNKGHdXGrCDGVjD/p+5tIQ9OUWTX/PMgO7SEPpgqtZRs2LJkJklPoZhrA0IQVyv4rm/60KOs=:403 Forbidden

Sanity Checks

The role_with_suffice_permissions has the right permissions as I can freely pull, push and list the s3://bucket/cluster_logs/ bucket.
The databricks cli is setup correctly as I can list all clusters in my workspace.

I can see that the rapids tool is obtaining the correct role name when I set the commands to --verbose

2024-11-27 09:55:32,271 DEBUG rapids.tools.qualification: Processing Rapids plugin Arguments {'aws_profile': 'role_with_suffice_permissions'}
2024-11-27 09:55:32,271 DEBUG rapids.tools.qualification: Processing tool CLI argument.. aws_profile:['role_with_suffice_permissions']
@anguy116 anguy116 added ? - Needs Triage question Further information is requested labels Nov 27, 2024
@parthosa
Copy link
Collaborator

parthosa commented Nov 27, 2024

To better debug this issue, could you clarify the following:

  1. Where is the tool running?

    • Is it running inside a Databricks node or on a local machine?
  2. If running inside a Databricks node:

    • The AWS_PROFILE environment variable might not be used. Instead, the instance profile attached to the Databricks cluster would be used for accessing S3.
    • Does the IAM role attached to the Databricks cluster have only the putObjectAcl permission? This might not be sufficient to list and read objects in the S3 bucket.
    • Databricks recommends adding the following permissions for access to S3:
      • s3:GetObject
      • s3:ListBucket
    • For more details, please refer to the documentation on instance profiles.

@anguy116
Copy link
Author

To better debug this issue, could you clarify the following:

  1. Where is the tool running?

    • Is it running inside a Databricks node or on a local machine?
  2. If running inside a Databricks node:

    • The AWS_PROFILE environment variable might not be used. Instead, the instance profile attached to the Databricks cluster would be automatically used for accessing S3.

    • Does the IAM role attached to the Databricks cluster have only the putObjectAcl permission? This might not be sufficient to list and read objects in the S3 bucket.

    • Databricks recommends adding the following permissions for access to S3:

      • s3:GetObject
      • s3:ListBucket
    • For more details, please refer to the documentation on instance profiles.

This is running locally on my machine

@parthosa
Copy link
Collaborator

Thanks for clarifying that. I am able to reproduce the bug by adjusting the policy.

Could you check the policy associated with user whose credentials are saved under the AWS_PROFILE role_with_suffice_permissions has list bucket policy?

@anguy116
Copy link
Author

Yes looking into the IAM Role, the role has a policy attached that allows for most if not all actions on all s3 resources (ex: allow s3.* on resource *). It only specifies actions it CANNOT perform.
The list bucket policy not being one of them

@parthosa
Copy link
Collaborator

That's good.

  • IAM role needs to be attached to be an IAM user for providing the credentials. I suspect there could be a override happening.
  • Could check if there are any other environment variables set? (e.g. (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))

@anguy116
Copy link
Author

Steps I've taken in consideration to this.

  • No other ENV variables are set besides AWS_PROFILE
  • I removed the default IAM section from my .aws/credentials incase its trying to use that to access my AWS resources

verbose logs for clarity

2024-11-28 14:01:49,245 WARNING rapids.tools.csp: Property profile is not set. Setting default value DEFAULT from environment variable
2024-11-28 14:01:49,245 WARNING rapids.tools.csp: Property awsProfile is not set. Setting default value role_with_suffice_permissions from environment variable
2024-11-28 14:01:49,245 WARNING rapids.tools.csp: Property cliConfigFile is not set. Setting default value /Users/user/.databrickscfg from environment variable
2024-11-28 14:01:49,245 WARNING rapids.tools.csp: Property awsCliConfigFile is not set. Setting default value /Users/user/.aws/config from environment variable
2024-11-28 14:01:49,245 WARNING rapids.tools.csp: Property awsCredentialFile is not set. Setting default value /Users/user/.aws/credentials from environment variable

NOTE: DEFAULT value for databricks profile is what I use to connect and interact with my DBX environment

@amahussein amahussein added the user_tools Scope the wrapper module running CSP, QualX, and reports (python) label Dec 3, 2024
@parthosa
Copy link
Collaborator

parthosa commented Dec 11, 2024

As an alternative, we provide a Jupyter notebook for running the Qualification tool in a Databricks environment - Download

Prerequisite:

  • A running compute cluster on Databricks (a single node is sufficient or reuse any existing cluster).
  • An instance profile linked to Databricks Docs
    • Since the outputs are being written to s3, I suspect this is already setup.

Notebook Usage:

  1. Import the notebook into Databricks via File -> Import Notebook.
  2. Open the notebook and attach it to the compute cluster mentioned above.
  3. Enter the event log s3 path in the text widget at the top of the notebook.
  4. Click Run all to run qualification tool on the provided logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

No branches or pull requests

4 participants