Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] we should support hdfs federation #1475

Open
nvliyuan opened this issue Dec 20, 2024 · 2 comments
Open

[FEA] we should support hdfs federation #1475

nvliyuan opened this issue Dec 20, 2024 · 2 comments
Assignees
Labels
core_tools Scope the core module (scala) feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)

Comments

@nvliyuan
Copy link
Contributor

nvliyuan commented Dec 20, 2024

Currently we only support hdfs path: https://github.com/NVIDIA/spark-rapids-tools/blob/3db52ef3a4c56abce9d1d3337daa05ae176d9622/user_tools/src/spark_rapids_tools/storagelib/hdfs/hdfspath.py

def detect_platform_from_eventlogs_prefix(self):

But some customer use hdfs federation, it would be nice if we could support viewfs://

@nvliyuan nvliyuan added ? - Needs Triage feature request New feature or request labels Dec 20, 2024
@amahussein amahussein added user_tools Scope the wrapper module running CSP, QualX, and reports (python) core_tools Scope the core module (scala) labels Dec 31, 2024
@amahussein
Copy link
Collaborator

Thanks @nvliyuan !
We are actually in the process of switching the workflow a little bit in the CLI.
The --platform is required. The reason to do that is that we found that auto-detecting the platform based on eventlogpath is not always accurate.
To support viewfs requires the following:

Core-tools

  • Investigate how to support the federation in Spark configuration/properties and how to set it to the java cmd.

user-tools

Our cross-platform storage uses pyarrow.filesystem. Fortunatley, it supports viewfs.
So we need to:

  • In general: we need to adds that scheme to the HDFS interface in our python.
  • For distributed mode: we need to make sure that the distributed-mode can be configured to use viewfs (CC: @parthosa )
  • How to set the java CLI with the correct configurations.

@parthosa
Copy link
Collaborator

parthosa commented Jan 7, 2025

@amahussein The distributed-mode uses our cross-platform storage for all file access. As you mentioned that pyarrow supports viewfs, it should simplify things for distributed-mode as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request user_tools Scope the wrapper module running CSP, QualX, and reports (python)
Projects
None yet
Development

No branches or pull requests

3 participants