This branch is 28 commits ahead of lcskrishna/nccl-rccl-parser:main.

Name	Name	Last commit message	Last commit date
Latest commit lcskrishna Merge pull request #10 from ROCm/cl/update-dtype Apr 10, 2024 ceb828b · Apr 10, 2024 History 39 Commits
nccl-tests @ 1292b25	nccl-tests @ 1292b25	Update submodule commits	Oct 27, 2023
rccl-tests @ 46375b1	rccl-tests @ 46375b1	Update submodule commits	Oct 27, 2023
.gitignore	.gitignore	Initial commit	Nov 19, 2020
.gitmodules	.gitmodules	update gitmodules and automate rccl-tests and nccl-tests	Nov 20, 2020
LICENSE	LICENSE	Initial commit	Nov 19, 2020
README.md	README.md	Formatting...	Oct 30, 2023
automated_parser.sh	automated_parser.sh	CUDA fixes and automated bash script	Oct 30, 2023
generate_summary.py	generate_summary.py	CUDA fixes and automated bash script	Oct 30, 2023
rccl_nccl_parser.py	rccl_nccl_parser.py	Update bfloat16 type	Apr 10, 2024
run_parser_and_generate_summary.py	run_parser_and_generate_summary.py	CUDA fixes and automated bash script	Oct 30, 2023

Repository files navigation

nccl-rccl-parser

This tool is used for dumping out the rccl-tests/nccl-test commands directly from an application to identify any potential bottlenecks of scaling while using RCCL/NCCL modules when running a distributed applications.

To get started please clone the following repository:

git clone --recursive https://github.com/ROCmSoftwarePlatform/nccl-rccl-parser

To run the tests, we use the following repositories:

On ROCm: https://github.com/ROCmSoftwarePlatform/rccl-tests
On CUDA: https://github.com/NVIDIA/nccl-tests.git

Pre-requisites:

RCCL/NCCL installed.
rccl-tests or nccl-tests installed.

How to use the tool:

Run application and collect RCCL/NCCL Log:**

Firstly, make sure you are running the experiments of a distributed setup of an application. Make sure to run the application for at least 1 iteration using the below two environment variables into a log file named nccl_debug_log.txt

On CUDA:

NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL <application/executable> |& tee nccl_debug_log.txt

On ROCm: (needed for PCIe P2P but not needed for GPUs connected by XGMI, ref)

HSA_FORCE_FINE_GRAIN_PCIE=1 NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL <application/executable> |& tee nccl_debug_log.txt

NOTE: For some workloads buffered output can impact the RCCL/NCCL log format which may break the parser. The following env variable can help with this

PYTHONBUFFERED=x stdbuf -i0 -o0 -e0

Automated way:

To gather the performance results once you have the debug log with you. Run the below command.

On CUDA devices, use --cuda argument.

On ROCm devices, use --rocm argument.

Note: If you don't mention the arguments the automated script only dumps out the output data from the parser.

On ROCm:

python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --rocm

On CUDA:

python run_parser_and_generate_summary.py --nccl-debug-log nccl_debug_log.txt --cuda

Easy mode: one bash script:

Ensure a RUN_COMMAND has been set, this can be any executable or bash script.

Usage on ROCm: bash automated_parser.sh --run-command "{RUN_COMMAND}" --use-rocm

Usage on CUDA: bash automated_parser.sh --run-command "{RUN_COMMAND}"

This will collect the logs from your program automatically and dump out the final csv report.

To run the tool manually step by step:

Use Parser to dump out the test commands:

Once the log is being collected, use the parser to dump out all the rccl/nccl test commands or just the unique commands with their respective counts of the workload. Note: To dump out the unique commands use the --unique argument. Optional parameters: output-script-name, unique

Here is the usage of the script

python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net
(or)
python rccl_nccl_parser.py --nccl-debug-log nccl_debug_log.txt --output-script-name net --unique

The first command dumps out all the rccl/nccl tests in the order they get executed in the application. (net_rccl_nccl.sh file). The second command dumps out a script file with unique commands and a csv file with commands and its counts of each command.

Run rccl-tests/nccl-tests:

Once you dump out the scripts, make sure to copy the script in nccl-tests/rccl-tests folder and run the script and gather the Inside nccl-tests/rccl-tests repository:

sh net_unique.sh |& tee rccl_perf_data.txt

Once you run the above script, the performance data of each command is redirected to a text file.

Generate Summary:

Now the final step is to use the above performance log and generate a summary in the form of CSV file for each of the command. The command gives the average values for each command like Time(us), algBw, busBw (out-of-place and in-place). For pytorch please consider out of place options.

To generate the summary, navigate to the tool nccl-rccl-parser:

python generate_summary.py --log-file rccl_perf_data.txt --output-file-name test_app_data--script-file net_unique.sh

This dumps out a csv file with performance data for further analysis.

Supported Collectives:

Currently only the AllReduce and Broadcast calls are being supported by this tool. Based on running more experiments other collectives need to be added.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nccl-rccl-parser

Pre-requisites:

How to use the tool:

Run application and collect RCCL/NCCL Log:**

Automated way:

Easy mode: one bash script:

To run the tool manually step by step:

About

Releases

Packages

Languages

License

ROCm/nccl-rccl-parser

Folders and files

Latest commit

History

Repository files navigation

nccl-rccl-parser

Pre-requisites:

How to use the tool:

Run application and collect RCCL/NCCL Log:**

Automated way:

Easy mode: one bash script:

To run the tool manually step by step:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages