Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipe-ALL AlexNet output formatting error when running entirely on little cluster at certain frequencies #8

Open
fragmential opened this issue Jan 23, 2024 · 2 comments

Comments

@fragmential
Copy link

demonstration:
Screenshot 2024-01-23 at 13 55 46

Steps to reproduce:

  • make sure LD_LIBRARY_PATH is set correctly
echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
echo performance > /sys/devices/system/cpu/cpufreq/policy2/scaling_governor
echo 1000000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
./graph_alexnet_all_pipe_sync --threads=4  --threads2=2 --n=60 --total_cores=6 --partition_point=8 --partition_point2=8 --order="L-G-B"                                                                    

Happens on frequencies of 1 GHz and higher. Once it didn't occur on 1.2 GHz, but it is generally very consistent in occurring.

Consistently reproducible on our system.

Hardware is plugged directly into power with the supplied Anker PowerPort+ 1 power supply.

Significance of the issue:

  • It disrupts the proper operation of our parser, which cost us hours trying to "fix" a bug yesterday that was due to this.
@Ehsan-aghapour
Copy link
Owner

Sorry for the inconvenience.
It seems that the output report of the stages are mixed. The reason is that each stage is run with a separate Thread, and when they try to print simultaneously, this problem happens. One possible solution is to use std::cerr instead of the std::cout for reporting the output reports. The source file is examples/graph_alexnet_all_pipe_sync.cpp. At the end of do_run_1, do_run_2, and do_run_3 the std::cout could be replaced with std::cerr.

What was your solution for this problem?

@fragmential
Copy link
Author

We didn't have a solution, we couldn't process the inference times for this data, and we couldn't average the other measurements with other data because our data processing scripts were relying on the order of items being consistent, which it wasn't, we didn't have time to change this so we just settled on not including the inference times on those tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants