Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to Limit the Number of Cores When Using -- workers with -m ANIb #448

Open
loomiscoh opened this issue Jan 24, 2025 · 6 comments
Open

Comments

@loomiscoh
Copy link

I am encountering a problem similar to the one described here. Specifically, when running -m ANIb and specifying --workers, the program still utilizes the maximum number of cores available. My goal is to limit the number of cores used by ANIb.

System Information:

  • pyani v0.2.13.1
  • OS: Rocky Linux 9.4 (Blue Onyx)
  • python v3.8.20

When executing the following command, all 40 cores on our cluster are utilized, even though I specify --workers 1.

average_nucleotide_identity.py -i input_dir -o output_dir -m ANIb --workers 1

Despite setting --workers 1 the program uses all 40 cores. This same behavior occurs when setting --workers to any other numbers as well.

@peterjc
Copy link
Collaborator

peterjc commented Jan 24, 2025

Can you check how many blastn jobs are running, or what seems to be using all the cores?

On my copy of blastn at least, it has a default of only one thread, and as far as known, pyANI leaves it like that:

$ blastn -help
...
DESCRIPTION
   Nucleotide-Nucleotide BLAST 2.16.0+
...
 *** Miscellaneous options
 -parse_deflines
   Should the query and subject defline(s) be parsed?
 -num_threads <Integer, >=1>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote
...

@loomiscoh
Copy link
Author

loomiscoh commented Jan 27, 2025

When executing pyani with --workers 1 it appears that all of our cores are being utilized for blastn jobs.

I am running the same version of blastn:

$ blastn -help
...
DESCRIPTION
   Nucleotide-Nucleotide BLAST 2.16.0+
...
 *** Miscellaneous options
 -parse_deflines
   Should the query and subject defline(s) be parsed?
 -num_threads <Integer, >=1>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote
...

In the average_nucleotide_identity.py documentation, the --workers argument is defined as:

$ average_nucleotide_identity.py -h
...
 --workers WORKERS     Number of worker processes for multiprocessing (default zero, meaning use all available cores)
...

It seems like the --workers argument in pyani is being ignored and defaulting to 0 (which uses all available cores) and also allowing blastn to also use all available cores.

@peterjc
Copy link
Collaborator

peterjc commented Jan 27, 2025

Here is where pyANI v0.2.13.1 builds the blastn command:

https://github.com/widdowquinn/pyani/blob/v0.2.13.1/pyani/anib.py#L465

    return (
        f"{blastn_exe} -out {prefix}.blast_tab -query {fname1} -db {fname2} "
        "-xdrop_gap_final 150 -dust no -evalue 1e-15 -max_target_seqs 1 -outfmt "
        "'6 qseqid sseqid length mismatch pident nident qlen slen "
        "qstart qend sstart send positive ppos gaps' "
        "-task blastn"
    )

There is no -num_threads setting being used (I searched the codebase to check), so if blastn is really using more than one worker thread, that could be an NCBI issue.

However, getting multiple blastn processes running clearly points at pyANI not obeying the --workers setting 😞

@widdowquinn
Copy link
Owner

widdowquinn commented Jan 27, 2025

Can you please try writing a log file and/or using the -v verbose option?

The ANIb code should be using the args.workers argument here:

https://github.com/widdowquinn/pyani/blob/855852633ae6957068081f653f10b18aac372fe6/pyani/scripts/subcommands/subcmd_anim.py#L378C5-L402C14

and ought to give a distinct log message depending on whether it receives a valid args.workers value - either "(using maximum number of worker threads)" if it's seeing None/0, or "(using %d worker threads, if available)" otherwise.

That will help us narrow down the issue.

@widdowquinn
Copy link
Owner

On my copy of blastn at least, it has a default of only one thread, and as far as known, pyANI leaves it like that:

The --workers argument should control the number of workers in the multiprocessing pool, rather than the number of threads blastn is using.

@loomiscoh
Copy link
Author

I have attached a .log file I generated using the following command:
average_nucleotide_identity.py -i input_dir -o output_dir -m ANIb --workers 1 -l pyani_run_debug.log --debug

I am unable to locate the distinct log message you mentioned in my file however, I notice Using scheduler method: multiprocessing. It seems like pyani is calling for multiple blastn processes and all cores available are being used for each blastn job.

pyani_run_debug.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants