Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UPDATE] Adaptation of trigger.sh for SLURM and SGE #302

Open
rolivella opened this issue Feb 11, 2025 · 15 comments
Open

[UPDATE] Adaptation of trigger.sh for SLURM and SGE #302

rolivella opened this issue Feb 11, 2025 · 15 comments
Assignees

Comments

@rolivella
Copy link
Contributor

rolivella commented Feb 11, 2025

🚀 [UPDATE] Adaptation of trigger.sh for SLURM and SGE

📌 Summary

This issue documents the adaptation of trigger.sh to be compatible with both SLURM and SGE, maintaining a modular structure and using submit_slurm.sh for job submission in SLURM.

🔄 Main Changes

  • Automatic detection of SLURM or SGE in trigger.sh.
  • Modification of launch_nf_run so that:
    • In SLURM, it submits sbatch submit_slurm.sh.
    • In SGE, it directly executes nextflow run.
  • No redundant configurations inside trigger.sh; everything is managed through the .config files.

📂 Modified Files

  • trigger.sh
  • submit_slurm.sh (renamed from submit_nf.sh for clarity)

🛠️ Steps to Apply the Changes

1️⃣ Modify trigger.sh

🔹 Add automatic SLURM/SGE detection

Add this block at the beginning of trigger.sh:

## 🔍 DETECT WHETHER WE ARE IN SLURM OR SGE
if command -v sinfo &> /dev/null; then
    SYSTEM="SLURM"
elif command -v qstat &> /dev/null; then
    SYSTEM="SGE"
else
    echo "❌ [ERROR] Neither SLURM nor SGE detected. Exiting..."
    exit 1
fi

echo "✅ [INFO] Detected system: $SYSTEM"

🔹 Replace the launch_nf_run function

Replace the existing launch_nf_run function with this improved version:

## 🚀 EXECUTE NEXTFLOW IN SLURM OR SGE
launch_nf_run () {
    local workflow_script=$1
    local log_file=$2
    local params="${@:3}" # Remaining parameters

    if [[ $SYSTEM == "SLURM" ]]; then
        echo "🚀 [INFO] Launching Nextflow with SLURM via submit_slurm.sh..."
        sbatch submit_slurm.sh "$workflow_script" "$CONFIG_FILE" "$LAB" "$params" "$log_file"

    elif [[ $SYSTEM == "SGE" ]]; then
        echo "🚀 [INFO] Launching Nextflow directly in SGE..."
        nextflow run "$workflow_script" -c "$CONFIG_FILE" -profile "$LAB" $params > "$log_file" 2>&1 &
    fi
}

2️⃣ Update submit_slurm.sh (previously submit_nf.sh)

Ensure submit_slurm.sh is properly structured to receive arguments and execute Nextflow correctly:

#!/bin/bash
# Arguments received from sbatch
WORKFLOW_SCRIPT=$1
CONFIG_FILE=$2
PROFILE=$3
PARAMS=$4
LOG_FILE=$5

# Load modules
module load Nextflow/23.10.1

# Run Nextflow
nextflow run "$WORKFLOW_SCRIPT" -c "$CONFIG_FILE" -profile "$PROFILE" $PARAMS > "$LOG_FILE" 2>&1

✅ How to Test the Changes

  1. Run trigger.sh in a SLURM environment and verify that sbatch is used correctly:
    ./trigger.sh LAB_NAME prod /path/to/assets data_file
  2. Run it in an SGE environment and check that nextflow run is executed directly:
    ./trigger.sh LAB_NAME prod /path/to/assets data_file
  3. Check logs to ensure Nextflow is running correctly.

📌 Conclusion

This change simplifies pipeline management by using a single trigger.sh for both SLURM and SGE, maintaining modularity and leveraging submit_slurm.sh for job submission in SLURM. 🚀🔥

@rolivella
Copy link
Contributor Author

Hi @braffes feel free to share you slurm config. Thank you!

@braffes
Copy link
Collaborator

braffes commented Feb 14, 2025

Hi, this is just a first attempt, and I haven't completely tested or optimized it for every type of job. However, it should still work.

I think "clusterOptions" is not mandatory since the default QOS should be normal.

Please note that I'm currently updating atlas/qsample, so the defaultab.config file might not be the latest version. Feel free to merge if any information is missing.

I will try to come back with a more accurate/optimize configuration when I can.

defaultlab.config.txt

@rolivella
Copy link
Contributor Author

Thanks @braffes I'll test it next week and let you know

@rolivella rolivella changed the title To slurm queue [UPDATE] Adaptation of trigger.sh for SLURM and SGE Mar 3, 2025
@rolivella
Copy link
Contributor Author

rolivella commented Mar 3, 2025

🚀 [UPDATE] Unified Nextflow Configuration for SLURM and SGE

✅ Subtasks

  • Integrate SLURM and SGE configurations into a single nextflow.config
  • Define profiles for slurm and sge
  • Ensure shared parameters remain in the common section
  • Implement automatic selection based on NXF_EXECUTOR environment variable
  • Test execution using -profile slurm and -profile sge
  • Verify job submission on both SLURM and SGE clusters
  • Validate log outputs for correctness

📌 Summary

This issue describes the integration of both SLURM and SGE execution environments into a single nextflow.config file. The goal is to maintain flexibility, allowing users to switch between execution modes using Nextflow profiles or an environment variable.

🔄 Main Changes

  • Added profiles section to separate slurm and sge configurations.
  • Moved shared parameters (folders, API settings, default values) to the global section.
  • Implemented NXF_EXECUTOR environment variable for automatic execution mode detection.
  • Ensured compatibility with both SLURM and SGE clusters.

📂 Modified Files

  • nextflow.config (Unified SLURM & SGE configuration)

🛠️ Steps to Apply the Changes

1️⃣ Modify nextflow.config to support SLURM and SGE

params {
    executor = System.getenv('NXF_EXECUTOR') ?: 'slurm' // Auto-select based on env variable

    // Folders:
    home_dir                    = "/home/proteomics"
    databases_folder            = "${params.home_dir}/mygit/atlas-databases"
    contaminants_file           = "${params.home_dir}/mygit/atlas-config/atlas-test/assets/contaminants.fasta"
    contaminants_prefix         = "CON_"
    scripts_folder              = "$baseDir/bin"
    tools_folder                = "$baseDir/tools"

    // API:
    url_api_signin              = "10.102.1.54/api/auth/signin"
    url_api_insert_file         = "10.102.1.54/api/file/insertFromPipelineRequest"
    url_api_insert_wetlab_data  = "10.102.1.54/api/data/pipeline"
    url_api_insert_data         = "10.102.1.54/api/data/pipelineRequest"
    url_api_insert_quant        = "10.102.1.54/api/quantification/pipeline"
    url_api_fileinfo            = "10.102.1.54/api/fileInfo/pipeline"
    url_api_insert_modif        = "10.102.1.54/api/modification/pipeline"
    url_api_insert_wetlab_file  = "10.102.1.54/api/file/insertFromPipeline"

    // API Key:
    api_key_qc_params           = "453a2301-6698-43dd-baae-7eb4c6a5eaa5"

    // Default search engine:
    search_engine               = "comet"
}

profiles {
    slurm {
        process {
            executor = "slurm"
            queue = "genoa64"
            cpus = '1'
            cache = 'lenient'

            clusterOptions = { task.time <= 3.h ? '--qos=shorter' :
                               (task.time <= 6.h ? '--qos=short' :
                               (task.time <= 12.h ? '--qos=normal' :
                               (task.time <= 24.h ? '--qos=long' :
                               (task.time <= 48.h ? '--qos=vlong' : '--qos=marathon')))) }

            withLabel:big_cpus {
                cpus = 8
                time = '6h'
                memory = '20G'
            }

            withLabel:big_mem {
                time = '12h'
                memory = '60G'
            }
        }
    }

    sge {
        process {
            executor = "sge"
            queue = "all.q"
            cpus = '1'
            cache = 'lenient'

            clusterOptions = '-l h_rt=48:00:00 -pe smp 4 -l h_vmem=16G'

            withLabel:big_cpus {
                cpus = 8
                time = '6h'
                memory = '20G'
                clusterOptions = '-l h_rt=6:00:00 -pe smp 8 -l h_vmem=20G'
            }

            withLabel:big_mem {
                time = '12h'
                memory = '60G'
                clusterOptions = '-l h_rt=12:00:00 -l h_vmem=60G'
            }
        }
    }
}

singularity {
    enabled = true
    cacheDir = "${params.home_dir}/mygit/atlas-imgs"
    runOptions = "-B ${params.home_dir}:${params.home_dir}"
}

✅ How to Test the Changes

  1. Run Nextflow with SLURM profile:
    nextflow run main.nf -profile slurm
  2. Run Nextflow with SGE profile:
    nextflow run main.nf -profile sge
  3. Auto-detect executor using environment variable:
    export NXF_EXECUTOR=slurm
    nextflow run main.nf
  4. Verify jobs are submitted correctly to SLURM or SGE:
    squeue -u $USER  # SLURM
    qstat -u $USER    # SGE
  5. Check logs to ensure proper execution.

📌 Conclusion

This update provides a unified Nextflow configuration supporting both SLURM and SGE using profiles. It allows seamless switching between execution environments and enables automatic selection via environment variables. 🚀🔥

@rolivella
Copy link
Contributor Author

All changes are done, tomorrow I'll test it within the branch atlas-test:

  • First test SGE profile. Maybe I'll need to modify the -profile section adding sge.
  • Then test SLURM profile.

@rolivella
Copy link
Contributor Author

rolivella commented Mar 5, 2025

After testing I think that the config files are becoming too much invloved and should be rethinked for the sake of clarity and scalability. This could be the new structure:

nextflow_configs/
│── shared.config               # Shared parameters (images, tools, API, etc.)
│── slurm/
│   ├── slurm.config            # SLURM-specific configuration
│   ├── profiles/
│   │   ├── tiny.config         # SLURM tiny profile
│   │   ├── small.config        # SLURM small profile
│   │   ├── medium.config       # SLURM medium profile
│   │   ├── big.config          # SLURM big profile
│── sge/
│   ├── sge.config              # SGE-specific configuration
│   ├── profiles/
│   │   ├── tiny.config         # SGE tiny profile
│   │   ├── small.config        # SGE small profile
│   │   ├── medium.config       # SGE medium profile
│   │   ├── big.config          # SGE big profile
│── aws/
│   ├── aws.config              # AWS Batch-specific configuration
│   ├── profiles/
│   │   ├── tiny.config         # AWS tiny profile
│   │   ├── small.config        # AWS small profile
│   │   ├── medium.config       # AWS medium profile
│   │   ├── big.config          # AWS big profile
│── pbs/
│   ├── pbs.config              # PBS/Torque-specific configuration
│   ├── profiles/
│   │   ├── tiny.config         # PBS tiny profile
│   │   ├── small.config        # PBS small profile
│   │   ├── medium.config       # PBS medium profile
│   │   ├── big.config          # PBS big profile
│── nextflow.config             # Main entrypoint that auto-selects executor

The nextflow.config should be modified accordingly:

// Auto-detect executor from environment
def executor = System.getenv('NXF_EXECUTOR') ?: 'slurm'

// Include shared settings
includeConfig 'nextflow_configs/shared.config'

// Load the correct executor configuration
if (executor == 'slurm') {
    includeConfig 'nextflow_configs/slurm/slurm.config'
} else if (executor == 'sge') {
    includeConfig 'nextflow_configs/sge/sge.config'
} else if (executor == 'aws') {
    includeConfig 'nextflow_configs/aws/aws.config'
} else if (executor == 'pbs') {
    includeConfig 'nextflow_configs/pbs/pbs.config'
} else {
    throw new RuntimeException("Unknown executor: " + executor)
}

And rethink it's old content:

params.custom_config_base = "/home/proteomics/mygit/atlas-config/atlas-test/."
includeConfig("atlas_custom.config")

Also rethink this -profile $LAB,"${13}" at trigger.sh

@rolivella
Copy link
Contributor Author

rolivella commented Mar 11, 2025

Done until today:

  • Now atlas-test branch works with SGE cluster with brand new config structure. ✅
  • Working DDA workflows with SLURM. ✅

Pending:

  • Extend trigger_slurm.sh call for all methods parameters. ✅
  • Unit tests:
    • DDA SLURM. ✅
    • DIA SLURM. ✅
    • Comet SLURM. ✅
    • DDA SGE. ✅
    • DIA SGE. ✅
    • Comet SGE. ✅
  • Slack notification system also for SGE. ✅
  • Integration test. ✅
  • Insert test SLURM and SGE. ✅
  • Issues to solve before moving:
    • Mem. def. at sbatch script and config file. If I remove sbatch mem from the script, I got: slurmstepd: error: Detected 1 oom_kill event in StepId=8601174.batch. Some of the step tasks have been OOM Killed. Solved: one think is the mem requested for the generic nextflow process, and the other think is the particular resource for a particular process. So I leave #SBATCH --mem=1G at trigger_slurm.sh. ✅
  • Feedback users about trigger_slurm.sh.
  • Move to prod CRG and check.
  • Update wiki.
  • Publication of new atlas release.

@rolivella
Copy link
Contributor Author

Feedback on SLURM Adaptation in the Pipeline

Hi @temaia @braffes

I would like to get your feedback regarding the SLURM adaptation I am working on. In our computing center, we are required to first launch a process using nextflow run, which I currently allocate 1G of memory to. From this initial process, the rest of the pipeline’s processes are submitted with the resources configured in the relevant .config files.

My question:

Does your cluster work in the same way, or does it follow a different setup? I want to understand if you also need a script to launch the Nextflow process in your environment.

The best solution for me would be if it works as I have implemented it now since that way I don’t have to handle specific cases. However, I am open to your feedback and would appreciate your input.

I have added a new script for this purpose, which you can find in the test version of ATLAS:
🔗 trigger_slurm.sh

Looking forward to your feedback!

Thanks,
Roger

@braffes
Copy link
Collaborator

braffes commented Mar 18, 2025

Hi @rolivella ,

In my current qsample setup, the nextflow command is started inside my VM, not in a slurm job. I don't think I am interested to use a job to handle the nextflow process, from my point on view, it will just waste 1 core...

After saying that, I can do some test about your new version if needed.

Best,
Brice

@rolivella
Copy link
Contributor Author

Ok @braffes thanks for your feedback, it make sense to put an option to skip this nextflow process. I'll do it, but you'll have to test it because it's not allowed on my institutional cluster.

@rolivella
Copy link
Contributor Author

Hi @braffes,

I'm going to refocus on this issue to have it wrapped up by next week.

Actually, as far as I know about SLURM I believe you do need a job script to launch the Nextflow process. It’s not like SGE in that sense. So, the way I’ve implemented it now (having Nextflow started via a SLURM job) should be fine.

What’s your opinion on that?

@braffes
Copy link
Collaborator

braffes commented Mar 28, 2025

Hi @rolivella,

I am not sure I understand your point. There are two possibilities:

  1. Use a SLURM job (with a SBATCH script) to manage the Nextflow main process on a VM/submit node, which will then run new jobs via the "SLURM executor" (for some clusters, it is mandatory to do this to avoid monopolizing the submit node).
  2. Start the Nextflow main process directly on the VM/submit node, which will then run new jobs via the "SLURM executor" (for some clusters, this is accepted).

I currently use the second option since I don't need a SLURM job to run the Nextflow main process and I don't want to waste 1 core on my HPC cluster for this job.

I have never played with SGE, but after reading a bit of the documentation, it seems that the commands sbatch/qsub are doing the same kind of things, right? For both options, you have the choice to create a job for the main Nextflow process or not. I think it can be a good idea to have both possibilities.

@rolivella
Copy link
Contributor Author

Hi @braffes

Yes, you're right — the issue I have is that I cannot test the second option because my cluster only works as in option 1.

The only solution I can think of is to make a change in the pipeline so that it supports option 2 as well, and then you could try it out once it's ready. I would push it to the atlas-test branch on GitHub.

Would that work for you?

@temaia do you know which mode use your slurm cluster? Or can I ask to someone in order to clarify this?

@braffes
Copy link
Collaborator

braffes commented Mar 31, 2025

Hi @rolivella

The option 2 is the default one on your implementation since I only modify the nextflow file to use slurm executor and I don't change trigger.sh.

That's ok to test the new implementation.

@rolivella
Copy link
Contributor Author

@braffes ah, now I see. In this way I think that I finished the implemntation appart from some small loose-ends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants