Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for larger genomes #239

Open
fmobegi opened this issue Jul 23, 2021 · 4 comments
Open

Support for larger genomes #239

fmobegi opened this issue Jul 23, 2021 · 4 comments

Comments

@fmobegi
Copy link

fmobegi commented Jul 23, 2021

Is your feature request related to a problem? Please describe

Am working on in-planta infection RNAseq assays for the Lentil-Ascochyta pathosystem. The pathogen side of things works perfectly with the pipeline as currently constituted (albeit with a few hiccups on the cleanup of large intermediate files as mentioned in a separate issue report). However, the analysis cannot go past samtools indexing when using the Lentil genome as a reference.

Describe the solution you'd like

Am not entirely sure how to handle this (perhaps a try-catch-except), but, if the tool could do some kind of chromosome length check and send the BAM files to appropriate samtools indexing and generate either the default *.BAI or *.CSI for larger genomes.

Describe alternatives you've considered

I have modified the process "samtools_index" as below to add the -c flag thus enabling csi index for this run.

process samtools_index {
  publishDir "${params.outdir}/Samples/${sample_id}", mode: params.publish_dir_mode, pattern: publish_pattern_samtools_index
  tag { sample_id }
  label "samtools"

  input:
    set val(sample_id), file(bam_file) from SORTED_FOR_INDEX

  output:
    set val(sample_id), file(bam_file) into BAM_INDEXED_FOR_STRINGTIE
    set val(sample_id), file("*.bam.csi") into BAI_INDEXED_FILE
//    set val(sample_id), file("*.bam.bai") into BAI_INDEXED_FILE
    set val(sample_id), file("*.bam.log") into BAM_INDEXED_LOG

  script:
  """
  echo "#TRACE sample_id=${sample_id}"
  echo "#TRACE bam_bytes=`stat -Lc '%s' *.bam`"

//  samtools index ${bam_file}
  samtools index -c ${bam_file}
  samtools stats ${bam_file} > ${sample_id}.bam.log
  """
}

@spficklin
Copy link
Member

Hi @fmobegi . Thanks for reporting this issue. We'll look into trying to get this fixed. It sounds like you found a work around!

@JohnHadish
Copy link
Collaborator

I just encountered this issue with the wheat genome as well.

@JohnHadish
Copy link
Collaborator

The purpose of samtools index is to create an index file that can be used with other programs to increase efficency. There are two options for this output .bai and .csi. .bai is the older version, and cannot create indeces for chromosomes over 2^29 bases. Unforunately, many plant genomes have chromosomes which are longer than this. .csi does not suffer these limitations.

GEMmaker currently creates a .bai file for each bam file it creates. However, GEMmaker does not use this .bai file in any of its downstream processes. The .bai file is output so that the user can use it in other programs.

4 Potential Solutions:

  • Have GEMmaker default to .csi files
  • Have the option to turn of samtools_index (it would be off by default, like other outputs)
  • Have the option to switch between .bai and .csi output (.csi as default)
  • Remove samtools_index from GEMmaker

@JohnHadish
Copy link
Collaborator

I implimented option 1 from above in my local version by changing the file modules/local/samtools_index.nf to the following. The only change is to add the -c tag to the samtools index command and change the output search pattern:

/**
 * Indexes the BAM alignment file
 */
process samtools_index {
    tag { sample_id }
    publishDir "${params.outdir}/Samples/${sample_id}", mode: params.publish_dir_mode, pattern: params.publish_pattern_samtools_index
    
    conda (params.enable_conda ? "bioconda::samtools=1.14" : null)
    if (workflow.containerEngine == 'singularity' && !params.singularity_pull_docker_container) {
        container "https://depot.galaxyproject.org/singularity/samtools:1.14--hb421002_0"
    } else {
        container "quay.io/biocontainers/samtools:1.14--hb421002_0"
    }

    input:
    tuple val(sample_id), path(bam_file)

    output:
    tuple val(sample_id), path(bam_file), emit: BAM_FILES
    tuple val(sample_id), path("*.bam.csi"), emit: BAI_FILES
    tuple val(sample_id), path("*.bam.csi"), emit: LOGS

    script:
    """
    echo "#TRACE sample_id=${sample_id}"
    echo "#TRACE bam_bytes=`stat -Lc '%s' *.bam`"

    samtools index -c ${bam_file}
    samtools stats ${bam_file} > ${sample_id}.bam.log
    """
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants