Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with accessions with non-existing files #139

Open
bmlab-sg opened this issue Mar 24, 2023 · 7 comments
Open

Deal with accessions with non-existing files #139

bmlab-sg opened this issue Mar 24, 2023 · 7 comments
Assignees
Labels
enhancement Improvement for existing functionality
Milestone

Comments

@bmlab-sg
Copy link

Description of feature

Hi,

In SRA some of the run accessions have no associated files.
For example bioproject PRJEB18755 has several runs that are total ghosts: ERR2013571, ERR2013572, ERR2013573, ..., while other are fine.
When these ghost accessions are provided in the input, the pipeline will first retry:

[60/81e7b9] NOTE: Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013613)` failed -- Execution is retried (2)

and then terminate with errors:

Command error:
  [ERROR] There is no content for id ERR2013581. Maybe you lack the right permissions?

Of course one thing that can be done is to filter first these entries before feeding to the pipeline, but it will be great if these errors can be ignored.
Or maybe there is an option like that already that I am missing?
Thanks for any info on that, it will be extremely helpful to be able to easily deal with it!

@bmlab-sg bmlab-sg added the enhancement Improvement for existing functionality label Mar 24, 2023
@Midnighter
Copy link
Contributor

If you just want to ignore the errors, you can create a local nextflow configuration:

process {
  withName: SRA_IDS_TO_RUNINFO {
    errorStrategy = 'ignore'
  }
}

@drpatelh drpatelh added this to the 1.10 milestone Apr 25, 2023
@drpatelh
Copy link
Member

Did this solution work for you @bmlab-sg ? We could try to incorporate ignoring these sorts of ids via the pipeline but we would need some sort of way to detect this via the metadata or otherwise.

@bmlab-sg
Copy link
Author

@drpatelh - yes, that solution mostly solves this issue.
After looking at few datasets, seems like AvgSpotLen and/or Bases that are >0 can be a good filtering marker for these ghosts.

@drpatelh
Copy link
Member

Cool. Thanks for the update. We can see if these metadata fields are exposed so we can add conditional filtering to the pipeline in these scenarios so it doesn't hard fail.

@drpatelh drpatelh assigned robsyme and drpatelh and unassigned robsyme May 5, 2023
@drpatelh
Copy link
Member

drpatelh commented May 6, 2023

I am unable to reproduce this issue anymore. This could be due to the changes made to the ENA API recently as fixed in #148

I am now getting [ERROR] No matches found for database id ERR2013613! and we are unable to retrieve any metadata via the API URL below which means we can't explicitly filter by Bases or otherwise:
https://www.ebi.ac.uk/ena/portal/api/filereport?accession=ERR2013613&result=read_run&fields=run_accession%2Cexperiment_accession

ERR2013613

ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013613)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013613)` terminated with an error exit status (1)

Command executed:

  echo ERR2013613 > id.txt
  sra_ids_to_runinfo.py \
      id.txt \
      ERR2013613.runinfo.tsv \
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [ERROR] No matches found for database id ERR2013613!
  Line: 'ERR2013613'

ERR2013581

ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013581)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013581)` terminated with an error exit status (1)

Command executed:

  echo ERR2013581 > id.txt
  sra_ids_to_runinfo.py \
      id.txt \
      ERR2013581.runinfo.tsv \
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [ERROR] No matches found for database id ERR2013581!
  Line: 'ERR2013581'

Will close this issue for now but please feel free to re-open if you encounter the issue again along with providing the appropriate ids we can use to fix.

@drpatelh drpatelh closed this as completed May 6, 2023
@rohitrrj
Copy link

rohitrrj commented Jul 19, 2024

Hello @drpatelh,
Recently I encountered this issue while working on PRJNA1079722. Multiple runs in this project SRR29688921, SRR29688964, SRR29688955, SRR29688939, SRR29688945, SRR29688933, SRR29688921, SRR29688964 seem to cause this same error. However these dont seem to be "ghosts" as you found previously. Each of these runs seem to host data for the associated sample. Below is the error for one of these:

`ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955)'

Caused by:
Process NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955) terminated with an error exit status (1)

Command executed:

echo SRR29688955 > id.txt
sra_ids_to_runinfo.py
id.txt
SRR29688955.runinfo.tsv \

cat <<-END_VERSIONS > versions.yml
"NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
python: $(python --version | sed 's/Python //g')
END_VERSIONS

Command exit status:
1

Command output:
(empty)

Command error:
[ERROR] No matches found for database id SRR29688955!
Line: 'SRR29688955'
`

@rohitrrj rohitrrj reopened this Jul 19, 2024
@lizzyjoan
Copy link

Hello @drpatelh, Recently I encountered this issue while working on PRJNA1079722. Multiple runs in this project SRR29688921, SRR29688964, SRR29688955, SRR29688939, SRR29688945, SRR29688933, SRR29688921, SRR29688964 seem to cause this same error. However these dont seem to be "ghosts" as you found previously. Each of these runs seem to host data for the associated sample. Below is the error for one of these:

`ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955)'

Caused by: Process NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955) terminated with an error exit status (1)

Command executed:

echo SRR29688955 > id.txt sra_ids_to_runinfo.py id.txt SRR29688955.runinfo.tsv \

cat <<-END_VERSIONS > versions.yml "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO": python: $(python --version | sed 's/Python //g') END_VERSIONS

Command exit status: 1

Command output: (empty)

Command error: [ERROR] No matches found for database id SRR29688955! Line: 'SRR29688955' `

I have encountered this same issue for the dataset PRJNA898600. I have also tried running just one sample from the project as well and in multiple ways (different identifiers: SRR22198886, SRS15675991, SRX18177158) and tried running the pipeline with ftp and sratools for -- download_method

The only variation I find is when I run with the ftp method, it technically completes the SRA_IDS_TO_RUNINFO process and fails at the SRA_RUNINFO_TO_FTP instead, but still has the underlying issue of not finding the dataset it seems (when exploring the "/work/" directory, the .runinfo.tsv is empty regardless of the way that I try to run the pipeline)

If it's helpful, here is the slight variation that I get with the FTP download method

Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_RUNINFO_TO_FTP (1)'

Caused by:
  Missing output file(s) `*.tsv` expected by process `NFCORE_FETCHNGS:SRA:SRA_RUNINFO_TO_FTP (1)` (note: input files are not included in the default matching set)


Command executed:

  sra_runinfo_to_ftp.py \
      SRX18177158.runinfo.tsv \
      SRX18177158.runinfo_ftp.tsv
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_RUNINFO_TO_FTP":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  0

Command output:
  (empty)

Work dir:
  /data/user/lizzyr/setbp1_hd/src/work/5d/eb2b5e5c24efac0162b2aad382a315

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

This is with nf-core/fetchngs v1.12.0 and nextflow version 24.04.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

6 participants