Skip to content
Artem Babaian edited this page Aug 7, 2020 · 5 revisions

Serratus Working Bucket(~): s3://serratus-public/

All data files are stored on our AWS S3 bucket. This is the working directory for the project and contains raw/less organized data. If you're interested in data from here your best bet will be to join the slack and ask and the right person can point you to it.

~/notebook : Experiment associated data

For each electronic lab notebook entry, data associated with that run can be stored in this directory. Each folder is a date (YYMMDD) corresponding to the date of the notebook file. For example

The data for the experiment serratus/notebook/200411_CoV_Divergence_Simulations.ipynb is found in s3://serratus-public/notebook/200411/.

~/out : Serratus alignment output

  • ~/out/200525_viro/bam : Aligned output file, SRA accession named
  • ~/out/200525_viro/summary : .summary files for this experiment

~/seq :

Reference sequence sets and their associated index files. Includes pan-genomes, mega-genomes, nucleotide and protein.

Examples:

  • ~/seq/cov0 : All CoV sequences from NCBI

    • NCBI search: "(Coronaviridae) AND "viruses"[porgn:txid10239]"
    • Date Accessed: 2020/03/30
    • Results: 33296
  • ~/seq/hgr1 : Human rDNA testing sequence

~/sra : SraRunInfo Tables (.csv.gz)

SRA Accession and Run Information master tables. Accessed via SRA website. See also SRA-queries.

~/test-data : example data for development

Sequence Files

  • ../bam/ : aligned bam files for breaking into blocks
  • ../bam-block : bam file output of fq-blocks requiring merging
  • ../fq/ : sequencing reads of various length
  • ../fq-block : fq files broken into 'blocks'
  • ../out : Example output data of re-aligned reads

~/var/ : Assorted nuts and bolts

~/assemblies/ : Assemblies of the CoV+ identified datasets

in assemblies/analysis/:

  • catA-v[XXX].txt list of assemblies of category A: single contig, longer than 25 Kbp
  • catB-v[XXX].txt list of assemblies of category B: > 1 contigs, total length longer than 25 Kbp
  • cat[A/B]-v[XXX].fa multifasta files of the lists above

in assemblies/contigs/:

  • SRRxxx.minia.checkv_filtered.fa Minia k31 contigs filtered by CheckV, keeping only coronavirus hits
  • SRRxxx.coronaspades.checkv_filtered.fa coronaSPAdes scaffolds filtered by CheckV, keeping only coronavirus hits
  • SRRxxx.coronaspades.gene_clusters.fa coronaSPAdes' gene_clusters.fasta (you can think of them as contigs obtained by matching to a coronavirus HMM but the construction process is more complex than that!)
  • SRRxxx.coronaspades.gene_clusters.checkv_filtered.fa coronaSPAdes gene_clusters.fasta further filtered by CheckV

in assemblies/other/SRRxxx.[assembler]/:

  • SRRxxx.[assembler].contigs.fa.mfc unfiltered, straight out-of-the-assembler, MFCompress'd contigs (Minia) or scaffolds (coronaSPAdes) file. This file contains the whole assembly of the dataset, hence host contigs too, and also other viruses.
  • SRRxxx.inputdata.txt some statistics about the reads (number of reads, FASTQ file size)
  • SRRxxx.[assembler].checkv.[completeness|contamination|quality_summary].tsv.gz output of CheckV on the whole assembly file (i.e. contigs.fa.mfc)
  • SRRxxx.coronaspades.gene_clusters.checkv.[completeness|contamination|quality_summary].tsv.gz output of CheckV on the gene_clusters.fasta file (for coronaSPAdes)
  • SRRxxx.[assembler].txt output log of the assembler

the magic that performed this is https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly

Clone this wiki locally