- 
                Notifications
    
You must be signed in to change notification settings  - Fork 34
 
Working Data Dir
Serratus Working Bucket(~): s3://serratus-public/
All data files are stored on our AWS S3 bucket. This is the working directory for the project and contains raw/less organized data. If you're interested in data from here your best bet will be to join the slack and ask and the right person can point you to it.
The S3 bucket has public read-only permissions. All files can be downloaded via aws cli or wget/curl.
- 
aws-cli:aws s3 cp s3://serratus-public/<file_path>. - 
wget/curl:wget https://serratus-public.s3.amazonaws.com/<file_path> 
For each electronic lab notebook entry, data associated with that run can be stored in this directory. Each folder is a date (YYMMDD) corresponding to the date of the notebook file. For example
The data for the experiment serratus/notebook/200411_CoV_Divergence_Simulations.ipynb is found in s3://serratus-public/notebook/200411/.
- 
~/out/200525_viro/bam: Aligned output file, SRA accession named - 
~/out/200525_viro/summary: .summary files for this experiment 
Reference sequence sets and their associated index files. Includes pan-genomes, mega-genomes, nucleotide and protein.
Examples:
- 
~/seq/cov0: All CoV sequences from NCBI- NCBI search: 
"(Coronaviridae) AND "viruses"[porgn:txid10239]" - Date Accessed: 2020/03/30
 - Results: 33296
 
 - NCBI search: 
 - 
~/seq/hgr1: Human rDNA testing sequence- From this publication
 
 
SRA Accession and Run Information master tables. Accessed via SRA website. See also SRA-queries.
Sequence Files
- 
../bam/: aligned bam files for breaking into blocks - 
../bam-block: bam file output of fq-blocks requiring merging - 
../fq/: sequencing reads of various length - 
../fq-block: fq files broken into 'blocks' - 
../out: Example output data of re-aligned reads 
in assemblies/analysis/:
- 
catA-v[XXX].txtlist of assemblies of category A: single contig, longer than 25 Kbp - 
catB-v[XXX].txtlist of assemblies of category B: > 1 contigs, total length longer than 25 Kbp - 
cat[A/B]-v[XXX].famultifasta files of the lists above 
in assemblies/contigs/:
- 
SRRxxx.minia.checkv_filtered.faMinia k31 contigs filtered by CheckV, keeping only coronavirus hits - 
SRRxxx.coronaspades.checkv_filtered.facoronaSPAdes scaffolds filtered by CheckV, keeping only coronavirus hits - 
SRRxxx.coronaspades.gene_clusters.facoronaSPAdes'gene_clusters.fasta(you can think of them as contigs obtained by matching to a coronavirus HMM but the construction process is more complex than that!) - 
SRRxxx.coronaspades.gene_clusters.checkv_filtered.facoronaSPAdesgene_clusters.fastafurther filtered by CheckV 
in assemblies/other/SRRxxx.[assembler]/:
- 
SRRxxx.[assembler].contigs.fa.mfcunfiltered, straight out-of-the-assembler, MFCompress'd contigs (Minia) or scaffolds (coronaSPAdes) file. This file contains the whole assembly of the dataset, hence host contigs too, and also other viruses. - 
SRRxxx.inputdata.txtsome statistics about the reads (number of reads, FASTQ file size) - 
SRRxxx.[assembler].checkv.[completeness|contamination|quality_summary].tsv.gzoutput of CheckV on the whole assembly file (i.e.contigs.fa.mfc) - 
SRRxxx.coronaspades.gene_clusters.checkv.[completeness|contamination|quality_summary].tsv.gzoutput of CheckV on thegene_clusters.fastafile (for coronaSPAdes) - 
SRRxxx.[assembler].txtoutput log of the assembler 
the magic that performed this is https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly