-
Notifications
You must be signed in to change notification settings - Fork 34
Working Data Dir
Serratus Working Bucket(~
): s3://serratus-public/
All data files are stored on our AWS S3 bucket. This is the working directory for the project and contains raw/less organized data.
For each electronic lab notebook entry, data associated with that run can be stored in this directory. Each folder is a date (YYMMDD
) corresponding to the date of the notebook file. For example
The data for the experiment serratus/notebook/200411_CoV_Divergence_Simulations.ipynb
is found in s3://serratus-public/notebook/200411/
.
-
~/out/200525_viro/bam
: Aligned output file, SRA accession named -
~/out/200525_viro/summary
: .summary files for this experiment
Reference sequence sets and their associated index files. Includes pan-genomes, mega-genomes, nucleotide and protein.
Examples:
-
~/seq/cov0
: All CoV sequences from NCBI- NCBI search:
"(Coronaviridae) AND "viruses"[porgn:txid10239]"
- Date Accessed: 2020/03/30
- Results: 33296
- NCBI search:
-
~/seq/hgr1
: Human rDNA testing sequence- From this publication
SRA Accession and Run Information master tables. Accessed via SRA website and the following basic filter:
"type_rnaseq"[Filter] AND cluster_public[prop] AND "platform illumina"[Properties] AND "cloud s3"[Properties] NOT "scRNA"[All Fields] AND <SUBFILTER>
-
Test Data Set
- Mammals and CoV+ swabs for testing pipeline
- SARS-CoV-2:
PRJNA616446
- Felis catus:
PRJNA432069
- Homo sapiens (HCT116):
PRJEB29794
- Macaca fascicularis:
PRJNA553361
- Mus musculus:
PRJNA553361
- Date Accessed: 2020/04/07
- Results: 49 libraries
-
Non-Human, Non-Mouse Mammals
BASE AND "Mammalia"[Organism] NOT "Homo sapiens"[Organism]) NOT "Mus musculus"[orgn]
- Date Accessed: 2020/03/28
- Results: 66926, 0.15 PB
-
Human
BASE AND "Homo sapiens"[Organism]
- Date Accessed: 2020/03/05
- Results: 520257, 4.75 PB
-
Mouse
BASE AND "Mus musculus"[orgn]
- Results: 539233
- Not accessed
-
Vertebrates, Non-mammal
BASE NOT "Mammalia"[Organism] NOT "Homo sapiens"[Organism] NOT "Mus musculus"[orgn]
- Date Accessed: 2020/03/29
- Results: 74532, 0.115 PB
-
Invertebrates
BASE NOT "Vertebrata"[Organism]
- Date Accessed: 2020/03/30
- Results: 403639, 0.7 PB
-
HCT116 RNAseq
- For testing; ca. 1000 entries of human HCT116 cell line
-
CoV Positive Control (known CoV)
"platform illumina"[Properties] OR "platform bgiseq"[Properties] AND txid694002[Organism:exp]
- Date Accessed: 2020/04/27
- Results: 862 samples
Sequence Files
-
../bam/
: aligned bam files for breaking into blocks -
../bam-block
: bam file output of fq-blocks requiring merging -
../fq/
: sequencing reads of various length -
../fq-block
: fq files broken into 'blocks' -
../out
: Example output data of re-aligned reads
in assemblies/analysis/
:
-
catA-v[XXX].txt
list of assemblies of category A: single contig, longer than 25 Kbp -
catB-v[XXX].txt
list of assemblies of category B: > 1 contigs, total length longer than 25 Kbp -
cat[A/B]-v[XXX].fa
multifasta files of the lists above
in assemblies/contigs/
:
-
SRRxxx.minia.checkv_filtered.fa
Minia k31 contigs filtered by CheckV, keeping only coronavirus hits -
SRRxxx.coronaspades.checkv_filtered.fa
coronaSPAdes scaffolds filtered by CheckV, keeping only coronavirus hits -
SRRxxx.coronaspades.gene_clusters.fa
coronaSPAdes'gene_clusters.fasta
(you can think of them as contigs obtained by matching to a coronavirus HMM but the construction process is more complex than that!) -
SRRxxx.coronaspades.gene_clusters.checkv_filtered.fa
coronaSPAdesgene_clusters.fasta
further filtered by CheckV
in assemblies/other/SRRxxx.[assembler]/
:
-
SRRxxx.[assembler].contigs.fa.mfc
unfiltered, straight out-of-the-assembler, MFCompress'd contigs (Minia) or scaffolds (coronaSPAdes) file. This file contains the whole assembly of the dataset, hence host contigs too, and also other viruses. -
SRRxxx.inputdata.txt
some statistics about the reads (number of reads, FASTQ file size) -
SRRxxx.[assembler].checkv.[completeness|contamination|quality_summary].tsv.gz
output of CheckV on the whole assembly file (i.e.contigs.fa.mfc
) -
SRRxxx.coronaspades.gene_clusters.checkv.[completeness|contamination|quality_summary].tsv.gz
output of CheckV on thegene_clusters.fasta
file (for coronaSPAdes) -
SRRxxx.[assembler].txt
output log of the assembler
the magic that performed this is https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly