-
Notifications
You must be signed in to change notification settings - Fork 34
SQL Schema
The following image shows the format of the summary reports that are generated upon each serratus run:

The SQL Schema for accessing each summary report is made up of four tables: Runs, FamilySections, AccessionSections, and FastaSections.
'Runs' corresponds to the first line of the summary file, where the data for the SRA, reference genome, and date is present. This table has a one to many relationship with the three following tables, all linked by the SRA and auto-generated PK RunId. The columns present on this table are as follows:
- RunId: PK for the table, auto-generated when entered into the database
- Sra: Sra for the analysis represented in the summary file
- FileName: File name of the summary file in the S3 Bucket ex: "summary/DRR000231.summary"
- Genome: Name of the reference genome used for analysis
- Date: Date the summary file was generated
'FamilySections' corresponds to the next section of the summary report, where the data for the pan-genome is sorted by the number of hits. The columns present on this table are as follows:
- FamilySectionId: PK for the table, auto-generated when entered into the database
- FamilySectionLineId: Number indicating the position of the family line in the family section of the summary file
- RunId: FK linking back to the Runs table
- Sra: FK linking back to the Runs table, added here for easier query construction
- Family: Name of the family of the pan-genome that is being analyzed
- Score: Score given for the quality of the alignment
- PctId: Percent identity of the sequences aligned (wrt the reference genome)
- Aln: Number of aligned reads
- Glb: Number of global aligned reads
- PanLen: Pangenome length
- Cvg: Coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequence
- Top: Top accession
- TopAln: Top accession aligned reads
- TopLen: Length of the top accession
- TopName: Study name linking to the top accession
'AccessionSections' corresponds to the next section of the summary file, where each line is a reference sequence sorted by hits. The columns present on this table are as follows:
- AccessionSectionId: PK for the table, auto-generated when entered into the database
- AccessionSectionLineId: Number indicating the position of the accession line in the accession section of the summary file
- RunId: FK linking back to the Runs table
- Sra: FK linking back to the Runs table, added here for easier query construction
- Fam: Name of the family of the pan-genome that is being analyzed
- Acc: Name of the accession
- PctId: Percent identity of the sequences aligned (wrt the reference genome)
- Aln: Number of aligned reads
- Glb: Number of global aligned reads
- Len: Length of accession
- CvgPct: Coverage percentage of mapping
- Depth: Depth of the sequence
- Cvg: Coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequence
- Name: Study name linking to the accession
'FastaSections' corresponds to the final section of the summary report, where sequences with good scores assigned to cov and non-cov(Other) and are listed. The columns on this table are as follows:
- FastaSectionId: PK for the table, auto-generated when entered into the database
- FastaSectionLineId: Number indicating the position of the fasta line in the fasta section of the summary file
- RunId: FK linking back to the Runs table
- Sra: FK linking back to the Runs table, added here for easier query construction
- SequenceId: Identifier of the shorthand for the fasta sequence in the fasta section (>Cov.2, >Other.12, etc.)
- Sequence: Nucleotide sequence of the read