Skip to content
lohrd edited this page Jul 21, 2020 · 14 revisions

SQL Schema for Summary Reports

The following image shows the format of the summary reports that are generated upon each serratus run:

Summary report example

The SQL Schema for accessing each summary report is made up of four tables: Runs, FamilySections, AccessionSections, and FastaSections.

Runs

'Runs' corresponds to the first line of the summary file, where the data for the SRA, reference genome, and date is present. This table has a one to many relationship with the three following tables, all linked by the SRA and auto-generated PK RunId. The columns present on this table are as follows:

  • RunId: PK for the table, auto-generated when entered into the database
  • Sra: Sra for the analysis represented in the summary file
  • FileName: File name of the summary file in the S3 Bucket ex: "summary/DRR000231.summary"
  • Genome: Name of the reference genome used for analysis
  • Date: Date the summary file was generated

FamilySections

'FamilySections' corresponds to the next section of the summary report, where the data for the pan-genome is sorted by the number of hits. The columns present on this table are as follows:

  • FamilySectionId: PK for the table, auto-generated when entered into the database
  • FamilySectionLineId: Number indicating the position of the family line in the family section of the summary file
  • RunId: FK linking back to the Runs table
  • Sra: FK linking back to the Runs table, added here for easier query construction
  • Family: Name of the family of the pan-genome that is being analyzed
  • Score: Score given for the quality of the alignment
  • PctId: Percent identity of the sequences aligned (wrt the reference genome)
  • Aln: Number of aligned reads
  • Glb: Number of global aligned reads
  • PanLen: Pangenome length
  • Cvg: Coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequence
  • Top: Top accession
  • TopAln: Top accession aligned reads
  • TopLen: Length of the top accession
  • TopName: Study name linking to the top accession

AccessionSections

'AccessionSections' corresponds to the next section of the summary file, where each line is a reference sequence sorted by hits. The columns present on this table are as follows:

  • AccessionSectionId: PK for the table, auto-generated when entered into the database
  • AccessionSectionLineId: Number indicating the position of the accession line in the accession section of the summary file
  • RunId: FK linking back to the Runs table
  • Sra: FK linking back to the Runs table, added here for easier query construction
  • Fam: Name of the family of the pan-genome that is being analyzed
  • Acc: Name of the accession
  • PctId: Percent identity of the sequences aligned (wrt the reference genome)
  • Aln: Number of aligned reads
  • Glb: Number of global aligned reads
  • Len: Length of accession
  • CvgPct: Coverage percentage of mapping
  • Depth: Depth of the sequence
  • Cvg: Coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequence
  • Name: Study name linking to the accession

FastaSections

'FastaSections' corresponds to the final section of the summary report, where sequences with good scores assigned to cov and non-cov(Other) and are listed. The columns on this table are as follows:

  • FastaSectionId: PK for the table, auto-generated when entered into the database
  • FastaSectionLineId: Number indicating the position of the fasta line in the fasta section of the summary file
  • RunId: FK linking back to the Runs table
  • Sra: FK linking back to the Runs table, added here for easier query construction
  • SequenceId: Identifier of the shorthand for the fasta sequence in the fasta section (>Cov.2, >Other.12, etc.)
  • Sequence: Nucleotide sequence of the read
Clone this wiki locally