Skip to content

Fix short transcripts#23

Draft
AnnaLazarEBI wants to merge 5 commits intomainfrom
fix/short_transcript
Draft

Fix short transcripts#23
AnnaLazarEBI wants to merge 5 commits intomainfrom
fix/short_transcript

Conversation

@AnnaLazarEBI
Copy link
Copy Markdown

FIlter for 1 codon translations (3bp).

Tested.

Original gtf: /hps/nobackup/flicek/ensembl/genebuild/lazar/bbr/zymoseptoria_tritici_pangenome/GCA_017766825.1/annotation_output/initial_region_gtfs/4.rs1.re2952612.busco_copy.gtf

Before filtering:
/hps/nobackup/flicek/ensembl/genebuild/lazar/dev/validation/old/output.gtf

After filtering:
/hps/nobackup/flicek/ensembl/genebuild/lazar/dev/validation/output.gtf


my $translation = $transcript->translation;

if (!$translation) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we compute the translation before so I think we want to exclude these ?

@AnnaLazarEBI AnnaLazarEBI marked this pull request as draft January 20, 2026 13:15
@AnnaLazarEBI
Copy link
Copy Markdown
Author

Issue 3bp CDS survived original filter.
For the 3bp BUSCO transcript to survive this:

  1. $transcript->translation must be ‎undef at that time

→ it goes through the ‎unless ($translation) branch and is kept.This happens if:

▫ you were in the “final transcript in file” case where ‎compute_translation was commented out when the transcript was first built, or

▫ it’s a transcript produced later (e.g. by joining / UTR processing) that hasn’t had ‎translation set yet.

  1. Or it had a translation with ‎length > 1 aa at that moment, and only later (after you call ‎compute_translation in the ‎"Computing translations" loop) did the CDS collapse to 3bp.

We now filter out 1‑codon transcripts immediately after grouping and optional joining, instead of only relying on later CDS‑length criteria. The change walks over ‎joined_transcripts, inspects any existing ‎translation, and removes transcripts whose translation length is ≤1 amino acid, logging what was dropped. This prevents obviously spurious 3bp coding models (like BUSCO artefacts) from propagating into downstream steps and the final GTF, while leaving non‑coding or untranslated models untouched for later processing.

@AnnaLazarEBI
Copy link
Copy Markdown
Author

These are filtered:
DS995906.1 anno exon 540775 543237 . - . gene_id "anno_4596"; transcript_id "anno_4596"; exon_number "1"; DS995906.1 anno transcript 1549067 1549069 . - . gene_id "anno_11407"; transcript_id "anno_11407"; biotype "busco"; translation_coords "1549067:1549069:1:1549067:1549069:3"; DS995906.1 anno exon 1549067 1549069 . - . gene_id "anno_11407"; transcript_id "anno_11407"; exon_number "1";

The first line is not expected.
original slice: /hps/nobackup/flicek/ensembl/genebuild/lazar/bbr/reannotation/batch2/GCA_000001985.1/annotation_output/slect_script_test/DS995906.1.rs1.re1818470.busco.gtf
filtered slice: /hps/nobackup/flicek/ensembl/genebuild/lazar/bbr/reannotation/batch2/GCA_000001985.1/annotation_output/slect_script_test/DS995906.1.rs1.re1818470.busco_sel.gtf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants