Added miniprot module by swatiebi · Pull Request #24 · Ensembl/ensembl-anno

swatiebi · 2026-02-03T15:24:06Z

Added miniprot module. Tested it
python3 /hps/software/users/ensembl/genebuild/swati/develop_anno/ensembl-anno/src/python/ensembl/tools/anno/protein_annotation/miniprot.py --masked_genome_file /hps/nobackup/flicek/ensembl/genebuild/swati/anno_miniprot_test/GCA_046119055.2/red_output/mask_output/gorgonocephalus_arcticus_reheadered_toplevel.msk --output_dir /hps/nobackup/flicek/ensembl/genebuild/swati/anno_development_test/ --protein_file /hps/nobackup/flicek/ensembl/genebuild/swati/echinoderms_data_files/echinobase.fasta --miniprot_bin /hps/software/users/ensembl/genebuild/swati/miniprot/miniprot --num_threads 20 --protein_set uniprot

Since we now use two tools for protein annotation, so we need two separate dirs. Therefore, updated the create_dir function to allow making nested dirs.

Updated the genblast module to create genblast_ouput dir for protein annotations

updated create_dir function to created nested dirs

AnnaLazarEBI · 2026-02-12T15:47:00Z

Is it intentional that it is not added to the anno wrapper? @swatiebi

swatiebi · 2026-03-08T21:12:39Z

Is it intentional that it is not added to the anno wrapper? @swatiebi

Added now!

ens-ftricomi

The logic is fine, I would probably run mypy on the python code because in might complain on the generic variable types.

Moreover I would add the miniprot call into pyproject.toml to build the package, check also the dependencies

we need to fix both on GenBlast and Miniprot the name convention for the protein because we are not using Uniprot and Orthodb in all the clades and this is becoming confusing.

ens-ftricomi · 2026-03-09T11:24:53Z

src/python/ensembl/tools/anno/protein_annotation/miniprot.py

+            if miniprot_file.endswith(".gff"):
+                convert_miniprot_gff_to_gtf(input_file=miniprot_file,output_file=file_out_name)
+
+def convert_miniprot_gff_to_gtf(input_file=None,output_file=None):


missing return type, -> None?

def convert_miniprot_gff_to_gtf(
input_file: Union[str, Path],
output_file: Union[str, Path]
) -> None:

ens-ftricomi · 2026-03-09T11:25:31Z

src/python/ensembl/tools/anno/protein_annotation/miniprot.py

+                file_out.write("%s\n"% "\t".join(ele))
+    file_out.close()
+
+def run_miniprot_index(


missing variable types and return

ens-ftricomi · 2026-03-09T11:26:57Z

src/python/ensembl/tools/anno/protein_annotation/miniprot.py

+    miniprot_cmd = [
+        str(miniprot_bin),
+        "-t" + str(num_threads),
+        "-N", "1", #get exactly one alignment per protein, primary alignment


I wouldn't hard code these two choices. They can be input variable with this default value and the user will be free to modify it according to the scenario

ens-ftricomi · 2026-03-09T11:31:43Z

support_scripts_perl/finalise_geneset.pl

 my $input_gtf_file;
 my $output_gtf_file;
-my $biotypes_hash = ['transcriptomic','busco','protein'];
+my $biotypes_hash = ['transcriptomic','genblast-protein','miniprot-protein','genblast-orthodb','miniprot-orthodb'];


we are not using Orthodox as a second line of evidence anymore and this is already confusing for fungi so I would stay general and using using something like miniprot-protein-evidence-1 and miniprot-protein-evidence-2

ens-ftricomi · 2026-03-09T11:46:39Z

ensembl_anno.py

    # Proteins
    flags['run_genblast'] = (run_genblast if run_genblast is not None else run_proteins) and protein_file is not None
-    flags['run_busco'] = (run_busco if run_busco is not None else run_proteins) and busco_protein_file is not None
+    flags['run_genblast_op'] = (run_genblast_op if run_genblast_op is not None else run_proteins) and other_protein_file is not None


the name convention for the protein file is not clear from a user perspective: I would leave them generic (protein_file_1 and protein _file_2 or something ) and specify if at least one needs to be always present

ens-ftricomi · 2026-03-09T11:48:52Z

ensembl_anno.py

+    parser.add_argument("--run_miniprot_op", action="store_const", const=True, default=None, help="Run Miniprot to align other protein sequences")
    parser.add_argument("--protein_file", type=str, help="Path to a fasta file with protein sequences")
-    parser.add_argument("--busco_protein_file", type=str, help="Path to a fasta file with BUSCO (OrthoDB) protein sequences")
+    parser.add_argument("--other_protein_file", type=str, help="Path to a fasta file with other protein sequences")


see comment above. Should the first source be more accurate than the first one? If yes we need to have this info in the description

ens-ftricomi · 2026-03-09T11:53:40Z

src/python/ensembl/tools/anno/protein_annotation/miniprot.py

+        str(masked_genome),
+    ]
+    logger.info(" ".join(miniprot_index_cmd))
+    subprocess.run(miniprot_index_cmd)


ens-ftricomi · 2026-03-09T11:54:11Z

src/python/ensembl/tools/anno/protein_annotation/miniprot.py

+    logger.info(" ".join(miniprot_cmd))
+
+    with open(initial_output_file, 'w') as process_output_file:
+        subprocess.run(miniprot_cmd, stdout=process_output_file)


ens-ftricomi · 2026-03-09T11:54:54Z

src/python/ensembl/tools/anno/protein_annotation/miniprot.py

+    generate_miniprot_gtf(miniprot_dir) 
+
+def generate_miniprot_gtf(
+    miniprot_dir


missing type and return

ens-ftricomi · 2026-03-09T11:55:18Z

src/python/ensembl/tools/anno/protein_annotation/miniprot.py

+):
+    logger.info("generate_miniprot_gff")
+    file_out_name = os.path.join(miniprot_dir, "annotation.gtf")
+    for root, dirs, files in os.walk(miniprot_dir):


if miniprot_dir is a Path you need to str(miniprot_dir)

Added miniprot module

a58516e

updated create_dir function to created nested dirs

swatiebi requested review from AnnaLazarEBI and ens-ftricomi February 3, 2026 15:24

Added layers

5646efc

swatiebi added 2 commits February 13, 2026 15:11

fixed bug where alignment lines were being truncated

f8dd65f

Added miniprot in the wrapper script

4a6e404

AnnaLazarEBI approved these changes Mar 9, 2026

View reviewed changes

ens-ftricomi requested changes Mar 9, 2026

View reviewed changes

Conversation

swatiebi commented Feb 3, 2026

Uh oh!

AnnaLazarEBI commented Feb 12, 2026

Uh oh!

swatiebi commented Mar 8, 2026

Uh oh!

ens-ftricomi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants