Skip to content

Added miniprot module#24

Open
swatiebi wants to merge 4 commits intodev/anno_module_wrapperfrom
feature/add_miniprot
Open

Added miniprot module#24
swatiebi wants to merge 4 commits intodev/anno_module_wrapperfrom
feature/add_miniprot

Conversation

@swatiebi
Copy link
Copy Markdown
Contributor

@swatiebi swatiebi commented Feb 3, 2026

Added miniprot module. Tested it
python3 /hps/software/users/ensembl/genebuild/swati/develop_anno/ensembl-anno/src/python/ensembl/tools/anno/protein_annotation/miniprot.py --masked_genome_file /hps/nobackup/flicek/ensembl/genebuild/swati/anno_miniprot_test/GCA_046119055.2/red_output/mask_output/gorgonocephalus_arcticus_reheadered_toplevel.msk --output_dir /hps/nobackup/flicek/ensembl/genebuild/swati/anno_development_test/ --protein_file /hps/nobackup/flicek/ensembl/genebuild/swati/echinoderms_data_files/echinobase.fasta --miniprot_bin /hps/software/users/ensembl/genebuild/swati/miniprot/miniprot --num_threads 20 --protein_set uniprot

Since we now use two tools for protein annotation, so we need two separate dirs. Therefore, updated the create_dir function to allow making nested dirs.

Updated the genblast module to create genblast_ouput dir for protein annotations

updated create_dir function to created nested dirs
@AnnaLazarEBI
Copy link
Copy Markdown

Is it intentional that it is not added to the anno wrapper? @swatiebi

@swatiebi
Copy link
Copy Markdown
Contributor Author

swatiebi commented Mar 8, 2026

Is it intentional that it is not added to the anno wrapper? @swatiebi

Added now!

Copy link
Copy Markdown
Contributor

@ens-ftricomi ens-ftricomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is fine, I would probably run mypy on the python code because in might complain on the generic variable types.

Moreover I would add the miniprot call into pyproject.toml to build the package, check also the dependencies

we need to fix both on GenBlast and Miniprot the name convention for the protein because we are not using Uniprot and Orthodb in all the clades and this is becoming confusing.

if miniprot_file.endswith(".gff"):
convert_miniprot_gff_to_gtf(input_file=miniprot_file,output_file=file_out_name)

def convert_miniprot_gff_to_gtf(input_file=None,output_file=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing return type, -> None?

def convert_miniprot_gff_to_gtf(
input_file: Union[str, Path],
output_file: Union[str, Path]
) -> None:

file_out.write("%s\n"% "\t".join(ele))
file_out.close()

def run_miniprot_index(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing variable types and return

miniprot_cmd = [
str(miniprot_bin),
"-t" + str(num_threads),
"-N", "1", #get exactly one alignment per protein, primary alignment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't hard code these two choices. They can be input variable with this default value and the user will be free to modify it according to the scenario

my $input_gtf_file;
my $output_gtf_file;
my $biotypes_hash = ['transcriptomic','busco','protein'];
my $biotypes_hash = ['transcriptomic','genblast-protein','miniprot-protein','genblast-orthodb','miniprot-orthodb'];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not using Orthodox as a second line of evidence anymore and this is already confusing for fungi so I would stay general and using using something like miniprot-protein-evidence-1 and miniprot-protein-evidence-2

# Proteins
flags['run_genblast'] = (run_genblast if run_genblast is not None else run_proteins) and protein_file is not None
flags['run_busco'] = (run_busco if run_busco is not None else run_proteins) and busco_protein_file is not None
flags['run_genblast_op'] = (run_genblast_op if run_genblast_op is not None else run_proteins) and other_protein_file is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name convention for the protein file is not clear from a user perspective: I would leave them generic (protein_file_1 and protein _file_2 or something ) and specify if at least one needs to be always present

parser.add_argument("--run_miniprot_op", action="store_const", const=True, default=None, help="Run Miniprot to align other protein sequences")
parser.add_argument("--protein_file", type=str, help="Path to a fasta file with protein sequences")
parser.add_argument("--busco_protein_file", type=str, help="Path to a fasta file with BUSCO (OrthoDB) protein sequences")
parser.add_argument("--other_protein_file", type=str, help="Path to a fasta file with other protein sequences")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above. Should the first source be more accurate than the first one? If yes we need to have this info in the description

str(masked_genome),
]
logger.info(" ".join(miniprot_index_cmd))
subprocess.run(miniprot_index_cmd)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check=True

logger.info(" ".join(miniprot_cmd))

with open(initial_output_file, 'w') as process_output_file:
subprocess.run(miniprot_cmd, stdout=process_output_file)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check=True

generate_miniprot_gtf(miniprot_dir)

def generate_miniprot_gtf(
miniprot_dir
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing type and return

):
logger.info("generate_miniprot_gff")
file_out_name = os.path.join(miniprot_dir, "annotation.gtf")
for root, dirs, files in os.walk(miniprot_dir):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if miniprot_dir is a Path you need to str(miniprot_dir)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants