Conversation
updated create_dir function to created nested dirs
|
Is it intentional that it is not added to the anno wrapper? @swatiebi |
Added now! |
ens-ftricomi
left a comment
There was a problem hiding this comment.
The logic is fine, I would probably run mypy on the python code because in might complain on the generic variable types.
Moreover I would add the miniprot call into pyproject.toml to build the package, check also the dependencies
we need to fix both on GenBlast and Miniprot the name convention for the protein because we are not using Uniprot and Orthodb in all the clades and this is becoming confusing.
| if miniprot_file.endswith(".gff"): | ||
| convert_miniprot_gff_to_gtf(input_file=miniprot_file,output_file=file_out_name) | ||
|
|
||
| def convert_miniprot_gff_to_gtf(input_file=None,output_file=None): |
There was a problem hiding this comment.
missing return type, -> None?
def convert_miniprot_gff_to_gtf(
input_file: Union[str, Path],
output_file: Union[str, Path]
) -> None:
| file_out.write("%s\n"% "\t".join(ele)) | ||
| file_out.close() | ||
|
|
||
| def run_miniprot_index( |
There was a problem hiding this comment.
missing variable types and return
| miniprot_cmd = [ | ||
| str(miniprot_bin), | ||
| "-t" + str(num_threads), | ||
| "-N", "1", #get exactly one alignment per protein, primary alignment |
There was a problem hiding this comment.
I wouldn't hard code these two choices. They can be input variable with this default value and the user will be free to modify it according to the scenario
| my $input_gtf_file; | ||
| my $output_gtf_file; | ||
| my $biotypes_hash = ['transcriptomic','busco','protein']; | ||
| my $biotypes_hash = ['transcriptomic','genblast-protein','miniprot-protein','genblast-orthodb','miniprot-orthodb']; |
There was a problem hiding this comment.
we are not using Orthodox as a second line of evidence anymore and this is already confusing for fungi so I would stay general and using using something like miniprot-protein-evidence-1 and miniprot-protein-evidence-2
| # Proteins | ||
| flags['run_genblast'] = (run_genblast if run_genblast is not None else run_proteins) and protein_file is not None | ||
| flags['run_busco'] = (run_busco if run_busco is not None else run_proteins) and busco_protein_file is not None | ||
| flags['run_genblast_op'] = (run_genblast_op if run_genblast_op is not None else run_proteins) and other_protein_file is not None |
There was a problem hiding this comment.
the name convention for the protein file is not clear from a user perspective: I would leave them generic (protein_file_1 and protein _file_2 or something ) and specify if at least one needs to be always present
| parser.add_argument("--run_miniprot_op", action="store_const", const=True, default=None, help="Run Miniprot to align other protein sequences") | ||
| parser.add_argument("--protein_file", type=str, help="Path to a fasta file with protein sequences") | ||
| parser.add_argument("--busco_protein_file", type=str, help="Path to a fasta file with BUSCO (OrthoDB) protein sequences") | ||
| parser.add_argument("--other_protein_file", type=str, help="Path to a fasta file with other protein sequences") |
There was a problem hiding this comment.
see comment above. Should the first source be more accurate than the first one? If yes we need to have this info in the description
| str(masked_genome), | ||
| ] | ||
| logger.info(" ".join(miniprot_index_cmd)) | ||
| subprocess.run(miniprot_index_cmd) |
| logger.info(" ".join(miniprot_cmd)) | ||
|
|
||
| with open(initial_output_file, 'w') as process_output_file: | ||
| subprocess.run(miniprot_cmd, stdout=process_output_file) |
| generate_miniprot_gtf(miniprot_dir) | ||
|
|
||
| def generate_miniprot_gtf( | ||
| miniprot_dir |
There was a problem hiding this comment.
missing type and return
| ): | ||
| logger.info("generate_miniprot_gff") | ||
| file_out_name = os.path.join(miniprot_dir, "annotation.gtf") | ||
| for root, dirs, files in os.walk(miniprot_dir): |
There was a problem hiding this comment.
if miniprot_dir is a Path you need to str(miniprot_dir)
Added miniprot module. Tested it
python3 /hps/software/users/ensembl/genebuild/swati/develop_anno/ensembl-anno/src/python/ensembl/tools/anno/protein_annotation/miniprot.py --masked_genome_file /hps/nobackup/flicek/ensembl/genebuild/swati/anno_miniprot_test/GCA_046119055.2/red_output/mask_output/gorgonocephalus_arcticus_reheadered_toplevel.msk --output_dir /hps/nobackup/flicek/ensembl/genebuild/swati/anno_development_test/ --protein_file /hps/nobackup/flicek/ensembl/genebuild/swati/echinoderms_data_files/echinobase.fasta --miniprot_bin /hps/software/users/ensembl/genebuild/swati/miniprot/miniprot --num_threads 20 --protein_set uniprotSince we now use two tools for protein annotation, so we need two separate dirs. Therefore, updated the create_dir function to allow making nested dirs.
Updated the genblast module to create genblast_ouput dir for protein annotations