Arthropod Moulting Gene Discovery Pipeline

This repository contains a Snakemake workflow for the selection, processing, and annotation of arthropod proteomes to identify orthologous groups associated with moulting pathways.

Overview

The pipeline performs the following steps:

Selection of High-Quality Proteomes
Filters the A3cat table to retain one representative genome per arthropod species, excluding low-quality assemblies and downsampling overrepresented lineages.
Proteome Download and Isoform Filtering
Downloads protein FASTA files and retains the longest isoform per gene.
Metadata Extraction
Extracts gene-level and protein-level metadata for downstream analysis.
Orthologous Group Inference
Runs Orthologer to infer orthologous groups across the selected species.
Moulting Gene Identification
Detects orthologs of known Drosophila melanogaster moulting genes, based on pathways curated by Giulia Campli (PMID: 39039636).
Domain Annotation
Annotates each filtered proteome using InterProScan, retaining high-confidence protein domains.

Requirements

Snakemake
Conda (for environment management)
Access to compute resources (recommended for InterProScan)

Usage

Clone the repository:

git clone https://github.com/yourusername/MoultDB_genomics.git
cd MoultDB_genomics

Edit the config/config.yaml file to define input paths and parameters.

Run the pipeline:

snakemake --use-conda --cores 16

Outputs

a3cat_filtered.tsv: Final genome information derived from the a3cat table, refined by selecting the highest-quality genome assemblies, downsampled overrepresented orders, and ensured the inclusion of only those species for which proteomes were successfully downloaded, focusing on specific required columns.
domains/"assembly_number"_"assembly_name"_filt_domains.tsv: Informations about protein domains of each genome/proteome. Retained only true positive InterPro domains.
metadata_table/"assembly_number"_"assembly_name"_table.tsv: This dataset encompasses comprehensive metadata for all genes across all genomes/proteomes. The 'origin' column indicates whether the data was sourced from GenBank or RefSeq. For entries originating from GenBank, there is an absence of both gene ID and gene name; in these cases, the 'locus_tag' is utilized instead. Additionally, there is no 'transcript_id' for GenBank entries, necessitating the use of a URL for access.
prot_moult_pathways.tsv: contains details on each identified moulting proteins, including their pathways, corresponding gene names, functions, and identifiers (= controlled voc), references. Giulia still need to complete the description
orthogroups.tsv: contains orthogroups information about all the available proteins.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
workflow		workflow
2023-09-05.tsv		2023-09-05.tsv
dmel_moultgenes.csv		dmel_moultgenes.csv
moultDB.run		moultDB.run
path_voc_tab.tsv		path_voc_tab.tsv
pathway_controlled_voc.csv		pathway_controlled_voc.csv
phylogeny_tmp.tsv		phylogeny_tmp.tsv
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Arthropod Moulting Gene Discovery Pipeline

Overview

Requirements

Usage

Outputs

About

Uh oh!

Releases

Packages

Languages

sdind/MoultDB_genomics

Folders and files

Latest commit

History

Repository files navigation

Arthropod Moulting Gene Discovery Pipeline

Overview

Requirements

Usage

Outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages