Identifications of genetic similarity or the distance between genomic sequences (approximate matching algorithm)
Alignment-free method for calculating genetic distances between DNA sequences as a basis for similarity distance estimation and phylogenetic reconstruction. The proposed approximate matching algorithm is effective for measuring homology in cases of mixed sequences and different lengths, including individual chromosomes of the same or different species can be used for analysis. The application of the proposed approximate matching algorithm is not limited to determining the degree of homology between sequences, but can be used for phylogenetic analysis, species identification and sequence classification in genomic assemblies.
Ruslan Kalendar email: [email protected]
Operating system(s): Platform independent
Programming language: Java 25 or higher
How do I set or change the Java path system variable
To install a specific version of OpenJDK using Conda, you need to specify the version number in your installation command and use the conda-forge channel. The latest version is available on the conda-forge channel.
- Add the conda-forge channel (if not already added). It is recommended to add the conda-forge channel to your configuration and set its priority to strict to ensure packages are preferentially installed from this channel:
conda config --add channels conda-forge
conda config --set channel_priority strict
- Create a new Conda environment and install the desired OpenJDK version. Creating a dedicated environment helps manage dependencies and avoid conflicts with other projects:
conda create -n java25 openjdk=25
- Activate the new environment:
conda activate java25
- Check if you have Java installed. The output should display information for the installed Java version:
java -version
The program generates a file for analysis in the software MEGA 12: https://www.megasoftware.net/
To run the project from the command line (CLI). Command-line options, separated by spaces.
The executive file GeneDistance.jar is in the dist directory, which can be copied to any location.
Go to the target folder and type the following; an individual file or a file folder can be specified:
java -jar GeneDistance.jar <target_file_path/Folder_path>
java -jar <GeneDistancePath>\dist\GeneDistance.jar <target_file_path> optional_commands
java -jar C:\GeneDistance\dist\GeneDistance.jar C:\GeneDistance\test\t1.txt
java -jar C:\GeneDistance\dist\GeneDistance.jar E:\Genomes\Chloroplast\ -kmer=6
Large genome usage (you will have to show the program to use more RAM, for example as listed here, up to 64 Gb memory: -Xms16g -Xmx64g):
java -jar -Xms16g -Xmx64g C:\GeneDistance\dist\GeneDistance.jar E:\Genomes\T2T-CHM13v2.0\ -kmer=6
For chromosomes larger than 500 Mb you will need to use more memory, 128 Gb:
java -jar -Xms32g -Xmx128g C:\GeneDistance\dist\GeneDistance.jar E:\Genomes\Cycas_panzhihuaensis\ -kmer=8
Sequence data files are prepared using a text editor and saved in ASCII as text/plain format (.txt) or in .fasta or without file extensions (a file extension is not obligatory). The program takes a single sequence or accepts multiple DNA sequences in FASTA format. The template length is not limited.
A sequence in FASTA format consists of the following: One line starts with a ">" sign and a sequence identification code. A textual description of the sequence optionally follows it. Since it is not part of the official format description, software can ignore it when it is present. One or more lines containing the sequence itself. A file in FASTA format may comprise more than one sequence.