Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheme creation/tree visualization #58

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
ae196e3
initial commit of scripts directory
gcttong May 31, 2018
693a417
modified ability to change filter out 2-state SNPS in vcf
gcttong Jun 6, 2018
0970ccf
added the split-genome functionality
gcttong Jun 6, 2018
ac9a58d
preliminary files for the pipeline
gcttong Jun 12, 2018
24cabfe
made initial changes on the command-line tool
gcttong Jun 14, 2018
a357f62
made initial changes on the command-line tool
gcttong Jun 14, 2018
e7ff894
made adjustments to the pipeline and allowed user to input files
gcttong Jun 18, 2018
14d3fd5
collecting sequences from local genbank file
gcttong Jun 19, 2018
a80499f
handled exception for file downloads
gcttong Jun 21, 2018
b24bf2e
added docstrings and modified the function structures
gcttong Jun 22, 2018
842a648
finished adding docstrings to files
gcttong Jun 22, 2018
e8a0f9e
added hiearchical clustering functionality
gcttong Jun 26, 2018
5e61271
made changes to main file
gcttong Jun 26, 2018
aa6b407
completed pipeline
gcttong Jun 26, 2018
a3532da
completed checks for pipelin
gcttong Jun 26, 2018
578aa4c
removed files
gcttong Jun 26, 2018
0a048e0
removed comments
gcttong Jun 26, 2018
1412a5e
smaller changes
gcttong Jun 26, 2018
dc1463b
addressed earlier commits
gcttong Jul 10, 2018
6d0d7aa
fixed the docstring for extract_test_columns
gcttong Jul 10, 2018
ce09e25
fixed the log debug info for reference groups
gcttong Jul 10, 2018
c292223
fixed the log debug info for reference groups
gcttong Jul 10, 2018
d4cd0c9
fixed the log debug info for reference groups
gcttong Jul 10, 2018
46ee176
fixed the import structure
gcttong Jul 11, 2018
674c46a
added docstrings and separated the findcluster function into smaller …
gcttong Jul 11, 2018
0e59bdc
fixed the thresholds so that the array takes distances between 0.0 an…
gcttong Jul 11, 2018
f81fe90
added the needed command-line arguments
gcttong Jul 13, 2018
1cf1fa7
merged development branch
gcttong Jul 13, 2018
40b47cb
changed the gitignore file to take out pytest_cache and Jupyter noteb…
gcttong Jul 13, 2018
c5d5586
changed the gitignore file
gcttong Jul 13, 2018
32aa088
changed gitignore files:
gcttong Jul 13, 2018
10d4357
changed the gitignore file
gcttong Jul 13, 2018
375f30e
deleted pytest_cache
gcttong Jul 13, 2018
83aa97c
changed the get-sequence name to write-sequence
gcttong Jul 13, 2018
aee9496
script no longer extracts test columns and modified scheme sequence o…
gcttong Jul 16, 2018
395792f
made some formatting changes
gcttong Jul 16, 2018
4d1d9c5
fixed the docstring in the write_sequence file
gcttong Jul 16, 2018
275637d
separated the write_sequence function into two distinct functions
gcttong Jul 17, 2018
d4ff06c
modifed read_vcf that splits up the two dataframes into binary data a…
gcttong Jul 18, 2018
e8ec3cf
changed the get-sequences return type
gcttong Jul 18, 2018
b00437e
fixed the way in which the file path is taken from the reference_geno…
gcttong Jul 18, 2018
957fe54
made the variable names in get_sequence more descriptive
gcttong Jul 18, 2018
a386dee
fixed the indentation problem for the tiles file output
gcttong Jul 18, 2018
ce3335a
changed the name of attribute_value to ratio_value in write_sequence
gcttong Jul 18, 2018
f68f6e9
allows for multiple records to be part of the reference genome file
gcttong Jul 19, 2018
b012ece
allowed for better querying of sequence file by using a dictionary
gcttong Jul 20, 2018
bad3a86
write_sequence now returns a string for the schema file output
gcttong Jul 20, 2018
1a71e15
first initial commit
gcttong Jul 20, 2018
90a26af
fixed return value from compute_distance_matrix
gcttong Jul 20, 2018
e98d7b0
allows for other file types in addition to genbank files
gcttong Jul 20, 2018
10c328c
Merge branch 'scheme-creation/_base' of github.com:phac-nml/biohansel…
gcttong Aug 8, 2018
dc053ca
Merge branch 'scheme-creation/_base' of github.com:phac-nml/biohansel…
gcttong Aug 8, 2018
a24132d
added tests and fixed formatting for files
gcttong Aug 8, 2018
8758254
Merge branch 'scheme-creation/initial' of github.com:phac-nml/biohans…
gcttong Aug 8, 2018
80e2bb2
added description for display_tree
gcttong Aug 8, 2018
507a105
removed logging messages
gcttong Aug 8, 2018
e96e954
changed the spacing in cli.py
gcttong Aug 9, 2018
bfa947b
Merge branch 'scheme-creation/_base' of github.com:phac-nml/biohansel…
gcttong Aug 9, 2018
9b70aa9
changed file names
gcttong Aug 10, 2018
0175d92
changed name of variable to tile-length and gives options of file for…
gcttong Aug 13, 2018
181b812
changed command line inputs and added default values
gcttong Aug 13, 2018
ad0706f
created a cluster class and added range of integer values as an option
gcttong Aug 15, 2018
d141a97
took out logging messages
gcttong Aug 15, 2018
b9b8857
modified the output format for the phylogenetic tree
gcttong Aug 16, 2018
46cbc1f
midified the string formatting in scheme creation
gcttong Aug 21, 2018
b554b06
allowed user to specify file format and program also parses file form…
gcttong Aug 30, 2018
b347944
changed fasta sequence file function to generator
gcttong Sep 4, 2018
292749a
added a funciton that extract snvs from ingroup and outgroup
gcttong Sep 5, 2018
784a18e
added docstring to new function
gcttong Sep 5, 2018
99966c0
modified function in cluster_generator
gcttong Sep 5, 2018
b5b061a
changed expand sets function in cluster generator to have dict compre…
gcttong Sep 5, 2018
94b2c02
refactored the output_flat_clusters function in cluster_generator files
gcttong Sep 6, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -145,3 +145,11 @@ match_results.tab
results.tab
test.tab

# VS Code
.vscode

# pytest_cache
.pytest_cache/*
.pytest_cache


104 changes: 90 additions & 14 deletions biohansel/cli.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,22 @@
import logging
import os
from typing import Union, Optional, Tuple, List

import attr
import click
import pandas as pd

from biohansel.create.display_tree import display_tree
from biohansel.create.cluster_generator import find_clusters
from biohansel.create.io.output import write_sequence_file, generate_tile_fasta
from biohansel.create.io.parsers import parse_vcf, parse_sequence_file
from biohansel.create.schema_generator import get_sequences, group_snvs
from biohansel.subtype import subtype_contigs_samples, subtype_reads_samples, Subtype
from biohansel.subtype.const import SUBTYPE_SUMMARY_COLS, JSON_EXT_TMPL
from biohansel.subtype.metadata import read_metadata_table, merge_metadata_with_summary_results
from biohansel.subtype.subtype_stats import subtype_counts
from biohansel.subtype.util import get_scheme_fasta, init_subtyping_params
from biohansel.utils import does_file_exist, collect_inputs
from biohansel.utils import init_console_logger

from biohansel.utils import does_file_exist, collect_inputs, genome_name_from_fasta_path, init_console_logger
CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help'])


Expand Down Expand Up @@ -247,35 +251,107 @@ def parse_comma_delimited_floats(ctx: click.Context, param: click.Option, value:


@cli.command()
@click.option('--vcf-file-path',
@click.option('-v', '--vcf-file-path',
required=True,
type=click.Path(exists=True, dir_okay=False),
help='Variant calling file (VCF) of a collection of input genomes for population of interest against a '
'reference genome that must be specified with --reference-genome-path')
@click.option('--reference-genome-path',
@click.option('-r', '--reference-genome-path',
required=True,
type=click.Path(exists=True, dir_okay=False),
help='Reference genome assembly file path. The reference used in the creation of the input VCF file.')
@click.option('--phylo-tree-path',
required=False,
type=click.Path(exists=True, dir_okay=False),
help='Optional phylogenetic tree created from variant calling analysis.')
@click.option('--distance-thresholds',
@click.option('-d', '--distance-thresholds',
required=False,
type=str,
callback=parse_comma_delimited_floats,
help='Comma delimited list of distance thresholds for creating hierarchical clustering groups '
'(e.g. "0,0.05,0.1,0.15")')
def create(vcf_file_path, reference_genome_path, phylo_tree_path, distance_thresholds):
@click.option('-o', '--output-folder-name',
required=True,
type=click.Path(exists=False, dir_okay=True),
help='Output folder name in which schema file would be located'
)
@click.option('-s', '--schema-name',
required=False,
default="biohansel-schema",
type=str,
help='A unique name for the schema file that is generated, the default is just'
'{bio_hansel-schema}-reference_genome_name}-{schema_version}')
@click.option('-m', '--schema-version',
required=False,
default="0.1.0",
type=str,
help='An optional version number for the schema file that is generated')
@click.option('-t', '--tile-length',
required=True,
type=int,
help='Length of sequence to be added around each SNV in the schema file,'
'if an even integer is provided, then the tile length would be n+1'
)
@click.option('-f', '--reference-genome-format',
required=False,
default=None,
type=click.Choice(['fasta', 'genbank']),
help='Reference genome file format: can be either fasta or genbank format'
)
@click.option('-g', '--group-size-range',
type=(int, int),
default=(2,10),
help='The range of child group size for each new subtype branching point from the parent group'
)
@click.option('-p', '--pairwise-distance-metric',
type=click.Choice(['hamming', 'euclidean', 'minkowski', 'cityblock', 'cosine', 'sqeuclidean', 'correlation', 'jaccard', 'chebyshev', 'braycurtis']),
default='hamming',
help='The distance metric used to calculate pairwise distances between SNVs'
)
@click.option('-l', '--linkage-method',
type=click.Choice(['single', 'complete', 'average', 'weighted', 'centroid', 'median', 'ward']),
default='complete',
help='The linkage method used to perform hierarchical clustering on SNVs'
)

def create(vcf_file_path, reference_genome_path, phylo_tree_path, distance_thresholds, output_folder_name, schema_name,
reference_genome_format, tile_length, group_size_range, schema_version, pairwise_distance_metric, linkage_method):
"""Create a biohansel subtyping scheme.

From the results of a variant calling analysis, create a biohansel subtyping with single nucleotide variants (SNV)
that discriminate subpopulations of genomes from all other genomes.
"""
click.secho(f'VCF file path: {vcf_file_path}', fg='green')
click.secho(f'Reference genome file path: {reference_genome_path}', fg='red')
click.secho(f'Phylogenetic tree file path: {phylo_tree_path}', fg='yellow')
click.secho(f'Distance thresholds: {distance_thresholds}', fg='blue')
logging.info(f'Creating biohansel subtyping scheme from SNVs in "{vcf_file_path}" using reference genome '
f'"{reference_genome_path}" at {distance_thresholds if distance_thresholds else "all possible"} '
f'distance threshold levels.')
logging.info(f'VCF file path: {vcf_file_path}')
logging.info(f'Reference genome file path: {reference_genome_path}')
logging.info(f'Phylogenetic tree file path: {phylo_tree_path}')
logging.info(f'Distance thresholds: {distance_thresholds}')
logging.info(f'Output folder name: {output_folder_name}')
logging.info(f'Scheme name: {schema_name}')
logging.info(f'Reference genome format: {reference_genome_format}')
logging.info(f'Group size range: {group_size_range}')
logging.info(f'Padding Sequence length: {tile_length}')
logging.info(f'Creating biohansel subtyping scheme from SNVs in "{vcf_file_path}" using reference genome ')
logging.info(f'Pairwise distance metric: {pairwise_distance_metric}')
logging.info(f'Linkage method to be used: {linkage_method}')
reference_genome_name = genome_name_from_fasta_path(reference_genome_path)
schema_name = f"{schema_name}-{reference_genome_name}-{schema_version}"

if not os.path.exists(output_folder_name):
os.makedirs(output_folder_name)


sequence_df, binary_df = parse_vcf(vcf_file_path)
logging.info(type(group_size_range))
clusters = find_clusters(binary_df, group_size_range, distance_thresholds, pairwise_distance_metric, linkage_method)

if phylo_tree_path is not None:
phylo_tree_string=display_tree(phylo_tree_path, clusters.hierarchical_clusters, output_folder_name)
#sequence_records: Dict[str, Seq.Seq]
sequence_records = parse_sequence_file(reference_genome_path, reference_genome_format)
results_dict = group_snvs(binary_df, sequence_df, clusters.hierarchical_clusters)
for group, curr_df in results_dict.items():
df_list = get_sequences(curr_df, tile_length,
sequence_records)
write_sequence_file(output_folder_name, schema_name, generate_tile_fasta(df_list, group))
output_schema_path = os.path.join(output_folder_name, f"{schema_name}.fasta")
logging.info(f"Finished writing schema file to {output_schema_path}")
10 changes: 10 additions & 0 deletions biohansel/create/cluster.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
import attr


@attr.s
class Cluster(object):
distance_matrix = attr.ib(default=None)
clustering_array = attr.ib(default=None)
hierarchical_clusters= attr.ib(default=None)


Loading