-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheme creation/tree visualization #58
Scheme creation/tree visualization #58
Conversation
… into scheme-creation/initial
… into scheme-creation/newChanges
…el into scheme-creation/tree_visualization
… into scheme-creation/tree_visualization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Gary,
I'm not sure where you've added the phylogenetic tree visualization. Please see the ete3
tree drawing docs for more information on how to visualize trees with metadata like overlaying the subgroup information onto the tree like in this example.
Please find my comments below.
Thanks
biohansel/create/display_tree.py
Outdated
from typing import Dict | ||
|
||
|
||
def display_tree(phylo_tree_path: str, groups_dict: Dict[str, str]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this function "displays" a tree, but instead reads the contents of a (I'm guessing) Newick format file and dangerously replaces genome names with some new name, then parses the string into a ete3 Tree
object, prints the Tree to the log (what if the log is set to WARNING
or higher?), and returns the string contents of some input file.
Doing this is dangerous because you could be replacing more than the expected genome name
for genome, group in groups_dict.items():
new_name = f"{genome}-{group}"
new_tree = new_tree.replace(genome, new_name)
e.g. what if a genome is named 0
or a
or one genome is named SRR123
and another is named SRR1234
?
You're already using ete3
to create a Tree
object so why not use it to replace leaf names if that's what you want to do. However, I would recommend against renaming the genome names and instead output a tree visualization image (SVG/PNG) where you highlight the subgroups on the tree like in http://etetoolkit.org/docs/latest/tutorial/tutorial_drawing.html#node-backgrounds
The ete3
tree drawing docs have plenty of information on how to visualize trees and metadata in interesting and useful ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this has been addressed in commit b9b8857
biohansel/cli.py
Outdated
@@ -276,6 +312,37 @@ def create(vcf_file_path, reference_genome_path, phylo_tree_path, distance_thres | |||
click.secho(f'Reference genome file path: {reference_genome_path}', fg='red') | |||
click.secho(f'Phylogenetic tree file path: {phylo_tree_path}', fg='yellow') | |||
click.secho(f'Distance thresholds: {distance_thresholds}', fg='blue') | |||
click.secho(f'Output folder name: {output_folder_name}', fg='magenta') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove or replace all click.secho
calls in this function with logging.info
calls instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this has been fixed in 181b812
biohansel/cli.py
Outdated
reference_genome_name = os.path.split(reference_genome_path)[-1] | ||
reference_genome_name = reference_genome_name.split(".")[-2] | ||
|
||
if schema_version is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please set a default in the @click.option('-m', '--schema-version', ...
decorator instead of setting it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this has been fixed in 181b812
biohansel/cli.py
Outdated
logging.info(f'Creating biohansel subtyping scheme from SNVs in "{vcf_file_path}" using reference genome ' | ||
f'"{reference_genome_path}" at {distance_thresholds if distance_thresholds else "all possible"} ' | ||
f'distance threshold levels.') | ||
reference_genome_name = os.path.split(reference_genome_path)[-1] | ||
reference_genome_name = reference_genome_name.split(".")[-2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use genome_name_from_fasta_path
from https://github.com/phac-nml/biohansel/blob/scheme-creation/_base/biohansel/utils.py#L39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this has been fixed in 181b812
biohansel/cli.py
Outdated
@click.option('-p', '--padding-sequence-length', | ||
required=True, | ||
type=int, | ||
help='Output folder name in which schema file would be located' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This help information does not seem to correspond to this option. Isn't padding_sequence_length
the length of sequence to extract around each SNP?
Also can you change this to --tile-length
? Where math.ceil((tile_length -1) / 2)
would be the length of sequence to get around each SNV with a warning to the user if they provided an even integer value that instead of tiles being length n
, they will be length n+1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this has been addressed in commit 0175d92
from scipy.spatial.distance import pdist | ||
|
||
|
||
def find_clusters(df: pd.DataFrame, min_group_size: int) -> Dict[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about user specified distance threshold levels? If None
then you can use the unique distances from the distance matrix. This function should be returning all the intermediate outputs as well. Please look into creating an attrs class to store this information like the Subtype
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this has been addressed in 94b2c02
""" | ||
|
||
return sp.spatial.distance.pdist( | ||
df.transpose(), metric='hamming') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pairwise distance metric could be specified via command-line with hamming
as the default. Users might want to compute matching
or euclidean
distances. See http://click.pocoo.org/6/options/#choice-options for implementing choice options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yes this has been addressed in d141a97
df.transpose(), metric='hamming') | ||
|
||
|
||
def create_linkage_array(distance_matrix: np.ndarray) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the linkage method up to the user via the command-line with complete
as the default. See http://click.pocoo.org/6/options/#choice-options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yes this has been addressed in d141a97
biohansel/cli.py
Outdated
os.makedirs(output_folder_name) | ||
|
||
sequence_df, binary_df = parse_vcf(vcf_file_path) | ||
groups_dict = find_clusters(binary_df, min_group_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about outputting all the intermediate results to the output directory? Those files may be useful to the user so they can see how the clusters were determined.
It would also be useful to create clusters at various group size thresholds. We could take a range of values from the command-line, e.g. . --min-group-size 2-10
and within the output directory you could have subdirectories containing the schemes created at each distinct min_group_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this has been addressed in d141a97
biohansel/cli.py
Outdated
sequence_df, binary_df = parse_vcf(vcf_file_path) | ||
groups_dict = find_clusters(binary_df, min_group_size) | ||
if phylo_tree_path is not None: | ||
new_tree=display_tree(phylo_tree_path, groups_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the return value from display_tree
?
biohansel/cli.py
Outdated
@@ -159,7 +163,7 @@ def subtype(scheme, | |||
input_contigs, input_reads = collect_inputs(**locals()) | |||
if len(input_contigs) == 0 and len(input_reads) == 0: | |||
no_files_exception = click.UsageError('No input files specified!') | |||
click.secho('Please see -h/--help for more info', err=True) | |||
logging.info('Please see -h/--help for more info', err=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change. If there are issues with the code related to the subtype
command, let me know in #52
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this has been addressed in d141a97
biohansel/cli.py
Outdated
help='Reference genome file format: can be either fasta or genbank format' | ||
) | ||
@click.option('-g', '--min-group-size', | ||
type=click.Choice(['2', '3', '4', '5', '6', '7', '8', '9', '10']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment about making this option accepting a range of integer values.
Let's also set a default for this value to reduce the amount of things users need to specify via the command-line and that make sense like integers between 2 and 10 (or half of number of input samples) inclusive.
e.g. biohansel create ... -g 2-50
-> creates schemes with min group size from 2 to 50 inclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok yes this has been addressed in d141a97
We are working on a separate repo for the scheme creation tool: BioCanon https://github.com/phac-nml/bioCanon |
Added tree visualization feature, which displays subclade information on the phylogenetic tree