Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheme creation/tree visualization #58

Closed

Conversation

gcttong
Copy link

@gcttong gcttong commented Aug 8, 2018

Added tree visualization feature, which displays subclade information on the phylogenetic tree

gcttong added 30 commits August 8, 2018 11:24
@gcttong gcttong requested a review from peterk87 August 9, 2018 14:39
Copy link
Contributor

@peterk87 peterk87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gary,

I'm not sure where you've added the phylogenetic tree visualization. Please see the ete3 tree drawing docs for more information on how to visualize trees with metadata like overlaying the subgroup information onto the tree like in this example.

Please find my comments below.

Thanks

from typing import Dict


def display_tree(phylo_tree_path: str, groups_dict: Dict[str, str]) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this function "displays" a tree, but instead reads the contents of a (I'm guessing) Newick format file and dangerously replaces genome names with some new name, then parses the string into a ete3 Tree object, prints the Tree to the log (what if the log is set to WARNING or higher?), and returns the string contents of some input file.

Doing this is dangerous because you could be replacing more than the expected genome name

        for genome, group in groups_dict.items():
            new_name = f"{genome}-{group}"
            new_tree = new_tree.replace(genome, new_name)

e.g. what if a genome is named 0 or a or one genome is named SRR123 and another is named SRR1234?

You're already using ete3 to create a Tree object so why not use it to replace leaf names if that's what you want to do. However, I would recommend against renaming the genome names and instead output a tree visualization image (SVG/PNG) where you highlight the subgroups on the tree like in http://etetoolkit.org/docs/latest/tutorial/tutorial_drawing.html#node-backgrounds

The ete3 tree drawing docs have plenty of information on how to visualize trees and metadata in interesting and useful ways.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this has been addressed in commit b9b8857

biohansel/cli.py Outdated
@@ -276,6 +312,37 @@ def create(vcf_file_path, reference_genome_path, phylo_tree_path, distance_thres
click.secho(f'Reference genome file path: {reference_genome_path}', fg='red')
click.secho(f'Phylogenetic tree file path: {phylo_tree_path}', fg='yellow')
click.secho(f'Distance thresholds: {distance_thresholds}', fg='blue')
click.secho(f'Output folder name: {output_folder_name}', fg='magenta')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove or replace all click.secho calls in this function with logging.info calls instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this has been fixed in 181b812

biohansel/cli.py Outdated
reference_genome_name = os.path.split(reference_genome_path)[-1]
reference_genome_name = reference_genome_name.split(".")[-2]

if schema_version is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please set a default in the @click.option('-m', '--schema-version', ... decorator instead of setting it here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this has been fixed in 181b812

biohansel/cli.py Outdated
logging.info(f'Creating biohansel subtyping scheme from SNVs in "{vcf_file_path}" using reference genome '
f'"{reference_genome_path}" at {distance_thresholds if distance_thresholds else "all possible"} '
f'distance threshold levels.')
reference_genome_name = os.path.split(reference_genome_path)[-1]
reference_genome_name = reference_genome_name.split(".")[-2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this has been fixed in 181b812

biohansel/cli.py Outdated
@click.option('-p', '--padding-sequence-length',
required=True,
type=int,
help='Output folder name in which schema file would be located'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This help information does not seem to correspond to this option. Isn't padding_sequence_length the length of sequence to extract around each SNP?

Also can you change this to --tile-length? Where math.ceil((tile_length -1) / 2) would be the length of sequence to get around each SNV with a warning to the user if they provided an even integer value that instead of tiles being length n, they will be length n+1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this has been addressed in commit 0175d92

from scipy.spatial.distance import pdist


def find_clusters(df: pd.DataFrame, min_group_size: int) -> Dict[str, int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about user specified distance threshold levels? If None then you can use the unique distances from the distance matrix. This function should be returning all the intermediate outputs as well. Please look into creating an attrs class to store this information like the Subtype class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this has been addressed in 94b2c02

"""

return sp.spatial.distance.pdist(
df.transpose(), metric='hamming')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pairwise distance metric could be specified via command-line with hamming as the default. Users might want to compute matching or euclidean distances. See http://click.pocoo.org/6/options/#choice-options for implementing choice options.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yes this has been addressed in d141a97

df.transpose(), metric='hamming')


def create_linkage_array(distance_matrix: np.ndarray) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the linkage method up to the user via the command-line with complete as the default. See http://click.pocoo.org/6/options/#choice-options

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yes this has been addressed in d141a97

biohansel/cli.py Outdated
os.makedirs(output_folder_name)

sequence_df, binary_df = parse_vcf(vcf_file_path)
groups_dict = find_clusters(binary_df, min_group_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about outputting all the intermediate results to the output directory? Those files may be useful to the user so they can see how the clusters were determined.

It would also be useful to create clusters at various group size thresholds. We could take a range of values from the command-line, e.g. . --min-group-size 2-10 and within the output directory you could have subdirectories containing the schemes created at each distinct min_group_size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this has been addressed in d141a97

biohansel/cli.py Outdated
sequence_df, binary_df = parse_vcf(vcf_file_path)
groups_dict = find_clusters(binary_df, min_group_size)
if phylo_tree_path is not None:
new_tree=display_tree(phylo_tree_path, groups_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the return value from display_tree?

biohansel/cli.py Outdated
@@ -159,7 +163,7 @@ def subtype(scheme,
input_contigs, input_reads = collect_inputs(**locals())
if len(input_contigs) == 0 and len(input_reads) == 0:
no_files_exception = click.UsageError('No input files specified!')
click.secho('Please see -h/--help for more info', err=True)
logging.info('Please see -h/--help for more info', err=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change. If there are issues with the code related to the subtype command, let me know in #52

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this has been addressed in d141a97

biohansel/cli.py Outdated
help='Reference genome file format: can be either fasta or genbank format'
)
@click.option('-g', '--min-group-size',
type=click.Choice(['2', '3', '4', '5', '6', '7', '8', '9', '10']),
Copy link
Contributor

@peterk87 peterk87 Aug 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment about making this option accepting a range of integer values.

Let's also set a default for this value to reduce the amount of things users need to specify via the command-line and that make sense like integers between 2 and 10 (or half of number of input samples) inclusive.

e.g. biohansel create ... -g 2-50 -> creates schemes with min group size from 2 to 50 inclusive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yes this has been addressed in d141a97

@glabbe
Copy link
Collaborator

glabbe commented Nov 4, 2020

We are working on a separate repo for the scheme creation tool: BioCanon https://github.com/phac-nml/bioCanon

@glabbe glabbe closed this Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants