Scheme creation/tree visualization #58

gcttong · 2018-08-08T21:27:54Z

Added tree visualization feature, which displays subclade information on the phylogenetic tree

…functions

…d 1.0

…ooks

… into scheme-creation/initial

… into scheme-creation/newChanges

…el into scheme-creation/tree_visualization

… into scheme-creation/tree_visualization

peterk87

Hi Gary,

I'm not sure where you've added the phylogenetic tree visualization. Please see the ete3 tree drawing docs for more information on how to visualize trees with metadata like overlaying the subgroup information onto the tree like in this example.

Please find my comments below.

Thanks

peterk87 · 2018-08-10T16:00:52Z

biohansel/create/display_tree.py

+from typing import Dict
+
+
+def display_tree(phylo_tree_path: str, groups_dict: Dict[str, str]) -> str:


It doesn't look like this function "displays" a tree, but instead reads the contents of a (I'm guessing) Newick format file and dangerously replaces genome names with some new name, then parses the string into a ete3 Tree object, prints the Tree to the log (what if the log is set to WARNING or higher?), and returns the string contents of some input file.

Doing this is dangerous because you could be replacing more than the expected genome name

for genome, group in groups_dict.items(): new_name = f"{genome}-{group}" new_tree = new_tree.replace(genome, new_name)

e.g. what if a genome is named 0 or a or one genome is named SRR123 and another is named SRR1234?

You're already using ete3 to create a Tree object so why not use it to replace leaf names if that's what you want to do. However, I would recommend against renaming the genome names and instead output a tree visualization image (SVG/PNG) where you highlight the subgroups on the tree like in http://etetoolkit.org/docs/latest/tutorial/tutorial_drawing.html#node-backgrounds

The ete3 tree drawing docs have plenty of information on how to visualize trees and metadata in interesting and useful ways.

yes this has been addressed in commit b9b8857

peterk87 · 2018-08-10T16:02:19Z

biohansel/cli.py

@@ -276,6 +312,37 @@ def create(vcf_file_path, reference_genome_path, phylo_tree_path, distance_thres
    click.secho(f'Reference genome file path: {reference_genome_path}', fg='red')
    click.secho(f'Phylogenetic tree file path: {phylo_tree_path}', fg='yellow')
    click.secho(f'Distance thresholds: {distance_thresholds}', fg='blue')
+    click.secho(f'Output folder name: {output_folder_name}', fg='magenta')


Please remove or replace all click.secho calls in this function with logging.info calls instead.

yes this has been fixed in 181b812

peterk87 · 2018-08-10T16:03:54Z

biohansel/cli.py

+    reference_genome_name = os.path.split(reference_genome_path)[-1]
+    reference_genome_name = reference_genome_name.split(".")[-2]
+
+    if schema_version is None:


Please set a default in the @click.option('-m', '--schema-version', ... decorator instead of setting it here.

yes this has been fixed in 181b812

peterk87 · 2018-08-10T16:06:12Z

biohansel/cli.py

    logging.info(f'Creating biohansel subtyping scheme from SNVs in "{vcf_file_path}" using reference genome '
                 f'"{reference_genome_path}" at {distance_thresholds if distance_thresholds else "all possible"} '
                 f'distance threshold levels.')
+    reference_genome_name = os.path.split(reference_genome_path)[-1]
+    reference_genome_name = reference_genome_name.split(".")[-2]


Please use genome_name_from_fasta_path from https://github.com/phac-nml/biohansel/blob/scheme-creation/_base/biohansel/utils.py#L39

yes this has been fixed in 181b812

peterk87 · 2018-08-10T16:17:13Z

biohansel/cli.py

+@click.option('-p', '--padding-sequence-length',
+              required=True,
+              type=int,
+              help='Output folder name in which schema file would be located'


This help information does not seem to correspond to this option. Isn't padding_sequence_length the length of sequence to extract around each SNP?

Also can you change this to --tile-length? Where math.ceil((tile_length -1) / 2) would be the length of sequence to get around each SNV with a warning to the user if they provided an even integer value that instead of tiles being length n, they will be length n+1.

yes this has been addressed in commit 0175d92

peterk87 · 2018-08-10T17:35:16Z

biohansel/create/cluster_generator.py

+from scipy.spatial.distance import pdist
+
+
+def find_clusters(df: pd.DataFrame, min_group_size: int) -> Dict[str, int]:


What about user specified distance threshold levels? If None then you can use the unique distances from the distance matrix. This function should be returning all the intermediate outputs as well. Please look into creating an attrs class to store this information like the Subtype class.

yes this has been addressed in 94b2c02

peterk87 · 2018-08-10T17:35:50Z

biohansel/create/cluster_generator.py

+    """
+
+    return sp.spatial.distance.pdist(
+        df.transpose(), metric='hamming')


The pairwise distance metric could be specified via command-line with hamming as the default. Users might want to compute matching or euclidean distances. See http://click.pocoo.org/6/options/#choice-options for implementing choice options.

ok yes this has been addressed in d141a97

peterk87 · 2018-08-10T17:37:14Z

biohansel/create/cluster_generator.py

+        df.transpose(), metric='hamming')
+
+
+def create_linkage_array(distance_matrix: np.ndarray) -> np.ndarray:


Let's make the linkage method up to the user via the command-line with complete as the default. See http://click.pocoo.org/6/options/#choice-options

ok yes this has been addressed in d141a97

peterk87 · 2018-08-10T17:38:18Z

biohansel/cli.py

+        os.makedirs(output_folder_name)
+
+    sequence_df, binary_df = parse_vcf(vcf_file_path)
+    groups_dict = find_clusters(binary_df, min_group_size)


What about outputting all the intermediate results to the output directory? Those files may be useful to the user so they can see how the clusters were determined.

It would also be useful to create clusters at various group size thresholds. We could take a range of values from the command-line, e.g. . --min-group-size 2-10 and within the output directory you could have subdirectories containing the schemes created at each distinct min_group_size

yes this has been addressed in d141a97

peterk87 · 2018-08-10T17:38:50Z

biohansel/cli.py

+    sequence_df, binary_df = parse_vcf(vcf_file_path)
+    groups_dict = find_clusters(binary_df, min_group_size)
+    if phylo_tree_path is not None:
+        new_tree=display_tree(phylo_tree_path, groups_dict)


What about the return value from display_tree?

…mats to user

peterk87 · 2018-08-14T18:00:28Z

biohansel/cli.py

@@ -159,7 +163,7 @@ def subtype(scheme,
    input_contigs, input_reads = collect_inputs(**locals())
    if len(input_contigs) == 0 and len(input_reads) == 0:
        no_files_exception = click.UsageError('No input files specified!')
-        click.secho('Please see -h/--help for more info', err=True)
+        logging.info('Please see -h/--help for more info', err=True)


Please revert this change. If there are issues with the code related to the subtype command, let me know in #52

yes this has been addressed in d141a97

peterk87 · 2018-08-14T18:04:49Z

biohansel/cli.py

+              help='Reference genome file format: can be either fasta or genbank format'
+              )
+@click.option('-g', '--min-group-size',
+              type=click.Choice(['2', '3', '4', '5', '6', '7', '8', '9', '10']),


Please see my comment about making this option accepting a range of integer values.

Let's also set a default for this value to reduce the amount of things users need to specify via the command-line and that make sense like integers between 2 and 10 (or half of number of input samples) inclusive.

e.g. biohansel create ... -g 2-50 -> creates schemes with min group size from 2 to 50 inclusive.

ok yes this has been addressed in d141a97

…at if none given

…hension

glabbe · 2020-11-04T14:12:11Z

We are working on a separate repo for the scheme creation tool: BioCanon https://github.com/phac-nml/bioCanon

gcttong added 30 commits August 8, 2018 11:24

initial commit of scripts directory

ae196e3

modified ability to change filter out 2-state SNPS in vcf

693a417

added the split-genome functionality

0970ccf

preliminary files for the pipeline

ac9a58d

made initial changes on the command-line tool

24cabfe

made initial changes on the command-line tool

a357f62

made adjustments to the pipeline and allowed user to input files

e7ff894

collecting sequences from local genbank file

14d3fd5

handled exception for file downloads

a80499f

added docstrings and modified the function structures

b24bf2e

finished adding docstrings to files

842a648

added hiearchical clustering functionality

e8a0f9e

made changes to main file

5e61271

completed pipeline

aa6b407

completed checks for pipelin

a3532da

removed files

578aa4c

removed comments

0a048e0

smaller changes

1412a5e

addressed earlier commits

dc1463b

fixed the docstring for extract_test_columns

6d0d7aa

fixed the log debug info for reference groups

ce09e25

fixed the log debug info for reference groups

c292223

fixed the log debug info for reference groups

d4cd0c9

fixed the import structure

46ee176

added docstrings and separated the findcluster function into smaller …

674c46a

…functions

fixed the thresholds so that the array takes distances between 0.0 an…

0e59bdc

…d 1.0

added the needed command-line arguments

f81fe90

merged development branch

1cf1fa7

changed the gitignore file to take out pytest_cache and Jupyter noteb…

40b47cb

…ooks

changed the gitignore file

c5d5586

gcttong added 8 commits August 8, 2018 11:24

fixed return value from compute_distance_matrix

90a26af

allows for other file types in addition to genbank files

e98d7b0

Merge branch 'scheme-creation/_base' of github.com:phac-nml/biohansel…

10c328c

… into scheme-creation/initial

Merge branch 'scheme-creation/_base' of github.com:phac-nml/biohansel…

dc053ca

… into scheme-creation/newChanges

added tests and fixed formatting for files

a24132d

Merge branch 'scheme-creation/initial' of github.com:phac-nml/biohans…

8758254

…el into scheme-creation/tree_visualization

added description for display_tree

80e2bb2

removed logging messages

507a105

gcttong requested a review from peterk87 August 9, 2018 14:39

gcttong added 3 commits August 9, 2018 16:11

changed the spacing in cli.py

e96e954

Merge branch 'scheme-creation/_base' of github.com:phac-nml/biohansel…

bfa947b

… into scheme-creation/tree_visualization

changed file names

9b70aa9

peterk87 suggested changes Aug 10, 2018

View reviewed changes

gcttong added 2 commits August 13, 2018 14:26

changed name of variable to tile-length and gives options of file for…

0175d92

…mats to user

changed command line inputs and added default values

181b812

peterk87 reviewed Aug 14, 2018

View reviewed changes

gcttong added 11 commits August 15, 2018 13:49

created a cluster class and added range of integer values as an option

ad0706f

took out logging messages

d141a97

modified the output format for the phylogenetic tree

b9b8857

midified the string formatting in scheme creation

46cbc1f

allowed user to specify file format and program also parses file form…

b554b06

…at if none given

changed fasta sequence file function to generator

b347944

added a funciton that extract snvs from ingroup and outgroup

292749a

added docstring to new function

784a18e

modified function in cluster_generator

99966c0

changed expand sets function in cluster generator to have dict compre…

b5b061a

…hension

refactored the output_flat_clusters function in cluster_generator files

94b2c02

glabbe closed this Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheme creation/tree visualization #58

Scheme creation/tree visualization #58

gcttong commented Aug 8, 2018

peterk87 left a comment

peterk87 Aug 10, 2018

gcttong Aug 16, 2018

peterk87 Aug 10, 2018

gcttong Aug 13, 2018

peterk87 Aug 10, 2018

gcttong Aug 13, 2018

peterk87 Aug 10, 2018

gcttong Aug 13, 2018

peterk87 Aug 10, 2018

gcttong Aug 13, 2018

peterk87 Aug 10, 2018

gcttong Sep 6, 2018

peterk87 Aug 10, 2018

gcttong Aug 15, 2018

peterk87 Aug 10, 2018

gcttong Aug 15, 2018

peterk87 Aug 10, 2018

gcttong Aug 15, 2018

peterk87 Aug 10, 2018

peterk87 Aug 14, 2018

gcttong Aug 15, 2018

peterk87 Aug 14, 2018 •

edited

Loading

gcttong Aug 15, 2018

glabbe commented Nov 4, 2020

		from typing import Dict


		def display_tree(phylo_tree_path: str, groups_dict: Dict[str, str]) -> str:

		from scipy.spatial.distance import pdist


		def find_clusters(df: pd.DataFrame, min_group_size: int) -> Dict[str, int]:

		df.transpose(), metric='hamming')


		def create_linkage_array(distance_matrix: np.ndarray) -> np.ndarray:

Scheme creation/tree visualization #58

Scheme creation/tree visualization #58

Conversation

gcttong commented Aug 8, 2018

peterk87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterk87 Aug 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glabbe commented Nov 4, 2020

peterk87 Aug 14, 2018 •

edited

Loading