Skip to content

add stratify argument#69

Draft
ljwolf wants to merge 4 commits intomartinfleis:mainfrom
ljwolf:main
Draft

add stratify argument#69
ljwolf wants to merge 4 commits intomartinfleis:mainfrom
ljwolf:main

Conversation

@ljwolf
Copy link

@ljwolf ljwolf commented Jun 12, 2025

This adds a stratify_by_k argument that allows the user to color links and nodes by a fixed "K" value. In a hierarchical example, this is interesting, since it lets you see what components of a cluster at k "flow into"/"out of" nodes at that strata.

Figure_1

This is useful because it lets you investigate cluster co-occurrence while also visualizing similarity for other clusterings. Each edge represents the count of observations moving from ki->kj that are in stratum q at stratify_by_k.

For linewidths, the linewidth parameter refers to the (square root) of the width of the root node. I've moved to square root (like s in ax.scatter()) because the width here is going to be interpreted in areal units (points^2) rather than directly (points). Each line, then, is actually a parallelogram whose area is proportional to the number of ki->kj observations in stratum q at stratify_by_k.

For colors, the cmap argument specifies the colormap to use for nodes and edges. The coloring for nodes/edges is calculated for all clusters at k. For clusters k_i<k, we nodes use the count-weighted average color for strata in that node. This works better visually than re-calculating the cluster color from y using the colormap because it's easier imho to think in terms of the color combination than it is to estimate the hue between two other hues on a color map, but I have examples that just calculated color based on the colormap at each k if you're curious. Line colors correspond to the color for stratum q always.

To enable this, I've had to also add a cmap argument. I think if stratify_by_k=None, cmap should color nodes/edges according to the fraction of observations they contain: bigger nodes/more populous links are "higher" values in the colormap. I didn't implement this yet, because I wanted to check:

  1. should this be allowed?
  2. is coloring by count more reasonable as a default than coloring on the y-axis height?

I still need to:

  • decide/implement cmap behaviors without stratification after a decision is made above.
  • check an example with non-hierarchical clustering to see if it "works"
  • write tests

@martinfleis
Copy link
Owner

bigger nodes/more populous links are "higher" values in the colormap. I didn't implement this yet, because I wanted to check:

  • should this be allowed?
  • is coloring by count more reasonable as a default than coloring on the y-axis height?

I am not sure to which degree it matters how is the colour assigned. Visually, it will always look better to map it according to y-axis as it will show a nice gradient. Mapping it by size at k is also possible but is it worth it? I am interpreting this colormap more as a categorical one to show the flows belonging to the same class at k than continuous showing something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants