Skip to content

feat: Add random state feature. #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: john-development
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 1 addition & 20 deletions src/diffpy/snmf/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,8 @@
Y0 = np.loadtxt("input/W0.txt", dtype=float)
N, M = MM.shape

# Convert to DataFrames for display
# df_X = pd.DataFrame(X, columns=[f"Comp_{i+1}" for i in range(X.shape[1])])
# df_Y = pd.DataFrame(Y, columns=[f"Sample_{i+1}" for i in range(Y.shape[1])])
# df_MM = pd.DataFrame(MM, columns=[f"Sample_{i+1}" for i in range(MM.shape[1])])
# df_Y0 = pd.DataFrame(Y0, columns=[f"Sample_{i+1}" for i in range(Y0.shape[1])])

# Print the matrices
"""
print("Feature Matrix (X):\n", df_X, "\n")
print("Coefficient Matrix (Y):\n", df_Y, "\n")
print("Data Matrix (MM):\n", df_MM, "\n")
print("Initial Guess (Y0):\n", df_Y0, "\n")
"""


my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, components=2)
my_model = snmf_class.SNMFOptimizer(MM=MM, Y0=Y0, X0=X0, A=A0, n_components=2)
print("Done")
# print(f"My final guess for X: {my_model.X}")
# print(f"My final guess for Y: {my_model.Y}")
# print(f"Compare to true X: {X_norm}")
# print(f"Compare to true Y: {Y_norm}")
np.savetxt("my_norm_X.txt", my_model.X, fmt="%.6g", delimiter=" ")
np.savetxt("my_norm_Y.txt", my_model.Y, fmt="%.6g", delimiter=" ")
np.savetxt("my_norm_A.txt", my_model.A, fmt="%.6g", delimiter=" ")
80 changes: 70 additions & 10 deletions src/diffpy/snmf/snmf_class.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,69 @@


class SNMFOptimizer:
def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500, tol=5e-7, components=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a docstring here and in the init. Please see scikit-package FAQ about how to write these. Also, look at Yucong's code or diffpy.utils?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one here. The package init dates back to the old codebase, but as soon as that is updated it will get a docstring as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package init (i.e., the __init__.py) doesn't need a docstring.

print("Initializing SNMF Optimizer")
"""A implementation of stretched NMF (sNMF), including sparse stretched NMF.

Instantiating the SNMFOptimizer class runs all the analysis immediately.
The results matrices can then be accessed as instance attributes
of the class (X, Y, and A).

For more information on sNMF, please reference:
Gu, R., Rakita, Y., Lan, L. et al. Stretched non-negative matrix factorization.
npj Comput Mater 10, 193 (2024). https://doi.org/10.1038/s41524-024-01377-5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would normally do a list of Class attributes here. Everything that is self.something. This is obviously strongly overlapped with the arguments of the constructor, as many of the attributes get defined in the constructor, but logically they are different. Here we list and dsecribe the class attributes, there we describe the init function arguments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on how I'd distinguish the arguments from the attributes. I understand how they are different semantically, but what part of that is necessary to make clear here? Can you give an example? Those have been helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything that is self.something (except for methods which are self.functions() which are not considered attributes) is an attribute. So MM, Y0, X0 are attributes, but also M, N, rng, num_updates etc.

Inside a function or method the parameters are the arguments of the function. so for the __init__() function they will be MM, Y0, X0, A, rho, eta and so on). Some of the descriptions will overlap but for the function argument the user needs to know if it is optional or not, what the default is, and anything else they need to know to successfully instantiate the class. People will generally not see the two docstrings at the same time, so there can be some repetition, but try and keep it short but informative.

"""

def __init__(
self,
MM,
Y0=None,
X0=None,
A=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more descriptive name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many different standards for what to name these matrices. Zero agreement between sources that use NMF. I'm inclined to eventually use what sklearn.decomposition.non_negative_factorization uses, which would mean MM->X, X->W, Y->H. But I'd like to leave this as is for the moment until there's a consensus about what would be the most clear or standard. If people will be finding this tool from the sNMF paper, there's also an argument for using the X, Y, and A names because that was used there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good. It has to be very good reason to break PEP8. The only good enough reason I can think of is to be consistent with scikit-learn. Another way of saying it is that we can "adopt the scikit-learn standard"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with adopting the scikit-learn standard, but I also like the idea of giving them descriptive (and lowercase) names. The argument against that is that some lines of code use A/X/Y 10+ times in quick succession, so it would make the code very verbose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more readable code is always better, so lower-case descriptive is preferred by me. I don't actually like that scikit-learn breaks this. Shall we go with lower-case? Names can be short if they are defined in a function in the docstring and docs too. Just hte code benefits from being readable, so I would say use your judgement on that.

rho=1e12,
eta=610,
max_iter=500,
tol=5e-7,
n_components=None,
random_state=None,
):
"""Initialize an instance of SNMF and run the optimization

Parameters
----------
MM : ndarray
The data to be decomposed. Shape is (length_of_signal, number_of_conditions).
Y0 : ndarray
The initial guesses for the component weights at each stretching condition.
Shape is (number of components, number ofconditions) Must be provided if
n_components is not provided. Will override n_components if both are provided.
X0 : ndarray
The initial guesses for the intensities of each component per
row/sample/angle. Shape is (length_of_signal, number_of_components).
A : ndarray
The initial guesses for the stretching factor for each component, at each
condition. Shape is (number_of_components, number_of_conditions).
rho : float
The stretching factor that influences the decomposition. Zero corresponds to no
stretching present. Relatively insensitive and typically adjusted in powers of 10.
eta : float
The sparsity factor than influences the decomposition. Should be set to zero for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo than -> that

Also it might help to know a good value or range of values to choose when not setting it to zero?

non-sparse data such as PDF. Can be used to improve results for sparse data such
as XRD, but due to instability, should be used only after first selecting the
best value for rho.
max_iter : int
The maximum number of times to update each of A, X, and Y before stopping
the optimization.
tol : float
The convergence threshold. This is the minimum fractional improvement in the
objective function to allow without terminating the optimization. Note that
a minimum of 20 updates are run before this parameter is checked.
n_components : int
The number of components to attempt to extract from MM. Note that this will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempt? So sometimes it extracs fewer than n_components when it attempts but doesn't manage?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should never find less. "Attempt" means that sometimes the optimization may not work. But if this is unclear I can change it.

Copy link
Contributor

@sbillinge sbillinge Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, delete "attempt" to make it clearer.

be overridden by Y0 if that is provided, but must be provided if no Y0 is
provided.
random_state : int
The seed for the initial matrices used in the optimization.
"""

self.MM = MM
self.X0 = X0
self.Y0 = Y0
Expand All @@ -15,23 +76,22 @@ def __init__(self, MM, Y0=None, X0=None, A=None, rho=1e12, eta=610, max_iter=500
# Capture matrix dimensions
self.N, self.M = MM.shape
self.num_updates = 0
self.rng = np.random.default_rng(random_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a more descriptive variable name? Is this a range? What is the range?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping on this one.


if Y0 is None:
if components is None:
raise ValueError("Must provide either Y0 or a number of components.")
if n_components is None:
raise ValueError("Must provide either Y0 or n_components.")
else:
self.K = components
self.Y0 = np.random.beta(a=2.5, b=1.5, size=(self.K, self.M)) # This is untested
self.K = n_components
self.Y0 = self.rng.beta(a=2.5, b=1.5, size=(self.K, self.M))
else:
self.K = Y0.shape[0]

# Initialize A, X0 if not provided
if self.A is None:
self.A = np.ones((self.K, self.M)) + np.random.randn(self.K, self.M) * 1e-3 # Small perturbation
self.A = np.ones((self.K, self.M)) + self.rng.normal(0, 1e-3, size=(self.K, self.M))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K and M are probably good names if the matrix decomposition equation is in hte docstring, so they get defined there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you addressed this with your comment to MM, but as a general rule, please respond to each comment so the reviewer knows you have seen it. It wouldn't work here, but just thumbs up works if you have seen a comment and agree, but it saves time in the long run as I don't have to write this long comment...... :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'd like to put the matrix decomposition in the docstring, but I'm having trouble formatting it. Might have to ask about this in one of the meetings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am not 100% sure but I think there is a way.

if self.X0 is None:
self.X0 = np.random.rand(self.N, self.K) # Ensures values in [0,1]
self.X0 = self.rng.random((self.N, self.K))

# Initialize solution matrices to be iterated on
self.X = np.maximum(0, self.X0)
self.Y = np.maximum(0, self.Y0)

Expand Down