Guidelines for bounds on clamping weights , scaling derivatives and slope #5
-
Opening issue for bit detailed documentation on
|
Beta Was this translation helpful? Give feedback.
Replies: 20 comments
-
Thanks for organizing those tricks.
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
For catalysis reactions, the input and output weights only share signs but not values. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Use it when the species concentrations span several orders of magnitude. |
Beta Was this translation helpful? Give feedback.
-
I would suggest trying different ones and profile it. This depends on a lot of factors and Chris has a nice paper on it: https://arxiv.org/pdf/1812.01892.pdf |
Beta Was this translation helpful? Give feedback.
-
Use it when you see the gradient is fluctuating a lot. It is hard to say when it happens. Gradient clipping is kind of common practices in training RNN and neural ODE. |
Beta Was this translation helpful? Give feedback.
-
Thanks @jiweiqi , this is nice guideline.
|
Beta Was this translation helpful? Give feedback.
-
This is a good question. it seems that MAE is preferred for neural ODE although I don't know the intuition. I noticed it when I try the ode demo code in the Pytorch package of torchdiffeq. To me, intuition is that the error is accumulated over time, so that the error could be substantially larger at a later phase. But we want the loss functions for the earlier phase well participate in the training so that it is better to use MAE since MSE will focus on large errors. But you know, those kinds of things are heuristic and we shall always try both. |
Beta Was this translation helpful? Give feedback.
-
I made up a logic of my own by creating an analogy with regularization(which is also part of loss function many times). L1 is analogical to MAE, whereas L2 is analogical to MSE. So just as we tend to use L1 methods for inducing sparsity or taking care of outliers, I think MAE does similar job. MSE on the other hand tends to induce a bias towards outliers. This logic is of course on assumption that we have less outliers, and hence are using L1 (MAE) type of metric |
Beta Was this translation helpful? Give feedback.
-
Is |
Beta Was this translation helpful? Give feedback.
-
I think so since it is an elementwise product. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Yeah, it is certainly not a matrix-vector product, instead, it broadcast in a certain way. I am always confused about it. It is a good practice to check in the terminal as you did :) |
Beta Was this translation helpful? Give feedback.
-
w_out_ = (w_out' .* dydt_scale') .* exp.(w_b) # scaling w_out
display(w_out_)
display(maximum(abs.(w_out_), dims=2)') # extracting maximum of absolute values from each row
display(w_out_ ./ maximum(abs.(w_out_), dims=2)) # writing off 1's in place of maximum Am I correct with these comments ? |
Beta Was this translation helpful? Give feedback.
-
I think you're right unless you see the results look strange, then we might come back to check the formula. |
Beta Was this translation helpful? Give feedback.
-
@jiweiqi |
Beta Was this translation helpful? Give feedback.
-
I don't think the training is sensitive to the boundary of the clamping for |
Beta Was this translation helpful? Give feedback.
-
Oh no, I didn't mean to ask about training sensitivity. I understood slightly after re-reading your paper and code now . I was referring to this, #case1 (also case 2)
w_in = clamp.(-w_out, 0, 2.5); but in case 3 w_in = clamp.(w_in, 0.f0, 4.f0); This refers to the quote "except that the sharing parameter between input weights and output weights are relaxed since the stoichiometric coefficients (output weights) for the catalysis could be zero while the reaction orders (input weights) are non-zero" |
Beta Was this translation helpful? Give feedback.
-
Yes, you are right, for case 3, we don't binding w_in and w_out, as did for case 1 and case 2. Similarly, for Robertson's problem, we don't binding the weights. |
Beta Was this translation helpful? Give feedback.
Thanks for organizing those tricks.
slope
, for example,CRNN/case2/case2.jl
Line 99 in e7242f6
The general guideline is that, if the activation energies or the prefactor logA are far away from zero (comparing to unity), it is recommended to have a slope to rescale the weights. This is because those weights are usually initialized from Gaussian distributions. We shall try our best to push the weights to Gaussian distribution as well to make the optimization easier.