Guidelines for bounds on clamping weights , scaling derivatives and slope #5

yewalenikhil65 · 2021-01-17T17:10:53Z

yewalenikhil65
Jan 17, 2021

Opening issue for bit detailed documentation on

slope,
scaling dydt,
bounds on clamping weights, scaling weights
scaling loss by the standard deviation
sensealg for different cases (stiff and non-stiff)
when to bother about g_norm or gradient in the training process(like we don't bother much in simple case 1)
tanh vs gelu ? Comments on different activations in different cases

Answered by jiweiqi

Jan 17, 2021

Thanks for organizing those tricks.

for slope, for example,

CRNN/case2/case2.jl

Line 99 in e7242f6

slope = p[nr * (ns + 2) + 1] .* 1.f2;

The general guideline is that, if the activation energies or the prefactor logA are far away from zero (comparing to unity), it is recommended to have a slope to rescale the weights. This is because those weights are usually initialized from Gaussian distributions. We shall try our best to push the weights to Gaussian distribution as well to make the optimization easier.

View full answer

jiweiqi · 2021-01-17T19:45:25Z

jiweiqi
Jan 17, 2021
Maintainer

Thanks for organizing those tricks.

for slope, for example,

CRNN/case2/case2.jl

Line 99 in e7242f6

slope = p[nr * (ns + 2) + 1] .* 1.f2;

The general guideline is that, if the activation energies or the prefactor logA are far away from zero (comparing to unity), it is recommended to have a slope to rescale the weights. This is because those weights are usually initialized from Gaussian distributions. We shall try our best to push the weights to Gaussian distribution as well to make the optimization easier.

0 replies

jiweiqi · 2021-01-17T19:47:40Z

jiweiqi
Jan 17, 2021
Maintainer

scaling dydt, for example,

CRNN/case3/case3.jl

Line 165 in e7242f6

du .= w_out * (@. exp(w_in_x + w_b)) .* dy_std_

It is useful if the species concentrations for different species span several orders of magnitude. Similar to the intuitions for the close, it is useful to scale the NN output to be close to Gaussian distribution. A natural choice of the scaling is the 'maximum concentrations (or range) / t_end'

0 replies

jiweiqi · 2021-01-17T19:55:03Z

jiweiqi
Jan 17, 2021
Maintainer

bounds on clamping weights, scaling weights
For non-catalysis reactions, one can simply bind the input weights and output weights, e.g.,

CRNN/case2/case2.jl

Line 107 in e7242f6

w_in = clamp.(-w_out, 0.f0, 4.f0);

. This is a kind of sharing parameter. My feeling is that the binding improves the robustness of training and reduce training cost. For elemental reactions, the output weights should be unity (or two). So that one can simply constrain those weights within [-2, 2]. To have some flexibility, and keep the loss curves being smooth, you can relax a little bit, say [-3, 3], [-4, 4].

For catalysis reactions, the input and output weights only share signs but not values.

0 replies

jiweiqi · 2021-01-17T20:40:49Z

jiweiqi
Jan 17, 2021
Maintainer

tanh vs gelu ? Comments on different activations in different cases. You are probably talking about the neural ode with standard NN. We can discuss that privately.

0 replies

jiweiqi · 2021-01-17T20:41:37Z

jiweiqi
Jan 17, 2021
Maintainer

scaling loss by the standard deviation

Use it when the species concentrations span several orders of magnitude.

0 replies

jiweiqi · 2021-01-17T20:44:07Z

jiweiqi
Jan 17, 2021
Maintainer

sensealg for different cases (stiff and non-stiff)

I would suggest trying different ones and profile it. This depends on a lot of factors and Chris has a nice paper on it: https://arxiv.org/pdf/1812.01892.pdf

0 replies

jiweiqi · 2021-01-17T20:45:17Z

jiweiqi
Jan 17, 2021
Maintainer

when to bother about g_norm or gradient in the training process(like we don't bother much in simple case 1)

Use it when you see the gradient is fluctuating a lot. It is hard to say when it happens. Gradient clipping is kind of common practices in training RNN and neural ODE.

0 replies

yewalenikhil65 · 2021-01-18T04:24:07Z

yewalenikhil65
Jan 18, 2021
Author

Thanks @jiweiqi , this is nice guideline.
Forgot another thing tto ask hat I had in mind.

Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?

0 replies

jiweiqi · 2021-01-18T04:32:57Z

jiweiqi
Jan 18, 2021
Maintainer

Thanks @jiweiqi , this is nice guideline.
Forgot another thing tto ask hat I had in mind.

Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?

This is a good question. it seems that MAE is preferred for neural ODE although I don't know the intuition. I noticed it when I try the ode demo code in the Pytorch package of torchdiffeq. To me, intuition is that the error is accumulated over time, so that the error could be substantially larger at a later phase. But we want the loss functions for the earlier phase well participate in the training so that it is better to use MAE since MSE will focus on large errors. But you know, those kinds of things are heuristic and we shall always try both.

0 replies

yewalenikhil65 · 2021-01-18T05:33:10Z

yewalenikhil65
Jan 18, 2021
Author

Thanks @jiweiqi , this is nice guideline.
Forgot another thing tto ask hat I had in mind.

Loss function for your trial codes was MSE based, whereas now I think you have adopted MAE measure. Any particular reason ?

This is a good question. it seems that MAE is preferred for neural ODE although I don't know the intuition. I noticed it when I try the ode demo code in the Pytorch package of torchdiffeq. To me, intuition is that the error is accumulated over time, so that the error could be substantially larger at a later phase. But we want the loss functions for the earlier phase well participate in the training so that it is better to use MAE since MSE will focus on large errors. But you know, those kinds of things are heuristic and we shall always try both.

I made up a logic of my own by creating an analogy with regularization(which is also part of loss function many times). L1 is analogical to MAE, whereas L2 is analogical to MSE. So just as we tend to use L1 methods for inducing sparsity or taking care of outliers, I think MAE does similar job. MSE on the other hand tends to induce a bias towards outliers.

This logic is of course on assumption that we have less outliers, and hence are using L1 (MAE) type of metric

0 replies

yewalenikhil65 · 2021-01-20T14:03:01Z

yewalenikhil65
Jan 20, 2021
Author

Is (w_out .* dydt_scale)' == w_out' .* dydt_scale'
Case 3 and Robertson's case seem suggest so while printing w_out_scale

0 replies

jiweiqi · 2021-01-20T14:27:25Z

jiweiqi
Jan 20, 2021
Maintainer

Is (w_out .* dydt_scale)' == w_out' .* dydt_scale'
Case 3 and Robertson's case seem suggest so while printing w_out_scale

I think so since it is an elementwise product.

0 replies

yewalenikhil65 · 2021-01-20T14:31:32Z

yewalenikhil65
Jan 20, 2021
Author

Is (w_out .* dydt_scale)' == w_out' .* dydt_scale'
Case 3 and Robertson's case seem suggest so while printing w_out_scale

I think so since it is an elementwise product.

for normal matrix product, its (A x B)' = B' x A'

I do not know any rule for element-wise product though.. Besides, w_out is a matrix and dydt_scale is vector.. So I got confused how it is even evaluating.. But it does evaluate, so i am guessing its not matrix-vector product.. must be matrix-matrix elementwise product.. and the rule seems true

0 replies

jiweiqi · 2021-01-20T14:36:44Z

jiweiqi
Jan 20, 2021
Maintainer

Yeah, it is certainly not a matrix-vector product, instead, it broadcast in a certain way. I am always confused about it. It is a good practice to check in the terminal as you did :)

0 replies

yewalenikhil65 · 2021-01-20T18:18:58Z

yewalenikhil65
Jan 20, 2021
Author

    w_out_ = (w_out' .* dydt_scale') .* exp.(w_b)   # scaling w_out
    display(w_out_)
    display(maximum(abs.(w_out_), dims=2)')     # extracting maximum of absolute values from each row
    display(w_out_ ./ maximum(abs.(w_out_), dims=2))  # writing off 1's in place of maximum

Am I correct with these comments ?
Shouldn't display(maximum(abs.(w_out_), dims=2)') be also used in scaling w_b ?

0 replies

jiweiqi · 2021-01-21T00:07:00Z

jiweiqi
Jan 21, 2021
Maintainer

I think you're right unless you see the results look strange, then we might come back to check the formula.

0 replies

yewalenikhil65 · 2021-02-03T05:42:36Z

yewalenikhil65
Feb 3, 2021
Author

For catalysis reactions, the input and output weights only share signs but not values.

@jiweiqi
Our w_in demands a value, but w_out demands it to be zero for catalytic reaction. , Any guidelines for clamping weights for catalysis reaction systems ?

0 replies

jiweiqi · 2021-02-03T05:56:58Z

jiweiqi
Feb 3, 2021
Maintainer

I don't think the training is sensitive to the boundary of the clamping for w_out. For w_in, we shall take care of it such that it is not too large, which will induce strong stiffness, and also not realistic. Most of the reactions are no more than second order, i.e., bio-molecular reaction. It is also possible with three molecular, although. I think 2.5 would be a good choice of upper bound for w_in.

0 replies

yewalenikhil65 · 2021-02-03T06:12:53Z

yewalenikhil65
Feb 3, 2021
Author

Oh no, I didn't mean to ask about training sensitivity. I understood slightly after re-reading your paper and code now . I was referring to this,

#case1 (also case 2)

  w_in = clamp.(-w_out, 0, 2.5);

but in case 3

    w_in = clamp.(w_in, 0.f0, 4.f0);

This refers to the quote "except that the sharing parameter between input weights and output weights are relaxed since the stoichiometric coefficients (output weights) for the catalysis could be zero while the reaction orders (input weights) are non-zero"
from the paper. Am I correct ?

0 replies

jiweiqi · 2021-02-03T06:26:29Z

jiweiqi
Feb 3, 2021
Maintainer

Yes, you are right, for case 3, we don't binding w_in and w_out, as did for case 1 and case 2. Similarly, for Robertson's problem, we don't binding the weights.

0 replies

Guidelines for bounds on clamping weights , scaling derivatives and slope #5

yewalenikhil65 Jan 17, 2021

Replies: 20 comments

jiweiqi Jan 17, 2021 Maintainer

jiweiqi Jan 17, 2021 Maintainer

jiweiqi Jan 17, 2021 Maintainer

jiweiqi Jan 17, 2021 Maintainer

jiweiqi Jan 17, 2021 Maintainer

jiweiqi Jan 17, 2021 Maintainer

jiweiqi Jan 17, 2021 Maintainer

yewalenikhil65 Jan 18, 2021 Author

jiweiqi Jan 18, 2021 Maintainer

yewalenikhil65 Jan 18, 2021 Author

yewalenikhil65 Jan 20, 2021 Author

jiweiqi Jan 20, 2021 Maintainer

yewalenikhil65 Jan 20, 2021 Author

jiweiqi Jan 20, 2021 Maintainer

yewalenikhil65 Jan 20, 2021 Author

jiweiqi Jan 21, 2021 Maintainer

yewalenikhil65 Feb 3, 2021 Author

jiweiqi Feb 3, 2021 Maintainer

yewalenikhil65 Feb 3, 2021 Author

jiweiqi Feb 3, 2021 Maintainer

yewalenikhil65
Jan 17, 2021

jiweiqi
Jan 17, 2021
Maintainer

jiweiqi
Jan 17, 2021
Maintainer

jiweiqi
Jan 17, 2021
Maintainer

jiweiqi
Jan 17, 2021
Maintainer

jiweiqi
Jan 17, 2021
Maintainer

jiweiqi
Jan 17, 2021
Maintainer

jiweiqi
Jan 17, 2021
Maintainer

yewalenikhil65
Jan 18, 2021
Author

jiweiqi
Jan 18, 2021
Maintainer

yewalenikhil65
Jan 18, 2021
Author

yewalenikhil65
Jan 20, 2021
Author

jiweiqi
Jan 20, 2021
Maintainer

yewalenikhil65
Jan 20, 2021
Author

jiweiqi
Jan 20, 2021
Maintainer

yewalenikhil65
Jan 20, 2021
Author

jiweiqi
Jan 21, 2021
Maintainer

yewalenikhil65
Feb 3, 2021
Author

jiweiqi
Feb 3, 2021
Maintainer

yewalenikhil65
Feb 3, 2021
Author

jiweiqi
Feb 3, 2021
Maintainer