Silent Bug during loss computation #152

leonardtschora · 2021-06-07T10:27:17Z

Hi everyone,

I think the loss computation of the generic fit! function is bugged.

The loss function is defined as loss_func(x, y) = loss(chain(x), y) and computed as mean(loss_func(X[i], y[i]) for i=1:length(X)). However, the result of chain(X[i]) yields a batch_size * 1 matrix that causes a silent bug during the loss computation.

The default loss is Flux.Losses,mse which is computed as agg((ŷ .- y) .^ 2). We expect the result of (ŷ .- y) to be a vector of length batch_size containing the differences between prediction and labels. However, given that ŷ is a Matrix, the result of the broadcast yields a batch_size * batch_size Matrix containing inconsistent values, that is next reduced to a scalar with the agg function. This causes the loss computation to yield no errors but the computed value is wrong!

Here is a MWE to illustrate:

using  MLJ, MLJFlux, Flux, DataFrames

# Define a small network structure
mutable struct TESTBuilder <: MLJFlux.Builder end
MLJFlux.build(builder::TESTBuilder, n_in, n_out) = Chain(Dense(n_in, n_out))

n = 100
d = 5
# We generate dummy data in the appropriate format
X_ = DataFrame(rand(Float64, n, d), :auto);
y_ = X_.x1 .^2 + X_.x2 .* X_.x3 - 4 * X_.x4

# Create the model
Model = @load NeuralNetworkRegressor pkg=MLJFlux verbosity=0
model = Model(builder = TESTBuilder(), batch_size=25, epochs=100)

# Check that the model works
mach = machine(model, X_, y_)
fit!(mach)
predict(mach, X_)
loss(predict(mach, X_), y_)

# Let's dig into the code.
# In MLJFlux.Regressor fit function
n_input = Tables.schema(X_).names |> length
data = MLJFlux.collate(model, X_, y_)
target_is_multivariate = Tables.istable(y)
if target_is_multivariate
    target_column_names = Tables.schema(y).names
else
    target_column_names = [""] # We won't be using this
end
n_output = length(target_column_names)
chain = MLJFlux.build(model.builder, n_input, n_output)

# We then call MLJFlux.core fit! function
loss = model.loss
X = data[1]
y = data[2]
loss_func(x, y) = loss(chain(x), y)
prev_loss = mean(loss_func(X[i], y[i]) for i=1:length(X)) # The loss is computed but is wrong!

# Let's check it : 
i = 1
x = X[i]
ytrue = y[1]
ypred = chain(x)
loss(ytrue, ypred)
ypred .- ytrue # We obtain a 25 * 25 matrix...

# Let's compute manually the mse loss
function mymse(ytrue, ypred)
    acc = 0.0
    for (yp, yt) in zip(ypred, ytrue)
        acc += (yp - yt)^2
    end
    acc /= length(ytrue) 
end
mymse(ytrue, ypred)
loss(ytrue, ypred) # The 2 results are drasticly different.

I think this issue is quite important because it potentially make all models not-converging if using a batch_size > 1. Moreover, it gives the user unreliable results.

I have thought of 2 ways to tackle it :

Use safe loss functions such as mymse
Convert the result of the call to chain in to a safe format

Anyway, let me know the updates on this issue, I think it is very important.

Thanks by advance.

The text was updated successfully, but these errors were encountered:

ablaom · 2021-06-08T02:26:00Z

Yes, @Leonardbcm, this is a bug. Great catch! This is invaluable user feedback.

Rather than messing with the result of chain, what do you think of applying your inserting your tomat mapping from #142?

loss_func(x, y) = loss(chain(x), tomat(y))   # option 1

Or, perhaps even more natural, would be redefine reformat to return a 1 x n matrix, instead of a vector:

reformat(y, ::Type{<:AbstractVector{<:Continuous}}) = reshape(y, 1, length(y))   # option 2

oops oops again

leonardtschora · 2021-06-08T13:40:42Z

This option seems fine to me. The new test is passing on my computer and I seem to have more coherent results in my model trainings.

I hope there are not too many other silent bugs like this one. Comparaison of performances between scikit learn's model become handy at this point!

Thanks for the quick and solid support.

ablaom · 2021-06-09T01:44:28Z

Comparaison of performances between scikit learn's model become handy at this point!

Agreed. Can one control initialisation of weights there (to avoid RNG dependencies)?

Be great if you have a chance to look into this yourself: #157

leonardtschora · 2021-06-09T06:24:50Z

Yes it is possible and as for everything else in Python, a bit less straigth-forward. One has to define a new class from tf.keras.initializers.Initializer. The following example should be an entry point (I used it for another experiment and should be running).

Anyway, the recipe in well described.

# Kernel and bias initializers
import tensorflow as tf
class ManualIntializer(tf.keras.initializers.Initializer):
    def __init__(self, weights):
        self.weights = weights
        
    def __call__(self, shape, dtype=None):
        w = tf.convert_to_tensor(self.weights, dtype=dtype)
        w = tf.reshape(w, shape)        
        return w

    def get_config(self):  # To support serialization
        return {'weights': self.weights}

I'll see if I can manage to spare some time to make a small comparison experiment.

ablaom added the bug Something isn't working label Jun 8, 2021

ablaom self-assigned this Jun 8, 2021

ablaom added a commit that referenced this issue Jun 8, 2021

add breaking test exposing issue #152

6bbdc67

oops oops again

ablaom mentioned this issue Jun 8, 2021

Redefine reformat for Continuous vectors to address silent loss function bug #153

Merged

ablaom closed this as completed in #153 Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent Bug during loss computation #152

Silent Bug during loss computation #152

leonardtschora commented Jun 7, 2021

ablaom commented Jun 8, 2021 •

edited

Loading

leonardtschora commented Jun 8, 2021

ablaom commented Jun 9, 2021

leonardtschora commented Jun 9, 2021

Silent Bug during loss computation #152

Silent Bug during loss computation #152

Comments

leonardtschora commented Jun 7, 2021

ablaom commented Jun 8, 2021 • edited Loading

leonardtschora commented Jun 8, 2021

ablaom commented Jun 9, 2021

leonardtschora commented Jun 9, 2021

ablaom commented Jun 8, 2021 •

edited

Loading