MPS differences while training - Can anything be done? #625

yonomitt · 2025-04-17T20:50:33Z

yonomitt
Apr 17, 2025

I've been enjoying the book immensely and have been learning a lot. I'm running on an M1 Max using the MPS backend. Most of the time the output of my code results in the exact same output as the book. However, during some training, things go off the rails.

Specifically, if I look section 7.6 Fine-tuning the LLM on instruction data, when running the following code:

model.to(device)
torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device, num_batches=5)
    valid_loss = calc_loss_loader(valid_loader, model, device, num_batches=5)

print(f'Training loss:   {train_loss:.3f}')
print(f'Validation loss: {valid_loss:.3f}')

I get:

Training loss:   3.826
Validation loss: 3.762

Whether I'm running on CPU or MPS. However, the next bit of code to fine-tune the model, differs greatly.

CPU:

Ep 1 (Step 000000): Train loss 2.637, Val loss 2.626
Ep 1 (Step 000005): Train loss 1.174, Val loss 1.102
Ep 1 (Step 000010): Train loss 0.872, Val loss 0.945
Ep 1 (Step 000015): Train loss 0.856, Val loss 0.906
Ep 1 (Step 000020): Train loss 0.776, Val loss 0.881
Ep 1 (Step 000025): Train loss 0.753, Val loss 0.859
Ep 1 (Step 000030): Train loss 0.798, Val loss 0.836
Ep 1 (Step 000035): Train loss 0.715, Val loss 0.809
Ep 1 (Step 000040): Train loss 0.672, Val loss 0.806
Ep 1 (Step 000045): Train loss 0.633, Val loss 0.790
Ep 1 (Step 000050): Train loss 0.662, Val loss 0.783
Ep 1 (Step 000055): Train loss 0.760, Val loss 0.764
Ep 1 (Step 000060): Train loss 0.719, Val loss 0.743
Ep 1 (Step 000065): Train loss 0.652, Val loss 0.735
Ep 1 (Step 000070): Train loss 0.532, Val loss 0.729
Ep 1 (Step 000075): Train loss 0.569, Val loss 0.729
Ep 1 (Step 000080): Train loss 0.605, Val loss 0.725
Ep 1 (Step 000085): Train loss 0.509, Val loss 0.710
Ep 1 (Step 000090): Train loss 0.562, Val loss 0.691
Ep 1 (Step 000095): Train loss 0.500, Val loss 0.682
Ep 1 (Step 000100): Train loss 0.502, Val loss 0.677
Ep 1 (Step 000105): Train loss 0.564, Val loss 0.670
Ep 1 (Step 000110): Train loss 0.555, Val loss 0.667
Ep 1 (Step 000115): Train loss 0.508, Val loss 0.664
...

But on MPS, it starts off similarly, but goes off the rails around step 30:

Device: mps
Ep 1 (Step 000000): Train loss 2.835, Val loss 2.805
Ep 1 (Step 000005): Train loss 1.377, Val loss 1.310
Ep 1 (Step 000010): Train loss 0.900, Val loss 0.987
Ep 1 (Step 000015): Train loss 0.895, Val loss 0.932
Ep 1 (Step 000020): Train loss 0.816, Val loss 0.908
Ep 1 (Step 000025): Train loss 0.802, Val loss 0.896
Ep 1 (Step 000030): Train loss 0.972, Val loss 0.995
Ep 1 (Step 000035): Train loss 1.005, Val loss 1.099
Ep 1 (Step 000040): Train loss 0.940, Val loss 1.043
Ep 1 (Step 000045): Train loss 0.923, Val loss 1.022
Ep 1 (Step 000050): Train loss 0.988, Val loss 1.006
Ep 1 (Step 000055): Train loss 1.042, Val loss 0.999
Ep 1 (Step 000060): Train loss 1.199, Val loss 1.176
Ep 1 (Step 000065): Train loss 1.518, Val loss 1.580
Ep 1 (Step 000070): Train loss 1.431, Val loss 1.505
Ep 1 (Step 000075): Train loss 1.787, Val loss 1.879
Ep 1 (Step 000080): Train loss 3.040, Val loss 3.021
Ep 1 (Step 000085): Train loss 3.590, Val loss 3.672
Ep 1 (Step 000090): Train loss 3.802, Val loss 3.785
Ep 1 (Step 000095): Train loss 3.646, Val loss 3.684
Ep 1 (Step 000100): Train loss 3.413, Val loss 3.506
Ep 1 (Step 000105): Train loss 3.298, Val loss 3.307
Ep 1 (Step 000110): Train loss 3.166, Val loss 3.100
Ep 1 (Step 000115): Train loss 2.872, Val loss 2.896
...

Is there anything I can do about this? I understand that there may be some differences, but theoretically, I should be able to fine-tune the model under some conditions using MPS, right (even if the output won't match the book's)? Would it be a matter of changing the learning rate/weight decay?

Any thoughts on how to make it work using MPS?

rasbt · 2025-04-18T15:55:07Z

rasbt
Apr 18, 2025
Maintainer

Thanks for the kind words. I am glad to hear you are liking the book overall. Unfortunately, this discrepancy is expected, which is why I didn't recommend MPS in the main chapters. It was really bad when I started working on the book (PyTorch 2.0 back then) but incrementally improved in newer PyTorch versions. Like you said, in newer PyTorch versions, it's mostly fine now except Chapter 7.
In any case, it's basically a PyTorch/MPS-support issue, not an issue in your code.

There were also some relevant discussions in the forum about that, e.g.,

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPS differences while training - Can anything be done? #625

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

MPS differences while training - Can anything be done? #625

yonomitt Apr 17, 2025

Replies: 1 comment

rasbt Apr 18, 2025 Maintainer

yonomitt
Apr 17, 2025

rasbt
Apr 18, 2025
Maintainer