-
Hey @talmo ! I'm working through the sleap tutorial on a PC with a decent GPU. The initial training step is taking a bit of time though...: Here's the dump from terminal - can't really see anything weird. Any idea what I need to do differently?
|
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 1 reply
-
Hey @panichem! Are you able to use the sample dataset from the tutorial? If so, or if you just want to give it a quick try, does training a bottom-up multi-animal model work? (This works for single animals as well.) Give those a spin and if neither works, do you mind sharing the video + .slp file with Talmo |
Beta Was this translation helpful? Give feedback.
-
Hi @panichem, I also came across this when I created a model where the receptive field (RF) size was relatively small compared to the overall frame size. You could try lowering the input scaling to ~0.5 (which increases the RF size) and see how that effects the first epoch training time. Please let us know if any of these solutions worked. Thanks, |
Beta Was this translation helpful? Give feedback.
-
@talmo @roomrys - I switched to a bottom-up model and changed the RF scaling to .5 and now the first ~10 epochs are done in a few minutes. Thanks for your help!! |
Beta Was this translation helpful? Give feedback.
-
Marking this as a TODO since there is a work-around, but we still need to find the root cause (and prevent it from happening) |
Beta Was this translation helpful? Give feedback.
-
The fact that there weren't any errors and that training didn't even start makes me think it's a tensorflow deadlock. We've run into this in the past (see attempted fixes in 613c201 and 492b67b). I think it's related to how we use In the past I've had a hard time reliably reproducing this -- it seems to be stochastic and maybe system-dependent -- so maybe let's just close this for now and revisit it if more people are having the same problem. Also moving this to Discussions so folks see it when asking q's. Thanks for the report @panichem! |
Beta Was this translation helpful? Give feedback.
-
Hi all! I am also encountering this issue. However, before I was able to train the model (and output predictions) on the same machine, same SLEAP version, and the same video data. The only difference is that now I'm using a smaller skeleton, i.e., with fewer nodes/edges. I did this because my previous predictions were not too good for my goal, so I wanted to simplify. The centroid training runs and finishes, the centered_instance training consistently stalls at Training Epoch 1, after 199 batches. It also doesn't stop when I click 'Stop Early', only when I cancel it. I already tried the trivial troubleshoots such as rebooting the PC and clearing up memory. Do you have any insights on what is going on? |
Beta Was this translation helpful? Give feedback.
@talmo @roomrys - I switched to a bottom-up model and changed the RF scaling to .5 and now the first ~10 epochs are done in a few minutes. Thanks for your help!!