Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_pickle.UnpicklingError: invalid load key, '\xe2'. #1

Open
xiuzhilu opened this issue Sep 2, 2021 · 1 comment
Open

_pickle.UnpicklingError: invalid load key, '\xe2'. #1

xiuzhilu opened this issue Sep 2, 2021 · 1 comment

Comments

@xiuzhilu
Copy link

xiuzhilu commented Sep 2, 2021

Hi,dear. I download dataset wmt14 en-de, and when I run the script "sh sh_train.sh", I meeting the ERROR:Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/luxiuzhi/luxiuzhi/Data-Rejuvenation/fairseq/fairseq/distributed_utils.py", line 177, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[header_size:header_size + enc_size].tolist())))
_pickle.UnpicklingError: invalid load key, '\xe2'.

Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data.

can you help me check it pls

@wxjiao
Copy link
Owner

wxjiao commented Sep 2, 2021

Hi,dear. I download dataset wmt14 en-de, and when I run the script "sh sh_train.sh", I meeting the ERROR:Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/luxiuzhi/luxiuzhi/Data-Rejuvenation/fairseq/fairseq/distributed_utils.py", line 177, in all_gather_list
result.append(pickle.loads(bytes(out_buffer[header_size:header_size + enc_size].tolist())))
_pickle.UnpicklingError: invalid load key, '\xe2'.

Exception: Unable to unpickle data from other workers. all_gather_list requires all workers to enter the function together, so this error usually indicates that the workers have fallen out of sync somehow. Workers can fall out of sync if one of them runs out of memory, or if there are other conditions in your training script that can cause one worker to finish an epoch while other workers are still iterating over their portions of the data.

can you help me check it pls

I haven't met such a problem before. But you may check if it helps to increase "--all-gather-list-size" to a larger value.
Also presenting your training script and the log here may make it easier to spot the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants