-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky Test: TestDatasetIntegration.test_dataset_download[HuggingFace - Public dataset-huggingface-test_case0] #2460
Comments
cc @seanlaii Please can you take when you can ? |
/area testing |
/good-first-issue |
@andreyvelich: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Yes, will check tonight. |
/assign |
I am not able to reproduce it locally.
However, the dataset and model are all public without gated: https://huggingface.co/api/datasets/karpathy/tiny_shakespeare/revision/main I will spend more time looking into it. Not sure if it is related to connection issue: else:
# Otherwise: most likely a connection issue or Hub downtime => let's warn the user
> raise LocalEntryNotFoundError(
"An error happened while trying to locate the files on the Hub and we cannot find the appropriate"
" snapshot folder for the specified revision on the local disk. Please check your internet connection"
" and try again."
) from api_call_error
E huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again. |
Can i work on this? |
Hi @azzamh15 , yes, feel free to take it. Thank you! |
/assign |
I tried reproducing the flaky behavior of TestDatasetIntegration.test_dataset_download on my local setup, but the test passed consistently across 10+ runs. |
What do you think about adding a retry in case of an intermittent network issue? |
Is this issue resolved or still open? (as latest attempt to run this issue is successful) Edit: https://github.com/kubeflow/trainer/actions/runs/13595830014/job/38012401152#step:5:1407 |
Okay, I will look into it. |
I saw the same errors https://github.com/kubeflow/trainer/actions/runs/13925570979/job/38969057348?pr=2540, again |
I will try reproducing this locally again by throttling my network. If this is a network-related issue (which seems likely), what should be the expected behavior in such cases? Should the test fail, retry, or handle it differently? Let me know your thoughts. |
Thank you for investigating that. What kind of network error? Could we just retry only downloading model in testing codes only when model downloading errros? |
What happened?
Flaky Integration Test:
TestDatasetIntegration.test_dataset_download[HuggingFace - Public dataset-huggingface-test_case0]
What did you expect to happen?
Never failed.
Environment
Kubernetes version:
Kubeflow Trainer version:
$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"
Kubeflow Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: