Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky Test: TestDatasetIntegration.test_dataset_download[HuggingFace - Public dataset-huggingface-test_case0] #2460

Open
tenzen-y opened this issue Feb 28, 2025 · 17 comments

Comments

@tenzen-y
Copy link
Member

What happened?

Flaky Integration Test: TestDatasetIntegration.test_dataset_download[HuggingFace - Public dataset-huggingface-test_case0]

What did you expect to happen?

Never failed.

Environment

Kubernetes version:

$ kubectl version

Kubeflow Trainer version:

$ kubectl get pods -n kubeflow -l app.kubernetes.io/name=trainer -o jsonpath="{.items[*].spec.containers[*].image}"

Kubeflow Python SDK version:

$ pip show kubeflow

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@andreyvelich
Copy link
Member

cc @seanlaii Please can you take when you can ?

@andreyvelich
Copy link
Member

/area testing

@andreyvelich
Copy link
Member

/good-first-issue

Copy link

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@seanlaii
Copy link
Contributor

seanlaii commented Mar 3, 2025

Yes, will check tonight.

@seanlaii
Copy link
Contributor

seanlaii commented Mar 3, 2025

/assign

@seanlaii
Copy link
Contributor

seanlaii commented Mar 4, 2025

I am not able to reproduce it locally.
These are the error messages:

E           requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://huggingface.co/api/datasets/karpathy/tiny_shakespeare/revision/main
E               huggingface_hub.errors.HfHubHTTPError: 403 Forbidden: None.
E               Cannot access content at: https://huggingface.co/api/datasets/karpathy/tiny_shakespeare/revision/main.
E               Make sure your token has the correct permissions.
E               huggingface_hub.errors.HfHubHTTPError: 403 Forbidden: None.
E               Cannot access content at: https://huggingface.co/api/models/hf-internal-testing/tiny-random-bert/revision/main.
E               Make sure your token has the correct permissions.

However, the dataset and model are all public without gated: https://huggingface.co/api/datasets/karpathy/tiny_shakespeare/revision/main
https://huggingface.co/api/models/hf-internal-testing/tiny-random-bert/revision/main

I will spend more time looking into it. Not sure if it is related to connection issue:

else:
                # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>               raise LocalEntryNotFoundError(
                    "An error happened while trying to locate the files on the Hub and we cannot find the appropriate"
                    " snapshot folder for the specified revision on the local disk. Please check your internet connection"
                    " and try again."
                ) from api_call_error
E               huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again.

@azzamh15
Copy link

Can i work on this?

@seanlaii
Copy link
Contributor

Hi @azzamh15 , yes, feel free to take it. Thank you!

@azzamh15
Copy link

/assign

@azzamh15
Copy link

I tried reproducing the flaky behavior of TestDatasetIntegration.test_dataset_download on my local setup, but the test passed consistently across 10+ runs.
cc @andreyvelich

@seanlaii
Copy link
Contributor

What do you think about adding a retry in case of an intermittent network issue?

@izuku-sds
Copy link

izuku-sds commented Mar 18, 2025

Is this issue resolved or still open? (as latest attempt to run this issue is successful)
If this issue is still not resolved, Please mention the steps to reproduce it locally.
cc @tenzen-y

Edit:
Got similar error when running test with no internet. What is expected behaviour in this case?

https://github.com/kubeflow/trainer/actions/runs/13595830014/job/38012401152#step:5:1407
connection issue or hub downtime.

@azzamh15
Copy link

What do you think about adding a retry in case of an intermittent network issue?

Okay, I will look into it.

@tenzen-y
Copy link
Member Author

@azzamh15
Copy link

I will try reproducing this locally again by throttling my network. If this is a network-related issue (which seems likely), what should be the expected behavior in such cases? Should the test fail, retry, or handle it differently? Let me know your thoughts.
cc @tenzen-y @andreyvelich

@tenzen-y
Copy link
Member Author

I will try reproducing this locally again by throttling my network. If this is a network-related issue (which seems likely), what should be the expected behavior in such cases? Should the test fail, retry, or handle it differently? Let me know your thoughts. cc @tenzen-y @andreyvelich

Thank you for investigating that. What kind of network error? Could we just retry only downloading model in testing codes only when model downloading errros?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants