Skip to content

Commit 774a0d8

Browse files
authored
Merge branch 'main' into mc/rm_lock
2 parents 2b025e0 + 014e9d6 commit 774a0d8

File tree

11 files changed

+67
-83
lines changed

11 files changed

+67
-83
lines changed

.github/workflows/python-tests.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ jobs:
5656
case "${{ matrix.test_name }}" in
5757
5858
"Repository only" | "Everything else")
59+
sudo apt update
5960
sudo apt install -y libsndfile1-dev
6061
;;
6162
@@ -69,6 +70,7 @@ jobs:
6970
;;
7071
7172
tensorflow)
73+
sudo apt update
7274
sudo apt install -y graphviz
7375
pip install .[tensorflow]
7476
;;

docs/source/en/guides/upload.md

Lines changed: 2 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -435,62 +435,8 @@ For more detailed information, take a look at the [`HfApi`] reference.
435435

436436
There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data,
437437
getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying.
438-
We gathered a list of tips and recommendations for structuring your repo.
439-
440-
441-
| Characteristic | Recommended | Tips |
442-
| ---------------- | ------------------ | ------------------------------------------------------ |
443-
| Repo size | - | contact us for large repos (TBs of data) |
444-
| Files per repo | <100k | merge data into fewer files |
445-
| Entries per folder | <10k | use subdirectories in repo |
446-
| File size | <5GB | split data into chunked files |
447-
| Commit size | <100 files* | upload files in multiple commits |
448-
| Commits per repo | - | upload multiple files per commit and/or squash history |
449-
450-
_* Not relevant when using `git` CLI directly_
451-
452-
Please read the next section to understand better those limits and how to deal with them.
453-
454-
### Hub repository size limitations
455-
456-
What are we talking about when we say "large uploads", and what are their associated limitations? Large uploads can be
457-
very diverse, from repositories with a few huge files (e.g. model weights) to repositories with thousands of small files
458-
(e.g. an image dataset).
459-
460-
Under the hood, the Hub uses Git to version the data, which has structural implications on what you can do in your repo.
461-
If your repo is crossing some of the numbers mentioned in the previous section, **we strongly encourage you to check out [`git-sizer`](https://github.com/github/git-sizer)**,
462-
which has very detailed documentation about the different factors that will impact your experience. Here is a TL;DR of factors to consider:
463-
464-
- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know in advance so we can better help you if you have any questions during the process. You can contact us at [email protected] or on [our Discord](http://hf.co/join/discord).
465-
- **Number of files**:
466-
- For optimal experience, we recommend keeping the total number of files under 100k. Try merging the data into fewer files if you have more.
467-
For example, json files can be merged into a single jsonl file, or large datasets can be exported as Parquet files.
468-
- The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to
469-
create a repository structure that uses subdirectories. For example, a repo with 1k folders from `000/` to `999/`, each containing at most 1000 files, is already enough.
470-
- **File size**: In the case of uploading large files (e.g. model weights), we strongly recommend splitting them **into chunks of around 5GB each**.
471-
There are a few reasons for this:
472-
- Uploading and downloading smaller files is much easier both for you and the other users. Connection issues can always
473-
happen when streaming data and smaller files avoid resuming from the beginning in case of errors.
474-
- Files are served to the users using CloudFront. From our experience, huge files are not cached by this service
475-
leading to a slower download speed.
476-
In all cases no single LFS file will be able to be >50GB. I.e. 50GB is the hard limit for single file size.
477-
- **Number of commits**: There is no hard limit for the total number of commits on your repo history. However, from
478-
our experience, the user experience on the Hub starts to degrade after a few thousand commits. We are constantly working to
479-
improve the service, but one must always remember that a git repository is not meant to work as a database with a lot of
480-
writes. If your repo's history gets very large, it is always possible to squash all the commits to get a
481-
fresh start using [`super_squash_history`]. This is a non-revertible operation.
482-
- **Number of operations per commit**: Once again, there is no hard limit here. When a commit is uploaded on the Hub, each
483-
git operation (addition or delete) is checked by the server. When a hundred LFS files are committed at once,
484-
each file is checked individually to ensure it's been correctly uploaded. When pushing data through HTTP with `huggingface_hub`,
485-
a timeout of 60s is set on the request, meaning that if the process takes more time, an error is raised
486-
client-side. However, it can happen (in rare cases) that even if the timeout is raised client-side, the process is still
487-
completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend
488-
adding around 50-100 files per commit.
489-
490-
### Practical tips
491-
492-
Now that we've seen the technical aspects you must consider when structuring your repository, let's see some practical
493-
tips to make your upload process as smooth as possible.
438+
439+
Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Next, let's move on with some practical tips to make your upload process as smooth as possible.
494440

495441
- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate
496442
on a script when failing takes only a little time.

src/huggingface_hub/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@
4646
from typing import TYPE_CHECKING
4747

4848

49-
__version__ = "0.17.0.dev0"
49+
__version__ = "0.18.0.dev0"
5050

5151
# Alphabetical order of definitions is ensured in tests
5252
# WARNING: any comment added in this dictionary definition will be lost when

src/huggingface_hub/commands/upload.py

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@
5151
from huggingface_hub import logging
5252
from huggingface_hub._commit_scheduler import CommitScheduler
5353
from huggingface_hub.commands import BaseHuggingfaceCLICommand
54-
from huggingface_hub.hf_api import create_repo, upload_file, upload_folder
54+
from huggingface_hub.hf_api import HfApi
5555
from huggingface_hub.utils import disable_progress_bars, enable_progress_bars
5656

5757

@@ -134,7 +134,7 @@ def __init__(self, args: Namespace) -> None:
134134
self.commit_message: Optional[str] = args.commit_message
135135
self.commit_description: Optional[str] = args.commit_description
136136
self.create_pr: bool = args.create_pr
137-
self.token: Optional[str] = args.token
137+
self.api: HfApi = HfApi(token=args.token, library_name="huggingface-cli")
138138
self.quiet: bool = args.quiet # disable warnings and progress bars
139139

140140
# Check `--every` is valid
@@ -222,7 +222,7 @@ def _upload(self) -> str:
222222
path_in_repo=path_in_repo,
223223
private=self.private,
224224
every=self.every,
225-
token=self.token,
225+
hf_api=self.api,
226226
)
227227
print(f"Scheduling commits every {self.every} minutes to {scheduler.repo_id}.")
228228
try: # Block main thread until KeyboardInterrupt
@@ -235,33 +235,37 @@ def _upload(self) -> str:
235235
# Otherwise, create repo and proceed with the upload
236236
if not os.path.isfile(self.local_path) and not os.path.isdir(self.local_path):
237237
raise FileNotFoundError(f"No such file or directory: '{self.local_path}'.")
238-
repo_id = create_repo(
239-
repo_id=self.repo_id, repo_type=self.repo_type, exist_ok=True, private=self.private, token=self.token
238+
repo_id = self.api.create_repo(
239+
repo_id=self.repo_id,
240+
repo_type=self.repo_type,
241+
exist_ok=True,
242+
private=self.private,
243+
space_sdk="gradio" if self.repo_type == "space" else None,
244+
# ^ We don't want it to fail when uploading to a Space => let's set Gradio by default.
245+
# ^ I'd rather not add CLI args to set it explicitly as we already have `huggingface-cli repo create` for that.
240246
).repo_id
241247

242248
# File-based upload
243249
if os.path.isfile(self.local_path):
244-
return upload_file(
250+
return self.api.upload_file(
245251
path_or_fileobj=self.local_path,
246252
path_in_repo=self.path_in_repo,
247253
repo_id=repo_id,
248254
repo_type=self.repo_type,
249255
revision=self.revision,
250-
token=self.token,
251256
commit_message=self.commit_message,
252257
commit_description=self.commit_description,
253258
create_pr=self.create_pr,
254259
)
255260

256261
# Folder-based upload
257262
else:
258-
return upload_folder(
263+
return self.api.upload_folder(
259264
folder_path=self.local_path,
260265
path_in_repo=self.path_in_repo,
261266
repo_id=repo_id,
262267
repo_type=self.repo_type,
263268
revision=self.revision,
264-
token=self.token,
265269
commit_message=self.commit_message,
266270
commit_description=self.commit_description,
267271
create_pr=self.create_pr,

src/huggingface_hub/file_download.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@
4040
)
4141
from .utils import (
4242
EntryNotFoundError,
43+
FileMetadataError,
4344
GatedRepoError,
4445
LocalEntryNotFoundError,
4546
RepositoryNotFoundError,
@@ -700,7 +701,7 @@ def cached_download(
700701
# we fallback to the regular etag header.
701702
# If we don't have any of those, raise an error.
702703
if etag is None:
703-
raise OSError(
704+
raise FileMetadataError(
704705
"Distant resource does not have an ETag, we won't be able to reliably ensure reproducibility."
705706
)
706707
# We get the expected size of the file, to check the download went well.
@@ -1246,15 +1247,19 @@ def hf_hub_download(
12461247
# Commit hash must exist
12471248
commit_hash = metadata.commit_hash
12481249
if commit_hash is None:
1249-
raise OSError("Distant resource does not seem to be on huggingface.co (missing commit header).")
1250+
raise FileMetadataError(
1251+
"Distant resource does not seem to be on huggingface.co. It is possible that a configuration issue"
1252+
" prevents you from downloading resources from https://huggingface.co. Please check your firewall"
1253+
" and proxy settings and make sure your SSL certificates are updated."
1254+
)
12501255

12511256
# Etag must exist
12521257
etag = metadata.etag
12531258
# We favor a custom header indicating the etag of the linked resource, and
12541259
# we fallback to the regular etag header.
12551260
# If we don't have any of those, raise an error.
12561261
if etag is None:
1257-
raise OSError(
1262+
raise FileMetadataError(
12581263
"Distant resource does not have an ETag, we won't be able to reliably ensure reproducibility."
12591264
)
12601265

@@ -1293,12 +1298,21 @@ def hf_hub_download(
12931298
# (if it's not the case, the error will be re-raised)
12941299
head_call_error = error
12951300
pass
1301+
except FileMetadataError as error:
1302+
# Multiple reasons for a FileMetadataError:
1303+
# - Wrong network configuration (proxy, firewall, SSL certificates)
1304+
# - Inconsistency on the Hub
1305+
# => let's switch to 'local_files_only=True' to check if the files are already cached.
1306+
# (if it's not the case, the error will be re-raised)
1307+
head_call_error = error
1308+
pass
12961309

12971310
# etag can be None for several reasons:
12981311
# 1. we passed local_files_only.
12991312
# 2. we don't have a connection
13001313
# 3. Hub is down (HTTP 500 or 504)
13011314
# 4. repo is not found -for example private or gated- and invalid/missing token sent
1315+
# 5. Hub is blocked by a firewall or proxy is not set correctly.
13021316
# => Try to get the last downloaded one from the specified revision.
13031317
#
13041318
# If the specified revision is a commit hash, look inside "snapshots".

src/huggingface_hub/hf_api.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2540,7 +2540,19 @@ def create_repo(
25402540
# See https://github.com/huggingface/huggingface_hub/pull/733/files#r820604472
25412541
json["lfsmultipartthresh"] = self._lfsmultipartthresh # type: ignore
25422542
headers = self._build_hf_headers(token=token, is_write_action=True)
2543-
r = get_session().post(path, headers=headers, json=json)
2543+
2544+
while True:
2545+
r = get_session().post(path, headers=headers, json=json)
2546+
if r.status_code == 409 and "Cannot create repo: another conflicting operation is in progress" in r.text:
2547+
# Since https://github.com/huggingface/moon-landing/pull/7272 (private repo), it is not possible to
2548+
# concurrently create repos on the Hub for a same user. This is rarely an issue, except when running
2549+
# tests. To avoid any inconvenience, we retry to create the repo for this specific error.
2550+
# NOTE: This could have being fixed directly in the tests but adding it here should fixed CIs for all
2551+
# dependent libraries.
2552+
# NOTE: If a fix is implemented server-side, we should be able to remove this retry mechanism.
2553+
logger.debug("Create repo failed due to a concurrency issue. Retrying...")
2554+
continue
2555+
break
25442556

25452557
try:
25462558
hf_raise_for_status(r)

src/huggingface_hub/utils/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
from ._errors import (
3333
BadRequestError,
3434
EntryNotFoundError,
35+
FileMetadataError,
3536
GatedRepoError,
3637
HfHubHTTPError,
3738
LocalEntryNotFoundError,

src/huggingface_hub/utils/_errors.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,13 @@
55
from ._fixes import JSONDecodeError
66

77

8+
class FileMetadataError(OSError):
9+
"""Error triggered when the metadata of a file on the Hub cannot be retrieved (missing ETag or commit_hash).
10+
11+
Inherits from `OSError` for backward compatibility.
12+
"""
13+
14+
815
class HfHubHTTPError(HTTPError):
916
"""
1017
HTTPError to inherit from for any custom HTTP Error raised in HF Hub.

tests/test_cli.py

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ def test_upload_basic(self) -> None:
8585
self.assertEqual(cmd.commit_description, None)
8686
self.assertEqual(cmd.create_pr, False)
8787
self.assertEqual(cmd.every, None)
88-
self.assertEqual(cmd.token, None)
88+
self.assertEqual(cmd.api.token, None)
8989
self.assertEqual(cmd.quiet, False)
9090

9191
def test_upload_with_all_options(self) -> None:
@@ -135,7 +135,7 @@ def test_upload_with_all_options(self) -> None:
135135
self.assertEqual(cmd.commit_description, "My commit description")
136136
self.assertEqual(cmd.create_pr, True)
137137
self.assertEqual(cmd.every, 5)
138-
self.assertEqual(cmd.token, "my-token")
138+
self.assertEqual(cmd.api.token, "my-token")
139139
self.assertEqual(cmd.quiet, True)
140140

141141
def test_upload_implicit_local_path_when_folder_exists(self) -> None:
@@ -211,8 +211,8 @@ def test_every_as_float(self) -> None:
211211
cmd = UploadCommand(self.parser.parse_args(["upload", DUMMY_MODEL_ID, ".", "--every", "0.5"]))
212212
self.assertEqual(cmd.every, 0.5)
213213

214-
@patch("huggingface_hub.commands.upload.upload_folder")
215-
@patch("huggingface_hub.commands.upload.create_repo")
214+
@patch("huggingface_hub.commands.upload.HfApi.upload_folder")
215+
@patch("huggingface_hub.commands.upload.HfApi.create_repo")
216216
def test_upload_folder_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
217217
with SoftTemporaryDirectory() as cache_dir:
218218
cmd = UploadCommand(
@@ -223,15 +223,14 @@ def test_upload_folder_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
223223
cmd.run()
224224

225225
create_mock.assert_called_once_with(
226-
repo_id="my-model", repo_type="model", exist_ok=True, private=True, token=None
226+
repo_id="my-model", repo_type="model", exist_ok=True, private=True, space_sdk=None
227227
)
228228
upload_mock.assert_called_once_with(
229229
folder_path=cache_dir,
230230
path_in_repo=".",
231231
repo_id=create_mock.return_value.repo_id,
232232
repo_type="model",
233233
revision=None,
234-
token=None,
235234
commit_message=None,
236235
commit_description=None,
237236
create_pr=False,
@@ -240,8 +239,8 @@ def test_upload_folder_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
240239
delete_patterns=["*.json"],
241240
)
242241

243-
@patch("huggingface_hub.commands.upload.upload_file")
244-
@patch("huggingface_hub.commands.upload.create_repo")
242+
@patch("huggingface_hub.commands.upload.HfApi.upload_file")
243+
@patch("huggingface_hub.commands.upload.HfApi.create_repo")
245244
def test_upload_file_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
246245
with SoftTemporaryDirectory() as cache_dir:
247246
file_path = Path(cache_dir) / "file.txt"
@@ -254,21 +253,20 @@ def test_upload_file_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
254253
cmd.run()
255254

256255
create_mock.assert_called_once_with(
257-
repo_id="my-dataset", repo_type="dataset", exist_ok=True, private=False, token=None
256+
repo_id="my-dataset", repo_type="dataset", exist_ok=True, private=False, space_sdk=None
258257
)
259258
upload_mock.assert_called_once_with(
260259
path_or_fileobj=str(file_path),
261260
path_in_repo="logs/file.txt",
262261
repo_id=create_mock.return_value.repo_id,
263262
repo_type="dataset",
264263
revision=None,
265-
token=None,
266264
commit_message=None,
267265
commit_description=None,
268266
create_pr=True,
269267
)
270268

271-
@patch("huggingface_hub.commands.upload.create_repo")
269+
@patch("huggingface_hub.commands.upload.HfApi.create_repo")
272270
def test_upload_missing_path(self, create_mock: Mock) -> None:
273271
cmd = UploadCommand(self.parser.parse_args(["upload", "my-model", "/path/to/missing_file", "logs/file.txt"]))
274272
with self.assertRaises(FileNotFoundError):

tests/test_inference_async_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -212,7 +212,7 @@ async def test_get_status_too_big_model() -> None:
212212

213213
@pytest.mark.asyncio
214214
async def test_get_status_loaded_model() -> None:
215-
model_status = await AsyncInferenceClient().get_model_status("bigcode/starcoder")
215+
model_status = await AsyncInferenceClient().get_model_status("bigscience/bloom")
216216
assert model_status.loaded is True
217217
assert model_status.state == "Loaded"
218218
assert model_status.compute_type == "gpu"

tests/test_inference_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -519,7 +519,7 @@ def test_too_big_model(self) -> None:
519519

520520
def test_loaded_model(self) -> None:
521521
client = InferenceClient()
522-
model_status = client.get_model_status("bigcode/starcoder")
522+
model_status = client.get_model_status("bigscience/bloom")
523523
self.assertTrue(model_status.loaded)
524524
self.assertEqual(model_status.state, "Loaded")
525525
self.assertEqual(model_status.compute_type, "gpu")

0 commit comments

Comments
 (0)