Commit 4849a5e
authored
Optimize task creation from CS without manifest (#9827)
Related: #9757
When a raw images task is created from a CS without a manifest attached,
CVAT downloads image headers to get image resolution. This operation can
be quite time-consuming for big tasks, but it can be optimized quite
simply.
- Improved performance of CS image header downloading and manifest
creation ~2-6x
The default chunk size used in the downloader is 64KB. For most image
formats the required information is available in the first 1KB, while
64KB (the previous value) can be the size of the whole file. It is
tempting to change it to a lower value, e.g. 1500 (the default Ethernet
v2 MTU size) and it works fine, except for jpgs that include an embedded
thumbnail (preview) image in the header, which can basically be of any
size. <s>Probably, this can be implemented with a more advanced JPEG
parser. It doesn't seem reasonable to use the reduced chunk size and
download the whole image for such images, as such jpegs seem to be quite
common, but maybe it can be implemented as an exception just for the jpg
format.</s> Now, multiple header sizes are attempted per file.
AWS connections limits are floating and depend on the data filenames. In
the worst case, we can expect about 100 connections per prefix, up to
infinite in the best case (random prefixes). It's also possible to get
throttled by AWS (e.g. 503 Slow Down), it should be handled by the boto
library itself.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html
https://stackoverflow.com/questions/37432285/maximum-no-of-connections-that-can-be-held-by-s3
https://repost.aws/knowledge-center/http-5xx-errors-s3
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html
Details:
test dataset: 26822 .jpg images (~10 GB)
baseline: 550s
with queue: 320 - 350s
with reduced chunk size (1 MTU): 220 - 280s
with improved connection reuse for AWS: up to 82s (64 connections for 16
cores, up from 10 kept alive with the default config)
Sample script for testing
```python
from time import perf_counter
from tempfile import TemporaryDirectory
from tqdm import tqdm
from cvat.apps.engine import models, cloud_provider
from utils.dataset_manifest.core import ImageManifestManager
cloud_storage = models.CloudStorage.objects.get(id=yourcs)
storage_client = cloud_provider.db_storage_to_storage_instance(cloud_storage)
media = [v["name"] for v in storage_client.list_files(prefix="images/", _use_flat_listing=True)]
header_downloader = cloud_provider.HeaderFirstMediaDownloader.create(
models.DimensionType.DIM_2D, client=storage_client
)
content_generator = (
v
for v in tqdm(
storage_client.bulk_download_to_memory(media, object_downloader=header_downloader.download),
total=len(media),
)
)
with TemporaryDirectory() as tempdir:
start_time = perf_counter()
manifest = ImageManifestManager(tempdir, upload_dir=tempdir, create_index=False)
manifest.link(
sources=content_generator,
stop=len(media) - 1,
DIM_3D=False,
)
manifest.create()
duration = perf_counter() - start_time
print(
f"Manifest for {len(media)} files created in",
duration,
"seconds",
f"avg. {duration / (len(media) or 1)}s.",
)
# run with
# cat test_cs_downloading.py | python manage.py shell
```1 parent 4f5be6c commit 4849a5e
File tree
2 files changed
+106
-42
lines changed- changelog.d
- cvat/apps/engine
2 files changed
+106
-42
lines changedLines changed: 4 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
12 | | - | |
13 | | - | |
| 13 | + | |
| 14 | + | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
| 18 | + | |
17 | 19 | | |
18 | 20 | | |
19 | 21 | | |
| |||
35 | 37 | | |
36 | 38 | | |
37 | 39 | | |
38 | | - | |
| 40 | + | |
39 | 41 | | |
40 | 42 | | |
41 | 43 | | |
| |||
214 | 216 | | |
215 | 217 | | |
216 | 218 | | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
217 | 226 | | |
218 | | - | |
219 | | - | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
220 | 239 | | |
221 | 240 | | |
222 | 241 | | |
| |||
354 | 373 | | |
355 | 374 | | |
356 | 375 | | |
357 | | - | |
358 | | - | |
359 | | - | |
360 | | - | |
361 | | - | |
362 | | - | |
| 376 | + | |
363 | 377 | | |
364 | | - | |
365 | 378 | | |
366 | 379 | | |
367 | | - | |
| 380 | + | |
368 | 381 | | |
369 | | - | |
370 | | - | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
371 | 395 | | |
372 | | - | |
373 | | - | |
374 | | - | |
375 | | - | |
376 | | - | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
377 | 408 | | |
378 | 409 | | |
379 | 410 | | |
| |||
388 | 419 | | |
389 | 420 | | |
390 | 421 | | |
391 | | - | |
392 | 422 | | |
393 | | - | |
394 | | - | |
395 | | - | |
396 | | - | |
397 | | - | |
398 | | - | |
| 423 | + | |
| 424 | + | |
399 | 425 | | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
400 | 445 | | |
401 | 446 | | |
402 | 447 | | |
403 | | - | |
| 448 | + | |
404 | 449 | | |
405 | | - | |
| 450 | + | |
406 | 451 | | |
407 | 452 | | |
408 | | - | |
409 | | - | |
410 | | - | |
411 | | - | |
412 | | - | |
413 | | - | |
414 | | - | |
415 | | - | |
416 | | - | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
417 | 456 | | |
418 | 457 | | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
419 | 470 | | |
420 | 471 | | |
421 | 472 | | |
| |||
426 | 477 | | |
427 | 478 | | |
428 | 479 | | |
429 | | - | |
| 480 | + | |
430 | 481 | | |
431 | | - | |
432 | | - | |
| 482 | + | |
| 483 | + | |
433 | 484 | | |
434 | 485 | | |
435 | 486 | | |
| |||
551 | 602 | | |
552 | 603 | | |
553 | 604 | | |
554 | | - | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
555 | 615 | | |
556 | 616 | | |
557 | 617 | | |
| |||
0 commit comments