Skip to content

Conversation

@SpecLad
Copy link
Contributor

@SpecLad SpecLad commented Nov 14, 2025

Motivation and context

Let's say we have a cloud storage-based task that was created with 6 input images, and the following settings: start frame 1, stop frame 4, frame step 2. A heavyweight backup of such a task will then contain only frames 1 and 3 in the data directory; however, the manifest will still contain all 6 frames. If you try to restore such a backup, CVAT will fail, because it checks that the files in the data directory correspond 1:1 to the manifest.

This could potentially be fixed in the restore code; however, it seems to me that the backups created in this case are incorrect, as they have manifests referencing nonexistent files. As such, I think it's more appropriate to fix this in the backup code.

The fix is to filter the manifest during backup, leaving only entries corresponding to frames that actually get backed up. We also need to reset the frame range to the default, so that it matches the filtered manifest.

Note that the same bug also affects backups of tasks based on the mounted share. It should be reasonably easy to fix (just use _write_filtered_media_manifest in the StorageChoice.SHARE branch), but I cannot test such a fix, because share backups are currently broken entirely. So I will defer this fix until #9972.

This bug does not affect backups of tasks based on local files, because those currently include all frames, even those that fall outside that frame range. I think that behavior is probably wrong, but it doesn't require immediate fixing.

How has this been tested?

Unit tests. Also manually.

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • [ ] I have updated the documentation accordingly
  • I have added tests to cover my changes
  • [ ] I have linked related issues (see GitHub docs)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

@SpecLad SpecLad force-pushed the heavyweight-backup-sparse branch from d1c7b51 to aeccf57 Compare November 14, 2025 12:56
@SpecLad SpecLad marked this pull request as ready for review November 14, 2025 12:58
@SpecLad SpecLad force-pushed the heavyweight-backup-sparse branch from aeccf57 to bc066b3 Compare November 14, 2025 15:49
…rame range

Let's say we have a cloud storage-based task that was created with 6 input
images, and the following settings: start frame 1, stop frame 4, frame step
2. A heavyweight backup of such a task will then contain only frames 1 and 3
in the `data` directory; however, the manifest will still contain all 6
frames. If you try to restore such a backup, CVAT will fail, because it
checks that the files in the `data` directory correspond 1:1 to the
manifest.

This could potentially be fixed in the restore code; however, it seems to me
that the backups created in this case are incorrect, as they have manifests
referencing nonexistent files. As such, I think it's more appropriate to fix
this in the backup code.

The fix is to filter the manifest during backup, leaving only entries
corresponding to frames that actually get backed up. We also need to reset
the frame range to the default, so that it matches the filtered manifest.

Note that the same bug also affects backups of tasks based on the mounted
share. It should be reasonably easy to fix (just use
`_write_filtered_media_manifest` in the `StorageChoice.SHARE` branch), but I
cannot test such a fix, because share backups are currently broken entirely.
So I will defer this fix until cvat-ai#9972.
@SpecLad SpecLad force-pushed the heavyweight-backup-sparse branch from bc066b3 to c37def9 Compare November 14, 2025 17:41
@sonarqubecloud
Copy link

target_dir=target_dir,
)

def _write_filtered_media_manifest(self, zip_object: ZipFile, target_dir: str) -> None:
Copy link
Contributor

@zhiltsov-max zhiltsov-max Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these frame filtering parameters (start, stop, and step) only make sense and should only apply to video tasks. This might be a big change for a bug fix, but could you check if we can move this into task creation instead? I'd probably prefer having lightweight and regular backups with the same data configurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these frame filtering parameters (start, stop, and step) only make sense and should only apply to video tasks.

I don't think I agree. For example, when a user creates a task, they can select a directory full of images and then set these parameters to only use a subset of them.

Also, regardless of whether creating tasks like this should be supported, de facto it already works. So we should be able to back up and restore such tasks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I agree. For example, when a user creates a task, they can select a directory full of images and then set these parameters to only use a subset of them.

My point was only that we probably should stop including all the unused images into task manifests for non-video tasks. Basically, do the same manifest filtering that was added in this PR, but during the task creation after we apply all the frame filters (these parameters, wildcards etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants