-
Notifications
You must be signed in to change notification settings - Fork 2.8k
feat: ROOT-11: Support reading JSONL from source cloud storages #7555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…k to task within file of tasks on cloud storage
Co-authored-by: Jo Booth <[email protected]>
Co-authored-by: Jo Booth <[email protected]>
✅ Deploy Preview for label-studio-docs-new-theme canceled.
|
✅ Deploy Preview for label-studio-storybook canceled.
|
✅ Deploy Preview for heartex-docs canceled.
|
/fm sync |
/fm sync |
label_studio/core/settings/base.py
Outdated
@@ -598,6 +598,7 @@ | |||
MEMBER_PERM = 'core.api_permissions.MemberHasOwnerPermission' | |||
RECALCULATE_ALL_STATS = None | |||
GET_STORAGE_LIST = 'io_storages.functions.get_storage_list' | |||
STORAGE_LOAD_TASKS_JSON = 'io_storages.utils._load_tasks_json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the meaning of a function or method that starts with a _
is "this should not be used outside of this class/module/whatever" - seems to me that we are treating this function as more of a public one, since it's being referred to in this other file. Would it be a pain to rename?
label_studio/io_storages/utils.py
Outdated
|
||
|
||
@dataclass | ||
class StorageObjectParams: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think StorageObject
would be a better name for this, since it actually contains the task now
label_studio/io_storages/utils.py
Outdated
|
||
@classmethod | ||
def bulk_create( | ||
cls, task_datas: list[dict], key, row_idxs: list[int] | None = None, row_groups: list[int] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer row_indexes
- generally in favor of avoiding abbreviations in variable names unless they're truly ubiquitous.
Note for posterity: |
Co-authored-by: Jo Booth <[email protected]>
@@ -49,6 +49,7 @@ dependencies = [ | |||
"ordered-set (==4.0.2)", | |||
"pandas (>=2.2.3)", | |||
"psycopg2-binary (==2.9.10)", | |||
"pyarrow (>=18.0.0,<19.0.0)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pyarrow (>=18.0.0,<19.0.0)", | |
"pyarrow (>=18.1.0,<19.0.0)", |
Would recommend we don't allow people to go older than the version in poetry.lock, note this will require a relock
label_studio/io_storages/utils.py
Outdated
_error_wrapper() | ||
|
||
|
||
def load_tasks_json(blob_str: str, key: str) -> tuple[list[dict], list[StorageObjectParams]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def load_tasks_json(blob_str: str, key: str) -> tuple[list[dict], list[StorageObjectParams]]: | |
def load_tasks_json(blob_str: str, key: str) -> list[StorageObjectParams]: |
@@ -515,6 +523,7 @@ def sync(self): | |||
self.info_set_queued() | |||
import_sync_background(self.__class__, self.id) | |||
except Exception: | |||
logger.debug(f'Storage {self} failed', exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug(f'Storage {self} failed', exc_info=True) | |
# needed to facilitate debugging storage-related testcases, since otherwise no exception is logged | |
logger.debug(f'Storage {self} failed', exc_info=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rare case where I'm in favor of a comment - the reason it's here is fairly subtle and not at all obvious from the context
@@ -431,7 +438,8 @@ def _scan_and_create_links(self, link_class): | |||
|
|||
logger.debug(f'{self}: found new key {key}') | |||
try: | |||
tasks_data = self.get_data(key) | |||
# list of (task data + ImportStorageLink details) | |||
links_params = self.get_data(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the change to StorageObject
, this can be storage_objects
, which I think will be a lot more readable than links_params
vs link_params
(although, valiant effort at making the plurals situation readable enough!)
@@ -431,7 +438,8 @@ def _scan_and_create_links(self, link_class): | |||
|
|||
logger.debug(f'{self}: found new key {key}') | |||
try: | |||
tasks_data = self.get_data(key) | |||
# list of (task data + ImportStorageLink details) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# list of (task data + ImportStorageLink details) |
[x] Factor out JSON parsing and validation logic from import storages into a function that can be hot-swapped in LSO/LSE
[x] add it to settings as an app-specific variable
[x] replace stdlib parsing with pyarrow parsing to handle JSONL as well as JSON
[x] test coverage for JSONL
[x] feature flag
[x] unskip multitask import tests in LSO, to prepare for testing different behavior in LSE and LSO
Implementation notes: