-
Notifications
You must be signed in to change notification settings - Fork 95
refactor(libcommon): consolidate query() and query_truncated_binary() methods
#3253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
query() and query_truncated_binary() methods
refactor(libcommon): remove unused hf_token in `parquet_utils` refactor(libcommon): remove the effectively unused arguments of `Indexer` (#3250) * refactor(libcommon): remove the effectively unused arguments of `Indexer` * style: remove unnecessarry imports * refactor(libcommon): remove `unsupported_features` argument from `RowsIndex` * style: remove unnecessarry imports refactor(libcommon): remove the effectively unused arguments of `Indexer` style: remove unnecessarry imports refactor(libcommon): remove `Indexer` refactor(services): directly create `RowsIndex` instead of `Indexer` test(libcommon): fix `test_rows_index_query_with_empty_dataset` to use `ds_empty` chore: missing import and mypy types style: fix import order fix(libcommon): cache the latest instance of `RowsIndex` test(libcommon): add a test for caching the latest RowsIndex instance fix(libcommon): only cache RowsIndex when serving from the rows endpoint test(libcommon): remove previously added test case for caching RowIndex instances chore: missing type annotations refactor(libcommon): remove now obsolete `get_supported_unsupported_columns\(\)` function style(libcommon): remove unnecessary imports chore: remove reduntant iterations
… to be used later by libviewer
b807572 to
0713677
Compare
| return is_list | ||
|
|
||
|
|
||
| def truncate_binary_columns(table: pa.Table, max_binary_length: int, features: Features) -> tuple[pa.Table, list[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied this change in the libviewer PR as well but pulling this out here to make the reviewer process easier.
| ) | ||
|
|
||
| index = ParquetIndexWithMetadata( | ||
| files=[file_metadata], # type: ignore[list-item] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ParquetFileMetadataItem imported here is an exact duplicate of the ParquetFileMetadataItem in libcommon, but mypy complains about the type mismatch.
|
|
||
| @dataclass | ||
| class ParquetIndexWithMetadata: | ||
| files: list[ParquetFileMetadataItem] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing the list of ParquetFileMetadataItem here directly since the previously extracted field lists didn't provide any advantage but rather noise.
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm !
| method="parquet_index_with_metadata.query", step="load the remote parquet files using metadata from disk" | ||
| ): | ||
| parquet_files = [ | ||
| pq.ParquetFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pq.ParquetFile can accept a filesystem argument so I think it could be further simplified and the mocking in the tests could be simpler by passing a local filesystem rather than a network filesystem but leaving it as is since hopefully we can remove it entirely.
| row_group_readers, first_row_group_id, last_row_group_id, all_columns, binary_columns | ||
| ) | ||
| else: | ||
| pa_table, truncated_columns = self._read_without_binary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically we could remove this branch as well, at most there would be negligible performance overhead, but trying to keep the old behavior.
Depends on #3252
The
.query()method was only used in the tests, so I removed it in favor ofquery_truncated_binary(). I also factored out the duplicated steps and simplified some additional states to make it more uniform with the upcoming libviewer code.