Skip to content

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Nov 4, 2025

Depends on #3252

The .query() method was only used in the tests, so I removed it in favor of query_truncated_binary(). I also factored out the duplicated steps and simplified some additional states to make it more uniform with the upcoming libviewer code.

@kszucs kszucs changed the title refactor(libcommon): consolidate query() and query_truncated_binary() methods refactor(libcommon): consolidate query() and query_truncated_binary() methods Nov 4, 2025
refactor(libcommon): remove unused hf_token in `parquet_utils`

refactor(libcommon): remove the effectively unused arguments of `Indexer` (#3250)

* refactor(libcommon): remove the effectively unused arguments of `Indexer`

* style: remove unnecessarry imports

* refactor(libcommon): remove `unsupported_features` argument from `RowsIndex`

* style: remove unnecessarry imports

refactor(libcommon): remove the effectively unused arguments of `Indexer`

style: remove unnecessarry imports

refactor(libcommon): remove `Indexer`

refactor(services): directly create `RowsIndex` instead of `Indexer`

test(libcommon): fix `test_rows_index_query_with_empty_dataset` to use `ds_empty`

chore: missing import and mypy types

style: fix import order

fix(libcommon): cache the latest instance of `RowsIndex`

test(libcommon): add a test for caching the latest RowsIndex instance

fix(libcommon): only cache RowsIndex when serving from the rows endpoint

test(libcommon): remove previously added test case for caching RowIndex instances

chore: missing type annotations

refactor(libcommon): remove now obsolete `get_supported_unsupported_columns\(\)` function

style(libcommon): remove unnecessary imports

chore: remove reduntant iterations
@kszucs kszucs force-pushed the consolidate-query-methods branch from b807572 to 0713677 Compare November 4, 2025 14:49
return is_list


def truncate_binary_columns(table: pa.Table, max_binary_length: int, features: Features) -> tuple[pa.Table, list[str]]:
Copy link
Member Author

@kszucs kszucs Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied this change in the libviewer PR as well but pulling this out here to make the reviewer process easier.

)

index = ParquetIndexWithMetadata(
files=[file_metadata], # type: ignore[list-item]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ParquetFileMetadataItem imported here is an exact duplicate of the ParquetFileMetadataItem in libcommon, but mypy complains about the type mismatch.


@dataclass
class ParquetIndexWithMetadata:
files: list[ParquetFileMetadataItem]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing the list of ParquetFileMetadataItem here directly since the previously extracted field lists didn't provide any advantage but rather noise.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm !

method="parquet_index_with_metadata.query", step="load the remote parquet files using metadata from disk"
):
parquet_files = [
pq.ParquetFile(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pq.ParquetFile can accept a filesystem argument so I think it could be further simplified and the mocking in the tests could be simpler by passing a local filesystem rather than a network filesystem but leaving it as is since hopefully we can remove it entirely.

row_group_readers, first_row_group_id, last_row_group_id, all_columns, binary_columns
)
else:
pa_table, truncated_columns = self._read_without_binary(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we could remove this branch as well, at most there would be negligible performance overhead, but trying to keep the old behavior.

@kszucs kszucs merged commit dd1630b into main Nov 4, 2025
29 of 32 checks passed
@kszucs kszucs deleted the consolidate-query-methods branch November 4, 2025 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants