Skip to content

Rewrite manifests #1661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

amitgilad3
Copy link
Contributor

@amitgilad3 amitgilad3 commented Feb 13, 2025

This is an initial implementation of rewrite manifests, aiming to mimic the Java implementation as closely as possible. I’ve tried to follow the same structure and logic, but there are still some areas that might need refinement.

I’m looking for feedback and suggestions on:
• Whether the approach aligns well with the existing design.
• Any gaps or optimizations that could improve performance.
• How best to proceed with completing this feature.

Would love any insights or guidance on the next steps! Thanks in advance for the review! 🙌

@amitgilad3 amitgilad3 marked this pull request as draft February 13, 2025 20:21
@Fokko Fokko self-requested a review February 18, 2025 15:10
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the many comments, most of them are easy to resolve. Thanks @amitgilad3 for working on this, I like it how you reuse the _ManifestMergeManager 👍

@@ -861,6 +865,7 @@ def existing(self, entry: ManifestEntry) -> ManifestWriter:
entry.snapshot_id, entry.sequence_number, entry.file_sequence_number, entry.data_file
)
)
self._existing_files += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we already do this here:

self._existing_files += 1

Comment on lines 483 to 484
rewritten_manifests: List[ManifestFile]
added_manifests: List[ManifestFile]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about slapping on some default factories: https://docs.python.org/3/library/dataclasses.html#dataclasses.field

Suggested change
rewritten_manifests: List[ManifestFile]
added_manifests: List[ManifestFile]
rewritten_manifests: List[ManifestFile] = field(default_factory=list)
added_manifests: List[ManifestFile] = field(default_factory=list)

@@ -477,6 +478,20 @@ def _deleted_entries(self) -> List[ManifestEntry]:
return []


@dataclass(init=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of making this Frozen? This gives some nice benefits like being hashable: https://docs.python.org/3/library/dataclasses.html#frozen-instances

Using the default_factory's, we can also drop the init.

Suggested change
@dataclass(init=False)
@dataclass(frozen=True)

Comment on lines 562 to 563
_table: Table
_spec: PartitionSpec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are used.

def _deleted_entries(self) -> List[ManifestEntry]:
"""To determine if we need to record any deleted manifest entries.

In case of an append, nothing is deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy paste :)

self.rewritten_manifests.extend(deletes_result.rewritten_manifests)
self.added_manifests.extend(deletes_result.added_manifests)

if not self.rewritten_manifests:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, just for clarity that it is a list:

Suggested change
if not self.rewritten_manifests:
if len(self.rewritten_manifests) == 0:

Comment on lines 634 to 635
min_count_to_merge=2,
merge_enabled=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


tbl.append(arrow_table_with_null)

assert tbl.format_version == 2, f"Expected v2, got: v{tbl.format_version}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we testing here? I would love to see if we could rewrite the V1 and V2 manifest into a V2 one.

assert len(result.rewritten_manifests) == 4, "Action should rewrite 4 manifests"
assert len(result.added_manifests) == 2, "Action should add 2 manifests"

new_manifests = tbl.current_snapshot().manifests(tbl.io)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong with this, but tbl.inspect might also be helpful to assert the manifests.

@Fokko
Copy link
Contributor

Fokko commented Apr 17, 2025

@amitgilad3 gentle ping, are you still interested in working on this?

@amitgilad3
Copy link
Contributor Author

Hey @Fokko , yes i am very interested in finishing this, must of missed this (sorry) , will look at this later today :)

@amitgilad3 amitgilad3 force-pushed the rewrite_manifests branch from abbb01c to 6913f09 Compare May 15, 2025 19:35
@Fokko Fokko marked this pull request as ready for review May 16, 2025 21:11
@Fokko
Copy link
Contributor

Fokko commented May 16, 2025

Looks like the CI is sad 😞

tests/integration/test_writes/test_rewrite_manifests.py:154: error: "rewrite_manifests" of "Table" does not return a value (it only ever returns None)  [func-returns-value]
tests/integration/test_writes/test_rewrite_manifests.py:227: error: "rewrite_manifests" of "Table" does not return a value (it only ever returns None)  [func-returns-value]

Comment on lines +585 to +591
ssc = SnapshotSummaryCollector()
partition_summary_limit = int(
self._transaction.table_metadata.properties.get(
TableProperties.WRITE_PARTITION_SUMMARY_LIMIT, TableProperties.WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT
)
)
ssc.set_partition_summary_limit(partition_summary_limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ssc = SnapshotSummaryCollector()
partition_summary_limit = int(
self._transaction.table_metadata.properties.get(
TableProperties.WRITE_PARTITION_SUMMARY_LIMIT, TableProperties.WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT
)
)
ssc.set_partition_summary_limit(partition_summary_limit)
ssc = SnapshotSummaryCollector(int(
self._transaction.table_metadata.properties.get(
TableProperties.WRITE_PARTITION_SUMMARY_LIMIT, TableProperties.WRITE_PARTITION_SUMMARY_LIMIT_DEFAULT
)
))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants