Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(puffin): Add PuffinReader #892

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fqaiser94
Copy link
Contributor

@fqaiser94 fqaiser94 commented Jan 15, 2025

Part of #744

Summary

  • Add PuffinReader

Context

  • This is the fourth of a number of PRs to add support for Iceberg Puffin file format.
  • It might be helpful to refer to the overarching PR from which these changes were split to understand better how these changes will fit in to the larger picture.
  • It may also be helpful to refer to the Java reference implementation for PuffinReader here.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fqaiser94 for this great pr, generally LGTM! Just some minor suggestions.

}

/// Returns file metadata
pub(crate) async fn file_metadata(&mut self) -> Result<&FileMetadata> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider using a LazyLock to initalize it? This way it could be thread safe, and discard the &mut modifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to be able to get rid of the &mut here 👍
Struggling to get a LazyLock based implementation working though.
The main challenge is that FileMetadata::read is an async function :/
Open to suggestions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think async_lazy crate is a good fit?

crates/iceberg/src/puffin/reader.rs Outdated Show resolved Hide resolved
/// Sequence number of the Iceberg table's snapshot the blob was computed from
pub(crate) sequence_number: i64,
/// The actual blob data
pub(crate) data: Vec<u8>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add comment to explain that this is always uncompressed data?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or can we include a CompressionCodec in the Blob to indicate the compress method? (we can do this in a follow-up PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add comment to explain that this is always uncompressed data?

Done.

Or can we include a CompressionCodec in the Blob to indicate the compress method? (we can do this in a follow-up PR)

I can see where you're coming from and while I'm not completely opposed to the idea, I think I prefer to return the uncompressed data at this point. I will also note that the Java PuffinReader API does the same i.e. always returns the data after decompression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants