Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stat.ino() is assumed to be small #1817

Closed
EliahKagan opened this issue Jan 28, 2025 · 3 comments · Fixed by #1818
Closed

stat.ino() is assumed to be small #1817

EliahKagan opened this issue Jan 28, 2025 · 3 comments · Fixed by #1818
Labels
acknowledged an issue is accepted as shortcoming to be fixed help wanted Extra attention is needed

Comments

@EliahKagan
Copy link
Member

EliahKagan commented Jan 28, 2025

Current behavior 😯

gix_index::entry::Stat::from_fs obtains information from a stat structure on like this on all Unix-like systems:

#[cfg(not(windows))]
let res = {
Stat {
mtime: mtime.try_into().unwrap_or_default(),
ctime: ctime.try_into().unwrap_or_default(),
// truncating to 32 bits is fine here because
// that's what the linux syscalls returns
// just rust upcasts to 64 bits for some reason?
// numbers this large are impractical anyway (that's a lot of hard-drives).
dev: stat.dev() as u32,
ino: stat.ino() as u32,
uid: stat.uid(),
gid: stat.gid(),
// truncation to 32 bits is on purpose (git does the same).
size: stat.len() as u32,
}
};

But even if that the rationale holds for Linux, it is not clear that the device number always fits in a 32-bit integer without loss of information, because enormous device numbers do exist on some notable Unix-like operating systems with some filesystems. It is furthermore not accurate that such large numbers are impractical, because devices need not be assigned sequentially. For example, this forum discussion pertains to huge device numbers for ZFS volumes on FreeBSD.

In addition, I am not confident this holds on Linux, since even on the same architecture there is more than one stat-related syscall. This reference of syscalls gives stat as an example of an entry point with multiple syscalls on the same platforms that use data types with different sizes.

Expected behavior 🤔

I don't know if this actually causes a problem. This issue is thus incomplete in that I am not sure what happens with such devices. Maybe the truncation is okay? But it is not obvious why it would be.

However, even if the implementation is not changed, the comments should be revised, once the effect is known, to avoid implying that device numbers must be sequential or that the field fits in 32-bits on all systems, and to explain either:

  • why is okay to cast it to u32, or
  • what kinds of limitations or problems may arise.

Git behavior

It looks like Git is similar. On most systems, unsigned int is 32-bit, even if unsigned long is 64-bit, and Git defines:

struct stat_data {
	struct cache_time sd_ctime;
	struct cache_time sd_mtime;
	unsigned int sd_dev;
	unsigned int sd_ino;
	unsigned int sd_uid;
	unsigned int sd_gid;
	unsigned int sd_size;
};

Which is populated like this:

void fill_stat_data(struct stat_data *sd, struct stat *st)
{
	sd->sd_ctime.sec = (unsigned int)st->st_ctime;
	sd->sd_mtime.sec = (unsigned int)st->st_mtime;
	sd->sd_ctime.nsec = ST_CTIME_NSEC(*st);
	sd->sd_mtime.nsec = ST_MTIME_NSEC(*st);
	sd->sd_dev = st->st_dev;
	sd->sd_ino = st->st_ino;
	sd->sd_uid = st->st_uid;
	sd->sd_gid = st->st_gid;
	sd->sd_size = munge_st_size(st->st_size);
}

It is not clear that this is intended in Git either. Other lossy conversions are identified as such.

Steps to reproduce 🕹

I don't currently have a procedure to attempt to produce any incorrect runtime behavior based on this. I am unsure if it is only the comment that should be corrected, or if the representation and/or runtime behavior should also be changed.

@Byron Byron added the acknowledged an issue is accepted as shortcoming to be fixed label Jan 29, 2025
@Byron
Copy link
Member

Byron commented Jan 29, 2025

Thanks a lot for bringing this up.

For context, the reason this function exists is that the std functions for creation_time() doesn't return the desired field from the stat call. Further, by default I think inode numbers aren't used to determine if a file changed or not, so usually these possibly truncated numbers are ignored.
As the Index only has 32 bits for this, truncated numbers may be ineffective to detect change, but semantically this still valid as more fields will then be used for comparison.

I do not know what drove me to write the numbers this large are impractical anyway (that's a lot of hard-drives). comment though.

Ultimately, I'd think improved documentation should be able to resolve this issue as this truncation is acceptable.

@Byron Byron added the help wanted Extra attention is needed label Jan 29, 2025
@EliahKagan
Copy link
Member Author

EliahKagan commented Jan 29, 2025

For context, the reason this function exists is that the std functions for creation_time() doesn't return the desired field from the stat call.

Doesn't MetadataExt provide the necessary methods on Unix-like systems? It looks like everything accessed here is available through a same-named method in MetadataExt, and that those methods have been available since Rust 1.1.0.

Currently the logic in for Unix-like systems in gix-index/entry/stat.rs and some other places uses #[cfg(not(windows))] rather than #[cfg(unix)]. But I don't think not(windows) is practically broader than unix here, unless there are non-unix implementations for them.

Further, by default I think inode numbers aren't used to determine if a file changed or not, so usually these possibly truncated numbers are ignored.

That makes sense and suggests a comment-only or comment-and-documentation-only change is the way to go.

However, I wonder if the implementation here, or for gix_index::Metadata, which currently extends rustix::fs::Stat, should also change to use std facilities instead of rustix. This relates somewhat to where #1811 left off: the thought expressed there in "Future directions."

Especially if no such change can be made, the gix_index::fs documentation could be expanded to explain what it does differently. I'd be happy to do that once I understand this issue adequately (unless it really is documented for gix-index but I haven't found it due to my own oversight).

Relatedly, is it really accurate to say that the from_fs function is called with a result of symlink_metadata?

/// Creates stat information from the result of `symlink_metadata`.
pub fn from_fs(stat: &crate::fs::Metadata) -> Result<Stat, SystemTimeError> {

Edit: I've opened #1818.

EliahKagan added a commit to EliahKagan/gitoxide that referenced this issue Jan 30, 2025
This expands and slightly corrects the `gix_index::entry::from_fs`
documentation:

- To be more precise about the expectation of where the metadata
  come from (it needn't specifically be the `symlink_metadata`
  function, and it is sometimes even an `fstat`-like function
  rather than an `lstat`-like function, though not `stat`).

- To mention that default values are used for unavailable data.

- To mention that truncation is sometimes performed on some fields.

Comments are also updated regarding some cases of truncation: why
casting down to 32-bit should be okay for `st_dev` and `st_ino`
(see GitoxideLabs#1817).
@Byron
Copy link
Member

Byron commented Jan 30, 2025

Doesn't MetadataExt provide the necessary methods on Unix-like systems? It looks like everything accessed here is available through a same-named method in MetadataExt, and that those methods have been available since Rust 1.1.0.

One might think that the creation date provide there is what Git uses, but when looking it up in the std code one will see its the 'birth' time, not the 'creation' time (or something like that). This is why it can't be used, or else gitoxide will write indices with data that will cause git to refresh the index each time it sees it.

Currently the logic in for Unix-like systems in gix-index/entry/stat.rs and some other places uses #[cfg(not(windows))] rather than #[cfg(unix)]. But I don't think not(windows) is practically broader than unix here, unless there are non-unix implementations for them.

I agree that it's probably wrong to use unix-specific extensions under a #[cfg(not(windows))] attribute.
The reason for doing that probably was that it was unclear what happens if #[cfg(not(unix))], it's like a hole in the definition. However, unless the hole is plucked, I think the current code or a change to #[cfg(unix)] would lead to compile errors, something which certainly is desirable and it would be a question of which produces better error messages.

Especially if no such change can be made, the gix_index::fs documentation could be expanded to explain what it does differently. I'd be happy to do that once I understand this issue adequately (unless it really is documented for gix-index but I haven't found it due to my own oversight).

Along with my comment above, I think the documentation could be improved to explain why the custom implementation is needed.

Relatedly, is it really accurate to say that the from_fs function is called with a result of symlink_metadata?

This looks like the doc-string is stale and isn't correct anymore. And I think it can be adjusted to once again say what it does.
Right, this is in #1818 , thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged an issue is accepted as shortcoming to be fixed help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants