Skip to content

Conversation

@iulialexandra
Copy link

Motivation for features / changes

When using the TensorboardLogger in Pytorch lightning to log to an Azure blob storage location, the following error is thrown:

ErrorCode:InvalidBlobType
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidBlobType</Code><Message>The blob type is invalid for this operation.

This is due to the assumption that if the filesystem supports append opperations, these will automatically be available for any file in the blob storage, see this code snippet from gfile.py:

if self.fs_supports_append:
    if not self.write_started:
        # write the first chunk to truncate file if it already exists
        self.fs.write(self.filename, file_content, self.binary_mode)
        self.write_started = True

    else:
        # append the later chunks
        self.fs.append(self.filename, file_content, self.binary_mode)

However, in Azure, if a file is created by opening it in write mode, it will not permit append operations later on.

Technical description of changes

Screenshots of UI changes (or N/A)

Detailed steps to verify changes work correctly (as executed by you)

Alternate designs / implementations considered (or N/A)

@bmd3k bmd3k self-requested a review June 4, 2024 17:58
Copy link
Contributor

@bmd3k bmd3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said:

However, in Azure, if a file is created by opening it in write mode, it will not permit append operations later on.

Is there a case where Azure will permit operations later on? Is there some other mode we should open it in instead?

TensorFlow for file operations.
"""

from adlfs import AzureBlobFileSystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have adlfs libraries installed on development or ci machines so this is not going to build. You can see this in the failures for the build checks:

https://github.com/tensorflow/tensorboard/actions/runs/9352527415/job/25755552731?pr=6860

What would be a good way to check for this without importing AzureBlobFileSystem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, if a file is initially created by opening it in append mode, this file will allow append operations later on. However, as shown in the code snippet above, the file is always created by opening it in write mode:


        # write the first chunk to truncate file if it already exists
        self.fs.write(self.filename, file_content, self.binary_mode)

One other option, besides checking what filesystem we are dealing with, is to always open the file in append mode initially, when the fs supports that. However, this means the contents of the file, if it exists, will not be cleared.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option, which I am not very fond of, is checking the fs_class's string representation for the substring "Azure". You can see this approach in the latest commit.

However, I believe the approach I suggested above would be better. I don't have all the context behind why you would to want clear the file though, and whether always creating the file in append mode would be appropriate. In my context, where I use lightning's TensorboardLogger, a unique experiment directory gets created at the beginning of the run, so the events file will always be in a unique location, unless we explicitly want to continue an experiment, in which case appending to an existing events file should not be a problem.

self.filename = compat.as_bytes(filename)
self.fs = get_filesystem(self.filename)
self.fs_supports_append = hasattr(self.fs, "append")
if _get_fs_class(self.filename) == AzureBlobFileSystem:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit odd that we would call get_filesystem, which calls fsspec.get_filesystem_class transitively, and then we call fsspec.get_filesystem_class again.

Can we remove the duplication?

Perhaps get_filesystem could return a tuple, where first entry is the filesystem (like _FSSPEC_FILESYSTEM) and the second entry is the filesystem class (if any, like AzureBlobFileSystem).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid that approach because it required a bit more refactoring, but I implemented it now, please check it out.

@ioangatop
Copy link

Hi is there any update? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants