New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

add efficient AV decoding #38

Draft

jon-barker wants to merge 15 commits into NVIDIA:develop from jon-barker:jbarker/efficient_video

Collaborator

jon-barker commented Jan 3, 2025

There is still tidying and documentation to be done. I am creating the draft PR to get initial feedback on design and big changes required.

philipp-fischer and others added 15 commits

November 6, 2024 13:46


          Merge pull request NVIDIA#27 from NVIDIA/develop

10c47c6

Apply asyn hotfix for a new bugfix release


          Merge pull request NVIDIA#30 from NVIDIA/develop

26700bb

Release 4.0.0


          WIP: integrate fastseek

86d4f08


          add video decoding tests

95ea59c


          tweak video decode test

3d384bb


          WIP: add audio decode with tests

ebe298a


          av tests

92e7ee4


          Merge branch 'main' of https://github.com/nvidia/Megatron-Energon int…

0916ae0

…o jbarker/efficient_video


          debugging audio

e2c4aa8


          remove poorly planned audio test

59ce7c2


          uncomment video test

4a8a544


          add audio resampling and spectrogram conversion

164b22f


          WIP: exposing av decode options through energon

ca16574


          expose av decode args through energon api

6006a03


          support decoding audio clips from a video

2394f17

voegtlel requested changes

View reviewed changes

Collaborator

voegtlel left a comment

Thanks for adding this code! A great addition! A few comments mainly on the integration

pyproject.toml

@@ @@ -31,6 +31,11 @@ dependencies = [ @@
                 "tqdm",
                 "webdataset",
                 "s3fs",
+                "bitstring>=4.2.3",

Collaborator

voegtlel Jan 10, 2025

Should we add these as extras to be installed like pip install megatron-energon[video]?
Would require the imports in the python files to be allowed to fail though

src/megatron/energon/flavors/webdataset/default_decoder_webdataset.py

Comment on lines +64 to +65

		video_num_frames: int = 64,
		video_out_frame_size: tuple = (224, 224),

Collaborator

voegtlel Jan 10, 2025

Should these be constant and not rather be set when fetching the frames in the user code? I think there should be a dynamic option as well.
E.g. Open-Sora group-batches videos into different sized outputs/frames.

src/megatron/energon/flavors/webdataset/fastseek/__init__.py

		@@ -0,0 +1,4 @@
		# Copyright (c) 2024, NVIDIA CORPORATION.

Collaborator

voegtlel Jan 14, 2025

I'd suggest to move this to a more top-level package. Maybe under megatron.energon.video?

src/megatron/energon/flavors/webdataset/decode_av_frames.py

+                      """
+                      extension = re.sub(r".*[.]", "", key)
+                      # TODO(jbarker): we should add a debug log here
+                      if extension in "mov mp4 webm mkv".split():

Collaborator

voegtlel Jan 14, 2025

Suggested change

      
                    if extension in "mov mp4 webm mkv".split():
          
                    if extension in ("mov", "mp4", "webm", "mkv"):

src/megatron/energon/flavors/webdataset/decode_av_frames.py

+                          key: media file extension
+                          data: raw media bytes
+                      """
+                      extension = re.sub(r".*[.]", "", key)

Collaborator

voegtlel Jan 14, 2025

Suggested change

      
                    extension = re.sub(r".*[.]", "", key)
          
                    extension = key.rsplit('.', 1)[-1]

Otherwise, we usually compile regexes before

src/megatron/energon/flavors/webdataset/decode_av_frames.py

+                              out_frame_size=self.video_out_frame_size,
+                              decode_audio=self.video_decode_audio,
+                          )
+                      elif extension in "flac mp3".split():

Collaborator

voegtlel Jan 14, 2025

Suggested change

      
                    elif extension in "flac mp3".split():
          
                    elif extension in ("flac", "mp3"):

src/megatron/energon/flavors/webdataset/decode_av_frames.py


		DEFAULT_AUDIO_FRAME_SHIFT_MS = 10 # in milliseconds

		class AVDecoder:

Collaborator

voegtlel Jan 14, 2025

Maybe we should have an alternative decoder as well, which returns the decoder itself, so the user can decide in user code (=encode_sample) which frames to read?

Like this:

# This function is to be registered as decoder
def read_av_data(key: str, data: bytes):
    if key in ("mp3", ...):
        return AVData(data)


# This class is now passed to the user's `encode_sample` function (i.e. the raw video
# bytes are essentially passed through). This allows the user to decide on the 
# parameters on the fly (e.g. for open-sora).
class AVData:
    def __init__(self, raw: bytes):
        ...

    def get_frames(
            self,
            audio_convert_to_melspec: bool = False,
            audio_clip_duration: int = 1,
            audio_num_clips: int = -1,
            audio_target_rate: int = 16000,
            video_decode_audio: bool = False,
            video_num_frames: int = 64,
            video_out_frame_size: tuple = (224, 224),
        ) -> AudioVideoData:
            ...

WDYT?

voegtlel reviewed

View reviewed changes

src/megatron/energon/flavors/webdataset/decode_av_frames.py

+                          )
+                      return None
+              def waveform2melspec(waveform, sample_rate, num_mel_bins, target_length):

Collaborator

voegtlel Jan 14, 2025

I feel, the functions below may have their own file, and also reside in the fastseek package?

voegtlel reviewed

View reviewed changes

src/megatron/energon/flavors/webdataset/decode_av_frames.py

+              class AVDecoder:
+                  def __init__(
+                          self,
+                          audio_convert_to_melspec,

Collaborator

voegtlel Jan 14, 2025

Generally, we have all parameters statically typed. Also all class variables are typically typed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet