Skip to content

Comments

[speechm2] Support indexed sharegpt JSONL and webdataset formats#15410

Open
pzelasko wants to merge 2 commits intomainfrom
indexed-sharegpt-data-type-parsers
Open

[speechm2] Support indexed sharegpt JSONL and webdataset formats#15410
pzelasko wants to merge 2 commits intomainfrom
indexed-sharegpt-data-type-parsers

Conversation

@pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Feb 17, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

The first PR to support indexed datasets. It reads a binary index (sequence of uint64 byte offsets marking beginning of each sample in a file), generates random permutation of indexes on the fly, and looks up the right sample.

This implementation pretends it's a sequential IO dataset for compatibility, but will be used as a building block for a resumable dataloader in the future.

Supported formats:

share_gpt

  • two files data.jsonl and data.jsonl.idx
  • schema (single line):
        {
            "id": f"audio_convo",
            "sound": f"audio.wav",
            "conversations": [
                {"from": "human", "value": f"Listen to this: <sound> What do you think?"},
                {"from": "gpt", "value": f"Response"},
            ],
        }

share_gpt_webdataset

  • directory with files like shard_0.tar + shard_0.tar.idx
  • layout
    Expected directory layout::
        data_dir/
          wids-meta.json                          # shard list metadata
          0/
            shard-0.tar      shard-0.tar.idx      # tar + optional index
            ...
    Each tar archive contains paired files per sample (same basename)::
        0.json   0.wav
        1.json   1.wav
        ...
  • individual json file has one line with the schema of share_gpt

Collection: speechlm2

Changelog

  • Data type parsers for indexed JSONL and webdataset based share_gpt format data.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@github-actions
Copy link
Contributor

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting and structural changes. General logic looks good though.

shard_seed=config.shard_seed,
)
)
if not config.get("force_finite", False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add quick comment for this flag (when going through dataloaders, we have a bit of a depth issue where the purpose of flags can get sidetracked.)

)
if not config.get("force_finite", False):
cuts = cuts.repeat(preserve_id=True)
return cuts, True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we need this bool for compatibility? curious if the same function can get achieved by just checking config

bits += 1
self._half = bits // 2
self._mask = (1 << self._half) - 1
self._rounds = 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a arg in init.

left = (x >> self._half) & self._mask
right = x & self._mask
for key in self._keys:
left, right = right, left ^ (((right * 2654435761) ^ key) >> 32 & self._mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make global var at top of file.

self._rounds = 6
self._keys = [rng.getrandbits(64) for _ in range(self._rounds)]

def _permute_one(self, x: int) -> int:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no numpy methods you can crib for this? or is the native Cython bitshift for efficient?

for line in f_in:
current_offset += len(line)
write_buffer.extend(struct.pack('<Q', current_offset))
if len(write_buffer) > 8 * 1024 * 1024:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nitpicky, but just write out the full multiplication as a var above and comment. no need to do the extra ops for every line.

"offset": turn.get("offset", 0.0),
}
)
if len(parts) > 1 and parts[1].strip():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs catch to prevent silent fall throughs


{
"id": str,
"id": str, # not optional, but we will tolerate if it's missing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit cryptic, provide line where we tolerate this?

elif isinstance(self.audio_placeholders, str):
self.audio_placeholders = [self.audio_placeholders]
self.audio_placeholders = _normalize_audio_placeholders(self.audio_placeholders)
self._has_index = all(Path(p + ".idx").exists() for p in self.manifest_filepath)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so been thinking: would the extension make more sense as manifest.idx/audio.idx instead of manifest.jsonl.idx '''.

raise FileNotFoundError(f"No wids-meta.json and no .tar files found under {self.data_dir}")
self.audio_placeholders = _normalize_audio_placeholders(self.audio_placeholders)
self._has_index = all(Path(p + ".idx").exists() for p in self._shard_paths)
self.epoch = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is there anyway we can sync this with the trainer? having an adapter maintaining epoch on its lonesome sounds like a pending desync issue that will be annoying to hunt down

@pyf98
Copy link
Collaborator

pyf98 commented Feb 19, 2026

Thanks. This PR looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants