[speechm2] Support indexed sharegpt JSONL and webdataset formats by pzelasko · Pull Request #15410 · NVIDIA-NeMo/NeMo

pzelasko · 2026-02-17T16:57:34Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

The first PR to support indexed datasets. It reads a binary index (sequence of uint64 byte offsets marking beginning of each sample in a file), generates random permutation of indexes on the fly, and looks up the right sample.

This implementation pretends it's a sequential IO dataset for compatibility, but will be used as a building block for a resumable dataloader in the future.

Supported formats:

share_gpt

two files data.jsonl and data.jsonl.idx
schema (single line):

        {
            "id": f"audio_convo",
            "sound": f"audio.wav",
            "conversations": [
                {"from": "human", "value": f"Listen to this: <sound> What do you think?"},
                {"from": "gpt", "value": f"Response"},
            ],
        }

share_gpt_webdataset

directory with files like shard_0.tar + shard_0.tar.idx
layout

    Expected directory layout::
        data_dir/
          wids-meta.json                          # shard list metadata
          0/
            shard-0.tar      shard-0.tar.idx      # tar + optional index
            ...
    Each tar archive contains paired files per sample (same basename)::
        0.json   0.wav
        1.json   1.wav
        ...

individual json file has one line with the schema of share_gpt

Collection: speechlm2

Changelog

Data type parsers for indexed JSONL and webdataset based share_gpt format data.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

github-actions · 2026-02-17T19:21:36Z

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

tbartley94

Formatting and structural changes. General logic looks good though.

tbartley94 · 2026-02-18T18:25:11Z

nemo/collections/common/data/lhotse/cutset.py

+            shard_seed=config.shard_seed,
+        )
+    )
+    if not config.get("force_finite", False):


add quick comment for this flag (when going through dataloaders, we have a bit of a depth issue where the purpose of flags can get sidetracked.)

tbartley94 · 2026-02-18T18:26:12Z

nemo/collections/common/data/lhotse/cutset.py

+    )
+    if not config.get("force_finite", False):
+        cuts = cuts.repeat(preserve_id=True)
+    return cuts, True


where do we need this bool for compatibility? curious if the same function can get achieved by just checking config

tbartley94 · 2026-02-18T18:33:04Z

nemo/collections/common/data/lhotse/indexed_adapters.py

+            bits += 1
+        self._half = bits // 2
+        self._mask = (1 << self._half) - 1
+        self._rounds = 6


make a arg in init.

tbartley94 · 2026-02-18T18:34:36Z

nemo/collections/common/data/lhotse/indexed_adapters.py

+        left = (x >> self._half) & self._mask
+        right = x & self._mask
+        for key in self._keys:
+            left, right = right, left ^ (((right * 2654435761) ^ key) >> 32 & self._mask)


make global var at top of file.

tbartley94 · 2026-02-18T18:35:35Z

nemo/collections/common/data/lhotse/indexed_adapters.py

+        self._rounds = 6
+        self._keys = [rng.getrandbits(64) for _ in range(self._rounds)]
+
+    def _permute_one(self, x: int) -> int:


is there no numpy methods you can crib for this? or is the native Cython bitshift for efficient?

tbartley94 · 2026-02-18T18:56:13Z

nemo/collections/common/data/lhotse/indexed_adapters.py

+        for line in f_in:
+            current_offset += len(line)
+            write_buffer.extend(struct.pack('<Q', current_offset))
+            if len(write_buffer) > 8 * 1024 * 1024:


very nitpicky, but just write out the full multiplication as a var above and comment. no need to do the extra ops for every line.

tbartley94 · 2026-02-18T19:03:45Z

nemo/collections/common/data/lhotse/text_adapters.py

+                        "offset": turn.get("offset", 0.0),
+                    }
+                )
+            if len(parts) > 1 and parts[1].strip():


Needs catch to prevent silent fall throughs

tbartley94 · 2026-02-18T19:05:01Z

nemo/collections/common/data/lhotse/text_adapters.py


        {
-            "id": str,
+            "id": str,  # not optional, but we will tolerate if it's missing


bit cryptic, provide line where we tolerate this?

tbartley94 · 2026-02-18T19:06:05Z

nemo/collections/common/data/lhotse/text_adapters.py

-        elif isinstance(self.audio_placeholders, str):
-            self.audio_placeholders = [self.audio_placeholders]
+        self.audio_placeholders = _normalize_audio_placeholders(self.audio_placeholders)
+        self._has_index = all(Path(p + ".idx").exists() for p in self.manifest_filepath)


so been thinking: would the extension make more sense as manifest.idx/audio.idx instead of manifest.jsonl.idx '''.

tbartley94 · 2026-02-18T19:09:31Z

nemo/collections/common/data/lhotse/text_adapters.py

+                raise FileNotFoundError(f"No wids-meta.json and no .tar files found under {self.data_dir}")
+        self.audio_placeholders = _normalize_audio_placeholders(self.audio_placeholders)
+        self._has_index = all(Path(p + ".idx").exists() for p in self._shard_paths)
+        self.epoch = 0


hmm, is there anyway we can sync this with the trainer? having an adapter maintaining epoch on its lonesome sounds like a pending desync issue that will be annoying to hunt down

pyf98 · 2026-02-19T04:53:57Z

Thanks. This PR looks good to me!

[speechm2] Support indexed sharegpt JSONL and webdataset formats

015b560

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko requested review from pyf98 and tbartley94 February 17, 2026 16:57

github-actions bot added the common label Feb 17, 2026

fix lint

149c911

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko added the Run CICD label Feb 17, 2026

pzelasko temporarily deployed to test February 17, 2026 17:04 — with GitHub Actions Inactive

github-actions bot removed the Run CICD label Feb 17, 2026

tbartley94 requested changes Feb 18, 2026

View reviewed changes

Comments

Conversation

pzelasko commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

tbartley94 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pyf98 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pzelasko commented Feb 17, 2026 •

edited

Loading