Skip to content

Comments

feat: enable timestamp support for batched beam search in RNN-T and TDT#15411

Open
pherber3 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
pherber3:main
Open

feat: enable timestamp support for batched beam search in RNN-T and TDT#15411
pherber3 wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
pherber3:main

Conversation

@pherber3
Copy link

@pherber3 pherber3 commented Feb 17, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Enable compute_timestamps=True for MALSD batch and MAES batch beam search strategies in RNN-T and TDT models (such as parakeet v3). Previously, these strategies raised NotImplementedError("Preserve alignments is not supported"), blocking timestamp generation even though the beam search infrastructure already tracks timestamp data internally via BatchedBeamHyps.

Collection: ASR

Changelog

  • Replace NotImplementedError with a warning in ModifiedALSDBatchedRNNTComputer, ModifiedALSDBatchedTDTComputer, and ModifiedAESBatchedRNNTComputer. Full alignment logprobs are unavailable in beam search, but timestamps are.
  • Add token_durations tensor to BatchedBeamHyps for TDT models so _compute_offsets_tdt() receives both start-frame timestamps and per-token durations.
  • Store start-frame (not end-frame) in TDT beam timestamps, following the greedy decoding implementation.
  • Populate Hypothesis.token_duration in to_hyps_list() and to_nbest_hyps_list() for TDT.
  • Add 'malsd_batch' and 'maes_batch' to the beam strategy lists for preserve_alignments / compute_timestamps config resolution in RNNTDecoding.
  • Add tests for timestamp generation with both RNN-T and TDT beam decoding.

Usage

import nemo.collections.asr as nemo_asr
from omegaconf import open_dict

model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v3")

cfg = model.cfg.decoding
with open_dict(cfg):
    cfg.strategy = "malsd_batch"
    cfg.compute_timestamps = True
    cfg.preserve_alignments = True
    cfg.beam.beam_size = 4
    cfg.beam.search_type = "malsd_batch"
    cfg.beam.return_best_hypothesis = True
model.change_decoding_strategy(cfg)

output = model.transcribe(["audio.wav"], timestamps=True)
print(output[0].timestamp)  # {'char': [...], 'word': [...], 'segment': [...]}

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation? (N/A - didn't find any related docs to change for this)
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc) (No)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

@nithinraok - Tagging according to above guidelines for ASR-related changes

Additional Information

  • No existing issue, just discovered while trying to use model.transcribe(timestamps=True) with malsd_batch strategy that it wasn't supported even though the capabilities are there under the hood I believe.
  • All new tensors are pre-allocated with fixed size, should preserve CUDA graph compatibility.
  • Tested on L40S GPU with stt_en_conformer_transducer_small (RNN-T) and nvidia/stt_en_fastconformer_tdt_large (TDT).
  • Also tested on production call center audio (~60 min stereo calls mixed to mono) with nvidia/parakeet-tdt-0.6b-v3 using malsd_batch + GPU-PB phrase boosting + NGPU-LM shallow fusion trained using the kenlm script. Timestamps are generated correctly and match the greedy decoder's segment/word boundaries.
  • Note 1: malsd_batch with high LM weights (e.g., ngram_lm_alpha=0.75) can cause content dropout on long audio where segments of speech get skipped entirely. I believe this is a pre-existing beam pruning interaction with LM scoring, not related to this PR's changes. Lower LM weights (0.2), greedy decoding with the same LM, or just using the phrase boosting alone do not exhibit this behavior.
  • Note 2: MAES batch is RNN-T only, there is no TDT MAES computer (i.e., tdt_maes_batched_computer.py does not exist). If that is made then this should support it but until then TDT models would only be supported with MALSD.

@github-actions github-actions bot added the ASR label Feb 17, 2026
Signed-off-by: Patrick Herbert <pherbert@gohealth.com>
@pherber3 pherber3 marked this pull request as ready for review February 17, 2026 23:05
@pherber3 pherber3 changed the title fix: enable timestamps for MALSD/MAES batch beam search in RNN-T and TDT feat: enable timestamps for MALSD/MAES batch beam search in RNN-T and TDT Feb 17, 2026
@pherber3 pherber3 changed the title feat: enable timestamps for MALSD/MAES batch beam search in RNN-T and TDT feat: enable timestamp support for batched beam search in RNN-T and TDT Feb 17, 2026
@nithinraok nithinraok requested a review from artbataev February 19, 2026 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant