Skip to content

[Q&A] Handling long-form audio and speaker labels in Sortformer training manifest #15356

@LilDevsy0117

Description

@LilDevsy0117

I am preparing training data for the Sortformer diarization model and have a few questions regarding the handling of long-form audio recordings:

Virtual Slicing via Manifest: If I have a long audio file (e.g., 5 minutes), is it sufficient to define segments in the manifest using only the offset and duration fields (e.g., 0-90s, 90-180s, etc.) without physically cutting the audio into separate files? Will the model train correctly using these offset-based entries?

Partial Speaker Presence in Segments: Suppose a 5-minute recording has 4 unique speakers in total. However, in a specific 90-second segment (e.g., the first 90 seconds), only 2 of those speakers are actively talking.

Is it okay to include segments where only a subset of the session's total speakers is present?

In the manifest entry for that 90s segment, should the num_speakers field reflect the number of speakers present in that specific segment (2), or the total speakers in the original long recording (4)?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions