[Q&A] Handling long-form audio and speaker labels in Sortformer training manifest

I am preparing training data for the Sortformer diarization model and have a few questions regarding the handling of long-form audio recordings:

Virtual Slicing via Manifest: If I have a long audio file (e.g., 5 minutes), is it sufficient to define segments in the manifest using only the offset and duration fields (e.g., 0-90s, 90-180s, etc.) without physically cutting the audio into separate files? Will the model train correctly using these offset-based entries?

Partial Speaker Presence in Segments: Suppose a 5-minute recording has 4 unique speakers in total. However, in a specific 90-second segment (e.g., the first 90 seconds), only 2 of those speakers are actively talking.

Is it okay to include segments where only a subset of the session's total speakers is present?

In the manifest entry for that 90s segment, should the num_speakers field reflect the number of speakers present in that specific segment (2), or the total speakers in the original long recording (4)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q&A] Handling long-form audio and speaker labels in Sortformer training manifest #15356

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Q&A] Handling long-form audio and speaker labels in Sortformer training manifest #15356

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions