-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
I am preparing training data for the Sortformer diarization model and have a few questions regarding the handling of long-form audio recordings:
Virtual Slicing via Manifest: If I have a long audio file (e.g., 5 minutes), is it sufficient to define segments in the manifest using only the offset and duration fields (e.g., 0-90s, 90-180s, etc.) without physically cutting the audio into separate files? Will the model train correctly using these offset-based entries?
Partial Speaker Presence in Segments: Suppose a 5-minute recording has 4 unique speakers in total. However, in a specific 90-second segment (e.g., the first 90 seconds), only 2 of those speakers are actively talking.
Is it okay to include segments where only a subset of the session's total speakers is present?
In the manifest entry for that 90s segment, should the num_speakers field reflect the number of speakers present in that specific segment (2), or the total speakers in the original long recording (4)?