-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Hello!
I would like to use a SSL model from this repo: https://github.com/Open-Speech-EkStep/vakyansh-models trained with NeMo for aligning hindi wavs in order to get alignments between spoken words using continous features extracted with the SSL model (like using a HuBERT model).
I would like to extract features from different layers, not just the last one like I do currently:
# Forward pass through SSL model
with torch.no_grad():
_,_,feat,feat_len = ssl_model.forward(
input_signal=wav,
input_signal_length=torch.tensor([wav.shape[-1]]).to(device)
)
Extracting them like this yields very strange results when trying to align them like this when computing a similarity matrix:
It clearly doesnt capture good phonetic information.. but I am also scared that I don't extract them correctly from the encoder. (p.s. I also apply VAD over the audios
Does anyone have experience with this? Any suggestions would be greatly appreciated!!!