Extract features from different layers of a SSL model

Hello!

I would like to use a SSL model from this repo: https://github.com/Open-Speech-EkStep/vakyansh-models trained with NeMo for aligning hindi wavs in order to get alignments between spoken words using continous features extracted with the SSL model (like using a HuBERT model).

I would like to extract features from different layers, not just the last one like I do currently:

```
# Forward pass through SSL model
        with torch.no_grad():
            _,_,feat,feat_len = ssl_model.forward(
                input_signal=wav,
                input_signal_length=torch.tensor([wav.shape[-1]]).to(device)
            )
```

Extracting them like this yields very strange results when trying to align them like this when computing a similarity matrix:

<img width="400" height="300" alt="Image" src="https://github.com/user-attachments/assets/b7ce1876-89bf-485b-8e07-9db6e9b47043" />

It clearly doesnt capture good phonetic information.. but I am also scared that I don't extract them correctly from the encoder. (p.s. I also apply VAD over the audios

Does anyone have experience with this? Any suggestions would be greatly appreciated!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract features from different layers of a SSL model #15408

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extract features from different layers of a SSL model #15408

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions