Skip to content

Extract features from different layers of a SSL model #15408

@gabitza-tech

Description

@gabitza-tech

Hello!

I would like to use a SSL model from this repo: https://github.com/Open-Speech-EkStep/vakyansh-models trained with NeMo for aligning hindi wavs in order to get alignments between spoken words using continous features extracted with the SSL model (like using a HuBERT model).

I would like to extract features from different layers, not just the last one like I do currently:

# Forward pass through SSL model
        with torch.no_grad():
            _,_,feat,feat_len = ssl_model.forward(
                input_signal=wav,
                input_signal_length=torch.tensor([wav.shape[-1]]).to(device)
            )

Extracting them like this yields very strange results when trying to align them like this when computing a similarity matrix:

Image

It clearly doesnt capture good phonetic information.. but I am also scared that I don't extract them correctly from the encoder. (p.s. I also apply VAD over the audios

Does anyone have experience with this? Any suggestions would be greatly appreciated!!!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions