pytorch torchaudio feature extraction

Question

I have been following the tutorial for feature extraction using pytorch audio here: https://pytorch.org/audio/0.10.0/pipelines.html#wav2vec-2-0-hubert-representation-learning

It says the result is a list of tensors of lenth 12 where each entry is the output of a transformer layer. So, the first tensor on the list has shape of something like (1,2341,768).

It seems to be correct as I get this result for most audios.

However, for some videos, I get returned a tensor of length 12, but the entries have more than 1 batchsize bizzarely. So the shape is (2,2341,768) I am baffled as to why this is?

Any clues would be great.

tgaudier · Accepted Answer · 2023-03-29 14:35:11Z

1

This is likely to be coming from your incoming audio being multi-channel (stereo for example). You can check the shape of your input tensor to see if the input is "batched" too, since it would be of shape (2, L) with L being the length of the audio. Then each layer of the model gives you a representation of shape (2, L', D), L' being the length of output sequence and D the number of features of the model.

answered Mar 29, 2023 at 14:35

tgaudier

4963 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JohnJ Over a year ago

oh thanks for this awesome insight. How would you deal with this? atm I have a script running where I compute the mean of the two batches.. or should I just be taking one or the other channel/batch?

tgaudier Over a year ago

I think it would depend on what you want to do and what audio you have exactly. For example, if the difference between both channels is small, you can just pick the first one and ignore the other. If you want to use both channels but with only one representation, you can merge the two channels into a single one with sox or average the representations in the end. If both channels are very different, it could be interesting to keep both depending on what you are doing with these representations

Collectives™ on Stack Overflow

pytorch torchaudio feature extraction

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related