I have been following the tutorial for feature extraction using pytorch audio here: https://pytorch.org/audio/0.10.0/pipelines.html#wav2vec-2-0-hubert-representation-learning
It says the result is a list of tensors of lenth 12 where each entry is the output of a transformer layer. So, the first tensor on the list has shape of something like (1,2341,768).
It seems to be correct as I get this result for most audios.
However, for some videos, I get returned a tensor of length 12, but the entries have more than 1 batchsize bizzarely. So the shape is (2,2341,768)
I am baffled as to why this is?
Any clues would be great.