batch_size not match in torchtext BucketIterator

Question

I have set batch_size equals to 64, but when i print out the train_batch and val_batch, the size is not equal to 64.

The train data and val data are in the below format:

First, i define TEXT and LABEL field.

tokenize = lambda x: x.split()

TEXT = data.Field(sequential=True, tokenize=tokenize)
LABEL = data.Field(sequential=False)

And then i keep trying follow tutorials, and wrote things below:

train_data, valid_data = data.TabularDataset.splits(
        path='.',
        train='train_intent.csv', validation='val.csv',
        format='csv',
        fields= {'sentences': ('text', TEXT),
                'labels': ('label',LABEL)}
)

test_data = data.TabularDataset(
        path='test.csv',
        format='csv',
        fields={'sentences': ('text', TEXT)}

)
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

BATCH_SIZE = 64

train_iter, val_iter = data.BucketIterator.splits(
    (train_data, valid_data),
    batch_sizes=(BATCH_SIZE, BATCH_SIZE),
    sort_key=lambda x: len(x.text),
    sort_within_batch=False,
    repeat=False,
    device=device
)

But when i want to know the iter is fine or not, i just find the below strange things:

train_batch = next(iter(train_iter))
print(train_batch.text.shape)
print(train_batch.label.shape)
[output]
torch.Size([15, 64])
torch.Size([64])

And the train process output errorValueError: Expected input batch_size (15) to match target batch_size (64).:

def train(model, iterator, optimizer, criterion):

    epoch_loss = 0

    model.train()

    for batch in iterator:

        optimizer.zero_grad()

        predictions = model(batch.text)
        loss = criterion(predictions, batch.label)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

Anyone could give me a hint would be highly appreciated. Thanks!

Not having done anything with torchtext nor NLP, I see you're working with Chinese characters, so my guess is that this issue stems from UTF encoding having variable character lengths. Taking n bytes of a an UTF string does not guarantee getting any specific number of characters, and you may even end in a middle of a character. Does this sound reasonable as the cause of the issue? — Jatentaki
– Jatentaki, Commented Jan 22, 2019 at 12:24

Hunger · Accepted Answer · 2019-04-09 09:00:56Z

1

The returned batch size not always equal to batch_size. ex: you have 100 train data, batch_size is 64. The returned batch_size should be [64, 36].

Code: https://github.com/pytorch/text/blob/1c2ae32d67f7f7854542212b229cd95c85cf4026/torchtext/data/iterator.py#L255-L271

answered Apr 9, 2019 at 9:00

Hunger

5,4255 gold badges25 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Insomnia · Accepted Answer · 2021-02-23 03:20:33Z

0

I also encountered this problem. I think the problem is that batch_size is not in shape[0] position. In your question:

train_batch = next(iter(train_iter))
print(train_batch.text.shape)
print(train_batch.label.shape)
[output]
torch.Size([15, 64])
torch.Size([64])

15 is max_sequence_length in a batch, which can be fixed using fix_length in Field definition, and 64 is batch_size. I think you can reshape your text to solve this but I'm also looking for a better answer.

answered Feb 23, 2021 at 3:20

Insomnia

336 bronze badges

Collectives™ on Stack Overflow

batch_size not match in torchtext BucketIterator

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related