0

I have set batch_size equals to 64, but when i print out the train_batch and val_batch, the size is not equal to 64.

The train data and val data are in the below format: enter image description here

First, i define TEXT and LABEL field.

tokenize = lambda x: x.split()

TEXT = data.Field(sequential=True, tokenize=tokenize)
LABEL = data.Field(sequential=False)

And then i keep trying follow tutorials, and wrote things below:

train_data, valid_data = data.TabularDataset.splits(
        path='.',
        train='train_intent.csv', validation='val.csv',
        format='csv',
        fields= {'sentences': ('text', TEXT),
                'labels': ('label',LABEL)}
)

test_data = data.TabularDataset(
        path='test.csv',
        format='csv',
        fields={'sentences': ('text', TEXT)}

)
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

BATCH_SIZE = 64

train_iter, val_iter = data.BucketIterator.splits(
    (train_data, valid_data),
    batch_sizes=(BATCH_SIZE, BATCH_SIZE),
    sort_key=lambda x: len(x.text),
    sort_within_batch=False,
    repeat=False,
    device=device
)

But when i want to know the iter is fine or not, i just find the below strange things:

train_batch = next(iter(train_iter))
print(train_batch.text.shape)
print(train_batch.label.shape)
[output]
torch.Size([15, 64])
torch.Size([64])

And the train process output errorValueError: Expected input batch_size (15) to match target batch_size (64).:

def train(model, iterator, optimizer, criterion):

    epoch_loss = 0

    model.train()

    for batch in iterator:

        optimizer.zero_grad()

        predictions = model(batch.text)
        loss = criterion(predictions, batch.label)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

Anyone could give me a hint would be highly appreciated. Thanks!

1
  • 1
    Not having done anything with torchtext nor NLP, I see you're working with Chinese characters, so my guess is that this issue stems from UTF encoding having variable character lengths. Taking n bytes of a an UTF string does not guarantee getting any specific number of characters, and you may even end in a middle of a character. Does this sound reasonable as the cause of the issue? Commented Jan 22, 2019 at 12:24

2 Answers 2

1

The returned batch size not always equal to batch_size. ex: you have 100 train data, batch_size is 64. The returned batch_size should be [64, 36].

Code: https://github.com/pytorch/text/blob/1c2ae32d67f7f7854542212b229cd95c85cf4026/torchtext/data/iterator.py#L255-L271

Sign up to request clarification or add additional context in comments.

Comments

0

I also encountered this problem. I think the problem is that batch_size is not in shape[0] position. In your question:

train_batch = next(iter(train_iter))
print(train_batch.text.shape)
print(train_batch.label.shape)
[output]
torch.Size([15, 64])
torch.Size([64])

15 is max_sequence_length in a batch, which can be fixed using fix_length in Field definition, and 64 is batch_size. I think you can reshape your text to solve this but I'm also looking for a better answer.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.