0

I’m doing seq2seq machine translation on my own dataset. I have preproceed my dataset using this code.

The problem comes when i tried to split train_data using BucketIterator.split()

def tokenize_word(text):
  return nltk.word_tokenize(text)

id = Field(sequential=True, tokenize = tokenize_word, lower=True, init_token="<sos>", eos_token="<eos>")
ti = Field(sequential=True, tokenize = tokenize_word, lower=True, init_token="<sos>", eos_token="<eos>")

fields = {'id': ('i', id), 'ti': ('t', ti)}

train_data = TabularDataset.splits(
    path='/content/drive/MyDrive/Colab Notebooks/Tidore/',
    train = 'id_ti.tsv',
    format='tsv',
    fields=fields
)[0]

id.build_vocab(train_data)
ti.build_vocab(train_data)

print(f"Unique tokens in source (id) vocabulary: {len(id.vocab)}")
print(f"Unique tokens in target (ti) vocabulary: {len(ti.vocab)}")

train_iterator = BucketIterator.splits(
    (train_data),
    batch_size = batch_size,
    sort_within_batch = True,
    sort_key = lambda x: len(x.id),
    device = device
)

print(len(train_iterator))

for data in train_iterator:
  print(data.i)

This is the result of the code above

Unique tokens in source (id) vocabulary: 1425
Unique tokens in target (ti) vocabulary: 1297
2004

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-72-e73a211df4bd> in <module>()
     31 
     32 for data in train_iterator:
---> 33   print(data.i)

AttributeError: 'BucketIterator' object has no attribute 'i'

This is the result when i tried to print the train_iterator

I am very confuse, because i don’t know what key i should use for train iterator. Thank you for your help

1
  • Please provide the question with code snippets instead of images :) Commented Aug 26, 2021 at 7:41

2 Answers 2

2
train_iterator = BucketIterator.splits(
  (train_data),
  batch_size = batch_size,
  sort_within_batch = True,
  sort_key = lambda x: len(x.id),
  device = device
)

here
Use BucketIterator instead of BucketIterator.splits when there is only one iterator needs to be generated.

I have met this problem and the method mentioned above works.

Sign up to request clarification or add additional context in comments.

Comments

1

According to torchtext documents, it's better to use TranslationDataset to do what is desired! but if for some reason you prefer to use TabularDataset its better to do it like:

import nltk
print(nltk.__version__)
from torchtext import data
import torchtext
print(torchtext.__version__)
def tokenize_word(text):
    return nltk.word_tokenize(text)

batch_size = 5

SRC = Field(sequential=True, tokenize = tokenize_word, lower=True, init_token="<sos>", eos_token="<eos>")
TRG = Field(sequential=True, tokenize = tokenize_word, lower=True, init_token="<sos>", eos_token="<eos>")

train = data.TabularDataset.splits(
    path='./data/', train='tr.tsv', format='tsv',
    fields=[('src', SRC), ('trg', TRG)])[0]

SRC.build_vocab(train)
TRG.build_vocab(train)

train_iter = data.BucketIterator(
    train, batch_size=batch_size,
    sort_key=lambda x: len(x.text), device=0)

for item in train_iter:
    print(item.trg)

Output:

3.6.2
0.6.0
tensor([[2, 2, 2, 2, 2],
        [5, 5, 5, 5, 5],
        [4, 4, 4, 4, 4],
        [6, 6, 6, 6, 6],
        [7, 7, 7, 7, 7],
        [3, 3, 3, 3, 3]])
tensor([[2, 2, 2, 2, 2],
        [5, 5, 5, 5, 5],
        [4, 4, 4, 4, 4],
        [6, 6, 6, 6, 6],
        [7, 7, 7, 7, 7],
        [3, 3, 3, 3, 3]])

NOTE: make sure there is tr.tsv file contains text columns separated by tab, in data directory. Welcome to stackoverflow & hope it helps :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.