I searched for a similar question but I didn't come across. And I'm new in this area, I hope I explained my question well enough.
I have a dataset consist of text data. I store them in a list and every row of a list consists of a string value. But every row length is not equal. I want them to be equal, so I can use them in a self-attention model.
The sample of my dataset
In [8]: myList
Out[8]:
[
['the first line of my dataset'],
['the second line'],
['the 3rd'],
['the 4th'],
['the 5th'],
['the 6th'],
['the 7th'],
]
So as you can see the first one is longer than the rest of them. I want to fill with a certain value like # to equalize the word count.
The sample output I'd like to do
In [8]: myList
Out[8]:
[
['the first line of my dataset'],
['the second line # # #'],
['the 3rd # # # #'],
['the 4th # # # #'],
['the 5th # # # #'],
['the 6th # # # #'],
['the 7th # # # #']
]
If this would be a dataframe I could use fillna() function of Pandas library. I tried to apply this:
train_X = pd.Series(train_X).fillna("#").values
but since it is an embedded list(I guess) it didn't work. Is there a better way to do that?
Any recommendation is appreciated.
pad_sequencesfunction from keras, but you'll have to tokenize your sentences first. $\endgroup$