Filling missing values for Embedded List in Python3

Question

I searched for a similar question but I didn't come across. And I'm new in this area, I hope I explained my question well enough.

I have a dataset consist of text data. I store them in a list and every row of a list consists of a string value. But every row length is not equal. I want them to be equal, so I can use them in a self-attention model.

The sample of my dataset

In [8]: myList
Out[8]: 
[
['the first line of my dataset'], 
['the second line'],
['the 3rd'],
['the 4th'],
['the 5th'],
['the 6th'],
['the 7th'],
]

So as you can see the first one is longer than the rest of them. I want to fill with a certain value like # to equalize the word count.

The sample output I'd like to do

In [8]: myList
Out[8]: 
[
['the first line of my dataset'], 
['the second line # # #'],
['the 3rd # # # #'],
['the 4th # # # #'],
['the 5th # # # #'],
['the 6th # # # #'],
['the 7th # # # #']
]

If this would be a dataframe I could use fillna() function of Pandas library. I tried to apply this:

train_X = pd.Series(train_X).fillna("#").values

but since it is an embedded list(I guess) it didn't work. Is there a better way to do that?

Any recommendation is appreciated.

you can use the pad_sequences function from keras, but you'll have to tokenize your sentences first. — bkshi
– bkshi, Commented Apr 28, 2020 at 13:58

aysebilgegunduz · Accepted Answer · 2020-04-29 09:04:49Z

According to the suggestion, @bkshi gave to me I come up with a solution here below:

Also since texts_to_sequences() function convert my list to sequences starting from 1, I could use pad_sequence() and use 0 instead of a string value.

This solution satisfies my requirements so I used a number as padding instead of a string value.

import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

my_list = [['the first line'],
       ['the 2nd line'],
       ['the 3r line'],
       ['the 4th line'],
       ['the 5th line'],
       ['the'],
       ['the 5th line, this is']]
max_features = 10 #how many unique words you're using

tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(my_list)
my_list = tokenizer.texts_to_sequences(my_list)
my_list = pad_sequences(train_X, maxlen=None, dtype='int32', padding='post', truncating='post', value=0.0)

Stack Exchange Network

Filling missing values for Embedded List in Python3

1 Answer 1

Your Answer

Hot Network Questions

Filling missing values for Embedded List in Python3

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions