1

I'm trying to load my pandas dataframe (df) into a Tensorflow dataset with the following command:

target = df['label']
features = df['encoded_sentence']

dataset = tf.data.Dataset.from_tensor_slices((features.values, target.values))

Here's an excerpt from my pandas dataframe:

+-------+-----------------------+------------------+
| label | sentence              | encoded_sentence |
+-------+-----------------------+------------------+
| 0     | Hello world           | [5, 7]           |
+-------+-----------------------+------------------+
| 1     | my name is john smith | [1, 9, 10, 2, 6] |
+-------+-----------------------+------------------+
| 1     | Hello! My name is     | [5, 3, 9, 10]    |
+-------+-----------------------+------------------+
| 0     | foo baar              | [8, 4]           |
+-------+-----------------------+------------------+

# df.dtypes gives me:
label                int8
sentence             object
encoded_sentencee    object

But it keeps giving me a Value Error:

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

Can anyone tell me how to use the encoded sentences in my Tensorflow dataset? Help would be greatly appreciated!

0

2 Answers 2

1

You can make your Pandas values into a ragged tensor first and then make the dataset from it:

import tensorflow as tf
import pandas as pd

df = pd.DataFrame({'label': [0, 1, 1, 0],
                   'sentence': ['Hello world', 'my name is john smith',
                                'Hello! My name is', 'foo baar'],
                   'encoded_sentence': [[5, 7], [1, 9, 10, 2, 6],
                                        [5, 3, 9, 10], [8, 4]]})
features = tf.ragged.stack(list(df['encoded_sentence']))
target = tf.convert_to_tensor(df['label'].values)
dataset = tf.data.Dataset.from_tensor_slices((features, target))
for f, t in dataset:
    print(f.numpy(), t.numpy())

Output:

[5 7] 0
[ 1  9 10  2  6] 1
[ 5  3  9 10] 1
[8 4] 0

Note you may want to use padded_batch to get batches of examples from the dataset.

EDIT: Since padded-batching does not seem to work with a dataset made from a ragged tensor at the moment, you can also convert the ragged tensor to a regular one first:

import tensorflow as tf
import pandas as pd

df = pd.DataFrame({'label': [0, 1, 1, 0],
                   'sentence': ['Hello world', 'my name is john smith',
                                'Hello! My name is', 'foo baar'],
                   'encoded_sentence': [[5, 7], [1, 9, 10, 2, 6],
                                        [5, 3, 9, 10], [8, 4]]})
features_ragged = tf.ragged.stack(list(df['encoded_sentence']))
features = features_ragged.to_tensor(default_value=-1)
target = tf.convert_to_tensor(df['label'].values)
dataset = tf.data.Dataset.from_tensor_slices((features, target))
batches = dataset.batch(2)
for f, t in batches:
    print(f.numpy(), t.numpy())

Output:

[[ 5  7 -1 -1 -1]
 [ 1  9 10  2  6]] [0 1]
[[ 5  3  9 10 -1]
 [ 8  4 -1 -1 -1]] [1 0]
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you so much for your help! When I try to create a batch it gives me a type error... TypeError: ('Padded batching of components of type ', <class 'tensorflow.python.ops.ragged.ragged_tensor.RaggedTensorSpec'>, ' is not supported.') Can you tell the correct way to create a train and test set?
@StudentAsker I see, I'd say that is a bug, I filed issue #39163.
@StudentAsker I added an alternative simply converting the ragged tensor into a regular one.
0

You can encode the array into a string, and then various methods to create a tf.data.Dataset will succeed.

Then you can split the feature column of the tf dataset into RaggedTensor and then to_tensor(). I'll provide an example below. Here is where I first found this string encoding workaround: https://keras.io/examples/structured_data/movielens_recommendations_transformers/#encode-input-features

#encode the pandas dataframe column as a string:
def encode_list_as_string(int_list : list, separator=","):
    return separator.join(map(str, int_list))

def encode_np_array_as_string_sep_comma(input : np.ndarray, separator=","):
    return separator.join(input.astype(str))

df['col_name'] = df['col_name'].apply(lambda x: encode_list_as_string(x))
or
df['col_name'] = df['col_name'].map(encode_list_as_string)


#decode the tensorflow string encoded column:
def expand_string_to_tensor(features, col_name, d_type):
    _ts = tf.strings.split(features[col_name], ",")
    t1 = tf.strings.to_number(_ts, out_type=d_type)
    type_tensor = t1.to_tensor()
    type_tensor = t1.to_tensor()
    return features

dataset = dataset.map(lambda x: expand_string_to_tensor(x, 'col_name', tf.int32))

The caveat is that the dataset is then a _MapDataset.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.