4

Here's the link to collab https://colab.research.google.com/drive/1wftAvDu_Wu2Y9ahgI1Z1FLciUH5MnSJ9

train_labels = ['GovernmentSchemes', 'GovernmentSchemes', 'GovernmentSchemes', 'GovernmentSchemes', 'CropInsurance']

training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))

output coming :

[list([3]) list([3]) list([3]) ... list([2]) list([5]) list([1])]

expected output :

[[3] [3] [3] .. [2] [5]...]
num_epochs = 30
history = model.fit(train_padded, training_label_seq, epochs=num_epochs, validation_data=(validation_padded, validation_label_seq))

Error => ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list)

7
  • 1
    what is the logic between the input and output? and how can label_tokenizer.texts_to_sequences be reproduced ? Commented May 30, 2020 at 12:52
  • Yes, this code is a bit incomplete. label_tokenizer - is this from TensorFlow? If so, this should have been included. The code, as it stands, is a snippet, and can't be run. Posting a minimal reproducible example is important. Commented May 30, 2020 at 12:58
  • Output after using -- np.array([[x] for x in training_label_seq]) [list([3])] [list([3])] [list([3])]] Commented May 30, 2020 at 13:16
  • Where do we get 'kcc_maharashtra.csv'? Commented May 30, 2020 at 14:14
  • @FrederikBode by uploading it ! available at data.gov.in Commented May 30, 2020 at 15:23

1 Answer 1

1

I was able to recreate your issue using the below code -

Code to recreate the issue -

import numpy as np
import tensorflow as tf
print(tf.__version__)
from tensorflow.keras.preprocessing.text import Tokenizer

label_tokenizer = Tokenizer()

# Fit on a text 
fit_text = "Tensorflow warriors are awesome people"
label_tokenizer.fit_on_texts(fit_text)

# Training Labels
train_labels = "Tensorflow warriors are great people"
training_label_list = np.array(label_tokenizer.texts_to_sequences(train_labels))

# Print the 
print(training_label_list)
print(type(training_label_list))
print(type(training_label_list[0]))

Output -

2.2.0
[list([9]) list([1]) list([10]) list([5]) list([3]) list([2]) list([11])
 list([7]) list([3]) list([6]) list([]) list([6]) list([4]) list([2])
 list([2]) list([12]) list([3]) list([2]) list([5]) list([]) list([4])
 list([2]) list([1]) list([]) list([4]) list([2]) list([1]) list([])
 list([]) list([2]) list([1]) list([4]) list([9]) list([]) list([8])
 list([1]) list([3]) list([8]) list([7]) list([1])]
<class 'numpy.ndarray'>
<class 'list'>

Solution -

  1. Replacing np.array with np.hstack will fix your problem. Your model.fit() should work fine now.
  2. Else if you are looking for the expected output as in your question, training_label_list = label_tokenizer.texts_to_sequences(train_labels) will give you a list of list. You can use np.array([np.array(i) for i in training_label_list]) to convert to array of array. This works only if your list of lists contains lists with same number of elements.

np.hstack Code - Code for Point number 1 in solution.

import numpy as np
import tensorflow as tf
print(tf.__version__)
from tensorflow.keras.preprocessing.text import Tokenizer

label_tokenizer = Tokenizer()

# Fit on a text 
fit_text = "Tensorflow warriors are awesome people"
label_tokenizer.fit_on_texts(fit_text)

# Training Labels
train_labels = "Tensorflow warriors are great people"
training_label_list = np.hstack(label_tokenizer.texts_to_sequences(train_labels))

# Print the 
print(training_label_list)
print(type(training_label_list))
print(type(training_label_list[0]))

Output -

2.2.0
[ 9.  1. 10.  4.  2.  3. 11.  7.  2.  5.  5.  6.  3.  3. 12.  2.  3.  4.
  6.  3.  1.  3.  1.  6.  9.  8.  1.  2.  8.  7.  1.]
<class 'numpy.ndarray'>
<class 'numpy.float64'>

Expected output as in question - Code for Point number 2 in solution.

import numpy as np
import tensorflow as tf
print(tf.__version__)
from tensorflow.keras.preprocessing.text import Tokenizer

label_tokenizer = Tokenizer()

# Fit on a text 
fit_text = "Tensorflow warriors are awesome people"
label_tokenizer.fit_on_texts(fit_text)

# Training Labels
train_labels = "Tensorflow warriors are great people"
training_label_list = label_tokenizer.texts_to_sequences(train_labels)

# Print 
print(training_label_list)
print(type(training_label_list))
print(type(training_label_list[0]))

# To convert elements to array
training_label_list = np.array([np.array(i) for i in training_label_list])

# Print
print(training_label_list)
print(type(training_label_list))
print(type(training_label_list[0]))

Output -

2.2.0
[[9], [1], [10], [4], [2], [3], [11], [7], [2], [5], [], [5], [6], [3], [3], [12], [2], [3], [4], [], [6], [3], [1], [], [], [3], [1], [6], [9], [], [8], [1], [2], [8], [7], [1]]
<class 'list'>
<class 'list'>
[array([9]) array([1]) array([10]) array([4]) array([2]) array([3])
 array([11]) array([7]) array([2]) array([5]) array([], dtype=float64)
 array([5]) array([6]) array([3]) array([3]) array([12]) array([2])
 array([3]) array([4]) array([], dtype=float64) array([6]) array([3])
 array([1]) array([], dtype=float64) array([], dtype=float64) array([3])
 array([1]) array([6]) array([9]) array([], dtype=float64) array([8])
 array([1]) array([2]) array([8]) array([7]) array([1])]
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

Hope this answers your question. Happy Learning.


Update 2/6/2020 - Anirudh_k07, As per our discussion, I had a look into your program and you are getting below error in model.fit() after using np.hstack for labels.

ValueError: Data cardinality is ambiguous:
  x sizes: 41063
  y sizes: 41429
Please provide data which shares the same first dimension.

This error you are getting is because few of the labels have special characters like - and /. Thus on performing np.hstack(label_tokenizer.texts_to_sequences(train_labels), they are creating additional rows. You can print list of unique train_labels by using print(set(train_labels)).

Here is gist of what I am trying to say -

# These Labels have special character
train_labels = ['Bio-PesticidesandBio-Fertilizers','Old/SenileOrchardRejuvenation']
training_label_seq = np.hstack(label_tokenizer.texts_to_sequences(train_labels))
print("Two labels are converted to Five :",training_label_seq)

# These Labels are fine
train_labels = ['SoilHealthCard', 'PostHarvestPreservation', 'FertilizerUseandAvailability']
training_label_seq = np.hstack(label_tokenizer.texts_to_sequences(train_labels))
print("Three labels are remain three :",training_label_seq)

Output -

Two labels are converted to Five : [17 18 19 51 52]
Three labels are remain three : [20 36  5]

So kindly do the proper preprocessing and eliminate these special characters in train_labels and then use np.hstack(label_tokenizer.texts_to_sequences(train_labels)) on labels. Your model.fit() should work fine after that.

Hope this answers your question. Happy Learning.

Sign up to request clarification or add additional context in comments.

8 Comments

@Anirudh_k07 - Does this answer your question?
Using 2nd method => Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
Using Method 1 my shape is changing & the dimensions are no longer matching
As we mentioned in the answer, Method 1 is the proper way to use in model.fit(). Method 2 is just mentioned as you stated the expected output in your question. Input shape is altogether a different problem that depends on shape of your input data and input shape mentioned in first layer. Do share those information, so that we can help.
Are you doing pad_sequences after Tokenizer to pad sequences to the same length in input? Would recommend you to look into this link - charon.me/posts/tf/tf3 to understand better the Tokenization and Text Data Preparation for model.fit().
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.