6

I'm trying to create a Dataset object in tensorflow 1.14 (I have some legacy code that i can't change for this specific project) starting from numpy arrays, but everytime i try i get everything copied on my graph and for this reason when i create an event log file it is huge (719 MB in this case).

Originally i tried using this function "tf.data.Dataset.from_tensor_slices()", but it didn't work, then i read it is a common problem and someone suggested me to try with generators, thus i tried with the following code, but again i got a huge event file (719 MB again)

def fetch_batch(x, y, batch):
    i = 0
    while i < batch:
        yield (x[i,:,:,:], y[i])
        i +=1

train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train  
images = images/255

training_dataset = tf.data.Dataset.from_generator(fetch_batch, 
    args=[images, np.int32(labels), batch_size], output_types=(tf.float32, tf.int32), 
    output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape)))

file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())

I know in this case I could use tensorflow_datasets API and it would be easier, but this is a more general question, and it involves how to create datasets in general, not only using the mnist one. Could you explain to me what am i doing wrong? Thank you

2
  • Can you explain a bit more in detail what's causing your event file to be that large? Is it creating repetitive subgraphs? Commented Nov 25, 2019 at 0:47
  • Could you explain what didn't work with from_tensor_slices? Commented Nov 25, 2019 at 13:43

1 Answer 1

4

I guess it's because you are using args in from_generator. This will surely put the provided args in the graph.

What you could do is define a function that will return a generator that will iterate through your set, something like (haven't tested):

def data_generator(images, labels):
  def fetch_examples():
    i = 0
    while True:
      example = (images[i], labels[i])
      i += 1
      i %= len(labels)
      yield example
  return fetch_examples

This would give in your example:

train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train  
images = images/255

training_dataset = tf.data.Dataset.from_generator(data_generator(images, labels), output_types=(tf.float32, tf.int32), 
    output_shapes=(tf.TensorShape(features_shape), tf.TensorShape(labels_shape))).batch(batch_size)

file_writer = tf.summary.FileWriter("/content", graph=tf.get_default_graph())

Note that I changed fetch_batch to fetch_examples since you probably want to batch using the dataset utilities (.batch).

Sign up to request clarification or add additional context in comments.

4 Comments

Yes, i think you are right. I was about to post right now how i solved the problem (actually i found a guy in github who was suggesting this) and it worked for me github.com/tensorflow/tensorflow/issues/14053 Thank you!
Cool, if this solution works then please accept it. Also, next time for your question, try to provide a code that can work if copy-pasted and include the version numbers (typically for tensorflow, the API changes a lot between 1.14 and 2.0).
Done, but one more thing: what do you mean with "you won't be able to use multiprocessing"? Is there any more efficient way?
Actually forget what I said, you don't need multiprocessing at this stage (just getting the data), so I was completely confused.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.