3

So, I have a model where the theoretical justification for the update procedure relies on having a batch size of 1. (For those curious, it's called Bayesian Personalized Ranking for recommender systems.)

Now, I have some standard code written. My input is a tf.placeholder variable. It's Nx3, and I run it as normal with the feed_dict. This is perfectly fine if I want N to be, say, 30K. However, if I want N to be 1, the feed_dict overhead really slows down my code.

For reference, I implemented the gradients by hand in pure Python, and it runs at about 70K iter/second. In contrast, GradientDescentOptimizer runs at about 1K iter/second. As you can see, this is just far too slow. So as I said, I suspect the problem is feed_dict has too much overhead to call it with a batch size of 1.

Here is the actual session code:

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for iteration in range(100):
    samples = data.generate_train_samples(1000000)
    for sample in tqdm(samples):
        cvalues = sess.run([trainer, obj], feed_dict={input_data:[sample]})
    print("objective = " + str(cvalues[1]))

Is there a better way to do a single update at once?

0

1 Answer 1

3

Probably your code runs much slower for two reasons:

  1. You copy your data to GPU memory (if you use GPU) only when you run session and you do it many times (And this is really time consuming)
  2. You do it in 1 thread

Luckily Tensorflow has tf.data API which helps to solve both problems. You can try to do something like:

inputs = tf.placeholder(tf.float32, your_shape)
labels = tf.placeholder(tf.floar32, labels_shape)
data = tf.data.Dataset.from_tensor_slices((inputs, labels))

iterator = dataset.make_initializable_iterator()

sess.run(iterator.initializer, {inputs: your_inputs, labels: your_labels})

And then to get next entry from the dataset you just use iterator.get_next()

If that's what you need, tensorflow has exhaustive documentation on importing data using tf.data API where you can find suitable for you use-case: documentation

Sign up to request clarification or add additional context in comments.

3 Comments

I appreciate the response. I am not using GPU which unfortunately cannot be helped right now. My CPU does have 4 cores with 2 threads each, and indeed my code is using only 1 core, so indeed I can speed things up by using 4 cores. But I still run into the fundamental problem that with that fix, this is computing at a maximum of 4K iterations/ second, versus 70K iterations/ second for my vanilla Python implementation. feed_dict is still a bottleneck.
@anon what makes you think feed_dict is the bottleneck? Have you profiled the code?
I have not. It's the logical conclusion from the fact that if my batch size is large enough, performance is totally fine (>100K it/sec) despite using a single thread on a single core. But if the batch size is down to 1, performance drops to ~1K it/sec. The only fundamental is the number of calls to feed_dict as well as to sess.run . Maybe the overhead is in sess.run?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.