4

I want to build a Sequence-to-sequence autoencoder in keras. The purpose is to "doc2vec".

In the documents on keras blog, I found an example: https://blog.keras.io/building-autoencoders-in-keras.html

from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model

inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)

decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)

sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)

What if I need to add an embedding layer to this? If we are dealing with a paragraph of text, we suppose should firstly tokenize the text, embedding it with pre-trained vector, right?

Do I need a Dense or time distributed dense layer in decoder? Do I need to reverse the order of the sequence?

Thanks in advance.

1
  • The embedding layer should be the first layer. In my opinion it does not have to be pretrained. As for the dense decoding layer: People prefer autoencoder with similiar encoding/decoding architecture, but this is not a must. I think the best approach would be to just try it out and get a feeling what works best for you. Commented Aug 10, 2018 at 7:11

1 Answer 1

1

The embedding layer can only be used as the first layer in a model as the documentation states, so something like this:

inputs = Input(shape=(timesteps, input_dim))
embedded = Embedding(vocab_size, embedding_size, mask_zero=True, ...))(inputs)
encoded = LSTM(latent_dim)(embedded)

We should firstly tokenize the text, embedding it with pre-trained vector, right? Yes, this is the default option. You only train your own embeddings if you have a sufficiently large enough corpus otherwise GloVe are often used. There is a Keras example that uses GloVe and the internal Tokenizer to pass text into a model with Embedding layer.

For decoding, you will need a Dense layer but using TimeDistributed is optional with version 2. By default Dense will apply the kernel to every timestep of your 3D tensor you pass:

decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
decoded = Dense(vocab_size, activation='softmax')(decoded)
# (batch_size, timesteps, vocab_size)

It's worth noting that taking the top N most frequent words will speed up training, otherwise that softmax will become very costly to compute. The Keras example also takes a limited number of words and every other word is mapped to a special UNKnown token.

Sign up to request clarification or add additional context in comments.

1 Comment

For autoencoder, suppose x=y. if use embedding layer, my inputs are tokenized one-hot number, while be embedded in the model, then go through 2 layers of RNN, then compare with the label which is the inputs (tokenized number). it seems the model is very hard to train. Lost very high and accuracy very low, can't learn anything. not sure where I get it wrong

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.