how to build Sequence-to-sequence autoencoder in keras with embedding layer?

Question

I want to build a Sequence-to-sequence autoencoder in keras. The purpose is to "doc2vec".

In the documents on keras blog, I found an example: https://blog.keras.io/building-autoencoders-in-keras.html

from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model

inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)

decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)

sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)

What if I need to add an embedding layer to this? If we are dealing with a paragraph of text, we suppose should firstly tokenize the text, embedding it with pre-trained vector, right?

Do I need a Dense or time distributed dense layer in decoder? Do I need to reverse the order of the sequence?

Thanks in advance.

The embedding layer should be the first layer. In my opinion it does not have to be pretrained. As for the dense decoding layer: People prefer autoencoder with similiar encoding/decoding architecture, but this is not a must. I think the best approach would be to just try it out and get a feeling what works best for you. — dennis-w
– dennis-w, Commented Aug 10, 2018 at 7:11

nuric · Accepted Answer · 2018-08-10 09:17:39Z

1

The embedding layer can only be used as the first layer in a model as the documentation states, so something like this:

inputs = Input(shape=(timesteps, input_dim))
embedded = Embedding(vocab_size, embedding_size, mask_zero=True, ...))(inputs)
encoded = LSTM(latent_dim)(embedded)

We should firstly tokenize the text, embedding it with pre-trained vector, right? Yes, this is the default option. You only train your own embeddings if you have a sufficiently large enough corpus otherwise GloVe are often used. There is a Keras example that uses GloVe and the internal Tokenizer to pass text into a model with Embedding layer.

For decoding, you will need a Dense layer but using TimeDistributed is optional with version 2. By default Dense will apply the kernel to every timestep of your 3D tensor you pass:

decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
decoded = Dense(vocab_size, activation='softmax')(decoded)
# (batch_size, timesteps, vocab_size)

It's worth noting that taking the top N most frequent words will speed up training, otherwise that softmax will become very costly to compute. The Keras example also takes a limited number of words and every other word is mapped to a special UNKnown token.

answered Aug 10, 2018 at 9:17

nuric

11.3k3 gold badges29 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

BryanCC Over a year ago

For autoencoder, suppose x=y. if use embedding layer, my inputs are tokenized one-hot number, while be embedded in the model, then go through 2 layers of RNN, then compare with the label which is the inputs (tokenized number). it seems the model is very hard to train. Lost very high and accuracy very low, can't learn anything. not sure where I get it wrong

Collectives™ on Stack Overflow

how to build Sequence-to-sequence autoencoder in keras with embedding layer?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related