Here we implement the full Transformer model on the IWSLT 2016 de-en dataset, a much smaller dataset than the WMT dataset used by Vawani et al., but sufficient to demonstrate the model's capabilities.
Since the IWSLT dataset is much smaller, we can use a smaller set of hyperparameters, as compared to the original Transformer model. Specifically, we use 1 encoder layer instead of 6, hidden dimension size of 64 rather than 512 and only 4 heads rather than 8 for multihead-attention. Finally, I use learned positional encodings for both encoder and decoder, rather than using the sinusoidal functions.
The training data and pretrained model is available here for testing.
This README is still WIP but the model should work fine.
Plain old translation task, in this case translating German to English using the training set from IWSLT 2016.
For example:
Input:
Der Großteil der Erde ist Meerwasser.
Output:
Most of the planet is ocean water.
I implement the entire Transformer model, with certain changes highlighted in the Preface above. You can try translating immediately with the pretrained model by running the following command:
$ python3 main.py --test --line=10
Change the --line
parameter for a different sample.
$ python3 main.py --train
This trains a Transformer model with default parameters:
- Training steps:
--steps=50000
- Batchsize:
--batchsize=64
- Learning rate:
--lr=1e-4
- Savepath:
--savepath=models/
- Encoding dimensions:
--hidden=64
- Encoder layers:
--enc_layers=1
- Decoder layers:
--dec_layers=6
- Number of heads:
--heads=4
The model will be trained on the Translation Task with default parameters:
- Max sequence length:
--max_len=20
$ python3 main.py --test
or
$ python3 main.py --test --line=10
This tests the trained model. You can specify a particular line using --line
, otherwise it defaults to the first sample.
You can also use the --plot
flag to plot the final encoder-decoder attention heatmaps.
$ python3 main.py --help
Run with the --help
flag to get a list of possible flags.
Skip ahead to the Model section for details about attention.
Typical of translation tasks, we first preprocess the dataset to generate a dictionary mapping each token to an index. Here we use two dictionaries, one for English and one for German, although some papers do use a single dictionary for both source and target languages. With reference to the .json
files, the index of each token is simply its position in the list. The dictionaries are generated by running make_dict.py
.
The Task
class's next_batch
method will generate three arrays:
- The one-hot encoded German sentences
- The one-hot encoded English sentences with
<S>
added to the beginning ie. shifting the sentences one token to the right; this serves as the decoder input - The one-hot encoded English sentences; this serves as the labels or the decoder outputs
WIP more here later.
Some interesting results.
Testing with --line=4
Input :
ich denke das problem ist dass wir das meer für zu selbstverständlich halten
Truth :
and the problem i think is that we take the ocean for <UNK>
Output:
i think the problem is that we take the ocean for <UNK>
Both words in the ocean
attends to meer
for all four heads. In addition, für zu selbstverständlich halten
is a German phrase that means take for granted
in English and here we see that the tokens take
and <UNK>
attends strongly to selbstverständlich
.
Testing with --line=10
Input :
die meisten tiere leben in den ozeanen
Truth :
most of the animals are in the <UNK>
Output:
most animals live in the <UNK>
In this case, we see that the token most
attends to meisten
and ignores the die
at the beginning of the input German sentence.
Testing with --line=153
Input :
übrigens ist das zeug <UNK>
Truth :
this stuff is <UNK> as <UNK> by the way
Output:
by the way that's <UNK> stuff
Here the tokens by the way
all attend strongly to übrigens
, which is the German parallel for the English phrase. In addition, we also see that the English translation <UNK> stuff
correctly flips the order of the German tokens zeug <UNK>
(where zeug
means stuff
).