1706.03762v7.pdf

TLDR: The Transformer, a new simple network architecture based solely on attention mechanisms, achieves state-of-the-art results on machine translation tasks while being more parallelizable and requiring significantly less training time compared to previous models.

Outline

Mindmap

Attention Is All You Need

Abstract

The Transformer, a new simple network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely
Experiments on machine translation tasks show the Transformer to be superior in quality while being more parallelizable and requiring significantly less time to train
The Transformer achieves new state-of-the-art results on English-to-German and English-to-French translation tasks

Introduction

Recurrent neural networks and encoder-decoder architectures have been state-of-the-art in sequence modeling and transduction problems
Attention mechanisms have become an integral part of sequence modeling and transduction models
The Transformer model relies entirely on an attention mechanism to draw global dependencies between input and output, eschewing recurrence

Background

Efforts to reduce sequential computation in sequence transduction models, such as Extended Neural GPU, ByteNet, and ConvS2S
Self-attention, an attention mechanism relating different positions of a single sequence
The Transformer is the first transduction model relying entirely on self-attention

Model Architecture

Encoder and Decoder Stacks

Encoder: Composed of a stack of N = 6 identical layers, each with a multi-head self-attention mechanism and a feed-forward network
Decoder: Also composed of a stack of N = 6 identical layers, with an additional sub-layer that performs multi-head attention over the output of the encoder stack

Attention

Scaled Dot-Product Attention

Computes the dot products of the query with all keys, divides each by √dk, and applies a softmax function to obtain the weights on the values

Multi-Head Attention

Linearly projects the queries, keys, and values h times with different, learned linear projections
Performs the attention function in parallel on the projected versions
Concatenates the outputs and projects again

Position-wise Feed-Forward Networks

Applies a simple, position-wise fully connected feed-forward network to each position separately and identically

Embeddings and Softmax

Uses learned embeddings to convert input and output tokens to vectors
Shares the same weight matrix between the two embedding layers and the pre-softmax linear transformation

Positional Encoding

Adds "positional encodings" to the input embeddings to inject information about the relative or absolute position of the tokens

Why Self-Attention

Compares various aspects of self-attention layers to recurrent and convolutional layers
Self-attention layers connect all positions with a constant number of sequential operations, whereas recurrent layers require O(n) sequential operations
Self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d

Training

Training Data and Batching

Trained on the WMT 2014 English-German and English-French datasets
Sentences encoded using byte-pair encoding and batched by approximate sequence length

Hardware and Schedule

Trained on one machine with 8 NVIDIA P100 GPUs
Base models trained for 100,000 steps (12 hours), big models trained for 300,000 steps (3.5 days)

Optimizer

Used the Adam optimizer with a learning rate that increases linearly for the first warmup_steps training steps and decreases proportionally to the inverse square root of the step number thereafter

Regularization

Employed residual dropout, label smoothing, and other regularization techniques

Results

Machine Translation

On the WMT 2014 English-to-German task, the big Transformer model outperforms previous state-of-the-art models by over 2 BLEU
On the WMT 2014 English-to-French task, the big Transformer model achieves a new single-model state-of-the-art BLEU score of 41.8

Model Variations

Evaluated the importance of different components of the Transformer architecture
Observed that the number of attention heads, attention key size, model size, and dropout all have a significant impact on performance

English Constituency Parsing

Trained a 4-layer Transformer model on the Wall Street Journal portion of the Penn Treebank
Achieved strong results, outperforming previous models in the small-data regime and performing competitively in the semi-supervised setting

Conclusion

The Transformer, the first sequence transduction model based entirely on attention, can be trained significantly faster than architectures based on recurrent or convolutional layers
Achieved new state-of-the-art results on machine translation tasks
Plans to apply attention-based models to other tasks and investigate local, restricted attention mechanisms