1706.03762v7.pdf
TLDR: The Transformer, a new simple network architecture based solely on attention mechanisms, achieves state-of-the-art results on machine translation tasks while being more parallelizable and requiring significantly less training time compared to previous models.
Outline
Mindmap
Attention Is All You Need
Abstract
- The Transformer, a new simple network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely
- Experiments on machine translation tasks show the Transformer to be superior in quality while being more parallelizable and requiring significantly less time to train
- The Transformer achieves new state-of-the-art results on English-to-German and English-to-French translation tasks
Introduction
- Recurrent neural networks and encoder-decoder architectures have been state-of-the-art in sequence modeling and transduction problems
- Attention mechanisms have become an integral part of sequence modeling and transduction models
- The Transformer model relies entirely on an attention mechanism to draw global dependencies between input and output, eschewing recurrence
Background
- Efforts to reduce sequential computation in sequence transduction models, such as Extended Neural GPU, ByteNet, and ConvS2S
- Self-attention, an attention mechanism relating different positions of a single sequence
- The Transformer is the first transduction model relying entirely on self-attention
Model Architecture
Encoder and Decoder Stacks
- Encoder: Composed of a stack of N = 6 identical layers, each with a multi-head self-attention mechanism and a feed-forward network
- Decoder: Also composed of a stack of N = 6 identical layers, with an additional sub-layer that performs multi-head attention over the output of the encoder stack
Attention
Scaled Dot-Product Attention
- Computes the dot products of the query with all keys, divides each by √dk, and applies a softmax function to obtain the weights on the values
Multi-Head Attention
- Linearly projects the queries, keys, and values h times with different, learned linear projections
- Performs the attention function in parallel on the projected versions
- Concatenates the outputs and projects again
Position-wise Feed-Forward Networks
- Applies a simple, position-wise fully connected feed-forward network to each position separately and identically
Embeddings and Softmax
- Uses learned embeddings to convert input and output tokens to vectors
- Shares the same weight matrix between the two embedding layers and the pre-softmax linear transformation
Positional Encoding
- Adds "positional encodings" to the input embeddings to inject information about the relative or absolute position of the tokens
Why Self-Attention
- Compares various aspects of self-attention layers to recurrent and convolutional layers
- Self-attention layers connect all positions with a constant number of sequential operations, whereas recurrent layers require O(n) sequential operations
- Self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d
Training
Training Data and Batching
- Trained on the WMT 2014 English-German and English-French datasets
- Sentences encoded using byte-pair encoding and batched by approximate sequence length
Hardware and Schedule
- Trained on one machine with 8 NVIDIA P100 GPUs
- Base models trained for 100,000 steps (12 hours), big models trained for 300,000 steps (3.5 days)
Optimizer
- Used the Adam optimizer with a learning rate that increases linearly for the first warmup_steps training steps and decreases proportionally to the inverse square root of the step number thereafter
Regularization
- Employed residual dropout, label smoothing, and other regularization techniques
Results
Machine Translation
- On the WMT 2014 English-to-German task, the big Transformer model outperforms previous state-of-the-art models by over 2 BLEU
- On the WMT 2014 English-to-French task, the big Transformer model achieves a new single-model state-of-the-art BLEU score of 41.8
Model Variations
- Evaluated the importance of different components of the Transformer architecture
- Observed that the number of attention heads, attention key size, model size, and dropout all have a significant impact on performance
English Constituency Parsing
- Trained a 4-layer Transformer model on the Wall Street Journal portion of the Penn Treebank
- Achieved strong results, outperforming previous models in the small-data regime and performing competitively in the semi-supervised setting
Conclusion
- The Transformer, the first sequence transduction model based entirely on attention, can be trained significantly faster than architectures based on recurrent or convolutional layers
- Achieved new state-of-the-art results on machine translation tasks
- Plans to apply attention-based models to other tasks and investigate local, restricted attention mechanisms