TLDR: The Transformer, a new simple network architecture based solely on attention mechanisms, achieves state-of-the-art results on machine translation tasks while being more parallelizable and requiring significantly less training time compared to previous models.



Attention Is All You Need


  • The Transformer, a new simple network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely
  • Experiments on machine translation tasks show the Transformer to be superior in quality while being more parallelizable and requiring significantly less time to train
  • The Transformer achieves new state-of-the-art results on English-to-German and English-to-French translation tasks


  • Recurrent neural networks and encoder-decoder architectures have been state-of-the-art in sequence modeling and transduction problems
  • Attention mechanisms have become an integral part of sequence modeling and transduction models
  • The Transformer model relies entirely on an attention mechanism to draw global dependencies between input and output, eschewing recurrence


  • Efforts to reduce sequential computation in sequence transduction models, such as Extended Neural GPU, ByteNet, and ConvS2S
  • Self-attention, an attention mechanism relating different positions of a single sequence
  • The Transformer is the first transduction model relying entirely on self-attention

Model Architecture

Encoder and Decoder Stacks

  • Encoder: Composed of a stack of N = 6 identical layers, each with a multi-head self-attention mechanism and a feed-forward network
  • Decoder: Also composed of a stack of N = 6 identical layers, with an additional sub-layer that performs multi-head attention over the output of the encoder stack


Scaled Dot-Product Attention

  • Computes the dot products of the query with all keys, divides each by √dk, and applies a softmax function to obtain the weights on the values

Multi-Head Attention

  • Linearly projects the queries, keys, and values h times with different, learned linear projections
  • Performs the attention function in parallel on the projected versions
  • Concatenates the outputs and projects again

Position-wise Feed-Forward Networks

  • Applies a simple, position-wise fully connected feed-forward network to each position separately and identically

Embeddings and Softmax

  • Uses learned embeddings to convert input and output tokens to vectors
  • Shares the same weight matrix between the two embedding layers and the pre-softmax linear transformation

Positional Encoding

  • Adds "positional encodings" to the input embeddings to inject information about the relative or absolute position of the tokens

Why Self-Attention

  • Compares various aspects of self-attention layers to recurrent and convolutional layers
  • Self-attention layers connect all positions with a constant number of sequential operations, whereas recurrent layers require O(n) sequential operations
  • Self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d


Training Data and Batching

  • Trained on the WMT 2014 English-German and English-French datasets
  • Sentences encoded using byte-pair encoding and batched by approximate sequence length

Hardware and Schedule

  • Trained on one machine with 8 NVIDIA P100 GPUs
  • Base models trained for 100,000 steps (12 hours), big models trained for 300,000 steps (3.5 days)


  • Used the Adam optimizer with a learning rate that increases linearly for the first warmup_steps training steps and decreases proportionally to the inverse square root of the step number thereafter


  • Employed residual dropout, label smoothing, and other regularization techniques


Machine Translation

  • On the WMT 2014 English-to-German task, the big Transformer model outperforms previous state-of-the-art models by over 2 BLEU
  • On the WMT 2014 English-to-French task, the big Transformer model achieves a new single-model state-of-the-art BLEU score of 41.8

Model Variations

  • Evaluated the importance of different components of the Transformer architecture
  • Observed that the number of attention heads, attention key size, model size, and dropout all have a significant impact on performance

English Constituency Parsing

  • Trained a 4-layer Transformer model on the Wall Street Journal portion of the Penn Treebank
  • Achieved strong results, outperforming previous models in the small-data regime and performing competitively in the semi-supervised setting


  • The Transformer, the first sequence transduction model based entirely on attention, can be trained significantly faster than architectures based on recurrent or convolutional layers
  • Achieved new state-of-the-art results on machine translation tasks
  • Plans to apply attention-based models to other tasks and investigate local, restricted attention mechanisms
Want to Digest Your Content?
Download ExtensionUpload My File