Back to Portfolio
📄 FEATURED PAPER

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
June 12, 2017
NeurIPS 2017
Deep Learning NLP Transformers Attention

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

}

This foundational paper introduced the Transformer architecture, which has become the backbone of modern NLP systems including GPT, BERT, and many other state-of-the-art models.

Why This Paper Matters

The Transformer architecture revolutionized natural language processing by:

  1. Eliminating Recurrence: Unlike RNNs and LSTMs, Transformers process sequences in parallel, dramatically improving training speed
  2. Self-Attention Mechanism: Allows the model to weigh the importance of different words in a sequence, regardless of their distance
  3. Scalability: The architecture scales efficiently to very large models and datasets
  4. Transfer Learning: Pre-trained Transformers can be fine-tuned for various downstream tasks

Key Contributions

  • Introduction of multi-head self-attention
  • Positional encodings for sequence order
  • Scaled dot-product attention mechanism
  • Layer normalization and residual connections

This paper has over 80,000 citations and fundamentally changed how we approach sequence modeling tasks.

}