Transformers

Demystifying Efficient Self-Attention

A practical overview of efficient attention mechanisms that tackle the quadratic scaling problem.

Using extractive summarization to train Transformers on long documents efficiently.