The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
亮点:1. 完全基于Attention的模型架构,区别于传统的RNN/CNN;2. 支持并行化,训练高效;3. 实际表现优异。
论文使用的Attention是Scaled Dot-Product Attention,公式如下:
另外一种常见的Attention是Additive attention,是通过一个前馈神经网络来计算softmax里的分数值。两者理论复杂度接近,Dot-Product Attention实践更高效,矩阵优化比较好。当$$d_k$$很大时,会导致$$softmax$$初始值落在梯度极小的区域($$d_k$$很大,假定$q$和$k$是互相独立的0均值1方差的随机变量,此时$q\cdot k$为0均值$d_k$方差),难以优化,为了消除该负面影响而引入上式的$$\frac{1}{\sqrt{d_k}}$$,这也正是Scaled的名称来源。
可以看到Multi-Head Attention就是先将Q/K/V经过不同的head矩阵W转换到不同表达子空间,分别做Scaled Dot-Product Attention,得到结果再拼接在一起,最后过一个非线性转换得到最后的输出。从公式理解更加直观,如下:
Transformer框架中主要有3处利用到了Multi-head Attention机制,
- 解码层中,decoder对encoder的attention;
- encoder中的自注意力(self-attention);
- 解码层中,当前解码对已经解码的masked attention。