The Idea :- Complex Recurrent neural networks include an encoder and a decoder which are connected through an attention mechanism . The paper proposes that we don’t need recurrence but only attention mechanism can produce great result .This is called a Transformer model architecture which is a paradigm shift in sequence processing as attention reduces the path length and reduces the computation steps ( which otherwise leads to information loss )
Traditional =. RNN + Attention . The Paper Proposes :- Only Attention
How RNN’s work :- RNNs perform sequential computation and precludes parallelisation which becomes critical as longer input sequences are encountered . Very hard for RNN to understand the long range dependencies. Although factorisation tricks and conditional computation do come to rescue but sequential processing still remains a constraint . RNNs take a current input and the last hidden state and determine the current hidden state . Attention is a mechanism to improve the performance of the RNN . In popular attention mechanism the decoder is taught to pay attention to the hidden states in the encoder model .
Other key words :- Self attention , end to end memory networks
The Proposed architecture :- The path length of information is much shorter now . The decoder decides to look at the hidden state as per an addressing scheme
Input embedding and Output Embedding – goes into the network
Positional Encoding – where the words are and gives the network a significant boost
Attention is over the input sentence ( the hidden state ) , attention over the hidden state of the part of the output sentence Already produced and the third multi head attention combines the input and output ( Encoded forms ) . Encoder of the source sentence builds KeyS ) way to index the value ) and value pairs and the Other part of the network builds the Query .
Keys, Value and the query
Attention(Q,K,V) = softmax(QK)
each Key has a corresponding value And then we introduce the Query . Compute the dot product Of the Keys and the Queries and select the key with the biggest dot product with the query and then apply the softmax ( exponential and normalisation ) to select the biggest dot product
Dot product between two vectors gives the angle between the two vectors