Paper Review :- Attention is all you need

The Idea :- Complex Recurrent neural networks include an encoder and a decoder which are connected through an attention mechanism . The paper proposes that we don’t need recurrence but only attention mechanism can produce great result .This is called a Transformer model architecture which is a paradigm shift in sequence processing as attention reduces the path length and reduces the computation steps ( which otherwise leads to information loss )

Traditional =. RNN + Attention . The Paper Proposes :- Only Attention

How RNN’s work :- RNNs perform sequential computation and precludes parallelisation which becomes critical as longer input sequences are encountered . Very hard for RNN to understand the long range dependencies. Although factorisation tricks and conditional computation do come to rescue but sequential processing still remains a constraint . RNNs take a current input and the last hidden state and determine the current hidden state . Attention is a mechanism to improve the performance of the RNN . In popular attention mechanism the decoder is taught to pay attention to the hidden states in the encoder model .

Other key words :- Self attention , end to end memory networks

The Proposed architecture :- The path length of information is much shorter now . The decoder decides to look at the hidden state as per an addressing scheme

Input embedding and Output Embedding – goes into the network

Positional Encoding – where the words are and gives the network a significant boost

Attention is over the input sentence ( the hidden state ) , attention over the hidden state of the part of the output sentence Already produced and the third multi head attention combines the input and output ( Encoded forms ) . Encoder of the source sentence builds KeyS ) way to index the value ) and value pairs and the Other part of the network builds the Query .

Keys, Value and the query

Attention(Q,K,V) = softmax(QK)

each Key has a corresponding value And then we introduce the Query . Compute the dot product Of the Keys and the Queries and select the key with the biggest dot product with the query and then apply the softmax ( exponential and normalisation ) to select the biggest dot product

Dot product between two vectors gives the angle between the two vectors

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s