RoFormer: Enhanced Transformer with Rotary Position Embedding
(https://arxiv.org/abs/2104.09864)
- Rotary embedding is the combination of absolute & relative position embedding
- The additional <PAD> tokens are found to be representing the relative position embedding
- Using relative position embedding makes the transformer slower
- What is rotary embedding?
- each token embedding would be multiplied with a rotation matrix
- rotation matrix consisting of
cosine
& sin
elements
- 2 elements at a time are from token embedding & multiplied with the first 2x2 matrix of the rotation matrix, similarly, the next 2 tokens are multiplied with the 3rdx4th 2x2 matrix
- Uses
- the sequence length flexibility. RE scales with an increase in seq length
- decaying inter-token dependency with increasing relative distances i.e., a pair of tokens with a long relative distance would have less connection (dot product between the distant token embedding are less compared to dot product between the tokens closer to each other)
- normal self-attention is O(N^2) & there is a proposed linear self-attention(2021) paper & RE goes well with linear self-attention
Efficient Memory Management for Large Language Model Serving with PagedAttention
(https://arxiv.org/abs/2309.06180)
KV Cache
- KV cache is used only in decoder models
- KV cache is used to store the
KEY
& VALUE
of the previous tokens so that the attention
for the current tokens wrt to previous tokens can be calculated faster
- In most implementations,
KV Cache
is inefficient & makes use of 20%-40% of the memory only. Also,
- there is no flexibility to reuse the memory (ex: in parallel sampling, beam search, etc)
- memory has to be preallocated (either to the max length of the model (i.e., 2048 in llama) or user-restricted token length
- KV cache expects memory to be continuous, resulting in internal & External fragmentation (i.e., memory allocated but not getting utilized)
- Internal fragmentation: over-provisioning of memory for potential maximum sequence lengths (memory allocated but not used due to hitting max_lenght OR EOS token)
- Reservation: slots for future tokens
- External fragmentation: occurs due to unequal prompt generation in a given batch
solution: PagedAttention
- PagedAttention divides the request’s KV cache into blocks, each of which can contain
the attention keys and values of a fixed number of tokens.
- In PagedAttention, the blocks for the KV cache are not necessarily stored in a contiguous space.
- The idea is inspired by virtual memory management: blocks as pages, tokens as bytes, requests as processes, KV Cache as Main memory, Page table as block table
- The improvements are more pronounced with longer sequences, larger models, and more complex decoding algorithms
- PagedAttention partitions the KV cache of each sequence into KV block. Each block contains the key and value vectors for a fixed number of tokens, called KV Block Size
- Physical memory space needs NOT to be fully reserved in advance, memory block allocation is done on demand instead