RoFormer: Enhanced Transformer with Rotary Position Embedding

(https://arxiv.org/abs/2104.09864)

  1. Rotary embedding is the combination of absolute & relative position embedding
  2. The additional <PAD> tokens are found to be representing the relative position embedding
  3. Using relative position embedding makes the transformer slower
  4. What is rotary embedding?
    1. each token embedding would be multiplied with a rotation matrix
    2. rotation matrix consisting of cosine & sin elements
    3. 2 elements at a time are from token embedding & multiplied with the first 2x2 matrix of the rotation matrix, similarly, the next 2 tokens are multiplied with the 3rdx4th 2x2 matrix
  5. Uses
    1. the sequence length flexibility. RE scales with an increase in seq length
    2. decaying inter-token dependency with increasing relative distances i.e., a pair of tokens with a long relative distance would have less connection (dot product between the distant token embedding are less compared to dot product between the tokens closer to each other)
    3. normal self-attention is O(N^2) & there is a proposed linear self-attention(2021) paper & RE goes well with linear self-attention

Efficient Memory Management for Large Language Model Serving with PagedAttention

(https://arxiv.org/abs/2309.06180)

KV Cache

  1. KV cache is used only in decoder models
  2. KV cache is used to store the KEY & VALUE of the previous tokens so that the attention for the current tokens wrt to previous tokens can be calculated faster
  3. In most implementations, KV Cache is inefficient & makes use of 20%-40% of the memory only. Also,
    1. there is no flexibility to reuse the memory (ex: in parallel sampling, beam search, etc)
    2. memory has to be preallocated (either to the max length of the model (i.e., 2048 in llama) or user-restricted token length
    3. KV cache expects memory to be continuous, resulting in internal & External fragmentation (i.e., memory allocated but not getting utilized)
      1. Internal fragmentation: over-provisioning of memory for potential maximum sequence lengths (memory allocated but not used due to hitting max_lenght OR EOS token)
      2. Reservation: slots for future tokens
      3. External fragmentation: occurs due to unequal prompt generation in a given batch

solution: PagedAttention

  1. PagedAttention divides the request’s KV cache into blocks, each of which can contain the attention keys and values of a fixed number of tokens.
  2. In PagedAttention, the blocks for the KV cache are not necessarily stored in a contiguous space.
  3. The idea is inspired by virtual memory management: blocks as pages, tokens as bytes, requests as processes, KV Cache as Main memory, Page table as block table
  4. The improvements are more pronounced with longer sequences, larger models, and more complex decoding algorithms
  5. PagedAttention partitions the KV cache of each sequence into KV block. Each block contains the key and value vectors for a fixed number of tokens, called KV Block Size
  6. Physical memory space needs NOT to be fully reserved in advance, memory block allocation is done on demand instead