transformers

  • replaces FFN and linear QKV mapping with an attention where keys and values are learned matrices
  • allows scaling up the K and V matrices without a need to retrain

  • uses nonparametric layernorm so that only tokens are learned