study ways to get rid of more modules in transformers
bounded activations to remove normalization layers
get rid of softmax and sequence level compute everywhere (ex: siglip and sigmoid attention)
don’t normalize
Jan 21, 20251 min read
study ways to get rid of more modules in transformers
bounded activations to remove normalization layers
get rid of softmax and sequence level compute everywhere (ex: siglip and sigmoid attention)
don’t normalize