study ways to get rid of more modules in transformers
bounded activations to remove normalization layers
get rid of softmax and sequence level compute everywhere (ex: siglip and sigmoid attention)
don’t normalize
Nov 20, 20241 min read
study ways to get rid of more modules in transformers
bounded activations to remove normalization layers
get rid of softmax and sequence level compute everywhere (ex: siglip and sigmoid attention)
don’t normalize