study ways to get rid of more modules in transformers

bounded activations to remove normalization layers

get rid of softmax and sequence level compute everywhere (ex: siglip and sigmoid attention)

don’t normalize