Remove all the things
study ways to get rid of more modules in transformers
bounded activations to remove normalization layers
get rid of softmax and sequence level compute everywhere (ex: siglip and sigmoid attention)
don’t normalize
- [2406.15786] What Matters in Transformers? Not All Attention is Needed
- [2407.15516] Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models
- SKIP-ATTENTION: IMPROVING VISION TRANSFORMERS BY PAYING LESS ATTENTION