Remove all the things

October 9, 2024 updated October 23, 2024 1 min read

study ways to get rid of more modules in transformers

bounded activations to remove normalization layers

get rid of softmax and sequence level compute everywhere (ex: siglip and sigmoid attention)

don’t normalize

[2406.15786] What Matters in Transformers? Not All Attention is Needed
[2407.15516] Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models
SKIP-ATTENTION: IMPROVING VISION TRANSFORMERS BY PAYING LESS ATTENTION

ml research-ideas