michal.i/o

❯

❯

2024-10-07

Jan 21, 20251 min read

study ways to get rid of more modules in transformers

bounded activations to remove normalization layers

get rid of softmax and sequence level compute everywhere (ex: siglip and sigmoid attention)

don’t normalize

[2406.15786] What Matters in Transformers? Not All Attention is Needed
[2407.15516] Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models
SKIP-ATTENTION: IMPROVING VISION TRANSFORMERS BY PAYING LESS ATTENTION

Backlinks

No backlinks found

Graph View

Created with Quartz v4.4.0 © 2025