michal.i/o

      • 2022-12-17 TIL
      • N+1 ways to implement attention
      • 2024-08-12
      • 2024-08-19
      • 2024-08-26
      • 2024-09-02
      • 2024-09-09
      • 2024-09-16
      • 2024-09-18 - Pytorch Conference Notes
      • 2024-09-23
      • 2024-09-30
      • 2024-10-07
      • 2024-10-14
      • 2024-10-21
      • 2024-10-28
      • 2024-11-04
      • 2024-11-18
      • 2024-11-25
      • 2024-12-02
      • 2024-12-09
      • 2024-12-16
      • 2024-12-23
      • 2024-12-30
      • 2025-01-06
      • 2025-01-13
      • 2025-01-20
        • accounting
        • Business Lessons
        • consulting
        • Growth
        • Landing Pages
        • legal
        • marketing
        • Open Source Business Models
        • pricing
        • Productivity Software
        • sales
        • VC Alternatives
          • Quickselect
          • Resources
          • Static Search Trees
        • arrow
        • bashrc x zshrc
        • ClickHouse
        • cloud
        • Concurrency
        • CRDTs
        • cuda
        • Data Structures and Algorithms
        • data visualization and dashboarding
        • Databases
        • django
        • docker
        • duckdb
        • Engineering Blogs
        • ffmpeg
        • hardware
        • Kafka
        • kubernetes
        • Latencies
        • Leetcode
        • logging
        • networking
        • object-stores
        • parquet
        • postgres
        • python
        • pytorch
        • ray
        • react-native
        • redis
        • rust
        • Search - Full Text Search and Semantic Search
        • security
        • sqlite
        • terraform
        • web-servers
        • Linear Algebra
        • Math for ML
        • Optimization
        • Probability
          • 2023 NeurIPS
          • 2024 NeurIPS
          • Mistral7B
          • 2023-04-14 - Combined Scaling for Zero-shot Transfer Learning
          • 2023-12-04 - MobileCLIP - Fast Image-Text Models through Multi-Modal Reinforced Training
          • 2023-12-04 - Rejuvenating image-GPT as Strong Visual Representation Learners
          • 2023-12-05 - Mamba Linear-Time Sequence Modeling with Selective State Spaces
          • 2023-12-09 - SILC Improving Vision Language Pretraining with Self-Distillation
          • 2023-12-09 - Text as Image Learning Transferable Adapter for Multi-Label Classification
          • 2023-12-17 - Stable and low-precision training for large-scale vision-language models
          • 2024-10-04 - Movie Gen A Cast of Media Foundation Models
          • 2024-10-10 - Pixtral 12B
          • 2024-11-03 - GATED DELTA NETWORKS IMPROVING MAMBA2 WITH DELTA RULE
          • 2024-11-03 - On the Efficiency of Convolutional Neural Networks
          • 2024-11-03 - ReMoE FULLY DIFFERENTIABLE MIXTURE-OF-EXPERTS WITH RELU ROUTING
          • 2024-11-03 - TokenFormer - RETHINKING TRANSFORMER SCAL-ING WITH TOKENIZED MODEL PARAMETERS
          • 2024-11-17 - Mixture-of-Transformers A Sparse and Scalable Architecture for Multi-Modal Foundation Models
          • AI Web Browser
          • Bad apples for label noise early stopping
          • Commander - Super Fast Local Function Calling
          • Early Fusion Multimodal Encoder Models
          • Latent Transformers with small vocabularies
          • Learn to Initialize from OS Models
          • Learning Skip Layers
          • Mixture of Modules
          • Multi Modal Learning to Rank as a replacement for CLIP
          • Neural Architecture Search for SSM Hybrids
          • Predict token from positional embedding
          • Pretrain on synthetic conversation data
          • Recurrent Computation with Transformers by repeating layers
          • Remove all the things
          • Sapiens for Robotics
          • Small Proxy model to predict loss for given sample
          • SSMs 4 Rec
          • Task Routing for Multimodal LLMs
          • Teach VLM to Zoom and Pan
          • Tiny Foundational model by distilling from a lot of SOTA models
          • Tiny LLMs with rag in the middle
          • Two Stream SSMs
          • Universal embedding space for popular foundational models (or adapters)
          • Untitled
          • VLMs for better Vision Backbones
          • White space separated conv text encoder
        • "World Models" - Modeling the Real World
        • 3D Computer Vision
        • A glossary of all the ways ML models fail to train
        • Activation Functions
        • Active Learning
        • Agents
        • Alignment and Post Training
        • Approximate Nearest Neighbor Search (ANN)
        • autograd
        • Autonomous Driving - Self Driving
        • benchmarks
        • CLIP
        • Cloud GPUs
        • cnns
        • Code LLMs
        • compilers
        • compression
        • Computer Graphics
        • Computer Vision Backbones
        • Contrastive Learning
        • Data Curation
        • Data Formats for ML
        • Data Loading
        • Decoder Transformer Inference (LLM Serving)
        • Decoding and Sampling
        • Deep Learning Tricks of the Trade
        • Deepspeed
        • Diffusion Models
        • Distributed Training
        • Document Processing
        • Embedding Models
        • Evaluation Metrics
        • Extreme Classification
        • FairScale
        • Feature Stores
        • Few Shot Learning
        • fine-tuning
        • Flow Matching - Rectified Flows
        • Food Recognition
        • Function Calling (with LLMs)
        • Generative Models
        • GPUs
        • graphs
        • Hallucinations
        • Human Pose Estimation and Human Modeling
        • Image Matching
        • Image Recognition
        • Imitation Learning
        • Information Retrieval - Retrieval, Ranking and Search
        • Instance Retrieval and Instance Recognition
        • jax
        • Label Noise
        • Learning to Rank
        • LLM Evaluation
        • LLM Tokenization
        • LLM Training and Tuning
        • logsumexp
        • Long Context Transformers
        • Long Tail Classification and Class Imbalance
        • Machine Learning Tricks and Best Practices
        • maes
        • Mamba
        • matryoshka embeddings
        • Mechanistic Interpretability
        • medical
        • mixture of experts
        • ML Competitions
        • ML Conferences
        • ML Courses & Books
        • ML for Math
        • ML Infrastructure
        • ML Scaling
        • ML Systems
        • MLX
        • Mobile Inference
        • Model Distillation and Transfer Learning
        • Model Routing
        • Multi Label Classification
        • Multi Modal Learning
        • Multi Task Learning
        • Natural Language Processing
        • NeRF - Neural Radiance Fields
        • Networking
        • Neural Architecture Search (NAS)
        • Normalization
        • Numerics
        • Object Detection
        • ocr
        • paper-params
        • Parameter Efficient Fine Tuning (PEFT)
        • PrefixLM
        • Production Machine Learning Systems
        • Prompt Engineering
        • Pruning
        • Quantization
        • Recommendation Systems (RecSys)
        • Reinforcement Learning (RL)
        • resources
        • Retrieval Augmented Generation (RAG)
        • Retrieval Augmented Models
        • RL for LMs
        • Robotics
        • segmentation
        • Self-Supervised Image Models
        • Semantic Search and Ranking
        • Semi Supervised Learning
        • Server Inference
        • SLAM
        • Small Foundational Models
        • softmax
        • Speech - Speech Recognition and TTS
        • Speedruns
        • State Space Models (SSMs)
        • Storage
        • Structured Generation with LLMs
        • Synthetic Data
        • Tabular Machine Learning
        • Tensor Tricks
        • Test Time Compute, LLM Reasoning, Inference Time Scaling
        • Text Embeddings
        • text2sql
        • Token Dropping, Pruning, Merging and Compression
        • torch compile
        • Transformer Alternatives (mostly SSMs)
        • Transformer Properties
        • transformers
        • triton
        • Untitled
        • Variational Autoencoders (VAE)
        • video
        • Video Generation
        • Vision Language Models
        • Vision Transformers
        • Visual Search
        • xformers
        • xlstm
        • C4AI Command R7B
        • ColBERT
        • ColPali & ColQwen
        • Conformer
        • Contextual Document Embeddings (CDE)
        • ControlNet
        • DeepSeek R1
        • DeepSeek v3
        • DeltaNet
        • DETR
        • Diffusion Transformer (DiT)
        • FLUX
        • Gecko - Versatile Text Embeddings Distilled from Large Language Models
        • GLiNER - General NER
        • HNSW
        • InternVL
        • Kolmogorov-Arnold Theorem
        • KV Cache Compression
        • Latent Diffusion
        • LayerSkip
        • LO-PQ
        • Maximal Update Parametrization (μP)
        • Mixture of Depth
        • Mixture-of-Transformer
        • MMDiT - Multi Modal Diffusion Transformer
        • ModernBERT
        • Movie Gen
        • Multi-Head Latent Attention (MLA)
        • Not All Tokens Are What You Need For Pretraining
        • PaliGemma
        • ReAct
        • Ring Attention
        • SetFit
        • Speech-to-Speech
        • SPLADE
        • Stable Diffusion 3 and 3.5
        • Test Time Learning (Local Learning)
        • Token Dropping
        • Unified-IO
        • Vision-Language-Action Models (VLA)
        • Wav2vec
        • WaveNet
        • You Only Cache Once (YOCO)
    Home

    ❯

    journal

    ❯

    2024-10-07

    2024-10-07

    Jan 21, 20251 min read

    study ways to get rid of more modules in transformers

    bounded activations to remove normalization layers

    get rid of softmax and sequence level compute everywhere (ex: siglip and sigmoid attention)

    don’t normalize

    • [2406.15786] What Matters in Transformers? Not All Attention is Needed
    • [2407.15516] Attention Is All You Need But You Don’t Need All Of It For Inference of Large Language Models
    • SKIP-ATTENTION: IMPROVING VISION TRANSFORMERS BY PAYING LESS ATTENTION

        • 2022-12-17 TIL
        • N+1 ways to implement attention
        • 2024-08-12
        • 2024-08-19
        • 2024-08-26
        • 2024-09-02
        • 2024-09-09
        • 2024-09-16
        • 2024-09-18 - Pytorch Conference Notes
        • 2024-09-23
        • 2024-09-30
        • 2024-10-07
        • 2024-10-14
        • 2024-10-21
        • 2024-10-28
        • 2024-11-04
        • 2024-11-18
        • 2024-11-25
        • 2024-12-02
        • 2024-12-09
        • 2024-12-16
        • 2024-12-23
        • 2024-12-30
        • 2025-01-06
        • 2025-01-13
        • 2025-01-20
          • accounting
          • Business Lessons
          • consulting
          • Growth
          • Landing Pages
          • legal
          • marketing
          • Open Source Business Models
          • pricing
          • Productivity Software
          • sales
          • VC Alternatives
            • Quickselect
            • Resources
            • Static Search Trees
          • arrow
          • bashrc x zshrc
          • ClickHouse
          • cloud
          • Concurrency
          • CRDTs
          • cuda
          • Data Structures and Algorithms
          • data visualization and dashboarding
          • Databases
          • django
          • docker
          • duckdb
          • Engineering Blogs
          • ffmpeg
          • hardware
          • Kafka
          • kubernetes
          • Latencies
          • Leetcode
          • logging
          • networking
          • object-stores
          • parquet
          • postgres
          • python
          • pytorch
          • ray
          • react-native
          • redis
          • rust
          • Search - Full Text Search and Semantic Search
          • security
          • sqlite
          • terraform
          • web-servers
          • Linear Algebra
          • Math for ML
          • Optimization
          • Probability
            • 2023 NeurIPS
            • 2024 NeurIPS
            • Mistral7B
            • 2023-04-14 - Combined Scaling for Zero-shot Transfer Learning
            • 2023-12-04 - MobileCLIP - Fast Image-Text Models through Multi-Modal Reinforced Training
            • 2023-12-04 - Rejuvenating image-GPT as Strong Visual Representation Learners
            • 2023-12-05 - Mamba Linear-Time Sequence Modeling with Selective State Spaces
            • 2023-12-09 - SILC Improving Vision Language Pretraining with Self-Distillation
            • 2023-12-09 - Text as Image Learning Transferable Adapter for Multi-Label Classification
            • 2023-12-17 - Stable and low-precision training for large-scale vision-language models
            • 2024-10-04 - Movie Gen A Cast of Media Foundation Models
            • 2024-10-10 - Pixtral 12B
            • 2024-11-03 - GATED DELTA NETWORKS IMPROVING MAMBA2 WITH DELTA RULE
            • 2024-11-03 - On the Efficiency of Convolutional Neural Networks
            • 2024-11-03 - ReMoE FULLY DIFFERENTIABLE MIXTURE-OF-EXPERTS WITH RELU ROUTING
            • 2024-11-03 - TokenFormer - RETHINKING TRANSFORMER SCAL-ING WITH TOKENIZED MODEL PARAMETERS
            • 2024-11-17 - Mixture-of-Transformers A Sparse and Scalable Architecture for Multi-Modal Foundation Models
            • AI Web Browser
            • Bad apples for label noise early stopping
            • Commander - Super Fast Local Function Calling
            • Early Fusion Multimodal Encoder Models
            • Latent Transformers with small vocabularies
            • Learn to Initialize from OS Models
            • Learning Skip Layers
            • Mixture of Modules
            • Multi Modal Learning to Rank as a replacement for CLIP
            • Neural Architecture Search for SSM Hybrids
            • Predict token from positional embedding
            • Pretrain on synthetic conversation data
            • Recurrent Computation with Transformers by repeating layers
            • Remove all the things
            • Sapiens for Robotics
            • Small Proxy model to predict loss for given sample
            • SSMs 4 Rec
            • Task Routing for Multimodal LLMs
            • Teach VLM to Zoom and Pan
            • Tiny Foundational model by distilling from a lot of SOTA models
            • Tiny LLMs with rag in the middle
            • Two Stream SSMs
            • Universal embedding space for popular foundational models (or adapters)
            • Untitled
            • VLMs for better Vision Backbones
            • White space separated conv text encoder
          • "World Models" - Modeling the Real World
          • 3D Computer Vision
          • A glossary of all the ways ML models fail to train
          • Activation Functions
          • Active Learning
          • Agents
          • Alignment and Post Training
          • Approximate Nearest Neighbor Search (ANN)
          • autograd
          • Autonomous Driving - Self Driving
          • benchmarks
          • CLIP
          • Cloud GPUs
          • cnns
          • Code LLMs
          • compilers
          • compression
          • Computer Graphics
          • Computer Vision Backbones
          • Contrastive Learning
          • Data Curation
          • Data Formats for ML
          • Data Loading
          • Decoder Transformer Inference (LLM Serving)
          • Decoding and Sampling
          • Deep Learning Tricks of the Trade
          • Deepspeed
          • Diffusion Models
          • Distributed Training
          • Document Processing
          • Embedding Models
          • Evaluation Metrics
          • Extreme Classification
          • FairScale
          • Feature Stores
          • Few Shot Learning
          • fine-tuning
          • Flow Matching - Rectified Flows
          • Food Recognition
          • Function Calling (with LLMs)
          • Generative Models
          • GPUs
          • graphs
          • Hallucinations
          • Human Pose Estimation and Human Modeling
          • Image Matching
          • Image Recognition
          • Imitation Learning
          • Information Retrieval - Retrieval, Ranking and Search
          • Instance Retrieval and Instance Recognition
          • jax
          • Label Noise
          • Learning to Rank
          • LLM Evaluation
          • LLM Tokenization
          • LLM Training and Tuning
          • logsumexp
          • Long Context Transformers
          • Long Tail Classification and Class Imbalance
          • Machine Learning Tricks and Best Practices
          • maes
          • Mamba
          • matryoshka embeddings
          • Mechanistic Interpretability
          • medical
          • mixture of experts
          • ML Competitions
          • ML Conferences
          • ML Courses & Books
          • ML for Math
          • ML Infrastructure
          • ML Scaling
          • ML Systems
          • MLX
          • Mobile Inference
          • Model Distillation and Transfer Learning
          • Model Routing
          • Multi Label Classification
          • Multi Modal Learning
          • Multi Task Learning
          • Natural Language Processing
          • NeRF - Neural Radiance Fields
          • Networking
          • Neural Architecture Search (NAS)
          • Normalization
          • Numerics
          • Object Detection
          • ocr
          • paper-params
          • Parameter Efficient Fine Tuning (PEFT)
          • PrefixLM
          • Production Machine Learning Systems
          • Prompt Engineering
          • Pruning
          • Quantization
          • Recommendation Systems (RecSys)
          • Reinforcement Learning (RL)
          • resources
          • Retrieval Augmented Generation (RAG)
          • Retrieval Augmented Models
          • RL for LMs
          • Robotics
          • segmentation
          • Self-Supervised Image Models
          • Semantic Search and Ranking
          • Semi Supervised Learning
          • Server Inference
          • SLAM
          • Small Foundational Models
          • softmax
          • Speech - Speech Recognition and TTS
          • Speedruns
          • State Space Models (SSMs)
          • Storage
          • Structured Generation with LLMs
          • Synthetic Data
          • Tabular Machine Learning
          • Tensor Tricks
          • Test Time Compute, LLM Reasoning, Inference Time Scaling
          • Text Embeddings
          • text2sql
          • Token Dropping, Pruning, Merging and Compression
          • torch compile
          • Transformer Alternatives (mostly SSMs)
          • Transformer Properties
          • transformers
          • triton
          • Untitled
          • Variational Autoencoders (VAE)
          • video
          • Video Generation
          • Vision Language Models
          • Vision Transformers
          • Visual Search
          • xformers
          • xlstm
          • C4AI Command R7B
          • ColBERT
          • ColPali & ColQwen
          • Conformer
          • Contextual Document Embeddings (CDE)
          • ControlNet
          • DeepSeek R1
          • DeepSeek v3
          • DeltaNet
          • DETR
          • Diffusion Transformer (DiT)
          • FLUX
          • Gecko - Versatile Text Embeddings Distilled from Large Language Models
          • GLiNER - General NER
          • HNSW
          • InternVL
          • Kolmogorov-Arnold Theorem
          • KV Cache Compression
          • Latent Diffusion
          • LayerSkip
          • LO-PQ
          • Maximal Update Parametrization (μP)
          • Mixture of Depth
          • Mixture-of-Transformer
          • MMDiT - Multi Modal Diffusion Transformer
          • ModernBERT
          • Movie Gen
          • Multi-Head Latent Attention (MLA)
          • Not All Tokens Are What You Need For Pretraining
          • PaliGemma
          • ReAct
          • Ring Attention
          • SetFit
          • Speech-to-Speech
          • SPLADE
          • Stable Diffusion 3 and 3.5
          • Test Time Learning (Local Learning)
          • Token Dropping
          • Unified-IO
          • Vision-Language-Action Models (VLA)
          • Wav2vec
          • WaveNet
          • You Only Cache Once (YOCO)

      Backlinks

      • No backlinks found

      Graph View

      Created with Quartz v4.4.0 © 2025