Recurrent Computation with Transformers by repeating layers

Add a large set of registers to allow writing to tokens that don’t align with specific tokens

x.com

Using recurrence to achieve weak to strong generalization - YouTube