Spherical Latents + Few-Step Flow Refinement
Inspiration
There has been a lot of generative modeling excitement in the last few weeks. One that caught my eye was Spherical Autoencoders [1], a relatively simple way to give a basic autoencoder much more powerful generative capabilities. I have recently been working a lot with a slightly older paper, Sample What You Can't Compress (SWYCC) [2], where they use a two-step process for generation: a small autoencoder that generates an initial image, then a diffusion decoder to refine it significantly.
Reasoning
Standard image autoencoders produce latents in unconstrained space, which makes generative modelling of the latent distribution harder. The sphere encoder constrains the latent to lie on a hypersphere by RMS-normalising the latent. As shown in [1], their system produces great-looking results with a latent space that is well-suited for sampling.
A criticism I have seen of their approach is that their model is HUGE, for ImageNet it needs ~1B parameters to achieve their results. That said, I think the basic idea is strong.
My idea is to drastically reduce the parameter count by training a small spherical AE, then applying techniques similar to SWYCC [2] and having a flow model that will few-step refine the images to be of very high quality.
Generation is a two-stage process:
- AE decode: a random point on the sphere is decoded directly to a rough image
- Flow refinement: a DiT refines that image, conditioned on the sphere latent
My Twist
Relative to SWYCC I have switched things up a bit. Their refinement model still follows the usual coupling that you would see in any diffusion or FM model.
I define the following:
- — generated by the decoder
- — target image
- — controls how noisy the starting image is
I jumpstart diffusion by replacing the source distribution: instead of coupling from pure noise to the image, the coupling is . This is an unusual idea AFAIK, but in my experience it significantly speeds up training, as the model has far more image structure to work with right away, essentially skipping the early stages of reconstruction. This process still works very well few-step, so I am hopeful that it can achieve similar results with fewer parameters and less training compute.
I am also using an architecture based on JiT, Just Image Transformers, from [3].
Results
I had this idea this morning, so results are still very much in progress, I am very compute limited, so if you have any resources I could use, please reach out.
You can however check out the code on github: flow-sphere
Sources
[1] Image Generation with a Sphere Encoder (Yue et al., 2026)
[2] Sample What You Can't Compress (Birodkar et al., 2025)
[3] Back to Basics: Let Denoising Generative Models Denoise (Li et al., 2025)