0.22B Beats 11.9B: HUST and VIVO Create an Image Editing Model with Parameters Far Less Than FLUX

Moebius Image Inpainting AI Photo Editing Open-Source Model ECCV2026

发布于 2026-07-02 10:40:33 8 次浏览

0.22B Beats 11.9B: HUST and VIVO Create an Image Editing Model with Parameters Far Less Than FLUX

There's an unwritten rule in the AI photo editing world: the larger the model, the better the results.

FLUX.1-Fill-Dev has 11.9B parameters, SD3.5 Large pushes beyond 10B+, and running it once occupies a whole A100. The industry defaults to: if you want good results, stack compute first.

Then HUST + VIVO AI Lab unveiled Moebius.

0.22B parameters. 226 million.

Less than 2% of FLUX, 15x faster inference, 26ms per step on a single GPU. Across 6 standard benchmarks, it rivals FLUX.1-Fill-Dev—and even surpasses it in facial details and complex textures.

The project has been accepted by ECCV 2026, with code and weights fully open-sourced under Apache-2.0. It ranks #1 on Hugging Face's daily leaderboard and #4 on the weekly leaderboard.

Two Numbers, One Counter-Intuitive Truth

Let's look at the hard metrics first.

Parameter comparison:

FLUX.1-Fill-Dev: 11.9B
SD3.5 Large-Inpainting: 10B+
Moebius: 0.22B

Inference speed comparison:

FLUX: ~390ms per step
Moebius: 26ms per step
Speedup: >15×

Effectiveness comparison (6 Benchmarks):

On six datasets including CelebA-HQ, FFHQ, Places2, Moebius's PSNR, SSIM, and LPIPS metrics are basically on par with FLUX.1-Fill-Dev, and even slightly better on portraits and complex textures.

This is not magic — it's a combination of architectural innovation and knowledge distillation.

How Was It Achieved?

First Punch: LλMI Attention Module

Traditional Attention has a quadratic relationship between computation and sequence length — as image resolution increases, computation explodes. This is one of the fundamental reasons large models have to be large.

Moebius designs the LλMI (Learnable λ-Matrix Integration) module, which compresses spatial context and global semantics into a fixed-size matrix. Regardless of the input image size, the attention computation cost remains constant.

This bypasses the quadratic computational overhead while retaining sufficient context information. This is the key to squeezing parameters down to 0.22B without sacrificing effectiveness.

Second Punch: Knowledge Distillation

Moebius was not trained from scratch. It uses PixelHacker (a previous work by the same team, with much larger parameter count) as the teacher model, transferring its capability through multi-granularity knowledge distillation.

Distillation is not simply copying outputs. Moebius designs a layered distillation strategy — distilling simultaneously at feature, logit, and pixel levels, ensuring that the lightweight model doesn't just "learn the form" but truly understands the semantic structure of the image.

The result is intuitive: On portrait tasks, the student model surpasses the teacher in some metrics.

What does this mean? A specialized expert model for a specific task doesn't need the parameter redundancy of a large general-purpose model.

How Does It Perform in Practice?

Numbers aside, let's look at the images.

Natural scene inpainting (Places2):

Traditional methods often fail on complex textures — discontinuous sky colors, broken grass textures, distorted building structures. FLUX and SD3.5 are generally good but still show color differences and artifacts in local details.

Moebius performs stably in these scenes, with natural boundary transitions and visibly better color consistency.

Portrait inpainting (CelebA-HQ / FFHQ):

This is Moebius's strongest battlefield. When the mask covers key facial areas (eyes, nose, mouth), most methods either generate blurry facial features or suffer from severe semantic errors — misaligned eyes, mismatched skin tones, or even unreasonable facial structures.

Moebius generates clear facial details, reasonable feature positions, and skin tones that naturally blend with the surrounding area. In some cases, it produces even more natural face restorations than FLUX.1-Fill-Dev.

Simon Willison Has Already Ported It to the Browser

Less than a week after Moebius's open-source release, Simon Willison (Django co-creator) used Claude Code to port it to the browser — based on ONNX Runtime Web + WebGPU, zero backend, just open the webpage and use.

Upload an image, paint over the area to remove, click "Run inpaint", wait a few seconds, and the editing is done.

This means that 0.22B parameters is not just "academically lightweight" — it's truly lightweight enough to run on consumer devices.

Why This Matters More Than the Model Itself

Moebius's core argument is not "we built a small model."

Its argument is: Stacking parameters is not the only way.

Over the past two years, the AI industry's faith in Scaling Law has reached almost religious levels. Bigger models, more data, more expensive compute — this path has indeed produced milestones like GPT-4 and FLUX, but it has also created a side effect: the innovation threshold has been raised to infinity.

When only large companies with thousands of GPUs can compete, this field is no longer open-source — it's an oligopoly.

Moebius proves the feasibility of an alternative path:

Architectural innovation can replace brute-force scaling
Knowledge distillation allows small models to inherit large model capabilities
Task specialization can eliminate the parameter redundancy of general models

This is not to say large models are useless — general capabilities still require scale. But on specific tasks, a carefully designed small model can easily defeat a bloated general-purpose model.

Project page: github.com/hustvl/Moebius Online demo: huggingface.co/spaces/multimodalart/Moebius Browser version: simonw.github.io/moebius-web

Promotion: Want more free AI tools? Agnes AI offers 1M context + 4K image generation + video all free. API endpoint: apihub.agnes-ai.com/v1, registration: https://platform.agnes-ai.com/

0.22B Beats 11.9B: HUST and VIVO Create an Image Editing Model with Parameters Far Less Than FLUX

Two Numbers, One Counter-Intuitive Truth

How Was It Achieved?

First Punch: LλMI Attention Module

Second Punch: Knowledge Distillation

How Does It Perform in Practice?

Simon Willison Has Already Ported It to the Browser

Why This Matters More Than the Model Itself

评论