Mobile-VTON: High-Fidelity On-Device Virtual Try-On

CVPR 2026
1University of Sydney   2MBZUAI   3University of Melbourne   4Google
*Equal Contribution   Corresponding Author
Mobile-VITON Teaser

Comparison of virtual try-on methods in terms of model size, mobile compatibility, and visual quality. Our model, with only 415M parameters, achieves competitive visual results while running entirely on mobile devices. The leftmost column shows the input person and garment images. All outputs are generated or super-resolved at 1024 x 768 resolution.

Abstract

Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present Mobile-VTON, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. Mobile-VTON introduces a modular TeacherNet-GarmentNet-TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, Mobile-VTON achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at 1024 x 768 show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.

Mobile-VTON Framework

Training Overview of Mobile-VTON: The left side illustrates our Feature-Guided Adversarial (FGA) Distillation process, where a high-capacity TeacherNet supervises two lightweight student networks. The right side depicts the main training pipeline, where TryonNet and GarmentNet are jointly optimized with garment-aware supervision.

On-Device Inference Demo

Real-time mobile inference demo of Mobile-VTON running entirely offline.

Mobile-VTON Results

Qualitative virtual try-on results on the VITON-HD In-the-Wild test set. Mobile-VTON generates high-fidelity results with precise garment alignment and skin texture preservation.

Mobile-VTON Results

Qualitative virtual try-on results on the DressCode test set. Mobile-VTON generates high-fidelity results with precise garment alignment and skin texture preservation.

Mobile-VTON Results

Qualitative virtual try-on results on the VITON-HD test set. Mobile-VTON generates high-fidelity results with precise garment alignment and skin texture preservation.

BibTeX

@misc{wan2026textscmobilevtonhighfidelityondevicevirtual,
      title={\textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On}, 
      author={Zhenchen Wan and Ce Chen and Runqi Lin and Jiaxin Huang and Tianxi Chen and Yanwu Xu and Tongliang Liu and Mingming Gong},
      year={2026},
      eprint={2603.00947},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.00947}, 
}