Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present Mobile-VTON, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. Mobile-VTON introduces a modular TeacherNet-GarmentNet-TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, Mobile-VTON achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at 1024 x 768 show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
Real-time mobile inference demo of Mobile-VTON running entirely offline.
@misc{wan2026textscmobilevtonhighfidelityondevicevirtual,
title={\textsc{Mobile-VTON}: High-Fidelity On-Device Virtual Try-On},
author={Zhenchen Wan and Ce Chen and Runqi Lin and Jiaxin Huang and Tianxi Chen and Yanwu Xu and Tongliang Liu and Mingming Gong},
year={2026},
eprint={2603.00947},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.00947},
}