Academic Project Page

1University of Melbourne, Australia
2University of Sydney, Australia
3Mohamed bin Zayed University of Artificial Intelligence, UAE
TED-VITON Teaser

We propose a Mask-Free Virtual Try-On framework that achieves SOTA visual quality by eliminating artifacts caused by inaccurate masks: (a) Eliminate interference from inaccurate masks: Inaccurate masks cause over-masking, leading to unnatural regeneration of hair or hands, and mask leakage, resulting in artifacts like remnants of the old clothing. (b) Demonstration of VITON-HD In-the-Wild.

Abstract

Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models. However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.~\ref{fig:cover_image}(a). To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks. Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset. This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. (2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies. This stage simplifies the input requirements while preserving garment texture and shape fidelity. Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism. Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches.

MF-VITON Teaser

Overview of MF-VITON: We propose a Mask-based & Mask-Free VITON pipeline that enables seamless adaptation from Mask-based to MF-VITON. The pipeline comprises two branches:

(a) the Garment Extractor, which leverages ReferenceNet to encode fine-grained garment features ℱ(xg) and employs an Adapter [Ye et al., 2023] to extract high-level semantics from garment images Xg using a pretrained image encoder;

(b) the Denoising Network, which utilizes TryonNet as the primary denoising branch to process concatenated inputs of noised latent Xt and selectively integrates either Mask-based conditions (ℱ(XMasked-Con), Mask-based text prompt) or Mask-Free conditions (ℱ(XUnmasked-Con), Mask-Free text prompt).

dataset_generation Teaser

Overview of MF-VITON Dataset Generation:

(a) In-the-Wild Mask-Free Dataset Generation: Uses FLUX.1-Fill-dev to generate realistic background-filled model images Xbg, which are then composited with the Mask-based background bg to create Mask-Free dataset samples XUnmasked-Con-bg.

(b) Mask-Free Dataset Generation: Concatenates the noised latent encoding ℱ(Xmodel) with Mask-based conditions ℱ(XMasked-Cons). The Mask-based VITON model then synthesizes garment-swapped images XUnmasked-Con.

BibTeX

@article{wan2025mfviton,
  title={MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input},
  author={Wan, Zhenchen and Xu, Yanwu and Hu, Dongting and Cheng, Weilun and Chen, Tianxi and Wang, Zhaoqing and Liu, Feng and Liu, Tongliang and Gong, Mingming},
  journal={arXiv preprint arXiv:2503.08650},
  year={2025}
}