TED-VITON

Transformer-Empowered Diffusion Models for Virtual Try-On

1University of Melbourne, Australia
2Snapchat, Los Angeles, USA
3University of Sydney, Australia
4Mohamed bin Zayed University of Artificial Intelligence, UAE
TED-VITON Teaser

TED-VITON demonstrates state-of-the-art performance in garment realism and text clarity under diverse poses and lighting conditions.

Abstract

Recent advancements in Virtual Try-On (VTO) have demonstrated exceptional efficacy in generating realistic images and preserving garment details, largely attributed to the robust generative capabilities of text-to-image (T2I) diffusion backbones. However, the T2I models that underpin these methods have become outdated, thereby limiting the potential for further improvement in VTO. Additionally, current methods face notable challenges in accurately rendering text on garments without distortion and preserving fine-grained details, such as textures and material fidelity. The emergence of Diffusion Transformer (DiT) based T2I models has showcased impressive performance and offers a promising opportunity for advancing VTO. Directly applying existing VTO techniques to transformer-based T2I models is ineffective due to substantial architectural differences, which hinder their ability to fully leverage the models' advanced capabilities for improved text generation. To address these challenges and unlock the full potential of DiT-based T2I models for VTO, we propose TED-VITON, a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features, a Text Preservation Loss to ensure accurate and distortion-free text rendering, and a constraint mechanism to generate prompts by optimizing Large Language Model (LLM). These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity, establishing a new benchmark for VTO task.

TED-VITON Teaser

Overview of TED-VITON: We present the architecture of the proposed model along with details of its block modules. (a) Our model consists of 1) DiT-GarmentNet that encodes fine-grained features of \( X_g \), 2) GS-Adapter that captures higher-order semantics of garment image \( X_g \), and 3) DiT-TryOnNet, the main Transformer for processing person images. The Transformer input is formed by concatenating the noised latents \( X_t \) with the segmentation mask \( m \), masked image \( \mathcal{E}(X_\text{model}) \), and Densepose \( \mathcal{E}(x_{\text{pose}}) \). Additionally, a detailed description of the garment (e.g., “[D]: The clothing item is a black T-shirt...”) is generated through an LLM and fed as input to both the DiT-GarmentNet and DiT-TryOnNet. The model aims to preserve garment-specific details through a text preservation loss, which ensures that key textual features are retained. (b) Intermediate features from DiT-TryOnNet and DiT-GarmentNet are concatenated. These are then refined through joint-attention and cross-attention layers, with the GS-Adapter further contributing to the refinement process. In this architecture, the DiT-TryOnNet and GS-Adapter modules are fine-tuned, while other components remain frozen.

BibTeX

@article{wan2024tedviton,
    title={TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On},
    author={Wan, Zhenchen and Xu, Yanwu and Wang, Zhaoqing and Liu, Feng and Liu, Tongliang and Gong, Mingming},
    journal={arXiv preprint arXiv:2411.10499},
    year={2024}
  }