CV Paper List

Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN

Zhenyu Xie¹, Zaiyu Huang¹, Fuwei Zhao¹, Haoye Dong¹ Michael Kampffmeyer², Xiaodan Liang¹,^3* ¹Shenzhen Campus of Sun Yat-Sen University ²UiT The Arctic University of Norway, ³Peng Cheng Laboratory {xiezhy6,huangzy225,zhaofw,donghy7}@mail2.sysu.edu.cn michael.c.kampffmeyer@uit.no, xdliang328@gmail.com

Abstract

Image-based virtual try-on is one of the most promising applications of humancentric image generation due to its tremendous real-world potential. Yet, as most try-on approaches fit in-shop garments onto a target person, they require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability. While a few recent works attempt to transfer garments directly from one person to another, alleviating the need to collect paired datasets, their performance is impacted by the lack of paired (supervised) information. In particular, disentangling style and spatial information of the garment becomes a challenge, which existing methods either address by requiring auxiliary data or extensive online optimization procedures, thereby still inhibiting their scalability. To achieve a scalable virtual try-on system that can transfer arbitrary garments between a source and a target person in an unsupervised manner, we thus propose a texture-preserving end-to-end network, the PAtch-routed SpaTially-Adaptive GAN (PASTA-GAN), that facilitates real-world unpaired virtual try-on. Specifically, to disentangle the style and spatial information of each garment, PASTA-GAN consists of an innovative patch-routed disentanglement module for successfully retaining garment texture and shape characteristics. Guided by the source person keypoints, the patch-routed disentanglement module first decouples garments into normalized patches, thus eliminating the inherent spatial information of the garment, and then reconstructs the normalized patches to the warped garment complying with the target person pose. Given the warped garment, PASTA-GAN further introduces novel spatially-adaptive residual blocks that guide the generator to synthesize more realistic garment details. Extensive comparisons with paired and unpaired approaches demonstrate the superiority of PASTA-GAN, highlighting its ability to generate high-quality try-on images when faced with a large variety of garments (e.g. vests, shirts, pants), taking a crucial step towards real-world scalable try-on.

이미지 기반 가상 트라이온은 엄청난 실제 잠재력으로 인해 인간 중심 이미지 생성의 가장 유망한 응용 프로그램 중 하나이다. 그러나 대부분의 트라이온 접근법은 대상자에게 매장 내 의류를 적합시키기 때문에 쌍을 이룬 훈련 데이터 세트의 힘들고 제한적인 구성이 필요하여 확장성이 심각하게 제한된다. 최근의 몇 가지 연구는 한 사람에서 다른 사람으로 의복을 직접 전송하려고 시도하여 페어링된 데이터 세트를 수집할 필요성을 완화하지만, 페어링된(감독된) 정보가 부족하여 성능에 영향을 미친다. 특히 의류의 스타일과 공간 정보를 분리하는 것은 과제가 되는데, 기존 방법은 보조 데이터나 광범위한 온라인 최적화 절차를 요구하여 해결하므로 여전히 확장성을 저해한다. 따라서 소스와 대상자 간에 임의의 의복을 감독되지 않은 방식으로 전송할 수 있는 확장 가능한 가상 트라이온 시스템을 달성하기 위해 실제 짝을 이루지 않은 가상 트라이온을 용이하게 하는 텍스처 보존 종단 간 네트워크인 패치 라우팅 SpaTial-Adaptive GAN(PASTA-GAN)을 제안한다. 특히, 각 의복의 스타일과 공간 정보를 분리하기 위해, PASTA-GAN은 의복 질감과 모양 특성을 성공적으로 유지하기 위한 혁신적인 패치 라우팅 분리 모듈로 구성된다. 소스 사람의 핵심 포인트에 의해 안내되는 패치 라우팅 분리 모듈은 먼저 의복을 정규화된 패치로 분리하여 의복의 고유한 공간 정보를 제거한 다음 대상 사람 포즈를 준수하여 뒤틀린 의복에 정규화된 패치를 재구성한다. 뒤틀린 의복을 고려할 때, PASTA-GAN은 생성기를 더 사실적인 의복 세부 사항을 합성하도록 안내하는 새로운 공간 적응형 잔류 블록을 추가로 도입한다. 짝을 이룬 접근 방식과 짝을 이루지 않은 접근 방식과의 광범위한 비교는 PASTA-GAN의 우수성을 보여주며, 다양한 의류(예: 조끼, 셔츠, 바지)와 마주쳤을 때 고품질의 트라이온 이미지를 생성할 수 있는 능력을 강조하여 실제 확장 가능한 트라이온을 향한 중요한 발걸음을 내딛는다.

1 Introduction

Figure 1: Example virtual try-on results from our PASTA-GAN, which is flexible for various try-on scenarios, e.g., garment transfer for the upper body, the lower body, and the full body.

그림 1: 다양한 트라이온 시나리오(예: 상체, 하체, 전신)에 대해 유연한 가상 트라이온 결과의 예.

Image-based virtual try-on, the process of computationally transferring a garment onto a particular person in a query image, is one of the most promising applications of human-centric image generation with the potential to revolutionize shopping experiences and reduce purchase returns. However, to fully exploit its potential, scalable solutions are required that can leverage easily accessible training data, handle arbitrary garments, and provide efficient inference results. Unfortunately, to date, most existing methods [35, 38, 12, 7, 37, 9, 10, 4, 36, 39] rely on paired training data, i.e., a person image and its corresponding in-shop garment, leading to laborious data-collection processes. Furthermore, these methods are unable to exchange garments directly between two person images, thus largely limiting their application scenarios and raising the need for unpaired solutions to ensure scalability.

쿼리 이미지에서 특정 사람에게 의복을 계산적으로 전송하는 프로세스인 이미지 기반 가상 트라이온은 쇼핑 경험을 혁신하고 구매 수익을 줄일 수 있는 잠재력을 가진 인간 중심 이미지 생성의 가장 유망한 응용 프로그램 중 하나이다. 그러나 그 잠재력을 최대한 활용하려면 쉽게 액세스할 수 있는 훈련 데이터를 활용하고 임의의 의복을 처리하며 효율적인 추론 결과를 제공할 수 있는 확장 가능한 솔루션이 필요하다. 불행하게도, 현재까지 대부분의 기존 방법[35, 38, 12, 7, 37, 9, 10, 4, 36, 39]은 쌍을 이룬 훈련 데이터, 즉 개인 이미지와 그에 상응하는 매장 내 의류에 의존하여 힘든 데이터 수집 프로세스를 초래한다. 또한 이러한 방법은 두 사람의 이미지 간에 직접 의복을 교환할 수 없으므로 응용 시나리오를 크게 제한하고 확장성을 보장하기 위한 짝을 이루지 않은 솔루션의 필요성이 제기된다.

While unpaired solutions have recently started to emerge, performing virtual try-on in an unsupervised setting is extremely challenging and tends to affect the visual quality of the try-on results. Specifically, without access to the paired data, these models are usually trained by reconstructing the same person image, which is prone to over-fitting, and thus underperform when handling garment transfer during testing. The performance discrepancy is mainly reflected in the garment synthesis results, in particular the shape and texture, which we argue is caused by the entanglement of the garment style and spatial representations in the synthesis network during the reconstruction process.

최근 쌍을 이루지 않은 솔루션이 등장하기 시작했지만, 감독되지 않은 환경에서 가상 트라이온을 수행하는 것은 매우 어렵고 트라이온 결과의 시각적 품질에 영향을 미치는 경향이 있다. 특히, 쌍을 이룬 데이터에 액세스하지 않으면, 이러한 모델은 대개 과적합되기 쉬운 동일 인물 이미지를 재구성하여 훈련되며, 따라서 테스트 중에 의복 전송을 처리할 때 성능이 떨어진다. 성능 불일치는 주로 의복 합성 결과, 특히 모양과 질감에 반영되는데, 우리는 재구성 과정에서 의복 스타일과 공간 표현이 합성 네트워크에서 얽힘으로써 발생한다고 주장한다.

While this is not a problem for the traditional paired try-on approaches [35, 12, 37, 10], which avoid this problem and preserve the garment characteristics by utilizing a supervised warping network to obtain the warped garment in target shape, this is not possible in the unpaired setting due to the lack of warped ground truth. The few works that do attempt to achieve unpaired virtual try-on, therefore, choose to circumvent this problem by either relying on person images in various poses for feature disentanglement [23, 33, 32, 31, 1, 5], which again leads to a laborious data-collection process, or require extensive online optimization procedures [25, 17] to obtain fine-grain details of the original garments, harming the inference efficiency. However, none of the existing unpaired try-on methods consider the problem of coupled style and spatial garment information directly, which is crucial to obtain accurate garment transfer results in the unpaired and unsupervised virtual try-on scenario.

이것은 이 문제를 피하고 감독된 뒤틀림 네트워크를 활용하여 뒤틀린 옷을 목표 모양으로 얻음으로써 의복 특성을 보존하는 전통적인 쌍체 시도 접근법[35, 12, 37, 10]에서는 문제가 되지 않지만, 뒤틀린 바닥 진실이 없기 때문에 짝을 이루지 않은 환경에서는 불가능하다. 따라서 짝을 이루지 않은 가상 시험을 달성하려고 시도하는 소수의 연구는 기능 분리[23, 33, 32, 31, 1, 5]를 위해 다양한 포즈의 사람 이미지에 의존하거나 미세한 데이터를 얻기 위해 광범위한 온라인 최적화 절차[25, 17]를 필요로 함으로써 이 문제를 우회하기로 선택한다.원래 의복의 곡물 세부 사항, 추론 효율성을 손상시킵니다. 그러나 기존의 짝을 이루지 않은 트라이온 방법 중 어떤 것도 짝을 이루지 않고 감독되지 않은 가상 트라이온 시나리오에서 정확한 의류 전송 결과를 얻는 데 중요한 결합 스타일 및 공간 의류 정보의 문제를 직접 고려하지 않는다.

In this paper, to tackle the essential challenges mentioned above, we propose a novel PAtch-routed SpaTially-Adaptive GAN, named PASTA-GAN, a scalable solution to the unpaired try-on task. Our PASTA-GAN can precisely synthesize garment shape and style (see Fig. 1) by introducing a patch-routed disentanglement module that decouples the garment style and spatial features, as well as a novel spatially-adaptive residual module to mitigate the problem of feature misalignment.

본 논문에서는 위에서 언급된 필수 과제를 해결하기 위해 짝을 이루지 않은 트라이온 작업에 대한 확장 가능한 솔루션인 PATAS-GAN이라는 새로운 패치 라우팅 SpaTial-Adaptive GAN을 제안한다. 우리의 PASTA-GAN은 의복 스타일과 공간적 특징을 분리하는 패치 라우팅 분리 모듈과 특징 불일치 문제를 완화하는 새로운 공간 적응형 잔류 모듈을 도입하여 의복 모양과 스타일을 정밀하게 합성할 수 있다(그림 1 참조).

The innovation of our PASTA-GAN includes three aspects: First, by separating the garments into normalized patches with the inherent spatial information largely reduced, the patch-routed disentanglement module encourages the style encoder to learn spatial-agnostic garment features. These features enable the synthesis network to generate images with accurate garment style regardless of varying spatial garment information. Second, given the target human pose, the normalized patches can be easily reconstructed to the warped garment complying with the target shape, without requiring a warping network or a 3D human model. Finally, the spatially-adaptive residual module extracts the warped garment feature and adaptively inpaints the region that is misaligned with the target garment shape. Thereafter, the inpainted warped garment features are embedded into the intermediate layer of the synthesis network, guiding the network to generate try-on results with realistic garment texture.

PASTA-GAN의 혁신은 세 가지 측면을 포함한다. 첫째, 고유한 공간 정보가 크게 줄어든 상태에서 의복을 정규화된 패치로 분리함으로써 패치 라우팅 분리 모듈은 스타일 인코더가 공간에 구애받지 않는 의복 특징을 학습하도록 장려한다. 이러한 기능을 통해 합성 네트워크는 다양한 공간 의류 정보에 관계없이 정확한 의류 스타일로 이미지를 생성할 수 있습니다. 둘째, 대상 인간 자세를 고려할 때, 워핑 네트워크나 3D 인간 모델이 필요하지 않고 목표 모양을 따라 워핑된 의복으로 정규화된 패치를 쉽게 재구성할 수 있다. 마지막으로, 공간 적응형 잔류 모듈은 뒤틀린 의복 특징을 추출하고 대상 의복 형태와 잘못 정렬된 영역을 적응적으로 색칠합니다. 그 후, 페인트로 칠해진 뒤틀린 의복 특징이 합성 네트워크의 중간 계층에 내장되어 네트워크가 사실적인 의복 질감으로 시험 결과를 생성하도록 안내한다.

We collect a scalable UnPaired virtual Try-on (UPT) dataset and conduct extensive experiments on the UPT dataset and two existing try-on benchmark datasets (i.e., the DeepFashion [21] and the MPV [6] datasets). Experiment results demonstrate that our unsupervised PASTA-GAN outperforms both the previous unpaired and paired try-on approaches.

확장 가능한 UnPaired Virtual Try-on(UP) 데이터 세트를 수집하고 UPT 데이터 세트와 두 개의 기존 트라이온 벤치마크 데이터 세트(즉, DeepFashion [21] 및 MPV [6] 데이터 세트)에 대한 광범위한 실험을 수행한다. 실험 결과는 우리의 비지도 PASTA-GAN이 이전의 짝을 이루지 않고 쌍을 이룬 트라이온 접근법 모두를 능가한다는 것을 보여준다.

Figure 2: Overview of the inference process. (a) Given the source and target images of person $(I_{s}, I_{t})$, we can extract the source garment $G_{s}$, the source pose $J_{s}$, and the target pose $J_{t}$. The three are then sent to the patch-routed disentanglement module to yield the normalized garment patches $P_{n}$ and the warped garment $G_{t}$. (b) The modified conditional StyleGAN2 first collaboratively exploits the disentangled style code $w$, projected from $P_{n}$, and the person identity feature $f_{id}$, encoded from target head and pose $(H_{t}, J_{t})$, to synthesize the coarse try-on result $I_{t}’$ in the style synthesis branch along with the target garment mask $M_{g}$. It then leverages the warped garment feature $f_{g}$ in the texture synthesis branch to generate the final try-on result $\hat{I’_{t}}$.

그림 2: 추론 과정의 개요. (a) 사람 $(I_{s}, I_{t})$의 소스 이미지와 타겟 이미지가 주어지면 소스 의류 $G_{s}$, 소스 포즈 $J_{s}$, 타깃 포즈 $J_{t}$를 추출할 수 있다. 그런 다음 세 개를 패치 라우팅 분리 모듈로 전송하여 정규화된 의복 패치 $P_{n}$와 뒤틀린 의복 $G_{t}$를 생성합니다. (b) 수정된 조건부 StyleGAN2는 먼저 $P_{n}$에서 투영된 분리된 스타일 코드 $w$와 대상 헤드 및 포즈 $(H_{t}, J_{t})$에서 인코딩된 개인 식별 기능 $f_{id}$을 공동으로 활용하여 대상 의류 마스크 $M_{g}$과 함께 스타일 합성 분기에서 거친 시도 결과 $I_{t}’$를 합성한다. 그런 다음 텍스처 합성 분기에서 뒤틀린 의복 특징 $f_{g}$을 활용하여 최종 시도 결과 $\hat{I’_{t}}$을 생성한다.

Paired Virtual Try-on. Paired try-on methods [13, 35, 38, 12, 24, 37, 9, 10, 36] aim to transfer an in-shop garment onto a reference person. Among them, VITON [13] for the first time integrates a U-Net [29] based generation network with a TPS [2] based deformation approach to synthesize the try-on result. CP-VTON [35] improves this paradigm by replacing the time-consuming warping module with a trainable geometric matching module. VTNFP [38] adopts human parsing to guide the generation of various body parts, while [24, 37, 39] introduce a smooth constraint for the warping module to alleviate the excessive distortion in TPS warping. Besides the TPS-based warping strategy, [12, 36, 10] turn to the flow-based warping scheme which models the per-pixel deformation. Recently, VITON-HD [4] focuses on high-resolution virtual try-on and proposes an ALIAS normalization mechanism to resolve the garment misalignment. PF-AFN [10] improves the learning process by employing knowledge distillation, achieving state-of-the-art results. However, all of these methods require paired training data and are incapable of exchanging garments between two person images.

쌍으로 구성된 가상 트라이온입니다. 짝을 이룬 트라이온 방법[13, 35, 38, 12, 24, 37, 9, 10, 36]은 매장 내 의류를 기준 담당자에게 전달하는 것을 목표로 합니다. 그 중에서, VITON[13]은 처음으로 U-Net[29] 기반 생성 네트워크와 TPS[2] 기반 변형 접근법을 통합하여 트라이온 결과를 합성한다. CP-VTON[35]은 시간이 많이 걸리는 뒤틀림 모듈을 훈련 가능한 기하학적 매칭 모듈로 대체함으로써 이 패러다임을 개선한다. VTNFP[38]는 인간 파싱을 채택하여 다양한 신체 부위의 생성을 안내하는 반면, [24, 37, 39]는 TPS 워핑의 과도한 왜곡을 완화하기 위해 워핑 모듈에 대한 부드러운 제약 조건을 도입한다. TPS 기반 워핑 전략 외에도 [12, 36, 10]은 픽셀당 변형을 모델링하는 흐름 기반 워핑 체계로 전환한다. 최근, VITON-HD[4]는 고해상도 가상 트라이온에 초점을 맞추고 의류 정렬 오류를 해결하기 위한 ALIAS 정규화 메커니즘을 제안한다. PF-AFN[10]은 지식 증류를 사용하여 최첨단 결과를 달성함으로써 학습 프로세스를 개선한다. 그러나 이러한 모든 방법은 쌍을 이룬 훈련 데이터가 필요하며 두 사람의 이미지 간에 의복을 교환할 수 없다.

Unpaired Virtual Try-on. Different from the above methods, some recent works [23, 33, 32, 31, 25, 17] eliminate the need for in-shop garment images and directly transfer garments between two person images. Among them, [23, 33, 32, 31, 1, 5] leverage pose transfer as the pretext task to learn disentangled pose and appearance features for human synthesis, but require images of the same person with different poses.2 In contrast, [25, 17] are more flexible and can be directly trained with unpaired person images. However, OVITON [25] requires online appearance optimization for each garment region during testing to maintain texture detail of the original garment. VOGUE [17] needs to separately optimize the latent codes for each person image and the interpolate coefficient for the final try-on result during testing. Therefore, existing unpaired methods require either cumbersome data collecting or extensive online optimization, extremely harming their scalability in real scenarios.

짝을 이루지 않은 가상 트라이온. 위의 방법과 달리, 일부 최근 작업[23, 33, 32, 31, 25, 17]은 인숍(in-shop) 의류 이미지의 필요성을 제거하고 두 사람의 이미지 간에 의류를 직접 전송한다. 그 중 [23, 33, 32, 31, 1, 5]는 인간 합성을 위해 흐트러진 포즈와 외모 특징을 배우기 위한 핑계 과제로 포즈 전송을 활용하지만, 다른 포즈를 가진 동일한 사람의 이미지가 필요하다. 2 대조적으로, [25, 17]은 더 유연하며 짝을 이루지 않은 사람 이미지로 직접 훈련할 수 있다. 그러나 OVITON [25]에서는 원래 의복의 질감 세부 정보를 유지하기 위해 테스트 중에 각 의복 영역에 대해 온라인 모양 최적화가 필요합니다. VOGUE[17]는 각 개인 이미지에 대한 잠재 코드와 테스트 중 최종 시도 결과에 대한 보간 계수를 별도로 최적화해야 한다. 따라서 기존의 짝을 이루지 않은 방법은 번거로운 데이터 수집 또는 광범위한 온라인 최적화를 필요로 하여 실제 시나리오에서 확장성을 크게 손상시킨다.

3 PASTA-GAN

Given a source image $I_{s}$ of a person wearing a garment $G_{s}$, and a target person image $I_{t}$, the unpaired virtual try-on task aims to synthesize the try-on result $I_{t}’$ retaining the identity of $I_{t}$ but wearing the source garment $G_{s}$. To achieve this, our PASTA-GAN first utilizes the patch-routed disentanglement module (Sec. 3.1) to transform the garment $G_{s}$ into normalized patches $P_{n}$ that are mostly agnostic to the spatial features of the garment, and further deforms $P_{n}$ to obtain the warped garment $G_{t}$ complying with the target person pose. Then, an attribute-decoupled conditional StyleGAN2 (Sec. 3.2) is designed to synthesize try-on results in a coarse-to-fine manner, where we introduce novel spatially-adaptive residual blocks (Sec. 3.3) to inject the warped garment features into the generator network for more realistic texture synthesis. The loss functions and training details will be described in Sec. 3.4. Fig. 2 illustrates the overview of the inference process for PASTA-GAN.

의류 $G_{s}$를 입은 사람의 소스 이미지 $I_{s}$와 대상 인물 이미지 $I_{t}$가 주어지면, 짝을 이루지 않은 가상 트라이온 작업은 $I_{t}$의 정체성을 유지하면서도 소스 의류 $I_{t}’$를 착용한 트라이온 결과 $G_{s}$를 합성하는 것을 목표로 한다. 이를 달성하기 위해, 우리의 PASTA-GAN은 먼저 패치 라우팅 분리 모듈(3.1절)을 사용하여 의류 $G_{s}$를 의류의 공간적 특징에 대부분 무관한 정규화된 패치 $P_{n}$로 변환하고, 나아가 $P_{n}$를 변형하여 대상 사람의 포즈를 준수하는 뒤틀린 의류 $G_{t}$를 얻는다. 그런 다음 속성이 분리된 조건부 StyleGAN2(3.2절)는 거친 방식에서 미세한 방식으로 트라이온 결과를 합성하도록 설계되었으며, 여기서 우리는 보다 현실적인 텍스처 합성을 위해 뒤틀린 의복 특징을 생성기 네트워크에 주입하기 위해 새로운 공간 적응형 잔류 블록(3.3절)을 도입한다. 손실 함수와 훈련 세부사항은 3.4절에 설명될 것이다. 그림 2는 PASTA-GAN에 대한 추론 과정의 개요를 보여준다.

3.1 Patch-routed Disentanglement Module

Since the paired data for supervised training is unavailable for the unpaired virtual try-on task, the synthesis network has to be trained in an unsupervised manner via image reconstruction, and thus takes a person image as input and separately extracts the feature of the intact garment and the feature of the person representation to reconstruct the original person image. While such a training strategy retains the intact garment information, which is helpful for the garment reconstruction, the features of the intact garment entangle the garment style with the spatial information in the original image. This is detrimental to the garment transfer during testing. Note that the garment style here refers to the garment color and categories, i.e., long sleeve, short sleeve, etc., while the garment spatial information implies the location, the orientation, and the relative size of the garment patch in the person image, in which the first two parts are influenced by the human pose while the third part is determined by the relative camera distance to the person.

감독된 훈련을 위한 쌍을 이룬 데이터는 짝을 이루지 않은 가상 시험 작업에 사용할 수 없기 때문에 합성 네트워크는 이미지 재구성을 통해 감독되지 않은 방식으로 훈련되어야 하며, 따라서 사람 이미지를 입력으로 가져가서 손상되지 않은 의복의 특징과 사람 표현의 특징을 재구성하기 위해 별도로 추출한다. 오리지널 인물 이미지 이러한 교육 전략은 의복 재구성에 도움이 되는 손상되지 않은 의복 정보를 유지하는 반면, 손상되지 않은 의복의 특징은 의복 스타일을 원래 이미지의 공간 정보와 얽히게 한다. 이는 테스트 중 의복 이송에 해롭습니다. 여기서의 의복 스타일은 의복 색상 및 범주, 즉 긴 소매, 짧은 소매 등을 의미하는 반면, 의복 공간 정보는 인물 이미지에서 의복 패치의 위치, 방향 및 상대적 크기를 의미하며, 여기서 처음 두 부분은 사람의 자세에 의해 영향을 받는 반면 세 번째 부분은 결정됩니다.상대적인 카메라 거리에 의해 고정됩니다.

To address this issue, we explicitly divide the garment into normalized patches to remove the inherent spatial information of the garment. Taking the sleeve patch as an example, by using division and normalization, various sleeve regions from different person images can be deformed to normalized patches with the same orientation and scale. Without the guidance of the spatial information, the network is forced to learn the garment style feature to reconstruct the garment in the synthesis image.

이 문제를 해결하기 위해, 우리는 의복의 고유한 공간 정보를 제거하기 위해 의복을 명시적으로 정규화된 패치로 나눈다. 슬리브 패치를 예로 들면, 분할 및 정규화를 사용함으로써, 서로 다른 사람 이미지의 다양한 슬리브 영역이 동일한 방향 및 스케일의 정규화된 패치로 변형될 수 있다. 공간 정보의 안내 없이, 네트워크는 합성 이미지에서 의복을 재구성하기 위해 의복 스타일 기능을 학습해야 한다.

Figure 3: The process of the patch-routed deformation. Please zoom in for more details.

그림 3: 패치 경로 변형의 과정. 자세한 내용을 보려면 확대하십시오.

Fig. 3 illustrates the process of obtaining normalized garment patches, which includes two main steps: (1) pose-guided garment segmentation, and (2) perspective transformation-based patch normalization. Specifically, in the first step, the source garment $G_{s}$ and human pose (joints) $J_{s}$ are firstly obtained by applying [11] and [3] to the source person $I_{s}$, respectively. Given the body joints, we can segment the source garment into several patches Ps, which can be quadrilaterals with arbitrary shapes (e.g., rectangle, square, trapezoid, etc.), and will later be normalized. Taking the torso region as an example, with the coordinates of the left/right shoulder joints and the left/right hips joints in $P_{i}^{s}$, a quadrilateral crop (of which the four corner points are visualized in color in $P_{i}^{s}$ of Fig. 3) covering the torso region of $G_{s}$ can be easily performed to produce an unnormalized garment patch. Note that we define eight patches for upper-body garments, i.e., the patches around the left/right upper/bottom arm, the patches around the left/right hips, a patch around the torso, and a patch around the neck. In the second step, all patches are normalized to remove their spatial information by perspective transformations. For this, we first define the same amount of template patches $P_{n}$ with fixed 64 × 64 resolution as transformation targets for all unnormalized source patches, and then compute a homography matrix $H_{i}^{s→n} ∈ R^{3×3}$ [40] for each pair of $P_{i}^{s}$ and $P_{n}^{i}$, based on the four corresponding corner points of the two patches. Concretely, $H_{i}^{s→n}$ serves as a perspective transformation to relate the pixel coordinates in the two patches, formulated as:

그림 3은 (1) 자세 유도 의복 분할 및 (2) 원근 변환 기반 패치 정규화의 두 가지 주요 단계를 포함하는 정규화된 의복 패치를 얻는 과정을 보여준다. 구체적으로, 첫 번째 단계에서, 소스 의류 $G_{s}$와 인간 포즈(관절) $J_{s}$는 먼저 소스 사람 $I_{s}$에 각각 [11]과 [3]을 적용하여 얻는다. 신체 관절을 고려할 때, 우리는 소스 의복을 여러 패치 P로 분할할 수 있으며, 이 패치는 임의의 모양(예: 직사각형, 사각형, 사다리꼴 등)을 가진 4각형일 수 있으며, 나중에 표준화될 것이다. 몸통 부위를 예로 들면, 왼쪽/오른쪽 어깨 관절과 왼쪽/오른쪽 엉덩이 관절의 좌표가 $P_{i}^{s}$인 사각형 크롭(그림 3의 $P_{i}^{s}$에서 네 모서리 지점이 색상으로 시각화됨)으로 비정상적인 의복 패치를 생성하기 위해 쉽게 수행될 수 있다. 상체 의류를 위한 8개의 패치, 즉 왼쪽/오른쪽 상/하부 팔 주위의 패치, 왼쪽/오른쪽 엉덩이 주위의 패치, 몸통 주위의 패치, 목 주위의 패치를 정의한다. 두 번째 단계에서는 모든 패치가 원근 변환을 통해 공간 정보를 제거하도록 정규화된다. 이를 위해, 우리는 먼저 모든 정규화되지 않은 소스 패치에 대한 변환 대상과 고정된 64x64 해상도를 가진 동일한 양의 템플릿 패치 $P_{n}$를 정의한 다음, 두 패치의 해당하는 네 지점을 기반으로 $P_{i}^{s→n}$와 $P_{n}^{i}$의 각 쌍에 대한 호모그래피 매트릭스 $H_{i}^{snn}$[40]를 계산한다. 구체적으로, $H_{i}^{s→n}$는 다음과 같이 공식화된 두 패치의 픽셀 좌표를 연관시키는 원근 변환 역할을 한다.

\[\begin{bmatrix} x_{n}^{i}\\ y_{n}^{i} \\ 1 \\ \end{bmatrix} = H_{s→n}^{i} \begin{bmatrix} x_{s}^{i}\\ y_{s}^{i} \\ 1 \\ \end{bmatrix} = \begin{bmatrix} x_{n}^{i} &\\ y_{n}^{i} &\\ 1 &\\ \end{bmatrix} \begin{bmatrix} x_{s}^{i}\\ y_{s}^{i} \\ 1 \\ \end{bmatrix}\]

where $(x_{i}^{n}, y_{i}^{n})$ and $(x_{i}^{s}, y_{i}^{s})$ are the pixel coordinates in the normalized template patch and the unnormalized source patch, respectively. To compute the homography matrix $H_{s→n}^{i}$, we directly leverage the OpenCV API, which takes as inputs the corner points of the two patches and is implemented by using least-squares optimization and the Levenberg-Marquardt method [8]. After obtaining $H_{s→n}^{i}$, we can transform the source patch $P_{i}^{s}$ to the normalized patch $P_{n}^{i}$ according to Eq. 1.

여기서 $(x_{i}^{n}, y_{i}^{n})$와 $(x_{i}^{s}, y_{i}^{s})$는 각각 정규화된 템플릿 패치와 정규화되지 않은 소스 패치의 픽셀 좌표이다. 호모그래피 매트릭스 $H_{s→n}^{i}$를 계산하기 위해, 우리는 두 패치의 코너 포인트를 입력으로 사용하고 최소 제곱 최적화와 레벤베르크-마쿼르트 방법을 사용하여 구현되는 OpenCV API를 직접 활용한다[8]. $H_{s→n}^{i}$를 얻은 후에, 우리는 소스 패치 $P_{i}^{s}$를 Eq. 1에 따라 정규화된 패치 $P_{n}^{i}$로 변환할 수 있다.

Moreover, the normalized patches $P_{n}$ can further be transformed to target garment patches $P_{t}$ by utilizing the target pose $J_{t}$, which can be obtained from the target person $I_{t}$ via [3]. The mechanism of that backward transformation is equivalent to the forward one in Eq. 1, i.e., computing the homography matrix $H_{n→t}^{i}$ based on the four point pairs extracted from the normalized patch $P_{n}^{i}$ and the target pose $J_{t}$. The recovered target patches $P_{t}$ can then be stitched to form the warped garment $G_{t}$ that will be sent to the texture synthesis branch in Fig. 2 to generate more realistic garment transfer results. We can also regard $H_{s→t} = H_{n→t}\cdot{}H_{s→n}$ as the combined deformation matrix that warps the source garment to the target person pose, bridged by an intermediate normalized patch representation that is helpful for disentangling garment styles and spatial features.

또한, 정규화된 패치 $P_{n}$는 [3]를 통해 대상자 $I_{t}$로부터 얻을 수 있는 목표 자세 $J_{t}$를 활용하여 목표 의류 패치 $P_{t}$로 추가로 변환될 수 있다. 그 역변환의 메커니즘은 Eq.1의 순방향 변환과 동일하다. 즉, 정규화된 패치 $P_{n}^{i}$와 대상 포즈 $J_{t}$에서 추출된 4개의 점 쌍을 기반으로 호모그래피 행렬 $H_{n→t}^{i}$를 계산한다. 복구된 타겟 패치 $P_{t}$는 보다 현실적인 의류 전달 결과를 생성하기 위해 도 2의 텍스처 합성 분기로 보내질 뒤틀린 의류 $G_{t}$을 형성하기 위해 스티치될 수 있다. $H_{s→t} = H_{n→t}\cdot{}H_{s→n}$는 또한 의류 스타일과 공간적 특징을 푸는 데 도움이 되는 중간 정규화된 패치 표현에 의해 연결된 소스 의류를 대상 사람 포즈로 뒤틀리는 결합 변형 매트릭스로 간주할 수 있다.

3.2 Attribute-decoupled Conditional StyleGAN2

Motivated by the impressive performance of StyleGAN2 [15] in the field of image synthesis, our PASTA-GAN inherits the main architecture of StyleGAN2 and modifies it to the conditional version (see Fig. 2). In the synthesis network, the normalized patches $P_{n}$ are projected to the style code w through a style encoder followed by a mapping network, which is spatial-agnostic benefiting from the disentanglement module. In parallel, the conditional information including the target head $H_{t}$ and pose $J_{t}$ is transferred into a feature map $f_{id}$, encoding the identity of the target person by the identity encoder. Thereafter, the synthesis network starts from the identity feature map and leverages the style code as the injected vector for each synthesis block to generate the try-on result $\tilde{I_{t}}’$.

Style의 인상적인 공연에 의해 동기 부여됨이미지 합성 분야에서 GAN2[15], 우리의 PASTA-GAN은 스타일의 주요 아키텍처를 계승한다.GAN2를 조건부 버전으로 수정합니다(그림 2 참조). 합성 네트워크에서 정규화된 패치 $P_{n}$는 스타일 인코더를 통해 스타일 코드 w에 투영된 후 매핑 네트워크를 통해, 이는 분리 모듈의 이점을 공간 불가지론적으로 제공한다. 동시에, 타겟 헤드 $H_{t}$ 및 포즈 $J_{t}$를 포함하는 조건부 정보는 아이덴티티 인코더에 의해 타겟 인물의 아이덴티티를 인코딩하는 특징 맵 $f_{id}$로 전송된다. 그 후, 합성 네트워크는 아이덴티티 피처 맵에서 시작하여 스타일 코드를 각 합성 블록에 대한 주입된 벡터로 활용하여 트라이온 결과 $\tilde{I_{t}}’$를 생성한다.

However, the standalone conditional StyleGAN2 is insufficient to generate compelling garment details especially in the presence of complex textures or logos. For example, although the illustrated $\tilde{I_{t}}’$ in Fig. 2 can recover accurate garment style (color and shape) given the disentangled style code $w$, it lacks the complete texture pattern. The reasons for this are twofold: First, the style encoder projects the normalized patches into a one-dimensional vector, resulting in loss of high frequency information. Second, due to the large variety of garment texture, learning the local distribution of the particular garment details is highly challenging for the basic synthesis network.

그러나 독립형 조건부 스타일GAN2는 특히 복잡한 질감이나 로고가 있는 경우 매력적인 의류 세부 정보를 생성하기에 충분하지 않다. 예를 들어, 도 2에 도시된 $\tilde{I_{t}}’$는 분리된 스타일 코드 $w$가 주어졌을 때 정확한 의복 스타일(색상 및 모양)을 회복할 수 있지만, 완전한 텍스처 패턴이 부족하다. 그 이유는 두 가지가 있습니다. 첫째, 스타일 인코더는 정규화된 패치를 1차원 벡터로 투영하여 고주파 정보의 손실을 초래한다. 둘째, 의류 질감의 다양성이 크기 때문에 특정 의류 세부 사항의 국소 분포를 학습하는 것은 기본 합성 네트워크에 매우 어렵다.

To generate more accurate garment details, instead of only having a one-way synthesis network, we intentionally split PASTA-GAN into two branches after the 128 × 128 synthesis block, namely the Style Synthesis Branch (SSB) and the Texture Synthesis Branch (TSB). The SSB with normal StyleGAN2 synthesis blocks aims to generate intermediate try-on results $\tilde{I_{t}}’$ with accurate garment style and predict a precise garment mask $M_{g}$ that will be used by TSB. The purpose of TSB is to exploit the warped garment $G_{t}$, which has rich texture information to guide the synthesis path, and generate high-quality try-on results. We introduce a novel spatially-adaptive residual module specifically before the final synthesis block of the TSB, to embed the warped garment feature $f_{g}$ (obtained by passing $M_{g}$ and $G_{t}$ through the garment encoder) into the intermediate features and then send them to the newly designed spatialy-apaptive residual blocks, which are beneficial for successfully synthesizing texture of the final try-on result $I_{t}’$. The detail of this module will be described in the following section.

보다 정확한 의류 세부 정보를 생성하기 위해, 단방향 합성 네트워크만 갖는 대신, 우리는 의도적으로 128 x 128 합성 블록, 즉 스타일 합성 분기(SSB)와 텍스처 합성 분기(TSB)의 두 가지 분기로 PASTA-GAN을 분할했다. 일반 스타일의 SSBGAN2 합성 블록은 정확한 의복 스타일로 중간 시도 결과 $\tilde{I_{t}}’$를 생성하고 TSB가 사용할 정밀한 의복 마스크 $M_{g}$를 예측하는 것을 목표로 한다. TSB의 목적은 합성 경로를 안내하기 위해 풍부한 텍스처 정보를 가진 뒤틀린 의복 $G_{t}$를 활용하고 고품질의 트라이온 결과를 생성하는 것이다. 우리는 특히 TSB의 최종 합성 블록 이전에 새로운 공간 적응형 잔류 모듈을 도입하여 뒤틀린 의복 특징 $f_{g}$(의류 인코더를 통해 $M_{g}$와 $G_{t}$를 통과하여 얻은)를 중간 특징에 내장한 다음 성공에 유리한 새로 설계된 공간 적응형 잔류 블록으로 보낸다.최종 시험 결과 $I_{t}’$의 텍스처를 완전히 합성한다. 이 모듈에 대한 자세한 내용은 다음 섹션에서 설명합니다.

3.3 Spatially-adaptive Residual Module

Given the style code that factors out the spatial information and only keeps the style information of the garment, the style synthesis branch in Fig. 2 can accurately predict the mean color and the shape mask of the target garment. However, its inability to model the complex texture raises the need to exploit the warped garment $G_{t}$ to provide features that encode high-frequency texture patterns, which is in fact the motivation of the target garment reconstruction in Fig. 3.

그림 2의 스타일 합성 분기는 공간 정보를 요소화하여 의류의 스타일 정보만 유지하는 스타일 코드를 고려할 때, 대상 의류의 평균 색상과 형태 마스크를 정확하게 예측할 수 있다. 그러나, 복잡한 텍스처를 모델링할 수 없는 것은 뒤틀린 의복 $G_{t}$를 활용하여 고주파 텍스처 패턴을 인코딩하는 기능을 제공할 필요성을 제기하며, 이는 사실 그림 3에서 대상 의복 재구성의 동기이다.

Figure 4: Illustration of misalignment between the warped garment and target garment shape. The orange and green region represent the region to be inpainted and to be removed, respectively.

그림 4: 뒤틀린 의복과 대상 의복 형태 사이의 정렬 불량의 그림. 주황색 및 녹색 영역은 각각 도장 및 제거할 영역을 나타냅니다.

However, as the coarse warped garment $G_{t}$ is directly obtained by stitching the target patches together, its shape is inaccurate and usually misaligns with the predicted mask $M_{g}$ (see Fig.4). Such shape misalignment in $G_{t}$ will consequently reduce the quality of the extracted warped garment feature $f_{g}$.

그러나, 거친 비뚤어진 의복 $G_{t}$는 대상 패치를 함께 꿰매서 직접 얻기 때문에, 그 모양이 부정확하고, 대개 예측된 마스크 $M_{g}$와 정렬이 잘못된다(그림 4 참조). $G_{t}$에서 이러한 형상 정렬 오류는 결과적으로 추출된 뒤틀린 의복 특징 $f_{g}$의 품질을 저하시킬 것이다.

To address this issue, we introduce the spatially-adaptive residual module between the last two synthesis blocks in the texture synthesis branch as shown in Fig. 2. This module is comprised of a garment encoder and three spatially-adaptive residual blocks with feature inpainting mechanism to modulate intermediate features by leveraging the inpainted warped garment feature.

이 문제를 해결하기 위해, 우리는 그림 2와 같이 텍스처 합성 분기의 마지막 두 합성 블록 사이에 공간 적응형 잔류 모듈을 도입한다. 이 모듈은 의류 인코더와 도장된 뒤틀린 의복 기능을 활용하여 중간 기능을 변조하는 기능 인페인팅 메커니즘이 있는 3개의 공간 적응형 잔류 블록으로 구성됩니다.

To be specific on the feature inpainting process, we first remove the part of $G_{t}$ that falls outside of $M_{g}$ (green region in Fig.4), and explicitly inpaint the misaligned regions of the feature map within $M_{g}$ with average feature values (orange region in Fig. 4). The inpainted feature map can then help the final synthesis block infer reasonable texture in the inside misaligned parts.

형상 인 페인팅 공정을 구체적으로 설명하기 위해, 먼저 $M_{g}$의 외부에 있는 $G_{t}$ 부분(그림 4의 녹색 영역)을 제거하고, $M_{g}$ 내 형상 맵의 잘못 정렬된 영역을 평균 형상 값(그림 4의 주황색 영역)으로 명시적으로 페인팅한다. 그런 다음 인페인팅된 피처 맵은 최종 합성 블록이 정렬되지 않은 내부 부품에서 합리적인 텍스처를 추론하는 데 도움이 될 수 있다.

Therefore given the predicted garment mask $M_{g}$, the coarse warped garment $G_{t}$ and its mask $M_{t}$, the process of feature inpainting can be formulated as:

따라서 예측된 의복 마스크 $M_{g}$, 거친 뒤틀림 의복 $G_{t}$ 및 그 마스크 $M_{t}$를 고려할 때, 특징 인 페인팅 프로세스는 다음과 같이 공식화될 수 있습니다.

where $E_{g}(·)$ represents the garment encoder and $f_{g}’$ denotes the raw feature map of $G_{t}$ masked by $M_{g}$. $A(·)$ calculates the average garment features and $f_{g}$ is the final inpainted feature map.

여기서 $E_{g}(·)$는 의복 인코더를 나타내고 $f_{g}’$는 $M_{g}$로 마스킹된 $G_{t}$의 원시 형상 맵을 나타냅니다. $A(·)$는 평균 의복 형상을 계산하고 $f_{g}$는 도장된 최종 형상 지도입니다.

Subsequently, inspired by the SPADE ResBlk from SPADE [26], the inpainted garment features are used to calculate a set of affine transformation parameters that efficiently modulate the normalized feature map within each spatially-adaptive residual block. The normalization and modulation process for a particular sample $h_{z,y,x}$ at location $(z ∈ C, y ∈ H, x ∈ W)$ in a feature map can then be formulated as:

이후 SPADE의 SPADE ResBlk에서 영감을 받아 [26], 도장된 의복 특징은 각 공간 적응형 잔차 블록 내에서 정규화된 특징 맵을 효율적으로 변조하는 일련의 아핀 변환 매개 변수를 계산하는 데 사용된다. 형상 지도의 위치 $(z ∈ C, y ∈ H, x ∈ W)$에서 특정 샘플 $h_{z,y,x}$에 대한 정규화 및 변조 프로세스는 다음과 같이 공식화될 수 있다.

where $µ_{z} =\frac{1}{HW}\sum_{y,x} h_{z,y,x}$ and $σ_{z}=\sqrt{\frac{1}{HW}\sum_{y,x} (h_{z,y,x} − µ_{z})^{2}}$ are the mean and standard deviation of the feature map along channel $C. γ_{z,y,x}(·)$ and $β_{z,y,x}(·)$ are the convolution operations that convert the inpainted feature to affine parameters.

여기서 $µ_{z} =\frac{1}{HW}\sum_{y,x} h_{z,y,x}$와 $σ_{z}=\sqrt{\frac{1}{HW}\sum_{y,x} (h_{z,y,x} − µ_{z})^{2}}$는 채널 $C. γ_{z,y,x}(·)$와 $β_{z,y,x}(·)$를 따른 형상 지도의 평균과 표준 편차이며, 인페인팅된 형상을 아핀 매개변수로 변환하는 컨볼루션 연산이다.

3.4 Loss Functions and Training Details

As paired training data is unavailable, our PASTA-GAN is trained unsupervised via image reconstruction. During training, we utilize the reconstruction loss $L_{rec}$ and the perceptual loss [14] $L_{perc}$ for both the coarse try-on result $\tilde{I}’$ and the final try-on result $I’$:

쌍을 이룬 훈련 데이터를 사용할 수 없기 때문에, 우리의 PASTA-GAN은 이미지 재구성을 통해 감독되지 않은 훈련을 받는다. 훈련 중에, 우리는 재구성 손실 $L_{rec}$와 지각 손실 [14] $L_{perc}$를 거친 시도 결과 $\tilde{I}’$와 최종 시도 결과 $I’$ 모두에 활용한다.

\[L_{rec}=\sum_{I∈(\tilde{I}',I')}\parallel{}I − I_{s}\parallel{}_{1}\;\;and\;\;L_{perc} =\sum_{I∈(\tilde{I}',I')}\sum^{5}_{k=1}λ_{k} \parallel{}φ_{k}(I) − φ_{k}(I_{s})\parallel{}_{1},\]

where $φ_{k}(I)$ denotes the k-th feature map in a VGG-19 network [34] pre-trained on the ImageNet [30] dataset. We also use the L1 loss between the predicted garment mask $M_{g}$ and the real mask $M_{gt}$ which is obtained via human parsing [11]:

여기서 $φ_{k}(I)$는 ImageNet [30] 데이터 세트에서 사전 훈련된 VGG-19 네트워크 [34]의 k번째 기능 맵을 나타낸다. 또한 예측된 의복 마스크 $M_{g}$와 인간 파싱을 통해 얻은 실제 마스크 $M_{gt}$ 사이의 L1 손실을 사용한다[11].

\[L_{mask} = \parallel{}M_{g} − M_{gt}\parallel{}_{1}.\]

Besides, for both $\tilde{I}’$ and $I’$, we calculate the adversarial loss LGAN which is the same as in StyleGAN2 [15]. The total loss can be formulated as

게다가, $\tilde{I}’$와 $I’$ 모두에 대해, 우리는 StyleGAN2 [15]에서와 동일한 적대적 손실 LGAN을 계산한다. 총 손실은 다음과 같이 공식화할 수 있다.

\[L = L_{GAN} + λ_{rec}L_{rec} + λ_{perc}L_{perc} + λ_{mask}L_{mask},\]

where $λ_{rec}$, $λ_{perc}$, and $λ_{mask}$ are the trade-off hyper-parameters.

여기서 $λ_{rec}$, $λ_{perc}$, 및 $λ_{mask}$는 트레이드오프 하이퍼 파라미터이다.

Figure 5: Comparison among the source garment and different warped garments.

그림 5: 소스 의류와 다른 비뚤어진 의류의 비교.

During training, although the source and target pose are the same, the coarse warped garment $G_{t}$ is not identical to the intact source garment $G_{s}$, due to the crop mechanism in the patch-routed disentanglement module. More specifically, the quadrilateral crop for $G_{s}$ is by design not seamless/perfect and there will accordingly often exist some small seams between adjacent patches in $G_{t}$ as well as incompleteness along the boundary of the torso region. To further reduce the training-test gap of the warped garment, we introduce two random erasing operations during the training phase. First, we randomly remove one of the four arm patches in the warped garment with a probability of $α_{1}$. Second, we use the random mask from [19] to additionally erase parts of the warped garment with a probability of $α_{2}$. Both of the erasing operations can imitate self-occlusion in the source person image. Fig. 5 illustrates the process by displaying the source garment $G_{s}$, the warped garment $G_{t}’$ that is obtained by directly stitching the warped patches together, and the warped garment $G_{t}$ that is sent to the network. We can observe a considerable difference between $G_{t}$ and $G_{s}$. An ablation experiment to validate the necessity of the randomly erasing operation for the unsupervised training is included in the supplementary material.

훈련 중에 소스 및 대상 포즈는 동일하지만 패치 라우팅 분리 모듈의 크롭 메커니즘으로 인해 거친 뒤틀림 의복 $G_{t}$는 손상되지 않은 소스 의복 $G_{s}$와 동일하지 않습니다. 보다 구체적으로, $G_{s}$에 대한 사각형 작물은 설계상 심리스/완벽하지 않으며, 따라서 몸통 영역의 경계를 따라 불완전할 뿐만 아니라 $G_{t}$의 인접 패치 사이에 작은 솔기가 종종 존재한다. 뒤틀린 의복의 훈련-테스트 간격을 더욱 줄이기 위해 훈련 단계 동안 두 가지 무작위 소거 작업을 도입한다. 먼저, 우리는 $α_{1}$의 확률로 뒤틀린 의복의 네 개의 팔 패치 중 하나를 무작위로 제거한다. 둘째, [19]의 랜덤 마스크를 사용하여 $α_{2}$의 확률로 뒤틀린 의복의 일부를 추가로 지웁니다. 두 소거 동작 모두 소스 인물 이미지에서의 자기 폐색을 모방할 수 있다. 그림 5는 소스 의류($G_{s}$), 휘어진 패치를 직접 스티치하여 얻은 휘어진 의류($G_{t}’$) 및 네트워크로 전송되는 휘어진 의류($G_{t}$)를 표시하여 그 과정을 도시한 것이다. 우리는 $G_{t}$와 $G_{s}$ 사이에 상당한 차이를 관찰할 수 있다. 감독되지 않은 훈련에 대한 무작위 소거 작업의 필요성을 검증하기 위한 절제 실험이 보충 자료에 포함되어 있다.

4 Experiments

Datasets. We conduct experiments on two existing virtual try-on benchmark datasets (MPV [6] dataset and DeepFashion [21] dataset) and our newly collected large-scale benchmark dataset for unpaired try-on, named UPT. UPT contains 33,254 half- and full-body front-view images of persons wearing a large variety of garments, e.g., long/short sleeve, vest, sling, pants, etc. UPT is further split into a training set of 27,139 images and a testing set of 6,115 images. In addition, we also pick out the front view images from MPV [6] and DeepFashion [21] to expand the size of our training and testing set to 54,714 and 10,493, respectively. Personally identifiable information (i.e. face information) has been masked out.

데이터 세트. 우리는 두 개의 기존 가상 트라이온 벤치마크 데이터 세트(MPV [6] 데이터 세트 및 DeepFashion [21] 데이터 세트)와 UPT라는 이름의 짝을 이루지 않은 트라이온에 대해 새로 수집된 대규모 벤치마크 데이터 세트에 대한 실험을 수행한다. UPT에는 다양한 종류의 의류(예: 긴/짧은 소매, 조끼, 슬링, 슬링, 슬링)를 착용한 사람의 반신 전면 이미지 33,254개가 포함되어 있다. 바지 등. UPT는 27,139개 이미지의 교육 세트와 6,115개 이미지의 테스트 세트로 더 분할됩니다. 또한 MPV[6]와 DeepFashion[21]에서 전면 뷰 이미지를 선택하여 훈련 및 테스트 세트의 크기를 각각 54,714 및 10,493으로 확장한다. 개인 식별 가능 정보(즉, 얼굴 정보)가 마스크되었습니다.

Metrics. We apply the Fr´echet Inception Distance (FID) [27] to measure the similarity between real and synthesized images, and perform human evaluation to quantitatively evaluate the synthesis quality of different methods. For the human evaluation, we design three questionnaires corresponding to the three used datasets. In each questionnaire, we randomly select 40 try-on results generated by our PASTA-GAN and the other compared methods. Then, we invite 30 volunteers to complete the 40 tasks by choosing the most realistic try-on results. Finally, the human evaluation score is calculated as the chosen percentage for a particular method.

측정 기준. 우리는 실제 이미지와 합성 이미지 간의 유사성을 측정하기 위해 Fréchet Inception Distance(FID)[27]를 적용하고, 다양한 방법의 합성 품질을 정량적으로 평가하기 위해 인간 평가를 수행한다. 인간 평가를 위해 사용된 세 개의 데이터 세트에 해당하는 세 개의 설문지를 설계한다. 각 설문지에서 PASTA-GAN과 다른 비교 방법에 의해 생성된 40개의 시도 결과를 무작위로 선택한다. 그런 다음 30명의 자원봉사자를 초대하여 가장 현실적인 시험 결과를 선택하여 40개 과제를 완료한다. 마지막으로, 인간 평가 점수는 특정 방법에 대해 선택된 백분율로 계산됩니다.

Implementation Details. Our PASTA-GAN is implemented using PyTorch [28] and is trained on 8 Tesla V100 GPUs. During training, the batch size is set to 96 and the model is trained for 4 million iterations with a learning rate of 0.002 using the Adam optimizer [16] with β1 = 0 and β2 = 0.99. The loss hyper-parameters $λ_{rec}$, $λ_{perc}$, and $λ_{mask}$ are set to 40, 40, and 100, respectively. The hhyper-parameters for the random erasing probability $α_{1}$ and $α_{2}$ are set to 0.2 and 0.9, respectively. 3

구현 세부 정보. 우리의 PASTA-GAN은 PyTorch[28]를 사용하여 구현되며 8개의 Tesla V100 GPU에서 훈련된다. 훈련 중 배치 크기는 96으로 설정되고 모델은 β1 = 0 및 β2 = 0.99인 Adam Optimizer [16]를 사용하여 학습률 0.002로 400만 번 반복에 대해 훈련된다. 손실 하이퍼 매개 변수 $tx_{rec}, $tx_{perc}$ 및 $tx_{mask}$는 각각 40, 40 및 100으로 설정된다. 무작위 소거 확률 $α_{1}$ 및 $α_{2}$에 대한 하이퍼 파라미터는 각각 0.2와 0.9로 설정된다. 3

Figure 6: Visual comparison among PASTA-GAN and the baseline methods under the unpaired setting on the UPT dataset. Please zoom in for more details.

그림 6: UPT 데이터 세트의 짝을 이루지 않은 설정에서 파스타-GAN과 기준 방법 간의 시각적 비교. 자세한 내용을 보려면 확대하십시오.

Baselines. To validate the effectiveness of our PASTA-GAN, we compare it with the state-of-the-art methods, including three paired virtual try-on methods, CP-VTON [35], ACGPN [37], PFAFN [10], and two unpaired methods Liquid Warping GAN [20] and ADGAN [23], which have released the official code and pre-trained weights.4 We directly use the pre-trained model of these methods as their training procedure depends on the paired data of garment-person or person-person image pairs, which are unavailable in our dataset. When testing paired methods under the unpaired try-on setting, we extract the desired garment from the person image and regard it as the in-shop garment to meet the need of paired approaches. To fairly compare with the paired methods, we further conduct another experiment on the paired MPV dataset [6], in which the paired methods take an in-shop garment and a person image as inputs, while our PASTA-GAN still directly receives two person images. See the following two subsections for detailed comparisons on both paired and unpaired settings.

기준선. PASTA-GAN의 효과를 검증하기 위해 CP-VTON[35], ACGPN[37], PFFN[10], 그리고 공식 코드와 사전 훈련된 가중치를 공개한 두 개의 페어링되지 않은 방법인 Liquid Warping GAN[20] 및 ADGAN[23]을 포함하여 최첨단 방법과 비교한다. 이러한 방법의 훈련 절차는 데이터 세트에서 사용할 수 없는 의복-사람 또는 사람-사람 이미지 쌍의 쌍 데이터에 따라 다르기 때문에 이러한 방법의 사전 훈련된 모델을 직접 사용한다. 페어링되지 않은 트라이온 설정에서 페어링된 방법을 테스트할 때 사용자 이미지에서 원하는 의류를 추출하고 페어링된 접근 방식의 필요성을 충족하기 위해 매장 내 의류로 간주한다. 페어링된 방법과 공정하게 비교하기 위해, 우리는 페어링된 MPV 데이터 세트[6]에 대해 다른 실험을 추가로 수행한다. 여기서 페어링된 방법은 인숍 의류와 개인 이미지를 입력으로 취하는 반면, PASTA-GAN은 여전히 두 사람 이미지를 직접 수신한다. 쌍으로 구성된 설정과 쌍으로 구성되지 않은 설정 모두에 대한 자세한 비교는 다음 두 하위 섹션을 참조하십시오.

4.1 Comparison with the state-of-the-art methods on unpaired benchmark

Quantitative: As reported in Table 1, when testing on the DeepFashion [21] and the UPT dataset under the unpaired setting, our PASTA-GAN outperforms both the paired methods [35, 37, 10] and the unpaired methods [23, 20] by a large margin, obtaining the lowest FID score and the highest human evaluation score, demonstrating that PASTA-GAN can generate more photo-realistic images. Note that, although ADGAN [23] is trained on the DeepFashion dataset, our PASTA-GAN still surpasses it. Since the data in the DeepFashion dataset is more complicated than the data in UPT, the FID scores for the DeepFashion dataset are generally higher than the FID scores for the UPT dataset.

정량적: 표 1에서 보고된 바와 같이, 페어링되지 않은 설정에서 DeepFashion[21]과 UPT 데이터 세트를 테스트할 때, 우리의 PASTA-GAN은 페어링된 방법[35, 37, 10]과 페어링되지 않은 방법[23, 20] 모두를 큰 차이로 능가하여 가장 낮은 FID 점수와 가장 높은 인간 평가 점수를 얻어 PASTA-GAN이 생성할 수 있음을 보여준다.더 많은 사진을 먹었다. ADGAN[23]이 DeepFashion 데이터 세트에 대해 훈련되었지만, 우리의 PASTA-GAN은 여전히 이를 능가한다는 점에 유의한다. DeepFashion 데이터 세트의 데이터는 UPT의 데이터보다 복잡하기 때문에 DeepFashion 데이터 세트의 FID 점수는 일반적으로 UPT 데이터 세트의 FID 점수보다 높다.

Qualitative: As shown in Fig. 6, under the unpaired setting, PASTA-GAN is capable of generating more realistic and accurate try-on results. On the one hand, paired methods [35, 37, 10] tend to fail in deforming the cropped garment to the target shape, resulting in the distorted warped garment that is largely misaligned with the target body part. On the other hand, unpaired method ADGAN [23] cannot preserve the garment texture and the person identity well due to its severe overfitting on the DeepFashion dataset. Liquid Warping GAN [20], another publicly available appearance transfer model, heavily relies on the 3D body model named SMPL [22] to obtain the appearance transfer flow. It is sensitive to the prediction accuracy of SMPL parameters, and thus prone to incorrectly transfer the appearance from other body parts (e.g., hand, lower body) into the garment region in case of inaccurate SMPL predictions. In comparison, benefited by the patch-routed mechanism, PASTA-GAN can learn appropriate garment features and predict precise garment shape. Further, the spatially-adaptive residual module can leverage the warped garment feature to guide the network to synthesize try-on results with realistic garment textures. Note that, in the top-left example of Fig. 6, our PASTA-GAN seems to smooth out the belt region. The reason for this is a parsing error. Specifically, the human parsing model [18] that was used does not designate a label for the belt, and the parsing estimator [11] will therefore assign a label for the belt region (i.e. pants, upper clothes, background, etc). For this particular example, the parsing label for the belt region is assigned the background label. This means that the pants obtained according to the predicted human parsing will not contain the belt, which will therefore not be contained in the normalized patches and the warped pants. The style synthesis branch then predicts the precise mask for the pants (including the belt region) and the texture synthesis branch inpaints the belt region with the white color according to the features of the pants.

질적: 그림 6에서 보는 바와 같이, 짝을 이루지 않은 설정 하에서, PASTA-GAN은 보다 현실적이고 정확한 트라이온 결과를 생성할 수 있다. 한편, 페어링된 방법[35, 37, 10]은 크롭된 의복을 목표한 형태로 변형시키는 데 실패하는 경향이 있으며, 이로 인해 왜곡된 의복이 목표한 신체 부위와 크게 어긋나게 된다. 반면에, 짝을 이루지 않은 방법 ADGAN[23]은 DeepFashion 데이터 세트에서 심하게 과적합되기 때문에 의복 질감과 개인 신원을 잘 보존할 수 없다. 공개적으로 사용 가능한 또 다른 외관 전달 모델인 Liquid Warping GAN[20]은 외관 전달 흐름을 얻기 위해 SMPL[22]이라는 이름의 3D 바디 모델에 크게 의존한다. SMPL 파라미터의 예측 정확도에 민감하므로 SMPL 예측이 부정확한 경우 다른 신체 부위(예: 손, 하체)의 외관을 의복 영역으로 잘못 전달하기 쉽다. 이에 비해 패치 라우팅 메커니즘의 혜택을 받은 PASTA-GAN은 적절한 의복 특징을 학습하고 정확한 의복 모양을 예측할 수 있다. 또한 공간 적응형 잔류 모듈은 뒤틀린 의복 기능을 활용하여 네트워크가 실제 의복 질감으로 시험 결과를 합성하도록 안내할 수 있다. 참고로, 그림 6의 왼쪽 상단 예에서, 우리의 PASTA-GAN은 벨트 영역을 매끄럽게 하는 것처럼 보인다. 그 이유는 구문 분석 오류입니다. 구체적으로, 사용된 인간 파싱 모델[18]은 벨트에 대한 라벨을 지정하지 않으며, 따라서 파싱 추정기[11]는 벨트 영역(즉, 바지, 윗옷, 배경 등)에 대한 라벨을 할당한다. 이 예에서는 벨트 영역의 구문 분석 레이블에 배경 레이블이 할당됩니다. 이것은 예측된 인간 파싱에 따라 얻은 바지에 벨트가 포함되지 않을 것이며, 따라서 정규화된 패치와 뒤틀린 바지에 포함되지 않을 것이라는 것을 의미한다. 이어 스타일 합성 브랜치는 바지(벨트 부위 포함)에 대한 정밀 마스크를 예측하고 텍스처 합성 브랜치는 바지의 특징에 따라 벨트 부위를 흰색으로 칠한다.

Table 2

Table 2: The FID score [27] and human evaluation score among different methods under their corresponding test setting on the MPV dataset [6].

표 2: MPV 데이터 세트의 해당 테스트 설정에서 다양한 방법 중 FID 점수[27]와 인간 평가 점수[6].

Figure 7: Visual comparison among PASTA-GAN and the paired baseline methods under their corresponding test setting on the MPV dataset [6]. Please zoom in for more details.

그림 7: MPV 데이터 세트의 해당 테스트 설정에서 파스타-GAN과 쌍을 이룬 기준 방법 간의 시각적 비교[6]. 자세한 내용을 보려면 확대하십시오.

4.2 Comparison with the state-of-the-art methods on paired benchmark

Quantitative: Tab. 2 illustrates the quantitative comparison on the MPV dataset [6], in which the paired methods are tested under the classical paired setting, i.e., transferring an in-shop garment onto a reference person. Our unpaired PASTA-GAN, nevertheless, can surpass the paired methods especially the state-of-the-art PFAFN [10] in both FID and human evaluation score, further evidencing the superiority of our PASTA-GAN.

정량적: 표 2는 MPV 데이터 세트[6]에 대한 정량적 비교를 보여준다. 여기서 쌍을 이룬 방법은 고전적인 쌍 설정에서 테스트된다. 즉, 매장 내 의류를 참조인에게 전달한다. 그럼에도 불구하고 우리의 짝을 이루지 않은 PASTA-GAN은 FID와 인간 평가 점수 모두에서 쌍을 이룬 방법, 특히 최첨단 PFAFN[10]을 능가할 수 있어 PASTA-GAN의 우수성을 더욱 입증한다.

Qualitative: Under the paired setting, the visual quality of the paired methods improves considerably, as shown in Fig. 7. The paired methods depend on TPS-based or flow-based warping architectures to deform the whole garment, which may lead to the distortion of texture and shape since the global interpolation or pixel-level correspondence is error-prone in case of large pose variation. Our PASTAGAN, instead, warps semantic garment patches separately to alleviate the distortion and preserve the original garment texture to a larger extent. Besides, the paired methods are unable to handle garments like sling that are rarely presented in the dataset, and perform poorly on full-body images. Our PASTA-GAN instead generates compelling results even in these challenging scenarios.

질적: 쌍체 설정 하에서 그림 7과 같이 쌍체 방법의 시각적 품질이 상당히 향상된다. 쌍을 이루는 방법은 전체 의복을 변형시키기 위해 TPS 기반 또는 흐름 기반 워핑 아키텍처에 의존하며, 큰 포즈 변화의 경우 전역 보간 또는 픽셀 수준 대응이 오류가 발생하기 쉽기 때문에 질감과 모양의 왜곡으로 이어질 수 있다. 대신, 우리의 PASTAGAN은 왜곡을 완화하고 원래의 의복 질감을 더 큰 범위로 보존하기 위해 의미론적 의복 패치를 별도로 왜곡한다. 또한 쌍을 이룬 방법은 데이터 세트에 거의 표시되지 않는 슬링과 같은 의복을 처리할 수 없으며 전신 이미지에서 성능이 떨어진다. 대신 우리의 PASTA-GAN은 이러한 어려운 시나리오에서도 설득력 있는 결과를 생성한다.

4.3 Ablation Studies

Patch-routed Disentanglement Module: To validate its effectiveness, we train two PASTA-GANs without texture synthesis branch, denoted as PASTA-GAN? and PASTA-GAN∗, which take the intact garment and the garment patches as input of the style encoder, respectively. As shown in Fig. 8, PASTA-GAN? fails to generate accurate garment shape. In contrast, the PASTA-GAN∗ which factors out spatial information of the garment, can focus more on the garment style information, leading to the accurate synthesis of the garment shape. However, without the texture synthesis branch, both of them are unable to synthesize the detailed garment texture. The models with the texture synthesis branch can preserve the garment texture well as illustrated in Fig 8.

패치 라우팅 분리 모듈: 그 효과를 검증하기 위해, 우리는 PASTA-GAN?와 PASTA-GAN∗으로 표시되는 텍스처 합성 분기가 없는 두 개의 PASTA-GAN을 훈련하는데, 이는 각각 손상되지 않은 의복과 의복 패치를 스타일 인코더의 입력으로 취한다. 그림 8에 나타난 바와 같이, PASTA-GAN?는 정확한 의복 형태를 생성하지 못합니다. 대조적으로, 의복의 공간 정보를 요소화하는 PASTA-GAN∗은 의복 스타일 정보에 더 집중할 수 있어 의복 형태의 정확한 합성으로 이어질 수 있다. 그러나 텍스처 합성 가지가 없으면 두 가지 모두 상세한 의복 텍스처를 합성할 수 없다. 텍스처 합성 분기가 있는 모델은 그림 8에 도시된 바와 같이 의복 텍스처를 잘 보존할 수 있습니다.

Figure 8: Qualitative results and quantitative results of the ablation study with different configurations, in which SSB, TSB, GP, NRB, SRB refer to style synthesis branch, texture synthesis branch, garment patches, normal residual blocks, and spatially-adaptive residual blocks, respectively.

그림 8: SSB, TSB, GP, NRB, SRB가 각각 스타일 합성 분기, 텍스처 합성 분기, 의복 패치, 정상 잔차 블록 및 공간 적응 잔차 블록을 참조하는 다양한 구성의 절제 연구의 정성적 결과 및 정량적 결과.

Spatially-adaptive Residual Module To validate the effectiveness of this module, we further train two PASTA-GANs with texture synthesis branch, denoted as PASTA-GAN† and PASTA-GAN‡, which excludes the style synthesis branch and replaces the spatially-adaptive residual blocks with normal residual blocks, respectively. Without the support of the corresponding components, both PASTA-GAN† and PASTA-GAN‡ fail to fix the garment misalignment problem, leading to artifacts outside the target shape and blurred texture synthesis results. The full PASTA-GAN instead can generate try-on results with precise garment shape and texture details. The quantitative comparison results in Fig. 8 further validate the effectiveness of our designed modules.

공간 적응형 잔차 모듈 이 모듈의 효과를 검증하기 위해, 우리는 각각 스타일 합성 분기를 제외하고 공간 적응형 잔차 블록을 일반 잔차 블록으로 대체하는 PASTA-GAN과 PASTA-GAN으로 표시되는 두 가지 텍스처 합성 분기를 가진 PASTA-GAN을 추가로 훈련한다. 해당 구성 요소의 지원 없이는 PASTA-GAN과 PASTA-GAN 모두 의류 정렬 불량 문제를 해결하지 못해 대상 형상을 벗어난 아티팩트가 발생하고 텍스처 합성 결과가 흐리게 된다. 대신 완전한 PASTA-GAN은 정확한 의복 모양과 질감 세부 사항으로 시험 결과를 생성할 수 있다. 그림 8의 정량적 비교 결과는 우리가 설계한 모듈의 효과를 추가로 검증한다.

5 Conclusion

We propose the PAtch-routed SpaTially-Adaptive GAN (PASTA-GAN) towards facilitating scalable unpaired virtual try-on. By utilizing the novel patch-routed disentanglement module and the spatiallyadaptive residual module, PASTA-GAN effectively disentangles garment style and spatial information and generates realistic and accurate virtual-try on results without requiring auxiliary data or extensive online optimization procedures. Experiments highlight PASTA-GAN’s ability to handle a large variety of garments, outperforming previous methods both in the paired and the unpaired setting.

우리는 확장 가능한 짝을 이루지 않은 가상 시험을 촉진하기 위해 패치 라우팅 SpaTially-Adaptive GAN(PASTA-GAN)을 제안한다. 새로운 패치 라우팅 분리 모듈과 공간 적응형 잔류 모듈을 활용하여 PASTA-GAN은 의류 스타일과 공간 정보를 효과적으로 분리하고 보조 데이터나 광범위한 온라인 최적화 절차 없이 현실적이고 정확한 가상 시도 결과를 생성한다. 실험은 짝을 이룬 환경과 짝을 이루지 않은 환경 모두에서 이전의 방법을 능가하면서 다양한 의류를 처리하는 PASTA-GAN의 능력을 강조한다.

We believe that this work will inspire new scalable approaches, facilitating the use of the large amount of available unlabeled data. However, as with most generative applications, misuse of these techniques is possible in the form of image forgeries, i.e. warping of unwanted garments with malicious intent.

우리는 이 작업이 새로운 확장 가능한 접근법에 영감을 주어 레이블이 지정되지 않은 대량의 사용 가능한 데이터의 사용을 촉진할 것이라고 믿는다. 그러나 대부분의 생성 응용 프로그램과 마찬가지로 이러한 기술의 오용은 이미지 위조, 즉 악의적인 의도로 원하지 않는 의복의 뒤틀림의 형태로 가능하다.

Acknowledgments and Disclosure of Funding

We would like to thank all the reviewers for their constructive comments. Our work was supported in part by National Key R&D Program of China under Grant No. 2018AAA0100300, National Natural Science Foundation of China (NSFC) under Grant No.U19A2073 and No.61976233, Guangdong Province Basic and Applied Basic Research (Regional Joint Fund-Key) Grant No.2019B1515120039, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Shenzhen Fundamental Research Program (Project No. RCYX20200714114642083, No. JCYJ20190807154211365), CSIG Youth Fund.

우리는 그들의 건설적인 의견에 대해 모든 검토자들에게 감사하고 싶다. 우리의 연구는 보조금 번호 2018AAA0100300, 보조금 번호 2018AAA0100300에 따른 중국 국가 핵심 연구 개발 프로그램에 의해 부분적으로 지원되었다.U19A2073 및 제61976233호, 광둥성 기초 및 응용 기초 연구(지역 공동 기금-키) 보조금 No.2019B15120039, 광둥 우수 청년 기금(보조금 No.2021B1515020061), 선전 기초 연구 프로그램(Project No. RCYX20200714642083, No.JCYG0651118 2019), 청년 기금 No.065JG015122003