(GAN)Controllable Person Image Synthesis with Attribute-Decomposed GAN Translation

CV Paper List

$\mathbf{Controllable\;Person\;Image\;Synthesis\;with\;Attribute-Decomposed\;GAN}$

$\mathbf{Yifang\;Men,\;Yiming\;Mao,\;Yuning\;Jiang,\;Wei-Ying\;Ma,\;Zhouhui\;Lian}$

$\mathbf{Wangxuan\;Institute\;of\;Computer\;Technology,\;Peking\;University,\;China}$

$\mathbf{Bytedance\;AI\;Lab}$

$\mathbf{Abstract}$

This paper introduces the Attribute-Decomposed GAN, a novel generative model for controllable person image synthesis, which can produce realistic person images with desired human attributes (e.g., pose, head, upper clothes and pants) provided in various source inputs. The core idea of the proposed model is to embed human attributes into the latent space as independent codes and thus achieve flexible and continuous control of attributes via mixing and interpolation operations in explicit style representations. Specifically, a new architecture consisting of two encoding pathways with style block connections is proposed to decompose the original hard mapping into multiple more accessible subtasks. In source pathway, we further extract component layouts with an off-the-shelf human parser and feed them into a shared global texture encoder for decomposed latent codes. This strategy allows for the synthesis of more realistic output images and automatic separation of un-annotated attributes. Experimental results demonstrate the proposed method’s superiority over the state of the art in pose transfer and its effectiveness in the brand-new task of component attribute transfer.

본 논문에서는 다양한 소스 입력에 제공되는 원하는 인간 속성(예: 포즈, 머리, 윗옷 및 바지)으로 현실적인 인물 이미지를 생성할 수 있는 제어 가능한 인물 이미지 합성을 위한 새로운 생성 모델인 속성 분해 GAN을 소개한다. 제안된 모델의 핵심 아이디어는 인간 속성을 독립적인 코드로 잠재 공간에 내장하여 명시적 스타일 표현에서 혼합 및 보간 연산을 통해 속성을 유연하고 지속적으로 제어하는 것이다. 특히, 스타일 블록 연결이 있는 두 개의 인코딩 경로로 구성된 새로운 아키텍처는 원래의 하드 매핑을 더 접근하기 쉬운 여러 하위 작업으로 분해하기 위해 제안된다. 소스 경로에서 우리는 기성 인간 파서로 구성 요소 레이아웃을 추출하여 분해된 잠재 코드를 위한 공유 글로벌 텍스처 인코더에 공급한다. 이 전략을 통해 보다 현실적인 출력 이미지를 합성하고 주석이 없는 속성을 자동으로 분리할 수 있다. 실험 결과는 제안된 방법이 포즈 전송의 최신 상태에 대한 우월성과 구성요소 속성 전송이라는 새로운 작업에서 그 효과를 입증한다.

$\mathbf{1.\;Introduction}$

Person image synthesis (PIS), a challenging problem in areas of Computer Vision and Computer Graphics, has huge potential applications for image editing, movie making, person re-identification (Re-ID), virtual clothes try-on and so on. An essential task of this topic is pose-guided image generation [23, 24, 9, 33], rendering the photo-realistic images of people in arbitrary poses, which has become a new hot topic in the community. Actually, not only poses but also many other valuable human attributes can be used to guide the synthesis process.

컴퓨터 비전 및 컴퓨터 그래픽 분야에서 어려운 문제인 개인 이미지 합성(PIS)은 이미지 편집, 영화 제작, 개인 재식별(Re-ID), 가상 옷 입어보기 등을 위한 엄청난 잠재적 응용 프로그램을 가지고 있다. 이 항목의 필수 과제는 포즈 유도 이미지 생성[23, 24, 9, 33]으로, 임의 포즈로 사람들의 사진 사실적 이미지를 렌더링하는 것으로, 커뮤니티에서 새로운 화제가 되고 있다. 사실, 포즈뿐만 아니라 많은 다른 가치 있는 인간 속성도 합성 과정을 안내하는 데 사용될 수 있다.

Figure 1: Controllable person image synthesis with desired human attributes provided by multiple source images. Human attributes including pose and component attributes are embedded into the latent space as the pose code and decomposed style code. Target person images can be generated in user control with the editable style code.

그림 1: 복수의 소스 이미지에 의해 제공되는 원하는 인간 속성을 가진 제어 가능한 인물 이미지 합성. 포즈와 구성 요소 속성을 포함한 인간 속성은 포즈 코드와 분해된 스타일 코드로 잠재 공간에 내장된다. 편집 가능한 스타일 코드를 사용하여 사용자 컨트롤에서 대상 사용자 이미지를 생성할 수 있습니다.

In this paper, we propose a brand-new task that aims at synthesizing person images with controllable human attributes, including pose and component attributes such as head, upper clothes and pants. As depicted in Figure 1, users are allowed to input multiple source person images to provide desired human attributes respectively. The proposed model embeds component attributes into the latent space to construct the style code and encodes the keypointsbased 2D skeleton extracted from the person image as the pose code, which enables intuitive component-specific (pose) control of the synthesis by freely editing the style (pose) code. Thus, our method can automatically synthesize high-quality person images in desired component attributes under arbitrary poses and can be widely applied in not only pose transfer and Re-ID, but also garment transfer and attribute-specific data augmentation (e.g., clothes commodity retrieval and recognition).

본 논문에서는 머리, 윗옷, 바지와 같은 포즈와 구성 요소 속성을 포함하여 제어 가능한 인간 속성을 가진 사람 이미지를 합성하는 것을 목표로 하는 새로운 과제를 제안한다. 그림 1에 나타난 바와 같이, 사용자는 원하는 인간 속성을 각각 제공하기 위해 여러 개의 소스 인물 이미지를 입력할 수 있다. 제안된 모델은 잠재 공간에 구성 요소 속성을 내장하여 스타일 코드를 구성하고 인물 이미지에서 추출한 키포인트 기반 2D 골격을 포즈 코드로 인코딩하여 스타일(포즈) 코드를 자유롭게 편집하여 합성의 직관적인 구성 요소별(포즈) 제어를 가능하게 한다. 따라서 우리의 방법은 원하는 구성 요소 속성에 있는 고품질 인물 이미지를 임의 포즈로 자동으로 합성할 수 있으며 포즈 전송 및 Re-ID뿐만 아니라 의류 전송 및 속성별 데이터 확대(예: 의류 상품 검색 및 인식)에도 광범위하게 적용될 수 있다.

Due to the insufficiency of annotation for human attributes, the simplicity of keypoint representation and the diversity of person appearances, it is challenging to achieve the goal mentioned above using existing methods. Pose transfer methods firstly proposed by [23] and later extended by [24, 9, 33, 46] mainly focus on pose-guided person image synthesis and they do not provide user control of human attributes such as head, pants and upper clothes. Moreover, because of the non-rigid nature of human body, it is difficult to directly transform the spatially misaligned bodyparts via convolution neural networks and thus these methods are unable to produce satisfactory results. Appearance transfer methods [40, 38, 28] allow users to transfer clothes from one person to another by estimating a complicated 3D human mesh and warping the textures to fit for the body topology. Yet, these methods fail to model the intricate interplay of the inherent shape and appearance, and lead to unrealistic results with deformed textures. Another type of appearance transfer methods [30, 20, 45] try to model clothing textures by feeding the entire source person image into neural networks, but they cannot transfer human attributes from multiple source person images and lack the capability of component-level clothing editing.

인간 속성에 대한 주석의 부족, 키포인트 표현의 단순성 및 사람 외모의 다양성 때문에 기존 방법을 사용하여 위에서 언급한 목표를 달성하는 것은 어렵다. [23]에서 처음 제안하고 나중에 [24, 9, 33, 46]에서 확장한 포즈 전송 방법은 주로 포즈 유도 인물 이미지 합성에 초점을 맞추고 있으며 머리, 바지 및 윗옷과 같은 인간 속성에 대한 사용자 제어를 제공하지 않는다. 또한 인체의 비강성 특성으로 인해 컨볼루션 신경망을 통해 공간적으로 잘못 정렬된 신체 부위를 직접 변환하기 어렵기 때문에 이러한 방법은 만족스러운 결과를 도출할 수 없다. 외관 전송 방법[40, 38, 28]을 사용하면 복잡한 3D 인간 망사를 추정하고 신체 토폴로지에 맞게 질감을 뒤틀림으로써 한 사람에서 다른 사람으로 옷을 전송할 수 있다. 그러나 이러한 방법은 고유한 모양과 외관의 복잡한 상호 작용을 모델링하지 못하고 변형된 텍스처로 비현실적인 결과를 초래한다. 다른 유형의 외관 전송 방법[30, 20, 45]은 전체 소스 인물 이미지를 신경망에 공급하여 의류 질감을 모델링하려고 하지만, 여러 소스 인물 이미지에서 인간 속성을 전송할 수 없고 구성 요소 수준의 의류 편집 기능이 부족하다.

The notion of attribute editing is commonly used in the field of facial attribute manipulation [14, 41, 39], but to the best of our knowledge this work is the first to achieve attribute editing in the task of person image synthesis. Different from pervious facial attribute editing methods which require strict attribute annotation (e.g., smiling, beard and eyeglasses exist or not in the training dataset), the proposed method does not need any annotation of component attributes and enables automatic and unsupervised attribute separation via delicately-designed modules. In another aspect, our model is trained with only a partial observation of the person and needs to infer the unobserved body parts to synthesize images in different poses and views. It is more challenging than motion imitation methods [6, 1, 35], which utilize all characters performing a series of same motions to disentangle the appearance and pose, or train one model for each character by learning a mapping from 2D pose to one specific domain.

속성 편집의 개념은 얼굴 속성 조작 분야에서 일반적으로 사용되지만 [14, 41, 39] 우리가 아는 한 이 연구는 인물 이미지 합성 작업에서 속성 편집을 달성한 첫 번째 작업이다. 엄격한 속성 주석을 필요로 하는 이전의 얼굴 속성 편집 방법(예: 미소, 턱수염 및 안경이 훈련 데이터 세트에 존재하거나 존재하지 않음)과 달리, 제안된 방법은 구성 요소 속성에 대한 주석이 필요하지 않으며 정교하게 설계된 모듈을 통해 자동 및 비지도 속성 분리를 가능하게 한다. 또 다른 측면에서, 우리의 모델은 사람에 대한 부분적인 관찰만으로 훈련되며 관찰되지 않은 신체 부위를 추론하여 다른 포즈와 뷰의 이미지를 합성해야 한다. 이는 일련의 동일한 모션을 수행하는 모든 캐릭터를 활용하여 모양과 포즈를 분리하거나 2D 포즈에서 특정 도메인으로의 매핑을 학습하여 각 캐릭터에 대해 하나의 모델을 훈련하는 모션 모방 방법[6, 1, 35]보다 어렵다.

To address the aforementioned challenges, we propose a novel controllable person image synthesis method via an Attribute-Decomposed GAN. In contrast to previous works [23, 3, 33] forcedly learn a mapping from concatenated conditions to the target image, we introduce a new architecture of generator with two independent pathways, one for pose encoding and the other for decomposed component encoding. For the latter, our model first separates component attributes automatically from the source person image via its semantic layouts which are extracted with a pretrained human parser. Component layouts are fed into a global texture encoder with multi-branch embeddings and their latent codes are recombined in a specific order to construct the style code. Then the cascaded style blocks, acting as a connection of two pathways, inject the component attributes represented by the style code into the pose code by controlling the affine transform parameters of AdaIN layer. Eventually, the desired image can be reconstructed from target features. In summary, our contributions are threefold:

앞서 언급한 과제를 해결하기 위해 속성 분해 GAN을 통한 새로운 제어 가능한 인물 이미지 합성 방법을 제안한다. 이전 작업[23, 3, 33]이 연결된 조건에서 대상 이미지로의 매핑을 강제로 학습하는 것과 달리, 우리는 두 개의 독립적인 경로를 가진 새로운 생성기 아키텍처를 소개한다. 하나는 포즈 인코딩을 위한 것이고 다른 하나는 분해된 구성 요소 인코딩을 위한 것이다. 후자의 경우, 우리의 모델은 먼저 사전 훈련된 인간 파서로 추출된 의미론적 레이아웃을 통해 소스 인물 이미지에서 구성 요소 속성을 자동으로 분리한다. 구성 요소 레이아웃은 다중 분기 임베딩과 함께 전역 텍스처 인코더로 공급되고, 잠재 코드는 스타일 코드를 구성하기 위해 특정 순서로 재결합된다. 그런 다음 두 경로의 연결 역할을 하는 계단식 스타일 블록은 에이다의 아핀 변환 매개 변수를 제어하여 스타일 코드로 표현되는 구성 요소 속성을 포즈 코드에 주입한다.겹겹이. 결국 원하는 영상을 대상 피쳐에서 재구성할 수 있습니다. 요약하면, NAT의 기여는 세 가지입니다.

We propose a brand-new task that synthesizes person images with controllable human attributes by directly providing different source person images, and solve it by modeling the intricate interplay of the inherent pose and component-level attributes.

우리는 다양한 소스 인물 이미지를 직접 제공하여 제어 가능한 인간 속성을 가진 인물 이미지를 합성하고, 고유한 포즈와 구성 요소 수준 속성의 복잡한 상호 작용을 모델링하여 해결하는 새로운 작업을 제안한다.
We introduce the Attribute-Decomposed GAN, a neat and effective model achieving not only flexible and continuous user control of human attributes, but also a significant quality boost for the original PIS task.

우리는 인간 속성의 유연하고 지속적인 사용자 제어뿐만 아니라 원래 PIS 작업에 대한 상당한 품질 향상을 달성하는 깔끔하고 효과적인 모델인 Attribute-Decomposed GAN을 소개한다.
We tackle the challenge of insufficient annotation for human attributes by utilizing an off-the-shelf human parser to extract component layouts, making an automatic separation of component attributes.

우리는 기성품 인간 파서를 활용하여 구성요소 레이아웃을 추출하여 구성요소 특성을 자동으로 분리함으로써 인간 속성에 대한 주석이 충분하지 않은 문제를 해결한다.

$\mathbf{2.\;Related\;Work}$

$\mathbf{2.1.\;Image\;Synthesis}$

Due to their remarkable results, Generative Adversarial Networks (GANs) [13] have become powerful generative models for image synthesis [16, 44, 4] in the last few years. The image-to-image translation task was solved with conditional GANs [26] in Pix2pix [16] and extended to highresolution level in Pix2pixHD [36]. Zhu et al. [44] introduced an unsupervised method, CycleGAN, exploiting cycle consistency to generate the image from two domains with unlabeled images. Much of the work focused on improving the quality of GAN-synthesized images by stacked architectures [43, 27], more interpretable latent representations [7] or self-attention mechanism [42]. StyleGAN [18] synthesized impressive images by proposing a brandnew generator architecture which controls generator via the adaptive instance normalization (AdaIN) [15], the outcome of style transfer literature [10, 11, 17]. However, these techniques have limited scalability in handling attributed-guided person synthesis, due to complex appearances and simple poses with only several keypoints. Our method built on GANs overcomes these challenges by a novel generator architecture designed with attribute decomposition.

놀라운 결과로 인해 생성적 적대 네트워크(GAN)[13]는 지난 몇 년 동안 이미지 합성을 위한 강력한 생성 모델이 되었다[16, 44, 4]. 이미지 대 이미지 변환 작업은 Pix2pix[16]에서 조건부 GAN[26]으로 해결되었으며 Pix2pix에서 고해상도 수준으로 확장되었습니다.HD [36]. Zhu 등[44]은 주기 일관성을 이용하여 레이블이 없는 이미지가 있는 두 도메인에서 이미지를 생성하는 비지도 방법인 CycleGAN을 도입했다. 연구의 대부분은 스택 아키텍처[43, 27], 더 해석 가능한 잠재 표현[7] 또는 자기 주의 메커니즘[42]에 의해 GAN 합성 이미지의 품질을 향상시키는 데 초점을 맞추었다. StyleGAN[18]은 적응형 인스턴스 정규화를 통해 제너레이터를 제어하는 새로운 제너레이터 아키텍처를 제안하여 인상적인 이미지를 합성했다(Ada).IN) [15], 스타일 이동 문헌의 결과 [10, 11, 17]. 그러나 이러한 기술은 복잡한 모양과 몇 가지 핵심 사항만 있는 간단한 포즈로 인해 속성 유도 사람 합성을 처리하는 데 확장성이 제한적이다. GAN을 기반으로 구축된 우리의 방법은 속성 분해로 설계된 새로운 생성기 아키텍처로 이러한 문제를 극복한다.

$\mathbf{2.2.\;Person\;Image\;Synthesis}$

Up to now, many techniques have been proposed to synthesize person images in arbitrary poses using adversarial learning. PG2 [23] firstly proposed a two-stage GAN architecture to generate person images, in which the person with the target pose is coarsely synthesized in the first stage, and then refined in the second stage. Esser et al. [9] leveraged a variational autoencoder combined with the conditional U-Net [31]to model the inherent shape and appearance. Siarohin et al. [33] used a U-Net based generator with deformable skip connections to alleviate the pixel-to-pixel misalignments caused by pose differences. A later work by Zhu et al. [46] introduced cascaded Pose-Attentional Transfer Blocks into generator to guide the deformable transfer process progressively. [29, 34] utilized a bidirectional strategy for synthesizing person images in an unsupervised manner. However, these methods only focused on transferring the pose of target image to the reference person and our method achieved a controllable person image synthesis with not only pose guided, but also component attributes (e.g., head, upper clothes and pants) controlled. Moreover, more realistic person images with textural coherence and identical consistency can be produced.

지금까지 적대적 학습을 사용하여 임의 포즈로 사람 이미지를 합성하는 많은 기술이 제안되었다. PG2[23]는 먼저 대상 포즈를 가진 사람이 첫 번째 단계에서 거칠게 합성된 후 두 번째 단계에서 정제되는 2단계 GAN 아키텍처를 제안했다. 에셀 외. [9] 조건부 U-Net[31]과 결합된 변형 자동 인코더를 활용하여 고유한 모양과 모양을 모델링했다. 시아로힌 외. [33] 포즈 차이로 인한 픽셀 간 오정렬을 완화하기 위해 변형 가능한 스킵 연결이 있는 U-Net 기반 생성기를 사용했다. Zhu 등의 후기 작품. [46] 계단식 포즈-주의 전달 블록을 발전기에 도입하여 변형 전달 과정을 점진적으로 안내합니다. [29, 34]는 비지도 방식으로 사람 이미지를 합성하기 위해 양방향 전략을 활용했다. 그러나 이러한 방법은 대상 이미지의 포즈를 참조인에게 전송하는 데만 초점을 맞췄고 우리의 방법은 자세 유도뿐만 아니라 구성 요소 속성(예: 머리, 윗옷 및 바지)도 제어하여 제어 가능한 사람 이미지 합성을 달성했다. 또한, 텍스처적 일관성과 동일한 일관성을 가진 보다 현실적인 인물 이미지를 생성할 수 있다.

Figure 2: An overview of the network architecture of our generator. The target pose and source person are embedded into the latent space via two independent pathways, called pose encoding and decomposed component encoding, respectively. For the latter, we employ a human parser to separate component attributes and encode them via a global texture encoder. A series of style blocks equipped with a fusion module are introduced to inject the texture style of source person into the pose code by controlling the affine transform parameters in AdaIN layers. Finally, the desired image is reconstructed via a decoder.

그림 2: 발전기의 네트워크 아키텍처 개요 대상 포즈와 소스 인물은 각각 포즈 인코딩과 분해된 구성 요소 인코딩이라는 두 가지 독립적인 경로를 통해 잠재 공간에 내장된다. 후자의 경우, 우리는 구성 요소 속성을 분리하고 글로벌 텍스처 인코더를 통해 인코딩하는 인간 파서를 사용한다. 융합 모듈이 장착된 일련의 스타일 블록이 도입되어 AdaIN 레이어에서 아핀 변환 매개 변수를 제어하여 소스 인물의 텍스처 스타일을 포즈 코드에 주입한다. 마지막으로, 원하는 이미지는 디코더를 통해 재구성된다.

$\mathbf{3.\;Method\;Description}$

Our goal is to synthesize high-quality person images with user-controlled human attributes, such as pose, head, upper clothes and pants. Different from previous attribute editing methods [14, 39, 41] requiring labeled data with binary annotation for each attribute, our model achieves automatic and unsupervised separation of component attributes by introducing a well-designed generator. Thus, we only need the dataset that contains person images {$I\in{}R^{3×H×W}$} with each person in several poses. The corresponding keypoint-based pose $P\in{}R^{18×H×W}$ of $I$, 18 channel heat map that encodes the locations of 18 joints of a human body, can be automatically extracted via an existing pose estimation method [5]. During training, a target pose $P_{t}$ and a source person image $I_{s}$ are fed into the generator and a synthesized image $I_{g}$ following the appearance of $I_{s}$ but under the pose $P_{t}$ will be challenged for realness by the discriminators. In the following, we will give a detailed description for each part of our model.

우리의 목표는 포즈, 머리, 윗옷, 바지 등 사용자가 제어하는 인간 속성과 함께 고품질 인물 이미지를 합성하는 것이다. 각 속성에 대해 이진 주석을 가진 레이블링된 데이터를 요구하는 이전의 속성 편집 방법[14, 39, 41]과 달리, 우리 모델은 잘 설계된 생성기를 도입하여 구성 요소 속성의 자동 및 비지도 분리를 달성한다. 따라서, 우리는 여러 포즈로 각 사람이 있는 인물 이미지 {$I\in{}R^{3×H×W}$}를 포함하는 데이터 세트만 필요하다. 인체의 18개 관절 위치를 인코딩하는 18채널 열 지도인 $I$의 해당 키포인트 기반 포즈 $P\in{}R^{18×H×W}$는 기존 포즈 추정 방법을 통해 자동으로 추출될 수 있다[5]. 훈련 중에 목표 포즈 $P_{t}$와 소스 인물 이미지 $I_{s}$는 $I_{s}$의 출현에 따라 생성기와 합성 이미지 $I_{g}$에 공급되지만, $P_{t}$ 포즈 아래에서는 판별기에 의해 실제성에 대해 이의를 제기할 것이다. 아래에서는 모델의 각 부분에 대해 자세히 설명하겠습니다.

$\mathbf{3.1.\;Generator}$

Figure 2 shows the architecture of our generator, whose inputs are the target pose $P_{t}$ and source person image $I_{s}$, and the output is the generated image $I_{g}$ with source person $I_{s}$ in the target pose $P_{t}$. Unlike the generator in [23] which directly concatenates the source image and target condition together as input to a U-Net architecture and forcedly learns a result under the supervision of the target image $I_{t}$, our generator embeds the target pose $P_{t}$ and source person $I_{s}$ into two latent codes via two independent pathways, called pose encoding and decomposed component encoding, respectively. These two pathways are connected by a series of style blocks, which inject the texture style of source person into the pose feature. Finally, the desired person image $I_{g}$ is reconstructed from target features by a decoder.

그림 2는 우리의 발전기의 아키텍처를 보여주는데, 입력은 목표 포즈 $P_{t}$와 소스 인물 이미지 $I_{s}$이고 출력은 소스 인물 $I_{s}$가 목표 포즈 $P_{t}$에 있는 생성된 이미지 $I_{g}$이다. U-Net 아키텍처에 대한 입력으로 소스 이미지와 대상 조건을 직접 연결하고 대상 이미지 $I_{t}$의 감독 하에 결과를 강제로 학습하는 [23]의 생성기와 달리, 우리의 생성기는 두 개의 독립적인 경로를 통해 대상 포즈 $P_{t}$와 소스 사람 $I_{s}$를 두 개의 잠재 코드에 내장한다. 각각 포즈 인코딩 및 분해된 구성 요소 인코딩이라고 합니다. 이 두 경로는 일련의 스타일 블록으로 연결되어 소스 인물의 텍스처 스타일을 포즈 피쳐에 주입한다. 마지막으로, 원하는 인물 이미지 $I_{g}$는 디코더에 의해 대상 피처로부터 재구성된다.

$\mathbf{3.1.1\;Pose\;encoding}$

In the pose pathway, the target pose $P_{t}$ is embedded into the latent space as the pose code $C_{pose}$ by a pose encoder, which consists of $N$ down-sampling convolutional layers ($N=2$ in our case), following the regular configuration of encoder.

포즈 경로에서 목표 포즈 $P_{t}$는 인코더의 정규 구성에 따라 $N$ 다운-다운-컨볼루션 레이어(이 경우 $N=2$)로 구성된 포즈 인코더에 의해 포즈 코드 $C_{pose}$로 잠재 공간에 내장된다.

Figure 3: Details of the texture encoder in our generator. A global texture encoding is introduced by concatenating the output of learnable encoder and fixed VGG encoder.

그림 3: 제너레이터의 텍스처 인코더 세부 정보 학습 가능한 인코더와 고정 VGG 인코더의 출력을 연결하여 글로벌 텍스처 인코딩을 도입한다.

$\mathbf{3.1.2\;Decomposed\;component\;encoding}$

In the source pathway, the source person image $I_{s}$ is embedded into the latent space as the style code $C_{sty}$ via a module called decomposed component encoding (DCE). As depicted in Figure 2, this module first extracts the semantic map $S$ of source person $I_{s}$ with an existing human parser [12] and converts $S$ into a K-channel heat map $M\in{}R^{K×H×W}$. For each channel $i$, there is a binary mask $M_{i}\in{}R^{H×W}$ for the corresponding component (e.g., upper clothes). The decomposed person image with component i is computed by multiplying the source person image with the component mask $M_{i}$

소스 경로에서 소스 인물 이미지 $I_{s}$는 분해 구성요소 인코딩(DCE)이라는 모듈을 통해 스타일 코드 $C_{sty}$로 잠재 공간에 내장된다. 그림 2에 표시된 것처럼, 이 모듈은 먼저 기존 인간 파서[12]로 소스 사람 $I_{s}$의 의미 맵 $S$를 추출하고 $S$를 K 채널 히트 맵 $M\in{}R^{K×H×W}$로 변환한다. 각 채널 $i$에 대해 해당 구성 요소(예: 윗옷)에 대한 이진 마스크 $M_{i}\in{}R^{H×W}$가 있다. 성분 i가 있는 분해된 인물 이미지는 소스 인물 이미지에 성분 마스크 $M_{i}$를 곱하여 계산된다.

\[I_{s}^{i}=I_{s}\odot{}M_{i},\]

where $\odot{}$ denotes element-wise product. $I$ i s is then fed into the texture encoder $T_{enc}$ to acquire the corresponding style code $C_{sty}^{i}$ in each branch by

여기서 $\odot{}$는 요소별 제품을 나타낸다. 그런 다음 $I$는 텍스처 인코더 $T_{enc}$에 공급되어 각 분기에서 해당 스타일 코드 $C_{sty}^{i}$를 획득한다.

\[C_{sty}^{i}=T_{enc}(I_{s}^{i}),\]

where the texture encode $T_{enc}$ is shared for all branches and its detailed architecture will be described below. Then all $C_{sty}^{i},i=1K$ will be concatenated together in a top-down manner to get the full style code $C_{sty}$.

여기서 텍스처 인코딩 $T_{enc}$는 모든 분기에 대해 공유되며 자세한 아키텍처는 아래에서 설명될 것이다. 그런 다음 모든 $C_{sty}^{i},i=1K$를 하향식으로 연결하여 전체 스타일 코드 $C_{sty}$를 얻는다.

In contrast to the common solution that directly encodes the entire source person image, this intuitive DCE module decomposes the source person into multiple components and recombines their latent codes to construct the full style code. Such an intuitive strategy kills two birds with one stone:

전체 소스 인물 이미지를 직접 인코딩하는 일반적인 솔루션과 달리, 이 직관적인 DCE 모듈은 소스 인물을 여러 구성 요소로 분해하고 잠재 코드를 재조합하여 전체 스타일 코드를 구성한다. 이런 직관적인 전략은 일석이조다.

$I_{t}$ speeds up the convergence of model and achieves more realistic results in less time. Due to the complex structure of the manifold that is constituted of various person images with different clothes and poses, it is hard to encode the entire person with detailed textures, but much simpler to only learn the features of one component of the person. Also, different components can share the same network parameters for color encoding and thus DCE implicitly provides a data augmentation for texture learning. The loss curves for the effects of our DCE module in training are shown in Figure 5 and the visualization effects are provided in Figure 4 (d)(e).

$I_{t}$는 모델의 수렴 속도를 높이고 더 짧은 시간 내에 보다 현실적인 결과를 달성한다. 옷과 포즈가 다른 다양한 인물 이미지로 구성된 매니폴드의 복잡한 구조 때문에 상세한 텍스처로 인물 전체를 인코딩하는 것은 어렵지만, 그 인물의 한 구성 요소의 특징만 학습하는 것은 훨씬 간단하다. 또한, 다른 구성 요소는 색상 인코딩을 위한 동일한 네트워크 매개 변수를 공유할 수 있으므로 DCE는 텍스처 학습을 위한 데이터 증강을 암시적으로 제공한다. 교육에서 DCE 모듈의 효과에 대한 손실 곡선은 그림 5에 나타나 있으며 시각화 효과는 그림 4(d)(e)에 제공된다.
$I_{t}$ achieves an automatic and unsupervised attribute separation without any annotation in the training dataset, which utilizes an existing human parser for spatial decomposition. Specific attributes are learned in the fixed positions of the style code. Thus we can easily control component attributes by mixing desired component codes extracted from different source persons.

$I_{t}$는 훈련 데이터 세트에서 주석 없이 자동 및 비지도 속성 분리를 달성하며, 이는 공간 분해를 위해 기존 인간 파서를 활용한다. 특정 속성은 스타일 코드의 고정 위치에서 학습됩니다. 따라서 서로 다른 소스 사람으로부터 추출한 원하는 구성 요소 코드를 혼합하여 구성 요소 속성을 쉽게 제어할 수 있다.

Figure 4: Visualization effects of the DCE and GTE. (a) A source person and (b) a target pose for inputs. (c) The result generated without either DCE or GTE. (d) The result generated without only DCE. (e) The result generated with both two modules.

그림 4: DCE와 GTE의 시각화 효과 (a) 소스 인물과 (b) 입력 대상 포즈 (c) DCE 또는 GTE 없이 생성된 결과 (d) DCE만 없이 생성된 결과 (e) 두 모듈 모두에서 생성된 결과.

Figure 5: Loss curves for the effectiveness of our DCE module in the training process.

그림 5: 교육 과정에서 DCE 모듈의 효과에 대한 손실 곡선.

For the texture encoder, inspired by a style transfer method [15] which directly extracts the image code via a pretrained VGG network to improve the generalization ability of texture encoding, we introduce an architecture of global texture encoding by concatenating the VGG features in corresponding layers to our original encoder, as shown in Figure 3. The values of parameters in the original encoder are learnable while those in the VGG encoder are fixed. Since the fixed VGG network is pretrained on the COCO dataset [21] and it has seen many images with various textures, it has a global property and strong generalization ability for in-the-wild textures. But unlike the typical style transfer task [15, 11] requiring only a roughly reasonable result without tight constraints, our model needs to output the explicitly specified result for a given source person in the target pose. $I_{t}$ is difficult for the network with a fixed encoder to fit such a complex model and thus the learnable encoder is introduced, combined with the fixed one. The effects of the global texture encoding (GTE) are shown in Figure 4 (c)(d).

텍스처 인코딩의 일반화 능력을 향상시키기 위해 사전 훈련된 VGG 네트워크를 통해 이미지 코드를 직접 추출하는 스타일 전송 방법[15]에서 영감을 얻은 텍스처 인코더의 경우, 그림에서 보듯이 해당 레이어의 VGG 기능을 원래 인코더에 연결하여 전역 텍스처 인코딩 아키텍처를 도입한다.re 3. 원래 인코더의 파라미터 값은 VGG 인코더의 파라미터 값이 고정되어 있는 동안 학습할 수 있습니다. 고정 VGG 네트워크는 COCO 데이터 세트[21]에서 사전 훈련되고 다양한 텍스처를 가진 많은 이미지를 보았기 때문에, 글로벌 특성과 야생 텍스처를 위한 강력한 일반화 능력을 가지고 있다. 그러나 엄격한 제약 없이 대략 합리적인 결과만 요구하는 일반적인 스타일 전송 작업[15, 11]과 달리, 우리의 모델은 목표 포즈에서 주어진 소스 인물에 대해 명시적으로 지정된 결과를 출력해야 한다. $I_{t}$는 고정 인코더가 있는 네트워크가 이러한 복잡한 모델에 적합하기 어렵기 때문에 학습 가능한 인코더가 고정 인코더와 결합하여 도입된다. 전역 텍스처 인코딩(GTE)의 효과는 그림 4(c)(d)에 나와 있습니다.

Figure 6: Auxiliary effects of the fusion module (FM) for DCE. (a) A source person and (b) a target pose for inputs. (c) The result generated without DCE. (d) The result generated with DCE introduced but no FM contained in style blocks. (e) The result generated with both DCE and FM.

그림 6: DCE용 융합 모듈(FM)의 보조 효과 (a) 소스 개인 및 (b) 입력 대상 포즈 (c) DCE 없이 생성된 결과 (d) 스타일 블록에 포함된 FM이 없는 DCE로 생성된 결과. (e) DCE와 FM 모두에서 생성된 결과입니다.

$\mathbf{3.1.3\;Texture\;style\;transfer}$

Texture style transfer aims to inject the texture pattern of source person into the feature of target pose, acting as a connection of the pose code and style code in two pathways. This transfer network consists of several cascaded style blocks, each one of which is constructed by a fusion module and residual conv-blocks equipped with AdaIN. For the t th style block, its inputs are deep features $F_{t−1}$ at the output of the previous block and the style code $C_{sty}$. The output of this block can be computed by

텍스처 스타일 전송은 소스 인물의 텍스처 패턴을 대상 포즈의 특징에 주입하는 것을 목표로 하며, 두 가지 경로에서 포즈 코드와 스타일 코드의 연결 역할을 한다. 이 전송 네트워크는 여러 계단식 스타일 블록으로 구성되며, 각 블록은 퓨전 모듈과 Ada가 장착된 잔여 컨블 블록에 의해 구성된다.IN. thstyle 블록의 경우, 입력은 이전 블록의 출력에서 딥 피처 $F_{t-1}$와 스타일 코드 $C_{sty}$이다. 이 블록의 출력은 다음과 같이 계산할 수 있다.

\[F_{t}=ϕ_{t}(F_{t−1}, A)+F_{t−1},\]

where $F_{t−1}$ firstly goes through conv-blocks $ϕ_{t}$, whose output is added back to $F_{t−1}$ to get the output $F_{t},F_{0}=C_{pose}$ in the first block and 8 style blocks are adopted totally. A denotes learned affine transform parameters (scale $µ$ and shift $σ$) required in the AdaIN layer and can be used to normalize the features into the desired style [8, 15]. Those parameters are extracted from the style code $C_{sty}$ via a fusion module (FM), which is an important auxiliary module for DCE. Because component codes are concatenated in a specified order to construct the style code, making a high correlation between the position and component features, this imposes much human ruled intervention and leads to a conflict with the learning tendency of the network itself. Thus we introduce FM consisting of 3 fully connected layers with the first two allowing the network to flexibly select the desired features via linear recombination and the last one providing parameters in the required dimensionality. FM can effectively disentangle features and avoid conflicts between forward operation and backward feedback. The effects of FM are shown in Figure 6. When DCE is applied to our model without FM, the result (see Figure 6 (d)) is even worse than that without DCE (see Figure 6 (c)). The fusion module makes our model more flexible and guarantees the proper performance of DCE.

여기서 $F_{t-1}$은 먼저 conv-block $ϕ_{t}$를 거치고, 그 출력은 첫 번째 블록에서 출력 $F_{t},F_{0}=C_{pose}$를 얻기 위해 $F_{t-1}$에 다시 추가되며, 8가지 스타일 블록이 전체적으로 채택된다. A는 Ada에 필요한 학습된 아핀 변환 매개 변수(스케일 $θ$ 및 시프트 $θ$)를 나타낸다.IN 레이어 및 를 사용하여 형상을 원하는 스타일로 정규화할 수 있습니다 [8, 15]. 이러한 매개 변수는 DCE의 중요한 보조 모듈인 융합 모듈(FM)을 통해 스타일 코드 $C_{sty}$에서 추출된다. 구성 요소 코드는 스타일 코드를 구성하기 위해 지정된 순서로 연결되어 위치 및 구성 요소 특징 사이에 높은 상관 관계를 만들기 때문에, 이는 인간이 지배한 개입을 많이 부과하고 네트워크 자체의 학습 경향과 충돌한다. 따라서 우리는 네트워크가 선형 재조합을 통해 원하는 기능을 유연하게 선택할 수 있도록 하는 처음 두 개의 완전히 연결된 계층으로 구성된 FM을 도입하고, 마지막 계층은 필요한 차원의 매개 변수를 제공한다. FM은 효과적으로 기능을 분리하고 전방 작동과 후방 피드백 간의 충돌을 방지할 수 있습니다. FM의 효과는 그림 6에 나와 있습니다. FM이 없는 모델에 DCE를 적용하면 결과(그림 6(d) 참조)가 DCE가 없는 모델보다 더 나빠집니다(그림 6(c) 참조). 융합 모듈은 모델을 더욱 유연하게 만들고 DCE의 적절한 성능을 보장한다.

$\mathbf{3.1.4\;Person\;image\;reconstruction}$

With the final target features $F_{T−1}$ at the output of the last style block, the decoder generates the final image $I_{g}$ from $F_{T−1}$ via $N$ deconvolutional layers, following the regular decoder configuration.

마지막 스타일 블록의 출력에 최종 목표 특징 $F_{T-1}$이 있는 경우, 디코더는 일반 디코더 구성에 따라 $N$ 디콘볼루션 레이어를 통해 $F_{T-1}$에서 최종 이미지 $I_{g}$를 생성한다.

$\mathbf{3.2.\;Discriminators}$

Following Zhu et al. [46], we adapt two discriminators $D_{p}$ and $D_{t}$, where $D_{p}$ is used to guarantee the alignment of the pose of generated image $I_{g}$ with the target pose $P_{t}$ and $D_{t}$ is used to ensure the similarity of the appearance texture of $I_{g}$ with the source person $I_{s}$. For $D_{p}$, the target pose $P_{t}$ concatenated with the generated image $I_{g}$ (real target image $I_{t}$) is fed into $D_{p}$ as a fake (real) pair. For $D_{t}$, the source person image $I_{s}$ concatenated with $I_{g}(I_{t})$ is fed into $D_{t}$ as a fake (real) pair. Both $D_{p}$ and $D_{t}$ are implemented as PatchGAN and more details can be found in [16].

Zhu 외 에 이어. [46], 우리는 두 개의 판별기 $D_{p}$ 및 $D_{t}$를 적응시킨다. 여기서 $D_{p}$는 생성된 이미지 $I_{g}$의 포즈와 대상 포즈 $P_{t}$의 정렬을 보장하기 위해 사용되며 $D_{t}$는 소스 사람 $I_{s}$의 외관 텍스처의 유사성을 보장하기 위해 사용된다. $D_{p}$의 경우 생성된 이미지 $I_{g}$(실제 대상 이미지 $I_{t}$)와 연결된 대상 포즈 $P_{t}$는 가짜(실제) 쌍으로 $D_{p}$에 공급된다. $D_{t}$의 경우 $I_{g}(I_{t})$ 와 연결된 소스 인물 이미지 $I_{s}$는 가짜(실제) 쌍으로 $D_{t}$에 입력된다. $D_{p}$와 $D_{t}$ 모두 PatchGAN으로 구현되며 자세한 내용은 [16]에서 확인할 수 있다.

$\mathbf{3.3.\;Training}$

Our full training loss is comprised of an adversarial term, a reconstruction term, a perceptual term and a contextual term

우리의 완전한 훈련 손실은 적대적 용어, 재구성 용어, 지각적 용어 및 상황적 용어로 구성된다.

\[L_{total}=L_{adv}+λ_{rec}L_{rec}+λ_{per}L_{per}+λ_{CX}L_{CX},\]

where $λ_{rec}$, $λ_{per}$ and $λ_{CX}$ denote the weights of corresponding losses, respectively.

여기서 $λ_{rec}$, $λ_{per}$ 및 $λ_{CX}$는 각각 해당 손실의 가중치를 나타낸다.

Adversarial loss. We employ an adversarial loss $L_{adv}$ with discriminators $D_{p}$ and $D_{t}$ to help the generator $G$ synthesize the target person image with visual textures similar to the reference one, as well as following the target pose. $I_{t}$ penalizes for the distance between the distribution of real pairs $(I_{s}(P_{t}$), $I_{t}$) and the distribution of fake pairs $(I_{s}(P_{t}), I_{g})$ containing generated images

적대적 손실. 우리는 생성기 $G$가 기준과 유사한 시각적 텍스처로 대상 인물 이미지를 합성하고 대상 포즈를 따를 수 있도록 판별기 $D_{p}$ 및 $D_{t}$와 함께 적대적 손실 $L_{adv}$를 사용한다. $I_{t}$는 실제 쌍 $(I_{s}(P_{t})$, $I_{t}$)의 분포와 생성된 이미지를 포함하는 가짜 쌍 $(I_{s}(P_{t}), I_{g})$ 사이의 거리에 대해 불이익을 준다.

\[L_{adv}=E_{I_{s},P_{t},I_{t}}[log(D_{t}(I_{s},I_{t})\cdot{}D_{p}(P_{t},I_{t}))]+E_{I_{s},P_{t}}[log((1−D_{t}(I_{s}, G(I_{s}, P_{t})))\cdot{}(1−D_{p}(P_{t}, G(I_{s},P_{t}))))].\]

Reconstruction loss. The reconstruction loss is used to directly guide the visual appearance of the generated image similar to that of the target image $I_{t}$, which can avoid obvious color distortions and accelerate the convergence process to acquire satisfactory results. $L_{rec}$ is formulated as the L1 distance between the generated image and target image $I_{t}$

재건 손실. 재구성 손실은 대상 이미지 $I_{t}$와 유사하게 생성된 이미지의 시각적 모양을 직접 안내하는 데 사용되며, 이는 명백한 색상 왜곡을 피하고 수렴 프로세스를 가속화하여 만족스러운 결과를 얻을 수 있다. $L_{rec}$는 생성된 이미지와 대상 이미지 $I_{t}$ 사이의 L1 거리로 공식화된다.

\[L_{rec}=\vert{}\vert{}G(I_{s}, P_{t})−I_{t}\vert{}\vert{}.\]

Perceptual loss. Except for low-level constraints in the RGB space, we also exploit deep features extracted from certain layers of the pretrained VGG network for texture matching, which has been proven to be effective in image synthesis [9, 33] tasks. The preceputal loss is computed as [46]

지각 상실. RGB 공간의 낮은 수준 제약 조건을 제외하고 텍스처 매칭을 위해 사전 훈련된 VGG 네트워크의 특정 계층에서 추출한 심층 특징도 활용하는데, 이는 이미지 합성 [9, 33] 작업에 효과적이라는 것이 입증되었다. 추정 손실은 [46]으로 계산된다.

\[L_{per}=\frac{1}{W_{l}H_{l}C_{l}}\sum_{x=1}^{W_{l}}\sum_{y=1}^{H_{l}}\sum_{z=1}^{C_{l}}\vert{}\vert{}φ_{l}I_{x,y,z}−φ_{l}I_{x,y,z}\vert{}\vert{},\]

where $φ_{l}$ is the output feature from layer $l$ of VGG19 network, and $W_{l},H_{l},C_{l}$ are spatial width, height and depth of feature $φ_{l}$.

여기서 $φ_{l}$은 VGG19 네트워크의 계층 $l$의 출력 특징이며, $W_{l},H_{l},C_{l}$은 공간 폭, 높이 및 특징 $φ_{l}$이다.

Figure 7: Results of synthesizing person images in arbitrary poses.

그림 7: 임의 포즈로 인물 이미지를 합성한 결과

Figure 8: Effects of the contextual loss.

그림 8: 상황별 손실의 영향

Contextual loss. The contextual loss proposed in [25] is designed to measure the similarity between two non-aligned images for image transformation, which is also effective in our GAN-based person image synthesis task. Compared with the pixel-level loss requiring pixel-to-pixel alignment, the contextual loss allows spatial deformations with respect to the target, getting less texture distortion and more reasonable outputs. We compute the contextual loss $L_{CX}$ by

상황별 손실. [25]에서 제안된 상황별 손실은 이미지 변환을 위해 정렬되지 않은 두 이미지 사이의 유사성을 측정하도록 설계되었으며, 이는 GAN 기반 인물 이미지 합성 작업에서도 효과적이다. 픽셀 간 정렬이 필요한 픽셀 수준 손실과 비교하여, 컨텍스트 손실은 대상에 대한 공간 변형을 허용하여 텍스처 왜곡을 줄이고 보다 합리적인 출력을 얻는다. 우리는 다음과 같이 상황별 손실 $L_{CX}$를 계산한다.

\[L_{CX}=−log(CX(F^{l}(I_{g}), F^{l}(I_{t}))),\]

where $F^{l}(I_{g})$ and $F^{l}(I_{t})$ denote the feature maps extracted from layer $l=relu{3 2, 4 2}$ of the pretrained VGG19 network for images $I_{g}$ and $I_{t}$, respectively, and CX denotes the similarity metric between matched features, considering both the semantic meaning of pixels and the context of the entire image. More details can be found in [25]. We show the effects of $L_{CX}$ in Figure 8, which enables the network to better preserve details with less distortion.

여기서 $F^{l}(I_{g})$ 및 $F^{l}(I_{t})$은 각각 $I_{g}$ 및 $I_{t}$ 이미지에 대해 사전 훈련된 VGG19 네트워크의 계층 $l=relu{32,42}$에서 추출된 피쳐 맵을 나타내며, CX는 픽셀의 의미적 의미와 전체 이미지의 문맥을 모두 고려하여 일치된 피쳐 간의 유사성 메트릭을 나타낸다. 자세한 내용은 [25]를 참조하십시오. 우리는 그림 8에서 $L_{CX}$의 효과를 보여주는데, 이는 네트워크가 왜곡을 덜 하면서 세부 정보를 더 잘 보존할 수 있게 한다.

Implementation details. Our method is implemented in PyTorch using two NVIDIA Tesla-V100 GPUs with 16GB memory. With the human parser [2], we acquire the semantic map of person image and merge original labels defined in [12] into K(K=8) categories (i.e., background, hair, face, upper clothes, pants, skirt, arm and leg). The weights for the loss terms are set to $λ_{rec}=2$, $λ_{per}=2$, and $λ_{CX}=0.02$. We adopt Adam optimizer [19] with the momentum set to 0.5 to train our model for around 120k iterations. The initial learning rate is set to 0.001 and linearly decayed to 0 after 60k iterations. Following this configuration, we alternatively train the generator and two discriminators.

구현내역 우리의 방법은 16GB 메모리가 있는 두 개의 NVIDIA Tesla-V100 GPU를 사용하여 PyTorch에서 구현된다. 인간 파서[2]를 사용하여, 우리는 인물 이미지의 의미 맵을 획득하고 [12]에 정의된 원래 레이블을 K(K=8) 범주(즉, 배경, 머리, 얼굴, 윗옷, 바지, 스커트, 팔다리)로 병합한다. 손실 항의 가중치는 $λ_{rec}=2$, $λ_{per}=2$ 및 $λ_{CX}=0.02$로 설정된다. 우리는 약 120k 반복에 대해 모델을 훈련하기 위해 모멘텀을 0.5로 설정한 아담 최적화기[19]를 채택한다. 초기 학습률은 0.001로 설정되고 60k 반복 후 0으로 선형적으로 감소된다. 이 구성에 따라, 우리는 발전기와 두 개의 판별기를 대안적으로 훈련시킨다.

$\mathbf{4.\;Experimental\;Results}$

In this section, we verify the effectiveness of the proposed network for attributes-guided person image synthesis tasks (pose transfer and component attribute transfer), and illustrate its superiority over other state-of-the-art methods. Detailed results are shown in the following subsections and more are available in the supplemental materials (Supp).

이 섹션에서는 속성 안내 사람 이미지 합성 작업(포즈 전송 및 구성 요소 속성 전송)에 대해 제안된 네트워크의 효과를 검증하고 다른 최첨단 방법보다 우수함을 설명한다. 자세한 결과는 다음 하위 섹션에 나와 있으며, 추가 정보는 보충 자료(Supplement)에서 확인할 수 있습니다.

Dataset. We conduct experiments on the In-shop $C_{l}$ othes Retrieval Benchmark DeepFashion [22], which contains a large number of person images with various appearances and poses. There are totally 52,712 images with the resolution of 256 × 256. Following the same data configuration in pose transfer [46], we randomly picked 101,966 pairs of images for training and 8,750 pairs for testing.

데이터 세트. 우리는 다양한 외모와 포즈를 가진 다수의 인물 이미지를 포함하는 In-shop $C_{l}$ othes Retrieve Benchmark DeepFashion[22]에 대한 실험을 수행한다. 256 × 256 해상도의 총 52,712개의 이미지가 있습니다. 포즈 전송에서 동일한 데이터 구성을 따라 [46], 우리는 훈련을 위한 101,966 쌍의 이미지와 테스트를 위한 8,750 쌍을 무작위로 선택했다.

Evaluation Metrics. Inception Score (IS) [32] and Structural Similarity (SSIM) [37] are two most commonly-used evaluation metrics in the person image synthesis task, which were firstly used in PG2 [23]. Later, Siarohin et al. [33] introduced Detection Score (DS) to measure whether the person can be detected in the image. However, IS and DS only rely on an output image to judge the quality in itself and ignore its consistency with conditional images. Here, we introduce a new metric called contextual (CX) score, which is proposed for image transformation [25] and uses the cosine distance between deep features to measure the similarity of two non-aligned images, ignoring the spatial position of the features. CX is able to explicitly assess the texture coherence between two images and it is suitable for our task to measure the appearance consistency between the generated image and source image (target image), recording as CXGS (CX-GT). Except for these computed metrics, we also perform the user study to assess the realness of synthesized images by human.

평가 지표. 인셉션 점수(IS) [32]와 구조적 유사성(SSIM) [37]은 PG2 [23]에서 처음 사용된 인물 이미지 합성 작업에서 가장 일반적으로 사용되는 두 가지 평가 지표이다. 나중에 시아로힌 외. [33] 이미지에서 사람이 감지될 수 있는지 여부를 측정하기 위해 감지 점수(DS)를 도입했습니다. 그러나 IS와 DS는 출력 이미지에 의존하여 품질 자체를 판단하고 조건부 이미지와의 일관성을 무시한다. 여기서는 이미지 변환[25]을 위해 제안되고 깊은 형상 사이의 코사인 거리를 사용하여 형상의 공간 위치를 무시하고 두 비 정렬 이미지의 유사성을 측정하는 문맥(CX) 점수라는 새로운 메트릭을 소개한다. CX는 두 이미지 사이의 텍스처 일관성을 명시적으로 평가할 수 있으며 생성된 이미지와 소스 이미지(대상 이미지) 간의 외관 일관성을 측정하는 것이 작업에 적합하며, CXGS(CX-GT)로 기록된다. 이러한 계산된 메트릭을 제외하고, 우리는 또한 인간에 의해 합성된 이미지의 실제성을 평가하기 위해 사용자 연구를 수행한다.

$\mathbf{4.1.\;Pose\;transfer}$

$\mathbf{4.1.1\;Person\;image\;synthesis\;in\;arbitrary\;poses}$

Pose is one of the most essential human attributes and our experiments verify the effectiveness of our model in posecontrolled person image synthesis. Given the same source person image and several poses extracted from person images in the test set, our model can generate natural and realistic results even when the target poses are drastically different from the source in scale, viewpoints, etc. We show some results of our method in Figure 7 and more are available in Supp.

포즈는 가장 필수적인 인간 속성 중 하나이며, 우리의 실험은 포즈 제어된 사람 이미지 합성에서 모델의 효과를 검증한다. 테스트 세트의 사람 이미지에서 추출된 동일한 소스 인물 이미지와 여러 포즈를 고려할 때, 우리의 모델은 대상 포즈가 소스, 시점 등에서 크게 다를 경우에도 자연스럽고 현실적인 결과를 생성할 수 있다. 그림 7에 몇 가지 방법의 결과가 나와 있으며, Support에서 더 많은 결과를 확인할 수 있습니다.

Figure 9: Qualitative comparison with state-of-the-art methods.

그림 9: 최첨단 방법과의 정성적 비교.

$\mathbf{4.1.2\;Comparison\;with\;state-of-the-art\;methods}$

For pose transfer, we evaluate our proposed method with both qualitative and quantitative comparisons.

포즈 전달의 경우, 우리는 정성적 및 정량적 비교를 통해 제안된 방법을 평가한다.

Qualitative comparison. In Figure 9, we compare the synthesis results of our method with four state-of-the-art pose transfer methods: PG2 [23],DPIG [24], Def-GAN [33] and PATN [46]. All the results of these methods are obtained by directly using the source codes and trained models released by authors. As we can see, our method produced more realistic results in both global structures and detailed textures. The facial identity is better preserved and even detailed muscles and clothing wrinkles are successfully synthesized. More results can be found in Supp.

정성적 비교. 그림 9에서, 우리는 우리의 방법의 합성 결과를 네 가지 최첨단 포즈 전달 방법인 PG2[23]와 비교한다.DPIG [24], Def-GAN [33] 및 PATN [46]. 이러한 방법의 모든 결과는 저자가 공개한 소스 코드와 훈련된 모델을 직접 사용하여 얻는다. 우리가 알 수 있듯이, 우리의 방법은 전역 구조와 세부 텍스처 모두에서 더 현실적인 결과를 낳았다. 얼굴의 정체성이 더 잘 보존되고 세밀한 근육과 옷 주름까지 성공적으로 합성된다. 추가 결과는 Support에서 확인할 수 있습니다.

Quantitative comparison. In Table 1, we show the quantitative comparison with abundant metrics described before. Since the data split information in experiments of [23, 24, 33] is not given, we download their pre-trained models and evaluate their performance on our test set. Although it is inevitable that testing images may be contained in their training samples, our method still outperforms them in most metrics. The results show that our method generates not only more realistic details with the highest IS value, but also more similar and natural textures with respect to the source image and target image, respectively (lowest CX-GS and CX-GT values). Furthermore, our method has the highest confidence for person detection with the best DS value. For SSIM, we observe that when the value of IS increases, this metric slightly decreases, meaning the sharper images may have lower SSIM, which also has been observed in other methods [23, 24].

정량적 비교. 표 1에서 우리는 앞서 설명한 풍부한 지표와의 정량적 비교를 보여준다. [23, 24, 33]의 실험에서 데이터 분할 정보가 제공되지 않으므로, 우리는 사전 훈련된 모델을 다운로드하고 테스트 세트에서 성능을 평가한다. 테스트 이미지가 그들의 훈련 샘플에 포함될 수 있지만, 우리의 방법은 여전히 대부분의 메트릭에서 그것들을 능가한다. 결과는 우리의 방법이 IS 값이 가장 높을 뿐만 아니라 소스 이미지 및 대상 이미지(가장 낮은 CX-GS 및 CX-GT 값)와 관련하여 더 유사하고 자연스러운 텍스처를 생성한다는 것을 보여준다. 또한, 우리의 방법은 최고의 DS 값을 가진 사람 탐지에 대한 가장 높은 신뢰도를 가지고 있다. SSIM의 경우, 우리는 IS 값이 증가하면 이 메트릭이 약간 감소한다는 것을 관찰한다. 이는 다른 방법에서도 관찰된 바와 같이, 선명한 이미지가 더 낮은 SSIM을 가질 수 있음을 의미한다[23, 24].

Table 1

Table 1: Quantitative comparison with state-of-the-art methods on DeepFashion.

표 1: Deep Fashion에 대한 최신 방법과의 정량적 비교.

Table 2

Table 2: Results of the user study (%). R2G means the percentage of real images rated as generated w.r.t. all real images. G2R means the percentage of generated images rated as real w.r.t. all generated images. The user preference of the most realistic images w.r.t. source persons is shown in the last row.

표 2: 사용자 연구 결과(%) R2G는 모든 실제 이미지에서 생성된 것으로 평가된 실제 이미지의 백분율을 의미한다. G2R은 생성된 모든 이미지에서 실제 w.r.t.로 평가된 생성된 이미지의 백분율을 의미한다. 소스 인물에 대한 가장 사실적인 이미지의 사용자 기본 설정은 마지막 행에 표시됩니다.

User study. We conduct a user study to assess the realness and faithfulness of the generated images and compare the performance of our method with four pose transfer techniques. For the realness, participants are asked to judge whether a given image is real or fake within a second. Following the protocol of [23, 33, 46], we randomly selected 55 real images and 55 generated images, first 10 of which are used for warming up and the remaining 100 images are used for evaluation. For the faithfulness, participants are shown a source image and 5 transferred outputs, and they are asked to select the most natural and reasonable image with respect to the source person image. We show 30 comparisons to each participant and finally 40 responses are collected per experiment. The results in Table 2 further validate that our generated images are more realistic, natural and faithful. $I_{t}$ is worth noting that there is a significant quality boost of synthesis results obtained by our approach compared with other methods, where over 70% of our results are selected as the most realistic one.

사용자 연구. 생성된 이미지의 진실성과 충실성을 평가하고 네 가지 포즈 전송 기술과 방법의 성능을 비교하기 위해 사용자 연구를 수행한다. 진짜를 위해, 참가자들은 주어진 이미지가 진짜인지 가짜인지 1초 안에 판단하도록 요청 받는다. [23, 33, 46]의 프로토콜에 따라 실제 이미지 55개와 생성된 이미지 55개를 무작위로 선택했는데, 이 중 처음 10개는 워밍업에 사용되고 나머지 100개는 평가에 사용된다. 충실도를 위해 참가자들에게 소스 이미지와 전송된 출력 5개를 보여주고 소스 인물 이미지와 관련하여 가장 자연스럽고 합리적인 이미지를 선택하도록 요청한다. 우리는 각 참가자에게 30개의 비교를 보여주고 마지막으로 실험당 40개의 반응을 수집한다. 표 2의 결과는 생성된 이미지가 더 현실적이고 자연스럽고 충실하다는 것을 추가로 검증한다. $I_{t}$는 결과의 70% 이상이 가장 현실적인 방법으로 선택되는 다른 방법과 비교하여 우리의 접근 방식에 의해 얻어진 합성 결과의 상당한 품질 향상이 있다는 점에 주목할 가치가 있다.

$\mathbf{4.2.\;Component\;Attribute\;Transfer}$

Our method also achieves controllable person image synthesis with user-specific component attributes, which can be provided by multiple source person images. For example, given 3 source person images with different component attributes, we can automatically synthesize the target image with the basic appearance of person 1, the upper clothes of person 2 and the pants of person 3. This also provides a powerful tool for editing component-level human attributes, such as pants to dress, T-shirt to waistcoat, and head of man to woman.

우리의 방법은 또한 사용자별 구성 요소 속성을 사용하여 제어 가능한 사람 이미지 합성을 달성하는데, 이는 여러 원본 사람 이미지에 의해 제공될 수 있다. 예를 들어, 구성 요소 속성이 다른 소스 인물 이미지가 3개 주어지면 인물 1의 기본 모습과 인물 2의 윗옷, 인물 3의 바지로 대상 이미지를 자동으로 합성할 수 있다. 이것은 또한 팬츠 투 드레스, 티셔츠 투 조끼, 그리고 남자 대 여자 같은 구성 요소 수준의 인간 속성을 편집할 수 있는 강력한 도구를 제공한다.

Figure 10: Results of synthesizing person images with controllable component attributes. We show original person images in the first column and the images in the right are synthesized results whose pants (the first row) or upper clothes (the second row) are changed with corresponding source images in the left.

그림 10: 제어 가능한 구성 요소 속성과 사람 이미지를 합성한 결과 우리는 첫 번째 열에 원본 인물 이미지를 표시하고 오른쪽의 이미지는 바지(첫 번째 열) 또는 윗옷(두 번째 열)이 왼쪽의 해당 소스 이미지와 바뀐 합성 결과이다.

Figure 11: Failure cases caused by component or pose attributes that extremely bias the manifold built upon training data.

그림 11: 훈련 데이터를 기반으로 구축된 매니폴드를 극도로 편향시키는 구성 요소 또는 포즈 속성으로 인한 고장 사례

By encoding the source person images to decomposed component codes and recombining their codes to construct the full style code, our method can synthesize the target image with desired attributes. In Figure 10, we edit the upper clothes or pants of target images by using additional source person images to provide desired attributes. Our method generates natural images with new attributes introduced harmoniously while preserving the textures of remaining components. Style Interpolation. Using our Attribute-Decomposed GAN, we can travel along the manifold of all component attributes of the person in a given image, thus synthesizing an animation from one attribute to another. Take for example the codes of upper clothes from person1 and person2 ($C_{uc1}$ and $C_{uc2}$), we define their mixing result as

소스 인물 이미지를 분해된 구성 요소 코드로 인코딩하고 코드를 재조합하여 전체 스타일 코드를 구성함으로써, 우리의 방법은 원하는 속성으로 대상 이미지를 합성할 수 있다. 그림 10에서는 원하는 속성을 제공하기 위해 추가 소스 인물 이미지를 사용하여 대상 이미지의 윗옷이나 바지를 편집합니다. 우리의 방법은 남은 구성 요소의 질감을 유지하면서 조화롭게 도입된 새로운 속성으로 자연 이미지를 생성한다. 스타일 보간. 속성 분해 GAN을 사용하여 주어진 이미지에서 인물의 모든 구성 요소 속성의 다양체를 따라 이동할 수 있으므로 한 속성에서 다른 속성으로 애니메이션을 합성할 수 있다. 예를 들어 person1과 person2의 윗옷 코드($C_{uc1}$ 및 $C_{uc2}$)를 예로 들어, 우리는 이들의 혼합 결과를 다음과 같이 정의한다.

\[Cmix=βC_{uc1}+(1−β)C_{uc2},\]

where $β\in{}(0, 1)$ and $β$ decreases from 1 to 0 in specific steps. Results of style interpolation are available in Supp. 여기서 $β\in{}(0, 1)$ 및 $β$는 특정 단계에서 1에서 0으로 감소한다. 스타일 보간 결과는 Support에서 확인할 수 있습니다.

$\mathbf{4.3.\;Failure\;cases}$

Although impressive results can be obtained by our method in most cases, it fails to synthesize images with pose and component attributes that extremely bias the manifold built upon the training data. The model constructs a complex manifold that is constituted of various pose and component attributes of person images, and we can travel along the manifold from one attribute to another. Thus, valid synthesis results are actually the mixtures of seen ones via the interpolation operation. As shown in Figure 11, the specific carton pattern in T-shirt of a woman fails to be interpolated with seen ones and the person in a rare pose cannot be synthesized seamlessly.

대부분의 경우 우리의 방법에 의해 인상적인 결과를 얻을 수 있지만, 훈련 데이터에 구축된 매니폴드를 극도로 편향시키는 포즈와 구성 요소 속성으로 이미지를 합성하지 못한다. 이 모델은 사람 이미지의 다양한 포즈와 구성 요소 속성으로 구성된 복잡한 매니폴드를 구성하며, 우리는 한 속성에서 다른 속성으로 매니폴드를 따라 이동할 수 있다. 따라서, 유효한 합성 결과는 실제로 보간 연산을 통해 보이는 것의 혼합이다. 그림 11에서 보듯이 여성의 티셔츠 속 특정 카톤 패턴은 보이는 것과 보간되지 않고 희귀한 포즈의 사람은 매끄럽게 합성될 수 없다.

$\mathbf{5.\;Conclusion}$

In this paper, we presented a novel AttributeDecomposed GAN for controllable person image synthesis, which allows flexible and continuous control of human attributes. Our method introduces a new generator architecture which embeds the source person image into the latent space as a series of decomposed component codes and recombines these codes in a specific order to construct the full style code. Experimental results demonstrated that this decomposition strategy enables not only more realistic images for output but also flexible user control of component attributes. We also believed that our solution using the offthe-shelf human parser to automatically separate component attributes from the entire person image could inspire future researches with insufficient data annotation. Furthermore, our method is not only well suited to generate person images but also can be potentially adapted to other image synthesis tasks.

본 논문에서는 인간 속성을 유연하고 지속적으로 제어할 수 있는 제어 가능한 인물 이미지 합성을 위한 새로운 AttributeDecomposed GAN을 제시하였다. 우리의 방법은 일련의 분해된 구성 요소 코드로 소스 인물 이미지를 잠재 공간에 내장하고 이러한 코드를 특정 순서로 재조합하여 전체 스타일 코드를 구성하는 새로운 생성기 아키텍처를 도입한다. 실험 결과는 이 분해 전략이 출력을 위한 보다 현실적인 이미지를 가능하게 할 뿐만 아니라 구성 요소 속성의 유연한 사용자 제어를 가능하게 한다는 것을 보여주었다. 또한 기성 인간 파서를 사용하여 전체 인물 이미지에서 구성 요소 속성을 자동으로 분리할 수 있는 솔루션이 불충분한 데이터 주석으로 향후 연구에 영감을 줄 수 있다고 믿었다. 또한, 우리의 방법은 사람 이미지를 생성하는 데 적합할 뿐만 아니라 잠재적으로 다른 이미지 합성 작업에 적응할 수 있다.

$\mathbf{Acknowledgements}$

This work was supported by National Natural Science Foundation of China (Grant No.: 61672043 and 61672056), Beijing Nova Program of Science and Technology (Grant No.: Z191100001119077), Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology).

이 작업은 중국국립자연과학재단(수여번호: 61672043 및 61672056), 베이징 노바 과학기술 프로그램(수여번호: Z191100001119077), 언론산업 핵심과학기술표준연구소(지능형언론미디어기술 핵심연구소)의 지원을 받았다.

$\mathbf{References}$

[1] Kfir Aberman, Rundi Wu, Dani Lischinski, Baoquan Chen, and Daniel Cohen-Or. Learning character-agnostic motion for motion retargeting in 2d. arXiv preprint arXiv:1905.01680, 2019. 2

[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017. 6

[3] Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8340– 8348, 2018. 2

[4] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 2

[5] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7291–7299, 2017. 3

[6] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In Proceedings of the IEEE International Conference on Computer Vision, pages 5933– 5942, 2019. 2

[7] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016. 2

[8] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. arXiv preprint arXiv:1610.07629, 2016. 5

[9] Patrick Esser, Ekaterina Sutter, and Bjorn Ommer. A varia-tional u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8857–8866, 2018. 1, 2, 5

[10] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems, pages 262–270, 2015. 2

[11] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2414–2423. IEEE, 2016. 2, 4

[12] Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. Look into person: Self-supervised structuresensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 932–940, 2017. 4, 6 [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. 2

[14] Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen. Attgan: Facial attribute editing by only changing what you want. IEEE Transactions on Image Processing, 2019. 2, 3

[15] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017. 2, 4, 5

[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017. 2, 5

[17] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016. 2

[18] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. 2

[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 6

[20] Christoph Lassner, Gerard Pons-Moll, and Peter V Gehler. A generative model of people in clothing. In Proceedings of the IEEE International Conference on Computer Vision, pages 853–862, 2017. 2

[21] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4

[22] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016. 6

[23] Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. Pose guided person image generation. In Advances in Neural Information Processing Systems, pages 406–416, 2017. 1, 2, 3, 6, 7

[24] Liqian Ma, Qianru Sun, Stamatios Georgoulis, Luc Van Gool, Bernt Schiele, and Mario Fritz. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 99–108, 2018. 1, 2, 7

[25] Roey Mechrez, Itamar Talmi, and Lihi Zelnik-Manor. The contextual loss for image transformation with non-aligned data. In Proceedings of the European Conference on Computer Vision (ECCV), pages 768–783, 2018. 6

[26] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 2

[27] Gonc¸alo Mordido, Haojin Yang, and Christoph Meinel. Dropout-gan: Learning from a dynamic ensemble of discriminators. arXiv preprint arXiv:1807.11346, 2018. 2

[28] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael J Black. Clothcap: Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (TOG), 36(4):73, 2017. 2

[29] Albert Pumarola, Antonio Agudo, Alberto Sanfeliu, and Francesc Moreno-Noguer. Unsupervised person image synthesis in arbitrary poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8620–8628, 2018. 3

[30] Amit Raj, Patsorn Sangkloy, Huiwen Chang, James Hays, Duygu Ceylan, and Jingwan Lu. Swapnet: Image based garment transfer. In European Conference on Computer Vision, pages 679–695. Springer, 2018. 2

[31] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. 2

[32] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016. 6

[33] Aliaksandr Siarohin, Enver Sangineto, Stephane Lathuili ´ ere, and Nicu Sebe. Deformable gans for pose-based human image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3408– 3416, 2018. 1, 2, 5, 6, 7

[34] Sijie Song, Wei Zhang, Jiaying Liu, and Tao Mei. Unsupervised person image generation with semantic parsing transformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2357– 2366, 2019. 3

[35] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-tovideo synthesis. arXiv preprint arXiv:1808.06601, 2018. 2

[36] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018. 2

[37] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 6

[38] Shan Yang, Tanya Ambert, Zherong Pan, Ke Wang, Licheng Yu, Tamara Berg, and Ming C Lin. Detailed garment recovery from a single-view image. arXiv preprint arXiv:1608.01250, 2016. 2

[39] Weidong Yin, Yanwei Fu, Leonid Sigal, and Xiangyang Xue. Semi-latent gan: Learning to generate and modify facial images from attributes. arXiv preprint arXiv:1704.02166, 2017. 2, 3

[40] Mihai Zanfir, Alin-Ionut Popa, Andrei Zanfir, and Cristian Sminchisescu. Human appearance transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5391–5399, 2018. 2

[41] Gang Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Generative adversarial network with spatial attention for face attribute editing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 417–432, 2018. 2, 3

[42] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018. 2

[43] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907–5915, 2017. 2

[44] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycleconsistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223– 2232, 2017. 2

[45] Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, and Chen Change Loy. Be your own prada: Fashion synthesis with structural coherence. In Proceedings of the IEEE International Conference on Computer Vision, pages 1680–1688, 2017. 2

[46] Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai. Progressive pose attention transfer for person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2347–2356, 2019. 2, 3, 5, 6, 7