(GAN)StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

$\mathbf{Yunjey Choi,\;Minje\;Choi,\;Munyoung\;Kim,\;Jung-Woo\;Ha,\;Sunghun\;Kim,\;Jaegul\;Choo}$

$\mathbf{Korea\;University,\;Clova\;AI\;Research,\;NAVER\;Corp}$

$\mathbf{The\;College\;of\;New\;Jersey,\;Hong\;Kong\;University\;of\;Science-Technology}$

Figure 1. Multi-domain image-to-image translation results on the CelebA dataset via transferring knowledge learned from the RaFD dataset. The first and sixth columns show input images while the remaining columns are images generated by StarGAN. Note that the images are generated by a single generator network, and facial expression labels such as angry, happy, and fearful are from RaFD, not CelebA.

그림 1 RaFD 데이터 세트에서 학습한 지식을 전달하여 CelebA 데이터 세트에 다중 도메인 이미지 간 변환 결과를 제공한다. 첫 번째 및 여섯 번째 열은 입력 영상을 표시하고 나머지 열은 StarGAN에서 생성된 영상입니다. 이미지는 단일 생성기 네트워크에 의해 생성되며, 분노, 행복 및 공포와 같은 얼굴 표정 라벨은 CellebA가 아닌 RaFD에서 가져온 것이다.

$\mathbf{Abstract}$

Recent studies have shown remarkable success in imageto-image translation for two domains. However, existing approaches have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. To address this limitation, we propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model. Such a unified model architecture of StarGAN allows simultaneous training of multiple datasets with different domains within a single network. This leads to StarGAN’s superior quality of translated images compared to existing models as well as the novel capability of flexibly translating an input image to any desired target domain. We empirically demonstrate the effectiveness of our approach on a facial attribute transfer and a facial expression synthesis tasks.

최근 연구는 두 개의 도메인에 대한 이미지 간 변환에서 주목할 만한 성공을 보여주었다. 그러나 기존 접근 방식은 모든 이미지 도메인 쌍에 대해 서로 다른 모델이 독립적으로 구축되어야 하기 때문에 세 개 이상의 도메인을 처리하는 데 확장성과 견고성이 제한적이다. 이러한 한계를 해결하기 위해 단일 모델만 사용하여 여러 도메인에 대해 이미지 대 이미지 변환을 수행할 수 있는 새롭고 확장 가능한 접근 방식인 StarGAN을 제안한다. 이러한 StarGAN의 통합 모델 아키텍처를 통해 단일 네트워크 내에서 서로 다른 도메인을 가진 여러 데이터 세트를 동시에 교육할 수 있다. 이는 StarGAN이 기존 모델에 비해 번역된 이미지의 품질이 우수할 뿐만 아니라 입력 이미지를 원하는 대상 도메인으로 유연하게 번역할 수 있는 새로운 기능으로 이어진다. 우리는 얼굴 속성 전달 및 얼굴 표정 합성 작업에 대한 접근 방식의 효과를 경험적으로 입증한다.

$\mathbf{1.\;Introduction}$

The task of image-to-image translation is to change a particular aspect of a given image to another, e.g., changing the facial expression of a person from smiling to frowning (see Fig. 1). This task has experienced significant improvements following the introduction of generative adversarial networks (GANs), with results ranging from changing hair color [9], reconstructing photos from edge maps [7], and changing the seasons of scenery images [33].

이미지 대 이미지 번역의 작업은 주어진 이미지의 특정 측면을 다른 것으로 바꾸는 것이다. 예를 들어, 사람의 얼굴 표정을 웃는 얼굴에서 찡그리는 얼굴로 바꾸는 것이다(그림 1 참조). 이 작업은 생성적 적대 네트워크(GAN)의 도입에 따라 상당한 개선을 경험했으며, 머리색 변경[9], 에지 맵의 사진 재구성[7], 풍경 이미지의 계절 변경[33] 등의 결과를 얻었다.

Given training data from two different domains, these models learn to translate images from one domain to the other. We denote the terms attribute as a meaningful feature inherent in an image such as hair color, gender or age, and attribute value as a particular value of an attribute, e.g., black/blond/brown for hair color or male/female for gender. We further denote domain as a set of images sharing the same attribute value. For example, images of women can represent one domain while those of men represent another.

두 개의 서로 다른 도메인의 교육 데이터가 주어지면, 이러한 모델은 한 도메인에서 다른 도메인으로 이미지를 변환하는 방법을 배운다. 우리는 속성이라는 용어를 머리색, 성별 또는 나이, 속성값과 같은 이미지에 내재된 의미 있는 특성으로, 예를 들어 머리색에 대해서는 검정/금발/갈색, 성별에 대해서는 남성/여성과 같은 속성의 특정 값으로 나타낸다. 우리는 또한 도메인을 동일한 속성 값을 공유하는 이미지 집합으로 나타낸다. 예를 들어, 여성의 이미지는 하나의 영역을 나타낼 수 있고 남성의 이미지는 다른 영역을 나타낼 수 있다.

Several image datasets come with a number of labeled attributes. For instance, the CelebA[19] dataset contains 40 labels related to facial attributes such as hair color, gender, and age, and the RaFD [13] dataset has 8 labels for facial expressions such as ‘happy’, ‘angry’ and ‘sad’. These settings enable us to perform more interesting tasks, namely multi-domain image-to-image translation, where we change images according to attributes from multiple domains. The first five columns in Fig. 1 show how a CelebA image can be translated according to any of the four domains, ‘blond hair’, ‘gender’, ‘aged’, and ‘pale skin’. We can further extend to training multiple domains from different datasets, such as jointly training CelebA and RaFD images to change a CelebA image’s facial expression using features learned by training on RaFD, as in the rightmost columns of Fig. 1.

여러 이미지 데이터 세트에는 여러 레이블이 지정된 속성이 포함되어 있습니다. 예를 들어, CelebA[19] 데이터 세트에는 머리색, 성별 및 나이와 같은 얼굴 속성과 관련된 40개의 레이블이 포함되어 있으며, RaFD[13] 데이터 세트에는 ‘행복’, ‘분노’, ‘슬픔’과 같은 얼굴 표정을 위한 8개의 레이블이 있다. 이러한 설정을 통해 여러 도메인의 속성에 따라 이미지를 변경하는 다중 도메인 이미지 간 변환과 같은 보다 흥미로운 작업을 수행할 수 있다. 그림 1의 처음 5개의 열은 셀럽A 이미지가 ‘금발’, ‘성별’, ‘노화’, ‘연령’, ‘연령’의 네 가지 영역 중 하나에 따라 어떻게 번역될 수 있는지를 보여준다. 우리는 그림 1의 가장 오른쪽 열에서처럼, RaFD에서 학습한 기능을 사용하여 CelebA 및 RaFD 이미지를 공동으로 훈련하여 CelebA 이미지의 얼굴 표정을 변경하는 것과 같이 서로 다른 데이터 세트의 여러 도메인을 훈련하는 데까지 확장할 수 있다.

However, existing models are both inefficient and ineffective in such multi-domain image translation tasks. Their inefficiency results from the fact that in order to learn all mappings among k domains, k(k−1) generators have to be trained. Fig. 2 (a) illustrates how twelve distinct generator networks have to be trained to translate images among four different domains. Meanwhile, they are ineffective that even though there exist global features that can be learned from images of all domains such as face shapes, each generator cannot fully utilize the entire training data and only can learn from two domains out of k. Failure to fully utilize training data is likely to limit the quality of generated images. Furthermore, they are incapable of jointly training domains from different datasets because each dataset is partially labeled, which we further discuss in Section 3.2.

그러나 기존 모델은 이러한 다중 도메인 이미지 변환 작업에서 비효율적이고 비효율적이다. 이들의 비효율성은 k 도메인 간의 모든 매핑을 학습하기 위해 k(k-1) 생성기를 훈련시켜야 한다는 사실에서 비롯된다. 그림 2(a)는 4개의 서로 다른 도메인 간에 이미지를 변환하기 위해 12개의 별개의 생성기 네트워크가 어떻게 훈련되어야 하는지를 보여준다. 한편, 얼굴 모양과 같은 모든 도메인의 이미지에서 학습할 수 있는 전역 기능이 존재하더라도 각 생성기는 전체 훈련 데이터를 완전히 활용할 수 없고 k개 중 2개 도메인에서만 학습할 수 있다는 점에서 비효율적이다. 훈련 데이터를 완전히 활용하지 못하면 생성된 이미지의 품질이 제한될 수 있다. 또한, 각 데이터 세트는 부분적으로 레이블이 지정되어 있기 때문에 서로 다른 데이터 세트의 도메인을 공동으로 훈련할 수 없다. 이는 섹션 3.2에서 추가로 논의한다.

As a solution to such problems we propose StarGAN, a novel and scalable approach capable of learning mappings among multiple domains. As demonstrated in Fig. 2 (b), our model takes in training data of multiple domains, and learns the mappings between all available domains using only a single generator. The idea is simple. Instead of learning a fixed translation (e.g., black-to-blond hair), our generator takes in as inputs both image and domain information, and learns to flexibly translate the image into the corresponding domain. We use a label (e.g., binary or one-hot vector) to represent domain information. During training, we randomly generate a target domain label and train the model to flexibly translate an input image into the target domain. By doing so, we can control the domain label and translate the image into any desired domain at testing phase.

이러한 문제에 대한 해결책으로 여러 도메인 간의 매핑을 학습할 수 있는 새롭고 확장 가능한 접근 방식인 StarGAN을 제안한다. 그림 2(b)에 설명된 바와 같이, 우리의 모델은 여러 도메인의 훈련 데이터를 받아들이고, 단일 생성기만을 사용하여 사용 가능한 모든 도메인 간의 매핑을 학습한다. 아이디어는 간단하다. 우리의 생성기는 고정된 번역(예: 검은색에서 금발로)을 배우는 대신 이미지와 도메인 정보를 모두 입력하고 이미지를 해당 도메인으로 유연하게 변환하는 방법을 학습한다. 우리는 도메인 정보를 나타내기 위해 레이블(예: 이진 또는 원핫 벡터)을 사용한다. 훈련 중에 우리는 무작위로 대상 도메인 레이블을 생성하고 입력 이미지를 대상 도메인으로 유연하게 변환하도록 모델을 훈련시킨다. 이를 통해 도메인 레이블을 제어하고 테스트 단계에서 이미지를 원하는 도메인으로 변환할 수 있습니다.

We also introduce a simple but effective approach that enables joint training between domains of different datasets by adding a mask vector to the domain label. Our proposed method ensures that the model can ignore unknown labels and focus on the label provided by a particular dataset. In this manner, our model can perform well on tasks such as synthesizing facial expressions of CelebA images us ing features learned from RaFD, as shown in the rightmost columns of Fig. 1. As far as our knowledge goes, our work is the first to successfully perform multi-domain image translation across different datasets.

또한 도메인 레이블에 마스크 벡터를 추가하여 서로 다른 데이터 세트의 도메인 간에 공동 훈련을 가능하게 하는 간단하지만 효과적인 접근 방식을 소개한다. 우리가 제안한 방법은 모델이 알 수 없는 레이블을 무시하고 특정 데이터 세트에서 제공하는 레이블에 집중할 수 있도록 한다. 이러한 방식으로 우리 모델은 그림 1의 가장 오른쪽 열에 표시된 것처럼 RaFD에서 학습한 기능을 사용하여 CelebA 이미지의 얼굴 표정을 합성하는 등의 작업에서 잘 수행할 수 있다. 우리가 아는 한, 우리의 작업은 서로 다른 데이터 세트에서 다중 도메인 이미지 변환을 성공적으로 수행하는 첫 번째 작업이다.

Figure 2. Comparison between cross-domain models and our proposed model, StarGAN. (a) To handle multiple domains, crossdomain models should be built for every pair of image domains. (b) StarGAN is capable of learning mappings among multiple domains using a single generator. The figure represents a star topology connecting multi-domains.

그림 2. 교차 도메인 모델과 제안된 모델인 StarGAN 간의 비교. (a) 여러 도메인을 처리하려면 모든 이미지 도메인 쌍에 대해 교차 도메인 모델을 구축해야 한다. (b) StarGAN은 단일 생성기를 사용하여 여러 도메인 간의 매핑을 학습할 수 있다. 이 그림은 다중 도메인을 연결하는 별 토폴로지를 나타냅니다.

Overall, our contributions are as follows:

전체적으로 NAT의 기여는 다음과 같습니다.

We propose StarGAN, a novel generative adversarial network that learns the mappings among multiple domains using only a single generator and a discriminator, training effectively from images of all domains.

단일 생성기와 판별기만을 사용하여 여러 도메인 간의 매핑을 학습하는 새로운 생성 적대적 네트워크인 StarGAN을 제안하며, 모든 도메인의 이미지에서 효과적으로 훈련한다.
We demonstrate how we can successfully learn multidomain image translation between multiple datasets by utilizing a mask vector method that enables StarGAN to control all available domain labels.

우리는 StarGAN이 사용 가능한 모든 도메인 레이블을 제어할 수 있는 마스크 벡터 방법을 활용하여 여러 데이터 세트 간의 다중 도메인 이미지 변환을 성공적으로 학습할 수 있는 방법을 보여준다.
We provide both qualitative and quantitative results on facial attribute transfer and facial expression synthesis tasks using StarGAN, showing its superiority over baseline models.

우리는 StarGAN을 사용한 얼굴 속성 전달 및 얼굴 표정 합성 작업에 대한 질적 및 정량적 결과를 모두 제공하여 기준 모델보다 우수함을 보여준다.

$\mathbf{2.\;Related\;Work}$

Generative Adversarial Networks. Generative adversarial networks (GANs) [3] have shown remarkable results in various computer vision tasks such as image generation [6, 24, 32, 8], image translation [7, 9, 33], super-resolution imaging [14], and face image synthesis [10, 16, 26, 31]. A typical GAN model consists of two modules: a discriminator and a generator. The discriminator learns to distinguish between real and fake samples, while the generator learns to generate fake samples that are indistinguishable from real samples. Our approach also leverages the adversarial loss to make the generated images as realistic as possible.

생성적 적대 네트워크. 생성적 적대 네트워크(GAN) [3]는 이미지 생성 [6, 24, 32, 8], 이미지 번역 [7, 9, 33], 초고해상도 이미징 [14] 및 얼굴 이미지 합성 [10, 16, 26, 31]과 같은 다양한 컴퓨터 비전 작업에서 주목할 만한 결과를 보여주었다. 일반적인 GAN 모델은 판별기와 발전기의 두 가지 모듈로 구성된다. 판별기는 실제 샘플과 가짜 샘플을 구별하는 방법을 배우는 반면, 생성기는 실제 샘플과 구별할 수 없는 가짜 샘플을 생성하는 방법을 학습한다. 우리의 접근 방식은 또한 적대적 손실을 활용하여 생성된 이미지를 가능한 현실적으로 만든다.

Conditional GANs. GAN-based conditional image generation has also been actively studied. Prior studies have provided both the discriminator and generator with class information in order to generate samples conditioned on the class [20, 21, 22]. Other recent approaches focused on generating particular images highly relevant to a given text description [25, 30]. The idea of conditional image generation has also been successfully applied to domain transfer [9, 28], superresolution imaging[14], and photo editing [2, 27]. In this paper, we propose a scalable GAN framework that can flexibly steer the image translation to various target domains, by providing conditional domain information.

조건부 GAN. GAN 기반 조건부 이미지 생성도 활발히 연구되었다. 이전 연구에서는 등급에 따라 조건화된 샘플을 생성하기 위해 판별기와 발생기 모두에 등급 정보를 제공했습니다 [20, 21, 22]. 최근의 다른 접근법은 주어진 텍스트 설명과 매우 관련성이 높은 특정 이미지를 생성하는 데 초점을 맞췄다[25, 30]. 조건부 이미지 생성의 개념은 도메인 전송[9, 28], 초해상도 이미징[14] 및 사진 편집[2, 27]에도 성공적으로 적용되었다. 본 논문에서는 조건부 도메인 정보를 제공하여 이미지 변환을 다양한 대상 도메인으로 유연하게 조정할 수 있는 확장 가능한 GAN 프레임워크를 제안한다.

Figure 3. Overview of StarGAN, consisting of two modules, a discriminator $D$ and a generator $G$. (a) $D$ learns to distinguish between real and fake images and classify the real images to its corresponding domain. (b) $G$ takes in as input both the image and target domain label and generates an fake image. The target domain label is spatially replicated and concatenated with the input image. (c) $G$ tries to reconstruct the original image from the fake image given the original domain label. (d) $G$ tries to generate images indistinguishable from real images and classifiable as target domain by $D$.

그림 3. 판별기 $D$와 생성기 $G$의 두 모듈로 구성된 StarGAN의 개요. (a) $D$는 실제 이미지와 가짜 이미지를 구별하고 실제 이미지를 해당 도메인으로 분류하는 방법을 학습한다. (b) $G$는 이미지와 대상 도메인 레이블을 모두 입력으로 받아들여 가짜 이미지를 생성한다. 대상 도메인 라벨은 공간적으로 복제되고 입력 이미지와 연결된다. (c) $G$는 원본 도메인 레이블이 주어진 가짜 이미지에서 원본 이미지를 재구성하려고 한다. (d) $G$는 실제 이미지와 구별할 수 없고 $D$에 의해 대상 도메인으로 분류될 수 있는 이미지를 생성하려고 한다.

Image-to-Image Translation. Recent work have achieved impressive results in image-to-image translation [7, 9, 17, 33]. For instance, pix2pix [7] learns this task in a supervised manner using cGANs[20]. It combines an adversarial loss with a L1 loss, thus requires paired data samples. To alleviate the problem of obtaining data pairs, unpaired image-to-image translation frameworks [9, 17, 33] have been proposed. UNIT [17] combines variational autoencoders (VAEs) [12] with CoGAN [18], a GAN framework where two generators share weights to learn the joint distribution of images in cross domains. CycleGAN [33] and DiscoGAN [9] preserve key attributes between the input and the translated image by utilizing a cycle consistency loss. However, all these frameworks are only capable of learning the relations between two different domains at a time. Their approaches have limited scalability in handling multiple domains since different models should be trained for each pair of domains. Unlike the aforementioned approaches, our framework can learn the relations among multiple domains using only a single model.

이미지 간 변환. 최근 연구는 이미지 간 번역에서 인상적인 결과를 달성했다[7, 9, 17, 33]. 예를 들어, pix2pix[7]는 cGAN[20]을 사용하여 감독된 방식으로 이 작업을 학습한다. 적대적 손실과 L1 손실을 결합하므로 쌍을 이룬 데이터 샘플이 필요하다. 데이터 쌍을 얻는 문제를 완화하기 위해, 쌍이 없는 이미지 대 이미지 변환 프레임워크[9, 17, 33]가 제안되었다. UNIT [17]은 가변 자동 인코더(VAE) [12]를 CoGAN [18]과 결합한다. CoGAN [18]은 두 발전기가 가중치를 공유하여 교차 도메인에서 이미지의 공동 분포를 학습하는 GAN 프레임워크이다. CycleGAN [33] 및 DiscGAN [9]은 주기 일관성 손실을 활용하여 입력과 변환된 이미지 사이의 주요 속성을 보존한다. 그러나 이러한 모든 프레임워크는 한 번에 두 개의 서로 다른 도메인 간의 관계를 학습할 수 있다. 그들의 접근 방식은 각 도메인 쌍에 대해 서로 다른 모델이 훈련되어야 하기 때문에 여러 도메인을 처리하는 데 있어 확장성이 제한적이다. 앞서 언급한 접근 방식과 달리, 우리의 프레임워크는 단일 모델만 사용하여 여러 도메인 간의 관계를 학습할 수 있다.

$\mathbf{3.\;Star\;Generative\;Adversarial\;Networks}$

We first describe our proposed StarGAN, a framework to address multi-domain image-to-image translation within a single dataset. Then, we discuss how StarGAN incorporates multiple datasets containing different label sets to flexibly perform image translations using any of these labels.

우리는 먼저 단일 데이터 세트 내에서 다중 도메인 이미지 간 변환을 해결하기 위한 프레임워크인 제안된 StarGAN을 설명한다. 그런 다음 StarGAN이 이러한 레이블 중 하나를 사용하여 이미지 변환을 유연하게 수행하기 위해 서로 다른 레이블 세트를 포함하는 여러 데이터 세트를 통합하는 방법에 대해 논의한다.

$\mathbf{3.1.\;Multi-Domain\;Image-to-Image\;Translation}$

Our goal is to train a single generator $G$ that learns mappings among multiple domains. To achieve this, we train $G$ to translate an input image $x$ into an output image $y$ conditioned on the target domain label $c$, $G(x,c)\to{}y$. We randomly generate the target domain label $c$ so that $G$ learns to flexibly translate the input image. We also introduce an auxiliary classifier [22] that allows a single discriminator to control multiple domains. That is, our discriminator produces probability distributions over both sources and domain labels, $D:x\to{}{D_{src}(x),D_{cls}(x)}$. Fig. 3 illustrates the training process of our proposed approach.

우리의 목표는 여러 도메인 간의 매핑을 학습하는 단일 생성기 $G$를 훈련하는 것이다. 이를 위해, 우리는 입력 이미지 $x$를 대상 도메인 레이블 $c$, $G(x,c)\to{}y$에서 조건화된 출력 이미지 $y$로 변환하도록 $G$를 훈련시킨다. $G$가 입력 이미지를 유연하게 변환하는 방법을 학습하도록 대상 도메인 레이블 $c$를 무작위로 생성한다. 또한 단일 판별기가 여러 도메인을 제어할 수 있는 보조 분류기 [22]를 소개한다. 즉, 우리의 판별기는 소스 및 도메인 레이블 $D:x\to{}{D_{src}(x),D_{cls}(x)}$ 모두에 대한 확률 분포를 생성한다. 그림 3은 우리가 제안한 접근 방식의 훈련 과정을 보여준다.

Adversarial Loss. To make the generated images indistinguishable from real images, we adopt an adversarial loss

적대적 손실. 생성된 이미지를 실제 이미지와 구별할 수 없도록 하기 위해 적대적 손실을 채택한다.

\[L_{adv}=E_{x}[\log{}D_{src}(x)]+E_{x,c}[\log{}(1-D_{src}(G(x,c)))],\]

where $G$ generates an image $G(x,c)$ conditioned on both the input image $x$ and the target domain label $c$, while $D$ tries to distinguish between real and fake images. In this paper, we refer to the term $D_{src}(x)$ as a probability distribution over sources given by $D$. The generator $G$ tries to minimize this objective, while the discriminator $D$ tries to maximize it.

여기서 $G$는 입력 이미지 $x$와 대상 도메인 레이블 $c$에 대해 조건화된 이미지 $G(x,c)$를 생성하는 반면, $D$는 실제 이미지와 가짜 이미지를 구별하려고 한다. 본 논문에서, 우리는 $D$에 의해 주어진 소스에 대한 확률 분포로 $D_{src}(x)$라는 용어를 언급한다. 생성기 $G$는 이 목표를 최소화하려고 하는 반면 판별기 $D$는 이를 최대화하려고 한다.

Domain Classification Loss. For a given input image $x$ and a target domain label $c$, our goal is to translate $x$ into an output image $y$, which is properly classified to the target domain $c$. To achieve this condition, we add an auxiliary classifier on top of $D$ and impose the domain classification loss when optimizing both $D$ and $G$. That is, we decompose the objective into two terms: a domain classification loss of real images used to optimize $D$, and a domain classification loss of fake images used to optimize $G$. In detail, the former is defined as

도메인 분류 손실. 주어진 입력 이미지 $x$와 대상 도메인 레이블 $c$의 경우, 우리의 목표는 $x$를 대상 도메인 $c$로 적절하게 분류되는 출력 이미지 $y$로 변환하는 것이다. 이 조건을 달성하기 위해 $D$ 위에 보조 분류기를 추가하고 $D$와 $G$를 모두 최적화할 때 도메인 분류 손실을 부과한다. 즉, 우리는 목표를 $D$ 최적화에 사용되는 실제 이미지의 도메인 분류 손실과 $G$ 최적화에 사용되는 가짜 이미지의 도메인 분류 손실의 두 가지 용어로 분해한다. 세부적으로, 전자는 다음과 같이 정의된다.

\[L_{cls}^{r}=E_{x,c'}[−\log{}D_{cls}(c'\vert{}x)],\]

where the term $D_{cls}(c’\vert{}x)$ represents a probability distribution over domain labels computed by $D$. By minimizing this objective, $D$ learns to classify a real image $x$ to its corresponding original domain $c’$ . We assume that the input image and domain label pair $(x,c’)$ is given by the training data. On the other hand, the loss function for the domain classification of fake images is defined as

여기서 $D_{cls}(c’\vert{}x)$라는 용어는 $D$에 의해 계산된 도메인 레이블에 대한 확률 분포를 나타낸다. 이 목표를 최소화함으로써 $D$는 실제 이미지 $x$를 해당 원본 도메인 $c’$로 분류하는 것을 학습한다. 입력 이미지와 도메인 레이블 쌍 $(x,c’)$는 훈련 데이터에 의해 주어졌다고 가정한다. 반면, 가짜 이미지의 도메인 분류를 위한 손실 함수는 다음과 같이 정의된다.

\[L_{cls}^{f}=E_{x,c}[−\log{}D_{cls}(c\vert{}G(x,c))].\]

In other words, $G$ tries to minimize this objective to generate images that can be classified as the target domain $c$.

즉, $G$는 대상 도메인 $c$로 분류될 수 있는 이미지를 생성하기 위해 이 목표를 최소화하려고 한다.

Reconstruction Loss. By minimizing the adversarial and classification losses, $G$ is trained to generate images that are realistic and classified to its correct target domain. However, minimizing the losses (Eqs. (1) and (3)) does not guarantee that translated images preserve the content of its input images while changing only the domain-related part of the inputs. To alleviate this problem, we apply a cycle consistency loss [9, 33] to the generator, defined as

Reconstruction Loss. 적대적 및 분류 손실을 최소화함으로써 $G$는 실제적이고 정확한 대상 도메인으로 분류되는 이미지를 생성하도록 훈련된다. 그러나 손실(eq. (1) 및 (3))을 최소화한다고 해서 변환된 이미지가 입력의 도메인 관련 부분만 변경하면서 입력 이미지의 내용을 보존한다는 보장은 없다. 이 문제를 완화하기 위해 다음과 같이 정의된 생성기에 사이클 일관성 손실 [9, 33]을 적용한다.

\[L_{rec}=E_{x,c,c'}[\vert{}\vert{}x-G(G(x,c),c')\vert{}\vert{}_{1}],\]

where $G$ takes in the translated image $G(x,c)$ and the original domain label $c’$ as input and tries to reconstruct the original image $x$. We adopt the L1 norm as our reconstruction loss. Note that we use a single generator twice, first to translate an original image into an image in the target domain and then to reconstruct the original image from the translated image. 여기서 $G$는 변환된 이미지 $G(x,c)$와 원본 도메인 레이블 $c’$를 입력으로 받아들이고 원본 이미지 $x$를 재구성하려고 한다. 우리는 L1 규범을 재구성 손실로 채택한다. 단일 생성기를 두 번 사용하는데, 먼저 원본 이미지를 대상 도메인의 이미지로 변환한 다음 변환된 이미지에서 원본 이미지를 재구성한다.

Full Objective. Finally, the objective functions to optimize $G$ and $D$ are written, respectively, as

완전 목표. 마지막으로, $G$와 $D$를 최적화하는 목적 함수는 각각 다음과 같이 작성된다.

\[L_{D}=−L_{adv}+λ_{cls}L_{cls}^{r},\] \[L_{G}=L_{adv}+λ_{cls}L_{cls}^{f}+λ_{rec}L_{rec},\]

where $λ_{cls}$ and $λ_{rec}$ are hyper-parameters that control the relative importance of domain classification and reconstruction losses, respectively, compared to the adversarial loss. We use $λ_{cls}=1$ and $λ_{rec}=10$ in all of our experiments.

여기서 $syslog_{cls}$와 $syslog_{rec}$는 각각 적대적 손실과 비교하여 도메인 분류와 재구성 손실의 상대적 중요성을 제어하는 하이퍼 매개 변수이다. 우리는 모든 실험에서 $param_{cls}=1$과 $param_{rec}=10$을 사용한다.

$\mathbf{3.2.\;Training\;with\;Multiple\;Datasets}$

An important advantage of StarGAN is that it simultaneously incorporates multiple datasets containing different types of labels, so that StarGAN can control all the labels at the test phase. An issue when learning from multiple datasets, however, is that the label information is only partially known to each dataset. In the case of CelebA [19] and RaFD [13], while the former contains labels for attributes such as hair color and gender, it does not have any labels for facial expressions such as ‘happy’ and ‘angry’, and vice versa for the latter. This is problematic because the complete information on the label vector $c’$ is required when reconstructing the input image $x$ from the translated image $G(x,c)$ (See Eq. (4)).

StarGAN의 중요한 장점은 다른 유형의 레이블을 포함하는 여러 데이터 세트를 동시에 통합하여 StarGAN이 테스트 단계에서 모든 레이블을 제어할 수 있다는 것이다. 그러나 여러 데이터 세트에서 학습할 때 문제는 레이블 정보가 각 데이터 세트에 부분적으로만 알려져 있다는 것이다. CelebA[19]와 RaFD[13]의 경우, 전자는 머리색이나 성별과 같은 속성에 대한 라벨을 포함하고 있지만, 후자의 경우 ‘행복’과 ‘분노’와 같은 얼굴 표정에 대한 라벨이 없으며, 그 반대도 마찬가지입니다. 이것은 변환된 이미지 $G(x,c)$에서 입력 이미지 $x$를 재구성할 때 레이블 벡터 $c’$에 대한 완전한 정보가 필요하기 때문에 문제가 된다(Eq. (4) 참조).

Mask Vector. To alleviate this problem, we introduce a mask vector $m$ that allows StarGAN to ignore unspecified labels and focus on the explicitly known label provided by a particular dataset. In StarGAN, we use an n-dimensional one-hot vector to represent $m$, with $n$ being the number of datasets. In addition, we define a unified version of the label as a vector

마스크 벡터. 이 문제를 완화하기 위해 StarGAN이 지정되지 않은 레이블을 무시하고 특정 데이터 세트에서 제공하는 명시적으로 알려진 레이블에 집중할 수 있는 마스크 벡터 $m$을 소개한다. StarGAN에서 우리는 n차원 원핫 벡터를 사용하여 $m$을 나타내며, $n$은 데이터 세트의 수입니다. 또한 레이블의 통합 버전을 벡터로 정의한다.

\[\tilde{c}=[c_{1}, ..., c_{n}, m],\]

where [·] refers to concatenation, and ci represents a vector for the labels of the i-th dataset. The vector of the known label ci can be represented as either a binary vector for binary attributes or a one-hot vector for categorical attributes. For the remaining n−1 unknown labels we simply assign zero values. In our experiments, we utilize the CelebA and RaFD datasets, where $n$ is two. 여기서 [·]는 연결을 의미하며, ci는 i번째 데이터 세트의 레이블에 대한 벡터를 나타낸다. 알려진 레이블 ci의 벡터는 이진 속성에 대한 이진 벡터 또는 범주 속성에 대한 원핫 벡터로 표현될 수 있다. 나머지 n-1 알 수 없는 레이블에 대해서는 단순히 0 값을 할당한다. 우리의 실험에서 우리는 $n$이 2인 CelebA 및 RaFD 데이터 세트를 활용한다.

Training Strategy. When training StarGAN with multiple datasets, we use the domain label $\tilde{c}$ defined in Eq. (7) as input to the generator. By doing so, the generator learns to ignore the unspecified labels, which are zero vectors, and focus on the explicitly given label. The structure of the generator is exactly the same as in training with a single dataset, except for the dimension of the input label $\tilde{c}$. On the other hand, we extend the auxiliary classifier of the discriminator to generate probability distributions over labels for all datasets. Then, we train the model in a multi-task learning setting, where the discriminator tries to minimize only the classification error associated to the known label. For example, when training with images in CelebA, the discriminator minimizes only classification errors for labels related to CelebA attributes, and not facial expressions related to RaFD. Under these settings, by alternating between CelebA and RaFD the discriminator learns all of the discriminative features for both datasets, and the generator learns to control all the labels in both datasets.

교육 전략. 여러 데이터 세트로 StarGAN을 훈련할 때, 우리는 생성기에 대한 입력으로 등식(7)에 정의된 도메인 레이블 $\tilde{c}$를 사용한다. 그렇게 함으로써, 생성자는 0 벡터인 지정되지 않은 레이블을 무시하고 명시적으로 주어진 레이블에 초점을 맞추는 법을 배운다. 생성기의 구조는 입력 레이블 $\tilde{c}$의 차원을 제외하고 단일 데이터 세트를 사용한 훈련과 정확히 동일하다. 한편, 우리는 판별기의 보조 분류기를 확장하여 모든 데이터 세트에 대한 레이블에 대한 확률 분포를 생성한다. 그런 다음 판별기가 알려진 레이블과 관련된 분류 오류만 최소화하려고 하는 다중 작업 학습 환경에서 모델을 훈련시킨다. 예를 들어, CelebA에서 이미지로 훈련할 때 판별기는 CelebA 속성과 관련된 레이블에 대한 분류 오류만 최소화하고 RaFD와 관련된 얼굴 표정은 최소화하지 않는다. 이러한 설정에서 CelebA와 RaFD를 번갈아 가며 판별기는 두 데이터 세트에 대한 모든 차별적 기능을 학습하고 생성기는 두 데이터 세트의 모든 레이블을 제어하는 방법을 학습한다.

Figure 4. Facial attribute transfer results on the CelebA dataset. The first column shows the input image, next four columns show the single attribute transfer results, and rightmost columns show the multi-attribute transfer results. H: Hair color, G: Gender, A: Aged.

그림 4. CelebA 데이터 세트에 대한 얼굴 속성 전송 결과. 첫 번째 열에는 입력 이미지가 표시되고, 다음 네 개의 열에는 단일 속성 전송 결과가 표시되며, 오른쪽 끝 열에는 다중 속성 전송 결과가 표시됩니다. H: 머리색, G: 성별, A: 나이.

$\mathbf{4.\;Implementation}$

Improved GAN Training. To stabilize the training process and generate higher quality images, we replace Eq. (1) with Wasserstein GAN objective with gradient penalty [1, 4] defined as

GAN 교육 개선. 훈련 프로세스를 안정화하고 고품질 이미지를 생성하기 위해 다음과 같이 정의된 그레이디언트 패널티 [1, 4]로 Wasserstein GAN 목표로 등식 (1)을 대체한다.

\[L_{adv}=E_{x}[D_{src}(x)]-E_{x,c}[D_{src}(G(x,c))]-λ_{gp}E_{\hat{x}}[(\vert{}\vert{}▽_{\hat{x}}D_{src}(ˆx)\vert{}\vert{}_{2}-1)^{2}],\]

where $\hat{x}$ is sampled uniformly along a straight line between a pair of a real and a generated images. We use $λ_{gp}=10$ for all experiments.

여기서 $\hat{x}$는 실제 이미지와 생성된 이미지 쌍 사이의 직선을 따라 균일하게 샘플링된다. 우리는 모든 실험에 $sv_{gp}=10$을 사용한다.

Network Architecture. Adapted from CycleGAN [33],

네트워크 아키텍처. CycleGAN에서 채택 [33],

StarGAN has the generator network composed of two convolutional layers with the stride size of two for downsampling, six residual blocks [5], and two transposed convolutional layers with the stride size of two for upsampling. We use instance normalization [29] for the generator but no normalization for the discriminator. We leverage PatchGANs [7, 15, 33] for the discriminator network, which classifies whether local image patches are real or fake. See the appendix (Section 7.2) for more details about the network architecture.

StarGAN은 다운샘플링을 위해 스트라이드 크기가 2인 2개의 컨볼루션 레이어, 6개의 잔여 블록[5], 업샘플링을 위해 스트라이드 크기가 2인 2개의 전치 컨볼루션 레이어로 구성된 생성기 네트워크를 가지고 있다. 우리는 생성기에 인스턴스 정규화[29]를 사용하지만 판별기에 대한 정규화는 사용하지 않는다. 우리는 로컬 이미지 패치가 진짜인지 가짜인지를 분류하는 판별기 네트워크에 PatchGAN[7, 15, 33]을 활용한다. 네트워크 아키텍처에 대한 자세한 내용은 부록(7.2절)을 참조하십시오.

$\mathbf{5.\;E_{x}periments}$

In this section, we first compare StarGAN against recent methods on facial attribute transfer by conducting user studies. Next, we perform a classification experiment on facial expression synthesis. Lastly, we demonstrate empirical results that StarGAN can learn image-to-image translation from multiple datasets. All our experiments were conducted by using the model output from unseen images during the training phase.

이 섹션에서는 먼저 사용자 연구를 수행하여 StarGAN을 얼굴 속성 전달에 대한 최신 방법과 비교한다. 다음으로, 우리는 얼굴 표정 합성에 대한 분류 실험을 수행한다. 마지막으로, 우리는 StarGAN이 여러 데이터 세트에서 이미지 간 변환을 학습할 수 있다는 경험적 결과를 보여준다. 우리의 모든 실험은 훈련 단계에서 보이지 않는 이미지의 모델 출력을 사용하여 수행되었다.

$\mathbf{5.1.\;Baseline\;Models}$

As our baseline models, we adopt DIAT [16] and CycleGAN [33], both of which performs image-to-image translation between two different domains. For comparison, we trained these models multiple times for every pair of two different domains. We also adopt IcGAN [23] as a baseline which can perform attribute transfer using a cGAN [22].

기본 모델로, 우리는 DIAT[16]와 CycleGAN[33]을 채택하는데, 둘 다 서로 다른 두 도메인 간에 이미지 간 변환을 수행한다. 비교를 위해, 우리는 두 개의 서로 다른 도메인의 모든 쌍에 대해 이러한 모델을 여러 번 훈련시켰다. 우리는 또한 acGAN을 사용하여 속성 전송을 수행할 수 있는 기준선으로 IcGAN [23]을 채택한다.

DIAT uses an adversarial loss to learn the mapping from $x\to{}X$ to $y\to{}Y$ , where $x$ and $y$ are face images in two different domains $X$ and $Y$ , respectively. This method has a regularization term on the mapping as $\vert{}\vert{}x-F(G(x))\vert{}\vert{}_{1}$ to preserve identity features of the source image, where $F$ is a feature extractor pretrained on a face recognition task.

DIAT는 적대적 손실을 사용하여 $x\to{}X$에서 $y\to{}Y$로의 매핑을 학습한다. 여기서 $x$와 $y$는 각각 두 개의 서로 다른 도메인 $X$와 $Y$의 얼굴 이미지이다. 이 방법은 소스 이미지의 ID 특징을 보존하기 위해 매핑에 정규화 용어를 $\vert{}\vert{}x-F(G(x))\vert{}\vert{}_{1}$ 로 가지고 있으며, 여기서 $F$는 얼굴 인식 작업에 대해 사전 훈련된 특징 추출기이다.

CycleGAN also uses an adversarial loss to learn the mapping between two different domains $X$ and $Y$ . This method regularizes the mapping via cycle consistency losses, $\vert{}\vert{}x-(G_{YX}(G_{XY}(x)))\vert{}\vert{}{1}$ and $\vert{}\vert{}y-(G{XY}(G_{YX}(y)))\vert{}\vert{}_{1}$. This method requires two generators and discriminators for each pair of two different domains.

CycleGAN은 또한 적대적 손실을 사용하여 두 개의 서로 다른 도메인 $X$와 $Y$ 사이의 매핑을 학습한다. 이 방법은 주기 일관성 손실인 $\vert{}\vert{}x-(G_{YX}(G_{XY}(x)))\vert{}\vert{}{1}$와 $\vert{}\vert{}y-(G{XY}(G_{YX}(y)))\vert{}\vert{}_{1}$를 통해 매핑을 정규화한다. 이 방법에는 두 개의 서로 다른 도메인의 각 쌍에 대해 두 개의 생성기와 판별기가 필요하다.

IcGAN combines an encoder with a cGAN [22] model. cGAN learns the mapping $G:{z,c}\to{}x$ that generates an image $x$ conditioned on both the latent vector $z$ and the conditional vector $c$. In addition, IcGAN introduces an encoder to learn the inverse mappings of cGAN, $E_{z}:x\to{}z$ and $E_{c}:x\to{}c$. This allows IcGAN to synthesis images by only changing the conditional vector and preserving the latent vector.

IcGAN은 인코더를 cGAN [22] 모델과 결합한다. cGAN은 잠재 벡터 $z$와 조건부 벡터 $c$ 모두에 대해 조건화된 이미지 $x$를 생성하는 매핑 $G:{z,c}\to{}x$를 학습한다. 또한 IcGAN은 인코더를 도입하여 cGAN, $E_{z}:x\to{}z$ 및 $E_{c}:x\to{}c$의 역매핑을 학습한다. 이를 통해 IcGAN은 조건부 벡터만 변경하고 잠재 벡터를 보존함으로써 이미지를 합성할 수 있다.

Figure 5. Facial expression synthesis results on the RaFD dataset.

그림 5. RaFD 데이터 세트에 대한 얼굴 표정 합성 결과.

$\mathbf{5.2.\;Datasets}$

CelebA. The CelebFaces Attributes (CelebA) dataset [19] contains 202,599 face images of celebrities, each annotated with 40 binary attributes. We crop the initial 178×218 size images to 178×178, then resize them as 128×128. We randomly select 2,000 images as test set and use all remaining images for training data. We construct seven domains using the following attributes: hair color (black, blond, brown), gender (male/female), and age (young/old).

셀럽 A. CelebFaces 속성(CelebA) 데이터 세트[19]에는 각각 40개의 이진 속성으로 주석이 달린 202,599개의 유명인 얼굴 이미지가 포함되어 있다. 우리는 초기 178×218 크기의 이미지를 178×178로 자른 다음 128×128로 크기를 조정한다. 2,000개의 이미지를 테스트 세트로 무작위로 선택하고 나머지 모든 이미지를 교육 데이터에 사용합니다. 우리는 머리색(검정색, 금발, 갈색), 성별(남성/여성) 및 나이(젊음/고령)의 속성을 사용하여 7개의 도메인을 구성한다.

RaFD. The Radboud Faces Database (RaFD) [13] consists of 4,824 images collected from 67 participants. Each participant makes eight facial expressions in three different gaze directions, which are captured from three different angles. We crop the images to 256 × 256, where the faces are centered, and then resize them to 128 × 128.

RaFD. Radboud Faces Database (RaFD) [13]는 67명의 참가자로부터 수집된 4,824개의 이미지로 구성되어 있다. 각 참가자는 세 개의 다른 시선 방향에서 여덟 개의 표정을 짓는데, 세 개의 다른 각도에서 포착된다. 우리는 얼굴이 중앙에 있는 256 × 256으로 이미지를 자른 다음 128 × 128로 크기를 조정한다.

$\mathbf{5.3.\;Training}$

All models are trained using Adam [11] with $β_{1}=0.5$ and $β_{2}=0.999$. For data augmentation we flip the images horizontally with a probability of 0.5. We perform one generator update after five discriminator updates as in [4]. The batch size is set to 16 for all experiments. For experiments on CelebA, we train all models with a learning rate of 0.0001 for the first 10 epochs and linearly decay the learning rate to 0 over the next 10 epochs. To compensate for the lack of data, when training with RaFD we train all models for 100 epochs with a learning rate of 0.0001 and apply the same decaying strategy over the next 100 epochs. Training takes about one day on a single NVIDIA Tesla M40 GPU.

모든 모델은 Adam [11]을 사용하여 $β_{1}=0.5$ 및 $β_{2}=0.999$를 사용하여 훈련된다. 데이터 확대를 위해 0.5의 확률로 이미지를 수평으로 뒤집는다. [4]에서와 같이 5개의 판별기 업데이트 후 하나의 발전기 업데이트를 수행한다. 배치 크기는 모든 실험에 대해 16으로 설정됩니다. CelebA에 대한 실험을 위해, 우리는 처음 10세기 동안 0.0001의 학습률을 가진 모든 모델을 훈련시키고 다음 10세기 동안 학습률을 0으로 선형적으로 감소시킨다. 데이터 부족을 보완하기 위해 RaFD로 훈련할 때 모든 모델을 학습률 0.0001로 100세기 동안 훈련하고 다음 100세기 동안 동일한 붕괴 전략을 적용한다. 교육은 NVIDIA Tesla M40 GPU 하나로 약 하루 정도 소요됩니다.

$\mathbf{5.4.\;E_{x}perimental\;Results\;on\;CelebA}$

We first compare our proposed method to the baseline models on a single and multi-attribute transfer tasks. We train the cross-domain models such as DIAT and CycleGAN multiple times considering all possible attribute value pairs. In the case of DIAT and CycleGAN, we perform multi-step translations to synthesize multiple attributes (e.g. transferring a gender attribute after changing a hair color).

먼저 제안된 방법을 단일 및 다중 속성 전송 작업에 대한 기준 모델과 비교한다. 우리는 가능한 모든 속성 값 쌍을 고려하여 DIAT 및 CycleGAN과 같은 교차 도메인 모델을 여러 번 훈련한다. DIAT와 CycleGAN의 경우, 우리는 여러 속성을 합성하기 위해 다단계 변환을 수행한다(예: 머리카락 색을 변경한 후 성별 속성 전달).

Qualitative evaluation. Fig. 4 shows the facial attribute transfer results on CelebA. We observed that our method provides a higher visual quality of translation results on test data compared to the cross-domain models. One possible reason is the regularization effect of StarGAN through a multi-task learning framework. In other words, rather than training a model to perform a fixed translation (e.g., brownto-blond hair), which is prone to overfitting, we train our model to flexibly translate images according to the labels of the target domain. This allows our model to learn reliable features universally applicable to multiple domains of images with different facial attribute values.

질적 평가. 그림 4는 셀럽 A의 얼굴 속성 전달 결과를 보여준다. 우리는 우리의 방법이 교차 도메인 모델에 비해 테스트 데이터에 대한 번역 결과의 시각적 품질을 제공한다는 것을 관찰했다. 가능한 한 가지 이유는 다중 작업 학습 프레임워크를 통한 StarGAN의 정규화 효과이다. 즉, 우리는 과적합하기 쉬운 고정된 번역(예: 갈색에서 금발)을 수행하기 위해 모델을 훈련시키는 대신 대상 도메인의 레이블에 따라 이미지를 유연하게 변환하도록 모델을 훈련시킨다. 이를 통해 우리 모델은 얼굴 속성 값이 다른 이미지의 여러 도메인에 보편적으로 적용할 수 있는 신뢰할 수 있는 기능을 학습할 수 있다.

Furthermore, compared to IcGAN, our model demonstrates an advantage in preserving the facial identity feature of an input. We conjecture that this is because our method maintains the spatial information by using activation maps from the convolutional layer as latent representation, rather than just a low-dimensional latent vector as in IcGAN.

또한, IcGAN과 비교하여, 우리의 모델은 입력의 얼굴 정체성 기능을 보존하는 데 이점을 보여준다. 우리는 우리의 방법이 IcGAN에서와 같이 단순한 저차원 잠재 벡터가 아닌 컨볼루션 레이어의 활성화 맵을 잠재 표현으로 사용하여 공간 정보를 유지하기 때문이라고 추측한다.

Quantitative evaluation protocol. For quantitative evaluations, we performed two user studies in a survey format using Amazon Mechanical Turk (AMT) to assess single and multiple attribute transfer tasks. Given an input image, the Turkers were instructed to choose the best generated image based on perceptual realism, quality of transfer in attribute(s), and preservation of a figure’s original identity. The options were four randomly shuffled images generated from four different methods. The generated images in one study have a single attribute transfer in either hair color (black, blond, brown), gender, or age. In another study, the generated images involve a combination of attribute transfers. Each Turker was asked 30 to 40 questions with a few simple yet logical questions for validating human effort. The number of validated Turkers in each user study is 146 and 100 in single and multiple transfer tasks, respectively.

정량평가 프로토콜입니다. 정량적 평가를 위해, 우리는 단일 및 다중 속성 전송 작업을 평가하기 위해 Amazon Mechanical Turk(AMT)를 사용하여 설문 형식의 두 가지 사용자 연구를 수행했다. 입력 이미지가 주어졌을 때, 터커들은 지각 사실성, 속성의 전송 품질, 그리고 인물의 원래 정체성의 보존에 기초하여 가장 잘 생성된 이미지를 선택하라는 지시를 받았다. 옵션은 네 가지 다른 방법에서 생성된 네 개의 무작위로 혼합된 이미지였다. 한 연구에서 생성된 이미지는 머리색(검정색, 금발, 갈색), 성별 또는 나이에서 단일 속성 전달을 가집니다. 다른 연구에서 생성된 이미지는 속성 전송의 조합을 포함한다. 각 Turker는 인간의 노력을 검증하기 위해 간단하지만 논리적인 몇 가지 질문을 30~40개씩 받았다. 각 사용자 연구에서 검증된 Turker 수는 단일 및 다중 전송 작업에서 각각 146개 및 100개입니다.

Figure 6. Facial expression synthesis results of StarGAN-SNG and StarGAN-JNT on CelebA dataset.

그림 6. CelebA 데이터 세트에서 StarGAN-SNG 및 StarGAN-JNT의 얼굴 표정 합성 결과.

Table 1

Table 1. AMT perceptual evaluation for ranking different models on a single attribute transfer task. Each column sums to 100%.

표 1. 단일 속성 전송 태스크에서 여러 모델의 순위를 매기기 위한 AMT 지각 평가 각 열의 합계는 100%입니다.

Table 2

Table 2. AMT perceptual evaluation for ranking different models on a multi-attribute transfer task. H: Hair color; G: Gender; A: Aged.

표 2. 다중 속성 전송 태스크에서 여러 모델의 순위를 매기기 위한 AMT 지각 평가 H: 머리색, G: 성별, A: 나이.

Quantitative results. Tables 1 and 2 show the results of our AMT experiment on single- and multi-attribute transfer tasks, respectively. StarGAN obtained the majority of votes for best transferring attributes in all cases. In the case of gender changes in Table 1, the voting difference between our model and other models was marginal, e.g., 39.1% for StarGAN vs. 31.4% for DIAT. However, in multi-attribute changes, e.g., the ‘G+A’ case in Table 2, the performance difference becomes significant, e.g., 49.8% for StarGAN vs. 20.3% for IcGAN), clearly showing the advantages of StarGAN in more complicated, multi-attribute transfer tasks. This is because unlike the other methods, StarGAN can handle image translation involving multiple attribute changes by randomly generating a target domain label in the training phase.

정량 결과. 표 1과 2는 각각 단일 및 다중 속성 전송 태스크에 대한 AMT 실험 결과를 보여줍니다. 스타GAN은 모든 경우에서 가장 좋은 전송 속성으로 다수의 표를 얻었다. 표 1의 성별 변화의 경우, 우리 모델과 다른 모델 간의 투표 차이는 미미했다. 예를 들어, StarGAN의 경우 39.1% 대 DIAT의 경우 31.4%였다. 그러나 표 2의 ‘G+A’ 사례와 같이 다중 속성 변경에서 성능 차이는 현저해지며, 예를 들어 StarGAN의 경우 49.8% 대 IcGAN의 경우 20.3%로 더 복잡한 다중 속성 전송 작업에서 StarGAN의 장점을 분명히 보여준다. 이는 다른 방법과 달리 StarGAN이 훈련 단계에서 대상 도메인 레이블을 무작위로 생성하여 여러 속성 변경을 포함하는 이미지 변환을 처리할 수 있기 때문이다.

$\mathbf{5.5.\;E_{x}perimental\;Results\;on\;RaFD}$

We next train our model on the RaFD dataset to learn the task of synthesizing facial expressions. To compare StarGAN and baseline models, we fix the input domain as the ‘neutral’ expression, but the target domain varies among the seven remaining expressions.

다음에는 RaFD 데이터 세트에 대한 모델을 훈련하여 얼굴 표정을 합성하는 작업을 학습한다. StarGAN과 기준 모델을 비교하기 위해 입력 도메인을 ‘중립’ 표현으로 고정하지만, 대상 도메인은 나머지 7개 표현식 사이에서 다양하다.

Qualitative evaluation. As seen in Fig. 5, StarGAN clearly generates the most natural-looking expressions while properly maintaining the personal identity and facial features of the input. While DIAT and CycleGAN mostly preserve the identity of the input, many of their results are shown blurry and do not maintain the degree of sharpness as seen in the input. IcGAN even fails to preserve the personal identity in the image by generating male images.

정성평가. 그림 5에서 볼 수 있듯이 StarGAN은 입력의 개인 정체성과 얼굴 특징을 적절히 유지하면서 가장 자연스럽게 보이는 표정을 명확하게 생성한다. DIAT와 CycleGAN은 대부분 입력의 ID를 보존하지만, 많은 결과는 흐릿하게 표시되며 입력에서 보는 것처럼 선명도를 유지하지 않는다. IcGAN은 심지어 남성 이미지를 생성함으로써 이미지에서 개인 정체성을 보존하는 데 실패한다.

We believe that the superiority of StarGAN in the image quality is due to its implicit data augmentation effect from a multi-task learning setting. RaFD images contain a relatively small size of samples, e.g., 500 images per domain. When trained on two domains, DIAT and CycleGAN can only use 1,000 training images at a time, but StarGAN can use 4,000 images in total from all the available domains for its training. This allows StarGAN to properly learn how to maintain the quality and sharpness of the generated output.

이미지 품질에서 StarGAN의 우수성은 다중 작업 학습 설정에서 암시적인 데이터 확대 효과 때문이라고 생각한다. RaFD 이미지에는 도메인당 500개의 이미지와 같이 비교적 작은 크기의 샘플이 포함되어 있습니다. 두 개의 도메인에서 훈련할 때, DIAT와 CycleGAN은 한 번에 1,000개의 훈련 이미지만 사용할 수 있지만, StarGAN은 훈련에 사용할 수 있는 모든 도메인에서 총 4,000개의 이미지를 사용할 수 있다. 이를 통해 StarGAN은 생성된 출력의 품질과 선명도를 유지하는 방법을 올바르게 학습할 수 있다.

Quantitative evaluation. For a quantitative evaluation, we compute the classification error of a facial expression on synthesized images. We trained a facial expression classifier on the RaFD dataset (90%/10% splitting for training and test sets) using a ResNet-18 architecture [5], resulting in a near-perfect accuracy of 99.55%. We then trained each of image translation models using the same training set and performed image translation on the same, unseen test set. Finally, we classified the expression of these translated images using the above-mentioned classifier. As can be seen in Table 3, our model achieves the lowest classification error, indicating that our model produces the most realistic facial expressions among all the methods compared.

정량적 평가. 정량적 평가를 위해 합성된 이미지에 대한 얼굴 표정의 분류 오류를 계산한다. ResNet-18 아키텍처[5]를 사용하여 RaFD 데이터 세트에서 얼굴 표정 분류기(훈련 및 테스트 세트의 경우 90%/10% 분할)를 훈련하여 99.55%의 거의 완벽한 정확도를 달성했다. 그런 다음 동일한 훈련 세트를 사용하여 각 이미지 번역 모델을 교육하고 보이지 않는 동일한 테스트 세트에서 이미지 번역을 수행했다. 마지막으로, 우리는 위에서 언급한 분류기를 사용하여 이러한 번역된 이미지의 표현을 분류했다. 표 3에서 볼 수 있듯이, 우리의 모델은 가장 낮은 분류 오류를 달성하는데, 이는 우리의 모델이 비교한 모든 방법 중 가장 현실적인 얼굴 표정을 만든다는 것을 나타낸다.

Another important advantage of our model is the scalability in terms of the number of parameters required. The last column in Table 3 shows that the number of parameters required to learn all translations by StarGAN is seven times smaller than that of DIAT and fourteen times smaller than that of CycleGAN. This is because StarGAN requires only a single generator and discriminator pair, regardless of the number of domains, while in the case of cross-domain models such as CycleGAN, a completely different model should be trained for each source-target domain pair.

우리 모델의 또 다른 중요한 장점은 필요한 매개 변수의 수 측면에서 확장성이다. 표 3의 마지막 열은 StarGAN이 모든 번역을 학습하는 데 필요한 매개 변수의 수가 DIAT보다 7배 작고 CycleGAN보다 14배 작다는 것을 보여준다. 이는 StarGAN이 도메인 수에 관계없이 단일 생성기와 판별기 쌍만 필요로 하는 반면, CycleGAN과 같은 교차 도메인 모델의 경우 소스-대상 도메인 쌍마다 완전히 다른 모델을 교육해야 하기 때문이다.

$\mathbf{5.6.\;E_{x}perimental\;Results\;on\;CelebA+RaFD}$

Finally, we empirically demonstrate that our model can learn not only from multiple domains within a single dataset, but also from multiple datasets. We train our model jointly on the CelebA and RaFD datasets using the mask vector (see Section 3.2). To distinguish between the model trained only on RaFD and the model trained on both CelebA and RaFD, we denote the former as StarGAN-SNG (single) and the latter as StarGAN-JNT (joint).

마지막으로, 우리는 모델이 단일 데이터 세트 내의 여러 도메인에서뿐만 아니라 여러 데이터 세트에서도 학습할 수 있음을 경험적으로 보여준다. 마스크 벡터를 사용하여 CelebA 및 RaFD 데이터 세트에서 모델을 공동으로 훈련한다(섹션 3.2 참조). RaFD에서만 훈련된 모델과 CellebA와 RaFD 모두에서 훈련된 모델을 구별하기 위해 전자를 StarGAN-SNG(단일)로, 후자를 StarGAN-JNT(관절)로 표시한다.

Effects of joint training. Fig. 6 shows qualitative comparisons between StarGAN-SNG and StarGAN-JNT, where the task is to synthesize facial expressions of images in CelebA. StarGAN-JNT exhibits emotional expressions with high visual quality, while StarGAN-SNG generates reasonable but blurry images with gray backgrounds. This difference is due to the fact that StarGAN-JNT learns to translate CelebA images during training but not StarGAN-SNG. In other words, StarGAN-JNT can leverage both datasets to improve shared low-level tasks such facial keypoint detection and segmentation. By utilizing both CelebA and RaFD, StarGAN-JNT can improve these low-level tasks, which is beneficial to learning facial expression synthesis.

합동훈련의 효과. 그림 6은 StarGAN-SNG와 StarGAN-JNT의 정성적 비교를 보여준다. 여기서 과제는 CellebA에서 이미지의 얼굴 표정을 합성하는 것이다. StarGAN-JNT는 높은 시각적 품질로 감정 표현을, StarGAN-SNG는 회색 배경의 합리적이지만 흐릿한 이미지를 생성한다. 이러한 차이는 StarGAN-JNT가 훈련 중에 CellebA 이미지 번역은 배우지만 StarGAN-SNG는 번역하지 않기 때문이다. 즉, StarGAN-JNT는 두 데이터 세트를 모두 활용하여 얼굴 키포인트 감지 및 분할과 같은 공유 낮은 수준의 작업을 개선할 수 있다. StarGAN-JNT는 CelebA와 RaFD를 모두 활용하여 이러한 낮은 수준의 작업을 개선할 수 있으며, 이는 얼굴 표정 합성을 학습하는 데 도움이 된다.

Learned role of mask vector. In this experiment, we gave a one-hot vector $c$ by setting the dimension of a particular facial expression (available from the second dataset, RaFD) to one. In this case, since the label associated with the second data set is explicitly given, the proper mask vector would be [0, 1]. Fig. 7 shows the case where this proper mask vector was given and the opposite case where a wrong mask vector of [1, 0] was given. When the wrong mask vector was used, StarGAN-JNT fails to synthesize facial expressions, and it manipulates the age of the input image. This is because the model ignores the facial expression label as unknown and treats the facial attribute label as valid by the mask vector. Note that since one of the facial attributes is ‘young’, the model translates the image from young to old when it takes in a zero vector as input. From this behavior, we can confirm that StarGAN properly learned the intended role of a mask vector in image-to-image translations when involving all the labels from multiple datasets altogether.

마스크 벡터의 역할을 학습했습니다. 이 실험에서, 우리는 (두 번째 데이터 세트인 RaFD에서 사용 가능한) 특정 얼굴 표정의 치수를 1로 설정하여 원핫 벡터 $c$를 주었다. 이 경우, 두 번째 데이터 세트와 관련된 라벨이 명시적으로 주어지기 때문에, 적절한 마스크 벡터는 [0, 1]이 될 것이다. 도 7은 이러한 적절한 마스크 벡터가 주어지는 경우와 [1, 0]의 잘못된 마스크 벡터가 주어지는 반대의 경우를 보여준다. 잘못된 마스크 벡터가 사용되었을 때 StarGAN-JNT는 얼굴 표정을 합성하지 못하고 입력 이미지의 나이를 조작한다. 이는 모델이 얼굴 표정 레이블을 알 수 없는 것으로 무시하고 얼굴 속성 레이블을 마스크 벡터에 의해 유효한 것으로 취급하기 때문이다. 얼굴 속성 중 하나가 ‘젊음’이기 때문에 모델은 영 벡터를 입력으로 받아들이면 이미지를 젊음에서 늙음으로 번역한다. 이 동작을 통해 StarGAN이 여러 데이터 세트의 모든 레이블을 모두 포함할 때 이미지 간 변환에서 마스크 벡터의 의도된 역할을 제대로 학습했음을 확인할 수 있다.

Figure 7. Learned role of the mask vector. All images are generated by StarGAN-JNT. The first row shows the result of applying the proper mask vector, and the last row shows the result of applying the wrong mask vector.

그림 7. 마스크 벡터의 학습된 역할입니다. 모든 영상은 StarGAN-JNT에 의해 생성됩니다. 첫 번째 행은 적절한 마스크 벡터를 적용한 결과를 나타내고, 마지막 행은 잘못된 마스크 벡터를 적용한 결과를 나타냅니다.

$\mathbf{6.\;Conclusion}$

In this paper, we proposed StarGAN, a scalable imageto-image translation model among multiple domains using a single generator and a discriminator. Besides the advantages in scalability, StarGAN generated images of higher visual quality compared to existing methods [16, 23, 33], owing to the generalization capability behind the multi-task learning setting. In addition, the use of the proposed simple mask vector enables StarGAN to utilize multiple datasets with different sets of domain labels, thus handling all available labels from them. We hope our work to enable users to develop interesting image translation applications across multiple domains.

본 논문에서는 단일 생성기와 판별기를 사용하여 여러 도메인 간에 확장 가능한 이미지 대 이미지 변환 모델인 StarGAN을 제안하였다. 확장성의 장점 외에도, StarGAN은 다중 작업 학습 설정 뒤에 있는 일반화 능력 때문에 기존 방법[16, 23, 33]에 비해 더 높은 시각적 품질의 이미지를 생성했다. 또한 제안된 단순 마스크 벡터를 사용하면 StarGAN이 서로 다른 도메인 레이블 세트를 가진 여러 데이터 세트를 활용하여 이들로부터 사용 가능한 모든 레이블을 처리할 수 있다. 사용자가 여러 도메인에 걸쳐 흥미로운 이미지 번역 응용 프로그램을 개발할 수 있도록 하는 우리의 작업을 희망한다.

$\mathbf{Acknowledgements.}$

This work was mainly done while the first author did a research internship at Clova AI Research, NAVER. We thank all the researchers at NAVER, especially Donghyun Kwak, for insightful discussions. This work was partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. NRF2016R1C1B2015924). Jaegul Choo is the corresponding author.

이 작업은 네이버 클로바 AI 리서치에서 첫 번째 저자가 연구 인턴십을 하면서 주로 이뤄졌다. 네이버의 모든 연구자들, 특히 곽동현 연구위원께서 통찰력 있는 토론을 해주셔서 감사드린다. 본 연구는 한국정부(MSIP)가 지원하는 국가연구재단(NRF) 보조금(No. NRF2016R1C1B2015924)에 의해 일부 지원되었다. 추재걸 씨가 해당 작가입니다.

$\mathbf{References}$

[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 214–223, 2017. 5

[2] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016. 3

[3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, $D$. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Pocessing Systems (NIPS), pages 2672–2680, 2014. 2

[4] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017. 5, 6

[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 5, 7

[6] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 2

[7] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 1, 2, 3, 5

[8] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017. 2

[9] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pages 1857–1865, 2017. 1, 2, 3, 4

[10] T. Kim, B. Kim, M. Cha, and J. Kim. Unsupervised visual attribute transfer with reconfigurable generative adversarial networks. arXiv preprint arXiv:1707.09798, 2017. 2

[11] $D$. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 6

[12] $D$. P. Kingma and M. Welling. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014. 3

[13] O. Langner, R. Dotsch, $G$. Bijlstra, $D$. H. Wigboldus, S. T. Hawk, and A. Van Knippenberg. Presentation and validation of the radboud faces database. Cognition and Emotion, 24(8):1377–1388, 2010. 2, 4, 6

[14] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2, 3

[15] C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In Proceedings of the 14th European Conference on Computer Vision (ECCV), pages 702–716, 2016. 5

[16] M. Li, W. Zuo, and $D$. Zhang. Deep identity-aware transfer of facial attributes. arXiv preprint arXiv:1610.05586, 2016. 2, 5, 8

[17] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848, 2017. 3

[18] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems (NIPS), pages 469–477, 2016. 3

[19] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. 2, 4, 6

[20] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. 3

[21] A. Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583, 2016. 3

[22] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016. 3, 5

[23] $G$. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Alvarez. Invertible conditional gans for image editing. ´ arXiv preprint arXiv:1611.06355, 2016. 5, 8

[24] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. 2

[25] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016. 3

[26] W. Shen and R. Liu. Learning residual images for face attribute manipulation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2

[27] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and $D$. Samaras. Neural face editing with intrinsic image disentangling. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 3

[28] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised crossdomain image generation. In 5th International Conference on Learning Representations (ICLR), 2017. 3

[29] $D$. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016. 5

[30] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and $D$. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016. 3

[31] Z. Zhang, Y. Song, and H. Qi. Age progression/regression by conditional adversarial autoencoder. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. 2

[32] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In 5th International Conference on Learning Representations (ICLR), 2017. 2

[33] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imageto-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. 1, 2, 3, 4, 5, 8