(GAN)StackGAN Text to Photo realistic Image Synthesis with Stacked Generative Adversarial Networks translation.md

$\mathbf{Han\;Zhang,\;Tao Xu,\;Hongsheng\;Li}$

$\mathbf{Shaoting\;Zhang,\;Xiaogang\;Wang,\;Xiaolei\;Huang,\;Dimitris\;Metaxas}$

$\mathbf{Rutgers,\;University,\;Lehigh\;University,\;The\;Chinese\;University\;of\;Hong\;Kong,\;Baidu\;Research}$

$\mathbf{Abstract}$

Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing textto-image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256×256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. E_{x}tensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.

텍스트 설명에서 고품질 이미지를 합성하는 것은 컴퓨터 비전에서 어려운 문제이며 많은 실용적 응용이 있습니다. 기존 텍스트-이미지 접근법에 의해 생성된 샘플은 주어진 설명의 의미를 대략적으로 반영할 수 있지만 필요한 세부 정보와 생생한 객체 부분을 포함하지 못합니다. 본 논문에서, 우리는 텍스트 설명을 조건으로 256×256의 실제 사진 이미지를 생성하기 위해 StackGAN(Stack GAN)을 제안한다. 우리는 스케치 다듬기 과정을 통해 어려운 문제를 보다 다루기 쉬운 하위 문제로 분해합니다. Stage-IGAN은 주어진 텍스트 설명을 기반으로 객체의 기본 모양과 색상을 스케치하여 1단계 저해상도 이미지를 생성합니다. 2단계 GAN은 1단계 결과 및 텍스트 설명을 입력으로 사용하고 사진 사실적인 세부 정보가 포함된 고해상도 이미지를 생성합니다. 1단계 결과의 결함을 수정하고 개선 프로세스를 통해 설득력 있는 세부 정보를 추가할 수 있습니다. 합성된 이미지의 다양성을 개선하고 조건부 GAN의 훈련을 안정화하기 위해, 우리는 잠재된 컨디셔닝 매니폴드의 부드러움을 장려하는 새로운 컨디셔닝 증강 기술을 도입합니다. 벤치마크 데이터 세트에 대한 광범위한 실험과 최첨단 비교는 제안된 방법이 텍스트 설명에 따라 사진 사실적인 이미지를 생성하는 데 있어 상당한 개선을 달성한다는 것을 보여줍니다.

$\mathbf{1.\;Introduction}$

Generating photo-realistic images from text is an important problem and has tremendous applications, including photo-editing, computer-aided design, etc. Recently, Generative Adversarial Networks (GAN) [8, 5, 23] have shown promising results in synthesizing real-world images. Conditioned on given text descriptions, conditional GANs [26, 24] are able to generate images that are highly related to the text meanings.

텍스트에서 사실적인 이미지를 생성하는 것은 중요한 문제이며 사진 편집, 컴퓨터 지원 설계 등을 포함한 엄청난 응용 프로그램을 가지고 있습니다. 최근, 생성적 적대 네트워크(GAN)[8, 5, 23]는 실제 이미지를 합성하는 데 유망한 결과를 보여주었습니다. 주어진 텍스트 설명에 따라 조건부 GAN [26, 24]은 텍스트 의미와 매우 관련이 있는 이미지를 생성할 수 있습니다.

Figure 1. Comparison of the proposed StackGAN and a vanilla one-stage GAN for generating 256×256 images. (a) Given text descriptions, Stage-I of StackGAN sketches rough shapes and basic colors of objects, yielding low-resolution images. (b) Stage-II of StackGAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. (c) Results by a vanilla 256×256 GAN which simply adds more upsampling layers to state-of-the-art GAN-INT-CLS [26]. It is unable to generate any plausible images of 256×256 resolution.

그림 1입니다. 제안된 StackGAN과 256×256 이미지를 생성하기 위한 바닐라 1단계 GAN의 비교입니다. (a) 텍스트 설명이 주어지면 Stack의 I단계GAN은 물체의 거친 모양과 기본 색상을 스케치하여 저해상도 이미지를 생성합니다. (b) StackGAN의 II단계는 1단계 결과 및 텍스트 설명을 입력으로 하고, 사진 사실적인 세부 정보를 가진 고해상도 이미지를 생성합니다. (c) 최신 GAN-INT-CL에 업샘플링 레이어를 더 추가하는 바닐라 256×256 GAN의 결과 [26] 256×256 해상도의 그럴듯한 이미지를 생성할 수 없습니다.

However, it is very difficult to train GAN to generate high-resolution photo-realistic images from text descriptions. Simply adding more upsampling layers in state-ofthe-art GAN models for generating high-resolution (e.g., 256×256) images generally results in training instability.

그러나 텍스트 설명에서 고해상도 사진 사실적 이미지를 생성하도록 GAN을 훈련시키는 것은 매우 어렵습니다. 고해상도(예: 256×256) 이미지를 생성하기 위해 최신 GAN 모델에 업샘플링 레이어를 추가하는 것만으로 일반적으로 교육이 불안정해집니다.

and produces nonsensical outputs (see Figure 1(c)). The main difficulty for generating high-resolution images by GANs is that supports of natural image distribution and implied model distribution may not overlap in high dimensional pixel space [31, 1]. This problem is more severe as the image resolution increases. Reed et al. only succeeded in generating plausible 64×64 images conditioned on text descriptions [26], which usually lack details and vivid object parts, e.g., beaks and eyes of birds. Moreover, they were unable to synthesize higher resolution (e.g., 128×128) images without providing additional annotations of objects [24].

또한 무의미한 출력을 생성합니다(그림 1(c) 참조). GAN에 의한 고해상도 이미지 생성의 주요 어려움은 자연 이미지 배포와 암시적 모델 배포의 지원이 고차원 픽셀 공간에서 겹치지 않을 수 있다는 것입니다 [31, 1]. 이 문제는 이미지 해상도가 높아질수록 더 심각합니다. 리드 외입니다. 오직 텍스트 설명[26]을 조건으로 한 그럴듯한 64×64 이미지를 생성하는 데 성공했습니다. 일반적으로 세부 정보와 생생한 물체 부분(예: 새의 부리와 눈)이 결여되어 있습니다. 더욱이, 객체에 대한 추가 주석을 제공하지 않고는 고해상도(예: 128×128) 이미지를 합성할 수 없었습니다 [24].

In analogy to how human painters draw, we decompose the problem of text to photo-realistic image synthesis into two more tractable sub-problems with Stacked Generative Adversarial Networks (StackGAN). Low-resolution images are first generated by our Stage-I GAN(see Figure 1(a)). On the top of our Stage-I GAN, we stack Stage-II GAN to generate realistic high-resolution (e.g., 256×256) images conditioned on Stage-I results and text descriptions (see Figure 1(b)). By conditioning on the Stage-I result and the text again, Stage-II GAN learns to capture the text information that is omitted by Stage-I GAN and draws more details for the object. The support of model distribution generated from a roughly aligned low-resolution image has better probability of intersecting with the support of image distribution. This is the underlying reason why Stage-II GAN is able to generate better high-resolution images.

인간 화가가 그리는 방법과 유사하게, 우리는 텍스트에서 사진 현실 이미지 합성 문제를 StackGAN(Stack GAN)으로 다루기 쉬운 두 가지 하위 문제로 분해한다. 저해상도 영상은 Stage-IGAN에 의해 먼저 생성됩니다(그림 1(a) 참조). Stage-IGAN의 상단에 Stage-IIGAN을 스택하여 Stage-II 결과 및 텍스트 설명에 따라 실제 고해상도(예: 256×256) 이미지를 생성한다(그림 1(b) 참조). Stage-IGAN은 Stage-IGAN 결과 및 텍스트를 다시 조건화함으로써 Stage-IGAN에서 생략한 텍스트 정보를 캡처하는 방법을 학습하고 개체에 대한 자세한 내용을 그립니다. 대략적으로 정렬된 저해상도 이미지에서 생성된 모델 분포의 지원은 이미지 분포의 지원으로 교차할 가능성이 더 높습니다. 이것이 2단계 GAN이 더 나은 고해상도 이미지를 생성할 수 있는 근본적인 이유입니다.

In addition, for the text-to-image generation task, the limited number of training text-image pairs often results in sparsity in the text conditioning manifold and such sparsity makes it difficult to train GAN. Thus, we propose a novel Conditioning Augmentation technique to encourage smoothness in the latent conditioning manifold. It allows small random perturbations in the conditioning manifold and increases the diversity of synthesized images.

또한, 텍스트-이미지 생성 작업의 경우, 훈련 텍스트-이미지 쌍의 수가 제한되어 텍스트 조정 매니폴드에서 희소성이 발생하는 경우가 많으며, 이러한 희소성은 GAN을 훈련하는 것을 어렵게 합니다. 따라서, 우리는 잠재된 컨디셔닝 매니폴드의 부드러움을 장려하기 위한 새로운 컨디셔닝 증강 기술을 제안합니다. 이는 컨디셔닝 매니폴드에서 작은 랜덤 섭동을 허용하고 합성 이미지의 다양성을 증가시킵니다.

The contribution of the proposed method is threefold:

제안된 방법의 기여는 세 가지입니다.

We propose a novel Stacked Generative Adversarial Networks for synthesizing photo-realistic images from text descriptions. It decomposes the difficult problem of generating high-resolution images into more manageable subproblems and significantly improve the state of the art. The StackGAN for the first time generates images of 256×256 resolution with photo-realistic details from text descriptions.

우리는 텍스트 설명에서 사진 사실적인 이미지를 합성하기 위한 새로운 스택 생성 적대적 네트워크를 제안한다. 고해상도 이미지를 생성하는 어려운 문제를 보다 관리하기 쉬운 하위 문제로 분해하고 최첨단 기술을 크게 향상시킵니다. StackGAN은 처음으로 텍스트 설명에서 사진 사실적인 세부 정보가 포함된 256×256 해상도의 이미지를 생성합니다.
A new Conditioning Augmentation technique is proposed to stabilize the conditional GAN training and also improves the diversity of the generated samples.

조건부 GAN 훈련을 안정화하고 생성된 샘플의 다양성을 개선하기 위해 새로운 조건화 증강 기술이 제안됩니다.
$Extensive qualitative and quantitative experiments demonstrate the effectiveness of the overall model design as well as the effects of individual components, which provide useful information for designing future conditional GAN models. Our code is available at https://github.com/hanzhanggit/StackGAN.

광범위한 정성적 및 정량적 실험은 전체 모델 설계의 효과와 개별 구성 요소의 효과를 입증하여 미래의 조건부 GAN 모델을 설계하는 데 유용한 정보를 제공합니다. 우리의 코드는 https://github.com/hanzhanggit/StackGAN에서 이용할 수 있습니다.

$\mathbf{2.\;Related\;Work}$

Generative image modeling is a fundamental problem in computer vision. There has been remarkable progress in this direction with the emergence of deep learning techniques. Variational Autoencoders (VAE) [13, 28] formulated the problem with probabilistic graphical models whose goal was to maximize the lower bound of data likelihood. Autoregressive models (e.g., PixelRNN) [33] that utilized neural networks to model the conditional distribution of the pixel space have also generated appealing synthetic images. Recently, Generative Adversarial Networks (GAN) [8] have shown promising performance for generating sharper images. But training instability makes it hard for GAN models to generate high-resolution (e.g., 256×256) images. Several techniques [23, 29, 18, 1, 3] have been proposed to stabilize the training process and generate compelling results. An energy-based GAN [38] has also been proposed for more stable training behavior.

생성 이미지 모델링은 컴퓨터 비전의 근본적인 문제입니다. 딥러닝 기술의 등장으로 이 방향에서 괄목할 만한 진전이 있었습니다. 가변 자동 인코더(VAE) [13, 28]는 데이터 우도의 하한을 최대화하는 것이 목적인 확률적 그래픽 모델로 문제를 공식화했습니다. 신경망을 활용하여 픽셀 공간의 조건부 분포를 모델링한 자기 회귀 모델(예: PixelRNN)[33]도 매력적인 합성 이미지를 생성했습니다. 최근, 생성적 적대 네트워크(GAN)[8]는 더 날카로운 이미지를 생성하는 유망한 성능을 보여주었습니다. 그러나 훈련 불안정성으로 인해 GAN 모델은 고해상도(예: 256×256) 이미지를 생성하기 어렵습니다. 교육 과정을 안정화하고 설득력 있는 결과를 얻기 위해 여러 기술[23, 29, 18, 1, 3]이 제안되었습니다. 보다 안정적인 훈련 행동을 위해 에너지 기반 GAN[38]도 제안되었습니다.

Built upon these generative models, conditional image generation has also been studied. Most methods utilized simple conditioning variables such as attributes or class labels [37, 34, 4, 22]. There is also work conditioned on images to generate images, including photo editing [2, 39], domain transfer [32, 12] and super-resolution [31, 15]. However, super-resolution methods [31, 15] can only add limited details to low-resolution images and can not correct large defects as our proposed StackGAN does. Recently, several methods have been developed to generate images from unstructured text. Mansimov et al. [17] built an AlignDRAW model by learning to estimate alignment between text and the generating canvas. Reed et al. [27] used conditional PixelCNN to generate images using the text descriptions and object location constraints. N_{g}uyen et al. [20] used an approximate Langevin sampling approach to generate images conditioned on text. However, their sampling approach requires an inefficient iterative optimization process. With conditional GAN, Reed et al. [26] successfully generated plausible 64×64 images for birds and flowers based on text descriptions. Their follow-up work [24] was able to generate 128×128 images by utilizing additional annotations on object part locations.

이러한 생성 모델을 기반으로 조건부 이미지 생성도 연구되었습니다. 대부분의 방법은 속성 또는 클래스 레이블과 같은 간단한 조건화 변수를 사용했습니다 [37, 34, 4, 22]. 또한 사진 편집 [2, 39], 도메인 전송 [32, 12] 및 초고해상도 [31, 15]를 포함하여 이미지를 생성하기 위한 이미지에 대한 작업이 필요합니다. 그러나 초해상도 방법[31, 15]은 저해상도 이미지에 제한된 세부 정보만 추가할 수 있으며 제안된 StackGAN이 하는 것처럼 큰 결점을 수정할 수 없습니다. 최근에는 구조화되지 않은 텍스트에서 이미지를 생성하는 몇 가지 방법이 개발되었습니다. 만시모프 외입니다. [17] 텍스트와 생성 캔버스 간의 정렬을 추정하는 방법을 학습하여 Align DRAW 모델을 구축했습니다. 리드 외입니다. [27] 조건부 PixelCNN을 사용하여 텍스트 설명 및 개체 위치 제약 조건을 사용하여 이미지를 생성합니다. 응우옌 외입니다. [20] 대략적인 Langevin 샘플링 접근 방식을 사용하여 텍스트에 따라 조정된 이미지를 생성했습니다. 그러나 샘플링 접근 방식에는 비효율적인 반복 최적화 프로세스가 필요합니다. 조건부 GAN 사용, 리드 등입니다. [26] 텍스트 설명을 기반으로 새와 꽃을 위한 그럴듯한 64×64 이미지를 성공적으로 생성했습니다. 그들의 후속 작업[24]은 객체 부품 위치에 대한 추가 주석을 활용하여 128×128개의 이미지를 생성할 수 있었습니다.

Besides using a single GAN for generating images, there is also work [36, 5, 10] that utilized a series of GANs for image generation. Wang et al. [36] factorized the indoor scene generation process into structure generation and style generation with the proposed S 2 -GAN. In contrast, the second stage of our StackGAN aims to complete object details and correct defects of Stage-I results based on text descriptions. Denton et al. [5] built a series of GANs within a Laplacian pyramid framework. At each level of the pyramid, a residual image was generated conditioned on the image of the previous stage and then added back to the input image to produce the input for the next stage. Concurrent to our work, Huang et al. [10] also showed that they can generate better images by stacking several GANs to reconstruct the multi-level representations of a pre-trained discriminative model. However, they only succeeded in generating 32×32 images, while our method utilizes a simpler architecture to generate 256×256 images with photo-realistic details and sixty-four times more pixels.

이미지 생성을 위해 단일 GAN을 사용하는 것 외에도 이미지 생성을 위해 일련의 GAN을 활용한 연구[36, 5, 10]도 있습니다. 왕 외입니다. [36] 제안된 S2-GAN을 사용하여 실내 장면 생성 프로세스를 구조 생성과 스타일 생성으로 인수 분해했습니다. 대조적으로, StackGAN의 두 번째 단계는 객체 세부 정보를 완료하고 텍스트 설명을 기반으로 Stage-I 결과의 결함을 수정하는 것을 목표로 합니다. 덴튼 외입니다. [5] 라플라시안 피라미드 프레임워크 내에서 일련의 GAN을 구축했습니다. 피라미드의 각 레벨에서, 이전 단계의 이미지에 조건부로 잔여 이미지가 생성된 다음 다음 단계의 입력을 생성하기 위해 입력 이미지에 다시 추가되었습니다. Huang 등 우리의 일과 동시에 말입니다. [10] 또한 사전 훈련된 차별 모델의 다단계 표현을 재구성하기 위해 여러 GAN을 쌓음으로써 더 나은 이미지를 생성할 수 있다는 것을 보여주었습니다. 그러나 32x32 이미지 생성에만 성공한 반면, 우리의 방법은 더 단순한 아키텍처를 활용하여 사진 사실적인 세부 정보와 64배 많은 픽셀을 가진 256x256 이미지를 생성합니다.

$\mathbf{3.\;Stacked\;Generative\;Adversarial\;Networks}$

To generate high-resolution images with photo-realistic details, we propose a simple yet effective Stacked Generative Adversarial Networks. It decomposes the text-to-image generative process into two stages (see Figure 2).

사진 사실적인 세부 정보로 고해상도 이미지를 생성하기 위해 단순하지만 효과적인 스택 생성 적대적 네트워크를 제안한다. 텍스트-이미지 생성 프로세스를 두 단계로 분해합니다(그림 2 참조).

Stage-I GAN: it sketches the primitive shape and basic colors of the object conditioned on the given text description, and draws the background layout from a random noise vector, yielding a low-resolution image.

Stage-IGAN:는 주어진 텍스트 설명에 따라 조건화된 객체의 기본 모양과 기본 색상을 스케치하고 무작위 노이즈 벡터로부터 배경 레이아웃을 그려 저해상도 이미지를 생성합니다.
Stage-II GAN: it corrects defects in the low-resolution image from Stage-I and completes details of the object by reading the text description again, producing a highresolution photo-realistic image.

Stage-II GAN:은 Stage-I의 저해상도 이미지의 결함을 수정하고 텍스트 설명을 다시 읽어 객체의 세부 정보를 완성하여 고해상도 사진 사실적인 이미지를 생성합니다.

$\mathbf{3.1.\;Preliminaries}$

Generative Adversarial Networks (GAN) [8] are composed of two models that are alternatively trained to compete with each other. The generator $G$ is optimized to reproduce the true data distribution $p_{data}$ by generating images that are difficult for the discriminator $D$ to differentiate from real images. Meanwhile,$D$ is optimized to distinguish real images and synthetic images generated by $G$. Overall, the training procedure is similar to a two-player min-max game with the following objective function,

GAN(Generative Adversarial Network) [8]은 서로 경쟁하도록 대안적으로 훈련된 두 개의 모델로 구성됩니다. 생성기 $G$는 판별기 $D$가 실제 이미지와 구별하기 어려운 이미지를 생성하여 실제 데이터 분포 $p_{data}$를 재생성하도록 최적화되었습니다. 한편, $D$는 $G$에 의해 생성된 실제 이미지와 합성 이미지를 구별하도록 최적화되었습니다. 전반적으로, 훈련 절차는 다음과 같은 객관적인 기능을 가진 2인용 미니맥스 게임과 유사합니다.

\[\underset{G}{min}\underset{D}{max}V(D,G)=E_{x\sim{}p_{data}}[\log{}D(x)]+E_{z\sim{}p_{z}}[\log{}(1-D(G(z)))],\]

where $x$ is a real image from the true data distribution $p_{data}$, and $z$ is a noise vector sampled from distribution $p_{z}$ (e.g., uniform or Gaussian distribution). 여기서 $x$는 실제 데이터 분포 $p_{data}$의 실제 이미지이고, $z$는 분포 $p_{z}$(예: 균일 또는 가우스 분포)에서 샘플링된 노이즈 벡터입니다.

Conditional GAN [7, 19] is an extension of GAN where both the generator and discriminator receive additional conditioning variables $c$, yielding $G(z, c)$ and $D(x, c)$. This formulation allows $G$ to generate images conditioned on variables $c$.

조건부 GAN [7, 19]은 생성자와 판별자 모두 추가 조건 변수 $c$를 수신하여 $G(z, c)$와 $D(x, c)$를 산출하는 GAN의 확장입니다. 이 공식을 통해 $G$는 변수 $c$에 따라 조정된 이미지를 생성할 수 있습니다.

$\mathbf{3.2.\;Conditioning\;Augmentation}$

As shown in Figure 2, the text description $t$ is first encoded by an encoder, yielding a text embedding $ϕ_{t}$. In previous works [26, 24], the text embedding is nonlinearly transformed to generate conditioning latent variables as the input of the generator. However, latent space for the text embedding is usually high dimensional (> 100 dimensions). With limited amount of data, it usually causes discontinuity in the latent data manifold, which is not desirable for learning the generator. To mitigate this problem, we introduce a Conditioning Augmentation technique to produce additional conditioning variables $\hat{c}$. In contrast to the fixed conditioning text variable $c$ in [26, 24], we randomly sample the latent variables $\hat{c}$ from an independent Gaussian distribution $N($µ(ϕ_{t})$,$Σ(ϕ_{t})$)$, where the mean $µ(ϕ_{t})$ and diagonal covariance matrix $Σ(ϕ_{t})$ are functions of the text embedding $ϕ_{t}$. The proposed Conditioning Augmentation yields more training pairs given a small number of imagetext pairs, and thus encourages robustness to small perturbations along the conditioning manifold. To further enforce the smoothness over the conditioning manifold and avoid overfitting [6, 14], we add the following regularization term to the objective of the generator during training,

그림 2와 같이 텍스트 설명 $t$는 먼저 인코더에 의해 인코딩되어 텍스트 임베딩 $ϕ_{t}$를 생성합니다. 이전 작업[26, 24]에서 텍스트 임베딩은 생성기의 입력으로 조건화 잠재 변수를 생성하기 위해 비선형적으로 변환됩니다. 그러나 텍스트 임베딩을 위한 잠재 공간은 일반적으로 고차원(> 100차원)입니다. 제한된 양의 데이터로 인해 일반적으로 잠재 데이터 매니폴드에서 불연속성을 발생시키며 이는 생성기를 학습하는 데 바람직하지 않습니다. 이 문제를 완화하기 위해, 우리는 추가적인 조건화 변수 $\hat{c}$를 생성하기 위해 조건화 증강 기술을 도입한다. [26, 24]의 고정 조건 텍스트 변수 $c$와 대조적으로, 우리는 평균 $µ(ϕ_{t})$와 대각선 공분산 행렬 $Σ(ϕ_{t})$가 텍스트 임베딩 $ϕ_{t}$의 함수인 독립 가우스 분포 $N($µ(ϕ_{t})$, $Σ(ϕ_{t})$)$에서 잠재 변수 $\hat{c}$를 무작위로 샘플링한다. 제안된 조건화 증강은 적은 수의 이미지 텍스트 쌍이 주어지면 더 많은 훈련 쌍을 생성하므로 조건화 매니폴드를 따라 작은 섭동에 대한 견고성을 장려합니다. 컨디셔닝 매니폴드에 평활성을 더욱 강화하고 [6, 14] 과적합을 방지하기 위해 교육 중에 제너레이터의 목표에 다음과 같은 정규화 용어를 추가합니다.

\[D_{KL}(N(µ(ϕ_{t}),Σ(ϕ_{t}))||N(0,I)),\]

which is the Kullback-Leibler divergence (KL divergence) between the standard Gaussian distribution and the conditioning Gaussian distribution. The randomness introduced in the Conditioning Augmentation is beneficial for modeling text to image translation as the same sentence usually corresponds to objects with various poses and appearances.

이는 표준 가우스 분포와 조건부 가우스 분포 사이의 Kullback-Leibler 발산(KL 발산)입니다. Conditioning Augmentation에 도입된 랜덤성은 일반적으로 동일한 문장이 다양한 포즈 및 모양을 가진 객체에 해당하므로 텍스트에서 이미지 번역으로 모델링하는 데 유용합니다.

$\mathbf{3.3.\;Stage-I\;GAN}$

Instead of directly generating a high-resolution image conditioned on the text description, we simplify the task to first generate a low-resolution image with our Stage-I GAN, which focuses on drawing only rough shape and correct colors for the object.

우리는 텍스트 설명에 따라 고해상도 이미지를 직접 생성하는 대신 먼저 객체에 대한 거친 모양과 올바른 색상만 그리는 Stage-IGAN을 사용하여 저해상도 이미지를 생성하도록 작업을 단순화합니다.

Let $ϕ_{t}$ be the text embedding of the given description, which is generated by a pre-trained encoder [25] in this paper. The Gaussian conditioning variables $\hat{c}{0}$ for text embedding are sampled from $N(µ{0}(ϕ_{t}),Σ_{0}(ϕ_{t}))$ to capture the meaning of $ϕ_{t}$ with variations. Conditioned on $\hat{c}{0}$ and random variable $z$, Stage-I GAN trains the discriminator $D{0}$ and the generator $G_{0}$ by alternatively maximizing $L_{D_{0}}$ in $E_{q}$. (3) and minimizing $L_{G_{0}}$ in $E_{q}$. (4),

$ϕ_{t}$를 주어진 설명의 텍스트 임베딩으로 간주합니다. 이 설명은 본 문서의 사전 훈련된 인코더 [25]에 의해 생성됩니다. 텍스트 임베딩을 위한 가우스 조건 변수 $\hat{c}{0}$는 $N(µ{0}(ϕ_{t}),Σ_{0}(ϕ_{t}))$에서 샘플링되어 $ϕ_{t}$의 의미를 변형으로 포착합니다. $\hat{c}{0}$ 및 랜덤 변수 $z$에 따라 Stage-IGAN은 $E{q}$에서 $L_{D_{0}}$를 교대로 최대화하고 $E_{Q}(4)$에서 $L_{G_{0}}$를 최소화하여 판별기 $D_{0}$와 생성기 $G_{0}$를 훈련시킵니다.

\[L_{D_{0}}=E_{(I_{0},t)\sim{}p_{data}}[\log{}D_{0}(I_{0},ϕ_{t})]+E_{z\sim{}p_{z},t\sim{}p_{data}}[\log{}(1-D_{0}(G_{0}(z,\hat{c}_{0}),ϕ_{t}))],\] \[L_{G_{0}}=E_{z\sim{}p_{z}},t\sim{}p_{data}[\log{}(1-D_{0}(G_{0}(z,\hat{c}_{0}),ϕ_{t}))]+λD_{KL}(N(µ_{0}(ϕ_{t}),Σ_{0}(ϕ_{t}))||N(0,I)),\]

where the real image $I_{0}$ and the text description $t$ are from the true data distribution $p_{data}$. $z$ is a noise vector randomly sampled from a given distribution $p_{z}$ (Gaussian distribution in this paper). $λ$ is a regularization parameter that balances the two terms in $E_{q}$. (4). We set $λ=1$ for all our experiments. Using the reparameterization trick introduced in [11], both $µ_{0}(ϕ_{t})$ and $Σ_{0}(ϕ_{t})$ are learned jointly with the rest of the network.

여기서 실제 이미지 $I_{0}$ 및 텍스트 설명 $t$는 실제 데이터 분포 $p_{data}$에서 추출된 노이즈 벡터이며, $z$는 주어진 분포 $p_{z}$(본 논문에서 가우스 분포)에서 무작위로 샘플링된 노이즈 벡터입니다. 달러 E_{q}$.(4)에서 이 두 용어의 균형을 맞춰 $λ$는 정례화 변수예요. 우리는 모든 실험을 위해 $λ=1$를 설정합니다. [11]에 도입된 재파라미터화 트릭을 사용하여, $µ_{0}(ϕ_{t})$와 $Σ_{0}(ϕ_{t})$ 모두 네트워크의 나머지 부분과 함께 학습됩니다.

Model Architecture. For the generator $G_{0}$ ,to obtain text conditioning variable $\hat{c}{0}$,the text embedding $ϕ{t}$ is first fed into a fully connected layer to generate $µ_{0}$ and $σ_{0}(σ_{0}$ are the values in the diagonal of $Σ_{0}$) for the Gaussian distribution $N(µ_{0}(ϕ_{t}),Σ_{0}(ϕ_{t}))$. $\hat{c}{0}$ are then sampled from the Gaussian distribution. Our $N{g}$ dimensional conditioning vector $\hat{c}{0}$ is computed by A (where $\odot{}$ is the element-wise multiplication, $\epsilon{}\sim{}N(0,I)$). Then, $\hat{c}{0}$ is concatenated with a $N_{z}$ dimensional noise vector to generate a $W_{0}×H_{0}$ image by a series of up-sampling blocks.

모델 아키텍처입니다. 생성기 $G_{0}$의 경우, 텍스트 조건화 변수 $\hat{c}{0}$를 얻기 위해 텍스트 임베딩 $α{t}$가 먼저 완전히 연결된 레이어에 공급되어 $β_{0}$를 생성하고 $β_{0}(β_{0}$)은 가우스 분포에서 샘플링됩니다. 우리의 $N_{g}$ 차원 조건화 벡터 $\hat{c}{0}$는 $N(µ{0}(ϕ_{t}),Σ_{0}(ϕ_{t}))$. $\hat{c}{0}$에 의해 계산됩니다(여기서 $\odot{}$는 요소별 곱셈, $\epsilon{}\sim{}N(0,I)$). 그런 다음 $\hat{c}{0}$를 $N_{z}$ 차원 노이즈 벡터와 연결하여 일련의 업샘플링 블록에 의해 $W_{0}×H_{0}$ 이미지를 생성합니다.

Figure 2. The architecture of the proposed StackGAN. The Stage-I generator draws a low-resolution image by sketching rough shape and basic colors of the object from the given text and painting the background from a random noise vector. Conditioned on Stage-I results, the Stage-II generator corrects defects and adds compelling details into Stage-I results, yielding a more realistic high-resolution image.

그림 2입니다. 제안된 StackGAN의 아키텍처입니다. 1단계 생성기는 주어진 텍스트에서 객체의 거친 모양과 기본 색상을 스케치하고 랜덤 노이즈 벡터에서 배경을 그려 저해상도 이미지를 그립니다. 1단계 결과에 따라 조건화된 2단계 생성기는 결함을 수정하고 1단계 결과에 매력적인 세부 정보를 추가하여 보다 사실적인 고해상도 이미지를 생성합니다.

For the discriminator D_{0},the text embedding $ϕ_{t}$ is first compressed to $N_{d}$ dimensions using a fully-connected layer and then spatially replicated to form a $M_{d}×M_{d}×N_{d}$ tensor. Meanwhile, the image is fed through a series of down-sampling blocks until it has $M_{d}×M_{d}$ spatial dimension. Then, the image filter map is concatenated along the channel dimension with the text tensor. The resulting tensor is further fed to a 1×1 convolutional layer to jointly learn features across the image and the text. Finally, a fullyconnected layer with one node is used to produce the decision score.

판별기 D_{0}의 경우, 임베딩 $ϕ_{t}$는 먼저 완전히 연결된 레이어를 사용하여 $N_{d}$ 차원으로 압축된 다음 공간적으로 복제되어 $M_{d}×M_{d}×N_{d}$ 텐서를 형성합니다. 한편, 이미지는 $M_{d}×M_{d}$ 공간 차원을 가질 때까지 일련의 다운 샘플링 블록을 통해 공급됩니다. 그런 다음 영상 필터 맵이 텍스트 텐서와 함께 채널 치수를 따라 연결됩니다. 결과 텐서는 이미지와 텍스트 전반에 걸쳐 특징을 공동으로 학습하기 위해 1×1 컨볼루션 레이어에 추가로 공급됩니다. 마지막으로, 하나의 노드가 있는 완전히 연결된 레이어를 사용하여 의사 결정 점수를 생성합니다.

$\mathbf{3.4.\;Stage-II\;GAN}$

Low-resolution images generated by Stage-I GAN usually lack vivid object parts and might contain shape distortions. Some details in the text might also be omitted in the first stage, which is vital for generating photo-realistic images. Our Stage-II GAN is built upon Stage-I GAN results to generate high-resolution images. It is conditioned on low-resolution images and also the text embedding again to correct defects in Stage-I results. The Stage-II GAN completes previously ignored text information to generate more photo-realistic details.

Stage-IGAN에서 생성된 저해상도 이미지는 일반적으로 선명한 객체 부품이 없으며 형상 왜곡을 포함할 수 있습니다. 텍스트의 일부 세부 정보는 첫 번째 단계에서 생략될 수 있으며, 이는 사실적인 이미지를 생성하는 데 매우 중요합니다. 우리의 Stage-IIGAN은 Stage-IGAN 결과를 기반으로 구축되어 고해상도 이미지를 생성합니다. 이 기능은 저해상도 영상과 1단계 결과의 결점을 수정하기 위해 텍스트 임베딩에 대해 조건화됩니다. 2단계 GAN은 이전에 무시된 텍스트 정보를 완성하여 보다 사실적인 세부 정보를 생성합니다.

Conditioning on the low-resolution result $s_{0}=G_{0}(z,\hat{c}{0})$ and Gaussian latent variables $\hat{c}$, the discriminator $D$ and generator $G$ in Stage-II GAN are trained by alternatively maximizing $L{D}$ in $E_{q}$. (5) and minimizing $L_{G}$ in $E_{q}$. (6),

저해상도 결과 $s_{0}=G_{0}(z,\hat{c}{0})$ 및 가우스 잠재 변수 $\hat{c}$를 조건으로 스테이지-II GAN의 판별기 $D$ 및 생성기 $G$는 교대로 $E{q}$에서 $L_{G}$를 최대화하고 $E_{q}(6)$에서 $L_{G}$를 최소화함으로써 훈련됩니다.

\[L_{D}=E_{(I,t)\sim{}p_{data}}[\log{}D(I,ϕ_{t})]+E_{s_{0}\sim{}p_{G_{0}},t\sim{}p_{data}}[\log{}(1-D(G(s_{0},\hat{c}),ϕ_{t}))],\] \[L_{G}=E_{s_{0}\sim{}p_{G_{0}},t\sim{}p_{data}}[\log{}(1-D(G(s_{0},\hat{c}),ϕ_{t}))]+λD_{KL}(N(µ(ϕ_{t}),Σ(ϕ_{t}))||N(0,I)),\]

Different from the original GAN formulation, the random noise $z$ is not used in this stage with the assumption that the randomness has already been preserved by $s_{0}$. Gaussian conditioning variables $\hat{c}$ used in this stage and $\hat{c}{0}$ used in Stage-I GAN share the same pre-trained text encoder, generating the same text embedding $ϕ{t}$. However, StageI and Stage-II Conditioning Augmentation have different fully connected layers for generating different means and standard deviations. In this way, Stage-II GAN learns to capture useful information in the text embedding that is omitted by Stage-I GAN.

원래의 GAN 공식과 달리, 무작위성 $z$은 무작위성이 $s_{0}$에 의해 이미 보존되었다는 가정 하에 이 단계에서 사용되지 않습니다. 이 단계에서 사용되는 가우스 조건화 변수 $\hat{c}$와 Stage-IGAN에서 사용되는 $\hat{c}{0}$는 동일한 사전 훈련된 텍스트 인코더를 공유하여 동일한 텍스트 임베딩 $α{t}$를 생성합니다. 그러나 1단계와 2단계 조건화 증강은 서로 다른 평균과 표준 편차를 생성하기 위해 완전히 연결된 다른 레이어를 가집니다. 이러한 방식으로, Stage-IIGAN은 Stage-IGAN에 의해 생략된 텍스트 임베딩에서 유용한 정보를 캡처하는 방법을 학습합니다.

Model Architecture. We design Stage-II generator as an encoder-decoder network with residual blocks [9]. Similar to the previous stage, the text embedding $ϕ_{t}$ is used to generate the $N_{g}$ dimensional text conditioning vector $\hat{c}$, which is spatially replicated to form a $M_{g}×M_{g}×N_{g}$ tensor. Meanwhile, the Stage-I result $s_{0}$ generated by Stage-I GAN is fed into several down-sampling blocks (i.e., encoder) until it has a spatial size of $M_{g}×M_{g}$. The image features and the text features are concatenated along the channel dimension. The encoded image features coupled with text features are fed into several residual blocks, which are designed to learn multi-modal representations across image and text features. Finally, a series of up-sampling layers (i.e., decoder) are used to generate a $W×H$ high-resolution image. Such a generator is able to help rectify defects in the input image while add more details to generate the realistic high-resolution image.

모델 아키텍처입니다. 우리는 2단계 생성기를 잔여 블록[9]이 있는 인코더-디코더 네트워크로 설계합니다. 이전 단계와 유사하게 텍스트 임베딩 $ϕ_{t}$은 $N_{g}$ 차원 텍스트 조건화 벡터 $\hat{c}$를 생성하는 데 사용되며, 이는 공간적으로 복제되어 $M_{g}×M_{g}×N_{g}$ 텐서를 형성합니다. 한편, Stage-IGAN에 의해 생성된 Stage-I 결과 $s_{0}$는 $M_{g}×M_{g}$의 공간 크기를 가질 때까지 여러 다운 샘플링 블록(즉, 인코더)으로 공급됩니다. 영상 피쳐와 텍스트 피쳐는 채널 치수를 따라 연결됩니다. 텍스트 기능과 결합된 인코딩된 이미지 기능은 여러 잔여 블록으로 공급되며, 이미지 및 텍스트 기능 전반에 걸친 다중 모드 표현을 학습하도록 설계되었습니다. 마지막으로, 일련의 업샘플링 레이어(즉, 디코더)를 사용하여 $W×H$ 고해상도 이미지를 생성합니다. 이러한 생성기는 입력 이미지의 결함을 수정하는 데 도움이 되는 동시에 실제 고해상도 이미지를 생성하기 위한 세부 정보를 추가할 수 있습니다.

For the discriminator, its structure is similar to that of Stage-I discriminator with only extra down-sampling blocks since the image size is larger in this stage. To explicitly enforce GAN to learn better alignment between the image and the conditioning text, rather than using the vanilla discriminator, we adopt the matching-aware discriminator proposed by Reed et al. [26] for both stages. During training, the discriminator takes real images and their corresponding text descriptions as positive sample pairs, whereas negative sample pairs consist of two groups. The first is real images with mismatched text embeddings, while the second is synthetic images with their corresponding text embeddings.

판별기의 경우 이미지 크기가 이 단계에서 더 크기 때문에 추가 다운샘플링 블록만 있는 1단계 판별기의 구조와 유사합니다. 바닐라 판별기를 사용하는 대신 이미지와 조건 텍스트 간의 더 나은 정렬을 학습하도록 GAN을 명시적으로 시행하기 위해, 우리는 리드 등이 제안한 매칭 인식 판별기를 채택한다. [26] 두 단계 모두요 훈련 중에 판별기는 실제 이미지와 해당 텍스트 설명을 양의 샘플 쌍으로 사용하는 반면 음의 샘플 쌍은 두 그룹으로 구성됩니다. 첫 번째는 텍스트 임베딩이 일치하지 않는 실제 이미지이고, 두 번째는 해당 텍스트 임베딩이 있는 합성 이미지입니다.

$\mathbf{3.5.\;Implementation\;details}$

The up-sampling blocks consist of the nearest-neighbor upsampling followed by a 3×3 stride 1 convolution. Batch normalization [11] and ReLU activation are applied after every convolution except the last one. The residual blocks consist of 3×3 stride 1 convolutions, Batch normalization and ReLU. Two residual blocks are used in 128×128 StackGAN models while four are used in 256×256 models. The down-sampling blocks consist of 4×4 stride 2 convolutions, Batch normalization and LeakyReLU, except that the first one does not have Batch normalization.

업샘플링 블록은 가장 가까운 이웃 업샘플링에 이어 3×3 스트라이드 1 컨볼루션으로 구성됩니다. Batch Normalization [11] 및 ReLU 활성화는 마지막 컨볼루션만 제외하고 모든 컨볼루션 후에 적용됩니다. 나머지 블록은 3×3 스트라이드 1 컨볼루션, 배치 정규화 및 ReLU로 구성됩니다. 128×128 StackGAN 모델에는 두 개의 잔여 블록이 사용되고 256×256 모델에는 네 개가 사용됩니다. 다운샘플링 블록은 4×4 스트라이드 2 컨볼루션, 배치 정규화 및 LeakyRe로 구성됩니다.LU. 단, 첫 번째 LU에는 Batch 정규화가 없습니다.

By default, $N_{g}=128$, $N_{z}=100$, $M_{g}=16$, $M_{d}=4$, $N_{d}=128$, $W_{0}=H_{0}=64$ and $W=H=256$. For training, we first iteratively train $D_{0}$ and $G_{0}$ of Stage-I GAN for 600 epochs by fixing Stage-II GAN. Then we iteratively train $D$ and $G$ of Stage-II GAN for another 600 epochs by fixing Stage-I GAN. All networks are trained using ADAM solver with batch size 64 and an initial learning rate of 0.0002. The learning rate is decayed to 1/2 of its previous value every 100 epochs.

기본적으로 $N_{g}=128$, $N_{z}=100$, $M_{g}=16$, $M_{d}=4$, $N_{d}=128$, $W_{0}=64$ 및 $W=H=256$입니다. 훈련을 위해 먼저 Stage-IIGAN을 수정하여 Stage-IGAN의 $D_{0}$와 $G_{0}$를 600에포크 동안 반복적으로 훈련시킨다. 그런 다음 Stage-IGAN을 수정하여 Stage-IIGAN의 $D$와 $G$를 다른 600 에폭 동안 반복적으로 훈련한다. 모든 네트워크는 배치 크기가 64이고 초기 학습률이 0.0002인 ADAM 솔버를 사용하여 훈련된다. 학습 속도는 100세기마다 이전 값의 1/2로 감소합니다.

$\mathbf{4.\;E_{x}periments}$

To validate our method, we conduct extensive quantitative and qualitative evaluations. Two state-of-the-art methods on text-to-image synthesis, GAN-INT-CLS [26] and GAWWN [24], are compared. Results by the two compared methods are generated using the code released by their authors. In addition, we design several baseline models to investigate the overall design and important components of our proposed StackGAN. For the first baseline, we directly train Stage-I GAN for generating 64×64 and 256×256 images to investigate whether the proposed stacked structure and Conditioning Augmentation are beneficial. Then we modify our StackGAN to generate 128×128 and 256×256 images to investigate whether larger images by our method result in higher image quality. We also investigate whether inputting text at both stages of StackGAN is useful.

우리의 방법을 검증하기 위해, 우리는 광범위한 양적, 질적 평가를 수행합니다. 텍스트-이미지 합성에 관한 두 가지 최신 방법인 GAN-INT-CLS[26]와 GAWWN[24]을 비교합니다. 비교한 두 가지 방법에 의한 결과는 작성자가 공개한 코드를 사용하여 생성됩니다. 또한 제안된 StackGAN의 전체 설계와 중요한 구성 요소를 조사하기 위해 몇 가지 기본 모델을 설계합니다. 첫 번째 기준선에 대해 64×64 및 256×256 이미지를 생성하기 위해 Stage-IGAN을 직접 훈련하여 제안된 스택 구조와 컨디셔닝 증강이 유익한지 여부를 조사합니다. 그런 다음 StackGAN을 수정하여 128×128 및 256×256 이미지를 생성하여 우리 방법으로 이미지가 클수록 이미지 품질이 향상되는지 여부를 조사합니다. 또한 스택의 두 단계에서 텍스트를 입력하는지 여부를 조사합니다.GAN은 유용합니다.

$\mathbf{4.1.\;Datasets\;and\;evaluation\;metrics}$

CUB [35] contains 200 bird species with 11,788 images. Since 80% of birds in this dataset have object-image size ratios of less than 0.5 [35], as a pre-processing step, we crop all images to ensure that bounding boxes of birds have greater-than-0.75 object-image size ratios. Oxford-102 [21] contains 8,189 images of flowers from 102 different categories. To show the generalization capability of our approach, a more challenging dataset, MS COCO [16] is also utilized for evaluation. Different from CUB and Oxford102, the MS COCO dataset contains images with multiple objects and various backgrounds. It has a training set with 80k images and a validation set with 40k images. Each image in COCO has 5 descriptions, while 10 descriptions are provided by [25] for every image in CUB and Oxford102 datasets. Following the experimental setup in [26], we directly use the training and validation sets provided by COCO, meanwhile we split CUB and Oxford-102 into class-disjoint training and test sets. Evaluation metrics. It is difficult to evaluate the performance of generative models (e.g., GAN). We choose a recently proposed numerical assessment approach “inception score” [29] for quantitative evaluation,

CUB [35]는 11,788개의 이미지를 가진 200종의 새를 포함합니다. 이 데이터 세트의 80%의 조류는 개체-이미지 크기 비율이 0.5[35]보다 작기 때문에 사전 처리 단계로 조류 경계 상자가 0.75개 이상의 개체-이미지 크기 비율을 갖도록 모든 이미지를 자른다. Oxford-102[21]에는 102개의 다른 카테고리의 8,189개의 꽃 이미지가 포함되어 있습니다. 우리의 접근 방식의 일반화 기능을 보여주기 위해, 더 어려운 데이터 세트인 MS COCO[16]도 평가에 활용됩니다. CUB 및 Oxford102와 달리 MS COCO 데이터 세트에는 여러 개체와 다양한 배경이 있는 이미지가 포함되어 있습니다. 80k 이미지로 구성된 교육 세트와 40k 이미지로 구성된 검증 세트가 있습니다. COCO의 각 이미지에는 5개의 설명이 있으며, CUB 및 Oxford102 데이터 세트의 모든 이미지에 대해 [25]에서 10개의 설명이 제공됩니다. [26]의 실험 설정에 따라, 우리는 COCO가 제공하는 훈련 및 검증 세트를 직접 사용하는 한편, CUB와 Oxford-102를 클래스 분리 훈련 및 테스트 세트로 나누었습니다. 평가 지표입니다. 생성 모델(예: GAN)의 성능을 평가하기 어렵습니다. 정량적 평가를 위해 최근에 제안된 수치 평가 접근 방식 “초기 점수”[29]를 선택합니다.

\[I=exp(E_{x}D_{KL}(p(y|x)||p(y))),\]

where $x$ denotes one generated sample, and $y$ is the label predicted by the Inception model [30]. The intuition behind this metric is that good models should generate diverse but meaningful images. Therefore, the KL divergence between the marginal distribution $p(y)$ and the conditional distribution $p(y x)$ should be large. In our experiments, we directly use the pre-trained Inception model for COCO dataset. For fine-grained datasets, CUB and Oxford-102, we fine-tune an Inception model for each of them. As suggested in [29], we evaluate this metric on a large number of samples (i.e., 30k randomly selected samples) for each model.

> 여기서 $x$는 생성된 샘플 중 하나를 나타내고 $y$는 Inception 모델에 의해 예측된 레이블입니다 [30]. 이 지표 뒤에 있는 직관은 좋은 모델은 다양하지만 의미 있는 이미지를 생성해야 한다는 것입니다. 따라서 한계 분포 $p(y)$와 조건부 분포 $p(y x)$ 사이의 KL 분기는 커야 합니다. 우리의 실험에서, 우리는 COCO 데이터 세트에 대해 사전 훈련된 Inception 모델을 직접 사용합니다. CUB와 Oxford-102라는 세분화된 데이터 세트의 경우 각각에 대한 Inception 모델을 미세 조정합니다. [29]에 제시된 바와 같이, 우리는 각 모델에 대해 많은 수의 샘플(즉, 무작위로 선택된 30k 샘플)에 대해 이 메트릭을 평가합니다.

Although the inception score has shown to well correlate with human perception on visual quality of samples [29], it cannot reflect whether the generated images are well conditioned on the given text descriptions. Therefore, we also conduct human evaluation. We randomly select 50 text descriptions for each class of CUB and Oxford-102 test sets. For COCO dataset, 4k text descriptions are randomly selected from its validation set. For each sentence, 5 images are generated by each model. Given the same text descriptions, 10 users (not including any of the authors) are asked to rank the results by different methods. The average ranks by human users are calculated to evaluate all compared methods.

시작 점수는 샘플의 시각적 품질에 대한 인간의 인식과 잘 상관관계가 있는 것으로 나타났지만 [29], 생성된 이미지가 주어진 텍스트 설명에 대해 잘 조정되어 있는지 여부를 반영할 수 없습니다. 따라서, 우리는 인간 평가도 실시합니다. CUB 및 Oxford-102 테스트 세트의 각 클래스에 대해 50개의 텍스트 설명을 무작위로 선택합니다. COCO 데이터 세트의 경우 검증 세트에서 4k 텍스트 설명이 무작위로 선택됩니다. 각 문장에 대해 모델별로 5개의 이미지가 생성됩니다. 동일한 텍스트 설명이 주어지면, 10명의 사용자(저작자 제외)가 다른 방법으로 결과의 순위를 매겨야 합니다. 사용자별 평균 순위는 비교된 모든 방법을 평가하기 위해 계산됩니다.

$\mathbf{4.2.\;Quantitative\;and\;qualitative\;results}$

We compare our results with the state-of-the-art text-toimage methods [24, 26] on CUB, Oxford-102 and COCO datasets. The inception scores and average human ranks for our proposed StackGAN and compared methods are reported in Table 1. Representative examples are compared in Figure 3 and Figure 4.

우리는 우리의 결과를 CUB, Oxford-102 및 COCO 데이터 세트에 대한 최첨단 텍스트-투-이미지 방법[24, 26]과 비교합니다. 제안된 StackGAN 및 비교 방법에 대한 초기 점수와 평균 인간 순위는 표 1에 보고되었습니다. 대표적인 예가 그림 3과 그림 4에서 비교됩니다.

Figure 3. Example results by our StackGAN, GAWWN [24], and GAN-INT-CLS [26] conditioned on text descriptions from CUB test set.

그림 3입니다. 예제 결과는 CUB 테스트 세트의 텍스트 설명을 조건으로 StackGAN, GAWWN [24] 및 GAN-INT-CLS [26]에 의한 것입니다.

Figure 4. Example results by our StackGAN and GAN-INT-CLS [26] conditioned on text descriptions from Oxford-102 test set (leftmost four columns) and COCO validation set (rightmost four columns).

그림 4입니다. 옥스퍼드-102 테스트 세트(맨 왼쪽 4열) 및 COCO 유효성 검사 세트(맨 오른쪽 4열)의 텍스트 설명을 조건으로 한 StackGAN 및 GAN-INT-CLS [26]의 예제 결과입니다.

Table 1

Table 1. Inception scores and average human ranks of our StackGAN, GAWWN [24], and GAN-INT-CLS [26] on CUB, Oxford102, and MS-COCO datasets.

표 1입니다. CUB, Oxford102 및 MS-COCO 데이터 세트에서 StackGAN, GAWWN[24] 및 GAN-INT-CLS[26]의 초기 점수와 평균 인간 순위를 확인할 수 있습니다.

Our StackGAN achieves the best inception score and average human rank on all three datasets. Compared with GAN-INT-CLS [26], StackGAN achieves 28.47% improvement in terms of inception score on CUB dataset (from 2.88 to 3.70),and 20.30% improvement on Oxford-102 (from 2.66 to 3.20). The better average human rank of our StackGAN also indicates our proposed method is able to generate more realistic samples conditioned on text descriptions.

StackGAN은 세 데이터 세트 모두에서 최고의 시작 점수와 평균 인간 순위를 달성합니다. GAN-INT-CLS[26]와 비교하여 StackGAN은 CUB 데이터 세트에 대한 초기 점수 측면에서 28.47%(2.88에서 3.70으로), Oxford-102(2.66에서 3.20)에서 20.30% 개선을 달성했습니다. StackGAN의 더 나은 평균 인간 순위는 또한 우리가 제안한 방법이 텍스트 설명에 따라 더 현실적인 샘플을 생성할 수 있음을 나타냅니다.

As shown in Figure 3, the 64×64 samples generated by GAN-INT-CLS [26] can only reflect the general shape and color of the birds. Their results lack vivid parts (e.g., beak and legs) and convincing details in most cases, which make them neither realistic enough nor have sufficiently high resolution. By using additional conditioning variables on location constraints, GAWWN [24] obtains a better inception score on CUB dataset, which is still slightly lower than ours. It generates higher resolution images with more details than GAN-INT-CLS, as shown in Figure 3. However, as mentioned by its authors, GAWWN fails to generate any plausible images when it is only conditioned on text descriptions [24]. In comparison, our StackGAN can generate 256×256 photo-realistic images from only text descriptions.

그림 3과 같이 GAN-INT-CLS [26]에서 생성된 64×64 샘플은 새의 일반적인 모양과 색만 반영할 수 있습니다. 그들의 결과는 선명한 부분(예: 부리와 다리)과 대부분의 경우 설득력 있는 세부 사항이 부족하며, 이것은 그것들을 충분히 현실적이지도 않고 충분히 높은 해상도도도 충분하지 않습니다. GAWWN [24]은 위치 제약에 대한 추가 조건 변수를 사용하여 CUB 데이터 세트에서 더 나은 시작 점수를 얻으며, 이는 여전히 우리보다 약간 낮습니다. 그림 3과 같이 GAN-INT-CLS보다 더 자세한 고해상도 이미지를 생성합니다. 그러나, 저자들이 언급한 바와 같이, GAWWN은 텍스트 설명에만 의존할 경우 그럴듯한 이미지를 생성하지 못합니다 [24]. 이에 비해 당사의 스택은GAN은 텍스트 설명만으로 256×256개의 사실적인 이미지를 생성할 수 있습니다.

Figure 5. Samples generated by our StackGAN from unseen texts in CUB test set. Each column lists the text description, images generated from the text by Stage-I and Stage-II of StackGAN.

그림 5입니다. CUB 테스트 세트의 보이지 않는 텍스트에서 StackGAN에 의해 생성된 샘플입니다. 각 열에는 StackGAN의 1단계 및 2단계에 의해 텍스트에서 생성된 영상, 텍스트 설명이 나열됩니다.

Figure 6. For generated images (column 1), retrieving their nearest training images (columns 2-6) by utilizing Stage-II discriminator D to extract visual features. The L2 distances between features are calculated for nearest-neighbor retrieval.

그림 6입니다. 생성된 이미지(1열)의 경우 2단계 판별기 D를 사용하여 가장 가까운 교육 이미지(2-6열)를 검색하여 시각적 특징을 추출합니다. 형상 간 L2 거리는 가장 가까운 이웃 검색을 위해 계산됩니다.

Figure 5 illustrates some examples of the Stage-I and Stage-II images generated by our StackGAN. As shown in the first row of Figure 5, in most cases, Stage-I GAN is able to draw rough shapes and colors of objects given text descriptions. However, Stage-I images are usually blurry with various defects and missing details, especially for foreground objects. As shown in the second row, StageII GAN generates 4×higher resolution images with more convincing details to better reflect corresponding text descriptions. For cases where Stage-I GAN has generated plausible shapes and colors, Stage-II GAN completes the details. For instance, in the 1st column of Figure 5, with a satisfactory Stage-I result, Stage-II GAN focuses on drawing the short beak and white color described in the text as well as details for the tail and legs. In all other examples, different degrees of details are added to Stage-II images. In many other cases, Stage-II GAN is able to correct the defects of Stage-I results by processing the text description again. For example, while the Stage-I image in the 5th column has a blue crown rather than the reddish brown crown described in the text, the defect is corrected by Stage-II shape, Stage-II GAN is able to generate reasonable objects. We also observe that StackGAN has the ability to transfer background from Stage-I images and fine-tune them to be more realistic with higher resolution at Stage-II.

그림 5는 StackGAN에서 생성된 1단계 및 2단계 이미지의 몇 가지 예를 보여줍니다. 그림 5의 첫 번째 행에서 볼 수 있듯이, 대부분의 경우 Stage-IGAN은 텍스트 설명이 주어진 객체의 거친 모양과 색상을 그릴 수 있습니다. 그러나 1단계 영상은 일반적으로 다양한 결점과 누락된 세부 정보, 특히 전경 객체의 경우 흐릿합니다. 두 번째 줄에 표시된 바와 같이 스테이지입니다.II GAN은 해당 텍스트 설명을 더 잘 반영하기 위해 보다 설득력 있는 세부 정보와 함께 4배 더 높은 해상도의 이미지를 생성합니다. Stage-IGAN이 그럴듯한 모양과 색상을 생성한 경우, Stage-IIGAN은 세부 정보를 완성합니다. 예를 들어, 그림 5의 첫 번째 열에서 만족스러운 1단계 결과에서, 2단계 GAN은 텍스트에 설명된 짧은 부리와 흰색 색상과 꼬리 및 다리에 대한 세부 정보를 그리는 데 초점을 맞춥니다. 다른 모든 예에서는 2단계 영상에 서로 다른 수준의 세부 정보가 추가됩니다. 다른 많은 경우, 2단계 GAN은 텍스트 설명을 다시 처리하여 1단계 결과의 결함을 수정할 수 있습니다. 예를 들어, 5번째 열의 1단계 이미지는 텍스트에 설명된 적갈색 크라운이 아닌 파란색 크라운을 가지고 있지만, 결함은 2단계 모양을 통해 수정되고, 2단계 GAN은 합리적인 개체를 생성할 수 있습니다. 또한 스택을 관찰합니다.GAN은 1단계 이미지에서 배경을 전송하고 2단계에서 고해상도로 보다 사실적으로 조정할 수 있습니다.

Importantly, the StackGAN does not achieve good results by simply memorizing training samples but by capturing the complex underlying language-image relations. We extract visual features from our generated images and all training images by the Stage-II discriminator $D$ of our StackGAN. For each generated image, its nearest neighbors from the training set can be retrieved. By visually inspecting the retrieved images (see Figure 6),we can conclude that the generated images have some similar characteristics with the training samples but are essentially different.

중요한 것은 StackGAN이 단순히 훈련 샘플을 암기하는 것이 아니라 복잡한 기본 언어-이미지 관계를 캡처함으로써 좋은 결과를 달성한다는 것입니다. 우리는 StackGAN의 2단계 판별기 $D$에 의해 생성된 이미지와 모든 훈련 이미지에서 시각적 특징을 추출한다. 생성된 각 이미지에 대해 교육 세트에서 가장 가까운 인접 이미지를 검색할 수 있습니다. 검색된 이미지를 육안으로 검사함으로써(그림 6 참조), 생성된 이미지는 교육용 샘플과 유사한 특성을 가지고 있지만 본질적으로 다르다는 결론을 내릴 수 있습니다.

$\mathbf{4.3.\;Component\;analysis}$

In this subsection, we analyze different components of StackGAN on CUB dataset with our baseline models. The inception scores for those baselines are reported in Table 2.

이 하위 섹션에서는 스택의 여러 구성 요소를 분석합니다.기본 모델을 사용하여 CUB 데이터 세트에 대한 GAN입니다. 이러한 기준선의 시작 점수는 표 2에 보고됩니다.

The design of StackGAN. As shown in the first four rows of Table 2, if Stage-I GAN is directly used to generate images, the inception scores decrease significantly. Such performance drop can be well illustrated by results in Figure 7. As shown in the first row of Figure 7, Stage-I GAN fails to generate any plausible 256×256 samples without using Conditioning Augmentation (CA). Although Stage-I GAN with CA is able to generate more diverse 256×256 samples, those samples are not as realistic as samples generated by StackGAN. It demonstrates the necessity of the proposed stacked structure. In addition, by decreasing the output resolution from 256×256 to 128×128, the inception score decreases from 3.70 to 3.35. Note that all images are scaled to 299×299 before calculating the inception score. Thus, if our StackGAN just increases the image size without adding more information, the inception score would remain the same for samples of different resolutions. Therefore, the decrease in inception score by 128×128 StackGAN demonstrates that our 256×256 StackGAN does add more details into the larger images. For the 256×256 StackGAN, if the text is only input to Stage-I (denoted as “no Text twice”),the inception score decreases from 3.70 to 3.45. It indicates that processing text descriptions again at Stage-II helps refine Stage-I results. The same conclusion can be drawn from the results of 128×128 StackGAN models.

StackGAN의 설계입니다. 표 2의 처음 네 줄에서 볼 수 있듯이 Stage-IGAN을 직접 사용하여 영상을 생성하는 경우 초기 점수가 크게 감소합니다. 이러한 성능 저하는 그림 7의 결과를 통해 잘 알 수 있습니다. 그림 7의 첫 번째 행에서 볼 수 있듯이, Stage-IGAN은 조건화 증강(CA)을 사용하지 않고서는 그럴듯한 256×256 샘플을 생성하지 못합니다. CA가 포함된 Stage-IGAN이 더 다양한 256×256 샘플을 생성할 수 있지만, 이러한 샘플은 StackGAN이 생성한 샘플만큼 현실적이지 않습니다. 이것은 제안된 적층 구조의 필요성을 보여줍니다. 또한 출력 해상도를 256×256에서 128×128로 줄임으로써 시작 점수가 3.70에서 3.35로 감소합니다. 모든 영상은 시작 점수를 계산하기 전에 299×299로 조정됩니다. 따라서, 만약 우리의 스택이GAN은 더 많은 정보를 추가하지 않고 이미지 크기를 늘리기만 하면 다른 해상도의 샘플에 대해 시작 점수가 동일하게 유지됩니다. 따라서 시작 점수가 128×128 StackGAN으로 감소하면 256×256 StackGAN이 더 큰 이미지에 더 많은 세부 정보를 추가한다는 것을 알 수 있습니다. 256×256 StackGAN의 경우, 텍스트가 1단계(“두 번 텍스트 없음”으로 표시됨)에만 입력되면 시작 점수가 3.70에서 3.45로 감소합니다. 이는 2단계에서 텍스트 설명을 다시 처리하는 것이 1단계 결과를 세분화하는 데 도움이 된다는 것을 나타냅니다. 128×128 StackGAN 모델의 결과에서도 동일한 결론을 도출할 수 있습니다.

Figure 7. Conditioning Augmentation (CA) helps stabilize the training of conditional GAN and improves the diversity of the generated samples. (Row 1) without CA, Stage-I GAN fails to generate plausible 256×256 samples. Although different noise vector $z$ is used for each column, the generated samples collapse to be the same for each input text description. (Row 2-3) with CA but fixing the noise vectors $z$, methods are still able to generate birds with different poses and viewpoints.

그림 7입니다. Conditioning Augmentation(CA)은 조건부 GAN의 교육을 안정화하고 생성된 샘플의 다양성을 향상시키는 데 도움이 된다. (1행) CA가 없으면 Stage-IGAN은 그럴듯한 256×256 샘플을 생성하지 못합니다. 각 열에 대해 서로 다른 노이즈 벡터 $z$가 사용되지만 생성된 샘플은 각 입력 텍스트 설명에 대해 동일하게 축소됩니다. (2-3행) CA를 사용하지만 노이즈 벡터 $z$를 고정하는 방법은 여전히 다른 자세와 관점을 가진 새를 생성할 수 있습니다.

Table 2

Table 2. Inception scores calculated with 30,000 samples generated by different baseline models of our StackGAN.

표 2입니다. 인셉션 점수는 StackGAN의 다양한 기준 모델에서 생성된 30,000개의 샘플로 계산되었습니다.

Conditioning Augmentation. We also investigate the efficacy of the proposed Conditioning Augmentation (CA). By removing it from StackGAN 256×256 (denoted as “no CA” in Table 2),the inception score decreases from 3.70 to 3.31. Figure 7 also shows that 256×256 Stage-I GAN(and StackGAN) with CA can generate birds with different poses The bird is completely red $\to{}$ The bird is completely yellow This bird is completely red with black wings and pointy beak $\to{}$ this small blue bird has a short pointy beak and brown on its wings Figure 8. (Left to right) Images generated by interpolating two sentence embeddings. Gradual appearance changes from the first sentence’s meaning to that of the second sentence can be observed. The noise vector $z$ is fixed to be zeros for each row. and viewpoints from the same text embedding. In contrast, without using CA, samples generated by 256×256 StageI GAN collapse to nonsensical images due to the unstable training dynamics of GANs. Consequently, the proposed Conditioning Augmentation helps stabilize the conditional GAN training and improves the diversity of the generated samples because of its ability to encourage robustness to small perturbations along the latent manifold.

증강을 조절합니다. 우리는 또한 제안된 컨디셔닝 증강(CA)의 효과를 조사합니다. StackGAN 256×256(표 2에서 “CA 없음”으로 표시됨)에서 제거하면 초기 점수가 3.70에서 3.31로 감소합니다. 그림 7은 또한 256×256 Stage-IGAN(및 StackGAN)이 다른 포즈를 가진 새를 생성할 수 있음을 보여줍니다. 새는 완전히 빨간색 $\to{}$ 새입니다. 이 새는 검은 날개와 뾰족한 부리 $\to{}$ 이 작은 파란색 새는 짧고 뾰족한 부리를 가지고 있으며 날개에 갈색입니다. 그림 8(왼쪽에서 오른쪽 이미지) 생성두 개의 문장 임베딩을 보간하여 ed를 만듭니다. 첫 번째 문장의 의미에서 두 번째 문장의 의미로의 점진적인 외관 변화를 관찰할 수 있습니다. 노이즈 벡터 $z$는 각 행에 대해 0으로 고정되며 동일한 텍스트 임베딩의 시점입니다. 대조적으로, CA를 사용하지 않으면 256×256 Stage IGAN에서 생성된 샘플은 GAN의 불안정한 훈련 역학으로 인해 무의미한 이미지로 붕괴됩니다. 결과적으로, 제안된 조건화 증강은 잠재 매니폴드를 따라 작은 섭동에 대한 견고성을 장려하는 능력 때문에 조건부 GAN 훈련을 안정화하고 생성된 샘플의 다양성을 개선하는 데 도움이 됩니다.

Sentence embedding interpolation. To further demonstrate that our StackGAN learns a smooth latent data manifold, we use it to generate images from linearly interpolated sentence embeddings, as shown in Figure 8. We fix the noise vector $z$, so the generated image is inferred from the given text description only. Images in the first row are generated by simple sentences made up by us. Those sentences contain only simple color descriptions. The results show that the generated images from interpolated embeddings can accurately reflect color changes and generate plausible bird shapes. The second row illustrates samples generated from more complex sentences, which contain more details on bird appearances. The generated images change their primary color from red to blue, and change the wing color from black to brown.

보간 기능이 포함된 문장입니다. StackGAN이 매끄러운 잠재 데이터 매니폴드를 학습한다는 것을 추가로 보여주기 위해, 우리는 그림 8과 같이 선형 보간된 문장 임베딩에서 이미지를 생성하는 데 사용합니다. 우리는 노이즈 벡터 $z$를 수정하므로 생성된 이미지는 주어진 텍스트 설명에서만 추론됩니다. 첫 번째 줄의 이미지는 우리가 만든 간단한 문장으로 만들어집니다. 그 문장들은 단순한 색 묘사만을 포함하고 있습니다. 결과는 보간 임베딩에서 생성된 이미지가 색상 변화를 정확하게 반영하고 그럴듯한 새 모양을 생성할 수 있음을 보여줍니다. 두 번째 행은 더 복잡한 문장에서 생성된 샘플을 보여주며, 새의 출현에 대한 자세한 내용을 포함합니다. 생성된 이미지는 기본 색상이 빨간색에서 파란색으로 바뀌고 날개 색상은 검은색에서 갈색으로 바뀝니다.

$\mathbf{5.\;Conclusions}$

In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) with Conditioning Augmentation for synthesizing photo-realistic images. The proposed method decomposes the text-to-image synthesis to a novel sketch-refinement process. Stage-I GAN sketches the object following basic color and shape constraints from given text descriptions. Stage-II GAN corrects the defects in Stage-I results and adds more details, yielding higher resolution images with better image quality. Extensive quantitative and qualitative results demonstrate the effectiveness of our proposed method. Compared to existing text-to-image generative models, our method generates higher resolution images (e.g., 256×256) with more photo-realistic details and diversity.

본 논문에서, 우리는 사진 사실적인 이미지를 합성하기 위해 조건화 증강이 있는 스택 생성 적대적 네트워크(StackGAN)를 제안한다. 제안된 방법은 텍스트와 이미지 합성을 새로운 스케치 정제 프로세스로 분해합니다. Stage-IGAN은 주어진 텍스트 설명에서 기본 색상과 모양 제약 조건에 따라 객체를 스케치합니다. 2단계 GAN은 1단계 결과의 결함을 수정하고 더 많은 세부 정보를 추가하여 더 나은 영상 화질로 더 높은 해상도의 영상을 생성합니다. 광범위한 양적 및 질적 결과는 우리가 제안한 방법의 효과를 보여줍니다. 기존의 텍스트-이미지 생성 모델과 비교하여, 우리의 방법은 더 사실적인 세부 사항과 다양성을 가진 고해상도 이미지(예: 256×256)를 생성한다.

$\mathbf{References}$

[1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017. 2

[2] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. In ICLR, 2017. 2

[3] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li. Mode regularized generative adversarial networks. In ICLR, 2017. 2

[4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016. 2

[5] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, 2015. 1, 2

[6] C. Doersch. Tutorial on variational autoencoders.arXiv:1606.05908, 2016. 3

[7] J. Gauthier. Conditional generative adversarial networks for convolutional face generation. Technical report, 2015. 3

[8] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014. 1, 2, 3

[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 4

[10] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In CVPR, 2017. 2, 3

[11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 5

[12] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017. 2

[13] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014. 2, 3

[14] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016. 3

[15] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017. 2

[16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5

[17] E. Mansimov, E. Parisotto, L. J. Ba, and R. Salakhutdinov. Generating images from captions with attention. In ICLR, 2016. 2

[18] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. In ICLR, 2017. 2

[19] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014. 3

[20] A. N_{g}uyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, 2017. 2

[21] M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In ICCVGIP, 2008. 5

[22] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. In ICML, 2017. 2

[23] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. 1, 2

[24] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016. 1, 2, 3, 5, 6, 7

[25] S. Reed, Z. Akata, B. Schiele, and H. Lee. Learning deep representations of fine-grained visual descriptions. In CVPR, 2016. 3, 5

[26] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016. 1, 2, 3, 5, 6

[27] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick, and N. de Freitas. Generating interpretable images with controllable structure. Technical report, 2016. 2

[28] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014. 2

[29] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016. 2, 5

[30] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016. 5

[31] C. K. Snderby, J. Caballero, L. Theis, W. Shi, and F. Huszar. Amortised map inference for image super-resolution. In ICLR, 2017. 2

[32] Y. Taigman, A. Polyak, and L. Wolf. Unsupervised crossdomain image generation. In ICLR, 2017. 2

[33] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016. 2

[34] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. E_{speholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelcnn decoders. In NIPS, 2016. 2

[35] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Techno\log{}y, 2011. 5

[36] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016. 2

[37] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016. 2

[38] J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2017. 2

[39] J. Zhu, P. Krahenb ¨ uhl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016. 2