[GauGAN] Semantic Image Synthesis with Spatially-Adaptive Normalization

비지도학습/GAN

[GauGAN] Semantic Image Synthesis with Spatially-Adaptive Normalization

cl2020 2021. 3. 26. 17:09

Semantic Image Synthesis with Spatially-Adaptive Normalization

공간 적응형 정규화를 사용한 의미론적 이미지 합성

Abstract

We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers. We show that this is suboptimal as the normalization layers tend to “wash away” semantic information. To address the issue, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation. Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows user control over both semantic and style. Code is available at https://github.com/NVlabs/SPADE.

우리는 입력 시맨틱 레이아웃이 주어진 사실적 이미지를 합성하기위한 간단하지만 효과적인 레이어 인 공간 적응 정규화를 제안합니다. 이전 방법은 시맨틱 레이아웃을 심층 네트워크에 대한 입력으로 직접 공급 한 다음 컨볼 루션, 정규화 및 비선형 계층 스택을 통해 처리됩니다. 정규화 계층이 의미 정보를 "제거"하는 경향이 있으므로 이것이 차선책이라는 것을 보여줍니다. 이 문제를 해결하기 위해 공간 적응 형 학습 변환을 통해 정규화 계층의 활성화를 변조하기위한 입력 레이아웃 사용을 제안합니다. 몇 가지 까다로운 데이터 세트에 대한 실험은 시각적 충실도 및 입력 레이아웃과의 정렬과 관련하여 기존 접근 방식에 비해 제안 된 방법의 이점을 보여줍니다. 마지막으로 우리 모델은 사용자가 의미와 스타일을 제어 할 수 있도록합니다. 코드는 https://github.com/NVlabs/SPADE에서 확인할 수 있습니다.

1. Introduction

Conditional image synthesis refers to the task of generating photorealistic images conditioning on certain input data. Seminal work computes the output image by stitching pieces from a single image (e.g., Image Analogies [16]) or using an image collection [7, 14, 23, 30, 35]. Recent methods directly learn the mapping using neural networks [3, 6, 22, 47, 48, 54, 55, 56]. The latter methods are faster and require no external database of images.

조건부 이미지 합성은 특정 입력 데이터에 대해 사실적인 이미지 조건을 생성하는 작업을 의미합니다. Seminal work는 단일 이미지 (예 : Image Analogies [16])에서 조각을 스티칭하거나 이미지 컬렉션 [7, 14, 23, 30, 35]을 사용하여 출력 이미지를 계산합니다. 최근 방법은 신경망을 사용하여 매핑을 직접 학습합니다 [3, 6, 22, 47, 48, 54, 55, 56]. 후자의 방법은 더 빠르며 외부 이미지 데이터베이스가 필요하지 않습니다.

We are interested in a specific form of conditional image synthesis, which is converting a semantic segmentation mask to a photorealistic image. This form has a wide range of applications such as content generation and image editing [6, 22, 48]. We refer to this form as semantic image synthesis. In this paper, we show that the conventional network architecture [22, 48], which is built by stacking convolutional, normalization, and nonlinearity layers, is at best suboptimal because their normalization layers tend to “wash away” information contained in the input semantic masks. To address the issue, we propose spatially-adaptive normalization, a conditional normalization layer that modulates the activations using input semantic layouts through a spatially adaptive, learned transformation and can effectively propagate the semantic information throughout the network.

우리는 시맨틱 분할 마스크를 사실적인 이미지로 변환하는 특정 형태의 조건부 이미지 합성에 관심이 있습니다. 이 양식은 콘텐츠 생성 및 이미지 편집과 같은 광범위한 응용 프로그램을 가지고 있습니다 [6, 22, 48]. 이 형식을 의미 론적 이미지 합성이라고합니다. 이 논문에서 우리는 convolutional, normalization, nonlinearity 레이어를 쌓아 만든 기존의 네트워크 아키텍처 [22, 48]가 입력 시맨틱에 포함 된 정보를“씻어 버리는”경향이 있기 때문에 기껏해야 차선이된다는 것을 보여줍니다. 마스크. 이 문제를 해결하기 위해 우리는 공간적으로 적응하고 학습 된 변환을 통해 입력 시맨틱 레이아웃을 사용하여 활성화를 변조하고 네트워크 전체에 의미 정보를 효과적으로 전파 할 수있는 조건부 정규화 계층 인 공간 적응 정규화를 제안합니다.

We conduct experiments on several challenging datasets including the COCO-Stuff [4, 32], the ADE20K [58], and the Cityscapes [9]. We show that with the help of our spatially-adaptive normalization layer, a compact network can synthesize significantly better results compared to several state-of-the-art methods. Additionally, an extensive ablation study demonstrates the effectiveness of the proposed normalization layer against several variants for the semantic image synthesis task. Finally, our method supports multimodal and style-guided image synthesis, enabling controllable, diverse outputs, as shown in Figure 1. Also, please see our SIGGRAPH 2019 Real-Time Live demo and try our online demo by yourself.

우리는 COCO-Stuff [4, 32], ADE20K [58], Cityscapes [9]를 포함한 몇 가지 도전적인 데이터 세트에 대한 실험을 수행합니다. 공간 적응 형 정규화 계층의 도움으로 소형 네트워크가 여러 최첨단 방법에 비해 훨씬 더 나은 결과를 합성 할 수 있음을 보여줍니다. 또한 광범위한 절제 연구는 시맨틱 이미지 합성 작업에 대한 여러 변형에 대해 제안 된 정규화 계층의 효과를 보여줍니다. 마지막으로, 우리의 방법은 그림 1과 같이 제어 가능하고 다양한 출력을 가능하게하는 멀티 모달 및 스타일 안내 이미지 합성을 지원합니다. 또한 SIGGRAPH 2019 Real-Time Live 데모를 참조하고 온라인 데모를 직접 사용해보십시오.

2. Related Work

Deep generative models can learn to synthesize images. Recent methods include generative adversarial networks (GANs) [13] and variational autoencoder (VAE) [28]. Our work is built on GANs but aims for the conditional image synthesis task. The GANs consist of a generator and a discriminator where the goal of the generator is to produce realistic images so that the discriminator cannot tell the synthesized images apart from the real ones.

심층 생성 모델은 이미지 합성 방법을 배울 수 있습니다. 최근의 방법으로는 생성 적 적대 네트워크 (GAN) [13]와 VAE (variational autoencoder) [28]가 있습니다. 우리의 작업은 GAN을 기반으로하지만 조건부 이미지 합성 작업을 목표로합니다. GAN은 생성기 및 판별기로 구성되며, 생성기의 목표는 사실적인 이미지를 생성하여 판별 기가 합성 된 이미지를 실제 이미지와 구별 할 수 없도록하는 것입니다.

Conditional image synthesis exists in many forms that differ in the type of input data. For example, class-conditional models [3, 36, 37, 39, 41] learn to synthesize images given category labels. Researchers have explored various models for generating images based on text [18,44,52,55]. Another widely-used form is image-to-image translation based on a type of conditional GANs [20, 22, 24, 25, 33, 57, 59, 60], where both input and output are images. Compared to earlier non-parametric methods [7, 16, 23], learning-based methods typically run faster during test time and produce more realistic results. In this work, we focus on converting segmentation masks to photorealistic images. We assume the training dataset contains registered segmentation masks and images. With the proposed spatially-adaptive normalization, our compact network achieves better results compared to leading methods.

조건부 이미지 합성은 입력 데이터 유형이 다른 여러 형태로 존재합니다. 예를 들어, 클래스 조건부 모델 [3, 36, 37, 39, 41]은 카테고리 레이블이 지정된 이미지를 합성하는 방법을 학습합니다. 연구자들은 텍스트를 기반으로 이미지를 생성하기위한 다양한 모델을 탐색했습니다 [18,44,52,55]. 널리 사용되는 또 다른 형식은 입력과 출력이 모두 이미지 인 조건부 GAN 유형 [20, 22, 24, 25, 33, 57, 59, 60]을 기반으로하는 이미지 대 이미지 변환입니다. 이전의 비모수 적 방법 [7, 16, 23]과 비교하여 학습 기반 방법은 일반적으로 테스트 시간 동안 더 빠르게 실행되고보다 현실적인 결과를 생성합니다. 이 작업에서는 분할 마스크를 사실적인 이미지로 변환하는 데 중점을 둡니다. 훈련 데이터 세트에 등록 된 분할 마스크 및 이미지가 포함되어 있다고 가정합니다. 제안 된 공간 적응 정규화를 통해 당사의 소형 네트워크는 주요 방법에 비해 더 나은 결과를 달성합니다.

Unconditional normalization layers have been an important component in modern deep networks and can be found in various classifiers, including the Local Response Normalization in the AlexNet [29] and the Batch Normalization (BatchNorm) in the Inception-v2 network [21]. Other popular normalization layers include the Instance Normalization (InstanceNorm) [46], the Layer Normalization [2], the Group Normalization [50], and the Weight Normalization [45]. We label these normalization layers as unconditional as they do not depend on external data in contrast to the conditional normalization layers discussed below.

무조건 정규화 계층은 현대 딥 네트워크에서 중요한 구성 요소였으며 AlexNet의 Local Response Normalization [29] 및 Inception-v2 네트워크의 Batch Normalization (BatchNorm) [21]을 포함한 다양한 분류기에서 찾을 수 있습니다. 다른 인기있는 정규화 계층으로는 인스턴스 정규화 (InstanceNorm) [46], 계층 정규화 [2], 그룹 정규화 [50], 가중치 정규화 [45]가 있습니다. 아래에서 설명하는 조건부 정규화 계층과 달리 외부 데이터에 의존하지 않기 때문에 이러한 정규화 계층을 무조건적으로 레이블합니다.

Conditional normalization layers include the Conditional Batch Normalization (Conditional BatchNorm) [11] and Adaptive Instance Normalization (AdaIN) [19]. Both were first used in the style transfer task and later adopted in various vision tasks [3, 8, 10, 20, 26, 36, 39, 42, 49, 54]. Different from the earlier normalization techniques, conditional normalization layers require external data and generally operate as follows. First, layer activations are normalized to zero mean and unit deviation. Then the normalized activations are denormalized by modulating the activation using a learned affine transformation whose parameters are inferred from external data. For style transfer tasks [11, 19], the affine parameters are used to control the global style of the output, and hence are uniform across spatial coordinates. In contrast, our proposed normalization layer applies a spatially-varying affine transformation, making it suitable for image synthesis from semantic masks. Wang et al. proposed a closely related method for image super-resolution [49]. Both methods are built on spatiallyadaptive modulation layers that condition on semantic inputs. While they aim to incorporate semantic information into super-resolution, our goal is to design a generator for style and semantics disentanglement. We focus on providing the semantic information in the context of modulating normalized activations. We use semantic maps in different scales, which enables coarse-to-fine generation. The reader is encouraged to review their work for more details.

조건부 정규화 계층에는 조건부 배치 정규화 (Conditional BatchNorm) [11] 및 적응 형 인스턴스 정규화 (AdaIN) [19]가 포함됩니다. 둘 다 스타일 이전 작업에 처음 사용되었고 나중에 다양한 비전 작업에 채택되었습니다 [3, 8, 10, 20, 26, 36, 39, 42, 49, 54]. 이전 정규화 기술과 달리 조건부 정규화 계층은 외부 데이터가 필요하며 일반적으로 다음과 같이 작동합니다. 첫째, 레이어 활성화는 0 평균 및 단위 편차로 정규화됩니다. 그런 다음 정규화 된 활성화는 매개 변수가 외부 데이터에서 유추되는 학습 된 아핀 변환을 사용하여 활성화를 변조하여 비정규 화됩니다. 스타일 전송 작업 [11, 19]의 경우 affine 매개 변수는 출력의 전역 스타일을 제어하는 데 사용되므로 공간 좌표에서 균일합니다. 대조적으로, 우리가 제안한 정규화 레이어는 공간적으로 변화하는 아핀 변환을 적용하여 시맨틱 마스크의 이미지 합성에 적합합니다. Wang et al. 이미지 초 고해상도에 대해 밀접하게 관련된 방법을 제안했다 [49]. 두 방법 모두 의미 론적 입력을 조건으로하는 공간 적응 변조 계층을 기반으로합니다. 시맨틱 정보를 초 고해상도로 통합하는 것을 목표로하지만, 우리의 목표는 스타일과 시맨틱 해체를위한 생성기를 설계하는 것입니다. 정규화 된 활성화를 변조하는 맥락에서 의미 정보를 제공하는 데 중점을 둡니다. 우리는 서로 다른 스케일의 의미 맵을 사용하여 대략적인 생성을 가능하게합니다. 독자는 더 자세한 내용을 위해 자신의 작업을 검토하는 것이 좋습니다.

3. Semantic Image Synthesis 시맨틱 이미지 합성

Let m ∈ L H×W be a semantic segmentation mask where L is a set of integers denoting the semantic labels, and H and W are the image height and width. Each entry in m denotes the semantic label of a pixel. We aim to learn a mapping function that can convert an input segmentation mask m to a photorealistic image.

m ∈ L HxW를 의미 론적 분할 마스크로 가정합니다. 여기서 L은 의미 론적 레이블을 나타내는 정수 집합이고 H와 W는 이미지 높이와 너비입니다. m의 각 항목은 픽셀의 의미 레이블을 나타냅니다. 우리는 입력 분할 마스크 m을 사실적인 이미지로 변환 할 수있는 매핑 함수를 배우는 것을 목표로합니다.

Spatially-adaptive denormalization.

Let h i denote the activations of the i-th layer of a deep convolutional network for a batch of N samples. Let C i be the number of channels in the layer. Let Hi and Wi be the height and width of the activation map in the layer. We propose a new conditional normalization method called the SPatially-Adaptive (DE)normalization1 (SPADE). Similar to the Batch Normalization [21], the activation is normalized in the channelwise manner and then modulated with learned scale and bias. Figure 2 illustrates the SPADE design. The activation value at site (n ∈ N, c ∈ C i , y ∈ Hi , x ∈ Wi ) is

공간 적응 형 비정규 화. h i는 N 개의 샘플 배치에 대한 심층 컨벌루션 네트워크의 i 번째 계층의 활성화를 나타냅니다. C i를 레이어의 채널 수라고합니다. Hi와 Wi를 레이어에서 활성화 맵의 높이와 너비로 설정합니다. SPADE (SPatially-Adaptive) 정규화 1 (SPADE)라는 새로운 조건부 정규화 방법을 제안합니다. Batch Normalization [21]과 유사하게 활성화는 채널 방식으로 정규화되고 학습 된 스케일 및 바이어스로 변조됩니다. 그림 2는 SPADE 설계를 보여줍니다. 사이트의 활성화 값 (n ∈ N, c ∈ C i, y ∈ Hi, x ∈ Wi)은 다음과 같습니다.

(1)

(1) where h i n,c,y,x is the activation at the site before normalization and µ i c and σ i c are the mean and standard deviation of the activations in channel c:

(1) 여기서 h i n, c, y, x는 정규화 전 사이트에서의 활성화이고 µ i c 및 σ i c는 채널 c에서 활성화의 평균 및 표준 편차입니다.

(2)

(3)

The variables γ i c,y,x(m) and β i c,y,x(m) in (1) are the learned modulation parameters of the normalization layer. In contrast to the BatchNorm [21], they depend on the input segmentation mask and vary with respect to the location (y, x). We use the symbol γ i c,y,x and β i c,y,x to denote the functions that convert m to the scaling and bias values at the site (c, y, x) in the i-th activation map. We implement the functions γ i c,y,x and β i c,y,x using a simple two-layer convolutional network, whose design is in the appendix.

(1)의 변수 γ i c, y, x (m) 및 β i c, y, x (m)는 정규화 계층의 학습 된 변조 매개 변수입니다. BatchNorm [21]과는 달리 입력 분할 마스크에 의존하고 위치 (y, x)에 따라 다릅니다. i 번째 활성화 맵의 사이트 (c, y, x)에서 m을 스케일링 및 바이어스 값으로 변환하는 함수를 나타 내기 위해 기호 γ i c, y, x 및 β i c, y, x를 사용합니다. 우리는 부록에있는 간단한 2 계층 컨벌루션 네트워크를 사용하여 γ i c, y, x 및 β i c, y, x 함수를 구현합니다.

In fact, SPADE is related to, and is a generalization of several existing normalization layers. First, replacing the segmentation mask m with the image class label and making the modulation parameters spatially-invariant (i.e., γ i c,y1,x1 ≡ γ i c,y2,x2 and β i c,y1,x1 ≡ β i c,y2,x2 for any y1, y2 ∈ {1, 2, ..., Hi} and x1, x2 ∈ {1, 2, ..., Wi}), we arrive at the form of the Conditional BatchNorm [11]. Indeed, for any spatially-invariant conditional data, our method reduces to the Conditional BatchNorm. Similarly, we can arrive at the AdaIN [19] by replacing m with a real image, making the modulation parameters spatially-invariant, and setting N = 1. As the modulation parameters are adaptive to the input segmentation mask, the proposed SPADE is better suited for semantic image synthesis.

실제로 SPADE는 기존의 여러 정규화 계층과 관련이 있으며 일반화입니다. 먼저 분할 마스크 m을 이미지 클래스 레이블로 교체하고 변조 매개 변수를 공간적으로 불변 (즉, γ ic, y1, x1 ≡ γ ic, y2, x2 및 β ic, y1, x1 ≡ β ic, y2, x2)으로 만듭니다. y1, y2 ∈ {1, 2, ..., Hi} 및 x1, x2 ∈ {1, 2, ..., Wi})에 대해 조건부 BatchNorm [11] 형식에 도달합니다. 실제로 공간적으로 불변하는 조건부 데이터에 대해 우리의 방법은 조건부 BatchNorm으로 축소됩니다. 마찬가지로 m을 실제 이미지로 바꾸고 변조 매개 변수를 공간적으로 불변하게 만들고 N = 1로 설정하여 AdaIN [19]에 도달 할 수 있습니다. 변조 매개 변수가 입력 분할 마스크에 적응할 수 있으므로 제안 된 SPADE가 더 좋습니다. 시맨틱 이미지 합성에 적합합니다.

SPADE generator. With the SPADE, there is no need to feed the segmentation map to the first layer of the generator, since the learned modulation parameters have encoded enough information about the label layout. Therefore, we discard encoder part of the generator, which is commonly used in recent architectures [22, 48]. This simplification results in a more lightweight network. Furthermore, similarly to existing class-conditional generators [36,39,54], the new generator can take a random vector as input, enabling a simple and natural way for multi-modal synthesis [20, 60].

SPADE 생성기. SPADE를 사용하면 학습 된 변조 매개 변수가 레이블 레이아웃에 대한 충분한 정보를 인코딩했기 때문에 세분화 맵을 생성기의 첫 번째 레이어에 공급할 필요가 없습니다. 따라서 우리는 최근 아키텍처에서 일반적으로 사용되는 생성기의 인코더 부분을 폐기합니다 [22, 48]. 이러한 단순화로 인해 더 가벼운 네트워크가 생성됩니다. 또한 기존의 클래스 조건 생성기 [36,39,54]와 유사하게 새로운 생성기는 임의 벡터를 입력으로 사용할 수 있으므로 다중 모드 합성을위한 간단하고 자연스러운 방법이 가능합니다 [20, 60].

Figure 4 illustrates our generator architecture, which employs several ResNet blocks [15] with upsampling layers. The modulation parameters of all the normalization layers are learned using the SPADE. Since each residual block operates at a different scale, we downsample the semantic mask to match the spatial resolution.

그림 4는 업 샘플링 레이어가있는 여러 ResNet 블록 [15]을 사용하는 생성기 아키텍처를 보여줍니다. 모든 정규화 레이어의 변조 매개 변수는 SPADE를 사용하여 학습됩니다. 각 잔차 블록은 다른 스케일에서 작동하므로 공간 해상도와 일치하도록 의미 마스크를 다운 샘플링합니다.

We train the generator with the same multi-scale discriminator and loss function used in pix2pixHD [48] except that we replace the least squared loss term [34] with the hinge loss term [31,38,54]. We test several ResNet-based discriminators used in recent unconditional GANs [1, 36, 39] but observe similar results at the cost of a higher GPU memory requirement. Adding the SPADE to the discriminator also yields a similar performance. For the loss function, we observe that removing any loss term in the pix2pixHD loss function lead to degraded generation results.

최소 제곱 손실 항 [34]을 힌지 손실 항 [31,38,54]으로 대체한다는 점을 제외하고는 pix2pixHD [48]에서 사용 된 것과 동일한 다중 스케일 판별 기 및 손실 함수로 생성기를 훈련합니다. 최근 무조건적인 GAN [1, 36, 39]에 사용 된 몇 가지 ResNet 기반 판별자를 테스트하지만 GPU 메모리 요구 사항이 증가하는 대신 유사한 결과를 관찰합니다. 판별 자에 SPADE를 추가하면 유사한 성능을 얻을 수 있습니다. 손실 함수의 경우 pix2pixHD 손실 함수에서 손실 항을 제거하면 생성 결과가 저하됩니다.

Why does the SPADE work better? A short answer is that it can better preserve semantic information against common normalization layers. Specifically, while normalization layers such as the InstanceNorm [46] are essential pieces in almost all the state-of-the-art conditional image synthesis models [48], they tend to wash away semantic information when applied to uniform or flat segmentation masks.

SPADE가 더 잘 작동하는 이유는 무엇입니까? 짧은 대답은 공통 정규화 계층에 대해 의미 정보를 더 잘 보존 할 수 있다는 것입니다. 특히, InstanceNorm [46]과 같은 정규화 레이어는 거의 모든 최첨단 조건부 이미지 합성 모델 [48]에서 필수적인 부분이지만 균일하거나 평평한 분할 마스크에 적용될 때 의미 정보를 제거하는 경향이 있습니다.

Let us consider a simple module that first applies convolution to a segmentation mask and then normalization. Furthermore, let us assume that a segmentation mask with a single label is given as input to the module (e.g., all the pixels have the same label such as sky or grass). Under this setting, the convolution outputs are again uniform, with different labels having different uniform values. Now, after we apply InstanceNorm to the output, the normalized activation will become all zeros no matter what the input semantic label is given. Therefore, semantic information is totally lost. This limitation applies to a wide range of generator architectures, including pix2pixHD and its variant that concatenates the semantic mask at all intermediate layers, as long as a network applies convolution and then normalization to the semantic mask. In Figure 3, we empirically show this is precisely the case for pix2pixHD. Because a segmentation mask consists of a few uniform regions in general, the issue of information loss emerges when applying normalization.

먼저 분할 마스크에 컨볼 루션을 적용한 다음 정규화하는 간단한 모듈을 고려해 보겠습니다. 또한 단일 레이블이있는 분할 마스크가 모듈에 대한 입력으로 제공된다고 가정합니다 (예 : 모든 픽셀에 하늘이나 잔디와 같은 동일한 레이블이 있음). 이 설정에서 컨볼 루션 출력은 서로 다른 레이블이 서로 다른 균일 한 값을 갖는 다시 균일합니다. 이제 InstanceNorm을 출력에 적용한 후에는 입력 의미 레이블이 제공 되어도 정규화 된 활성화가 모두 0이됩니다. 따라서 의미 정보가 완전히 손실됩니다. 이 제한은 네트워크가 컨볼 루션을 적용한 다음 시맨틱 마스크에 정규화를 적용하는 한 pix2pixHD 및 모든 중간 계층에서 시맨틱 마스크를 연결하는 변형을 포함한 광범위한 생성기 아키텍처에 적용됩니다. 그림 3에서 우리는 이것이 pix2pixHD의 경우임을 경험적으로 보여줍니다. 분할 마스크는 일반적으로 몇 개의 균일 한 영역으로 구성되기 때문에 정규화를 적용 할 때 정보 손실 문제가 나타납니다.

In contrast, the segmentation mask in the SPADE Generator is fed through spatially adaptive modulation without normalization. Only activations from the previous layer are normalized. Hence, the SPADE generator can better preserve semantic information. It enjoys the benefit of normalization without losing the semantic input information.

반대로, SPADE Generator의 분할 마스크는 정규화없이 공간 적응 변조를 통해 공급됩니다. 이전 계층의 활성화 만 정규화됩니다. 따라서 SPADE 생성기는 의미 정보를 더 잘 보존 할 수 있습니다. 의미 론적 입력 정보를 잃지 않고 정규화의 이점을 누립니다.

Multi-modal synthesis. 다중 모드 합성.

By using a random vector as the input of the generator, our architecture provides a simple way for multi-modal synthesis [20, 60]. Namely, one can attach an encoder that processes a real image into a random vector, which will be then fed to the generator. The encoder and generator form a VAE [28], in which the encoder tries to capture the style of the image, while the generator combines the encoded style and the segmentation mask information via the SPADEs to reconstruct the original image. The encoder also serves as a style guidance network at test time to capture the style of target images, as used in Figure 1. For training, we add a KL-Divergence loss term [28].

생성기의 입력으로 랜덤 벡터를 사용함으로써 우리 아키텍처는 다중 모달 합성을위한 간단한 방법을 제공합니다 [20, 60]. 즉, 실제 이미지를 처리하는 인코더를 임의의 벡터로 연결 한 다음 생성기로 공급할 수 있습니다. 인코더와 생성기는 VAE [28]를 형성하는데, 인코더는 이미지의 스타일을 포착하려고 시도하고 생성기는 SPADE를 통해 인코딩 된 스타일과 분할 마스크 정보를 결합하여 원본 이미지를 재구성합니다. 인코더는 그림 1에 사용 된 것처럼 대상 이미지의 스타일을 캡처하기 위해 테스트 시간에 스타일 안내 네트워크 역할도합니다. 교육을 위해 KL-Divergence loss term [28]을 추가합니다.

4. Experiments

Implementation details. 구현 세부 사항.

We apply the Spectral Norm [38] to all the layers in both generator and discriminator. The learning rates for the generator and discriminator are 0.0001 and 0.0004, respectively [17]. We use the ADAM solver [27] with β1 = 0 and β2 = 0.999. All the experiments are conducted on an NVIDIA DGX1 with 8 32GB V100 GPUs. We use synchronized BatchNorm, i.e., these statistics are collected from all the GPUs.

Spectral Norm [38]을 생성기와 판별 기의 모든 레이어에 적용합니다. 생성기와 판별 기의 학습률은 각각 0.0001과 0.0004입니다 [17]. β1 = 0 및 β2 = 0.999 인 ADAM 솔버 [27]를 사용합니다. 모든 실험은 8 개의 32GB V100 GPU가 장착 된 NVIDIA DGX1에서 수행되었습니다. 동기화 된 BatchNorm을 사용합니다. 즉, 이러한 통계는 모든 GPU에서 수집됩니다.

Datasets.

We conduct experiments on several datasets.

• COCO-Stuff [4] is derived from the COCO dataset [32]. It has 118, 000 training images and 5, 000 validation images captured from diverse scenes. It has 182 semantic classes. Due to its vast diversity, existing image synthesis models perform poorly on this dataset.

• ADE20K [58] consists of 20, 210 training and 2, 000 validation images. Similarly to the COCO, the dataset contains challenging scenes with 150 semantic classes.

• ADE20K-outdoor is a subset of the ADE20K dataset that only contains outdoor scenes, used in Qi et al. [43].

• Cityscapes dataset [9] contains street scene images in German cities. The training and validation set sizes are 3, 000 and 500, respectively. Recent work has achieved photorealistic semantic image synthesis results [43, 47] on the Cityscapes dataset.

• Flickr Landscapes. We collect 41, 000 photos from Flickr and use 1, 000 samples for the validation set. To avoid expensive manual annotation, we use a well-trained DeepLabV2 [5] to compute input segmentation masks.

We train the competing semantic image synthesis methods on the same training set and report their results on the same validation set for each dataset.

데이터 세트.

우리는 여러 데이터 세트에 대한 실험을 수행합니다.

COCO-Stuff [4]는 COCO 데이터 세트 [32]에서 파생되었습니다. 다양한 장면에서 캡처 한 118, 000 개의 트레이닝 이미지와 5, 000 개의 검증 이미지가 있습니다. 182 개의 시맨틱 클래스가 있습니다. 방대한 다양성으로 인해 기존 이미지 합성 모델은이 데이터 세트에서 제대로 작동하지 않습니다. • ADE20K [58]는 20, 210 개의 훈련 및 2, 000 개의 검증 이미지로 구성됩니다. COCO와 마찬가지로 데이터 세트에는 150 개의 시맨틱 클래스가있는 까다로운 장면이 포함되어 있습니다. • ADE20K-outdoor는 Qi 등에서 사용되는 실외 장면 만 포함하는 ADE20K 데이터 세트의 하위 집합입니다. [43]. • Cityscapes 데이터 세트 [9]에는 독일 도시의 거리 장면 이미지가 포함되어 있습니다. 훈련 및 검증 세트 크기는 각각 3, 000 및 500입니다. 최근 연구는 Cityscapes 데이터 셋에서 사실적인 의미 론적 이미지 합성 결과를 얻었습니다 [43, 47]. • Flickr 풍경. Flickr에서 41,000 장의 사진을 수집하고 유효성 검사 세트에 1,000 개의 샘플을 사용합니다. 값 비싼 수동 주석 처리를 피하기 위해 잘 훈련 된 DeepLabV2 [5]를 사용하여 입력 분할 마스크를 계산합니다. 우리는 동일한 훈련 세트에서 경쟁하는 의미 론적 이미지 합성 방법을 훈련하고 각 데이터 세트에 대해 동일한 검증 세트에 대한 결과를보고합니다.

Performance metrics. 성능 지표.

We adopt the evaluation protocol from previous work [6, 48]. Specifically, we run a semantic segmentation model on the synthesized images and compare how well the predicted segmentation mask matches the ground truth input. Intuitively, if the output images are realistic, a well-trained semantic segmentation model should be able to predict the ground truth label. For measuring the segmentation accuracy, we use both the mean Intersection over-Union (mIoU) and the pixel accuracy (accu). We use the state-of-the-art segmentation networks for each dataset: DeepLabV2 [5, 40] for COCO-Stuff, UperNet101 [51] for ADE20K, and DRN-D-105 [53] for Cityscapes. In addition to the mIoU and the accu segmentation performance metrics, we use the Frechet Inception Distance (FID) [ ´ 17] to measure the distance between the distribution of synthesized results and the distribution of real images.

우리는 이전 작업의 평가 프로토콜을 채택합니다 [6, 48]. 특히, 합성 된 이미지에서 의미 론적 분할 모델을 실행하고 예측 된 분할 마스크가 Ground Truth 입력과 얼마나 잘 일치하는지 비교합니다. 직관적으로 출력 이미지가 사실적이라면 잘 훈련 된 시맨틱 분할 모델이 Ground Truth 레이블을 예측할 수 있어야합니다. 세분화 정확도를 측정하기 위해 평균 교차점 오버 유니온 (mIoU)과 픽셀 정확도 (accu)를 모두 사용합니다. COCO-Stuff의 경우 DeepLabV2 [5, 40], ADE20K의 경우 UperNet101 [51], Cityscapes의 경우 DRN-D-105 [53] 등 각 데이터 세트에 대해 최첨단 세분화 네트워크를 사용합니다. mIoU 및 accu 세분화 성능 메트릭 외에도 FID (Frechet Inception Distance) [´ 17]를 사용하여 합성 결과 분포와 실제 이미지 분포 사이의 거리를 측정합니다.

Baselines. 기준.

We compare our method with 3 leading semantic image synthesis models: the pix2pixHD model [48], the cascaded refinement network (CRN) [6], and the semiparametric image synthesis method (SIMS) [43]. The pix2pixHD is the current state-of-the-art GAN-based conditional image synthesis framework. The CRN uses a deep network that repeatedly refines the output from low to high resolution, while the SIMS takes a semi-parametric approach that composites real segments from a training set and refines the boundaries. Both the CRN and SIMS are mainly trained using image reconstruction loss. For a fair comparison, we train the CRN and pix2pixHD models using the implementations provided by the authors. As image synthesis using the SIMS requires many queries to the training dataset, it is computationally prohibitive for a large dataset such as the COCO-stuff and the full ADE20K. Therefore, we use the results provided by the authors when available.

우리는 우리의 방법을 pix2pixHD 모델 [48], 캐스케이드 정제 네트워크 (CRN) [6], 세미 파라 메트릭 이미지 합성 방법 (SIMS) [43]의 3 가지 주요 의미 론적 이미지 합성 모델과 비교합니다. pix2pixHD는 최신 GAN 기반 조건부 이미지 합성 프레임 워크입니다. CRN은 낮은 해상도에서 높은 해상도로 출력을 반복적으로 정제하는 심층 네트워크를 사용하는 반면 SIMS는 훈련 세트에서 실제 세그먼트를 합성하고 경계를 구체화하는 반 파라 메트릭 접근 방식을 사용합니다. CRN과 SIMS는 주로 이미지 재구성 손실을 사용하여 훈련됩니다. 공정한 비교를 위해 저자가 제공 한 구현을 사용하여 CRN 및 pix2pixHD 모델을 훈련합니다. SIMS를 사용하는 이미지 합성은 훈련 데이터 세트에 대한 많은 쿼리를 필요로하기 때문에 COCO-stuff 및 전체 ADE20K와 같은 대규모 데이터 세트의 경우 계산적으로 금지됩니다. 따라서 가능한 경우 저자가 제공 한 결과를 사용합니다.

Quantitative comparisons. 양적 비교.

As shown in Table 1, our method outperforms the current state-of-the-art methods by a large margin in all the datasets. For the COCO-Stuff, our method achieves an mIoU score of 35.2, which is about 1.5 times better than the previous leading method. Our FID is also 2.2 times better than the previous leading method. We note that the SIMS model produces a lower FID score but has poor segmentation performances on the Cityscapes dataset. This is because the SIMS synthesizes an image by first stitching image patches from the training dataset. As using the real image patches, the resulting image distribution can better match the distribution of real images. However, because there is no guarantee that a perfect query (e.g., a person in a particular pose) exists in the dataset, it tends to copy objects that do not match the input segments.

표 1에서 볼 수 있듯이, 우리의 방법은 모든 데이터 세트에서 큰 차이로 현재의 최첨단 방법을 능가합니다. COCO-Stuff의 경우 우리의 방법은 35.2의 mIoU 점수를 달성하여 이전 선행 방법보다 약 1.5 배 더 우수합니다. FID는 이전 선행 방법보다 2.2 배 더 우수합니다. SIMS 모델은 FID 점수가 더 낮지 만 Cityscapes 데이터 세트에서 세분화 성능이 좋지 않습니다. 이는 SIMS가 먼저 훈련 데이터 세트에서 이미지 패치를 스티칭하여 이미지를 합성하기 때문입니다. 실제 이미지 패치를 사용하면 결과 이미지 분포가 실제 이미지 분포와 더 잘 일치 할 수 있습니다. 그러나 데이터 세트에 완벽한 쿼리 (예 : 특정 포즈의 사람)가 존재한다는 보장이 없기 때문에 입력 세그먼트와 일치하지 않는 객체를 복사하는 경향이 있습니다.

Qualitative results. 질적 결과.

In Figures 5 and 6, we provide qualitative comparisons of the competing methods. We find that our method produces results with much better visual quality and fewer visible artifacts, especially for diverse scenes in the COCO-Stuff and ADE20K dataset. When the training dataset size is small, the SIMS model also renders images with good visual quality. However, the depicted content often deviates from the input segmentation mask (e.g., the shape of the swimming pool in the second row of Figure 6). In Figures 7 and 8, we show more example results from the Flickr Landscape and COCO-Stuff datasets. The proposed method can generate diverse scenes with high image fidelity. More results are included in the appendix.

그림 5와 6에서는 경쟁 방법의 질적 비교를 제공합니다. 우리의 방법은 특히 COCO-Stuff 및 ADE20K 데이터 세트의 다양한 장면에서 훨씬 더 나은 시각적 품질과 더 적은 가시적 아티팩트로 결과를 생성합니다. 훈련 데이터 세트 크기가 작 으면 SIMS 모델도 우수한 시각적 품질로 이미지를 렌더링합니다. 그러나 묘사 된 콘텐츠는 종종 입력 분할 마스크 (예 : 그림 6의 두 번째 행에있는 수영장의 모양)에서 벗어납니다. 그림 7과 8에서는 Flickr Landscape 및 COCO-Stuff 데이터 세트의 더 많은 예제 결과를 보여줍니다. 제안 된 방법은 높은 이미지 충실도로 다양한 장면을 생성 할 수있다. 더 많은 결과는 부록에 포함되어 있습니다.

Human evaluation.

We use the Amazon Mechanical Turk (AMT) to compare the perceived visual fidelity of our method against existing approaches. Specifically, we give the AMT workers an input segmentation mask and two synthesis outputs from different methods and ask them to choose the output image that looks more like a corresponding image of the segmentation mask. The workers are given unlimited time to make the selection. For each comparison, we randomly generate 500 questions for each dataset, and each question is answered by 5 different workers. For quality control, only workers with a lifetime task approval rate greater than 98% can participate in our study.

Amazon Mechanical Turk (AMT)를 사용하여 기존 접근 방식과 우리 방법의 시각적 충실도를 비교합니다. 특히, AMT 작업자에게 입력 분할 마스크와 서로 다른 방법의 합성 출력 2 개를 제공하고 분할 마스크의 해당 이미지와 더 유사한 출력 이미지를 선택하도록 요청합니다. 근로자는 선택을 할 수있는 무제한 시간이 주어집니다. 각 비교에 대해 각 데이터 세트에 대해 500 개의 질문을 무작위로 생성하고 각 질문에 5 명의 작업자가 답변합니다. 품질 관리를 위해 평생 작업 승인률이 98 % 이상인 근로자 만 본 연구에 참여할 수 있습니다.

Table 2 shows the evaluation results. We find that users strongly favor our results on all the datasets, especially on the challenging COCO-Stuff and ADE20K datasets. For the Cityscapes, even when all the competing methods achieve high image fidelity, users still prefer our results.

평가 결과를 표 2에 나타낸다. 사용자는 모든 데이터 세트, 특히 까다로운 COCO-Stuff 및 ADE20K 데이터 세트에 대한 결과를 강력하게 선호합니다. Cityscapes의 경우 모든 경쟁 방법이 높은 이미지 충실도를 달성하더라도 사용자는 여전히 우리의 결과를 선호합니다.

Effectiveness of the SPADE. SPADE의 효과.

For quantifying importance of the SPADE, we introduce a strong baseline called pix2pixHD++, which combines all the techniques we find useful for enhancing the performance of pix2pixHD except the SPADE. We also train models that receive the segmentation mask input at all the intermediate layers via feature concatenation in the channel direction, which is termed as pix2pixHD++ w/ Concat. Finally, the model that combines the strong baseline with the SPADE is denoted as pix2pixHD++ w/ SPADE.

SPADE의 중요성을 정량화하기 위해 SPADE를 제외한 pix2pixHD의 성능을 향상시키는 데 유용한 모든 기술을 결합한 pix2pixHD ++라는 강력한 기준을 소개합니다. 또한 채널 방향의 기능 연결을 통해 모든 중간 레이어에서 세분화 마스크 입력을받는 모델을 학습합니다.이를 pix2pixHD ++ w / Concat이라고합니다. 마지막으로 강력한 기준과 SPADE를 결합한 모델은 pix2pixHD ++ w / SPADE로 표시됩니다.

As shown in Table 3, the architectures with the proposed SPADE consistently outperforms its counterparts, in both the decoder-style architecture described in Figure 4 and more traditional encoder-decoder architecture used in the pix2pixHD. We also find that concatenating segmentation masks at all intermediate layers, a reasonable alternative to the SPADE, does not achieve the same performance as SPADE. Furthermore, the decoder-style SPADE generator works better than the strong baselines even with a smaller number of parameters.

표 3에서 볼 수 있듯이 제안 된 SPADE가있는 아키텍처는 그림 4에 설명 된 디코더 스타일 아키텍처와 pix2pixHD에 사용 된보다 전통적인 인코더-디코더 아키텍처 모두에서 지속적으로 해당 아키텍처를 능가합니다. 또한 SPADE에 대한 합리적인 대안 인 모든 중간 레이어에서 분할 마스크를 연결해도 SPADE와 동일한 성능을 얻지 못합니다. 또한 디코더 스타일의 SPADE 생성기는 적은 수의 매개 변수로도 강력한 기준선보다 더 잘 작동합니다.

Variations of SPADE generator. SPADE 생성기의 변형.

Table 4 reports the performance of several variations of our generator. First, we compare two types of input to the generator where one is the random noise while the other is the downsampled segmentation map. We find that both of the variants render similar performance and conclude that the modulation by SPADE alone provides sufficient signal about the input mask. Second, we vary the type of parameter-free normalization layers before applying the modulation parameters. We observe that the SPADE works reliably across different normalization methods. Next, we vary the convolutional kernel size acting on the label map, and find that kernel size of 1x1 hurts performance, likely because it prohibits utilizing the context of the label. Lastly, we modify the capacity of the generator by changing the number of convolutional filters. We present more variations and ablations in the appendix.

표 4는 다양한 발전기의 성능을 보여줍니다. 먼저, 하나는 랜덤 노이즈이고 다른 하나는 다운 샘플링 된 분할 맵인 두 가지 유형의 입력을 생성기에 비교합니다. 두 변형 모두 유사한 성능을 렌더링하고 SPADE에 의한 변조만으로도 입력 마스크에 대한 충분한 신호를 제공한다는 결론을 내립니다. 둘째, 변조 매개 변수를 적용하기 전에 매개 변수없는 정규화 레이어의 유형을 변경합니다. SPADE는 다양한 정규화 방법에서 안정적으로 작동합니다. 다음으로 레이블 맵에 작용하는 컨벌루션 커널 크기를 변경하고 1x1의 커널 크기가 레이블 컨텍스트 사용을 금지하기 때문에 성능이 저하된다는 것을 발견했습니다. 마지막으로 컨벌루션 필터의 수를 변경하여 생성기의 용량을 수정합니다. 우리는 부록에 더 많은 변형과 절제를 제시합니다.

Multi-modal synthesis. 다중 모드 합성.

In Figure 9, we show the multimodal image synthesis results on the Flickr Landscape dataset. For the same input segmentation mask, we sample different noise inputs to achieve different outputs. More results are included in the appendix.

그림 9에서는 Flickr Landscape 데이터 세트에 대한 다중 모드 이미지 합성 결과를 보여줍니다. 동일한 입력 분할 마스크에 대해 서로 다른 출력을 얻기 위해 서로 다른 노이즈 입력을 샘플링합니다. 더 많은 결과는 부록에 포함되어 있습니다.

Semantic manipulation and guided image synthesis. 시맨틱 조작 및 안내 이미지 합성.

In Figure 1, we show an application where a user draws different segmentation masks, and our model renders the corresponding landscape images. Moreover, our model allows users to choose an external style image to control the global appearances of the output image. We achieve it by replacing the input noise with the embedding vector of the style image computed by the image encoder.

그림 1에서는 사용자가 다른 분할 마스크를 그리고 모델이 해당 풍경 이미지를 렌더링하는 애플리케이션을 보여줍니다. 또한, 우리 모델은 사용자가 외부 스타일 이미지를 선택하여 출력 이미지의 전체 모양을 제어 할 수 있도록합니다. 입력 노이즈를 이미지 인코더가 계산 한 스타일 이미지의 임베딩 벡터로 대체하여이를 달성합니다.

5. Conclusion

We have proposed the spatially-adaptive normalization, which utilizes the input semantic layout while performing the affine transformation in the normalization layers. The proposed normalization leads to the first semantic image synthesis model that can produce photorealistic outputs for diverse scenes including indoor, outdoor, landscape, and street scenes. We further demonstrate its application for multi-modal synthesis and guided image synthesis.

우리는 정규화 계층에서 아핀 변환을 수행하면서 입력 시맨틱 레이아웃을 활용하는 공간 적응 형 정규화를 제안했습니다. 제안 된 정규화는 실내, 실외, 풍경 및 거리 장면을 포함한 다양한 장면에 대해 사실적인 출력을 생성 할 수있는 최초의 시맨틱 이미지 합성 모델로 이어집니다. 우리는 다중 모드 합성 및 안내 이미지 합성에 대한 응용 프로그램을 추가로 보여줍니다.

A. Additional Implementation Details 추가 구현 세부 정보

Generator.

The architecture of the generator consists of a series of the proposed SPADE ResBlks with nearest neighbor upsampling. We train our network using 8 GPUs simultaneously and use the synchronized version of the BatchNorm. We apply the Spectral Norm [38] to all the convolutional layers in the generator. The architectures of the proposed SPADE and SPADE ResBlk are given in Figure 10 and Figure 11, respectively. The architecture of the generator is shown in Figure 12.

생성기의 아키텍처는 최근 접 이웃 업 샘플링을 사용하여 제안 된 일련의 SPADE ResBlk로 구성됩니다. 8 개의 GPU를 동시에 사용하여 네트워크를 훈련시키고 BatchNorm의 동기화 된 버전을 사용합니다. Spectral Norm [38]을 생성기의 모든 conv 층에 적용합니다. 제안 된 SPADE 및 SPADE ResBlk의 아키텍처는 각각 그림 10과 그림 11에 나와 있습니다. 생성기의 아키텍처는 그림 12에 나와 있습니다.

Discriminator.

The architecture of the discriminator follows the one used in the pix2pixHD method [48], which uses a multi-scale design with the InstanceNorm (IN). The only difference is that we apply the Spectral Norm to all the convolutional layers of the discriminator. The details of the discriminator architecture is shown in Figure 13.

판별 기의 아키텍처는 InstanceNorm (IN)과 함께 다중 스케일 디자인을 사용하는 pix2pixHD 방법 [48]에서 사용되는 아키텍처를 따릅니다. 유일한 차이점은 식별기의 모든 컨벌루션 레이어에 Spectral Norm을 적용한다는 것입니다. 판별 기 아키텍처의 세부 사항은 그림 13에 나와 있습니다.

Image Encoder.

The image encoder consists of 6 stride-2 convolutional layers followed by two linear layers to produce the mean and variance of the output distribution as shown in Figure 14.

이미지 인코더는 그림 14와 같이 출력 분포의 평균과 분산을 생성하기 위해 6 개의 stride-2 컨벌루션 레이어와 2 개의 선형 레이어로 구성됩니다.

Learning objective.

We use the learning objective function in the pix2pixHD work [48] except that we replace its LSGAN loss [34] term with the Hinge loss term [31, 38, 54]. We use the same weighting among the loss terms in the objective function as that in the pix2pixHD work.

우리는 LSGAN 손실 [34] 항을 힌지 손실 항 [31, 38, 54]으로 대체하는 것을 제외하고는 pix2pixHD 작업 [48]에서 학습 목적 함수를 사용합니다. pix2pixHD 작업에서와 같이 목적 함수에서 손실 항간에 동일한 가중치를 사용합니다.

When training the proposed framework with the image encoder for multi-modal synthesis and style-guided image synthesis, we include a KL Divergence loss:

다중 모드 합성 및 스타일 안내 이미지 합성을위한 이미지 인코더로 제안 된 프레임 워크를 훈련 할 때 KL 발산 손실을 포함합니다.

LKLD = DKL(q(z|x)||p(z))

where the prior distribution p(z) is a standard Gaussian distribution and the variational distribution q is fully determined by a mean vector and a variance vector [28]. We use the reparamterization trick [28] for back-propagating the gradient from the generator to the image encoder. The weight for the KL Divergence loss is 0.05.

여기서 사전 분포 p (z)는 표준 가우스 분포이고 변동 분포 q는 평균 벡터와 분산 벡터에 의해 완전히 결정됩니다 [28]. 우리는 생성기에서 이미지 인코더로 기울기를 역 전파하기 위해 재 매개 변수화 트릭 [28]을 사용합니다. KL 발산 손실의 가중치는 0.05입니다.

In Figure 15, we overview the training data flow. The image encoder encodes a real image to a mean vector and a variance vector. They are used to compute the noise input to the generator via the reparameterization trick [28]. The generator also takes the segmentation mask of the input image as input with the proposed SPADE ResBlks. The discriminator takes concatenation of the segmentation mask and the output image from the generator as input and aims to classify that as fake.

그림 15에서는 훈련 데이터 흐름을 개괄적으로 보여줍니다. 이미지 인코더는 실제 이미지를 평균 벡터와 분산 벡터로 인코딩합니다. 재 매개 변수화 트릭을 통해 발생기에 입력되는 노이즈를 계산하는 데 사용됩니다 [28]. 생성기는 또한 제안 된 SPADE ResBlk를 사용하여 입력 이미지의 분할 마스크를 입력으로 사용합니다. 판별 기는 세분화 마스크와 생성기의 출력 이미지를 입력으로 연결하여 가짜로 분류하는 것을 목표로합니다.

Training details.

We perform 200 epochs of training on the Cityscapes and ADE20K datasets, 100 epochs of training on the COCO-Stuff dataset, and 50 epochs of training on the Flickr Landscapes dataset. The image sizes are 256 × 256, except the Cityscapes at 512 × 256. We linearly decay the learning rate to 0 from epoch 100 to 200 for the Cityscapes and ADE20K datasets. The batch size is 32. We initialize the network weights using thes Glorot initialization [12].

우리는 Cityscapes 및 ADE20K 데이터 세트에 대한 200 epoch의 교육, COCO-Stuff 데이터 세트에 대한 100 epoch의 교육, Flickr Landscapes 데이터 세트에 대한 50 epoch의 교육을 수행합니다. 이미지 크기는 512 × 256의 Cityscapes를 제외하고 256 × 256입니다. Cityscapes 및 ADE20K 데이터 세트에 대해 학습률을 epoch 100에서 200까지 선형 적으로 0으로 감소시킵니다. 배치 크기는 32입니다. Glorot 초기화 [12]를 사용하여 네트워크 가중치를 초기화합니다.

B. Additional Ablation Study

Table 5 provides additional ablation study results analyzing the contribution of individual components in the proposed method. We first find that both of the perceptual loss and GAN feature matching loss inherited from the learning objective function of the pix2pixHD [48] are important. Removing any of them leads to a performance drop. We also find that increasing the depth of the discriminator by inserting one more convolutional layer to the top of the pix2pixHD discriminator does not improve the results.

표 5는 제안 된 방법에서 개별 구성 요소의 기여도를 분석 한 추가 절제 연구 결과를 제공합니다. 우리는 먼저 pix2pixHD [48]의 학습 목적 함수에서 물려받은 지각 손실과 GAN 기능 매칭 손실이 모두 중요하다는 것을 발견했습니다. 이들 중 하나를 제거하면 성능이 저하됩니다. 또한 pix2pixHD 판별 기의 상단에 하나 이상의 컨볼 루션 레이어를 삽입하여 판별 기의 깊이를 늘려도 결과가 개선되지 않음을 발견했습니다.

In Table 5, we also analyze the effectiveness of each component used in our strong baseline, the pix2pixHD++ method, derived from the pix2pixHD method. We found that the Spectral Norm, synchronized BatchNorm, TTUR [17], and the hinge loss objective all contribute to the performance boost. Adding the SPADE to the strong baseline further improves the performance. Note that the pix2pixHD++ w/o Sync BatchNorm and w/o Spectral Norm still differs from the pix2pixHD in that it uses the hinge loss objective, TTUR, a large batch size, and the Glorot initialization [12].

표 5에서는 pix2pixHD 방법에서 파생 된 강력한 기준 인 pix2pixHD ++ 방법에 사용 된 각 구성 요소의 효과도 분석합니다. Spectral Norm, 동기화 된 BatchNorm, TTUR [17] 및 힌지 손실 목표가 모두 성능 향상에 기여한다는 것을 발견했습니다. SPADE를 강력한베이스 라인에 추가하면 성능이 더욱 향상됩니다. Sync BatchNorm이없고 Spectral Norm이없는 pix2pixHD ++는 힌지 손실 목표, TTUR, 큰 배치 크기 및 Glorot 초기화를 사용한다는 점에서 pix2pixHD와 여전히 다릅니다 [12].

C. Additional Results

In Figure 16, 17, and 18, we show additional synthesis results from the proposed method on the COCO-Stuff and ADE20K datasets with comparisons to those from the CRN [6] and pix2pixHD [48] methods. In Figure 19 and 20, we show additional synthesis results from the proposed method on the ADE20K-outdoor and Cityscapes datasets with comparison to those from the CRN [6], SIMS [43], and pix2pixHD [48] methods.

그림 16, 17, 18에서는 COCO-Stuff 및 ADE20K 데이터 세트에 대해 제안 된 방법의 추가 합성 결과를 CRN [6] 및 pix2pixHD [48] 방법과 비교하여 보여줍니다. 그림 19와 20에서는 CRN [6], SIMS [43], pix2pixHD [48] 방법과 비교하여 ADE20K-outdoor 및 Cityscapes 데이터 세트에 대해 제안 된 방법의 추가 합성 결과를 보여줍니다.

In Figure 21, we show additional multi-modal synthesis results from the proposed method. As sampling different z from a standard multivariate Gaussian distribution, we synthesize images of diverse appearances.

그림 21에서는 제안 된 방법의 추가 다중 모드 합성 결과를 보여줍니다. 표준 다변량 가우시안 분포에서 다른 z를 샘플링하여 다양한 모양의 이미지를 합성합니다.

In the accompanying video, we demonstrate our semantic image synthesis interface. We show how a user can create photorealistic landscape images by painting semantic labels on a canvas. We also show how a user can synthesize images of diverse appearances for the same semantic segmentation mask as well as transfer the appearance of a provided style image to the synthesized one.

함께 제공되는 비디오에서 시맨틱 이미지 합성 인터페이스를 보여줍니다. 사용자가 캔버스에 의미 론적 레이블을 페인팅하여 사실적인 풍경 이미지를 만드는 방법을 보여줍니다. 또한 사용자가 동일한 의미 분할 마스크에 대해 다양한 모양의 이미지를 합성하고 제공된 스타일 이미지의 모양을 합성 된 이미지로 전송하는 방법도 보여줍니다.