Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2020

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
순환 일관된 적대 네트워크를 사용하여 쌍을 이루는 이미지 대 이미지 변환

Abstract

그림 1 : 순서가 지정되지 않은 두 이미지 컬렉션 X 및 Y가 주어지면 알고리즘이 이미지를 한 이미지에서 다른 이미지로 또는 그 반대로 자동으로 "번역"하는 방법을 학습합니다. (왼쪽) Flickr의 모네 그림과 풍경 사진; (가운데) ImageNet의 얼룩말과 말; (오른쪽) Flickr의 여름과 겨울 요세미티 사진. 예제 응용 프로그램 (아래) : 유명한 예술가의 그림 모음을 사용하여 자연 사진을 각 스타일로 렌더링하는 방법을 배웁니다.

Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs.
이미지 대 이미지 변환은 정렬 된 이미지 쌍의 훈련 세트를 사용하여 입력 이미지와 출력 이미지 간의 매핑을 학습하는 것이 목표 인 비전 및 그래픽 문제의 한 종류입니다.
However, for many tasks, paired training data will not be available.
그러나 많은 작업의 경우 쌍을 이룬 학습 데이터를 사용할 수 없습니다.
We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples.
쌍을 이룬 예제가없는 경우 소스 도메인 X에서 대상 도메인 Y로 이미지를 변환하는 방법을 학습하는 방법을 제시합니다.
Our goal is to learn a mapping G : X → Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss.
우리의 목표는 G (X)의 이미지 분포가 적대적 손실을 사용하는 분포 Y와 구별 할 수 없도록 매핑 G : X → Y를 학습하는 것입니다.
Because this mapping is highly under-constrained, we couple it with an inverse mapping F : Y → X and introduce a cycle consistency loss to enforce F(G(X)) ≈ X (and vice versa).
이 매핑은 제약이 매우 적기 때문에이를 역 매핑 F : Y → X와 연결하고주기 일관성 손실을 도입하여 F (G (X)) ≈ X를 적용합니다 (반대의 경우도 마찬가지).
Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc.
수집 스타일 전송, 개체 변형, 시즌 전송, 사진 향상 등을 포함하여 쌍을 이루는 훈련 데이터가 존재하지 않는 여러 작업에 대한 정 성적 결과가 제공됩니다.
Quantitative comparisons against several prior methods demonstrate the superiority of our approach.
몇 가지 이전 방법에 대한 정량적 비교는 우리의 접근 방식의 우수성을 보여줍니다.

1. Introduction

그림 2 : 짝을 이룬 학습 데이터 (왼쪽)는 학습 예 {xi, yi} N i = 1로 구성되며, 여기서 xi와 yi 간의 대응 관계가 존재합니다 [22]. 대신 소스 세트 {xi} N i = 1 (xi ∈ X) 및 대상 세트 {yj} M j = 1 (yj ∈ Y)으로 구성된 페어링되지 않은 학습 데이터 (오른쪽)를 고려합니다. 어떤 xi는 어떤 yj와 일치합니다.

What did Claude Monet see as he placed his easel by the bank of the Seine near Argenteuil on a lovely spring day in 1873 (Figure 1, top-left)?
클로드 모네는 1873 년 아름다운 봄날 아르 장 퇴유 근처 센 강둑 옆에 자신의 이젤을 놓았을 때 무엇을 보았습니까 (그림 1, 왼쪽 상단)?
A color photograph, had it been invented, may have documented a crisp blue sky and a glassy river reflecting it.
컬러 사진이 발명 되었다면 맑고 푸른 하늘과 그것을 반사하는 유리 강을 기록했을 것입니다.
Monet conveyed his impression of this same scene through wispy brush strokes and a bright palette.
모네는 희미한 붓놀림과 밝은 팔레트를 통해 같은 장면에 대한 인상을 전달했습니다.
What if Monet had happened upon the little harbor in Cassis on a cool summer evening (Figure 1, bottom-left)?
모네가 시원한 여름 저녁에 카시스의 작은 항구에서 일어났다면 어떨까요 (그림 1, 왼쪽 하단)?
A brief stroll through a gallery of Monet paintings makes it possible to imagine how he would have rendered the scene:
모네 그림 갤러리를 잠시 산책하면 그가 장면을 어떻게 렌더링했는지 상상할 수 있습니다.
perhaps in pastel shades, with abrupt dabs of paint, and a somewhat flattened dynamic range.
갑작스런 페인트와 다소 평탄한 다이나믹 레인지가있는 파스텔 색조 일 것입니다.
We can imagine all this despite never having seen a side by side example of a Monet painting next to a photo of the scene he painted.
그가 그린 장면의 사진 옆에 모네 그림의 나란히있는 예를 본 적이 없음에도 불구하고 우리는이 모든 것을 상상할 수 있습니다.
Instead, we have knowledge of the set of Monet paintings and of the set of landscape photographs.
대신 우리는 모네 그림 세트와 풍경 사진 세트에 대한 지식을 가지고 있습니다.
We can reason about the stylistic differences between thesetwo sets, and thereby imagine what a scene might look like if we were to “translate” it from one set into the other.
이 두 세트 간의 스타일 차이에 대해 추론 할 수 있으므로 한 세트에서 다른 세트로 "번역"하면 장면이 어떻게 보일지 상상할 수 있습니다.
In this paper, we present a method that can learn to do the same: capturing special characteristics of one image collection and figuring out how these characteristics could be translated into the other image collection, all in the absence of any paired training examples.
이 백서에서는 한 이미지 컬렉션의 특수한 특성을 캡처하고 이러한 특성을 다른 이미지 컬렉션으로 변환 할 수있는 방법을 파악하는 등 동일한 방법을 학습 할 수있는 방법을 제시합니다.
This problem can be more broadly described as imageto-image translation [22], converting an image from one representation of a given scene, x, to another, y, e.g., grayscale to color, image to semantic labels, edge-map to photograph.
이 문제는 이미지를 이미지로 변환하는 것으로보다 광범위하게 설명 할 수 있습니다 [22], 주어진 장면 x의 한 표현에서 다른 이미지로, 예를 들어 그레이 스케일을 컬러로, 이미지를 의미 레이블로, 에지 맵을 사진으로 변환하는 이미지 .
Years of research in computer vision, image processing, computational photography, and graphics have produced powerful translation systems in the supervised setting, where example image pairs {xi, yi}Ni=1 are available (Figure 2, left), e.g., [11, 19, 22, 23, 28, 33, 45, 56, 58,62].
컴퓨터 비전, 이미지 처리, 컴퓨터 사진 및 그래픽에 대한 수년간의 연구를 통해 감독 된 설정에서 강력한 번역 시스템이 만들어졌습니다. 여기에서 예제 이미지 쌍 {xi, yi} Ni = 1을 사용할 수 있습니다 (그림 2, 왼쪽). 예 : [11 , 19, 22, 23, 28, 33, 45, 56, 58,62].
However, obtaining paired training data can be difficult and expensive. For example, only a couple of datasets exist for tasks like semantic segmentation (e.g., [4]), and they are relatively small.
그러나 페어링 된 훈련 데이터를 얻는 것은 어렵고 비용이 많이들 수 있습니다. 예를 들어 의미 론적 세분화 (예 : [4])와 같은 작업에 대해 몇 개의 데이터 세트 만 존재하며 상대적으로 작습니다.
Obtaining input-output pairs for graphics tasks like artistic stylization can be even more difficult since the desired output is highly complex, typically requiring artistic authoring.
원하는 출력이 매우 복잡하고 일반적으로 예술적 저작이 필요하기 때문에 예술적 스타일 화와 같은 그래픽 작업에 대한 입력-출력 쌍을 얻는 것은 훨씬 더 어려울 수 있습니다.
For many tasks, like object transfiguration (e.g., zebra↔horse, Figure 1 top-middle), the desired output is not even well-defined.
객체 변형 (예 : zebra↔horse, 그림 1 상단 중간)과 같은 많은 작업의 경우 원하는 출력이 잘 정의되어 있지 않습니다.
We therefore seek an algorithm that can learn to translate between domains without paired input-output examples (Figure 2, right).
따라서 우리는 쌍을 이루는 입출력 예제없이 도메인 간 번역을 배울 수있는 알고리즘을 찾습니다 (그림 2, 오른쪽).
We assume there is some underlying relationship between the domains – for example, that they are two different renderings of the same underlying scene – and seek to learn that relationship.
예를 들어, 동일한 기본 장면의 두 가지 다른 렌더링이라고 가정하고 도메인간에 몇 가지 기본 관계가 있다고 가정하고 그 관계를 배우려고합니다.
Although we lack supervision in the form of paired examples, we can exploit supervision at the level of sets: we are given one set of images in domain X and a different set in domain Y .
쌍을 이룬 예의 형태로 감독이 부족하더라도 세트 수준에서 감독을 활용할 수 있습니다. 도메인 X에 한 세트의 이미지가 있고 도메인 Y에 다른 세트가 주어집니다.
We may train a mapping G : X → Y such that the output yˆ = G(x), x ∈ X, is indistinguishable from images y ∈ Y by an adversary trained to classify yˆ apart from y.
yˆ를 분류하도록 훈련 된 적에 의해 yˆ = G (x), x ∈ X, yˆ = G (x), x ∈ X가 이미지 y ∈ Y와 구별되지 않도록 매핑 G : X → Y를 훈련 할 수 있습니다.
In theory, this objective can induce an output distribution over yˆ that matches the empirical distribution pdata(y) (in general, this requires G to be stochastic) [16].
이론적으로이 목표는 경험적 분포 pdata (y)와 일치하는 yˆ에 대한 출력 분포를 유도 할 수 있습니다 (일반적으로 G는 확률 적이어야 함) [16].
The optimal G thereby translates the domain X to a domain Yˆ distributed identically to Y .
따라서 최적 G는 도메인 X를 Y와 동일하게 분포 된 도메인 Yˆ로 변환합니다.
However, such a translation does not guarantee that an individual input x and output y are paired up in a meaningful way – there are infinitely many mappings G that will induce the same distribution over yˆ.
그러나 이러한 변환은 개별 입력 x와 출력 y가 의미있는 방식으로 쌍을 이루는 것을 보장하지 않습니다. yˆ에 대해 동일한 분포를 유도하는 매핑 G가 무한히 많습니다.
Moreover, in practice, we have found it difficult to optimize the adversarial objective in isolation:
또한 실제로는 적대적 목표를 분리하여 최적화하는 것이 어렵다는 것을 알게되었습니다.
standard procedures often lead to the wellknown problem of mode collapse, where all input images map to the same output image and the optimization fails to make progress [15].
표준 절차는 종종 모든 입력 이미지가 동일한 출력 이미지에 매핑되고 최적화가 진행되지 않는 모드 붕괴라는 잘 알려진 문제로 이어집니다 [15].
These issues call for adding more structure to our objective.
이러한 문제는 우리의 목표에 더 많은 구조를 추가 할 것을 요구합니다.
Therefore, we exploit the property that translation should be “cycle consistent”, in the sense that if we translate, e.g., a sentence from English to French, and then translate it back from French to English, we should arrive back at the original sentence [3].
따라서 우리는 예를 들어 영어에서 프랑스어로 문장을 번역 한 다음 다시 프랑스어에서 영어로 번역하면 원래 문장으로 돌아 가야한다는 의미에서 번역이 "주기 일관성"이어야한다는 속성을 이용합니다. [삼].
Mathematically, if we have a translator G : X → Y and another translator F : Y → X, then G and F should be inverses of each other, and both mappings should be bijections.
수학적으로 우리가 번역가 G : X → Y와 다른 번역가 F : Y → X를 가지고 있다면, G와 F는 서로 역이어야하며 두 매핑은 모두 bijections 여야합니다.
We apply this structural assumption by training both the mapping G and F simultaneously, and adding a cycle consistency loss [64] that encourages F(G(x)) ≈ x and G(F(y)) ≈ y.
매핑 G와 F를 동시에 훈련하고 F (G (x)) ≈ x 및 G (F (y)) ≈ y를 장려하는주기 일관성 손실 [64]을 추가하여이 구조적 가정을 적용합니다.
Combining this loss with adversarial losses on domains X and Y yields our full objective for unpaired image-to-image translation.
이러한 손실과 도메인 X 및 Y의 적대적 손실을 결합하면 짝을 이루지 않은 이미지 대 이미지 변환에 대한 전체 목표가 산출됩니다.
We apply our method to a wide range of applications, including collection style transfer, object transfiguration, season transfer and photo enhancement.
우리는 컬렉션 스타일 전송, 개체 변형, 시즌 전송 및 사진 향상을 포함한 광범위한 응용 프로그램에 우리의 방법을 적용합니다.
We also compare against previous approaches that rely either on hand-defined factorizations of style and content, or on shared embedding functions, and show that our method outperforms these baselines.
또한 스타일 및 콘텐츠의 수동 분해 또는 공유 임베딩 기능에 의존하는 이전 접근 방식과 비교하여 우리의 방법이 이러한 기준을 능가한다는 것을 보여줍니다.
We provide both PyTorch and Torch implementations. Check out more results at our website.
PyTorch 및 Torch 구현을 모두 제공합니다. 저희 웹 사이트에서 더 많은 결과를 확인하십시오.

2. Related work

Generative Adversarial Networks (GANs) [16, 63] have achieved impressive results in image generation [6, 39], image editing [66], and representation learning [39, 43, 37].
Generative Adversarial Networks (GAN) [16, 63]는 이미지 생성 [6, 39], 이미지 편집 [66] 및 표현 학습 [39, 43, 37]에서 인상적인 결과를 얻었습니다.

Recent methods adopt the same idea for conditional image generation applications, such as text2image [41], image inpainting [38], and future prediction [36], as well as to other domains like videos [54] and 3D data [57].
최근의 방법은 text2image [41], image inpainting [38], 미래 예측 [36]과 같은 조건부 이미지 생성 응용 프로그램뿐만 아니라 비디오 [54] 및 3D 데이터 [57]와 같은 다른 영역에도 동일한 아이디어를 채택합니다.
The key to GANs’ success is the idea of an adversarial loss that forces the generated images to be, in principle, indistinguishable from real photos.
GAN의 성공의 열쇠는 생성 된 이미지가 원칙적으로 실제 사진과 구별 할 수 없도록 만드는 적대적 손실의 아이디어입니다
This loss is particularly powerful for image generation tasks, as this is exactly the objective that much of computer graphics aims to optimize.
이 손실은 이미지 생성 작업에 특히 강력합니다. 이는 대부분의 컴퓨터 그래픽이 최적화하려는 목표이기 때문입니다.
We adopt an adversarial loss to learn the mapping such that the translatedimages cannot be distinguished from images in the target domain.
변환 된 이미지가 대상 도메인의 이미지와 구별되지 않도록 매핑을 학습하기 위해 적대적 손실을 채택합니다.

그림 3 : (a) 우리 모델에는 두 개의 매핑 함수 G : X → Y 및 F : Y → X 및 관련 적대 판별 기 DY 및 DX가 포함되어 있습니다. DY는 G가 X를 도메인 Y와 구별 할 수없는 출력으로 변환하고 DX 및 F의 경우 그 반대로 변환하도록 권장합니다. 매핑을 추가로 정규화하기 위해 한 도메인에서 다른 도메인으로 변환하고 다시 되 돌리면 시작점에 도달해야한다는 직관을 포착하는 두 가지주기 일관성 손실을 도입합니다. (b) 순방향 사이클 일관성 손실 : x → G (x) → F (G (x)) ≈ x, (c) 역방향 사이클 일관성 손실 : y → F (y) → G (F (y)) ≈ y

*Image-to-Image Translation
* 이미지-이미지 번역
The idea of image-toimage translation goes back at least to Hertzmann et al.’s Image Analogies [19], who employ a non-parametric texture model [10] on a single input-output training image pair.
이미지-이미지 변환의 아이디어는 적어도 단일 입력-출력 훈련 이미지 쌍에 비모수 텍스처 모델 [10]을 사용하는 Hertzmann 등의 Image Analogies [19]로 거슬러 올라갑니다.
More recent approaches use a dataset of input-output examples to learn a parametric translation function using CNNs (e.g., [33]).
보다 최근의 접근 방식은 입력-출력 예제 데이터 세트를 사용하여 CNN을 사용하는 매개 변수 변환 함수를 학습합니다 (예 : [33]).
Our approach builds on the “pix2pix” framework of Isola et al. [22], which uses a conditional generative adversarial network [16] to learn a mapping from input to output images.
우리의 접근 방식은 Isola 등의 "pix2pix"프레임 워크를 기반으로합니다. [22], 조건부 생성 적대 네트워크 [16]를 사용하여 입력에서 출력 이미지로의 매핑을 학습합니다.
Similar ideas have been applied to various tasks such as generating photographs from sketches [44] or from attribute and semantic layouts [25].
유사한 아이디어가 스케치 [44] 또는 속성 및 의미 레이아웃 [25]에서 사진을 생성하는 것과 같은 다양한 작업에 적용되었습니다.
However, unlike the above prior work, we learn the mapping without paired training examples.
그러나 위의 이전 작업과 달리 페어링 된 학습 예제없이 매핑을 학습합니다.

* Unpaired Image-to-Image Translation
* 페어링되지 않은 이미지 대 이미지 번역

Several other methods also tackle the unpaired setting, where the goal is to relate two data domains: X and Y .
다른 몇 가지 방법은 X와 Y의 두 데이터 도메인을 연결하는 것이 목표 인 짝이없는 설정을 다룹니다.
Rosales et al. [42] propose a Bayesian framework that includes a prior based on a patch-based Markov random field computed from a source image and a likelihood term obtained from multiple style images.
Rosales et al. 소스 이미지에서 계산 된 패치 기반 Markov 랜덤 필드와 여러 스타일 이미지에서 얻은 우도 항을 기반으로 한 사전을 포함하는 베이지안 프레임 워크를 제안합니다.
More recently, CoGAN [32] and cross-modal scene networks [1] use a weight-sharing strategy to learn a common representation across domains.
최근에는 CoGAN [32]과 크로스 모달 씬 네트워크 [1]가 가중치 공유 전략을 사용하여 도메인 간 공통된 표현을 학습합니다.
Concurrent to our method, Liu et al. [31] extends the above framework with a combination of variational autoencoders [27] and generative adversarial networks [16].
우리의 방법과 동시에 Liu et al. [31] 변형 자동 인코더 [27]와 생성 적대 네트워크 [16]의 조합으로 위의 프레임 워크를 확장합니다.
Another line of concurrent work [46, 49, 2] encourages the input and output to share specific “content” features even though they may differ in “style“.
또 다른 동시 작업 라인 [46, 49, 2]은 입력과 출력이 "스타일"이 다를 수 있지만 특정 "콘텐츠"기능을 공유하도록 장려합니다.
These methods also use adversarial networks, with additional terms to enforce the output to be close to the input in a predefined metric space, such as class label space [2], image pixel space [46], and image feature space [49].
이러한 방법은 또한 클래스 레이블 공간 [2], 이미지 픽셀 공간 [46] 및 이미지 특징 공간 [49]과 같은 미리 정의 된 메트릭 공간의 입력에 근접하도록 출력을 강제하는 추가 용어와 함께 적대적 네트워크를 사용합니다.
Unlike the above approaches, our formulation does not rely on any task-specific, predefined similarity function between the input and output, nor do we assume that the input and output have to lie in the same low-dimensional embedding space.
위의 접근 방식과 달리, 우리의 공식은 입력과 출력 사이의 작업 별 미리 정의 된 유사성 함수에 의존하지 않으며 입력과 출력이 동일한 저 차원 임베딩 공간에 있어야한다고 가정하지도 않습니다
This makes our method a general-purpose solution for many vision and graphics tasks.
이것은 우리의 방법을 많은 비전 및 그래픽 작업에 대한 범용 솔루션으로 만듭니다.
We directly compare against several prior and contemporary approaches in Section 5.1.
5.1 절에서 몇 가지 이전 및 현재 접근 방식과 직접 비교합니다.

*Cycle Consistency
*주기 일관성

The idea of using transitivity as a way to regularize structured data has a long history.
구조화 된 데이터를 정규화하는 방법으로 전이성을 사용한다는 아이디어는 오랜 역사를 가지고 있습니다.
In visual tracking, enforcing simple forward-backward consistency has been a standard trick for decades [24, 48].
시각적 추적에서 단순한 전후 일관성을 적용하는 것은 수십 년 동안 표준 트릭이었습니다 [24, 48].
In the language domain, verifying and improving translations via “back translation and reconciliation” is a technique used by human translators [3] (including, humorously, by Mark Twain [51]), as well as by machines [17].
언어 영역에서 "역 번역 및 조정"을 통한 번역 확인 및 개선은 인간 번역가 [3] (Mark Twain [51] 포함)와 기계 [17]가 사용하는 기술입니다.
More recently, higher-order cycle consistency has been used in structure from motion [61], 3D shape matching [21], cosegmentation [55], dense semantic alignment [65, 64], and depth estimation [14].
최근에는 모션 [61], 3D 형상 매칭 [21], 코 세그멘테이션 [55], 조밀 한 의미 정렬 [65, 64] 및 깊이 추정 [14]의 구조에서 고차주기 일관성이 사용되었습니다.
Of these, Zhou et al. [64] and Godard et al. [14] are most similar to our work, as they use a cycle consistency loss as a way of using transitivity to supervise CNN training.
이들 중 Zhou et al. [64] 및 Godard et al. [14] CNN 훈련을 감독하기 위해 전이성을 사용하는 방법으로주기 일관성 손실을 사용한다는 점에서 우리 작업과 가장 유사합니다.
In this work, we are introducing a similar loss to push G and F to be consistent with each other.
이 작업에서 우리는 G와 F를 서로 일치시키기 위해 유사한 손실을 도입합니다.
Concurrent with our work, in these same proceedings, Yi et al.
우리의 작업과 동시에이 같은 절차에서 Yi et al.
[59] independently use a similar objective for unpaired image-to-image translation, inspired by dual learning in machine translation [17].
[59] 기계 번역의 이중 학습에서 영감을 얻은 짝을 이루지 않은 이미지 대 이미지 번역에 유사한 목표를 독립적으로 사용합니다 [17].
Neural Style Transfer [13, 23, 52, 12] is another way to perform image-to-image translation, which synthesizes a novel image by combining the content of one image with the style of another image (typically a painting) based on matching the Gram matrix statistics of pre-trained deep features.
Neural Style Transfer [13, 23, 52, 12]는 이미지 대 이미지 변환을 수행하는 또 다른 방법으로, 일치를 기반으로 한 이미지의 내용을 다른 이미지 (일반적으로 그림)의 스타일과 결합하여 새로운 이미지를 합성합니다. 사전 훈련 된 심층 특성의 그람 행렬 통계.
Our primary focus, on the other hand, is learning the mapping between two image collections, rather than between two specific images, by trying to capture correspondences between higher-level appearance structures.
반면에 우리의 주요 초점은 더 높은 수준의 외관 구조 간의 일치를 포착하려고 시도함으로써 두 특정 이미지 간의 매핑이 아닌 두 이미지 컬렉션 간의 매핑을 학습하는 것입니다.
Therefore, our method can be applied to other tasks, such as painting→ photo, object transfiguration, etc. where single sample transfer methods do not perform well.
따라서 우리의 방법은 단일 샘플 전송 방법이 잘 수행되지 않는 페인팅 → 사진, 물체 변형 등과 같은 다른 작업에 적용될 수 있습니다.
We compare these two methods in Section 5.2.
5.2 절에서이 두 가지 방법을 비교합니다.

3. Formulation

Our goal is to learn mapping functions between two domains X and Y given training samples {xi} N i=1 where xi ∈ X and {yj}M j=1 where yj ∈ Y.
우리의 목표는 훈련 샘플 {xi} N i = 1 (여기서 xi ∈ X 및 {yj} M j = 1 여기서 yj ∈ Y)이 주어지면 두 도메인 X와 Y 사이의 매핑 함수를 학습하는 것입니다.
We denote the data distribution as x ∼ pdata(x) and y ∼ pdata(y).
데이터 분포를 x ∼ pdata (x) 및 y ∼ pdata (y)로 표시합니다.
As illustrated in Figure 3 (a), our model includes two mappings G : X → Y and F : Y → X.
그림 3 (a)에서 볼 수 있듯이 모델에는 G : X → Y 및 F : Y → X의 두 가지 매핑이 포함됩니다.
In addition, we introduce two adversarial discriminators DX and DY , where DX aims to distinguish between images {x} and translated images {F(y)};
또한 두 개의 적대적 판별 기 DX와 DY를 소개합니다. 여기서 DX는 이미지 {x}와 번역 된 이미지 {F (y)}를 구별하는 것을 목표로합니다.
in the same way, DY aims to discriminate between {y} and {G(x)}.
같은 방식으로 DY는 {y}와 {G (x)}를 구별하는 것을 목표로합니다.
Our objective contains two types of terms: adversarial losses [16] for matching the distribution of generated images to the data distribution in the target domain;
우리의 목표는 두 가지 유형의 용어를 포함합니다. 생성 된 이미지의 분포를 대상 도메인의 데이터 분포와 일치시키기위한 적대적 손실 [16]
and cycle consistency losses to prevent the learned mappings G and F from contradicting each other.
학습 된 매핑 G 및 F가 서로 모순되는 것을 방지하기위한주기 일관성 손실.

3.1. Adversarial Loss

We apply adversarial losses [16] to both mapping functions.
우리는 두 매핑 함수에 적대적 손실 [16]을 적용합니다.
For the mapping function G : X → Y and its discriminator DY , we express the objective as:
매핑 함수 G : X → Y 및 그 판별 자 DY의 경우 목적을 다음과 같이 표현합니다.
LGAN(G, DY , X, Y ) = Ey∼pdata(y) [log DY (y)] + Ex∼pdata(x) [log(1 − DY (G(x))], (1)
where G tries to generate images G(x) that look similar to images from domain Y , while DY aims to distinguish between translated samples G(x) and real samples y.
LGAN(G, DY , X, Y ) = Ey∼pdata(y) [log DY (y)] + Ex∼pdata(x) [log(1 − DY (G(x))], (1)
여기서 G는 도메인 Y의 이미지와 유사한 이미지 G (x)를 생성하려고 시도하는 반면 DY는 번역 된 샘플 G (x)와 실제 샘플 y를 구별하는 것을 목표로합니다.
G aims to minimize this objective against an adversary D that tries to maximize it, i.e., minG maxDY LGAN(G, DY , X, Y ).
G는이 목표를 최대화하려고 시도하는 적 D, 즉 minG maxDY LGAN (G, DY, X, Y)에 대해이 목표를 최소화하는 것을 목표로합니다.
We introduce a similar adversarial loss for the mapping function F : Y → X and its discriminator DX as well:
매핑 함수 F : Y → X 및 판별 기 DX에 대해서도 유사한 적대적 손실을 도입합니다.
i.e., minF maxDX LGAN(F, DX, Y, X).
즉, minF maxDX LGAN (F, DX, Y, X).

그림 4 : 다양한 실험에서 입력 이미지 x, 출력 이미지 G (x) 및 재구성 된 이미지 F (G (x)). 위에서 아래로 : 사진 ↔ 세잔, 말 ↔ 얼룩말, 겨울 → 여름 요세미티, 항공 사진 ↔ 구글지도.

3.2. Cycle Consistency Loss
주기 일관성 손실
Adversarial training can, in theory, learn mappings G and F that produce outputs identically distributed as target domains Y and X respectively (strictly speaking, this requires G and F to be stochastic functions) [15].
이론적으로 적대적 훈련은 목표 도메인 Y와 X로 각각 동일하게 분포 된 출력을 생성하는 매핑 G와 F를 학습 할 수 있습니다 (엄격히 말하면 G와 F가 확률 함수 여야 함) [15].
However, with large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution.
그러나 용량이 충분히 큰 네트워크는 동일한 입력 이미지 세트를 대상 도메인의 임의의 이미지 순열에 매핑 할 수 있으며, 여기서 학습 된 매핑은 대상 분포와 일치하는 출력 분포를 유도 할 수 있습니다.
Thus, adversarial losses alone cannot guarantee that the learned function can map an individual input xi to a desired output yi.
따라서 적대적 손실만으로는 학습 된 함수가 개별 입력 xi를 원하는 출력 yi에 매핑 할 수 있음을 보장 할 수 없습니다.
To further reduce the space of possible mapping functions, we argue that the learned mappingfunctions should be cycle-consistent: as shown in Figure 3 (b), for each image x from domain X, the image translation cycle should be able to bring x back to the original image, i.e., x → G(x) → F(G(x)) ≈ x. We call this forward cycle consistency.
가능한 매핑 함수의 공간을 더 줄이기 위해 학습 된 매핑 함수는주기에 일관성이 있어야한다고 주장합니다. 그림 3 (b)에 표시된 것처럼 도메인 X의 각 이미지 x에 대해 이미지 변환주기는 x를 다시 가져올 수 있어야합니다. 원본 이미지로, 즉 x → G (x) → F (G (x)) ≈ x. 이를 순방향주기 일관성이라고합니다.
Similarly, as illustrated in Figure 3 (c), for each image y from domain Y , G and F should also satisfy backward cycle consistency: y → F(y) → G(F(y)) ≈ y.
마찬가지로 그림 3 (c)에 표시된 것처럼 도메인 Y의 각 이미지 y에 대해 G 및 F는 역방향주기 일관성 (y → F (y) → G (F (y)) ≈ y)도 충족해야합니다.
We incentivize this behavior using a cycle consistency loss:(2)
주기 일관성 손실을 사용하여이 동작을 장려합니다. : (2)
In preliminary experiments, we also tried replacing the L1 norm in this loss with an adversarial loss between F(G(x)) and x, and between G(F(y)) and y, but did not observe improved performance.
예비 실험에서 우리는이 손실의 L1 표준을 F (G (x))와 x 사이, G (F (y))와 y 사이의 적대적 손실로 대체하려고 시도했지만 성능이 향상되지 않았습니다.
The behavior induced by the cycle consistency loss can be observed in Figure 4: the reconstructed images x.
주기 일관성 손실에 의해 유도 된 동작은 그림 4에서 관찰 할 수 있습니다. 재구성 된 이미지 x.

3.3. Full Objective
3.3. 전체 목표

Our full objective is: (3) where λ controls the relative importance of the two objectives.
우리의 전체 목표는 다음과 같습니다. (3) 여기서 λ는 두 목표의 상대적 중요성을 제어합니다.
We aim to solve:(4)
우리는 다음을 해결하고자합니다 : (4)
Notice that our model can be viewed as training two “autoencoders” [20]:
우리 모델은 두 개의 "자동 인코더"[20]를 훈련시키는 것으로 볼 수 있습니다.
we learn one autoencoder F ◦ G : X → X jointly with another G◦F : Y → Y .
우리는 하나의 오토 인코더 F ◦ G : X → X를 다른 G◦F : Y → Y와 함께 배웁니다.
However, these autoencoders each have special internal structures: they map an image to itself via an intermediate representation that is a translation of the image into another domain.
그러나 이러한 오토 인코더에는 각각 특수한 내부 구조가 있습니다. 즉, 이미지를 다른 도메인으로 변환하는 중간 표현을 통해 이미지를 자체에 매핑합니다.
Such a setup can also be seen as a special case of “adversarial autoencoders” [34], which use an adversarial loss to train the bottleneck layer of an autoencoder to match an arbitrary target distribution.
이러한 설정은 "적대적 자동 인코더"[34]의 특수한 경우로 볼 수도 있습니다.이 경우 적대적 손실을 사용하여 자동 인코더의 병목 계층을 임의의 대상 분포와 일치하도록 훈련시킵니다.
In our case, the target distribution for the X → X autoencoder is that of the domain Y .
우리의 경우 X → X autoencoder의 대상 분포는 도메인 Y의 분포입니다.
In Section 5.1.4, we compare our method against ablations of the full objective, including the adversarial loss LGAN alone and the cycle consistency loss Lcyc alone, and empirically show that both objectives play critical roles in arriving at high-quality results.
섹션 5.1.4에서 우리는 적대적 손실 LGAN 단독 및주기 일관성 손실 Lcyc 단독을 포함하여 전체 목표의 제거와 우리의 방법을 비교하고, 두 목표 모두 고품질 결과에 도달하는 데 중요한 역할을한다는 것을 경험적으로 보여줍니다.
We also evaluate our method with only cycle loss in one direction and show that a single cycle is not sufficient to regularize the training for this under-constrained problem.
또한 한 방향으로 만주기 손실을 사용하여 방법을 평가하고 단일 주기로는이 부족한 제약 문제에 대한 훈련을 정규화하기에 충분하지 않음을 보여줍니다.

4. Implementation 구현
Network Architecture We adopt the architecture for our generative networks from Johnson et al.
네트워크 아키텍처 우리는 Johnson 등의 생성 네트워크 아키텍처를 채택했습니다.
[23] who have shown impressive results for neural style transfer and superresolution.
[23] 신경 스타일 전달 및 초 해상도에 대해 인상적인 결과를 보여준 사람.
This network contains three convolutions, several residual blocks [18], two fractionally-strided convolutions with stride 1 2, and one convolution that maps features to RGB.
이 네트워크에는 3 개의 컨볼 루션, 여러 개의 잔차 블록 [18], 스트라이드 1 2가있는 2 개의 부분 줄무늬 컨볼 루션 및 특징을 RGB에 매핑하는 컨볼 루션 1 개가 포함되어 있습니다.
We use 6 blocks for 128 × 128 images and 9 blocks for 256×256 and higher-resolution training images.
128x128 이미지에 6 개 블록을 사용하고 256x256 및 고해상도 훈련 이미지에 9 개 블록을 사용합니다.
Similar to Johnson et al. [23], we use instance normalization [53]. For the discriminator networks we use 70 × 70 PatchGANs [22, 30, 29], which aim to classify whether 70 × 70 overlapping image patches are real or fake.
Johnson et al. [23], 인스턴스 정규화 [53]를 사용합니다. 판별 자 네트워크의 경우 70 × 70의 겹치는 이미지 패치가 진짜인지 가짜인지 분류하는 것을 목표로하는 70 × 70 PatchGAN [22, 30, 29]을 사용합니다.
Such a patch-level discriminator architecture has fewer parameters than a full-image discriminator and can work on arbitrarilysized images in a fully convolutional fashion [22].
이러한 패치 레벨 판별 기 아키텍처는 전체 이미지 판별 기보다 적은 매개 변수를 가지며 완전 컨볼 루션 방식으로 임의 크기의 이미지에 대해 작업 할 수 있습니다 [22].
Training details We apply two techniques from recent works to stabilize our model training procedure.
훈련 내용 우리는 모델 훈련 절차를 안정화하기 위해 최근 작업에서 두 가지 기술을 적용합니다.
First, for LGAN (Equation 1), we replace the negative log likelihood objective by a least-squares loss [35].

첫째, LGAN (방정식 1)의 경우 음의 로그 우도 목표를 최소 제곱 손실로 대체합니다 [35].
This loss is more stable during training and generates higher quality results.
이 손실은 훈련 중에 더 안정적이며 더 높은 품질의 결과를 생성합니다.
In particular, for a GAN loss LGAN(G, D, X, Y ), we train the G to minimize Ex∼pdata(x) [(D(G(x)) − 1)2 ] and train the D to minimize Ey∼pdata(y) [(D(y) − 1)2 ] + Ex∼pdata(x) [D(G(x))2 ].
특히 GAN 손실 LGAN (G, D, X, Y)의 경우 G를 훈련하여 Ex∼pdata (x) [(D (G (x)) − 1) 2]를 최소화하고 D를 훈련시켜 Ey∼pdata (y) [(D (y) − 1) 2] + Ex∼pdata (x) [D (G (x)) 2].

Second, to reduce model oscillation [15], we follow Shrivastava et al.’s strategy [46] and update the discriminators using a history of generated images rather than the ones produced by the latest generators.
둘째, 모델 진동을 줄이기 위해 [15] Shrivastava 등의 전략 [46]을 따르고 최신 생성기에서 생성 된 이미지가 아닌 생성 된 이미지의 이력을 사용하여 판별기를 업데이트합니다.
We keep an image buffer that stores the 50 previously created images.
이전에 생성 된 50 개의 이미지를 저장하는 이미지 버퍼를 유지합니다.
For all the experiments, we set λ = 10 in Equation 3.
모든 실험에 대해 방정식 3에서 λ = 10을 설정합니다.
We use the Adam solver [26] with a batch size of 1.
배치 크기가 1 인 Adam 솔버 [26]를 사용합니다.
All networks were trained from scratch with a learning rate of 0.0002.
모든 네트워크는 0.0002의 학습률로 처음부터 훈련되었습니다.
We keep the same learning rate for the first 100 epochs and linearly decay the rate to zero over the next 100 epochs.
우리는 처음 100 개의 Epoch에 대해 동일한 학습률을 유지하고 다음 100 Epoch 동안 속도를 0으로 선형 적으로 감소시킵니다.
Please see the appendix (Section 7) for more details about the datasets, architectures, and training procedures.
데이터 세트, 아키텍처 및 교육 절차에 대한 자세한 내용은 부록 (섹션 7)을 참조하십시오.

5. Results 결과
We first compare our approach against recent methods for unpaired image-to-image translation on paired datasets where ground truth input-output pairs are available for evaluation.
먼저 평가에 사용할 수있는 Ground Truth 입력-출력 쌍이있는 쌍을 이루는 데이터 세트에서 쌍을 이루지 않은 이미지-이미지 변환을위한 최근 방법과 우리의 접근 방식을 비교합니다.
We then study the importance of both the adversarial loss and the cycle consistency loss and compare our full method against several variants.
그런 다음 적대적 손실과주기 일관성 손실의 중요성을 연구하고 전체 방법을 여러 변형과 비교합니다.
Finally, we demonstrate the generality of our algorithm on a wide range of applications where paired data does not exist.
마지막으로 쌍을 이룬 데이터가 존재하지 않는 광범위한 응용 프로그램에서 알고리즘의 일반성을 보여줍니다.
For brevity, we refer to our method as CycleGAN.
간결함을 위해 우리의 방법을 CycleGAN이라고합니다.
The PyTorch and Torch code, models, and full results can be found at our website.
PyTorch 및 Torch 코드, 모델 및 전체 결과는 당사 웹 사이트에서 찾을 수 있습니다.

5.1. Evaluation 평가
Using the same evaluation datasets and metrics as “pix2pix” [22], we compare our method against several baselines both qualitatively and quantitatively.
"pix2pix"[22]와 동일한 평가 데이터 세트 및 메트릭을 사용하여 우리의 방법을 정 성적 및 정량적으로 여러 기준과 비교합니다.
The tasks include semantic labels↔photo on the Cityscapes dataset [4], and map↔aerial photo on data scraped from Google Maps.
작업에는 Cityscapes 데이터 세트 [4]의 의미 레이블 ↔ 사진, Google지도에서 스크랩 한 데이터의지도 ↔ 공중 사진이 포함됩니다.
We also perform ablation study on the full loss function.
또한 전체 손실 함수에 대한 절제 연구를 수행합니다.

그림 5 : 라벨을 매핑하는 다른 방법 ↔ 도시 풍경 이미지에서 훈련 된 사진. 왼쪽에서 오른쪽으로 : 입력, BiGAN / ALI [7, 9], CoGAN [32], 기능 손실 + GAN, SimGAN [46], CycleGAN (당사), 쌍을 이루는 데이터에 대해 훈련 된 pix2pix [22], 그리고 Ground Truth.

그림 6 : Google지도에서 항공 사진 ↔지도를 매핑하는 다양한 방법. 왼쪽에서 오른쪽으로 : 입력, BiGAN / ALI [7, 9], CoGAN [32], 기능 손실 + GAN, SimGAN [46], CycleGAN (당사), 쌍을 이루는 데이터에 대해 훈련 된 pix2pix [22] 및 Ground Truth.

5.1.1 Evaluation Metrics 평가 지표
*AMT perceptual studies * AMT 지각 연구

On the map↔aerial photo task, we run “real vs fake” perceptual studies on Amazon Mechanical Turk (AMT) to assess the realism of our outputs.
map↔aerial 사진 작업에서 Amazon Mechanical Turk (AMT)에 대한 "실제 대 가짜"지각 연구를 실행하여 출력의 사실성을 평가합니다.
We follow the same perceptual study protocol from Isola et al. [22], except we only gather data from 25 participants per algorithm we tested.
우리는 Isola et al.의 동일한 지각 연구 프로토콜을 따릅니다. [22] 단, 테스트 한 알고리즘 당 25 명의 참가자로부터 만 데이터를 수집합니다.
Participants were shown a sequence of pairs of images, one a real photo or map and one fake (generated by our algorithm or a baseline), and asked to click on the image they thought was real.
참가자들은 일련의 이미지 쌍, 하나는 실제 사진 또는지도이고 하나는 가짜 (우리 알고리즘 또는 기준선에 의해 생성됨)를 보여주고 실제라고 생각되는 이미지를 클릭하도록 요청했습니다.
The first 10 trials of each session were practice and feedback was given as to whether the participant’s response was correct or incorrect.
각 세션의 처음 10 개 시험은 연습이었고 참가자의 응답이 맞았는지 틀렸는 지에 대한 피드백이 제공되었습니다
The remaining 40 trials were used to assess the rate at which each algorithm fooled participants.
나머지 40 개의 시험은 각 알고리즘이 참가자를 속이는 비율을 평가하는 데 사용되었습니다.
Each session only tested a single algorithm, and participants were only allowed to complete a single session.
각 세션은 단일 알고리즘 만 테스트했으며 참가자는 단일 세션 만 완료 할 수있었습니다.
The numbers we report here are not directly comparable to those in [22] as our ground truth images were processed slightly differently 2 and the participant pool we tested may be differently distributed from those tested in [22] (due to running the experiment at a different date and time).
우리가 여기에보고 한 수치는 우리의 지상 진실 이미지가 약간 다르게 처리 되었기 때문에 [22]의 수치와 직접 비교할 수 없으며, 우리가 테스트 한 참가자 풀은 [22]에서 테스트 한 것과 다르게 분포 될 수 있습니다 (실험을 a 다른 날짜와 시간).
Therefore, our numbers should only be used to compare our current method against the baselines (which were run under identical conditions), rather than against [22].
따라서 우리의 수치는 현재 방법을 기준선 (동일한 조건에서 실행)과 비교하는 데에만 사용해야합니다 [22].

*FCN score

Although perceptual studies may be the gold standard for assessing graphical realism, we also seek an automatic quantitative measure that does not require human experiments.
지각적 연구가 그래픽 리얼리즘을 평가하기위한 표준이 될 수 있지만, 우리는 인간 실험이 필요하지 않은 자동 정량 측정도 추구합니다.
For this, we adopt the “FCN score” from [22], and use it to evaluate the Cityscapes labels→photo task.
이를 위해 [22]의“FCN 점수”를 채택하여 도시 경관 라벨 → 사진 작업을 평가하는 데 사용합니다.
The FCN metric evaluates how interpretable the generated photos are according to an off-the-shelf semantic segmentation algorithm (the fully-convolutional network, FCN, from [33]).
FCN 메트릭은 기성 시맨틱 분할 알고리즘 ([33]의 완전 합성 곱 네트워크, FCN)에 따라 생성 된 사진이 얼마나 해석 가능한지 평가합니다.
The FCN predicts a label map for a generated photo.
FCN은 생성 된 사진의 레이블 맵을 예측합니다.
This label map can then be compared against the input ground truth labels using standard semantic segmentation metrics described below.
그런 다음 이 레이블 맵을 아래에 설명 된 표준 의미 론적 세분화 메트릭을 사용하여 입력 Ground Truth 레이블과 비교할 수 있습니다.
The intuition is that if we generate a photo from a label map of “car on the road”, then we have succeeded if the FCN applied to the generated photo detects “car on the road”.
직감적으로 "도로 위의 자동차"라는 레이블 맵에서 사진을 생성하면 생성 된 사진에 적용된 FCN이 "도로 위의 자동차"를 감지하면 성공한 것입니다.

*Semantic segmentation metrics 의미론적 세분화 메트릭

To evaluate the performance of photo→labels, we use the standard metrics from the Cityscapes benchmark [4], including per-pixel accuracy, per-class accuracy, and mean class Intersection-Over-Union (Class IOU) [4].
사진 → 라벨의 성능을 평가하기 위해 픽셀 당 정확도, 클래스 당 정확도 및 평균 클래스 Intersection-Over-Union (Class IOU) [4]을 포함하여 Cityscapes 벤치 마크 [4]의 표준 메트릭을 사용합니다.

5.1.2 Baselines 기준
CoGAN [32] This method learns one GAN generator for domain X and one for domain Y , with tied weights on the first few layers for shared latent representations.
CoGAN [32]이 방법은 공유 된 잠재 표현을 위해 처음 몇 개의 레이어에 가중치를 묶어 도메인 X에 대한 GAN 생성기와 도메인 Y에 대한 하나의 GAN 생성기를 학습합니다.
Translation from X to Y can be achieved by finding a latent representation that generates image X and then rendering this latent representation into style Y .
이미지 X를 생성하는 잠재 표현을 찾은 다음이 잠재 표현을 스타일 Y로 렌더링하여 X에서 Y로 변환 할 수 있습니다.
SimGAN [46] Like our method, Shrivastava et al.[46] uses an adversarial loss to train a translation from X to Y .
SimGAN [46] 우리의 방법과 마찬가지로 Shrivastava et al. [46] 적대적 손실을 사용하여 X에서 Y 로의 변환을 훈련시킵니다.
The regularization term kx − G(x)k1 i s used to penalize making large changes at pixel level.
정규화 항 kx − G (x) k1 i s는 픽셀 수준에서 큰 변화를 일으키는 데 페널티를 주는데 사용됩니다.
Feature loss + GAN We also test a variant of SimGAN [46] where the L1 loss is computed over deep image features using a pretrained network (VGG-16 relu4 2 [47]), rather than over RGB pixel values.
특징 손실 + GAN 우리는 또한 L1 손실이 RGB 픽셀 값이 아닌 사전 훈련 된 네트워크 (VGG-16 relu4 2 [47])를 사용하여 심층 이미지 특징에 대해 계산되는 SimGAN [46]의 변형을 테스트합니다.
Computing distances in deep feature space, like this, is also sometimes referred to as using a “perceptual loss” [8, 23].
이와 같이 딥 피처 공간에서 거리를 계산하는 것은 "지각 손실"[8, 23]을 사용하는 것으로도 불립니다.
BiGAN/ALI [9, 7] Unconditional GANs [16] learn a generator G : Z → X, that maps a random noise z to an image x.
BiGAN / ALI [9, 7] 무조건 GAN [16]은 임의의 노이즈 z를 이미지 x에 매핑하는 생성기 G : Z → X를 학습합니다.
The BiGAN [9] and ALI [7] propose to also learn the inverse mapping function F : X → Z.
BiGAN [9]과 ALI [7]는 역 매핑 함수 F : X → Z도 배울 것을 제안합니다.
Though they were originally designed for mapping a latent vector z to an image x, we implemented the same objective for mapping a source image x to a target image y.
원래는 잠복 벡터 z를 이미지 x에 매핑하기 위해 설계되었지만 소스 이미지 x를 대상 이미지 y에 매핑하는 데 동일한 목표를 구현했습니다.
pix2pix [22] We also compare against pix2pix [22], which is trained on paired data, to see how close we can get to this “upper bound” without using any paired data.
pix2pix [22] 또한 쌍을 이루는 데이터에 대해 훈련 된 pix2pix [22]와 비교하여 쌍을 이룬 데이터를 사용하지 않고도이“상한값”에 얼마나 근접 할 수 있는지 확인합니다.
For a fair comparison, we implement all the baselines using the same architecture and details as our method, except for CoGAN [32].
공정한 비교를 위해 CoGAN [32]을 제외하고는 방법과 동일한 아키텍처 및 세부 사항을 사용하여 모든 기준을 구현합니다.
CoGAN builds on generators that produce images from a shared latent representation, which is incompatible with our image-to-image network.
CoGAN은 공유 잠재 표현에서 이미지를 생성하는 생성기를 기반으로하며, 이는 이미지 대 이미지 네트워크와 호환되지 않습니다.
We use the public implementation of CoGAN instead.
대신 CoGAN의 공개 구현을 사용합니다.

5.1.3 Comparison against baselines 기준선과 비교

As can be seen in Figure 5 and Figure 6, we were unable to achieve compelling results with any of the baselines.
그림 5와 그림 6에서 볼 수 있듯이 어떤 기준선에서도 놀라운 결과를 얻을 수 없었습니다.
Our method, on the other hand, can produce translations that are often of similar quality to the fully supervised pix2pix.
반면에 우리의 방법은 완전히 감독되는 pix2pix와 유사한 품질의 번역을 생성 할 수 있습니다.
Table 1 reports performance regarding the AMT perceptual realism task.
표 1은 AMT 지각 현실성 작업과 관련된 성능을보고합니다.
Here, we see that our method can fool participants on around a quarter of trials, in both the maps→aerial photos direction and the aerial photos→maps direction at 256 × 256 resolution3.
여기서 우리는 우리의 방법이 256 × 256 해상도에서지도 → 항공 사진 방향과 항공 사진 →지도 방향 모두에서 약 1/4의 시험에서 참가자를 속일 수 있음을 알 수 있습니다.
All the baselines almost never fooled participants.
모든 기준은 거의 참가자를 속이지 않았습니다.
Table 2 assesses the performance of the labels→photo task on the Cityscapes and Table 3 evaluates the opposite mapping (photos→labels).
표 2는 도시 경관에서 라벨 → 사진 작업의 성능을 평가하고 표 3은 반대 매핑 (사진 → 라벨)을 평가합니다.
In both cases, our method again outperforms the baselines.
두 경우 모두 우리의 방법은 다시 기준선을 능가합니다.

5.1.4 Analysis of the loss function 손실 함수 분석

In Table 4 and Table 5, we compare against ablations of our full loss.
표 4와 표 5에서 우리는 완전한 손실의 절제와 비교합니다.
Removing the GAN loss substantially degrades results, as does removing the cycle-consistency loss.
GAN 손실을 제거하면주기 일관성 손실이 제거되는 것처럼 결과가 크게 저하됩니다.
We therefore conclude that both terms are critical to our results.
따라서 두 용어가 모두 결과에 중요하다는 결론을 내립니다.
We also evaluate our method with the cycle loss in only one direction:
또한 한 방향의 사이클 손실로 방법을 평가합니다.
GAN + forward cycle loss Ex∼pdata(x) [kF(G(x))−xk1], or GAN + backward cycle loss Ey∼pdata(y) [kG(F(y))−yk1] (Equation 2) and find that it often incurs training instability and causes mode collapse, especially for the direction of the mapping that was removed.
GAN + 순방향 사이클 손실 Ex∼pdata (x) [kF (G (x))-xk1] 또는 GAN + 역방향 사이클 손실 Ey∼pdata (y) [kG (F (y)) − yk1] (수식 2) 특히 제거 된 매핑 방향에 대해 훈련이 불안정 해지고 모드 붕괴가 발생하는 경우가 많습니다.
Figure 7 shows several qualitative examples.
그림 7은 몇 가지 질적 예를 보여줍니다.

그림 7 : 라벨을 매핑하는 방법의 다양한 변형과 도시 경관에서 훈련 된 사진. 왼쪽에서 오른쪽으로 : 입력,주기 일관성 손실 만, 적대 손실 만, GAN + 순방향주기 일관성 손실 (F (G (x)) ≈ x), GAN + 역방향주기 일관성 손실 (G (F (y)) ≈ y), CycleGAN (전체 방법) 및 실측 값. Cycle 단독 및 GAN + backward 모두 대상 도메인과 유사한 이미지를 생성하지 못합니다. GAN 단독 및 GAN + forward는 모드 붕괴로 인해 입력 사진에 관계없이 동일한 레이블 맵을 생성합니다.

그림 8 : 쌍을 이루는 데이터 세트에 대한 CycleGAN의 예제 결과 건축 라벨 ↔ 사진 등“pix2pix”[22]에서 사용 그리고 가장자리 ↔ 신발

5.1.5 Image reconstruction quality 이미지 재구성 품질

In Figure 4, we show a few random samples of the reconstructed images F(G(x)).
그림 4에서는 재구성 된 이미지 F (G (x))의 몇 가지 무작위 샘플을 보여줍니다
We observed that the reconstructed images were often close to the original inputs x, at both training and testing time, even in cases where one domain represents significantly more diverse information,such as map↔aerial photos.
우리는 하나의 도메인이지도 ↔ 항공 사진과 같이 훨씬 더 다양한 정보를 나타내는 경우에도 재구성 된 이미지가 학습 및 테스트 시간 모두에서 원래 입력 x에 가깝다는 것을 관찰했습니다.

5.1.6 Additional results on paired datasets쌍을 이룬 데이터 세트에 대한 추가 결과

Figure 8 shows some example results on other paired datasets used in “pix2pix” [22], such as architectural labels↔photos from the CMP Facade Database [40], and edges↔shoes from the UT Zappos50K dataset [60].
그림 8은 CMP Facade Database [40]의 건축 라벨 ↔ 사진, UT Zappos50K 데이터 세트 [60]의 edge↔shoes와 같이“pix2pix”[22]에 사용 된 다른 쌍을 이루는 데이터 세트에 대한 몇 가지 예시 결과를 보여줍니다.
The image quality of our results is close to those produced by the fully supervised pix2pix while our method learns the mapping without paired supervision.
결과의 이미지 품질은 완전히 감독 된 pix2pix에 의해 생성 된 것과 비슷하지만, 우리의 방법은 페어링 된 감독없이 매핑을 학습합니다.

5.2. Applications 응용
We demonstrate our method on several applications where paired training data does not exist.
쌍을 이룬 학습 데이터가 존재하지 않는 여러 애플리케이션에서 방법을 시연합니다.
Please refer tothe appendix (Section 7) for more details about the datasets.
데이터 세트에 대한 자세한 내용은 부록 (Section 7)을 참조하십시오.
We observe that translations on training data are often more appealing than those on test data, and full results of all applications on both training and test data can be viewed on our project website.
우리는 훈련 데이터에 대한 번역이 종종 테스트 데이터에 대한 번역보다 더 매력적이며 훈련 및 테스트 데이터에 대한 모든 응용 프로그램의 전체 결과를 프로젝트 웹 사이트에서 볼 수 있습니다.

*Collection style transfer (Figure 10 and Figure 11)
*컬렉션 스타일 전송 (그림 10 및 그림 11)

We train the model on landscape photographs downloaded from Flickr and WikiArt.
Flickr 및 WikiArt에서 다운로드 한 풍경 사진으로 모델을 훈련합니다.
Unlike recent work on “neural style transfer” [13], our method learns to mimic the style of an entire collection of artworks, rather than transferring the style of a single selected piece of art.
"신경 스타일 전달"[13]에 대한 최근 작업과 달리, 우리의 방법은 선택한 단일 예술 작품의 스타일을 전달하는 대신 전체 작품 컬렉션의 스타일을 모방하는 방법을 배웁니다.
Therefore, we can learn to generate photos in the style of, e.g., Van Gogh, rather than just in the style of Starry Night.
따라서 우리는 Starry Night 스타일이 아닌 Van Gogh 스타일로 사진을 생성하는 방법을 배울 수 있습니다.
The size of the dataset for each artist/style was 526, 1073, 400, and 563 for Cezanne, Monet, Van Gogh, and Ukiyo-e.
각 아티스트 / 스타일의 데이터 세트 크기는 Cezanne, Monet, Van Gogh 및 Ukiyo-e의 경우 526, 1073, 400 및 563이었습니다.

*Object transfiguration (Figure 13) 객체 변형 (그림 13)

The model is trained to translate one object class from ImageNet [5] to another (each class contains around 1000 training images).
모델은 하나의 객체 클래스를 ImageNet [5]에서 다른 객체로 변환하도록 훈련되었습니다 (각 클래스에는 약 1000 개의 훈련 이미지가 포함됨).
Turmukhambetov et al. [50] propose a subspace model to translate one object into another object of the same category, while our method focuses on object transfiguration between two visually similar categories. Season transfer (Figure 13)
Turmukhambetov et al. 하나의 객체를 동일한 범주의 다른 객체로 변환하는 부분 공간 모델을 제안하는 반면, 우리의 방법은 시각적으로 유사한 두 범주 간의 객체 변형에 중점을 둡니다. 시즌 이전 (그림 13)
The model is trained on 854 winter photos and 1273 summer photos of Yosemite downloaded from Flickr.
이 모델은 Flickr에서 다운로드 한 854 개의 겨울 사진과 1273 개의 요세미티 여름 사진으로 훈련되었습니다.

*Photo generation from paintings (Figure 12) 그림에서 사진 생성 (그림 12)

For painting→photo, we find that it is helpful to introduce an additional loss to encourage the mapping to preserve color composition between the input and output.
페인팅 → 사진의 경우 입력과 출력 사이의 색상 구성을 유지하기 위해 매핑을 장려하기 위해 추가 손실을 도입하는 것이 유용하다는 것을 알았습니다.
In particular, we adopt the technique of Taigman et al. [49] and regularize the generator to be near an identity mapping when real samples of the target domain are provided as the input to the generator: i.e., Lidentity(G, F) = Ey∼pdata(y) [kG(y) − yk1] + Ex∼pdata(x) [kF(x) − xk1].
특히, 우리는 Taigman et al. [49] 타겟 도메인의 실제 샘플이 생성기에 입력으로 제공 될 때 생성기를 ID 매핑에 가깝게 정규화합니다. 즉, Lidentity (G, F) = Ey∼pdata (y) [kG (y) − yk1] + Ex∼pdata (x) [kF (x)-xk1].
Without Lidentity, the generator G and F are free to change the tint of input images when there is no need to.
Lidentity가 없으면 생성기 G와 F는 필요가 없을 때 입력 이미지의 색조를 자유롭게 변경할 수 있습니다.
For example, when learning the mapping between Monet’s paintings and Flickr photographs, the generator often maps paintings of daytime to photographs taken during sunset, because such a mapping may be equally valid under the adversarial loss and cycle consistency loss.
예를 들어, Monet의 그림과 Flickr 사진 간의 매핑을 학습 할 때 생성기는 종종 낮의 그림을 일몰 동안 찍은 사진에 매핑합니다. 이러한 매핑은 적대적 손실 및주기 일관성 손실에서도 똑같이 유효 할 수 있기 때문입니다.
The effect of this identity mapping loss are shown in Figure 9.
이 ID 매핑 손실의 영향은 그림 9에 나와 있습니다.
In Figure 12, we show additional results translating Monet’s paintings to photographs.
그림 12에서는 Monet의 그림을 사진으로 번역 한 추가 결과를 보여줍니다.
This figure and Figure 9 show results on paintings that were included in the training set, whereas for all other experiments in the paper, we only evaluate and show test set results. Because the training set does not include paired data, coming up with a plausible translation for a training set painting is a nontrivial task.
이 그림과 그림 9는 학습 세트에 포함 된 그림에 대한 결과를 보여주는 반면, 논문의 다른 모든 실험에 대해서는 테스트 세트 결과 만 평가하고 표시합니다. 훈련 세트에는 쌍을 이루는 데이터가 포함되어 있지 않기 때문에 훈련 세트 그림에 대한 그럴듯한 번역을 만드는 것은 사소한 작업입니다.
Indeed, since Monet is no longer able to create new paintings, generalization to unseen, “test set”, paintings is not a pressing problem.
실제로 Monet은 더 이상 새로운 그림을 만들 수 없기 때문에 보이지 않는 "테스트 세트"로 일반화하므로 그림은 시급한 문제가 아닙니다.

*Photo enhancement (Figure 14) 사진 향상 (그림 14)

We show that our method can be used to generate photos with shallower depth of field. We train the model on flower photos downloaded from Flickr.
우리는 우리의 방법을 사용하여 더 얕은 피사계 심도로 사진을 생성 할 수 있음을 보여줍니다. Flickr에서 다운로드 한 꽃 사진으로 모델을 훈련시킵니다.
The source domain consists of flower photos taken by smartphones, which usually have deep DoF due to a small aperture.
소스 도메인은 스마트 폰으로 찍은 꽃 사진으로 구성되며 일반적으로 작은 조리개로 인해 DoF가 깊습니다.
The target contains photos captured by DSLRs with a larger aperture.
대상에는 조리개가 더 큰 DSLR로 캡처 한 사진이 포함됩니다.
Our model successfully generates photos with shallower depth of field from the photos taken by smartphones.
우리 모델은 스마트 폰으로 찍은 사진에서 피사계 심도가 얕은 사진을 성공적으로 생성합니다.

*Comparison with Gatys et al. [13] Gatis et al.과의 비교 [13]

In Figure 15, we compare our results with neural style transfer [13] on photo stylization.
그림 15에서는 사진 스타일 화에 대한 신경 스타일 전송 [13]과 결과를 비교합니다.
For each row, we first use two representative artworks as the style images for [13].
각 행에 대해 먼저 두 개의 대표 작품을 [13]의 스타일 이미지로 사용합니다.
Our method, on the other hand, can produce photos in the style of entire collection.
반면에 우리의 방법은 전체 컬렉션 스타일의 사진을 생성 할 수 있습니다.
To compare against neural style transfer of an entire collection, we compute the average Gram Matrix across the target domain and use this matrix to transfer the “average style” with Gatys et al [13].
전체 컬렉션의 신경 스타일 전달과 비교하기 위해 대상 도메인 전체에 걸쳐 평균 Gram Matrix를 계산하고이 행렬을 사용하여 Gatys 등 [13]과 함께 "평균 스타일"을 전달합니다.
Figure 16 demonstrates similar comparisons for other translation tasks.
그림 16은 다른 번역 작업에 대한 유사한 비교를 보여줍니다.
We observe that Gatys et al. [13] requires finding target style images that closely match the desired output, but still often fails to produce photorealistic results, while our method succeeds to generate natural-looking results, similar to the target domain.
우리는 Gatys et al. [13]은 원하는 출력과 거의 일치하는 대상 스타일 이미지를 찾아야하지만 여전히 종종 사실적인 결과를 생성하지 못하는 반면, 우리의 방법은 대상 도메인과 유사한 자연스러운 결과를 생성하는 데 성공했습니다.

그림 9 : ID 매핑 손실이 Monet에 미치는 영향 회화 → 사진. 왼쪽에서 오른쪽으로 : 입력 그림, ID 매핑 손실없는 CycleGAN, ID 매핑 손실. ID 매핑 손실은 입력 그림의 색상을 보존하는 데 도움이됩니다.

6. Limitations and Discussion 제한 및 논의

Although our method can achieve compelling results in many cases, the results are far from uniformly positive.
우리의 방법은 많은 경우에 설득력있는 결과를 얻을 수 있지만 결과는 균일하게 긍정적 인 것과는 거리가 멀다.
Figure 17 shows several typical failure cases.
그림 17은 몇 가지 일반적인 실패 사례를 보여줍니다.
On translation tasks that involve color and texture changes, as many of those reported above, the method often succeeds.
색상 및 질감 변경이 포함 된 번역 작업에서 위에서보고 한 많은 작업에서이 방법은 종종 성공합니다.
We have also explored tasks that require geometric changes, with little success.
우리는 또한 거의 성공하지 못한 채 기하학적 변화가 필요한 작업을 탐색했습니다.
For example, on the task of dog→cat transfiguration, the learned translation degenerates into making minimal changes to the input (Figure 17).
예를 들어, 개 → 고양이 변형 작업에서 학습 된 번역은 입력에 최소한의 변경을가하도록 퇴화됩니다 (그림 17).
This failure might be caused by our generator architectures which are tailored for good performance on the appearance changes.
이 오류는 모양 변경에 대한 우수한 성능을 위해 조정 된 생성기 아키텍처로 인해 발생할 수 있습니다.
Handling more varied and extreme transformations, especially geometric changes, is an important problem for future work.
더 다양하고 극단적 인 변형, 특히 기하학적 변화를 처리하는 것은 향후 작업에 중요한 문제입니다.
Some failure cases are caused by the distribution characteristics of the training datasets.
일부 실패 사례는 훈련 데이터 세트의 분포 특성으로 인해 발생합니다.
For example, our method has got confused in the horse → zebra example (Figure 17, right), because our model was trained on the wild horse and zebra synsets of ImageNet, which does not contain images of a person riding a horse or zebra.
예를 들어, 우리의 모델은 말이나 얼룩말을 타는 사람의 이미지를 포함하지 않는 ImageNet의 야생마와 얼룩말 합성 세트에서 훈련 되었기 때문에 말 → 얼룩말 예제 (그림 17, 오른쪽)에서 우리의 방법이 혼란 스러웠습니다.
We also observe a lingering gap between the results achievable with paired training data and those achieved by our unpaired method.
또한 짝을 이룬 훈련 데이터로 얻을 수있는 결과와 짝을 이루지 않은 방법으로 얻은 결과 사이에 남아있는 차이를 관찰합니다.
In some cases, this gap may be very hard – or even impossible – to close: for example, our method sometimes permutes the labels for tree and building in the output of the photos→labels task.
경우에 따라이 간격은 닫기가 매우 어렵거나 불가능할 수 있습니다. 예를 들어, 우리의 방법은 때때로 사진 → 라벨 작업의 출력에서 나무 및 건물의 라벨을 변경합니다.
Resolving this ambiguity may require some form of weak semantic supervision. Integrating weak or semi-supervised data may lead to substantially more powerful translators, still at a fraction of the annotation cost of the fully-supervised systems.
이 모호성을 해결하려면 어떤 형태의 약한 의미 감독이 필요할 수 있습니다. 약하거나 반 감독 된 데이터를 통합하면 완전히 감독되는 시스템의 주석 비용의 일부로 훨씬 더 강력한 번역가로 이어질 수 있습니다.
Nonetheless, in many cases completely unpaired data is plentifully available and should be made use of.
그럼에도 불구하고 대부분의 경우 완전히 짝을 이루지 않은 데이터를 많이 사용할 수 있으므로 사용해야합니다.
This paper pushes the boundaries of what is possible in this “unsupervised” setting.
이 백서는 이 “감독되지 않는”환경에서 가능한 것의 경계를 넓힙니다.

그림 10 : 컬렉션 스타일 전송 I : 입력 이미지를 Monet, Van Gogh, Cezanne 및 우키요에. 추가 예제는 당사 웹 사이트를 참조하십시오.

그림 11 : 컬렉션 스타일 전송 II : 입력 이미지를 Monet, Van Gogh, Cezanne, Ukiyo-e의 예술적 스타일로 전송합니다. 추가 예제는 당사 웹 사이트를 참조하십시오.

그림 12 : Monet의 그림을 사진 스타일로 매핑 한 비교적 성공적인 결과. 저희 웹 사이트를 참조하십시오 추가 예.

그림 13 : 우리의 방법은 여러 번역 문제에 적용되었습니다. 이 이미지는 비교적 성공적인 결과로 선택됩니다. –보다 포괄적이고 무작위적인 결과는 저희 웹 사이트를 참조하십시오. 맨 위 두 행에는 개체에 대한 결과가 표시됩니다. 말과 얼룩말 간의 변형, 야생마 클래스의 939 개 이미지와 얼룩말의 1177 개 이미지로 훈련 됨 Imagenet [5]의 클래스. 말 → 얼룩말 데모 영상도 확인하세요. 중간 두 행은 시즌 이전 결과를 보여줍니다. Flickr에서 요세미티의 겨울과 여름 사진에 대한 교육을 받았습니다. 맨 아래 두 행에서 996 apple에 대한 방법을 학습합니다. ImageNet의 이미지 및 1020 배꼽 오렌지 이미지.

그림 14 : 사진 향상 : 스마트 폰 스냅에서 전문 DSLR 사진으로 매핑, 시스템은 종종 얕은 초점을 만드는 법을 배웁니다. 여기에서는 테스트 세트에서 가장 성공적인 결과 중 일부를 보여줍니다. 평균 성능은 상당히 나빠요. 보다 포괄적이고 무작위적인 예는 저희 웹 사이트를 참조하십시오.

그림 15 : 사진 스타일 화에 대한 신경 스타일 전송 [13]과 방법을 비교합니다. 왼쪽에서 오른쪽 : 입력 이미지, 결과 Gatys et al. [13] 두 개의 다른 대표 작품을 스타일 이미지로 사용한 것은 Gatys et al. [13] 아티스트의 전체 컬렉션과 CycleGAN (당사).

그림 16 : 다양한 애플리케이션에서 우리의 방법을 신경 스타일 전달 [13]과 비교합니다. 위에서 아래로 : 사과 → 오렌지, 말 → 얼룩말, 모네 → 사진. 왼쪽에서 오른쪽으로 : 입력 이미지, Gatys et al. [13] 서로 다른 두 가지 사용 스타일 이미지로서의 이미지, Gatys et al. [13] 대상 도메인 및 CycleGAN (당사)의 모든 이미지를 사용합니다.

그림 17 : 우리 방법의 일반적인 실패 사례. 왼쪽 : 개 → 고양이 변신 작업에서 CycleGAN은 입력에 대한 최소한의 변경. 오른쪽 : CycleGAN은이 말 → zebra 예제에서도 실패합니다. 모델이 이미지를 보지 못했기 때문입니다. 훈련 중 승마의. 보다 포괄적 인 결과는 당사 웹 사이트를 참조하십시오.

'비지도학습 > 임시' 카테고리의 다른 글

Learning to Discover Cross-Domain Relationswith Generative Adversarial Networks,2017(DiscoGAN) (0)	2021.01.20
DCGAN Review (0)	2021.01.18
Conditional Generative Adversarial Nets (0)	2021.01.17
StarGAN v2: Diverse Image Synthesis for Multiple Domains (0)	2021.01.15
StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation (0)	2021.01.14

내가 보려고 만든 블로그

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2020

'비지도학습 > 임시' 카테고리의 다른 글

티스토리툴바

Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2020

'비지도학습 > 임시' 카테고리의 다른 글

'비지도학습/임시' Related Articles

티스토리툴바