[논문]DiscoFaceGAN,2020

Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning
3D 모방-대비 학습을 통한 엉킴 및 제어 가능한 얼굴 이미지 생성(Microsoft Research Asia)

Abstract

We propose DiscoFaceGAN, an approach for face image generation with DISentangled, precisely-COntrollable latent representations for identity of non-existing people, expression, pose, and illumination.
We embed 3D priors into adversarial learning and train the network to imitate the image formation of an analytic 3D face deformation and rendering process.
To deal with the generation freedom induced by the domain gap between real and rendered faces, we further introduce contrastive learning to promote disentanglement by comparing pairs of generated images.
Experiments show that through our imitative-contrastive learning, the factor variations are very well disentangled and the properties of a generated face can be precisely controlled.
We also analyze the learned latent space and present several meaningful properties supporting factor disentanglement.
Our method can also be used to embed real images into the disentangled latent space.
We hope our method could provide new understandings of the relationship between physical properties and deep image synthesis.
우리는 존재하지 않는 사람의 정체성, 표현, 포즈 및 조명에 대한 얽힌 것을 해체한(DISentangled), 정밀하게 통제할수 있는(COntrollable) 잠재 표현을 사용한 얼굴 이미지 생성을 위한 접근 방식인 DiscoFaceGAN을 제안합니다.
우리는 3D 사전을 적대적 학습에 포함하고 분석적 3D 얼굴 변형 및 렌더링 프로세스의 이미지 형성을 모방하도록 네트워크를 훈련시킵니다.
실제 얼굴과 렌더링 된 얼굴 사이의 영역 차이로 인해 발생하는 생성 자유를 처리하기 위해 생성된 이미지 쌍을 비교하여 분리를 촉진하는 대조 학습을 추가로 도입합니다.
실험은 우리의 모방-대비 학습을 통해 요인 변화가 매우 잘 풀리고 생성 된 얼굴의 속성을 정확하게 제어 할 수 있음을 보여줍니다.
우리는 또한 학습 된 잠재 공간을 분석하고 요인 분리를 지원하는 몇 가지 의미있는 속성을 제시합니다.
우리의 방법은 또한 실제 이미지를 풀린 잠복 공간에 삽입하는 데 사용할 수 있습니다.
우리의 방법이 물리적 특성과 깊은 이미지 합성 간의 관계에 대한 새로운 이해를 제공 할 수 있기를 바랍니다.

1. Introduction
Face image synthesis has achieved tremendous success in the past few years with the rapid advance of Generative Adversarial Networks (GANs) [14].
State-of-the art GAN models, such as the recent StyleGAN [23], can generate high-fidelity virtual face images that are sometimes even hard to distinguish from real ones.
Compared to the vast body of works devoted to improving the image generation quality and tailoring GANs for various applications, synthesizing face images de novo with multiple disentangled latent spaces characterizing different properties of a face image is still not well investigated.
Such a disentangled latent representation is desirable for constrained face image generation (e.g., random identities with specific illuminations or poses).
It can also derive a disentangled representation of a real image by embedding it into the learned feature space.
A seminal GAN research for disentangled image generation is InfoGAN [6], where the representation disentanglement is learned in an unsupervised manner via maximizing the mutual information between the latent variables and the observation.
However, it has been shown that without any prior or weak supervision, there is no guarantee that each latent variable contains a semantically-meaningful factor of variation [30, 7].
In this paper, we investigate synthesizing face images of virtual people with independent latent variables for identity, expression, pose, lighting, and an additional noise.
To gain predictable controllability on the former four variables, we translate them to the coefficients of parametric models through training a set of Variational Autoencourders (VAE).
We incorporate priors from 3D Morphable Face Models (3DMM) [4, 33] and an analytic rendering procedure into adversarial learning.
A set of imitative losses is introduced which enforces the generator to imitate the explainable image rendering process, thus generating face properties characterized by the latent variables.
However, the domain gap between real and rendered faces gives rise to a certain generation freedom that is uncontrollable, leading to unsatisfactory disentanglement of factor variations.
To deal with such generation freedom and enhance disentanglement, we further propose a collection of contrastive losses for training. We compare pairs of generated images and penalize the appearance difference that is only induced by a set of identical latent variables shared between each pair.
얼굴 이미지 합성은 GAN (Generative Adversarial Networks)[14]의 급속한 발전으로 지난 몇 년간 엄청난 성공을 거두었습니다.
최신 StyleGAN [23]과 같은 최첨단 GAN 모델은 때때로 실제와 구별하기 어려운 고 충실도(high-fidelity) 가상 얼굴 이미지를 생성 할 수 있습니다.
이미지 생성 품질을 개선하고 다양한 응용 분야에 대한 GAN을 조정하는 데 전념하는 방대한 작업과 비교하여 얼굴 이미지의 다양한 특성을 특징으로하는 여러 개의 분리 된 잠재 공간으로 새로운 얼굴 이미지를 합성하는 것은 아직 잘 조사되지 않았습니다.
이러한 얽 히지 않은 잠재 표현은 제한된 얼굴 이미지 생성 (예 : 특정 조명 또는 포즈가있는 임의의 정체성)에 바람직합니다.
또한 학습된 특징 공간에 포함하여 실제 이미지의 얽히지 않은 표현을 유도 할 수 있습니다.
disentangled 이미지 생성을 위한 중요한 GAN 연구는 InfoGAN[6]으로, 잠재 변수와 관찰 사이의 상호 정보를 최대화하여 감독되지 않은 방식으로 표현 disentanglement를 학습합니다.
그러나 사전 또는 약한 감독 없이는 각 잠재 변수에 의미상 의미있는 변동 인자가 포함된다는 보장이 없다는 것이 밝혀졌습니다 [30, 7].
본 논문에서는 정체성, 표현, 포즈, 조명, 추가 노이즈에 대한 독립적인 잠재 변수를 가진 가상 인물의 얼굴 이미지를 합성하는 방법을 조사합니다.
이전 4 개의 변수에 대한 예측 가능한 제어 가능성을 얻기 위해 VAE (Variational Autoencourders) 세트를 학습하여 매개 변수 모델의 계수로 변환.
3D Morphable Face Models (3DMM) [4, 33]의 사전과 분석적 렌더링 절차를 적대적 학습에 통합합니다.
생성기가 설명 가능한 이미지 렌더링 프로세스를 모방하도록 강제하는 일련의 모방 손실이 도입되어 잠재 변수로 특징 지어지는 얼굴 속성을 생성합니다.
그러나 실제 얼굴과 렌더링 된 얼굴 사이의 영역 격차는 제어 할 수없는 특정 세대의 자유를 가져 와서 요인 변동의 불만족스러운 분리를 초래합니다.
이러한 세대의 자유를 처리하고 분리를 강화하기 위해 우리는 훈련을위한 대조적 손실 모음을 추가로 제안합니다. 생성 된 이미지 쌍을 비교하고 각 쌍간에 공유되는 동일한 잠재 변수 집합에 의해서만 유발되는 모양 차이에 페널티를 줍니다.

This way, the generator is forced to express an independent influence of each latent variable to the final output.
We show that these contrastive losses are crucial to achieve complete latent variable disentanglement.
The model we use in this paper is based on the StyleGAN structure [23], though our method can be extended to other GAN models as well. We modify the latent code layer of StyleGAN and equip it with our new loss functions for training.
We show that the latent variables can be highly disentangled and the generation can be accurately controlled.
Similar to StyleGAN, the faces generated by our method do not correspond to any real person in the world.
We further analyze the learned StyleGAN latent space and find some meaningful properties supporting factor disentanglement.
Our method can be used to embed real images into the disentangled latent space and we demonstrate this with various experiments.
The contributions of this paper can be summarized as follows.
We propose a novel disentangled representation learning scheme for de novo face image generation via a imitative-contrastive paradigm leveraging 3D priors.
Our method enables precise control of the targeted face properties such as pose, expression, and illumination, achieving flexible and high-quality face image generation that, to our knowledge, cannot be achieved by any previous method.
Moreover, we offer several analyses to understand the properties of the disentangled StyleGAN latent space.
At last, we demonstrate that our method can be used to project real images into the disentangled latent space for analysis and decomposition.
이런 식으로 생성기는 최종 출력에 대한 각 잠재 변수의 독립적인 영향을 표현해야합니다.
우리는 이러한 대비 손실이 완전한 잠재 변수 분리를 달성하는 데 중요하다는 것을 보여줍니다.
이 논문에서 사용하는 모델은 StyleGAN 구조[23]를 기반으로 하지만 다른 GAN 모델로도 확장 할 수 있습니다.

StyleGAN의 잠복 코드 레이어(latent code layer)를 수정하고 훈련을 위한 새로운 손실 함수를 장착합니다.
우리는 잠재 변수가 매우 얽혀있고, 생성을 정확하게 제어 할 수 있음을 보여줍니다.
StyleGAN과 유사하게, 우리의 방법으로 생성된 얼굴은 세상의 실제 사람과 일치하지 않습니다.
학습된 StyleGAN 잠재 공간을 추가로 분석하고 요인 분리를 지원하는 의미있는 속성을 찾습니다.
우리의 방법은 실제 이미지를 분리된 잠복 공간에 삽입하는데 사용할 수 있으며, 다양한 실험을 통해 이를 입증합니다.
이 논문의 공헌은 다음과 같이 요약 될 수있다.
우리는 3D 사전을 활용하는 모방-대비(imitative-contrastive) 패러다임을 통해 de novo 얼굴 이미지 생성을위한 새로운 disentangled 표현 학습 체계를 제안합니다.
우리의 방법은 포즈, 표현 및 조명과 같은 대상 얼굴 속성을 정밀하게 제어하여 이전 방법으로는 달성 할 수없는 유연하고 고품질의 얼굴 이미지 생성을 가능하게합니다.
또한, 우리는 해체된 StyleGAN 잠재 공간의 속성을 이해하기 위해 여러가지 분석을 제공합니다.
마지막으로, 우리는 분석 및 분해를 위해 실제 이미지를 풀린(disentangled) 잠재 공간에 투영하는 데 우리의 방법을 사용할 수 있음을 보여줍니다.

2. Related Work
We briefly review the literature on disentangled representation learning and face image synthesis as follows.
우리는 다음과 같이 해체 표현 학습 및 얼굴 이미지 합성에 관한 문헌을 간략히 검토합니다.

Disentangled representation learning. 얽히지 않은 표현 학습.
Disentangled representation learning (DRL) for face images has been vividly studied in the past.
Historical attempts are based on simple bilinear models [46], restricted Boltzmann machines [10, 39], among others.
A seminal GAN research along this direction is InfoGAN [6].
However, InfoGAN is known to suffer from training instability [48], and there is no guarantee that each latent variable is semantically meaningful [30, 7].
InfoGAN-CR [29] introduces an additional discriminator to identify the latent code under traversal.

SD-GAN [11] applies a discriminator on image pairs to disentangle identity and appearance factors.
Very recently, HoloGAN [32] disentangles 3D pose and identity with unsupervised learning using 3D convolutions and rigid feature transformations.

DRL with VAEs also received much attention in recent years [26, 48, 18, 5, 25]
얼굴 이미지에 대한 DRL (Disentangled Representation Learning)은 과거에 생생하게 연구되었습니다.
역사적 시도는 단순한 쌍 선형 모델 [46], 제한된 볼츠만 기계 [10, 39] 등을 기반으로합니다.(Disentangled representation learning (DRL) for face images has been vividly studied in the past. )
이 방향에 따른 중요한 GAN 연구는 InfoGAN [6]입니다.
그러나 InfoGAN은 훈련 불안정성 [48]으로 고통받는 것으로 알려져 있으며 각 잠재 변수가 의미 상 의미가 있다는 보장은 없습니다 [30, 7].
InfoGAN-CR [29]은 순회에서 잠복 코드를 식별하기 위해 추가 판별자를 도입합니다.

SD-GAN [11]은 이미지 쌍에 식별기를 적용하여 정체성과 외모 요인을 분리합니다.
아주 최근에 HoloGAN [32]은 3D 컨볼루션과 엄격한 특징 변환을 사용하는 비지도 학습으로 3D 포즈와 정체성을 분리합니다.

VAE가있는 DRL도 최근 몇 년 동안 많은 주목을 받았습니다 [26, 48, 18, 5, 25].

Conditional GAN for face synthesis. 얼굴 합성을위한 조건부 GAN.

CGAN [31] has been widely used in face image synthesis tasks especially identity-preserving generation [47, 2, 52, 3, 42].
In a typical CGAN framework, the input to a generator consists of random noises together with some preset conditional factors (e.g., categorical labels or features) as constraints, and an auxiliary classifier/feature extractor is applied to restore the conditional factors from generator outputs.
It does not offer a generative modeling of the conditional factors.
Later we show that our method can be applied to various face generation tasks handled previously with CGAN frameworks.

CGAN [31]은 얼굴 이미지 합성 작업, 특히 신원 보존 세대 [47, 2, 52, 3, 42]에 널리 사용되었습니다.
일반적인 CGAN 프레임 워크에서 생성기에 대한 입력은 제약 조건으로 사전 설정된 조건부 요인 (예 : 범주 레이블 또는 기능)과 함께 임의의 노이즈로 구성되며, 생성기 출력에서 조건부 요인을 복원하기 위해 보조 분류기 / 특징 추출기가 적용됩니다.
조건부 요인의 생성 모델링을 제공하지 않습니다.
나중에 우리는 우리의 방법이 이전에 CGAN 프레임 워크로 처리 된 다양한 얼굴 생성 작업에 적용될 수 있음을 보여줍니다.

Face image embedding and editing with GANs. GAN을 사용한 얼굴 이미지 삽입 및 편집.
GANs have seen heavy use in face image manipulation [34, 19, 49,44, 36, 45, 54].
These methods typically share an encoderdecoder/generator-discriminator paradigm where the encoder embeds images into disentangled latent representations characterizing different facial properties.
Our method can also be applied to embed face images into our disentangled latent space, as we will show in the experiments.
GAN은 얼굴 이미지 조작에 많이 사용되었습니다 [34, 19, 49,44, 36, 45, 54].
이러한 방법은 일반적으로 인코더가 이미지를 서로 다른 얼굴 속성을 특성화하는 분리 된 잠재 표현에 삽입하는 인코더 디코더 / 생성기-분별 패러다임을 공유합니다.
우리의 방법은 실험에서 보여줄 것처럼 얽 히지 않은 잠복 공간에 얼굴 이미지를 삽입하는 데 적용될 수도 있습니다.

3D prior for GANs. GAN에 대한 3D 사전.
Many methods have been proposed to incorporate 3D prior into GAN for face synthesis [52, 43, 24, 8, 12, 35, 13, 32, 50].
Most of them leverages 3DMMs. For example, [24] utilizes 3DMM coefficients extracted from input images as low-frequency feature for frontal face synthesis.
[12] and [35] translate rendered 3DMM faces and real face images in a cycle fashion.
[24] generates video frames from 3DMM faces for face re-animation.
[50] uses 3DMM for portrait reconstruction and pose manipulation.
Different from these methods, we only employ 3DMM as priors in the training stage for our imitative-contrastive learning.
After training, we do not require a 3DMM model or any rendering procedure.
얼굴 합성을 위해 GAN에 3D를 사전에 통합하는 많은 방법이 제안되었습니다 [52, 43, 24, 8, 12, 35, 13, 32, 50].
대부분은 3DMM을 활용합니다. 예를 들어, [24]는 입력 영상에서 추출한 3DMM 계수를 정면 합성을위한 저주파 특성으로 활용합니다.
[12]와 [35]는 렌더링 된 3DMM 얼굴과 실제 얼굴 이미지를 주기적으로 변환합니다.
[24]는 얼굴 재생을 위해 3DMM 얼굴에서 비디오 프레임을 생성합니다.
[50]은 초상화 재구성 및 포즈 조작을 위해 3DMM을 사용합니다.
이러한 방법과 달리 우리는 모방-대비 학습을위한 훈련 단계에서 3DMM만을 우선 순위로 사용합니다.
훈련 후에는 3DMM 모델이나 렌더링 절차가 필요하지 않습니다.

3. Approach
Given a collection of real face images Y, our goal is to train a network G that generates realistic face images x from random noise z, which consists of multiple independent variables zi ∈ R Ni, each following the normal distribution.
We consider latent variables for five independent factors: identity, expression, illumination, pose, and a random noise accounting for other properties such as background.
As in standard GAN, a discriminator D is applied to compete with G.
To obtain disentangled and interpretable latent space, we incorporate 3D priors in an imitative-contrastive learning scheme (Fig. 2), described as follows.
실제 얼굴 이미지 Y의 모음이 주어지면, 우리의 목표는 각각 정규 분포를 따르는 여러 독립 변수 z_i ∈ R Ni로 구성된 랜덤 노이즈 z에서 사실적인 얼굴 이미지 x를 생성하는 네트워크 G를 훈련시키는 것입니다.
정체성, 표현, 조명, 포즈, 배경과 같은 다른 속성을 설명하는 랜덤 노이즈의 다섯 가지 독립 요소에 대한 잠재 변수를 고려합니다.
표준 GAN에서와 같이 G와 경쟁하기 위해 판별자 D가 적용됩니다.
얽 히지 않고 해석 가능한 잠복 공간을 얻기 위해 다음과 같이 모방 대조 학습 계획 (그림 2)에 3D 사전을 통합합니다.

3.1. Imitative Learning
To learn how a face image should be generated following the desired properties, we incorporate a 3DMM model [33] and train the generator to imitate the rendered 3D faces.
With a 3DMM, the 3D shape S and texture T of a face is parameterized as (1)
where S¯ and T¯ are the average face shape and texture, Bid, Bexp, and Bt are the PCA bases of identity, expression, and texture, respectively, and αs, β, and αt are the corresponding 3DMM coefficient vectors.
We denote α.= [αs, αt] as the identity-bearing coefficients.
We approximate scene illumination with Spherical Harmonics (SH) [38] parameterized by coefficient vector γ.
Face pose is defined as three rotation angles2 expressed as vector θ. With λ.= [α, β, γ, θ], we can easily obtain a rendered face xˆ through a wellestablished analytic image formation [4].
To enable imitation, we first bridge the z-space to the λspace.
We achieve this by training VAE models on the λ samples extracted from real image set Y.
More specifically, we use the 3D face reconstruction network from [9] to obtain the coefficients of all training images and train four simple VAEs for α, β, γ and θ, respectively.
After training, we discard the VAE encoders and keep the decoders, denoted as Vi, i= 1, 2, 3, 4, for z-space to λ-space mapping.
In our GAN training, we sample z = [z1, . . . , z5] from standard normal distribution, map it to λ, and feed λ to both the generator G and the renderer to obtain a generated face x and a rendered face xˆ, respectively.
Note that we can input either z or λ into G – in practice we observe no difference between these two options in terms of either visual quality or disentangling efficacy.
The benefit of using λ is the ease of face property control since λ is interpretable.
원하는 속성에 따라 얼굴 이미지를 생성하는 방법을 배우기 위해 3DMM 모델 [33]을 통합하고, 렌더링된 3D 얼굴을 모방하도록 생성기를 훈련합니다.
3DMM을 사용하면 얼굴의 3D 모양 S와 텍스처 T가 (1)로 매개 변수화됩니다.

$$$$(1)
여기서 S¯ 및 T¯은 평균 얼굴 모양 및 질감이고 Bid, Bexp 및 Bt는 각각 동일성, 표현 및 질감의 PCA 기반이며 αs, β 및 αt는 해당 3DMM 계수 벡터입니다.
α. = [αs, αt]를 동일성을 갖는 계수로 표시합니다.
계수 벡터 γ로 매개 변수화 된 구면 고조파 (SH) [38]를 사용하여 장면 조명을 근사화합니다.
얼굴 포즈는 벡터 θ로 표현되는 3 개의 회전 각도 2로 정의됩니다. λ. = [α, β, γ, θ]를 사용하면 잘 정립된 분석 이미지 형성을 통해 렌더링 된 얼굴 xˆ를 쉽게 얻을 수 있습니다 [4].
모방을 활성화하려면 먼저 z 공간을 λ 공간에 연결합니다.
실제 이미지 세트 Y에서 추출한 λ 샘플에서 VAE 모델을 학습하여이를 달성합니다.
보다 구체적으로, 우리는 [9]의 3D 얼굴 재구성 네트워크를 사용하여 모든 훈련 이미지의 계수를 얻고 α, β, γ 및 θ에 대해 각각 4 개의 간단한 VAE를 학습.
학습 후, 우리는 VAE 인코더를 버리고 z- 공간 대 λ- 공간 매핑을 위해 Vi, i = 1, 2, 3, 4로 표시된 디코더를 유지합니다.
GAN 훈련에서는 z = [z1,. . . , z5]를 표준 정규 분포에서 λ에 매핑하고 생성기 G와 렌더러 모두에 λ를 공급하여 생성 된면 x와 렌더링 된면 xˆ를 각각 얻습니다.
z 또는 λ를 G에 입력 할 수 있습니다. 실제로 시각적 품질이나 엉킴 해제 효과 측면에서 이 두 옵션간에 차이가 없음을 관찰합니다.
λ 사용의 이점은 λ가 해석 가능하기 때문에 얼굴 속성 제어가 쉽다는 것입니다.

We define the following loss functions on x for imitative learning.
First, we enforce x to mimic the identity of xˆ perceptually by (2) where fid(·) is the deep identity feature from a face recognition network, < ·, · > denotes cosine similarity, and τ is a constant margin which we empirically set as 0.3.
Since there is an obvious domain gap between rendered 3DMM faces and real ones, we allow a small difference between the features.
The face recognition network from [51] is used in this paper for deep identity feature extraction.
For expression and pose, we penalize facial landmark differences via (3) where p(·) denotes the landmark positions detected by the 3D face reconstruction network, and pˆ is the landmarks of the rendered face obtained trivially.
For illumination, we simply minimize the SH coefficient discrepancy by (4) where γ(·) represents the coefficient given by the 3D face reconstruction network, and γˆ is the coefficient of xˆ.
Finally, we add a simple loss which enforces the output to mimic the skin color of the rendered face via (5) where c(·) denotes the average color of face region defined by the mask in 3DMM.
By using these imitative losses, the generator will learn to generate face images following the identity, expression, pose, and illumination characterized by the corresponding latent variables.
The domain gap issue.
Obviously, there is an inevitable domain gap between the rendered 3DMM faces and generated ones.
Understanding the effect of this domain gap and judiciously dealing with it is important.
On one hand, retaining a legitimate domain gap that is reasonably large is necessary as it avoids the conflict with the adversarial loss and ensures the realism of generated images.
It also prevents the generative modeling from being trapped into the small identity subspace of the 3DMM model3.
On the other hand, however, it may lead to poor factor variation disentanglement (for example, changing expression may lead to unwanted variations of identity and image background, and changing illumination may disturb expression and hair structure; see Fig. 3 and 6).
To understand why this happens, we first symbolize the difference between a generated face x and its rendered counterpart xˆ as ∆x, i.e., x = ˆx + ∆x.
In the imitative learning, x is free to deviate from xˆ in terms of certain identity characteristics and other image contents beyond face region (e.g., background, hair, and eyewear).
As a consequence, ∆x has a certain degree of freedom that is uncontrollable.
We resolve this issue via contrastive learning, to be introduced next.
모방 학습을 위해 x에 다음 손실 함수를 정의합니다.
첫째, 우리는 식(2)로서 xˆ의 신원을 지각적으로 모방하도록 x를 강제합니다.

(2) : $l_I^{id}=max(1-<f_{id}(x),f_{id}(x^^)>-τ, 0)$

(2)의 fid(·)는 얼굴인식 네트워크를 통해서 deep identity feature이고, < ·, · >는 코사인 유사도이며, τ는 constant margin 0.3으로 설정합니다.
렌더링 된 3DMM 얼굴과 실제 얼굴 사이에 명백한 도메인 간격이 있으므로 기능간에 약간의 차이를 허용합니다.
본 논문에서는 Neural aggregation network for video face recognition[51]의 얼굴 인식 네트워크를 사용하여 심층 신원 특징 추출을 수행합니다.
표정과 포즈의 경우 (3)을 통해 얼굴 랜드 마크 차이에 페널티를 줍니다.

(3) : $l_I^{lm}(x)=||p(x)-p^||^2$

여기서 p (·)는 3D 얼굴 재구성 네트워크에 의해 감지된 랜드 마크 위치를 나타내고, pˆ는 사소하게 얻은 렌더링 된 얼굴의 랜드 마크입니다.
조명의 경우 (4)에 의해 SH 계수 불일치를 최소화합니다.

(4) : $l_I^{sh}=|γ(x)-γ^|_1$

여기서 γ (·)는 3D 얼굴 재구성 네트워크에 의해 주어진 계수를 나타내고 γˆ는 xˆ의 계수입니다.

마지막으로 (5)를 통해 렌더링 된 얼굴의 피부색을 모방하도록 출력을 강제하는 단순 손실을 추가합니다. 여기서 c (·)는 3DMM에서 마스크에 의해 정의 된 얼굴 영역의 평균 색상을 나타냅니다.

(5) : $l_I^{cl}(x)=|c(x)-c(x^)|_1$

이러한 모방 손실을 사용하여 생성기는 해당 잠재 변수가 특징으로하는 정체성, 표현, 포즈 및 조명에 따라 얼굴 이미지를 생성하는 방법을 학습합니다.
도메인 갭 문제.
분명히 렌더링 된 3DMM면과 생성 된면 사이에는 피할 수없는 도메인 갭이 있습니다.
이 도메인 갭의 영향을 이해하고 신중하게 처리하는 것이 중요합니다.
한편으로는 적대적 손실과의 충돌을 피하고 생성 된 이미지의 사실성을 보장하기 때문에 합법적으로 큰 합법적 인 도메인 갭을 유지하는 것이 필요합니다.
또한 생성 모델링이 3DMM 모델의 작은 식별 부분 공간에 갇히는 것을 방지합니다 3.
그러나 다른 한편으로는 요인 변화의 엉킴이 좋지 않을 수 있습니다 (예를 들어, 표현을 변경하면 원치 않는 정체성과 이미지 배경의 변화가 발생할 수 있으며 조명을 변경하면 표현과 모발 구조를 방해 할 수 있습니다. 그림 3 및 6 참조).
왜 이런 일이 발생하는지 이해하기 위해 먼저 생성 된면 x와 렌더링 된 대응 $xˆ$ 사이의 차이를 ∆x, 즉 $x = ˆx + ∆x$로 상징합니다.
모방 학습에서 x는 특정 정체성 특성과 얼굴 영역 (예 : 배경, 머리카락 및 안경)을 넘어서는 기타 이미지 콘텐츠 측면에서 xˆ에서 자유롭게 벗어날 수 있습니다.
결과적으로 ∆x는 제어 할 수없는 어느 정도의 자유도를가집니다.
이 문제는 다음에 소개할 대조 학습을 통해 해결합니다.

3.2. Contrastive Learning

To fortify disentanglement, we enforce the invariance of the latent representations for image generation in a contrastive manner: we vary one latent variable while keeping others unchanged, and enforce that the difference on the generated face images relates only to that latent variable.
Concretely, we sample pairs of latent code z, z' which differ only at zk and share the same zi, ∀i 6= k.
We compare the generated face images x, x', and then penalize the difference induced by any of zi but zk.
To enable such a comparison, we need to find a function $φk(G(z))$ which is, to the extent possible, invariant to $z_k$ but sensitive to variations of zi’s.
In this work, we implement two simple functions for face images.
The first one is designed for expression-invariant comparison.
Our idea is to restore a neutral expression for x and x' to enable the comparison.
However, high-fidelity expression removal per se is a challenging problem still being actively studied in GAN-based face image manipulation [37, 13].
To circumvent this issue, we resort to the rendered 3DMM face xˆ to get a surrogate flow field for image warping.
Such a flow field can be trivially obtained by revising the expression coefficient and rendering another 3DMM face with a neutral expression.
엉킴을 강화하기 위해 이미지 생성을위한 잠재 표현의 불변성을 대조적인 방식으로 적용합니다. 하나의 잠재 변수를 변경하고 나머지는 변경하지 않고 생성 된 얼굴 이미지의 차이가 잠재 변수에만 관련되도록 강제합니다.
구체적으로, zk에서만 다르고 동일한 $z_i$, ∀i 6 = k를 공유하는 잠복 코드 z, z'쌍을 샘플링합니다.
생성 된 얼굴 이미지 x, x '를 비교 한 다음 zi와 zk 중 하나에 의해 유발된 차이에 페널티를 적용합니다.
이러한 비교를 가능하게하려면 가능한 한 z_k에는 변하지 않지만 zi의 변화에는 민감한 φk (G (z)) 함수를 찾아야합니다.
이 작업에서는 얼굴 이미지에 대한 두 가지 간단한 기능을 구현합니다.
첫 번째는 식 불변 비교를 위해 설계되었습니다.
우리의 아이디어는 비교를 가능하게하기 위해 x와 x '에 대한 중립 표현을 복원하는 것입니다.
그러나 고 충실도 표현 제거 자체는 GAN 기반 얼굴 이미지 조작에서 여전히 활발히 연구되고있는 도전적인 문제입니다 [37, 13].
이 문제를 피하기 위해 렌더링 된 3DMM면 xˆ에 의존하여 이미지 왜곡을위한 대리 유동장을 얻습니다.
이러한 유동장은 표현 계수를 수정하고 중립 표현으로 다른 3DMM 얼굴을 렌더링함으로써 사소하게 얻을 수 있습니다.

In practice, it is unnecessary to warp both x and x'.
We simply generate the flow field v from xˆ to xˆ'and warp x to x' accordingly (see Fig. 3 for an example).
We then minimize the image color difference via (6) where x(v) is the warped image.
Second, we design two illumination-invariant losses for contrastive learning.
Since the pixel color across the whole image can be affected by illumination change, we simply enforce the semantical structure to remain static.
We achieve this by minimizing the difference between the face structures of x and x': (7) where m(·) is the hair segmentation probability map obtained from a face parsing network [28], p(·) denotes landmark positions same as in Eq. 3, and ω is a balancing weight.
We also apply a deep identity feature loss via (8) In this paper, using the above contrastive learning losses regarding expression and illumination can lead to satisfactory disentanglement (we found that pose variations can be well disentangled without need for another contrastive loss).
실제로 x와 x '를 모두 왜곡 할 필요는 없습니다.
우리는 xˆ에서 xˆ '로의 유동장 v를 생성하고 그에 따라 x에서 x'로 휘게합니다 (예를 들어 그림 3 참조).
그런 다음 (6)을 통해 이미지 색상 차이를 최소화합니다. 여기서 x (v)는 뒤틀린 이미지입니다.

(6): $l_C^{ex}(x,x')=|x(v)-x'|_1$
둘째, 우리는 대조 학습을 위해 두 가지 조명 불변 손실을 설계합니다.
전체 이미지의 픽셀 색상은 조명 변경의 영향을받을 수 있으므로 단순히 의미 구조를 정적으로 유지하도록 강제합니다.
x와 x '의 얼굴 구조 간의 차이를 최소화하여이를 달성합니다.

(7) :$l_C^{il_1}(x,x')=||m(x)-m(x||^2+||||^2$

여기서 m (·)은 얼굴 분석 네트워크에서 얻은 헤어 분할 확률 맵이고 [28], p (·)는 다음과 같은 랜드 마크 위치를 나타냅니다. 식에서. ω는 균형 가중치입니다.
우리는 또한 (8)을 통해 깊은 정체성 특징 손실을 적용합니다.이 논문에서 표현 및 조명에 관한 위의 대조 학습 손실을 사용하면 만족스러운 분리로 이어질 수 있습니다 (다른 대비 손실없이 포즈 변형이 잘 분리 될 수 있음을 발견했습니다).

Effect of contrastive learning. 대조 학습의 효과.
Following the discussion in Section 3.1, for two rendered faces xˆ and xˆ',which only (and perfectly) differ at one factor such as expression, both ∆x and ∆x' have certain free variations that are uncontrollable.
Therefore, achieving complete disentanglement with imitative learning is difficult, if not impossible.
The contrastive learning is an essential complement to imitative learning: it imposes proper constrains on ∆x and ∆x' by explicitly learning the desired differences between x and x',thus leading to enhanced disentanglement.
We empirically find that the contrastive learning also leads to better imitation and more accurate face property control.
This is because the pairwise comparison can also suppress imitation noise: any misalignment of pose or expression between x and xˆ or between x' and xˆ' will incur larger contrastive losses.
3.1 절의 논의에 따라, 표현과 같은 한 요소에서만 (그리고 완벽하게) 다른 두 개의 렌더링 된면 xˆ 및 xˆ '에 대해 ∆x와 ∆x'는 제어 할 수없는 특정 자유 변형을가집니다.
따라서 모방 학습으로 완전한 얽힘을 이루는 것은 불가능하지는 않지만 어렵습니다.
대조 학습은 모방 학습의 필수 보완 요소입니다. x와 x '사이의 원하는 차이를 명시 적으로 학습하여 ∆x 및 ∆x'에 적절한 제약을 부과하여 엉킴을 향상시킵니다.
우리는 대조 학습이 더 나은 모방과 더 정확한 얼굴 속성 제어로 이어진다는 것을 경험적으로 발견했습니다.
이는 쌍별 비교가 모조 잡음을 억제 할 수도 있기 때문입니다. x와 xˆ 사이 또는 x '와 xˆ'사이에서 포즈 나 표현이 잘못 정렬되면 더 큰 대비 손실이 발생합니다.

4. Experiments
Implementation details.
In this paper, we adopt the StyleGAN structure [23] and the FFHQ dataset [23] for training.
We train the λ-space VAEs following the schedule of [7], where encoders and decoders of the VAEs are all MLPs with three hidden layers.
구현 세부 사항.
이 논문에서는 훈련을 위해 StyleGAN 구조 [23]와 FFHQ 데이터 셋 [23]을 채택합니다.
VAE의 인코더와 디코더는 모두 3 개의 은닉 계층이있는 MLP 인 [7]의 일정에 따라 λ 공간 VAE를 훈련합니다.
For StyleGAN, we follow the standard training procedure of the original method except that we
1) remove the normalization operation for input latent variable layer,
2) discard the style-mixing strategy, and
3) train up to image resolution of 256 × 256 due to time constraint.
StyleGAN의 경우 원래 방법의 표준 교육 절차를 따릅니다.
1) 입력 잠재 변수 레이어에 대한 정규화 작업을 제거하고,
2) 스타일 믹싱 전략을 버리고
3) 시간 제약으로 인해 256 × 256의 이미지 해상도까지 훈련합니다.
We first train the network with the adversarial loss as in [23] and our imitative losses until seeing 15M real images to obtain reasonable imitation.
Then we add contrastive losses into the training process and train the network up to seeing 20M real images in total.
More training details can be found in the supplementary material.
우리는 합리적인 모방을 얻기 위해 15M 실제 이미지를 볼 때까지 [23]에서와 같이 적대적 손실과 모방 손실로 네트워크를 먼저 훈련시킵니다.
그런 다음 훈련 프로세스에 대비 손실을 추가하고 총 2 천만 개의 실제 이미지를 볼 수 있도록 네트워크를 훈련시킵니다.
자세한 교육 정보는 보충 자료에서 찾을 수 있습니다.

4.1. Generation Results
Figure 4 presents some image samples generated by our DiscoFaceGAN after training.
It can be seen that our method is able to randomly generate high-fidelity face images with a large variant of identities with diverse pose, illumination, and facial expression.
More importantly, the variations of identity, expression, pose, and illumination are highly disentangled – when we vary one factor, all others can be well preserved.
Furthermore, we can precisely control expression, illumination and pose using the parametric model coefficients for each of them.
One more example for precisely controlled generation is given in Fig. 1.
Figure 5 shows that we can generate images of new identities by mimicking the properties of a real reference image.
We achieve this by extracting the expression, lighting, and pose parameters from the reference image and combine them with random identity variables for generation.
그림 4는 훈련 후 DiscoFaceGAN에서 생성 된 일부 이미지 샘플을 보여줍니다.
우리의 방법은 다양한 포즈, 조명 및 표정으로 다양한 정체성을 가진 고 충실도 얼굴 이미지를 무작위로 생성 할 수 있음을 알 수 있습니다.
더 중요한 것은 정체성, 표현, 포즈 및 조명의 변형이 매우 분리되어 있다는 것입니다. 하나의 요소를 변경하면 다른 모든 요소가 잘 보존 될 수 있습니다.
또한 각각에 대한 파라 메트릭 모델 계수를 사용하여 표현, 조명 및 포즈를 정확하게 제어 할 수 있습니다.
정밀하게 제어되는 생성에 대한 또 다른 예가 그림 1에 나와 있습니다.
그림 5는 실제 참조 이미지의 속성을 모방하여 새로운 ID의 이미지를 생성 할 수 있음을 보여줍니다.
참조 이미지에서 표현, 조명 및 포즈 매개 변수를 추출하고 생성을 위해 무작위 ID 변수와 결합하여 이를 달성합니다.

4.2. Ablation Study
In this section, we train the DiscoFaceGAN with different losses to validate the effectiveness of our imitativecontrastive learning scheme. Some typical results are presented in Fig. 6.
Obviously, the network cannot generate reasonable face images if we remove the imitation losses.
This is because the contrastive losses rely on reasonable imitation, without which they are less meaningful and the network behavior will be unpredictable.
On the other hand, without contrastive losses, variations of different factors cannot be fully disentangled.
For example, expression and lighting changes may influence certain identity-related characteristics and some other properties such as hair structure.
The contrastive losses can also improve the desired preciseness of imitation (e.g., see the mouth-closing status in the last row), leading to more accurate generation control.
이 섹션에서는 모방 적 대조 학습 체계의 효과를 검증하기 위해 다양한 손실로 DiscoFaceGAN을 훈련합니다. 몇 가지 일반적인 결과가 그림 6에 나와 있습니다.
모방 손실을 제거하면 네트워크는 합리적인 얼굴 이미지를 생성 할 수 없습니다.
이는 대조적 손실이 합리적인 모방에 의존하기 때문입니다. 그렇지 않으면 의미가 떨어지고 네트워크 동작을 예측할 수 없습니다.
반면에 대비 손실 없이는 다양한 요인의 변화를 완전히 풀 수 없습니다.
예를 들어, 표현 및 조명 변경은 특정 정체성 관련 특성 및 머리카락 구조와 같은 다른 속성에 영향을 미칠 수 있습니다.
대조 손실은 또한 원하는 모방의 정확성을 향상시킬 수 있으며 (예 : 마지막 행의 입 닫힘 상태 확인)보다 정확한 생성 제어로 이어집니다.

4.3. Quantitative Evaluation
In this section, we evaluate the performance of our DiscoFaceGAN quantitatively in terms of disentanglement efficacy as well as generation quality.
For the former, several metrics have been proposed in VAE-based disentangled representation learning, such as factor score [25] and mutual information gap [5].
However, these metrics are not suitable for our case. Here we design a simple metric named disentanglement score (DS), described as follows.
Our goal is to measure that when we only vary the latent variable for one single factor, if other factors on the generated images are stable.
We denote the four λ-space variables α, β, γ, θ as ui, and we use u{j} as the shorthand notation for the variable set {uj |j = 1, . . . , 4, j 6= i}.
To measure the disentanglement score for ui, we first randomly generate 1K sets of u{j}, and for each u{j} we randomly generate 10 ui.
Therefore, we can generate 10K images using the trained network with combinations of ui and u{j}.
For these images, we re-estimate ui and u{j} using the 3D reconstruction network [9] (for identity we use a face recognition network [51] to extract deep identity feature instead).
We calculate the variance of the estimated values for each of the 1K groups, and then average them to obtain σui and σuj.
We further normalize σui and σuj by dividing the variance of the corresponding variable computed on FFHQ.
Finally, we measure the disentanglement score via (9)
A high DS indicates that when varying a certain factor, only the corresponding property in the generated images is changing (σui > 0) while other factors remain unchanged(σuj → 0).
Table 1 shows that the imitative learning leads to high factor disentanglement and the contrastive learning further enhances it for expression, illumination, and pose.
The disentanglement score for identity decreases with contrastive learning.
We found the 3D reconstruction results from the network are slightly unstable when identity changes, which increased the variances of other factors.
To evaluate the quality of image generation, we follow [23] to compute the Frechet Inception Distances (FID) [ ´ 17] and the Perceptual Path Lengths (PPL) [23] using 50K and 100K randomly generated images, respectively.
Table 1 shows that the FID increases with our method.
This is expected as the additional losses added to the adversarial training will inevitably affect the generative modeling.
However,we found that the PPL is comparable to the results trained with only the adversarial loss.
이 섹션에서 우리는 DiscoFaceGAN의 성능을 생성 품질뿐만 아니라 분해 효능 측면에서 정량적으로 평가합니다.
전자의 경우 요인 점수 [25] 및 상호 정보 격차 [5]와 같은 VAE 기반 분리 표현 학습에서 몇 가지 메트릭이 제안되었습니다.
그러나 이러한 메트릭은 우리의 경우에 적합하지 않습니다. 여기에서는 다음과 같이 disentanglement score (DS)라는 간단한 메트릭을 설계합니다.
우리의 목표는 생성 된 이미지의 다른 요소가 안정적인 경우 하나의 단일 요소에 대한 잠재 변수 만 변경할 때 측정하는 것입니다.
우리는 4 개의 λ 공간 변수 α, β, γ, θ를 ui로 표시하고 변수 세트 {uj | j = 1,에 대한 속기 표기법으로 u {j}를 사용합니다. . . , 4, j 6 = i}.
UI의 엉킴 해제 점수를 측정하기 위해 먼저 u {j}의 1K 세트를 무작위로 생성하고 각 u {j}에 대해 무작위로 10 개의 ui를 생성합니다.
따라서 훈련 된 네트워크를 사용하여 ui 및 u {j} 조합을 사용하여 10K 이미지를 생성 할 수 있습니다.
이러한 이미지의 경우 3D 재구성 네트워크 [9]를 사용하여 ui 및 u {j}를 재 추정합니다 (신원을 위해 얼굴 인식 네트워크 [51]를 사용하여 대신 깊은 신원 특성을 추출합니다).
1K 그룹 각각에 대해 추정 된 값의 분산을 계산 한 다음 평균을내어 σui 및 σuj를 얻습니다.
FFHQ에서 계산 된 해당 변수의 분산을 나눔으로써 σui 및 σuj를 추가로 정규화합니다.
마지막으로 (9)를 통해 풀림 점수를 측정합니다.
높은 DS는 특정 요소를 변경할 때 생성 된 이미지에서 해당 속성 만 변경되고 (σui> 0) 다른 요소는 변경되지 않은 상태 (σuj → 0)를 나타냅니다.
표 1은 모방 학습이 높은 요소 분리로 이어지고 대조 학습이 표현, 조명 및 포즈를 더욱 향상 시킨다는 것을 보여줍니다.
대조적 학습에 따라 정체성에 대한 얽힘 해제 점수가 감소합니다.
네트워크의 3D 재구성 결과는 정체성이 변경 될 때 약간 불안정하여 다른 요인의 분산이 증가한다는 것을 발견했습니다.
이미지 생성의 품질을 평가하기 위해 [23]에 따라 각각 50K 및 100K 임의로 생성 된 이미지를 사용하여 FID (Frechet Inception Distances) [´ 17] 및 PPL (Perceptual Path Lengths) [23]을 계산합니다.
표 1은 우리의 방법에 따라 FID가 증가 함을 보여줍니다.
이는 적대적 훈련에 추가 된 추가 손실이 생성 모델링에 필연적으로 영향을 미치기 때문에 예상됩니다.
그러나 우리는 PPL이 적대적 손실만으로 훈련 된 결과와 비슷하다는 것을 발견했습니다.

5. Latent Space Analysis and Embedding 잠재 공간 분석 및 임베딩
In this section, we analyze the latent space of our DiscoFaceGAN.
We show some meaningful properties supporting factor variation disentanglement, based on which we further present a method for embedding and manipulating real face images in the disentangled latent space.
이 섹션에서는 DiscoFaceGAN의 잠재 공간을 분석합니다.
우리는 팩터 변화를 지원하는 몇 가지 의미있는 속성을 보여 주며,이를 기반으로 얽 히지 않은 잠복 공간에서 실제 얼굴 이미지를 삽입하고 조작하는 방법을 추가로 제시합니다.

5.1. Analysis of Latent Space 잠재 공간 분석
One key ingredients of StyleGAN is the mapping from zspace to W-space, the latter of which relates linearly to the AdaIN [22] parameters that control “styles” (we refer the readers to [23] for more details).
Previous studies [41, 1] have shown that certain direction of changes in W-space leads to variations of corresponding attributes in generated images.
In our case, W space is mapped from λ space which naturally relates to image attributes.
Therefore, we analyze the direction of changes in the learned W-space by varying λ variables, and some interesting properties have been found.
We will introduce these properties and then provide strong empirical evidences supporting them.
Recall that the input to generator is λ-space variables α, β, γ, θ and an additional noise ε.
Here we denote these five variables as ui with u5 = ε.
We use u{j} as the shorthand notation for the variable set {uj |j = 1, . . . , 5, j 6= i}, and w(ui, u{j}) denotes the W space variable mapped from ui and u{j}.
We further denote a unit vector (10) to represent the direction of change in W space when we change ui from a to b.
The following two properties of ∆dw(i, a, b) are observed: Property 1.
For the i-th variable ui, i ∈ 1, 2, 3, 4, with any given starting value a and ending value b, we have:∆dw(i, a, b) is almost constant for ∀u{j}.
Property 2. For the i-th variable ui, i ∈ 1, 2, 3, 4, with any given offset vector 4, we have: ∆dw(i, a, a+4) is almost constant for ∀u{j}and ∀a.
Property 1 states that if the starting and ending values of a certain factor in λ space are fixed, then the direction of change in W space is stable regardless of the choice of all other factors.
Property 2 further indicates that it is unnecessary to fix the starting and ending values – the direction of change in W space is only decided by the difference between them.
To empirically examine Property 1, we randomly sampled 50 pairs of (a, b) values for each ui and 100 remaining factors for each pair.
For each (a, b) pair, we calculate 100 ∆w = w2 − w1 and get 100 × 100 pairwise cosine distances.
StyleGAN의 핵심 요소 중 하나는 zspace에서 W-space 로의 매핑이며, 후자는 "스타일"을 제어하는 AdaIN [22] 매개 변수와 선형 적으로 관련됩니다 (자세한 내용은 독자를 [23] 참조).
이전 연구 [41, 1]에서는 W 공간의 특정 방향 변화가 생성 된 이미지에서 해당 속성의 변화를 초래한다는 것을 보여주었습니다.
우리의 경우 W 공간은 자연스럽게 이미지 속성과 관련된 λ 공간에서 매핑됩니다.
따라서 우리는 λ 변수를 변경하여 학습 된 W- 공간의 변화 방향을 분석하고 몇 가지 흥미로운 특성을 발견했습니다.
우리는 이러한 속성을 소개하고이를 뒷받침하는 강력한 경험적 증거를 제공 할 것입니다.
생성기에 대한 입력은 λ 공간 변수 α, β, γ, θ 및 추가 노이즈 ε입니다.
여기서 우리는이 5 개의 변수를 u5 = ε 인 ui로 표시합니다.
변수 세트 {uj | j = 1,.에 대한 축약 표기법으로 u {j}를 사용합니다. . . , 5, j 6 = i} 및 w (ui, u {j})는 ui 및 u {j}에서 매핑 된 W 공간 변수를 나타냅니다.
ui를 a에서 b로 변경할 때 W 공간의 변경 방향을 나타내는 단위 벡터 (10)를 추가로 표시합니다.
다음 두 가지 ∆dw (i, a, b) 속성이 관찰됩니다.
i 번째 변수 ui, i ∈ 1, 2, 3, 4에 대해 주어진 시작 값 a와 끝 값 b를 사용하면 다음과 같습니다. ∆dw (i, a, b)는 ∀u {j}에 대해 거의 일정합니다. .
속성 2. i 번째 변수 ui, i ∈ 1, 2, 3, 4에 대해 주어진 오프셋 벡터 4를 사용하면 다음과 같습니다. ∆dw (i, a, a + 4)는 ∀u {j에 대해 거의 일정합니다. } 및 ∀a.
속성 1은 λ 공간에서 특정 요소의 시작 값과 끝 값이 고정되면 다른 모든 요소의 선택에 관계없이 W 공간의 변화 방향이 안정적이라고 말합니다.
속성 2는 시작 값과 끝 값을 수정할 필요가 없음을 나타냅니다. W 공간의 변경 방향은 둘 사이의 차이에 의해서만 결정됩니다.
속성 1을 경험적으로 조사하기 위해 각 ui에 대해 50 쌍의 (a, b) 값과 각 쌍에 대해 나머지 100 개 요소를 무작위로 샘플링했습니다.
각 (a, b) 쌍에 대해 100 ∆w = w2 − w1을 계산하고 100 × 100 쌍별 코사인 거리를 얻습니다.
We average all these distances for each (a, b) pair, and finally compute the mean and standard derivation of the 50 average distance values from all 50 pairs.
Similarly, we examine Property 2 by randomly generating offsets for ui, and all the results are presented in Table 2.
It can be seen that all the cosine similarities are close to 1, indicating the high consistency of W-space direction change.
For reference, in the table we also present the statistics obtained using a model trained with the same pipeline but without our imitative-contrastive losses.
각 (a, b) 쌍에 대해 이러한 모든 거리를 평균화하고 마지막으로 모든 50 쌍에서 50 개의 평균 거리 값의 평균 및 표준 도출을 계산합니다.
마찬가지로 ui에 대한 오프셋을 무작위로 생성하여 속성 2를 검사하고 모든 결과가 표 2에 나와 있습니다.
모든 코사인 유사성이 1에 가까워 W 공간 방향 변화의 높은 일관성을 나타냅니다.
참고로 표에는 동일한 파이프 라인으로 훈련 된 모델을 사용하여 얻은 통계도 있지만 모방 대비 손실이 없습니다.

5.2. Real Image Embedding and Editing 실제 이미지 삽입 및 편집
Based on the above analysis, we show that our method can be used to embed real images into the latent space and edit the factors in a disentangled manner.
We present the experimental results on various factors.
More results can be found in the suppl. material due to space limitation.
A natural latent space for image embedding and editing is the λ space.
However, embedding an image to it leads to poor image reconstruction.
Even inverting to the W space is problematic – the image details are lost as shown in previous works [1, 41].
For higher fidelity, we embed the image into a latent code w+ in the W+ space suggested by [1] which is an extended W space.
An optimization-based embedding method is used similar to [1]. However, W or W+space is not geometrically interpretable thus cannot be directly used for controllable generation.
Fortunately though, thanks to the nice properties of the learned W space (see Section 5.1), we have the following latent representation editing and image generation method: (11) where xs is an input image and xt is the targeted image after editing.
Gsyn is the synthesis sub-network of StyleGAN (after the 8-layer MLP). ∆w(i, a, b) denotes the offset of w induced by changing ui, the i-th λ-space latent variable,from a to b (see Eq. 10). It can be computed with any u{j}(we simply use the embedded one).
Editing can be achieved by flexibly setting a and b.
위의 분석을 바탕으로 우리는 우리의 방법을 사용하여 실제 이미지를 잠재 공간에 삽입하고 얽 히지 않은 방식으로 요소를 편집 할 수 있음을 보여줍니다.
다양한 요인에 대한 실험 결과를 제시합니다.
더 많은 결과는 suppl에서 찾을 수 있습니다. 공간 제한으로 인해 재료.
이미지 임베딩 및 편집을위한 자연적인 잠재 공간은 λ 공간입니다.
그러나 이미지를 포함하면 이미지 재구성이 제대로 이루어지지 않습니다.
W 공간으로 반전하는 것조차 문제가됩니다. 이전 작업 [1, 41]에서 볼 수 있듯이 이미지 세부 정보가 손실됩니다.
충실도를 높이기 위해 확장 된 W 공간 인 [1]에서 제안한 W + 공간의 잠복 코드 w +에 이미지를 삽입합니다.
최적화 기반 임베딩 방법은 [1]과 유사하게 사용됩니다. 그러나 W 또는 W + space는 기하학적으로 해석 할 수 없으므로 제어 가능한 생성에 직접 사용할 수 없습니다.
다행히도 학습 된 W 공간 (5.1 절 참조)의 멋진 속성 덕분에 다음과 같은 잠재 표현 편집 및 이미지 생성 방법이 있습니다. (11) 여기서 xs는 입력 이미지이고 xt는 편집 후 대상 이미지입니다.
Gsyn은 StyleGAN의 합성 서브 네트워크입니다 (8 계층 MLP 이후). ∆w (i, a, b)는 i 번째 λ 공간 잠재 변수 인 ui를 a에서 b로 변경하여 유도 된 w의 오프셋을 나타냅니다 (식 10 참조). 어떤 u {j}로도 계산할 수 있습니다 (단순히 내장 된 것을 사용합니다).
a와 b를 유연하게 설정하여 편집 할 수 있습니다.
Pose Editing.
Figure 7 (top) shows the typical results of pose manipulation where we freely rotate the input face by desired angles.
We also test our method with the task of face frontalization, and compare with previous methods.
Figure 7 (bottom) shows the results on face images from the LFW dataset [20].
Our method well-preserved the identitybearing characteristics as well as other contextual information such as hair structure and illumination.
포즈 편집.
그림 7 (위)은 입력면을 원하는 각도로 자유롭게 회전하는 포즈 조작의 일반적인 결과를 보여줍니다.
우리는 또한 얼굴 정면 화 작업으로 우리의 방법을 테스트하고 이전 방법과 비교합니다.
그림 7 (아래)은 LFW 데이터 세트 [20]의 얼굴 이미지에 대한 결과를 보여줍니다.
우리의 방법은 모발 구조 및 조명과 같은 기타 상황 정보뿐만 아니라 정체성 보유 특성을 잘 보존했습니다.
Image Relighting. 이미지 재조명.
Figure 8 (top) shows an example of image relighting with our method, where we freely vary the lighting direction and intensity.
In addition, we follow the previous methods to evaluate our method on the MultiPIE [15] images.
Figure 8 (bottom) shows a challenging case for lighting transfer.
Despite the extreme indoor lighting may be outside of the training data, our method still produces reasonable results with lighting directions well conforming with the references.
그림 8 (상단)은 조명 방향과 강도를 자유롭게 변경하는 방법으로 이미지 재조명의 예를 보여줍니다.
또한 MultiPIE [15] 이미지에 대한 방법을 평가하기 위해 이전 방법을 따릅니다.
그림 8 (아래)은 조명 전송에 대한 까다로운 사례를 보여줍니다.
극단적 인 실내 조명이 훈련 데이터 밖에있을 수 있지만, 우리의 방법은 여전히 참조와 잘 일치하는 조명 방향으로 합리적인 결과를 생성합니다.

6. Conclusion and Future Work 결론 및 향후 작업
We presented DiscoFaceGAN for disentangled and controllable latent representations for face image generation.
The core idea is to incorporate 3D priors into the adversarial learning framework and train the network to imitate the rendered 3D faces.
Influence of the domain gap between rendered faces and real images is properly handled by introducing the contrastive losses which explicitly enforce disentanglement.
Extensive experiments on disentangled virtual face image synthesis and face image embedding have demonstrated the efficacy of our proposed imitationcontrastive learning scheme.
The generated virtual identity face images with accurately controlled properties could be used for a wide range of vision and graphics applications which we will explore in our future work.
It is also possible to apply our method for forgery image detection and anti-spoofing by analyzing real and faked images in the disentangled space.
우리는 얼굴 이미지 생성을 위해 얽 히지 않고 제어 가능한 잠재 표현을 위해 DiscoFaceGAN을 제시했습니다.
핵심 아이디어는 3D 사전을 적대적 학습 프레임 워크에 통합하고 렌더링 된 3D 얼굴을 모방하도록 네트워크를 훈련시키는 것입니다.
렌더링 된 얼굴과 실제 이미지 사이의 도메인 간격의 영향은 명시 적으로 엉킴을 강제하는 대조 손실을 도입하여 적절하게 처리됩니다.
얽 히지 않은 가상 얼굴 이미지 합성 및 얼굴 이미지 임베딩에 대한 광범위한 실험은 제안 된 모방 대조 학습 체계의 효과를 입증했습니다.
정확하게 제어 된 속성으로 생성 된 가상 신원 얼굴 이미지는 향후 작업에서 살펴볼 광범위한 비전 및 그래픽 애플리케이션에 사용될 수 있습니다.
또한 엉킴이 풀린 공간에서 실물 이미지와 위조 이미지를 분석하여 위조 이미지 감지 및 위조 방지 방법을 적용 할 수도 있습니다.

'비지도학습 > GAN' 카테고리의 다른 글

Analyzing and Improving the Image Quality of StyleGAN, 2020(StyleGAN2) (0)	2021.03.02
A Style-Based Generator Architecture for Generative Adversarial Networks, 2019(버전 1) (0)	2021.03.02
Patch-Based Image Inpainting with Generative Adversarial Networks,2018 (0)	2021.02.18
pix2pixHD,2015 (0)	2021.02.09
[11주차] StarGAN v2: Diverse Image Synthesis for Multiple Domains, 2020 (0)	2021.01.28

내가 보려고 만든 블로그

[논문]DiscoFaceGAN,2020

'비지도학습 > GAN' 카테고리의 다른 글

티스토리툴바

[논문]DiscoFaceGAN,2020

'비지도학습 > GAN' 카테고리의 다른 글

'비지도학습/GAN' Related Articles

티스토리툴바