人工智能分析报告-nvidia:使用深层神经网络的面部性能捕获facialperformancecapturewithdeepneuralnetworks(编辑修改稿)内容摘要:

ectures. Our initial efforts concentrated on a fully connected work, but it was soon discov ered that a convo lutional work architecture was better suited for the task. For the sake of pleteness, we detail the fully connected work in Appendix B, where we also briefly characterize its strengths and weaknesses pared to the convolutional work. Convolutional work Our convolutional work is based on the allconvolutional archi tecture [Springe nberg et al. 20xx] extended with two fully con nected layers to produce the fullresolution vert ex data at output. The input is a whitened version of the 240 32 0 grayscale image. For whitening, we calculate the mean an d variance over all pixels in the training ima ges, and bias and scale the input so that these are normalized to zero and one, respectively. Note that same whitening coefficients, fixed at training time, are used for all input images during training, validation, and produc tion use. If the whitening were done on a perimage or pershot basis, we would lose part of the benefits of the standardized light ing environment. For example, variation in the color of the actor’s shirt between shots would end up affecting the brightness of the face. The layers of the work are listed in the table below. Name Description input conv1a conv1b conv2a conv2b conv3a conv3b conv4a conv4b conv5a conv5b conv6a conv6b drop fc output Input 1 240 320 image Conv 3 3, 1 → 64, stride 2 2, ReLU Conv 3 3, 64 → 64, stride 1 1, ReLU Conv 3 3, 64 → 96, stride 2 2, ReLU Conv 3 3, 96 → 96, stride 1 1, ReLU Conv 3 3, 96 → 144, stride 2 2, ReLU Conv 3 3, 144 → 144, stride 1 1, ReLU Conv 3 3, 144 → 216, stride 2 2, ReLU Conv 3 3, 216 → 216, stride 1 1, ReLU Conv 3 3, 216 → 324, stride 2 2, ReLU Conv 3 3, 324 → 324, stride 1 1, ReLU Conv 3 3, 324 → 486, stride 2 2, ReLU Conv 3 3, 486 → 486, stride 1 1, ReLU Dropout, p = Fully connected 9720 → 160, linear activation Fully connected 160 → Nout, linear activation The output layer is initialized by preputing a PCA basis for the output meshes based on the target meshes from the training data. Allowing 160 basis vectors explains approximately % of the variance seen in the meshes, which was considered to be sufficient. If we made the weights of the output layer fixed, ., made it a nontrainable layer, that would effectively train the remainder of the work to output the 160 PCA coefficients. However, we found that allowing the last layer to be trainable as well improv ed the results. Note that if we merge d the last two layers together, we would have a single 9 720 → Nout fully connected layer, which would have al most 40 times the numb er of trainable weights p ared to the bination of two layers. Beca use 160 PC A basis vectors are suf ficient for accurately covering the space of target meshes, these de grees of freed om would be unn ecessary and only make the training more difficult and prone to overfitting. The convolutional work, when trained with prop er input aug mentation (Section ), is not sensitive to the position an d orien tation of the actor’s head in the input images. Hence no im age stabilization is required as a preprocess. It should be noted that the quality of the results is not overly sensi tive to the exact position of the work. Changing the geomet ric progression of the number of feature maps, removing some or all of the 1 1 stride convolution layers, or adding more such lay ers, did not substantially change the results. The architecture listed above was found to perform consistently well, and was possible to train in a reasonable time, so it was chosen for use in production. All results in this paper were puted using this architecture. 4 Training For each actor, the training set consists of four parts, totaling ap proximately 10 minutes of footage. The position of the training set is as follows. Range of motion. In order to capture the maximal extents of the facial motion, a single rangeofmotion shot is taken where the actor goes through a predefined set of exteme expressions. These include, ., opening the mouth as wide as possible, moving the jaw sideways and front as far as possible, pursing the lips, opening the eyes wide and forcing them shut, etc. Expressions. Unlike the rangeof motion shot that contains ex aggerated expressions, this set contains normal expressions such as squinting of the eyes, expression of disgust, etc. These kind of expressions must be included in the training set, as otherwise the work would not be able to replicate them in production use. Pangrams. This set attempts to cover the set of possible facial motions during normal speech for a given target language, in our case English. The actor speaks one to three pangrams, ., sen tences that are designed to contain as many different phonemes as possible, in several different emotional tones. A pangram fitting the emotion would be optimal but in practice this is not always feasible. Incharacter material. This set leverages the fact that an actor’s performance of a character is often heavily biased in terms of emo tional and expressive range for various dramatic and narrative rea sons. This material is posed of the preliminary version of the script, or it may be otherwise prepared for the training. Only the shots that are deemed to support the different aspects of the char acter are selected so as to ensure that the trained work produces output that stays in character even if the inference isn’t perfect, or pletely novel or out of character acting is encountered. The position of the training is typically roughly a。
阅读剩余 0%
本站所有文章资讯、展示的图片素材等内容均为注册用户上传(部分报媒/平媒内容转载自网络合作媒体),仅供学习参考。 用户通过本站上传、发布的任何内容的知识产权归属用户或原始著作权人所有。如有侵犯您的版权,请联系我们反馈本站将在三个工作日内改正。