Ausitn.AI hosted an AI event during South by Southwest showcasing AI startups and companies in Austin. KUNGFU.AI was in attendance demonstrating the new StyleGAN facial generation architecture out of NVIDIA. StyleGAN augments the GAN (Generative Adversarial Network) architecture with techniques out of style transfer literature. These new techniques allow the generation of fake faces whose realism is an order of magnitude above previous GAN results.
During the event, we used StyleGAN along with some other tricks to generate and control pictures of people who came to our booth. The process involves several steps and models that need to be pulled together to achieve the final product. In this post, I’ll first review how the neural network works, including explanations of GANs and Style Transfer. Then I’ll explain what generating faces has to do with “Practical AI” .
The intuition behind generative adversarial networks is very approachable compared to other complicated architectures. Imagine you have two neural networks, one that produces images and one that judges them, known as the “generator” and “discriminator” respectively. These two networks are pitted against each other. The generator attempts to create pictures that are realistic enough that the discriminator fails to distinguish them from pictures in the dataset. Meanwhile, the discriminator tries to outpace the generator by identifying which images are real and which are generated. Using this game we are able to backpropagate to improve both networks at the same time. The StyleGAN architecture we used was trained on 40,000 photos of faces scrapped from Flickr.
Interpolation between the “style” of two friends who attended our demo. The images on the side are StyleGAN’s reproduction of the faces of the attendees. Note that during the demo, we could only spend a limited time finding an attendee’s latent representation (two minutes), so they were not as representative as possible.
Next is the “Style” of StyleGAN. Standard GAN architectures generate images by taking random vectors and upsampling them as they move through the generator’s network, eventually arriving at something that can ideally fool the discriminator. StyleGAN moves away from this approach by instead starting at a learned constant and adding in “style” at multiple points during the generation process. Repeatedly injecting the style vector into the network results in the style vector having a greater impact on the final image. If used once and then discarded, aspects of the latent space will be drowned out by other operations by the time the full picture is generated. This technique is powerful, producing the most realistic faces from a generative model to date. It also gives us the ability to mix styles, operate on styles, find specific encodings, and more.
Our demo utilized this new network, but it had other pieces as well. It stitched together a few different networks to achieve a final product in the demo. We made use of three different networks in total: a face identifier to find the face of the attendee within the picture we took, a VGG-net to find the encoding within StyleGAN’s latent space, and finally StyleGAN itself for control of the output. Ensembling is more standard practice these days — to solve a problem, rarely is the solution a single network or method. It is common to pool together several models, networks, or techniques to create whatever custom solution is needed for the specific use case. It can get complicated, but the end result is a powerful tool to solve a complicated problem.
How can StyleGAN be practical? Generating fake faces might not be immediately useful in a business context, but any piece of the process could be. If one can wire these parts together correctly, they can solve very complex problems with grace. I like to compare neural networks to plumbing, a metaphor that fits both constructing the specifics of the network architecture as well as building a full-stack solution to a problem with AI. When putting pipes in a home, a plumber must decide which pieces to use and how to stick them together, depending on where the water needs to come out, where it’s coming in, and whatever the specific situation is. Neural networks don’t have twisting pipes, but they do have huge tensors of weights. There are an assortment of different methods and it’s up to the practitioner to know which parts suit the current situation best. One must make intelligent decisions about which base model suits the problem (CNN vs LSTM vs Transformer), how to glue layers together (non-linear activation functions), how deep to make the network (number of layers), and other hyperparameters. StyleGAN has artfully stitched together GANs and style transfer to give us an awesome, powerful, fine-grained generative tool.
But again, how will this new technology be useful beyond being cool? That’s up to our creativity (which computers have yet to figure out yet). Products could no longer need models, as we’ll be able to generate them with specific constraints and ideas in mind. Extras in movies could be procedurally generated. NPCs in video games can be more realistic, interesting, and varied. These are just possibilities for facial generation, but GANs can work on any dataset of images that share similarities, and more recently non-image datasets like text as well as audio. As a non-human example, GANs are already heavily used to create training data for driverless cars. GANs may be useful to help generate synthetic data to train all sorts of models where the data is lacking — which would be a huge breakthrough for the field and speed up other innovation. If you’d like to explore some of the demo itself, check out our repo at https://github.com/maxisawesome/stylegan-encoder.