Introduction

This webpage is part of a submission for Computer Vision 766 at the University of Wisconsin-Madison School of Computer Sciences. The project details the research and findings related to identification and classification of artificially generated human faces. Authored by Ante Tonkovic-Capin and Udhbhav Gupta during the Spring 2024 semester.

Demonstration
Presentation Slides
Project Repository
Motivation
Defining Real
Related Work
Methodology
Results
Conclusion
References

Model Demonstration

You can find our pretrained model, along with a ready-made inference script that takes an image path as input and returns the result of inference as well as displaying the result of the facial feature extraction, here. Below is the result of running inference on two example images, one of a real human face and one of an artificially generated one:

Demo: Real Face — Test results for Fake Facial Detection with Facial Extraction, Real sample (left) and Fake sample (right) using inference demo.

Demo: Fake Face — Test results for Fake Facial Detection with Facial Extraction, Real sample (left) and Fake sample (right) using inference demo.

To reproduce the baseline examples above, or to run inference on your own images, clone or download the contents of the ./inference directory available here then install the requirements with pip install -r requirements.txt. From there you can run the above examples directly with the provided sample images using run_inference.py or provide your own images for inference with $ python run_inference.py "image.png" to see the results!

Presentation Slides

The presentation slides for the project which were presented on 22 April 2024 are available here. For reference purposes, the slides have not been modified since presented and include the original notes used during the presentation. Additional instructions related to the repository containing the slides, including the website source, are available in the main README.md for the website and slides repository.

Project Source

All of the project source code used for researching this project, as well as the code used for training and testing, is available in this dedicated repository. The various datasets used, including the primary 140k images dataset are not included in the source code for obvious reasons, however their original download links are well-documented throughout the project and can be reproduced at will. The primary two datasets are linked here for convenience, being the 140k images dataset and the "Photoshop" dataset. The only images included in the source are those referenced in the demonstration portion and those used for running the provided example inference scripts. The model checkpoints are available for any results or resulting accuracies reported for testing, including any required modules or scripts for instantiating them.

Motivation

Artificially generated image technology has become increasingly sophisticated, sometimes termed as 'Deep Fakes', allowing the creation of realistic videos and images that can deceive viewers. In recent years this has become even more acute, with increasingly advanced techniques for creating artificial images and detecting them leading to a sort of arms race in the field. This is particularly dangerous when it comes to false or misleading representations of individuals. Detecting fake or manipulated images in the case where individuals are artificially generated is crucial to prevent their malicious use in spreading misinformation, identity theft, and other harmful activities. Our project aims to explore the potential of implementing a AI generated detection system for facial images using modern computer vision techniques.

Below is an example of just how powerful modern image generation and manipulation techniques have become. Using the original Mona Lisa, just a single image as reference, multiple different perspectives and poses can be generated and seemingly bring her to life:

Monalisa Comes to Life GIF — Egor Zakharov Skolkovo Institute of Science

Defining Real vs Fake

Before moving on, we first need to establish what we mean when we say a face is either "Real" or "Fake". This is not as easy as it sounds, especially when trying to establish an objective line between standard processing and false represntation. With the widespread use of image processing, filters on social media, image manipulation has become a widespread phenomenon. These tools have democratized the ability to alter images, making it possible for anyone to modify their photos with just a few taps on a screen. From simple filters that adjust lighting and color, to more complex features that can smooth skin, change eye color, or even reshape facial features, these apps have transformed the way we present ourselves online. While these tools can be used creatively and for fun, they also raise questions about authenticity and blur the line between what should be considered “real” and “fake”. Take these four images as an example, as we go from left to right we increasingly manipulate the image using various techniques:

RVF: Original — Right image generated using Stable Diffusion model at https://www.artguru.ai/ai-text-to-image-generator using left-most image as input

RVF: Second — Right image generated using Stable Diffusion model at https://www.artguru.ai/ai-text-to-image-generator using left-most image as input

The left-most image is the unaltered original, even here "unaltered" still includes the native iPhone 14 camera's compensation and processing. The second image has only been modified in changing the original color to black & white. The third image has additional modifications, including some sharp changes to contrast and clipping pixel intensities at both ends of the color range with the addition of a slight Gaussian blur as well. The fourth image is the output from Stable Diffusion as a result of the prompt "modify the faces of the human and the dog" when provided with the original unaltered image. In the case of these four images, you could fairly argue that the first two should be classified as "Real", while the second two considered "Fake". Easy enough right? Instead of four discrete stages of modifications, consider we have a spectrum of all the images inbetween these four. Each step slightly more manipulated and modified than the last. At which image, which level of alteration from the original image, can we stop and draw a line to declare everything beyond this level is considered "Fake"? Considering this difficulty, we used the following verbose but careful definition when defining what constitues a "Real" human facial image:

The face, as digitally represented, is considered real if it existed physically as a biological human at a moment in time and faithfully depicts the countenance, demeanor and position originally expressed by the digital representation such that the average observer's prima facie impression would reflect the reality of the moment captured.

All this to say that we consider an image of a face as "Real" if it accurately represents a moment in time, and the average observer should be left with an accurate impression of the moment when presented the image that reflects it. For the purposes of this project, it is easier to classify what constitues a "Fake" image of a face, the majority we use are those artificially generated by GAN models or models performing localized manipulations of facial features.

Detection of fake or manipulated images has been a topic of interest for many years, pre-dating the advent of Machine Learning models. Traditionally digital images were manually manipulated using software like Photoshop. Gradually multiple types of Machine Learnign models were developed which could manipulate parts of images for example facial expressions. More recently with the development in Generaive AI, ML models, such as Generative Adversarial Networks (GANs), have become quite capable of generating entire images on their own based on text prompts. A variety of approaches and techniques have been developed over time to identify manipulated or AI generated images. Some of these include:

Image Forensics and Metadata Analysis: Techniques that analyze image metadata e.g. EXIF data or the image itself for inconsistencies, artifacts, or other signs of manipulation. Artifacts include image watermarks, sensor noise patterns, file format compression signals and lighting and lens aberrations. Such techniques have worked well on Photoshop based forgery but fall short when applied to AI generated images. Further, once such artifacts are exposed modern ML models can then be explicitly trained to work around them.
Deep Learning: Using (convolutional) neural networks to classify images as real or fake based on a training dataset. Once trained, such models end up resembling the Discriminator of the GAN whose dataset the model is trained on. While these models work well on this specific dataset, they don't generalize well to images generated from other GANs.
Color Statistics feature set⁴: To detect GAN generated images, not specifically faces, the authors of this paper propose a feature set based on color statistics. They argue that GAN generated images have a different color distribution than real images, and that this difference can be used to detect them. Detection models based on this feature set have been shown to generalize well to multiple GANs.
DeepFD⁵: A deep learning based joint feature learning and classification model has been proposed which uses a combination of CNN and LSTM with contrastive loss to model common underlying feature os fake images that differ from real images. This model has also been shown to generalize well to multiple GANs.

Methodology

At a high level, our project takes a two-phased approach. The first phase leverages Multitask Cascaded Convolutional Networks or MTCNN. MTCNN is a powerful algorithm for facial detection, iteratively feeding candidate detections through 3 different networks:

P-Net: Proposal network, used to generate candidate bounding boxes around candidate detections
R-Net: Refine network, refines and eliminates false positive bounding boxes, predicts facial landmarks like eyes, nose, mouth
O-Net: Output network, performs final calibration on bounding boxes, refines landmark predictions.

These combined networks allow efficient and highly accurate detections, here's an example of MTCNN in action:

MTCNN: Example — MTCNN detecting and isolating two candidates, one human and one canine

Once the face is detected and isolated using the MTCNN phase, we extract it from the image and apply any required resizing and normalization. Here you can see that, while MTCNN is capable of detecting the canine's face, we only grab the first most likely facial candidate to use in our pipeline:

The second phase of our process involves a CNN classifier trained on the output of the first phase, the extracted facial images output from the MTCNN model. We tried multiple different architectures and approaches to find the best CNN when it came to classifying the images as either "Real" or "Fake". Ultimately, we found a four layer model, two convolutional layers and two fully connected ones, provided the best results. Here's an ONNX representation of the model we used:

CNN: ONNX — Fake Facial Detection CNN Classifier ONNX Graph

Combining both the Facial Extraction phase with the CNN Classifier phase, the overall pipeline from input image to classification prediction can be summarized by:

FFD + FFX: Pipeline — Fake Facial Detection Pipeline

Results

We initially started started our project by exploring which model type would work best for classifying images as fake or real with our candidates being the Vision Transformer (ViT) and Convolutional Neural Network (CNN). Vision Transformers are a relatively new architecture that divides images into patches, converts patches into vector embeddings, and then uses a transformer encoder to learn dependencies and relationships between different patches. ViTs typically perform well on image classification tasks, but we found that they did not perform as well as CNNs on our dataset. This can likely be attributed to the less emphasis that ViTs place on local features and perhaps the need for a larger dataset to train on. We trained both models on the Photoshop and StyleGAN dataset with different configurations and hyperparameters, and found that the CNN model performed better.

Next we trained the CNN with and without the MTCNN facial extraction phase. Here's the training results from both these approaches, using only the Fake Facial Detector CNN (FFD) versus using our two-phased approach of first detection and extraction through the MTCNN model, followed by a second phase with our CNN classifier (FFX+FFD), using the entire 140,000 image dataset across 10 full training epochs:

FFD vs FFX: Loss — Test results for Fake Facial Detection with and without Facial Extraction, loss (left) and test accuracy (right) across epochs.

FFD vs FFX: Accuracy — Test results for Fake Facial Detection with and without Facial Extraction, loss (left) and test accuracy (right) across epochs.

Based on the above results, it would appear that the higher accuracy of the FFD than FFX+FFD indicates that the MTCNN phase is not improving overall classification process. However, the higher accuracy of the FFD model can be attributed to the fact that without FX the model has more background pixels to learn from (256x256 vs 160x160 with FX), so the model captures more information about synthetically generated pixels. But, those background pixels do not provide any information about the synthetic facial features which we ideally want our model to learn. By using MTCNN to extract the face, we not only restrict the model to only learn on pixels representing facial features, but we also make the model capture information about different facial orientations better. To highlight this point, take a look at the left-most image in Figure 2 below. This is a face generated by StyleGAN which the FFD model classified as real with a high probability (0.9997) but the FFX+FFD model correctly classified as fake.

After training the 2 models only on the StyleGAN dataset, we also wanted to test how the model generalizes to other datasets of fake images. Following are the results:

Model generalization results — Figure 1: Model Accuracy of FFD and FFX+FFD on various datasets

Figure 2: Sample dataset images correctly classified with Facial Extraction but not without

Unsurprisingly, the overall test accuracy of both models was low on StarGAN⁶ and DALL-E⁷ which falls in line with the fact that regular CNNs only tend to work well for the GAN whose images they are trained on and don't generalize well to other GANs. However the model did surprsingly well on the Photoshop image dataset, which we didn't expect as we did not train the model on images from that dataset, showing our approach was at least able to generalize to a completely unknown dataset with some degree of effectiveness.

As apparent in the results in Figure 1, a common theme we noticed was that, despite being trained only on the StyleGAN dataset, the CNN model with MTCNN Facial Extraction had better accuracy on the other GAN datasets than the model without Facial Extraction. So, the main takeaway from these results was that the CNN model with MTCNN Facial Extraction was able to generalize better to other GANs, than the model without Facial Extraction.

Conclusion

Our project is a step towards detecting fake or manipulated images in the case where individuals are artificially generated, which is crucial to prevent their malicious use in spreading misinformation, identity theft, and other harmful activities.
MTCNN is one of the most accurate facial detection models. For inference tasks, it is fast, accurate and also identifies facial landmarks.
While the model we developed only produced accurate detection on the dataset it was trained on, we found that pre-processing image data with MTCNN for facial extraction provides more reliable results from the CNN classifier and tends to generalizes better.
Classifying images as real or AI generated remains a challenge and active area of research, particularly from a generalization perspective. This project shows that Facial Extraction can improve the generalization of detection models.

References

All images and content were sourced as referenced unless otherwise cited. All demonstration and training images were obtained from the 140k Kaggle dataset¹ with remaining referenced as follows:

XHLULU. 140k Real and Fake Faces. Kaggle, 2020. Accessed March 2024. https://www.kaggle.com/datasets/xhlulu/140k-real-and-fake-faces
CIPLab, "Photoshop" dataset. Real and Fake Face Detection. Kaggle. Accessed March 2024. https://www.kaggle.com/datasets/ciplab/real-and-fake-face-detection
Long-Hua Ma, Hang-Yu Fan, Zhe-Ming Lu, Dong Tian (2020), Acceleration of multi-task cascaded convolutional networks. IET Image Process., 14: 2435-2441.MTCNN Source. Kaggle, 2020. https://doi.org/10.1049/iet-ipr.2019.0141
H. Li, B. Li, S. Tan, J. Huang, Identification of Deep Network Generated Images Using Disparities in Color Components, Signal Processing, Signal Processing 174 (2020) 107616. https://doi.org/10.48550/arXiv.1808.07276
Chih-Chung Hsu, Chia-Yen Lee, Yi-Xiu Zhuang. Learning to Detect Fake Face Images in the Wild. IEEE IS3C Conference (IEEE International Symposium on Computer, Consumer and Control Conference), Dec. 2018. https://doi.org/10.48550/arXiv.1809.08754
Zhiqing Guo, Gaobo Yang, Jiyou Chen, Xingming Sun. Fake face detection via adaptive manipulation traces extraction network, Computer Vision and Image Understanding, Volume 204, 2021, 103170, ISSN 1077-3142. https://doi.org/10.1016/j.cviu.2021.103170.
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever. Zero-Shot Text-to-Image Generation. 2021. https://doi.org/10.48550/arXiv.2102.12092