Main page # Project 5A ## Description The main goal of this project was to play with DeepFloyd diffusion models and different algorithms for image generation from noise with prompts. This project also included some experiments with image combination and editing as well as some experiments with prompt engineering. ## Part 0: Setup For the whole project I used `YOUR_SEED = 42`. Here are the initial results I got from the 2 stages of the DeepFloyd model with `num_inference_steps=20`:
an oil painting of a snowy mountain village
a man wearing a hat
a rocket ship


Here are the results I got from experimenting with different `num_inference_steps` values: `5 steps`
an oil painting of a snowy mountain village
a man wearing a hat
a rocket ship

`10 steps`
an oil painting of a snowy mountain village
a man wearing a hat
a rocket ship

`30 steps`
an oil painting of a snowy mountain village
a man wearing a hat
a rocket ship

`50 steps`
an oil painting of a snowy mountain village
a man wearing a hat
a rocket ship

Overall, the more steps the more high quality/detailed the image is. It's clearly visible in case of mountain village which increases in details with more steps. And in case of a man wearing a hat the image becomes more and more realistic with more steps. But there's still a good amount of randomness in the generation process. The rocket ship looks most realistic in 10 steps but then starts degrading into a simpler cartoonish image. ## 1.1 Implementing the Forward Process For this part I implemented the forwarding process exactly as described by the formulas in the assignment. Here is how the Berkeley Campanile looks at different noise levels:
original image
noise level 250
noise level 500
noise level 750

## 1.2 Classical Denoising In this part I tried to denoise the noisy images with different gaussian blur filters. I got the best results with `kernel_size=11` and `sigma=1.5`. Here are the results:
denoised from noise level 250
denoised from noise level 500
denoised from noise level 750
Applying a gaussian blur helped to recover some high frequency details but the images still look way less clean and sharp than the original. Clearly, simply applying a gaussian blur doesn't work very well for denoising.
## 1.3 One-Step Denoising In this part I implemented the one-step denoising algorithm. I used the `stage_1.unet` for noise prediction. The prompt was `a high quality photo`. To denoise the image, I used the forward noising process formula to get formula for the original image based on the noisy one. Here are the results:
denoised from noise level 250
denoised from noise level 500
denoised from noise level 750

## 1.4 Iterative Denoising In the previous part the results were good but not perfect. The final images still had some noise and were not perfectly sharp. In this part I implemented the iterative denoising algorithm as described in the assignment. The algorithm is based on [DDPM paper](https://arxiv.org/pdf/2006.11239). The algorithm would iteratively apply denoising while gradually decreasing the noise level until the final image at noise level 0 is produced. Here are the results:
original image
noise level 660
noise level 510
noise level 360
noise level 210
noise level 60
final image
one-step denoising
As we can see, the final image is much cleaner and sharper than the one produced by the one-step denoising algorithm for the same noise level.
## 1.5 Diffusion Model Sampling In this part I used the iterative denoising algorithm to create new images from noise. The prompt was `a high quality photo`. Here are the generated images:
sample 1
sample 2
sample 3
sample 4
sample 5
Though some of the samples are confusing and low quality, overall the look good and close to real images. Sample 2 looks like a group of performers, sample 3 like a hot air balloon flying over a mountain at sunset, sample 4 like a painting of knights.
## 1.6 Classifier-Free Guidance (CFG) In this part I modified the iterative denoising algorithm to use classifier-free guidance. I closely followed the algorithm described in the assignment. The prompt was `a high quality photo`. I used `scale=7` for the guidance scale. Here are the generated images:
sample 1
sample 2
sample 3
sample 4
sample 5
Immediately we can see that the images are far more realistic and sharp, with lots of details. They are also generally more beautiful and colorful. 4 out of 5 samples show people, 3 out of 5 images show some water like a lake or a sea.
## 1.7 Image-to-image Translation In this part I used the `iterative_denoise_cfg` function from the previous part to denoise different images from a set of noise levels. This would result in creation of new images or sometimes just slight modifications of the original images. Berkeley Campanile:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

President Trump:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

Fat cat:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20
I found it fascinating how key aspects of the images were mostly preserved in generation. For campanile all the pictures depict a center tall figure pointing to the sky. For President Trump starting at index 5 images show a close up of a person smiling. For fat cat starting at index 5 images show the angle at which the cat is lying down.
## 1.7.1 Editing Hand-Drawn and Web Images In this part I repeated steps from part 1.7 to generate images based on unrealistic images and sketches. Troll face:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

Sketch of a cat:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

Sketch of a ship:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

Sketch of an elephant:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

I found the modified version of troll face super fun and creepy! My sketches mostly didn't work very well. But it was interesting to see how the cat got colored. And super fun to see my elephant turn into a surreal animal in antarctica. For the ship sketch, the model failed to come up with anything close - maybe it didn't see much ships or my sketch was too bad. It's interesting how the ship turned into an elegant text saying "... quality".
## 1.7.2 Inpainting In this part I modified the iterative denoising algorithm to inpaint a part of the image. Basically I just masked out the part of the image I wanted to inpaint and then used the iterative denoising algorithm to fill it in. It was important to also produced noised image from the original image at every step so that the model tries to change the whole image. Modified campanile top:
original
mask
to replace
result

My face in a cafe replaced (funny how the head now is turned the other way):
original
mask
to replace
result

Car removed from the front of a castle (got replaced with a tollgate to prevent cars from entering, lol):
original
mask
to replace
result

## 1.7.3 Editing with Text Prompts This part is the same as was done in part 1.7 except the prompt is "a rocket ship". Campanile into a rocket ship:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

My girlfriend into a rocket ship:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

Campanile at night into a rocket ship (I love the night sky!):
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

My sketch of an elephant into a rocket ship:
original
from noise level at index 1
from noise level at index 3
from noise level at index 5
from noise level at index 7
from noise level at index 10
from noise level at index 20

## 1.8 Visual Anagrams In this part I created visual anagrams of the images that look different when flipped vertically. The results are super fun and fascinating! Old man and campfire:
an oil painting of an old man
an oil painting of people around a campfire

Lady and butterfly:
a detailed sketch of a dancing lady
a detailed sketch of a butterfly

Knights and dancing show:
a lithograph of knights on horses
a lithograph of a dancing show

Cat and dog:
a watercolor of a cat
a watercolor of a dog

## 1.9 Hybrid Images In this part I combined the denoising logic together with some logic from a previous Project 2 to create hybrid images. The main idea is to take combine low frequencies and high frequencies of two image noises to get a hybrid noise estimate. The hybrid noise estimate can then be used in denoising process to get a hybrid image after some iterations. Below are examples of combinations of two prompts, first taken in low frequencies and second in high frequencies. "a lithograph of a skull" and "a lithograph of waterfalls":

"a watercolor of a dog" and "an oil painting of people around a campfire":

"a watercolor of a soccer ball" and "a pair of people in black and white dancing":

"a hedgehog" and "an oil painting of a forest":

"a detailed sketch of a butterfly" and "a detailed sketch of a dancing lady":

This part has proven to be the most complicated due to difficulty of prompt engineering. It was very hard to come up with good prompts that would combine well together as low and high frequencies. I had to try more than 20 different combinations to get something that looks good.


# Project 5B ## Description The main goal of this project was to implement variations of UNet from scratch with the help of PyTorch. MNIST dataset was used to train the models and test their quality. ## Part 1: Training a Single-Step Denoising UNet First a simple denoising UNet was implemented. Architecture of the UNet is shown below: Noising process for sigma values `[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]`: I trained the model to denoise MNIST images with 0.5 noise levels. L2 loss was used as a loss function. Training losses graph: Predictions:
after 1 epoch
after 5 epochs
Denoiser results for different noise levels (sigma values `[0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]`): The results are pretty good for all noise levels. However they tend to be worse for sigma values higher than 0.5. It's especially visible in case of digit 1 which becomes hard to recognize when denoised from higher noise levels.

## Part 2: Training a Diffusion Model In this part I switched to making a UNet that would predict noise instead of denoising. The architecture was modified to condition the UNet on the noise level (or to be more precise, the time step `t` in a denoising process). Here is the updated architecture: I trained the model to predict noise for MNIST images with various noise levels (time step from 0 to 300). L2 loss was used as a loss function. Training losses graph: Then the denoising algorithm from Project5A was modified to use trained UNet as noise predictor. Here are examples of sampling from pure noise with this modified algorithm:
after 5 epochs
after 20 epochs
After 5 epochs the results are still quite random. Produced samples are often hard to match with a specific class label. After 20 epochs the results are more recognizable but there are still cases when the output is not even close to a specific class label.
## 2.4 Adding Class-Conditioning to UNet In this part the previous architecture was modified to also condition the UNet on the class label. This way we will be able to sample from different classes with the help of denoising algoritm using modified UNet. I trained the model to predict noise for MNIST images with various noise levels (time step from 0 to 300) and class labels (0 to 9). L2 loss was used as a loss function. Training losses graph: Then the classifier-free guidance denoising algorithm from Project5A was modified to use trained UNet with class conditioning. Here are some examples of sampling from pure noise with this modified algorithm (each column corresponds to a different class label):
after 5 epochs
after 20 epochs
After 5 epochs there's still a little bit of background noise (see gray color in the background). After 20 epochs the background noise is minimal.