Introduction

In this post I am presenting my work on autoencoder neural network for denoising (images/renders) using Keras and Python.


First, let's explain what is the autoencoder neural network in a nutshell:

"Autoencoding" is a data compression algorithm where the compression and decompression functions learned automatically from examples rather than engineered by a human.

In addation to:

  • Autoencoder is one of the simplest neural networks form. It can consist of 3 layers (encoding, hidden layer and decoding) and that makes it practical example for introduction into deep learning.
  • This neural network is used for dimensionality reduction during the encoding process.
  • For example, it reduces the image dimensions (columns/rows) into a small represntation that you can apply different algorithms to it while preserving the key features that has been compressed.
  • Then decoding it will reconstruct the small represntation to match the given input data.
  • The reconstructed output should match exactly the input (image pixels, audio waves, etc..)

Motivation

Nvidia's demo at GTC 2017 about denoising renderers with artificial intelligence was my biggest boost toward the work I did.

During my trip to SIGGRAPH 2017 this year I saw this demo which is really informative

And later they released their development kit Nvidia OptiX Ray Tracing Engine free for commercial-use.


Research

So I started with following the steps from the official Keras blog to build an Autoencoder and then I started to do my tries and tests.

The very basic example is to build a model and train it to reconstruct a given input image using MNIST handwritten digit database just to get use to how to build a model, encoder and decoder.

and that was the results:

here you can see how the decoder works after the encoder preserved the key features of the input image. the input was a 28x28 pixels image.


Then I tried to do the same but with a differenet input so I had to write a function to prepare the input data so it can match the form of the given example.

and that was the results:

It was a 28x28 pixels image but with different shape.


So, why not to try it with a real image that has noise but the trained network only accepts 28x28 pixels images as input.

Then I decided to write a function for splitting into small chunks of 28x28 pixels and I got this image from google just for testing.



import cv2
import os
import numpy as np

def image_to_patches(input_image, height, width):

    image_patches = []
    imgwidth, imgheight = input_image.shape

    for i in range(0,imgheight,height):
        for j in range(0,imgwidth,width):

            box = (j, i, j+width, i+height)
            patch_img = (input_image[box[0]:box[2], box[1]:box[3]]).flatten()

            image_patches.append(patch_img)

    return image_patches


Using the trained model on MNIST dataset to denoise these small patches didn't work probably and that makes sense, but it was worth trying to learn.

here was the results:


It seems that I have to train a network on a different dataset that has a bigger resolution so it works later with larger frames.

hmmmm...time to find a new dataset.. then I got this one from University of Toronto.

http://www.cs.utoronto.ca/~strider/Denoise/Benchmark/

it contained 300 images as a clean data and another 300 images with rgb noise in a different levels (low noise, medium noise and high noise) with a resolution of 480x320.

Time for training!! I did many trainings with changing the parameters, layers count and epochs and even splitting the training data into small patches to get to know what will really affect the reconstructed output.





And finally I got these results:




Comparison



The results looked so blurry, looks like we are not only losing the noise while reconstructing the input but also the sharpness..

I decided to train a new network with bigger dataset, bigger resolution and different type of noise but this time I am running it on every single channel of the image, not the three combined RGB.

I got the INRIA Holidays dataset from here

http://lear.inrialpes.fr/~jegou/data.php

This time the dataset contains 803 images with a resolution of 480x640.

and wrote a function for noising the images with salt and pepper noise which looks similar to the rendering noise.

import numpy as np
import random
import cv2

def noise_image(image, noise_rate):

    output = np.zeros(image.shape,np.uint8)
    for i in range(image.shape[0]):
        for j in range(image.shape[1]):
            random_val = random.random()
            if random_val  (1 - noise_rate):
                output[i][j] = 255
            else:
                output[i][j] = image[i][j]
    return output

Training time! it took approximately 1 hour and 28 for 100 epochs on my workstation

GPU: Nvidia GTX 660 (CUDA Compute Capability 3.1)
CPU: Intel i7-3770 3.40GHz
Memory: 8GB

Results


This time the results is better but still not sharp.
I had another look again on Nvidia's video and I found that they are using the albedo buffer to preserve the details in the image which I am missing in the normal 2D images so I am still working on that and getting to learn more.
I started to study deep learning and neural networks recently, so please correct me if you find any mistaken information.