## Prediction Uncertainty for Neural Networks

## Deep Learning

Determining Epistemic Uncertainty while Reducing Computational Costs

# Executive Summary

Evidential deep learning allows us to understand when we can trust the output of our deep learning models. When training any machine learning model, including a deep neural network, we make the assumption that our training data is representative of what we will see in practice when the algorithm is implemented in the real world. However, this is rarely true, and often the algorithm sees patterns that it did not see in the training dataset.

In this paper, I aim to investigate evidential deep learning in the context of self-driving vehicles. In order to train a model for an autonomous vehicle, image data must be collected from the real world (a large dataset of road images), and these images are then labeled and trained on in order to allow the vehicle to identify the center of the road, signs, other vehicles, pedestrians, and anything else that may be encountered when driving. In a safety critical application such as autonomous vehicles, knowing when we can trust the predictions of a model, and when we can’t is critical.

In this paper, I generate images of a road, and predict the center of the right lane. These images are artificially generated (cartoon images generated in Python), and are not real images of a road. I chose to take this approach in order to have a very clean dataset to use to test the theories behind evidential deep learning. Using this controlled environment was important so that we could eliminate the extra noise, and focus on the problem. In the future, these methods could be applied to real world data.

Later in this paper, I discuss the process and theory behind adding noise to the artificially generated images to test the robustness of the model, and to understand how we can test for evidence in our predictions. Finally, I discuss a code repository for evidential deep learning that can be used to calculate prediction uncertainty for this problem.

I worked on this project under the mentorship of Dr. Larry Jackel, President of North-C[1] Technologies, and consultant to NVIDIA and Toyota on self-driving car technology and software.

### Github Repository

The code for this paper can be found in a Github repository. The ReadMe file contains a description of each of the code files in the repository:

https://github.com/pmank64/Prediction-Uncertainty-for-Neural-Networks

# Prediction Uncertainty

The motivation behind this project stems from a paper on Deep Evidential Regression, and a lecture presented by Alexander Amini[2], a PhD student at MIT. Currently, a neural network can be trained to output a probability distribution in a classification setting. For example, if we had a set of images of animals, and wanted to identify a horse, from a cow, from a cat, the output of our model would be the probability that a given sample belongs to any one of these classes. A softmax activation function ensures that the sum of the probability outputs is one. A negative log likelihood loss function can be used in this type of network. This probabilistic output gives us a sense as to how confident the network is that an image falls into a particular category.

As mentioned previously, the ability to understand how confident we should be in our model predictions is crucial in safety critical applications. The challenge is to be able to understand similar uncertainty in regression problems, or problems with a continuous output, such as predicting the pixel location of the center of the road.

In order to do this, we assume that our truth values are normally distributed, and we can use a specific loss function to make not just a single point prediction, but output a mean and sigma value describing a probability distribution that indicates how confident we are in the prediction for that particular sample. It is important to note that this method does not output the confidence in our model, but more the confidence in our data.

There are two types of uncertainty, Epistemic and Aleatoric uncertainty. Epistemic uncertainty is the uncertainty in the predictive process (the model may be unsure of the prediction). Aleatoric uncertainty is the uncertainty in the data itself; there is statistical noise in the data. Aleatoric uncertainty can be learned using the methods discussed (designing a model that outputs a probability distribution). The sigma is the Aleatoric uncertainty. Epistemic uncertainty is a little more challenging to determine.

One way that epistemic uncertainty has been determined is by training a set of independently training networks. These networks would be trained in a stochastic manner with random sets of training data. If the variance of the predictions across each of these models is high, then we can conclude that we have high epistemic uncertainty. The big downside to this method is that many networks must be trained. In a network that could have many millions of parameters, this can become very computationally costly.

As discussed before, we can learn the mean and sigma given a particular input to get the aleatoric uncertainty, or the uncertainty in the data. We can train multiple models and plot the means and sigmas (image below), and find a new distribution, or a normal inverse gamma distribution. This distribution represents the epistemic uncertainty of the model. The spread of this distribution indicates how confident our model is in its predictions. The goal would be to be able to learn the parameters of the normal inverse gamma distribution directly in order to determine epistemic uncertainty.

Depending on the shape of the normal inverse gamma distribution, we can understand if we have high Aleatoric or high Epistemic uncertainty. For example, if the distribution is tall and narrow, then there is a wide range of sigmas, and therefore high aleatoric uncertainty. If the distribution is wide in both directions, then we have high epistemic uncertainty.

There is a GitHub repository[3] associated with the paper and lecture that implements this method for calculating epistemic uncertainty using a toy dataset. The last layer in the neural network outputs the four parameters of the normal inverse gamma distribution (mu, lambda, alpha, and beta). There is a custom loss function that updates the model in order to learn these parameters given the y labels. This method could also be applied to the road images that I have discussed in this paper in order to understand how certain we are about the center of the road.

# Simple Implementation: Predicting the center of the road

The first phase of the project involved designing a neural network to predict the center of the right lane in a set of artificially generated images. I generated images such as the ones below using a python imaging library called Pillow.

Each of the images are varied slightly by two parameters, offset and rotation. In other words, in each image, the road is shifted left or right and rotated based on values pulled from a uniform distribution. The red line that is drawn down the center of the right line is the target that we are attempting to predict, which is a continuous value. The truth values are represented in horizontal pixels. We know that the image is 500 pixels wide, providing a sanity check when assessing the predictions and error rate. We aim to predict two values, the bottom of the line and the top of the line (horizontal position from the left).

### Designing the network

The images that I am working with in this paper are very clean with very little noise, and therefore, a relatively simple convolutional network results in accuracy levels within a fraction of a pixel.

Below is a diagram of the network architecture. Each image is 375 X 500 pixels, with 3 color channels. There is a convolutional layer with 16 filters, a batch normalization layer, and a rectified linear unit activation function. Finally, the dimensions are flattened, and the data is fed through a fully connected network with one simple linear layer with two outputs. These two outputs are continuous values that represent the top and bottom horizontal positions of the center of the right lane.

### Initial results

Below are the hyperparameters used to train the initial network.

The y axis is the loss represented in horizontal pixels. We can say that initially in the training, we were on average 90 pixels off from the true center of the right lane. Below are the actual and predicted lines plotted on a road image. It is only possible to see the blue line (predicted line) because it is right over top of the red line (actual center of the road). Overall, our training converged to an error of about 0.404 pixels in the training dataset, and 0.336 pixels in the validation dataset.

# Adding Noise to our images

After establishing a network that makes accurate predictions for clean images, it was then time to set out on making predictions for noisy images. In order to generate this noise, I took the original value for every value in every color channel and added the product of a draw from a normal distribution with a mean of zero and a standard deviation, and the square root of the original value. The amount of noise can be controlled by the sigma of the normal distribution that is part of the noise calculation. Note that in the training example in the section above, the sigma was zero, indicating no noise added. In the next section of the paper, I explore the effect of adding varying amounts of noise.

**Below are the results of applying differing levels of noise (values in pixels):**

**Noise: sigma = 10**

**Noise: sigma = 100**

**Noise: sigma = 500**

**Noise: sigma = 1000**

**Experimenting with Hyperparameters**

In this section, I discuss experimenting with different hyperparameters to understand their effect on the loss graphs and results. Rather than training with 2000 images total (between test and validation), I use just 50 images since this section is for experimentation purposes. Also, I reduce the batch size to match the smaller number of total images.

The red line represents the true center of the lane, and the blue line represents the model’s prediction.

The below chart plots the loss over epochs. The loss on the y axis is in horizontal pixels. The training starts at a loss of between 200 and 250 pixels. In other words, the predicted line is, on average, around 200 - 250 pixels away from the actual center of the lane. Even with this level of noise, and only 50 images and 50 epochs, the model converges very well, with only about 5 pixels of error in the validation dataset.

As a test of the characteristics of a neural network, I increase the batch size to 25, with all else being held constant. When using a smaller batch size, we are able to get past some of the complexities of the loss surface by incorporating some “noise” into the training. In other words, each batch only represents a small part of the dataset, allowing us to avoid getting hung up on local minimums and other abnormalities in the loss surface. When using a higher batch size, we can see that the training does not converge as quickly.

Next, I try increasing the learning rate to 3e-05. We can see that the loss jumps around a lot and is initially unstable. This is an indication that the training may be overshooting the global minima, and jumping around it before finding the optimal solution. Also, we can see that at the end of the training, the validation loss is significantly higher, indicating that we may be overfitting with much better performance on the training dataset compared to the validation dataset.

**Conclusions and Future Work**

In the section in which I investigate different levels of noise, we can clearly see that as the amount of noise increases, both our training loss and validation loss increase. This makes sense since adding more noise makes the images more complex and harder to learn. In particular, the validation loss increases a lot, indicating that out of sample, we see more and more error. The images of the road, which are from the validation dataset, show that as we add more noise, the blue line (the predicted center) moves away slightly from the red line (the true center of the right lane)

Adding noise to these images is a useful capability to evaluate evidential learning algorithms. The code in the Github Repository[3] associated with Alexander Amini’s lecture and paper is currently implemented in Tensorflow, and would have to be converted into PyTorch in order to work with the implementation that I put together for the analysis in this paper. Future work would include making this conversion and ensuring that we still see the same results on the toy dataset that is currently implemented. Then, the task would be to implement the evidential deep learning algorithm on this problem (predicting the center of the road).

By having the capability to add varying levels of noise to our road images, we can test the evidential deep learning algorithm. And examine both Aleatoric and Epistemic uncertainty. We can hypothesize that one or both of these may increase as we add more noise to our images. This very clean dataset of road images would allow us to test this method in a controlled environment before testing it on real world road images. A great source for real world road images is the KITTI dataset [4].

[1] North C Technologies: http://north-c.com/

[2] Lecture by Alexander Amini: https://www.youtube.com/watch?v=toTcf7tZK8c&t=673s

[3] GitHub repository: https://github.com/aamini/evidential-deep-learning

[4] KITTI Dataset: http://www.cvlibs.net/datasets/kitti/raw_data.php