AI Fundamentals: The Plain Vanilla Neural Network

The AI Safety course, specifically the Introduction to Machine Learning section of the AI Alignment curriculum, first starts by introducing the basic concepts of a neural network. The great thing about this curriculum is that it pulls from free, open source materials. The introduction to neural networks is accomplished by leveraging Grant Sanderson's video But what is a Neural network? from his YouTube channel 3Blue1Brown . With that, I'll be using screenshots from 3Blue1Brown to help demonstrate the concepts discussed here.

Structure

The initial approach is what could be considered the "Hello, world!" of machine learning (ML). That is, many begin their ML journey by looking at a how a neural network can be used to recognize hand-written numerical digits. This example uses a 28 X 28 = 784 pixel image for the digit.

Image of neural network recognizing a hand-written digit — Visualization of the digit recognition neural network from 3blue1brown.com

As a primer for more current and advanced neural networks, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformers, etc., it's important to first learn about the very basic "plain vanilla" neural networks, which consists of a relatively small number layers and neurons. Let's dig into that.

So, What is a Neuron?

The simplest way for me to think about a neuron is to imagine it as a light bulb. When the bulb is emitting light, it is active. When it isn't emitting light, it is inactive. Pretty basic and straight forward. However, it's also important to know that neurons in a neural network are treated as a gradient of activation, rather than as a strict binary on/off. Continuing with the light bulb analogy, you can imagine that the bulb is connected to a dimmer switch. The bulb can emit its full capacity of light, no light, or any amount of light in between those two extremes. This is actually very similar to how neurons in the brain work, but I think the light bulb example is more approachable.

Each neuron in this digit recognition neural network can have an activation ranging from 0.0 to 1.0, where 0.0 is considered a completely inactive pixel and 1.0 is a fully activated pixel.

Pixel activation values for the number 3 — Visualization showing pixel activation values

Having the activation values for each pixel is great, but what does any of that mean for the neural network? In short, we have to put each of the neurons and their corresponding activation values into the first layer of our neural network. You can imagine taking each row (or column, if you prefer) of neurons and arranging each of them (784 to be exact) into a giant vertical line of neurons. That's it! We just created the first layer, the input layers, of our neural network.

Column of 784 neurons showing our first layer — Visualization of the fist layer (input layer) of the neural network

Moving from the first layer to the final layer, we encounter two "hidden layers" and the last layer, the output layer. It's the output layer where the network states which digit (or digits) it thinks correspond to the provided hand-written digit. As you can see, the output layer has 10 neurons that correspond with 0-9. The neuron that has the highest activation value is the output digit of the network.

The Layers

When Grant Sanderson designed the architecture for this neural network, he arbitrarily chose to have two hidden layers, with 16 neurons each. Despite the arbitrary number of layers and number of nodes, it still provides ample room to demonstrate the impact of activation at each layer on the output layer.

I decided to dig in a little deeper into how to chose the number of hidden layers after learning about Grant's arbitrary decision. Quickly going through Stack Overflow, Medium, and other sites, I was surprised to learn that it's actually common, and even expected, to start with only one or two hidden layers. This is generally sufficient for networks that don't absolutely require more complexity.

The primary reason for the layers is to break down the initial input layer into potential subcomponents that can be used to determine the output layer. Exactly how that's accomplished is dependent on weights, biases, and training (I'll cover training in a separate post).

The layered structure of the neural network is great because it allows you to break down difficult problems into bite-size steps, so that moving from one layer to the next is relatively straightforward. - Grant Sanderson

Moving from One Layer to the Next

Each of the input neurons will require a weight, which allows us to tell the network how important that particular neuron will have on the subsequent layer. A positive wight will indicate that the neuron in the second layer should be active. A negative weight indicates that the neuron in the second layer should be inactive.

To make this work, you compute the value of the neuron in the second layer by taking the weighted sum of all the neurons in the fist layer.

w1a1+w2a2+w3a3+w4a4+⋯+wnan

This allows us to to put a higher or lower emphasis on the regions we care about or don't care about, respectively.

In this example, the blue pixels indicate positive weights for the area we care about, while the red pixels indicate negative weights for the areas we don't really care about.

Visualization of positive and negative weights

The Sigmoid Function Squish Effect

The problem with using a weighted sum is that it can result in any number being calculated; what we need is a number somewhere between 0 and 1. This is where the sigmoid function comes in. In short, it squishes the total sum into a value that lies between the 0 to 1 range that we need. The input values are mapped in a way that causes very negative numbers to approach 0, and very positive numbers approach 1.

Now we have the values we need to indicate the activation of a neuron. Amazing! However, this may not provide the expected results.

A "bias" can be used to to help tweak the results so that the network recognizes activations correctly. More specially, if the neural network shouldn't activate all neurons with a weighted sum greater than 0, a bias can be used to only activate those neurons with a meaningful value above 0. In this example from 3blue1brown, a bias of -10 is used to only activate weighted sums that are greater than 10.

One thing to keep in mind is that calculation, the one in the image above, going from the input layer to the second layer is just for one single neuron. Every neuron will go through the same weighted sum and sigmoid bias treatment. In total, this relatively basic neural network has a total of 13,002 weights and biases. With the understanding of how those weights and biases work, it's easier to demystify the hidden layers (at least somewhat) to make more accurate predictions or catch anomalies.

An important note to recognize is that there is a more compact notation for the sigmoid function. The way to do this is via matrix multiplication to simultaneously compute the activations of all the neurons in the next layer. Rather than spend too much time going over linear algebra's matrix multiplication concepts, the main things to know is doing it this way will organize all of the biases and weights into a vector, and add the entire vector to the previous matrix-vector product. The end result is this very concise expression.

The a(1) expression represents all the computations in the next layer based on the previous layer

Conclusion

The very basics of a neural network consists of neurons that can be active or inactive. The degree to which the neurons are active depends on how the weights and biases are structured. Finally, using mathematical tools like the sigmoid function and matrix multiplication can help simply and predict what happens within the hidden layers, resulting in a more meaningful (and accurate) output. It is important to note that this simplified example of a neural network, using the sigmoid function, is actually very outdated and isn't really used in modern neural networks. In any case, this is a great place to begin understanding how neural networks function, which is a good primer for moving on to the much more complex aspects of neural networks.

Since this was my first real post about a topic that I began digging into, this post is lacking any real information to do with AI Safety in particular (even though this was part of the curriculum). I'll likely avoid doing a writeup on each lesson I go through moving forward. That way I can write more well-rounded posts that encompass the bigger picture, while taking deep dives into the necessary aspects.