Thanks.

Mimicking the biological structures which are the product of millions of years of evolution has worked in the object detection field, thanks to Warren McCulloch and Walter Pitts , scientists who came up with the Artificial Neural Network idea , to Kunihiko Fukushima , who applied pattern recognition to NN’s, to Yann LeCun - the designer of the first Convolutional Neural Network , of course to Alex Krizhevsky the man who started the age of the CNNs and The architect of the Alex-Net and to the all others who contributed to the purpose with their valuable work.

I tried to explain the Convolutional Neural Networks in this presentation. Hope you enjoyed.

This a SWE-596 Project for Bogazici University Software Enngineering Master of Science Program prepared by Ferhat SAL - 2021.

Nemo vir est qui mundum non reddat meliorem!

Last Step : Softmax Classifier

The last step of the network is applying SoftMax function to Logits produced by fully connect layer. The mathematical calculation is getting the exponantial of the results and divide them to summation of exponantials as shown in figure. By applying this math to the scores ( logits ) softmax turns them into a vector of K real values that sum to 1 , which K is class count. This means at the end we know have a probaility vector of classes and we can select the bigger one as our prediction.

Of course during the training process these predictions are compared to the real class values for those inputs and the a loss function such as cross entropy is calculated. Then the CNN uses these "erros" , gets the gradients (derivatives) of the loss with respect to each layes input and updates the weights by a small learning rate value, as mentioned in bacpropagaiton section.

After the training completes the CNN model weights are saved, therefore we can run the CNN to predict objects in images usimg these saved weight.

Fully Connected Layer

CNN’s contains a bunch of Convolutional Layers , ReLU layers and Pooling Layers and finally
ends with the “Classifier Part” of the network. The “Classifier of the Network” is made of a Fully Connected Layer + Output Layer. To feed the fully connected layer the result of the last layer (mostly a pooling layer ) , a feature map matrix , must be flattened into a feature vector.

Flattening is converting the data into a 1-dimensional array for inputting it to the next layer. We flatten the output of the convolutional layers to create a single long feature vector. And it is connected to the final classification model, which is called a fully-connected layer. In other words, we put all the pixel data in one line and make connections with the final layer.

The fully connected Layer calculates the dot product of flattened values and weights matrix then adds biases. This linear transformation step produces "Class Scores" or with ML jargon "Logits". Logits are unnormalised (or not-yet normalised) predictions (or outputs) of a network.

Max Pooling Layer

After some Convolutional layers followed by ReLU Layers, CNN’s applies pooling or more precisely
max-pooling. Pooling layers are used for simplifying the output of a convolutional layer without compromising from spatial relationship and they reduce the dimensions of data which is known as “downsampling”.

Getting rid of redundant feature data that exists in the feature map by selecting the maximum value from the applied filter is crucial for speeding the training process up and decreasing the memory consumption.

But the output feature maps are sensitive to the spatial location of the features in the input. Applying down sampling to the input feature maps using max pooling technique - "sliding through map and selecting the partial maximums" has the effect of making the resulting down sampled feature maps more robust to changes in the position of the feature in the image, referred to by the technical phrase “local translation invariance.”

ReLU Layer

After Convolutional Layers generate feature maps as a result , CNN then runs the ReLU ( Rectified Linear Unit ) layer which uses Rectifier as Activation functions. If we look closely at a rectifier function wecan easily realise that it simply turns the negative input values to zero but positive values are outputted as without altering.

This means ReLU, removes all negative values in the input data which correspond to the feature map generated by the previous convolutional layer as shown in the figure.

Convolutional Layer

CNN’s uses “filters” or “kernels” to analyze the pixel in a spatial neighborhood by applying some calculations using the “target area” of the input data and the “filter data”. The target area is the portion of the entire input image data which has the same size of the applied filter.

Filter Sliding

A filter starts with the top-left of the image , get’s the portion of the image ( a patch ) , let’s say 3x3 pixels , and applies a calculation using the values in that image patch such as multiplying the pixel value with the corresponding filter value. Then filter slides to the next position until it finishes scanning the whole image. The “amount of sliding” is called “stride” which is important because it is used as a hyperparameter.

This operation is called "convolution" and the result of the filtering process is a feature map which is more meaningful for the purpose of the network.

Convolutional Neural Networks

Problem

The traditional neural networks have some disadvantages about processing images and detecting objects. They use a neuron for every single input piece such as pixels in an image. Since every input has a weight in a NN model, the total count of those weights becomes a huge number when the image size gets larger, which makes the training process so hard a>nd both power and time consuming. For Example an image whiit only 224x224 pixels with RGB data we must adjust 150.528 weight in the training ! Moreover traditional neural networks are weak against spatial changes such as shifting or rotating.

Solution

To solve the problems of traditional networks such as MLP’s, we can use the Convolutional Neural
Networks. Convolutional Neural Networks are an altered and improved version of fully connected “traditional” neural networks.

CNN's applies some special operations which can extract features from images , reduce the data size for next layers and keep crucial spatial data. After these operations they feed a fully connected layer and makes peridictions about the object class in the images.

Backpropagation

The network must apply this adjusting process to all layers and all neurons in them. How ? Here another magic appears: Backpropagation algorithm takes the control, obtained loss values are pushed back through the network in order to update parameters (which are nothing but the weights) and these updates result in reducing the total loss value.

After completing the first sample of the dataset, the learning algorithm takes the next one and applies all steps. The loop continues until a level that the network can produce acceptable prediction accuracy.

To summarize and simplify the whole learning process, we can say that a neural network “knows nothing” at the beginning of the learning, it starts with random parameters (weights), takes the training data, applies all “transformations” on to that input layer by layer, makes some predictions (which are so far away from the reality for the first samples), then it learns how bad or good the predictions by comparing them with the real answers given in training set, uses this information to adjust the weight slightly, and repeats the same process with next samples until reaching a level which the network predictions are good enough.

How a Neural Network Learns ?

"The Loss"

The answer lies in the supervised learning technique which a labeled input dataset is used to train your network in other words adjust “those weights”.

The learning algorithm starts with weight values ( ω ) which i do not make any sense, namely random values. Then it takes the first data in the training dataset as an input and processes it across the network and produces a prediction at the last layer.

The algorithm compares this first “primitive” prediction with the real labeled values in the dataset which are real answers. The result of this comparison gives the “error” value, with other words “loss”.

"Gradient Descent"

If these loss values of every given input of the training dataset can be minimized , the predictions become more accurate. That means we have a loss function and we have to find the minimum of this loss function which is a common task in calculus:

We will use the slope of the function and gradually change our parameters according to the value of the slope, until we find a local minimum! The name of the method used to minimize the loss values is “gradient descent”.

Neural Networks

Now that we have given a lot of information about these artificial networks, we can now focus on the “network of these artificial neurons”. Neurons are mostly aggregated into network layers and each of these layers may apply different processes or computations to their input data.

The input data of the whole network is processed through the first layer to the last layer of thenetwork. The first layer is mostly named as “Input Layer” , and the last one as “Output Layer”. The multiple layers between these two are known as “Hidden Layers”.

The connections between layers may appear in different organizations. For example if every neuron in a layer is connected to every neuron of the next one , this is called “fully connected”. But if some of the neurons in a layer are connected to just a single neuron of the next layer, this is called “pooling”.

"Training & Learning"

The "learning" of a neural network is gradual better performance of the network in its intended task using the given datasets . For example making a prediction about the class in which the input belongs to. To do this , network uses the weights of inputs, which means which inputs has stronger effect on each class.

The magic of a neural network shows itself at exactly this point by adjusting the weights (including bias) of neurons until they reach an optimal value. This is called "Learning" , and the process called "Training"

Activation Functions

The activation functions can simulate the biological neurons by outputting a positive meaning value such as 1 if the summation of the inputs is over a threshold and a negative meaning value such as 0 in the reverse state. This behavior can easily be named as a "step function". But it is not mandatory that an activation function must be a step function and produce only binary outputs such as 1 and 0, it can produce "probability values" such as the range between 0 and 1, or it can output any other values according to the purpose and design. However activation functions can be classified as linear or nonlinear functions. Sigmoid Function, Rectifier Function or Softmax function are examples of activation functions.

Sigmoid Function Example

If the purpose of the activation function is to produce values between 0 and 1 (probabilities) a Sigmoid function will be the appropriate choice , because as the figure below shows , if the input is is a large negative value the sigmoid function outputs a value close to 0, and if the input is large positive then it tends to output a result close to 1.

Artificial Neurons

Definiton

Just like the "Biological Neurons" in the human brain do , Artificial Neurons take some “signals” as inputs, process those signals using applying some sort of calculations and then can send signals as outputs to the connected neurons.

In the light of this description, to define an artificial neuron with more mathematical expressions we can simply say that an artificial neuron takes some inputs such as x1 , x2 ,..., xn and these inputs have some weights such as ω1 , ω2 ..., ωn and processes these weighted inputs such as calculating the summation of them, then uses a function to “normalize” the result which is mostly called activation or transfer function to produce an output a y value which will be as an input in the connected target neuron.

Why ?

The reason behind this logic relies on weights of inputs , the final purpose is understanding how an input has an effect on the output. Of course this can not be achived by just using a neuron , we have to connect thousand of them in a network , but to answer the question "why we are doing these stuf such as getting the summation of inputs x weights and adding biases to them , then produceing an output if the calculated value is over a treshold ?", we have to know what is our purpose here.

Convolutional Neural Networks

The state of art techology in the object classification and object detection from images are Convolutional Neural Networks. These stunning algorithm and the math behind is hard to understand for many people which is not familier with Deep Learning and Computer Vision areas. Altough CNN's are complicated artchitectures , like all complicated thinks in the world , if they can be explained with visualizations , every avarage person easily get the idea behind them and adore this science product.

In this presentation the main pupose is to explain CNN's with the support of SVG graphic.

This a SWE-596 Project for Bogazici University Software Enngineering Master of Science Program prepared by Ferhat SAL - 2021.