Deep learning for computer vision

A primer for the animal behaviour scientist

Sofía Miñano

Overview

Vanilla neural network

Convolutional neural network

Applications to animal behaviour

Figures from cs231n.github.io/convolutional-networks and movement.neuroinformatics.dev

Overview

Preliminary concepts
Anatomy of a neural network
Forward pass
Training and backward pass
Convolutional neural networks
CV applications in animal behaviour

Some preliminaries

What is Deep Learning?

Deep Learning is:

an approach to Artificial Intelligence
a type of Machine Learning that uses artificial neural networks

Why is it so popular?

Deep Learning achieves great power and flexibility by:

gathering knowledge from experience
being compositional

What is Artificial Intelligence?

Tesler’s theorem:

“AI is whatever hasn’t been done yet.”

Figure from stackoverflow.com

Figure from Waymo

Deep Learning as an approach to Artificial Intelligence

Early AI: problems that were intellectually challenging for humans, but easy for computers
True challenge: tasks that are easy for humans but hard to describe formally
DL has proven very powerful in solving these intuitive problems within AI.

Deep Learning is a subset of Machine Learning

ML allows computers to tackle problems using knowledge (data) from the real world
Algorithms depend heavily on the representation of the data

Figure from Deep Learning book

Deep Learning is a subset of Machine Learning

ML allows computers to tackle problems using knowledge (data) from the real world
Algorithms depend heavily on the representation of the data
DL is a kind of representation learning

Figure from Deep Learning book

Deep Learning is compositional

Figure from Deep Learning book

Variables that we are able to observe.

Extraction of increasingly abstract features.

“Hidden” (i.e., not observable)

Recognition of objects in the image

Recap

Deep Learning is:

an approach to AI
a subset of ML
well-suited to solve intuitive problems because learns from data and represents the world hierarchically

Figure modified from Deep Learning book

Additional references

On the strengths of deep learning:
- Brains, Minds and Machines summer school 2017 - Deep learning tutorial
Early AI:
- Early Work in AI
- Logic Theorist - wikipedia
On the challenge of solving “intuitive” tasks:
- Fei‐Fei Li’s lecture on cs231n 2017
- The Summer Vision Project
Deep Learning for Computer Vision course (University of Michigan)
- Winter 2019 (lecture videos public)
- Winter 2022 (lecture videos not available)
CS231n: Convolutional Neural Networks for Visual Recognition (Stanford University)
- Winter 2017 (lecture videos public)
- Spring 2025 (lecture videos not available)

Anatomy of a neural network

What is a neural network?

Deep learning and neural networks
Digit recognition task as an intuitive problem
Multilayer perceptron

We have mentioned that Deep learning is a typeof ML based on artificial neural networks, and we have hinted at deep learning relying on many stacked layers of abstraction.

We will see in this section that these layers of abstractions and transformations are formalised as NN, which are basically stacks of linear transformations, with nonlinear layers in between.

For the sake of clarity we are going to focus on the problem of digit recognition, a classic example for neural network applications

Digit recognition is one of those intuitive tasks that we’ve talked about (easy to humans, hard to formalise). We’ll use this example to introduce the multilayer perceptron, the simplest type of NN.

Note that we focus on the classification task (but NN can also address regression and many other kinds of tasks)

Our task

Figure from But what is a neural network? | Deep learning chapter 1

A single neuron

A multi-layer perceptron

The simplest neural network

An extension to more neurons and more layers
Requires a non-linear activation function
Also fully‐connected or feed-forward networks

Figure from CS231n lecture notes: neural networks

A multi‐layer perceptron (MLP) is an extension of this to more neurons and more layers - MLPs have at least one layer between the first and the last layer, each of which with a relatively large number of neurons - The number of layers and neurons varies per application and it often based on what has worked well in the past (there are methods for comparing performance across different architectures to make choices).

Every neuron behaves exactly as we saw in the simple case: - at each layer after the input layer, each neuron receives as input all the neurons in the previous layer, computes a certain function, and outputs a value. - This is carried out at every layer until the output layer. - These networks are also called fully‐connected, since all the neurons in one layer are connected to all previous ones

Layers

Input layer

Output layer

Hidden layers

Layers: input layer

→ np.reshape(28*28, 1)

Layers: output layer

→ 2

Additional references

On the single and multi-layer perceptron:
- Deep Learning book sections 1.2.1 and 6.6
- Brains, Minds and Machines summer school 2017 - Deep learning tutorial
On the network’s architecture
- CS231n - Neural networks
- Deep Learning book chapter 6, especially section 6.4

Forward pass

From layer to layer

Hidden layers

From layer to layer: one neuron

\[ f(x_1, x_2, x_3, \ldots) = \mathbf{\color{rgb(225, 174, 65)}h} \]

From layer to layer: one neuron

\[ \mathbf{\color{rgb(225, 174, 65)}h} \]

Compute weighted sum \[ {\color{rgb(225, 65, 185)}\Sigma} = {\color{rgb(137, 225, 65)}w_1} x_1 + {\color{rgb(137, 225, 65)}w_2} x_2 + {\color{rgb(137, 225, 65)}w_3} x_3 \]
Apply non-linearity \[ \mathbf{\color{rgb(225, 174, 65)}h} = max({\color{rgb(225, 65, 185)}\Sigma}, 0) \]

The neuron processes the inputs as follows:

First it computes a weighted sum of the inputs. The weights represent the connections between the neurons. The weighted sum can be seen as the neuron attending to different parts of the input.

Then it applies a non-linear function. For example, the rectified linear unit (ReLU) function. This function will output the final output of the neuron

Choosing the ReLU function is not a totally arbitrary choice: the ReLU function is a common choice due to its properties when it comes to optimization via gradient descent. However there are other popular options for activation functions

Note as well that, a non linear function is required for the MLP to be a universal function approximator (along with a hidden layer with enough number of units/neurons)

The activation function is loosely based on biological neurons - It represents that the artificial neuron will respond to inputs beyond a certain threshold. - As it is now, this threshold would be zero: if the weighted sum is positive the neuron will output the result of the weighted sum, if it’s negative it will output zero.

From layer to layer: one neuron

\[ \mathbf{\color{rgb(225, 174, 65)}h} \]

Compute weighted sum \[ {\color{rgb(225, 65, 185)}\Sigma} = {\color{rgb(137, 225, 65)}w_1} x_1 + {\color{rgb(137, 225, 65)}w_2} x_2 + {\color{rgb(137, 225, 65)}w_3} x_3 \]
Apply non-linearity \[ \mathbf{\color{rgb(225, 174, 65)}h} = max({\color{rgb(225, 65, 185)}\Sigma} + {\color{rgb(213, 24, 24)}b}, 0) \]

From layer to layer: many neurons

In an MLP:

each connection is associated with a weight
each neuron is associated with a bias

Figure from But what is a neural network? | Deep learning chapter 1

From layer to layer: many neurons

\[ \small \qquad \color{rgb(0, 0, 255)}{h_0^1} = ReLU(\sum_{i=0}^{n} \color{rgb(137, 225, 65)}{w_i^{0,1}} \color{rgb(173, 216, 230)}{h_i^0} + \color{rgb(213, 24, 24)}{b_0^1}) \]

\[ \scriptstyle ReLU\left( \begin{bmatrix} w_{0,0} & \cdots & w_{0,n} \\ \vdots & \ddots & \vdots \\ w_{k,0} & \cdots & w_{k,n} \end{bmatrix} \begin{bmatrix} h_0^0 \\ \vdots \\ h_n^0 \end{bmatrix} + \begin{bmatrix} b_0^1 \\ \vdots \\ b_k^1 \end{bmatrix} \right) = \begin{bmatrix} h_0^1 \\ \vdots \\ h_k^1 \end{bmatrix} \]

\[ \scriptstyle ReLU\left( \begin{bmatrix} w_{0,0} & \cdots & w_{0,n} & b_0^1 \\ \vdots & \ddots & \vdots & \vdots \\ w_{k,0} & \cdots & w_{k,n} & b_k^1 \end{bmatrix} \begin{bmatrix} h_0^0 \\ \vdots \\ h_n^0 \\ 1 \end{bmatrix} \right) = \begin{bmatrix} h_0^1 \\ \vdots \\ h_k^1 \end{bmatrix} \]

\[ ReLU(\color{rgb(137, 225, 65)}{\mathbf{W}'} \color{rgb(173, 216, 230)}{h^{0^{\prime}}} ) = \color{rgb(0, 0, 255)}{h^1} \]

We can use matrix multiplication to extend that expression to all the neurons in layer 1.

If you remember matrix multiplication from linear algebra…. you’ll see that the computation of the first neuron in layer 1 is equivalent to the one above (we multiply the first row and the first column to obtain the element at the first row and first column, etc.)

We now have a weight matrix, that collects all the weights representing the connections between all the neurons in layer 0 and all the neurons in layer 1. The first row in the weight matrix has the weights required to compute the output of the first neuron in layer 1, the second row those required to compute neuron 2, and so on until the k-th row to compute the k-th neuron.

The vector that multiplies with this matrix collects the neurons in the previous layer. To the result of this multiplication we add the bias term of each neuron in layer 1 and then feed the result to a ReLU function to obtain the final number held by the neurons in layer 1

Often the bias vector is included in the weight matrix as an additional column for further compactness.

But note that if we adequately respect the dimensions and add a 1 as a last row in the vector of neurons in the previous layer, both expressions are equivalent

Just remember that there is a compact way that is often preferred to express the transition from one layer to the next. And that often making use of this bias trick the bias is omitted, but it’s implicitly considered within the weight matrix as we’ve seen.

Expressing the transformations as vectorised operations is very convenient for computing efficiency (we have specialised hardware that performs this operations very fast)

A two layer network

\[ \color{rgb(255, 176, 0)}{y} = \color{rgb(143, 204, 143)}{W_1} \color{rgb(92, 144, 224)}{ReLU({W_0}x)} \]

Forward pass

→

\[ \scriptstyle ReLU(W h^n) = h^{n+1} \]

\[ \scriptstyle ReLU(\color{rgb(255,0,0)}{W} h^n) = h^{n+1} \]

→ 2

How to choose them?

Ok so we’ve now seen how we go from a certain input (in our case a greyscale image with a digit), across all layers applying the corresponding transformations, until the output layer, where we will obtain ‘scores’ representing how much the model thinks the input represents each class. This whole process is usually called forward pass, since the information flows from input to output

We’ve also seen that each layer has a collection of weights and biases that along with the activation function, define the transformation taking place across that layer. The weights and biases of all the layers constitute the network parameters, and they determine what the network “does”.

With an ideal set of parameters, we would feed a certain image, for example one representing a 3, and the transformations across the layers should end up in the class “3” having the highest score in the output layer.

How do we choose these parameters? We would like to obtain the weights and biases that perform best in the task of recognising digits.

We can formulate this as an optimisation problem: we want to obtain the weights and biases that minimise the error when recognising digits.

This is what we basically do when we train the network, which is the next part of the session.

Additional references

On neural networks being universal function approximators
- Deep learning book section 6.4 http://www.deeplearningbook.org/contents/mlp.html
- CS231n notes https://cs231n.github.io/neural-networks-1/#power
- Michael Nielsen’s book http://neuralnetworksanddeeplearning.com/chap4.html

Training and backward pass

Training: intuition

( , 2)

Labelled data

Supervised learning

( , …)

Training set

→

→ 7?

→ …

The main idea of training: During training we will feed labelled data to the network. It is called ‘labelled data’ because for each image we have a label (in red) that contains the ground truth. This is basically the answer to the problem we want to solve (here, the digit it represents). - Remember that with the neural network we want to capture that mapping from inputs to labels, from images to digits. - We are focusing on supervised learning, in which labelled data is available, but be aware that there are other approaches to learning too.

For each training sample, the network will just execute a normal forward pass, ignoring the label. After going through all the layers, the network will output a prediction for the given input. Then we will compare it to the ground truth label.

Depending on how far off the prediction is, the network will adapt its weights and biases so that it improves its performance, and its predictions become closer and closer to the ground truth.

The complete set of images that we present during training constitutes the training set. The hope is that with this layered approach of the network and its hierarchical abstraction of concepts, we may be able to train a network that generalises beyond the training set.

Testing: intuition

Test set

→

TRAINED

→

Accuracy

Dataset split

Hyperparameters
Keep test set aside! ⚠️

Figures modified from CS231n lecture notes: neural networks

Before we dive into the details of training, le’t make an important note about the test set.

We have seen how in the paradigm I just described, there are two subsets within the full set of labelled data: the training set and the test set. The test set is indeed a very valuable resource to assess how our model performs in deployment / in the wild. As such, it should be treated with care.

But what does this mean in practice? To explain this better, we need to talk first about hyperparameters, a common feature of ML algorithms.

You may have noticed that in the development of our simple MLP we made a few design choices, such as the number of neurons per layer, the number of hidden layers, the activation function etc. These are called hyperparameters - basically all the parameters in the network that are not weights and biases (i.e. not learnt, not tuned during training) can be considered hyperparameters. We will see more examples of hyperparameters in the next slides too.

How do hyperparameters relate to the test set? Well it’s often not obvious what values/settings one should choose, and a reasonable suggestion would be to try a few different values and see which one works best. This is indeed what we do in practice, but we need to be careful. In particular, we cannot use the test set for the purpose of tweaking hyperparameters.

Why? As we said the test set as a very precious resource, as we can use it as a proxy for measuring the generalization of your model. If we tune our hyperparameters to work well on the test set, we will overfit to it, and when we deploy our model we could see a significantly reduced performance. We will loose our only metric of realistic performance in the wild.

Dataset split

Hyperparameters
Keep test set aside! ⚠️
So then how?

Figures modified from CS231n lecture notes: neural networks

So how do we go about this? Usually people split the data into three sets: training, validation and test. - The training set is used to train the model. - The validation set is used to select the best hyperparameters. - The test set is only used once, at the end, to evaluate the performance of the model with the selected hyperparameters. This way it remains a good proxy for measuring the generalization of our model.

In cases where the size of your training data (and therefore also the validation data) might be small, it is common practice to do what is called k-fold validation.

In k-fold validation: The training set is split into folds (for example 5 folds). The folds 1-4 become the training set. One fold (e.g. fold 5 here in yellow) is denoted as the Validation fold and is used to tune the hyperparameters. Cross-validation goes a step further and iterates over the choice of which fold is the validation fold, separately from 1-5. This would be referred to as 5-fold cross-validation. In the very end once the model is trained and all the best hyperparameters are determined, the model is evaluated a single time on the test data (red).

In animal pose estimation, most frameworks make it very easy to do a slight variant of this. Instead of fixed folds, people often do random samples for the training and the test set (in DLC for examplethese are called “shuffles”).

Training as an optimisation problem

Loss function

Gradient descent

Loss function

Raw scores

Probabilities

The loss function describes how good a certain collection of weights and biases is. Sometimes called cost function, objective function.

To describe the loss function let’s look at what happens at the end of a forward pass during training. After we go through all the layers we arrive at the output layer or readout layer, where the scores for each of the classes are collected in the neurons.

We would like the score for the correct class, in this case “2”, to be very high, and best highest than the other classes.

But these numbers at the output layer are just ‘raw’ scores. They are unnormalised (we interpret them as unnormalised log probabilities). It would be very nice for interpretability if these scores could represent something like a probability distribution across the classes, reflecting the network’s prediction.

This is exactly what the softmax function does.

Loss function

Raw scores z

Probabilities p̂

Ground truth p

Softmax:

\[ \tiny \widehat{p} (z_{j}) = \frac{e^{z_j}}{\sum_{k} e^{z_k}} \]

Cross-entropy:

\[ \tiny \text{H}(p, q) = -\sum_{k} p_k \log(\widehat{p}_{k}) \]

Loss: \[ \tiny \text{L}_{i} = - \log(\widehat{p} (z_{j=y_i}) ) \]

\[ \tiny \text{L} = \frac{1}{N} \sum_{i} \text{L}_{i} \]

This is exactly what the softmax function does. It is a function that takes as inputs K real numbers, and transforms them so that they fulfill the minimum requirements for a probability distribution (numbers from 0 to 1 that add up to 1). In the result, each prob is proportional to the exp of the raw score. So after the softmax each score will be now between 0 and 1 and all of them will add up to 1.

Now we have a probability distribution over the classes that represent the network’s predictions. However because we have the labels, we actually know the true probability distribution, or the ground truth. This distribution is “1” at the correct label class, and zero elsewhere.

How can we compare the set of two probability distributions?

We can use tools from information theory to quantify how far off these distributions are. If we call the true probability distribution \(\hat{p}(x)\) and the one estimated by the network \(p(x)\), the cross entropy between them is defined as shown in the slide.

You can go into further detail about interpretation of cross‐entropy but for now it’s enough to know that it measures how far off the true and the estimated distributions are. This is great because we can already use this as our loss function! It indeed tells us how good or bad we are doing, which is what we were looking for.

Note that the cross‐entropy function spits out a scalar (i.e., a number) for the probabilities we obtain for one input image.

With a bit of reordering of the cross-entropy expression we can define the loss per input image as shown in the slide.

However the full loss of the complete dataset would be the average over the losses for each of the training samples. If the training samples are too many, so much that it slows down the training process, often a small portion of samples is considered, but we’ll see that in more detail in a few slides.

Optimisation: intuition

Figure from Gradient descent, how neural networks learn | Deep Learning Chapter 2

Going back to the two aspects of training we wanted to further define: - We have talked about the loss function and how it allows us to quantify how well/ how badly our network is doing, given a certain set of weights and a bunch of input images - Now we are going to talk about the last aspect of the training process: the optimisation of the weights and biases, or how we go about adapting the parameters of the network to improve performance

Let’s consider a simple case of a loss function with one input (corresponding to something like a hypothetical network with single parameter). Remember that the loss function always returns a scalar value (i.e., one output). We want to find the value of the parameter that minimises the loss.

If we solved this with gradient descent, the process would be approximately as follows:

We initialise our parameter in a random point
We compute the derivative of the loss function with respect to the parameter (i.e., the slope). The derivative tells us in which direction we decrease the loss.
We take a small step in that direction and repeat the process: at every point we compute the slope, we take a small step following the slope in the adequate direction, and then repeat.
If we move every time an amount proportional to the slope we will prevent overshooting the minimum, since as we approach the minimum the slope becomes flatter and flatter.

Note that there is no guarantee that we will find the global minimum, that is a hard problem, but we are fine with a local minimum!

Optimisation: intuition

Figure from Gradient descent, how neural networks learn | Deep Learning Chapter 2

Seven optimisation takeaways!

The loss function as a high-dimensional “surface”.
The gradient is a vector that at any point in the loss “surface” gives us the direction of steepest ascent.
The negative gradient gives us the direction of steepest descent.
Gradient descent is an optimisation procedure that iteratively adjusts the parameters based on the gradient.

Until when?….

We now have a good intuition on how the optimisation problem is solved, and how we adjust the weights and biases in the network to minimise our loss. Let’s consolidate what we’ve seen a bit more formally, in the form of 6 Main Optimisation Takeaways

Loss function as a high-dimensional “surface”, where each point in the surface corresponds to a set of weights and biases.
The gradient, a vector that at any point in the loss “surface” gives us the direction of steepest ascent (i.e. the direction in which the loss increases most).
The negative gradient gives us the direction of steepest descent (i.e. the direction in which the loss decreases most).
We optimise (i.e. find the parameters that make the loss function minimal) the loss function iteratively, starting off with a random set of parameters and adjusting them until the loss is below a certain threshold.

Seven optimisation takeaways!

Figure from Kaggle tutorial: Overfitting and Underfitting

Seven optimisation takeaways!

To update the parameters we take a small step in the direction of the negative gradient. \[ W_{new} = W_{old} - \alpha \nabla \text{L}_W \]
Stochastic gradient descent is a more efficient variant of gradient descent which computes the gradient on batches of training samples.
An epoch is a single pass through the complete training set. A training process will consist of multiple epochs.

The simplest form of parameter update is to take a small step in the direction of the negative gradient. \[ W_{new} = W_{old} - \alpha \nabla \text{L}_W \] where \(\alpha\) is the learning rate. This is actually never used in practice, but it’s the main idea behind all methods for parameter updates.
In gradient descent, a parameter update takes place when we compute the full loss (i.e. across the entire training set). However due to vectorisation, it is much more efficient to do stochastic gradient descent. In this case, we update the parameters more frequently, and compute the gradient on a small subset of the training set, called a batch.

“True” stochastic gradient descent is when we compute the gradient and perform an update on every single sample. But a batch of samples has better convergence properties. > A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a “mini-batch”) at each step. This can perform significantly better than “true” stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately as was first shown in [6] where it was called “the bunch-mode back-propagation algorithm”. It may also result in smoother convergence, as the gradient computed at each step is averaged over more training samples. [From https://en.wikipedia.org/wiki/Stochastic_gradient_descent]

This [single-example parameter updates] is relatively less common to see because in practice due to vectorized code optimizations it can be computationally much more efficient to evaluate the gradient for 100 examples, than the gradient for one example 100 times. [From https://cs231n.github.io/optimization-1/ ]

The size of the mini-batch is a hyperparameter but it is not very common to cross-validate it. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2. [From https://cs231n.github.io/optimization-1/ ]

There are other ways of performing the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and established way of optimizing Neural Network loss function. [From https://cs231n.github.io/optimization-1/ ]

A reminder

In training: forward and backward pass
In testing and inference: only forward pass

Additional references

On validation set and hyperparameter tuning
- CS231n: Classification
On cross-entropy loss and other information theory concepts
- Deep learning book chapter 3 http://www.deeplearningbook.org/contents/prob.html
On weight initialisation
- https://cs231n.github.io/neural-networks-2/
On methods for gradient update
- https://cs231n.github.io/neural‐networks‐3/#update
On babysitting the training process
- https://cs231n.github.io/neural-networks-3/#baby
On optimisation
- https://cs231n.github.io/optimization-1/
On precisely how the gradient is computed
- Backpropagation, intuitively | Deep Learning Chapter 3
- Backpropagation calculus | Deep Learning Chapter 4
Neural networks and an interesting insight into stochastic gradient descent
- Machine Learning for Intelligent Systems CS4780 (Cornell University)
- see lectures 20 and 21

Convolutional neural networks

Regular NN don’t scale well to images
CNNs take advantage of the fact that their inputs are images

Figure from CS4780 lecture notes

Figure from cs231n.github.io/convolutional-networks

Regular NNs don’t scale well to images
- An image of size 200x200x3 would mean a first hidden layer with 120,000 weights per neuron!
- The full connectivity is overkill
CNNs take advantage of the fact that their inputs are images
- this means they can make sensible modifications to the architecture.
The modifications to the architecture encode the assumptions of the functions that we want to fit.

For example, if we want to train a network to fit the function that detects if an image contains a cat, when we use a CNN we constrain that space of possible functions to only those that are translation invariant, just by using this architecture.

We can see them as subsequent transformations of the input image.

We can also see their neurons as being 3-dimensional, with width, height and depth (if like before, we consider the neurons to hold the outputs of the computations)

A simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations [neuron outputs] to another

Layers used in CNNs

Three common types:

Convolutional layer
Pooling layer
Batch normalisation layer
Fully connected layer

Layers used in CNNs

Three common types:

Convolutional layer
Pooling layer
Batch normalisation layer
Fully connected layer

Convolutional layer

A set of n learnable filters

Each filter is a small matrix of weights + 1 bias

We slide (convolve) each filter across the width and height and the full depth of the input volume

Figure modified from CS231n

Figure from Convolution arithmetic

Figure modified from CS231n

We have seen that it is useful to see a convolutional layer as a transform between an input volume and an output volume
It is defined by a set of n learnable filters, also sometimes called kernels. The depth of the output volume is equal to the number of filters.
Each filter is a small matrix of weights and one bias parameter
- so each slice in the output volume corresponds to a small matrix of weights and a bias parameter
- small along width and height but always extends to the full depth of the input
- e.g. a filter on the first layer of a CNN may have size 5x5x3 (+ 1 bias parameter)
How do we compute the values in each slice of the output volume?
- In the forward pass:
  - we slide (convolve) each filter across the width and height of the input volume and compute dot products (multiply and sum)
  - we then would apply a non-linear activation function such as ReLU
  - the output is a 2D “activation map” that gives the responses of that filter at every position
- for a layer with 12 filters, we will stack the 12 activation maps along depth to produce the output volume

We can interpret the filter as a pattern detector > Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network.

We can formulate the forward pass as a matrix multiplication

Convolutional layer

A few hyperparameters:

number of filters
filter size
stride
padding

Stride = 1. Figure from Convolution arithmetic

Stride = 2. Figure from Convolution arithmetic

Layers used in CNNs

Three common types:

Convolutional layer
Pooling layer
Batch normalisation layer
Fully connected layer

Pooling layer

An example CNN architecture: VGG-16

ImageNet 2014 challenge (1000 categories)

Figure from Neuralception

In the example above, Conv and FC layers include a ReLU activation function.

A simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. - INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B. - CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters. - RELU layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]). POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12]. - FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.

An example CNN architecture: ResNet-18

Figure from LearnOpenCV

Transfer learning

Few people train a CNN from scratch
More common scenarios:
- Fine-tune a pretrained model (e.g. backbones for SLEAP, DLC, etc.)
- Use as a feature extractor (e.g. DINOv2)
- Directly use the model for inference (e.g. OpenPose)

In practice, very few people train an entire Convolutional Network from scratch (with random initialization), because it is relatively rare to have a dataset of sufficient size.

Instead, it is common to pretrain a CNN on a very large dataset (e.g. ImageNet, which contains 1.2 million images with 1000 categories), and then use the CNN either as an initialization or a fixed feature extractor for the task of interest.

Fine-tuning may refer to continue training the model on a new dataset, including all or only the last few layers (the rest are considered “frozen”). > It is possible to fine-tune all the layers of the CNN, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. - Backbone or feature extractor vs head

As a feature extractor: This may refer to the backbone purpose, or more explicitly to using a pretrained network to compute embeddings for your images > Take a CNN pretrained on ImageNet, remove the last fully-connected layer, then treat the rest of the ConvNet as a fixed feature extractor for the new dataset.

Data augmentation

To improve performance, train on more and diverse data.

One easy way: transform the images we already have, while preserving the label

The choice depends on the task and the dataset.

Reduces overfitting and improves robustness.

Random shifts. Figure from https://github.com/Prachi-Gopalani13/Image-Augmentation-Using-Keras

Random rotations. Figure from https://github.com/Prachi-Gopalani13/Image-Augmentation-Using-Keras

Random flips. Figure from https://github.com/Prachi-Gopalani13/Image-Augmentation-Using-Keras

Notes from https://www.kaggle.com/code/ryanholbrook/data-augmentation and https://www.ibm.com/think/topics/data-augmentation

The best way to improve the performance of a model is to train it on more and diverse data. The more examples that are provided during training, the better it will be able to recognise which differences between images matter and which don’t.

One easy way to increase the size of the dataset is to transform the images we already have. Specifically, we apply transformations to the images that preserve the label (or modify it in a way that is consistent with the label). For example in our MNIST case, we would apply transformations such that the image would still be classified as the digit it contains. Common examples: rotation, scaling, shearing, flipping, cropping, color jittering, etc.

That way we teach the model to recognise the digit even if it is rotated, scaled, sheared, flipped, cropped, or otherwise transformed.

But the choice of what data augmentation to apply depends on the task and the dataset. For example, a rotation of 180 degrees is not a good idea for a digit classification task, as it may confuse the model between 6s and 9s

Data augmentation can reduce overfitting and improve model robustness, esp in cases with small or unbalanced datasets.

A note on horizontal flip data augmentation in pose estimation: - usually data augmentation transforms will apply the same transformation to the image and the labels (e.g. horizontal flip) - so if you have a skeleton with left and right keypoints, and we apply an horizontal flip, the left and right keypoints will be swapped - this will render wrong keypoints tho! So often pose estimation frameworks allow you to define symmetric keypoints, to swap them again after mirroring - However if the left and right keypoints are not symmetric (e.g. fiddler crabs with dominant and non-dominant claws), you will want to flip the image and the keypoints without the additional swap - Basically if the keypoints are not different physically, you need to set them as “symmetric”.

See here

Additional references

CS231n: Convolutional Neural Networks
- https://cs231n.github.io/convolutional-networks/
- https://cs231n.github.io/transfer-learning/
CS4780: Machine Learning for Intelligent Systems
- lecture 20 notes
On data augmentation:
- DeepLabCut case study: Improving network performance on unbalanced data via augmentation, specifically the section on “Augment to reduce left-right bias” and “Edit pose_cfg.yaml for fliplr augmentation”. A blogpost version is available here
- https://www.kaggle.com/code/ryanholbrook/data-augmentation
- https://www.ibm.com/think/topics/data-augmentation

CV applications in animal behaviour

CV tasks in animal behaviour

What are tasks?

A task is a problem that we want to solve
There may be multiple ways to solve a task
DL has proven to be very powerful to solve many vision tasks

CV tasks in animal behaviour

Image classification

Detection

Segmentation

Pose estimation

Tracking

Behaviour classification

Re-identification

Which species are present in this image?

Figure from Snapshot Serengeti dataset

Where is the animal in this image?

Figure from and there from the Orinoquía Camera Traps dataset (University of Minnesota).

Which pixels are “mouse”?

Sample image from Aeon project

Which pixels are “mouse 1”?

Sample image from Aeon project

Where are the keypoints?

Figure from Ye et al. (2024) SuperAnimal pretrained pose estimation models for behavioral analysis

How do detections in frame f map to frame f+1?

Figure from Pereira et al. (2022) SLEAP: A deep learning system for multi-animal pose tracking

What is the behaviour of the animal in this frame/clip?

Figure from https://dattalab.github.io/moseq2-website/

Which individual is the animal in this frame?

Figure from Happy Whale dataset