Deep learning for computer vision

A primer for the animal behaviour scientist

Sofía Miñano

Overview

Overview

Vanilla neural network

Convolutional neural network

Applications to animal behaviour

Overview

Some preliminaries

What is Deep Learning?


Deep Learning is:

  • an approach to Artificial Intelligence
  • a type of Machine Learning that uses artificial neural networks

What is Artificial Intelligence?


Tesler’s theorem:

“AI is whatever hasn’t been done yet.”

Figure from stackoverflow.com

Figure from Waymo

Deep Learning as an approach to Artificial Intelligence


  • Early AI: problems that were intellectually challenging for humans, but easy for computers
  • True challenge: tasks that are easy for humans but hard to describe formally
  • DL has proven very powerful in solving these intuitive problems within AI.

Deep Learning is a subset of Machine Learning

  • ML allows computers to tackle problems using knowledge (data) from the real world
  • Algorithms depend heavily on the representation of the data

Figure from Deep Learning book

Deep Learning is a subset of Machine Learning

  • ML allows computers to tackle problems using knowledge (data) from the real world
  • Algorithms depend heavily on the representation of the data
  • DL is a kind of representation learning

Figure from Deep Learning book

Deep Learning is compositional

Figure from Deep Learning book

Variables that we are able to observe.

Extraction of increasingly abstract features.

“Hidden” (i.e., not observable)

Recognition of objects in the image

Recap

Deep Learning is:

  • an approach to AI
  • a subset of ML
  • well-suited to solve intuitive problems because learns from data and represents the world hierarchically

Figure modified from Deep Learning book

Additional references

Anatomy of a neural network

What is a neural network?


  • Deep learning and neural networks

  • Digit recognition task as an intuitive problem

  • Multilayer perceptron

Our task

A single neuron

G x1 x₁ f f x1->f x2 x₂ x2->f x3 x₃ x3->f

A multi-layer perceptron

The simplest neural network

  • An extension to more neurons and more layers
  • Requires a non-linear activation function
  • Also fully‐connected or feed-forward networks

Layers

Input layer

Output layer

Hidden layers

Layers: input layer

np.reshape(28*28, 1)

Layers: output layer

0
1
2
3
4
5
6
7
8
9

→ 2

Additional references

Forward pass

From layer to layer

Hidden layers

From layer to layer: one neuron

G x1 x₁ f f x1->f x2 x₂ x2->f x3 x₃ x3->f


\[ f(x_1, x_2, x_3, \ldots) = \mathbf{\color{rgb(225, 174, 65)}h} \]

From layer to layer: one neuron

G x1 x₁ f f x1->f w₁ x2 x₂ x2->f w₂ x3 x₃ x3->f w₃

\[ \mathbf{\color{rgb(225, 174, 65)}h} \]

  1. Compute weighted sum \[ {\color{rgb(225, 65, 185)}\Sigma} = {\color{rgb(137, 225, 65)}w_1} x_1 + {\color{rgb(137, 225, 65)}w_2} x_2 + {\color{rgb(137, 225, 65)}w_3} x_3 \]

  2. Apply non-linearity \[ \mathbf{\color{rgb(225, 174, 65)}h} = max({\color{rgb(225, 65, 185)}\Sigma}, 0) \]

From layer to layer: one neuron

G x1 x₁ f f x1->f w₁ x2 x₂ x2->f w₂ x3 x₃ x3->f w₃ bias b bias->f +

\[ \mathbf{\color{rgb(225, 174, 65)}h} \]

  1. Compute weighted sum \[ {\color{rgb(225, 65, 185)}\Sigma} = {\color{rgb(137, 225, 65)}w_1} x_1 + {\color{rgb(137, 225, 65)}w_2} x_2 + {\color{rgb(137, 225, 65)}w_3} x_3 \]

  2. Apply non-linearity \[ \mathbf{\color{rgb(225, 174, 65)}h} = max({\color{rgb(225, 65, 185)}\Sigma} + {\color{rgb(213, 24, 24)}b}, 0) \]

From layer to layer: many neurons

In an MLP:

  • each connection is associated with a weight
  • each neuron is associated with a bias

From layer to layer: many neurons

G h1 h₀⁰ h1_prime h₀¹ h1->h1_prime h2_prime h₁¹ h1->h2_prime h3_prime h₂¹ h1->h3_prime h4_prime ... h1->h4_prime h5_prime hₖ¹ h1->h5_prime h2 h₁⁰ h2->h1_prime h2->h2_prime h2->h3_prime h2->h4_prime h2->h5_prime h3 h₂⁰ h3->h1_prime h3->h2_prime h3->h3_prime h3->h4_prime h3->h5_prime h4 h₃⁰ h4->h1_prime h4->h2_prime h4->h3_prime h4->h4_prime h4->h5_prime h5 ... h5->h1_prime h5->h2_prime h5->h3_prime h5->h4_prime h5->h5_prime h6 hₙ⁰ h6->h1_prime h6->h2_prime h6->h3_prime h6->h4_prime h6->h5_prime bias

G h1 h₀⁰ h1_prime h₀¹ h1->h1_prime h2_prime h₁¹ h1->h2_prime h3_prime h₂¹ h1->h3_prime h4_prime ... h1->h4_prime h5_prime hₖ¹ h1->h5_prime h2 h₁⁰ h2->h1_prime h2->h2_prime h2->h3_prime h2->h4_prime h2->h5_prime h3 h₂⁰ h3->h1_prime h3->h2_prime h3->h3_prime h3->h4_prime h3->h5_prime h4 h₃⁰ h4->h1_prime h4->h2_prime h4->h3_prime h4->h4_prime h4->h5_prime h5 ... h5->h1_prime h5->h2_prime h5->h3_prime h5->h4_prime h5->h5_prime h6 hₙ⁰ h6->h1_prime h6->h2_prime h6->h3_prime h6->h4_prime h6->h5_prime bias b₀¹ bias->h1_prime

G h1 h₀⁰ h1_prime h₀¹ h1->h1_prime h2_prime h₁¹ h1->h2_prime h3_prime h₂¹ h1->h3_prime h4_prime ... h1->h4_prime h5_prime hₖ¹ h1->h5_prime h2 h₁⁰ h2->h1_prime h2->h2_prime h2->h3_prime h2->h4_prime h2->h5_prime h3 h₂⁰ h3->h1_prime h3->h2_prime h3->h3_prime h3->h4_prime h3->h5_prime h4 h₃⁰ h4->h1_prime h4->h2_prime h4->h3_prime h4->h4_prime h4->h5_prime h5 ... h5->h1_prime h5->h2_prime h5->h3_prime h5->h4_prime h5->h5_prime h6 hₙ⁰ h6->h1_prime h6->h2_prime h6->h3_prime h6->h4_prime h6->h5_prime bias

G h1 h₀⁰ h1_prime h₀¹ h1->h1_prime h2_prime h₁¹ h1->h2_prime h3_prime h₂¹ h1->h3_prime h4_prime ... h1->h4_prime h5_prime hₖ¹ h1->h5_prime h2 h₁⁰ h2->h1_prime h2->h2_prime h2->h3_prime h2->h4_prime h2->h5_prime h3 h₂⁰ h3->h1_prime h3->h2_prime h3->h3_prime h3->h4_prime h3->h5_prime h4 h₃⁰ h4->h1_prime h4->h2_prime h4->h3_prime h4->h4_prime h4->h5_prime h5 ... h5->h1_prime h5->h2_prime h5->h3_prime h5->h4_prime h5->h5_prime h6 hₙ⁰ h6->h1_prime h6->h2_prime h6->h3_prime h6->h4_prime h6->h5_prime bias

\[ \small \qquad \color{rgb(0, 0, 255)}{h_0^1} = ReLU(\sum_{i=0}^{n} \color{rgb(137, 225, 65)}{w_i^{0,1}} \color{rgb(173, 216, 230)}{h_i^0} + \color{rgb(213, 24, 24)}{b_0^1}) \]

\[ \scriptstyle ReLU\left( \begin{bmatrix} w_{0,0} & \cdots & w_{0,n} \\ \vdots & \ddots & \vdots \\ w_{k,0} & \cdots & w_{k,n} \end{bmatrix} \begin{bmatrix} h_0^0 \\ \vdots \\ h_n^0 \end{bmatrix} + \begin{bmatrix} b_0^1 \\ \vdots \\ b_k^1 \end{bmatrix} \right) = \begin{bmatrix} h_0^1 \\ \vdots \\ h_k^1 \end{bmatrix} \]

\[ \scriptstyle ReLU\left( \begin{bmatrix} w_{0,0} & \cdots & w_{0,n} \\ \vdots & \ddots & \vdots \\ w_{k,0} & \cdots & w_{k,n} \end{bmatrix} \begin{bmatrix} h_0^0 \\ \vdots \\ h_n^0 \end{bmatrix} + \begin{bmatrix} b_0^1 \\ \vdots \\ b_k^1 \end{bmatrix} \right) = \begin{bmatrix} h_0^1 \\ \vdots \\ h_k^1 \end{bmatrix} \]
\[ \scriptstyle ReLU\left( \begin{bmatrix} w_{0,0} & \cdots & w_{0,n} \\ \vdots & \ddots & \vdots \\ w_{k,0} & \cdots & w_{k,n} \end{bmatrix} \begin{bmatrix} h_0^0 \\ \vdots \\ h_n^0 \end{bmatrix} + \begin{bmatrix} b_0^1 \\ \vdots \\ b_k^1 \end{bmatrix} \right) = \begin{bmatrix} h_0^1 \\ \vdots \\ h_k^1 \end{bmatrix} \]
\[ \scriptstyle ReLU\left( \begin{bmatrix} w_{0,0} & \cdots & w_{0,n} \\ \vdots & \ddots & \vdots \\ w_{k,0} & \cdots & w_{k,n} \end{bmatrix} \begin{bmatrix} h_0^0 \\ \vdots \\ h_n^0 \end{bmatrix} + \begin{bmatrix} b_0^1 \\ \vdots \\ b_k^1 \end{bmatrix} \right) = \begin{bmatrix} h_0^1 \\ \vdots \\ h_k^1 \end{bmatrix} \]
\[ \scriptstyle ReLU\left( \begin{bmatrix} w_{0,0} & \cdots & w_{0,n} & b_0^1 \\ \vdots & \ddots & \vdots & \vdots \\ w_{k,0} & \cdots & w_{k,n} & b_k^1 \end{bmatrix} \begin{bmatrix} h_0^0 \\ \vdots \\ h_n^0 \\ 1 \end{bmatrix} \right) = \begin{bmatrix} h_0^1 \\ \vdots \\ h_k^1 \end{bmatrix} \]

\[ ReLU(\color{rgb(137, 225, 65)}{\mathbf{W}'} \color{rgb(173, 216, 230)}{h^{0^{\prime}}} ) = \color{rgb(0, 0, 255)}{h^1} \]

A two layer network

G x1 x₀ x2 x₁ h1 h₀ x1->h1 h2 h₁ x1->h2 h3 h₂ x1->h3 h4 ... x1->h4 h5 hₖ x1->h5 x3 x₂ x2->h1 x2->h2 x2->h3 x2->h4 x2->h5 x4 x₃ x3->h1 x3->h2 x3->h3 x3->h4 x3->h5 x5 ... x4->h1 x4->h2 x4->h3 x4->h4 x4->h5 x6 xₙ x5->h1 x5->h2 x5->h3 x5->h4 x5->h5 x6->h1 x6->h2 x6->h3 x6->h4 x6->h5 input_label Input Layer bias y1 y₀ h1->y1 y2 y₁ h1->y2 y3 ... h1->y3 y4 yₘ h1->y4 h2->y1 h2->y2 h2->y3 h2->y4 h3->y1 h3->y2 h3->y3 h3->y4 h4->y1 h4->y2 h4->y3 h4->y4 h5->y1 h5->y2 h5->y3 h5->y4 hidden_label Hidden Layer output_label Output Layer



\[ \color{rgb(255, 176, 0)}{y} = \color{rgb(143, 204, 143)}{W_1} \color{rgb(92, 144, 224)}{ReLU({W_0}x)} \]

Forward pass

\[ \scriptstyle ReLU(W h^n) = h^{n+1} \]

\[ \scriptstyle ReLU(\color{rgb(255,0,0)}{W} h^n) = h^{n+1} \]

→ 2

How to choose them?

Additional references

Training and backward pass

Training: intuition

( , 2)

Labelled data

Supervised learning

( , )

Training set

7?

Testing: intuition

Test set

TRAINED

Accuracy

Dataset split


  • Hyperparameters
  • Keep test set aside! ⚠️

Figures modified from CS231n lecture notes: neural networks

Dataset split


  • Hyperparameters
  • Keep test set aside! ⚠️
  • So then how?

Figures modified from CS231n lecture notes: neural networks

Training as an optimisation problem

Loss function

Gradient descent

Loss function

0
1
2
3
4
5
6
7
8
9
Raw scores
-6.8
2.6
6.7
5.9
1.9
-1.3
-5.7
3.2
1.3
1.0

Probabilities

Loss function

Raw scores z
-6.8
2.6
6.7
5.9
1.9
-1.3
-5.7
3.2
1.3
1.0

Probabilities

Ground truth p

Softmax:

\[ \tiny \widehat{p} (z_{j}) = \frac{e^{z_j}}{\sum_{k} e^{z_k}} \]

Cross-entropy:

\[ \tiny \text{H}(p, q) = -\sum_{k} p_k \log(\widehat{p}_{k}) \]

Loss: \[ \tiny \text{L}_{i} = - \log(\widehat{p} (z_{j=y_i}) ) \]

\[ \tiny \text{L} = \frac{1}{N} \sum_{i} \text{L}_{i} \]

Optimisation: intuition

Optimisation: intuition

Seven optimisation takeaways!

  1. The loss function as a high-dimensional “surface”.

  2. The gradient is a vector that at any point in the loss “surface” gives us the direction of steepest ascent.

  3. The negative gradient gives us the direction of steepest descent.

  4. Gradient descent is an optimisation procedure that iteratively adjusts the parameters based on the gradient.

Until when?….

Seven optimisation takeaways!


Seven optimisation takeaways!

  1. To update the parameters we take a small step in the direction of the negative gradient. \[ W_{new} = W_{old} - \alpha \nabla \text{L}_W \]

  2. Stochastic gradient descent is a more efficient variant of gradient descent which computes the gradient on batches of training samples.

  3. An epoch is a single pass through the complete training set. A training process will consist of multiple epochs.

A reminder

  • In training: forward and backward pass

  • In testing and inference: only forward pass

Additional references

Convolutional neural networks

Convolutional neural networks


  • Regular NN don’t scale well to images
  • CNNs take advantage of the fact that their inputs are images

Layers used in CNNs

Three common types:

  • Convolutional layer
  • Pooling layer
  • Batch normalisation layer
  • Fully connected layer

Layers used in CNNs

Three common types:

  • Convolutional layer
  • Pooling layer
  • Batch normalisation layer
  • Fully connected layer

Convolutional layer

  • A set of n learnable filters
  • Each filter is a small matrix of weights + 1 bias
  • We slide (convolve) each filter across the width and height and the full depth of the input volume

Figure modified from CS231n

Figure modified from CS231n

Figure modified from CS231n

Convolutional layer

A few hyperparameters:

  • number of filters
  • filter size
  • stride
  • padding

Stride = 1. Figure from Convolution arithmetic

Stride = 2. Figure from Convolution arithmetic

Layers used in CNNs

Three common types:

  • Convolutional layer
  • Pooling layer
  • Batch normalisation layer
  • Fully connected layer

Pooling layer

An example CNN architecture: VGG-16

ImageNet 2014 challenge (1000 categories)

Figure from Neuralception

An example CNN architecture: ResNet-18

Figure from LearnOpenCV

Transfer learning

  • Few people train a CNN from scratch

  • More common scenarios:

    • Fine-tune a pretrained model (e.g. backbones for SLEAP, DLC, etc.)

    • Use as a feature extractor (e.g. DINOv2)

    • Directly use the model for inference (e.g. OpenPose)

Data augmentation

  • To improve performance, train on more and diverse data.
  • One easy way: transform the images we already have, while preserving the label
  • The choice depends on the task and the dataset.
  • Reduces overfitting and improves robustness.

Additional references

CV applications in animal behaviour

CV tasks in animal behaviour


What are tasks?

  • A task is a problem that we want to solve
  • There may be multiple ways to solve a task
  • DL has proven to be very powerful to solve many vision tasks

CV tasks in animal behaviour

  • Image classification
  • Detection
  • Segmentation
  • Pose estimation
  • Tracking
  • Behaviour classification
  • Re-identification
Which species are present in this image?
Which pixels are “mouse”?

Sample image from Aeon project

Which pixels are “mouse 1”?

Sample image from Aeon project

How do detections in frame f map to frame f+1?
What is the behaviour of the animal in this frame/clip?
Which individual is the animal in this frame?