Deep Learning for Computer Vision

Image Classification

Assign a label to an image
ImageNet Large Scale Visual Recognition Challenge
- 1000 categories

AlexNet

Five convolutional layers
Convolution kernels are 3-dimensional, depth matching the input depth
Convolution is calculated in two directions (x, y), producing a 2-dimensional output for each kernel
Kernel sizes: 11x11, 5x5, and three layers of 3x3
Two position-wise fully connected layers (i.e. 1x1 convolutions) with 4096 outputs and one with 1000 outputs before softmax
Max pooling to reduce resolution after first, second, and the last convolutional layer
Data augmentation by mirroring and cropping
Dropout

VGGNet

More layers, e.g. 13 convolutional and 3 fully connected layers
Smaller kernels, all are sized 3x3

VGGNet architecture. Wang et al. 2017

ResNet

Even deeper, typically 50 or 101 layers
Consists of residual blocks that contain two convolutional layers and a bypass connection
A residual block can achieve identity mapping by zeroing out the layer weights

Inception

Consists of nine inception modules
Each module performs 1x1, 3x3, and 5x5 convolutions and concatenates them in depth dimension
1x1 convolutions are also performed to reduce the depth before the 3x3 and 5x5 convolutions

Inception module. Szegedy et al. 2014

Object Detection

Finding and classifying a variable number of objects in an image
Variable number of outputs: a list of bounding boxes with associated labels (and probabilities for the labels and bounding boxes)
Microsoft COCO
- Complex everyday scenes containing common objects in their natural context
- 80 object categories
- 1.5 million object instances in 330k images
PASCAL Visual Object Classes Challenge
- 20 object categories

Faster R-CNN

A CNN is pretrained for image classifications (e.g. ImageNet) and one of its intermediate layers is used as a feature extractor
A region proposal network (RPN) finds up to a predefined number of regions
Initially all combinations of given sizes (e.g. 64, 128, 256 pixels) and given ratios between width and height (e.g. 0.5, 1.0, 1.5) are used as reference regions
RPN learns to predict a probability that a reference region contains an object, and x, y, width, and height offset from the reference
The extracted features are reused for classifying the regions using a region-based CNN (R-CNN)
R-CNN consists of two fully-connected layers of size 4096, followed by two output layers—one predicting the class (including “background”) and one predicting the region offset

YOLO

A single convolutional network extracts features and predicts the bounding boxes
The input image is conceptually divided into a grid of cells (for example, 7 × 7)
For each cell, the model predicts multiple bounding boxes (for example, two), whose center falls into the cell, and one set of class probabilities
The model predicts 5 values for each bounding box:
- x and y coordinates (between 0 and 1, relative to the cell)
- width and height (between 0 and 1, relative to the image)
- confidence score (between 0 and 1)
At test time, the confidences are multiplied by the class probabilities to get a set of class scores for each bounding box
For every bounding box, keeps only the class with the highest score
Overlapping bounding boxes are filtered using non-maximum suppression
At training time, the bounding box with the highest Intersection over Union with a ground truth bounding box is “responsible for predicting the object”, i.e. takes part in the loss calculation
Target for the confidence score is the Intersection over Union between the predicted and ground truth bounding boxes
The loss of a bounding box is the sum of squared errors from its coordinates and dimensions, confidence score, and class probabilities

Non-Maximum Suppression

Bounding boxes that predict the same class, are sorted by probability, and iterated starting from the one with the highest probability
At each step, calculates Intersection over Union (IoU) with the next bounding box
A high IoU means that there is considerable overlap between the bounding boxes
If the IoU is higher than a threshold, any remaining bounding boxes are discarded

YOLO9000 and YOLOv3

There are multiple detection layers at different levels of the model (for example, 3)
Each grid cell at each detection layer predicts multiple bounding boxes (for example, 3) and a separate set of class probabilities for every box (as opposed to YOLO)
Boxes are predicted relative to predefined anchor boxes, like in Faster R-CNN
The anchor boxes are centered at the cell center, i.e. the bounding box location is still predicted relative to the grid cell and the anchor box only defines a prior width and height
The three boxes predicted at a cell each have a separate prior width and height
The three detection layers use separate priors, totaling nine anchor box shapes
The anchor box dimensions are determined by running k-means clustering on the training set bounding boxes
At each detection layer, the bounding box that is responsible for a training target is the one whose anchor box best matches the object
- The grid cell is determined by the object center
- The predictor within the cell is the one whose prior dimensions give the ighest Intersection over Union with the target
Confidence score is now called objectness, and YOLOv3 uses binary targets
- The target is one for the bounding boxes that are responsible for predicting any objects
- Bounding boxes that are not responsible for predicting any objects, but whose anchor box overlaps an object (high enough IoU), are ignored from the objectness loss
- Objectness loss is calculated as the binary cross entropy from those boxes that are not ignored

Image Segmentation

Predict a segmentation map given an image
The output is the same size as the input

U-Net

After reducing resolutiong using the usual convolutional layers, increase the resolution by upsampling
Cross connections (not part of the original U-Net architecture) from the downsampling part of the network to the same-sized image in the upsampling part