# Deep Learning for Computer Vision

## Image Classification

- Assign a label to an image
- ImageNet Large Scale Visual Recognition Challenge: 1000 different classes

### AlexNet

- Five convolutional layers
- Convolution kernels are 3-dimensional, depth matching the input depth
- Convolution is calculated in two directions (x, y), producing a 2-dimensional output for each kernel
- Kernel sizes: 11x11, 5x5, and three layers of 3x3
- Two position-wise fully connected layers (i.e. 1x1 convolutions) with 4096 outputs and one with 1000 outputs before softmax
- Max pooling to reduce resolution after first, second, and the last convolutional layer
- Data augmentation by mirroring and cropping
- Dropout

### VGGNet

- More layers, e.g. 13 convolutional and 3 fully connected layers
- Smaller kernels, all are sized 3x3

### ResNet

- Even deeper, typically 50 or 101 layers
- Consists of residual blocks that contain two convolutional layers and a bypass connection
- A residual block can achieve identity mapping by zeroing out the layer weights

### Inception

- Consists of nine inception modules
- Each module performs 1x1, 3x3, and 5x5 convolutions and concatenates them in depth dimension
- 1x1 convolutions are also performed to reduce the depth before the 3x3 and 5x5 convolutions

## Object Detection

- Finding and classifying a variable number of objects in an image
- Variable number of outputs: a list of bounding boxes with associated labels (and probabilities for the labels and bounding boxes)

### Faster R-CNN

- A CNN is pretrained for image classifications (e.g. ImageNet) and one of its intermediate layers is used as a feature extractor
- A region proposal network (RPN) finds up to a predefined number of regions
- Initially all combinations of given sizes (e.g. 64, 128, 256 pixels) and given ratios between width and height (e.g. 0.5, 1.0, 1.5) are used as reference regions
- RPN learns to predict a probability that a reference region contains an object, and x, y, width, and height offset from the reference
- The extracted features are reused for classifying the regions using a region-based CNN (R-CNN)
- R-CNN consists of two fully-connected layers of size 4096, followed by two output layers—one predicting the class (including “background”) and one predicting the region offset

## Image Segmentation

- Predict a segmentation map given an image
- The output is the same size as the input

### U-Net

- After reducing resolutiong using the usual convolutional layers, increase the resolution by upsampling
- Cross connections (not part of the original U-Net architecture) from the downsampling part of the network to the same-sized image in the upsampling part