Deep Learning for Computer Vision

Image Classification

  • Assign a label to an image
  • ImageNet Large Scale Visual Recognition Challenge: 1000 different classes


  • Five convolutional layers
  • Convolution kernels are 3-dimensional, depth matching the input depth
  • Convolution is calculated in two directions (x, y), producing a 2-dimensional output for each kernel
  • Kernel sizes: 11x11, 5x5, and three layers of 3x3
  • Two position-wise fully connected layers (i.e. 1x1 convolutions) with 4096 outputs and one with 1000 outputs before softmax
  • Max pooling to reduce resolution after first, second, and the last convolutional layer
  • Data augmentation by mirroring and cropping
  • Dropout
AlexNet architecture. Llamas et al. 2017


  • More layers, e.g. 13 convolutional and 3 fully connected layers
  • Smaller kernels, all are sized 3x3
VGGNet architecture. Wang et al. 2017


  • Even deeper, typically 50 or 101 layers
  • Consists of residual blocks that contain two convolutional layers and a bypass connection
  • A residual block can achieve identity mapping by zeroing out the layer weights


  • Consists of nine inception modules
  • Each module performs 1x1, 3x3, and 5x5 convolutions and concatenates them in depth dimension
  • 1x1 convolutions are also performed to reduce the depth before the 3x3 and 5x5 convolutions
Inception module. Szegedy et al. 2014

Object Detection

  • Finding and classifying a variable number of objects in an image
  • Variable number of outputs: a list of bounding boxes with associated labels (and probabilities for the labels and bounding boxes)

Faster R-CNN

  • A CNN is pretrained for image classifications (e.g. ImageNet) and one of its intermediate layers is used as a feature extractor
  • A region proposal network (RPN) finds up to a predefined number of regions
  • Initially all combinations of given sizes (e.g. 64, 128, 256 pixels) and given ratios between width and height (e.g. 0.5, 1.0, 1.5) are used as reference regions
  • RPN learns to predict a probability that a reference region contains an object, and x, y, width, and height offset from the reference
  • The extracted features are reused for classifying the regions using a region-based CNN (R-CNN)
  • R-CNN consists of two fully-connected layers of size 4096, followed by two output layers—one predicting the class (including “background”) and one predicting the region offset

Image Segmentation

  • Predict a segmentation map given an image
  • The output is the same size as the input


  • After reducing resolutiong using the usual convolutional layers, increase the resolution by upsampling
  • Cross connections (not part of the original U-Net architecture) from the downsampling part of the network to the same-sized image in the upsampling part
U-Net architecture. Ronneberger et al. 2015