Computer Vision Algorithm Implementations

Year: 2015

I implemented, experimented, and analyzed the results three foundational computer vision algorithms: GrabCut Image Segmentation, TLD online-single target tracker, and R-CNN object detector. These implementations and analyses were performed to complete CS231B: The Cutting Edge of Computer Vision with Professor Fei-Fei Li at Stanford University.

GrabCut Image Segmentation


The goal of foreground-background segmentation is to divide the pixels of an image into two sections, foreground and background. Foreground-background segmentation is a challenging task, and has many applications in object recognition and classification. A successful segmentation requires knowledge of both local features and global features in an image. GrabCut segments the image by minimizing an energy equation that balances both local and global relationships between pixels.

We implement GrabCut and experiment with three extensions. We try varying the number of Gaussian Mixture Model (GMM) components used, reinitializing GMM components after a few iterations, and restricting the background GMM to pixels within the bounding box.

Top row are the ground truth original images. Bottom row are the segmentations using the standard GrabCut algorithm (5 component GMMs). Images towards the left segment well, whereas images towards the right segment poorly.


From our three experiments we see that the number of GMM components used can have a large effect on the final segmentation for certain images. Moreover, reinitializing GMM components after a few iterations can stabilize the segmentations. Finally, we achieve our best result of 96.94% accuracy and 87.71% Jaccard similarity when we constrain the background model to only the pixels within the bounding box on all but the initial iteration. This approach allows the model to better represent the local color distribution around the object. Nevertheless, some images still may not segment well, because of similarities between the foreground and background. In these cases, we could further improve our segmentations by including user interaction.

TLD Online-Single Target Tracker

Online tracking of an object in video is a difficult task because the objects appearance can change a lot across frames. For example, the object might undergo changes in illumination, rotation, or occlusion. A successful system must learn to adapt to these changes in order to continue tracking the object.

Frames sampled from videos tracked by the TLD system.

Kalal et al propose a system that combines tracking and detection to follow an object. They track the object between frames using the Lucas-Kanade optical flow tracker. The tracker works well if there is minimal object motion between frames, and can fail if the object moves too quickly or goes out of frame. The detector searches for the object in every frame by comparing individual patches to the learned object model. The comparison are performed with fern hashing and nearest neighbor classifiers. After each iteration, the integrator compares the confidences of tracker and detector’s object location estimates to choose a final bounding box. Also, the object model is updated by sampling a set of positive patches around the object’s estimated location and a set of negative patches from the background of the image. This learning step allows the object model to adapt to changes in appearance from frame to frame.

The TLD tracker with HOG descriptors successfully tracks the runner through the video. This is accomplished because the HOG descriptors easily discriminate the runner from the background polyurethane track by leveraging the sharp image intensity gradients at the border between the runner and the track.

We implement the TLD tracker and experiment with a few extensions including varying the fern dimensions and the number of ferns in the ensemble, replacing the fern classifier with an SVM, and using Histogram of Oriented Gradient (HOG) features to represent each image patch.

R-CNN Object Detector

Object detection is a challenging task for many reasons. For example, objects of one class may come in different shapes, colors, positions and poses. An effective object detector must detect objects of the same class in spite of intra-class variation, while at the same time ignoring out-of-class objects. Recently Convolutional Neural Networks (CNNs) have become more popular for describing objects as the variation among a class can be learnt through training.

The system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. See original publication for more details.

The R-CNN achieves strong object detection performance by using selective search to retrieve bounding box proposals, a pre-trained CNN to extract feature vectors from each region, and an SVM classifier. The bounding box proposals are refined using ridge regression, and finally non-maximal suppression is applied to the bounding boxes to achieve a single prediction per object instance.


  • Python
  • C++
  • Matlab