Computer vision refers to the process by which computers interpret information from digital images or videos. While humans can quickly determine what items are present in a photograph and where those items are located, computers must be trained to do so, often with massive amounts of images and corresponding object labels. This task, called object detection, was considered incredibly challenging just a few years ago, but recent developments in deep learning, specifically convolutional neural networks (CNNs), have substantially improved computers’ ability to successfully locate and identify objects.
This post will cover the basics of object detection: what it is, various approaches to it, the measurements used to judge its results, along with a few important considerations of modern object detection. Look out for bold words throughout this post, as these are special vocabulary terms used by the object detection community that will help you sound like a pro in no time!
Image source: Personal photo processed with YOLOv2. Author and her dog.
What is object detection?
The goal of object detection is to draw rectangular bounding boxes around each object of interest as well as identify what object each box contains. A single image can feature many different objects, so multiple bounding boxes may be drawn. Object detection applications are basically limitless, but some uses include people or animal counting, face detection, self-driving cars, or even ball tracking in sports. These applications require many different kinds of objects to be detected, frequently with a high degree of both accuracy and prediction speed to meet the demands of real-time video tracking.
The first successful object detection frameworks relied on more traditional machine learning techniques. These methods required extensive feature engineering to first look for very specific object patterns. For example, the Viola-Jones algorithm, introduced in 2001 for face detection, scans through an image to find features common to human faces like the bridge of the nose or the region around the eyes. Once these features have or have not been detected, results are fed to a machine learning classifier such as a support vector machine to determine where faces are located in the image. This type of machine learning approach shows impressive test times and detection accuracy, but they often fail to generalize to other varieties of objects since the features are rigidly engineered.
The advent of CNNs brought great improvement to the world of object detection because it allowed for representative features to be determined directly from the images themselves and eliminated the need for manual feature engineering. In fact, nearly all modern object detection systems leverage deep learning for some portion of the process. The family of regional-based CNN methods is made up of R-CNN, released in 2014, and the subsequent Fast- and Faster R-CNN. These approaches take an image and put it through two stages: 1) the generation of candidate regions of interest (RoIs) where objects may be present and 2) the use of a CNN to produce characteristic features for classifying and further refining the bounding box coordinates of each RoI.
Single-shot detectors also adopt CNNs to learn about objects in images and videos. This category of algorithms includes YOLO and SSD, both published in 2016. Rather than following the two-stage pipeline approach of regional-based methods, these systems perform object detection in one shot. This means that identifying each region of interest, determining if the region contains an object or not, classifying each object detected, and refining the bounding box coordinates all happen within one pass through the CNN. Single-shot detectors are generally much faster than R-CNN methods; however, they often struggle with small objects and may exhibit worse accuracy than, say, Faster R-CNN.
Image designed by author by modifying the following sources: one, two, three, & four.
Because object detection involves two tasks, both object classification and object localization, special evaluation metrics apply. Two such metrics, IoU and mAP, prevail among the object detection community when reporting results or comparing techniques.
IoU stands for intersection over union. This measurement judges how well the algorithm has determined an object’s location. IoU compares the actual and predicted bounding boxes by calculating the area of their overlap divided by the total area enclosed by these two boxes. Typically, IoU above 50% is regarded as a positive match.
Image source: designed by the author.
mAP, or mean average precision, ultimately assesses the performance of the object classification. Once bounding box predictions have been made, those meeting a prescribed IoU level (often 50%) are deemed positives matches while others are negatives. Consider just one object class, all of the cars for example. Average precision (AP) approximates the area under the class’s entire precision-recall curve. The exact details of this calculation vary a bit between datasets because sometimes this quantity is interpolated, but overall, AP attempts to get a sense for precision across all possible values of recall. mAP then computes the numerical mean of AP for all classes of objects in the dataset (cars, trucks, dogs, etc.) Successful object detection frameworks perform well over many item types, thus producing higher mAP scores.
Building an object detection system today hinges upon making several important design decisions to suit the problem at hand. Perhaps the most critical consideration is the prioritization of prediction speed or detection accuracy. The typical way to locate items in videos requires each frame of the video to pass through the object detection procedure as an individual image. If real-time video tracking is required, the algorithm must be able to make predictions at a rate of at least 24 frames per second meaning speed certainly ranks highly for this kind of work. Many design choices influence the speed-accuracy tradeoff including picking between an R-CNN or SSD framework, deciding the number of layers in the CNN base, and even selecting image resolution in some approaches.
Object detection presents several other challenges in addition to concerns about speed versus accuracy. Most object detection systems attempt to generalize in order to find items of many different shapes and sizes. Smaller objects tend to be much more difficult to catch, especially for single-shot detectors. A partial occlusion, where a portion of the item is hidden behind another item, can also confuse an object detection algorithm causing it to miss an important object. To learn more about object detection challenges and the resulting ways researchers address these issues, check out my blog post: “5 Significant Object Detection Challenges and Solutions.”
Read more about Kimberly Fessel and the rest of the Sr. Data Scientist team here.