Instance segmentation algorithms overview
Image segmentation comes to play a crucial role when we want to localize an object and get the information regarding its morphology (shape). Being one of the most widely used computer vision techniques, image segmentation is also a crucial step in building intelligent systems that can interact with the surrounding world and support human work. Image segmentation is widely used in self-driving cars, medical image analysis (X-Rays, dental imagining), or satellite imagery.
Figure 1: Instance segmentation example (source)
Types of image segmentation
Image segmentation techniques can be divided into three main categories:
Semantic segmentation
The goal of semantic segmentation is to classify image pixels into a set of categories without differentiating separate object instances. It segments all objects of the same class in the image, but it doesn’t differentiate instances of the same class. For example, we can recognize all cancer cells in medical scans, but we cannot distinguish one cell from another.
Instance segmentation
Compared to semantic segmentation, instance segmentation additionally differentiates each instance in the visual input that belongs to the same class. It combines two tasks: correctly detecting each object in the scene while also precisely segmenting each instance. It can be considered semantic segmentation with object detection at the same time. In cancer cell segmentation tasks, we can accurately predict each cell's shape and distinguish one cell from another.
Panoptic segmentation
Although panoptic segmentation is not a ground-breaking concept, it should be considered very useful and important. In a nutshell, it connects two previous segmentation approaches: semantic segmentation and instance segmentation. It does semantic and instance segmentation on a given image and connects outcomes of these two into one image. Panoptic segmentation is a holistic scene understanding, which is crucial for enabling intelligent behavior. Intelligent systems should have the ability to understand the visual scene both on the pixel-wise level and the class instance level. Panoptic segmentation categorizes scenes into _stuff _and things. Stuff is, for example, a sky, sidewalk - generally speaking, some background class, and “things” are particular instances of the foreground classes, for example pedestrians, cars, cancer cells, etc. Example network architectures are Panoptic FPN or EfficentPS.
You can read more about panoptic segmentation in satellite imagery interpretation in this article by Maciej Adamiak.
Figure 2: semantic segmentation vs instance segmentation vs panoptic segmentation comparison (source)
Instance Segmentation
In this article, I will focus on the instance segmentation case. The biggest challenge in instance segmentation problem is dealing with occluded objects of the same class. In other words, instance segmentation has to properly assign pixels belonging to the same class, separate different instances and also those instances that overlap. The main idea to help the neural network solve that task is to do the segmentation on bounding boxes, not process the whole image. Solutions that follow this approach are called two-stage instance segmentation algorithms. The example two-stage instance segmentation algorithms are Mask-RCNN (16th place in the COCO instance segmentation challenge in 04.2022). There are neural networks that do one-shot instance segmentation tasks, for example, PolarMask++ or YOLACT. One-shot means that the network is doing object classification and object segmentation simultaneously, in parallel. On the other hand, two-stages instance segmentation algorithms first detect objects with bounding boxes and then run segmentation head-on object proposals. One-shot based methods are much faster than proposal-based approaches. However, they achieved worse results in terms of Average Precision. PolarMask++ took 38th place in the COCO instance segmentation challenge (as of 04/2022).
Categories of instance segmentation methods
We can divide instance segmentation solutions into three main categories based on their architecture:
Detection-based instance segmentation
The detection-based method is also referred to as the proposal-based method. Firstly, this method detects different objects in the image using a detection network. Further, it runs a segmentation head on each detected bounding box in order to obtain instance segmentation. The following architecture is implemented, for example by Mask-RCNN.
Figure 3: Two-stage instance segmentation block diagram
Single-shot instance segmentation
Single-shot instance segmentation, or in other words: the proposal-free method, is a real-time approach, usually much faster than detection-based approach. At the same time, the single-shot approach is less accurate. As representative examples of the single-shot architecture, we can mention YOLACT (You Only Look At CoefficienTs), PolarMask++ or SSAP (Single-Shot Instance Segmentation With Affinity Pyramid). SSAP first runs a pixel category classification network (semantic segmentation) and then performs instance segmentation on a semantic segmentation map.
Figure 4: One-stage instance segmentation block diagram
Transformers based methods
Convolutional Neural Networks (CNNs) for a long time have been the main and standard computer vision algorithm. Everything started to change with the big success of transformers in NLP, which caused the research community to start looking for an equivalent to handle computer vision tasks. Starting from ViT (An Image is Worth more than 16x16 Words), transformers started to take over different challenges in computer vision. Transformers-based networks are top-rated in image classification challenges like ImageNet-1K and detection challenges like COCO object detection or semantic segmentation benchmark ADE20K. The same thing happens in instance segmentation. Top-performing semantic segmentation networks are transformer-based networks. Among others: ISTR-SMT (9th place in COCO instance segmentation challenge), SWIN-L (6th place), SwinV2-G (1st place). A Transformer based approach can be considered a proposal-based method. The main difference is the usage of transformers in the detection and mask generation phase. I believe that separating this approach from a proposal-based approach can zoom into the transformer's architecture.
Figure 5: Benchmark for instance segmentation networks (source)
Figure 5 presents a COCO benchmark for instance segmentation algorithms. All top performing architectures, like SwinV2-G or SWIN-L are transformer-based architectures. It is worth noting that Mask R-CNN can achieve a much better score, by simple backbone swapping. The same architecture, but using SpineNet-190 instead of ResNeXt-101 FPN for feature extraction can significantly improve performance.
Example architectures overview
In this section, I will walk you through the architecture representatives for each main instance segmentation category.
Mask R-CNN
It is a very well-known and popular instance segmentation network. The solution builds on top of the Faster-RCNN architecture. The pipeline of Faster-RCNN remains unchanged - there is a feature extraction backbone, Region Proposal Network, classification, and box regression branch. The Mask-RCNN additionally introduces a segmentation branch parallel to the classification and regression branch on ROIs.
Figure 6: The Left and right images show the mask head for two existing Faster R-CNN heads. The left figure presents the mask head for the ResNet C4 backbone and the right figure shows the mask head FPN backbones (source)
The segmentation branches are a couple of convolutional layers (2 or 4 depending on the architecture) stacked together. Convolutional layers are good at producing spatially coherent masks. The loss function is expanded by the Lmask loss.
L = L<sub>cls</sub> + L<sub>box</sub> + L<sub>mask</sub>
Lmask is defined as per pixel binary-cross entropy with sigmoid activation on the output mask. The original implementation uses 28x28 resolution output masks, which are then resized to the desired resolution. The Mask branch consists of a couple of convolutional layers.
Figure 7: Mask R-CNN architecture (source)
A pre-trained implementation of Mask R-CNN can be directly obtained from the detectron2 repo or torchvision.models.detection.mask_rcnn with the following code:
ISTR
Its architecture is very similar to the Mask R-CNN architecture. It also uses a ResNet backbone with Feature Pyramid Network (FPN) for feature extraction. Furthermore, it also has a classification branch, regression branch, and mask generation branch. It also uses a two-stage approach - first detecting an object, then segmenting it. The main difference comes in the design of classification, regression and mask branches, and Region of Interest proposal network. The ISTR proposes a transformer-based architecture that allows an End-to-End training, not like in Mask R-CNN, that first requires ROI network training and then classification, regression, and mask network training. Moreover, the new architecture outperforms Mask R-CNN AP scores in the COCO benchmark by 3.5 percentage points.
Figure 8: ISTR architecture (source)
Features extraction backbone
In a nutshell, the backbone network (for example ResNet-50) extracts features from the input image. Features extraction is a process of extracting meaningful information from an image, like edges, circles, squares, sharp corners etc. that combined together allow identifying a class of an object. The concept idea of features extraction is presented in Figure 9 below.
Figure 9: Features extraction process
Such obtained features are further utilized by the FPN different scales and thus makes the model scale-invariant. Then the processing pipe is split into two branches: region of interest (ROI) features and image features. Image features are just features from the FPN, averaged among height (H) and width (W) dimensions, summed up for all pyramid stages, and repeated 300 to match the number of proposal boxes. The output dim is [Bx300x256]. Next positional embeddings are added to the image features. Learnable positional embeddings are initialized randomly. The ROI features branch consists of 300 learnable quarry boxes that initially cover the whole image. The ROI query has a shape [Bx300x4]. Averaged image features and bounding-box (bbox) queries are input to the six stages attention head.
Attention head
In the attention head, averaged image features are passed through a multi-head attention network (like in the original transformer implementation) to encapsulate multi-complex relationships among different features. All Value, Key, and Query are the same features. The idea of a multi-head attention layer is presented in Figure 10.
Figure 10: Multi-head attention architecture (source)
Features obtained from the multi-head attention are used in the dynamic attention block for better fusing the ROI and image features. The dynamic attention block is a linear layer applied to image features and matrix multiplication between image features and ROI features. ROI features are features pooled from the FPN based on the bbox queries. The output of the Dynamic Attention Layer is propagated through 3 stages of fully connected layers, and we reach the point where we split the processing pipeline into classification, regression, and mask segmentation branches.
The classification and regression branches are eight stacked linear layers with normalization and the ELU activation function. The classification branch has an output shape of [Bx300x#classes] and uses a softmax layer for classification tasks. The regression branch learns the deformation of query boxes (deltas of query boxes, in other words, how to modify normalized x,y, width, height parameters of the query bbox to match the ground-truth), and the output shape is [Bx300x4]. Further obtained boxes from the regression branch are used for pooling feathers from the FPN. The pooled features of shape [Bx300x256x28x28] are propagated through 2D convolutional encoder with a bottleneck of size 1x1. The output of the encoder is [Bx300x256x1x1]. The newly predicted boxes and features obtained as the output of the FFN module are reused as a new input (bbox queries and averaged image features) to the attention head. The attention head is run six times in the loop. During the inference time, final classifications, boxes, and segmentation masks are the averages of values from each of the six stages of the attention head.
Mask head
During the training and inference, the model is outputting the encoded mask of shape [Bx300x256x1x1], which is an output of the mask encoder module. In order to obtain the final segmentation mask, the mask needs to go through the mask decoder. The decoder is fixed and pre-learned. It means it doesn’t update its weights during the training procedure. It is possible to insert a U-Net instead of a pre-learned encoder-decoder and learn the U-Net weights during the training, but there might be a smarter way. Authors of the ISTR argue that not all pixels in the 28x28 marks are equally likely to appear. They do a PCA analysis on the ground truth masks and find out that the majority of the information about masks is embedded in the first few principal components. Therefore, they decide to do the PCA encoder-decoder and learn the low-dimensional embeddings instead of the full mask. The official implementation uses class-agnostic PCA analysis.
YOLACT
YOLACT is a real-time instance segmentation algorithm. Unlike the Mask R-CNN or the ISTR, it is a proposal-free approach. It has a much worse AP than the Mask R-CNN or ISTR, but it outperforms both trackers in terms of inference speed and training time. It achieves 38.5 FPS with ResNet-101 backbone on the RTX 2080 Ti and 45.9 FPS using Darknet-53 backbone. It archives 29.8 and 28.9 AP accordingly. It is best suited for real-time applications where mask accuracy is not crucial.
Figure 11: Different instance segmentation algorithms comparison (source)
The proposal-free instance segmentation task is not trivial. SOTA solutions for instance segmentation heavily depend on feature localization (with boxes and ROI pooling layers) to produce masks. YOLACT forgoes explicit localization steps by performing parallel processing in two branches. The bottom branch “Protonet” generates prototype masks using convolutional neural networks. It doesn’t know anything about instances and class types of predicted masks. The upper branch, on the other hand, (Prediction Head with NMS) uses fully convolutional layers to produce mask coefficients. The prediction head outputs 4 coordinates for bounding boxes, value for class instance, and k mask coefficients. The assembly step of Protonet and Prediction Head is efficiently achieved with simple matrix multiplication, addition, and subtraction. The reasoning behind two-branches processing is as follows: fully connected layers are good at producing semantic vectors and convolutional layers are good at producing spatially coherent masks. Two parallel branches allow fast inference and the assembly steps add just a small computational overhead.
Figure 12: YOLACT architecture (source)
Feature extraction backbone
For feature extraction, the YOLACT uses the same approach as Mask R-CNN or the ISTR network. It uses a ResNet backbone with the FPN network for making the model scale-invariant.
Protonet
Protonet is just 5 convolutional layers with ReLU activations and one upscaling interpolation layer stacked together. Protonet aims to predict generic, class-agnostic masks without knowing instance types. The output of this module has a [Bx32x138x138] shape. For prediction, it uses just the highest resolution feature map from the FPN network of shape [Bx256x69x69].
Prediction Head
The prediction head outputs the boxes, classes, and mask coefficients. It runs the head on each single FPN feature map resolution (in total 5) in order to detect boxes and instances in different scales. It first runs each FPN level through a single convolutional layer. Further, it splits processing into 3 branches: one for classification, one for bbox regression, and one for mask coefficients. Each branch is a convolutional layer. It outputs 81x3 class predictions for each location (every single location has 3 anchor proposals: two rectangles and one square), 4x3 bbox predictions for each location on the feature map (same as for classification, there are 3 anchors proposals for each location), and 32 mask coefficients for each location on the output feature map. It is worth mentioning that the number of mask coefficients has to be the same as the number of masks in the protonet branch. The output feature map is of size [69x69].
Final prediction
The final prediction of masks is achieved by matrix multiplication of mask coefficients (32) for each detected class in the image and masks output from the “Protonet” branch (also 32 masks with h,w equal [138x138]). The matrix multiplication weighs each mask according to the mask coefficients and sums it. The output is a final mask prediction for each class.
Summary
An instance segmentation algorithm generates a binary mask for each separate object in the image. As we just saw, the instance segmentation approaches can be divided into two main categories: detection base instance segmentation algorithms and single-shot instance segmentation algorithms. The first approach achieves better mAP, so the overlap area between predicted and a ground truth mask.
The second approach, on the other hand, is best suitable for real-time implementation scenarios, as it is much faster than the first approach. We also saw the third approach that utilizes a transformer-based architecture. It follows the same idea as the two-stage instance segmentation algorithms, however, implements it using transformers architecture and thus differs from the original solution.
The transformers-based approach achieves an even higher mAP than CNN-based two-stage architectures. However, it requires a significant amount of training data to be trained. The interesting implementation of a transformers-based instance segmentation algorithm for videos is End-to-End Video Instance Segmentation with Transformers. It additionally utilizes information between frames to assign the same masks among frames and even further improve mask quality.