Instance segmentation loss functions
Instance segmentation aims to generate a binary mask for each single object detected in the scene. There are two main categories of instance segmentation algorithms: detection base instance segmentation algorithms and single-shot instance segmentation algorithms. The first approach generates higher quality masks, however the latest one is faster.
In the 1st article in the series, Instance segmentation algorithms overview, we dive deep into types and approaches used to perform instance segmentation and the 2nd part, Instance segmentation evaluation criteria, presents evaluation metrics used by instance segmentation models. This article, in turn, focuses on loss functions used to train the instance segmentation algorithms. The most commonly used is the focal loss, however, it is not the only one out there.
For instance segmentation tasks, we can use the following loss functions:
- Weighted binary cross-entropy loss
- Focal Loss
- Dice Loss
- Generalized IoU
- Boundary loss
- Lovasz softmax loss
Weighted binary-cross entropy loss
It is a binary cross-entropy loss, which handles class imbalance. Let’s imagine a seagull instance segmentation model. The input and an output to the model are presented in Figure 1.
Figure 1. The left image is an input to the instance segmentation network and the right image is a segmentation mask obtained as an output of the instance segmentation network.
The network predicts black pixels as the background class (0) and white pictures as the object class (1). In the following scenario, the network poorly predicted the mask of the seagull. However, if we calculate the loss using standard binary cross-entropy the loss would be low (close to zero), which means that the network performs well. Why is that?
It is because of a huge class imbalance. The number of black pixels (negative class) is significantly higher than the number of white pixels (positive class). Because most of the black pixels were correctly classified as a background, then the overall loss is small and the model thinks it performs well. To handle class imbalance we can use weighted binary cross-entropy (WBCE). WBCE takes into account the number of instances in the given class.
The 𝛂 parameter is a class weight. In the case of a seagull example, the value should be:
Incorporating a weight parameter into the binary cross entropy favored class with small number of instances. In the given example, the white pixels class is the one with a small number of instances, so the weighted BCE loss will put more attention to correctly classifying white pixels and less attention to correctly classifying black pixels. Therefore the loss in the given example would be high, even though most of the black pixels were classified correctly.
Focal Loss
Focal loss is an improved version of binary-cross entropy loss, which tries to handle the class imbalance problem plus additionally adds a gamma parameter, which focuses on hard to classify cases.
It focuses more on those hard to classify, giving them more attention and reducing loss for easy to classify classes. Let’s imagine an object classification task where we want to correctly classify dogs, wolves, and planes. While distinguishing between a dog and a plane or a wolf and a plane is fairly easy, the model may struggle to achieve good results on identifying a dog and a wolf as separate objects. Here comes the power of focal loss. The part presented above allows to even further increase the loss if the model predicts low probability for the correct class label, and even further decrease the loss if the predicted probability for the correct class is high.
The influence of gamma on the loss function is best illustrated by figure 2. It shows that the loss value is squeezed toward zero for high confidence correct predictions and the loss is higher for low probability values obtained for correct classes.
Figure 2. Focal loss curve for different parameters of gamma. Image source
The gamma parameters control the shape of the curve.
The focal loss introduces one new hyperparameter, the focusing parameter γ, that controls the strength of the modulating term. When γ = 0, our loss is equivalent to the CE loss. As γ increases, the shape of the loss changes so that “easy” examples with low loss get further discounted, see Figure 1. FL shows large gains over CE as γ is increased. With γ = 2, FL yields a 2.9 AP improvement over the α-balanced CE loss.
Focal loss can be easily adapted to the instance segmentation scenario. The weight (alpha) parameter handles white-black pixels class imbalance the same way as the weighted binary cross entropy does. The gamma parameter allows focusing more on hard to classify pixels. For example, correctly classifying pixels close to the center of an object is a fairly easy task. Thus, the network will predict the center object’s pixels as white with high confidence. On the other hand, classifying edge pixels is challenging. Those pixels may be classified as belonging to the same object or not. The model should put more attention to those pixels in the learning process. Focal loss does it with the gamma term.
Dice Loss
Dice loss is widely used in medical image segmentation tasks. It tackles the problem of class imbalance. The dice loss formula is given with the following equation:
where the DSC is a Dice coefficient given by the equation:
where the DSC is a Dice coefficient given by the equation:
or
Figure 3: Dice loss computation
Dice loss is very similar to IoU. It is the area of overlap divided by the total area of predicted and ground truth shape. The main difference is in the denominator. IoU uses the area union and DSC uses the area sum. The DSC is equal to 1 if two areas overlap perfectly and it is equal to 0 if two areas do not overlap. To make it a valid loss, we just do 1 - DSC and, therefore, by minimizing a given loss function, we can train the model. The main disadvantage of dice loss is that it is equal to zero independently of how far away from each other the ground truth and predicted pixels are.
Generalized IoU
IoU measures the overlapping area of two bounding boxes or, generally speaking, shapes and normalizes it by the common area. IoU is successfully used for image evaluation, however, it has problems when it comes to the loss application. IoU is equal to zero if there is no overlap between the ground truth and the predicted bbox. It means that it doesn’t matter if two instances of bbox are away just by one pixel or are away by 100 pixels - the loss would be zero anyway. It makes IoU useless for loss applications. What’s more, if there is no intersection between the ground truth and a predicted mask, the IoU has no value and thus no gradient. However, the loss function has to be fully differentiable to allow backpropagation. To overcome this problem, Stanford researchers proposed a generalized IoU loss, which is fully differentiable.
Adding the second term to the standard IoU ensures that the loss is smaller when two bounding boxes approach each other and greater when they are far away from each other. The GIoU ranges from -1 to 1. Negative values occur when C (the area enclosing both bounding boxes) is greater than IoU. As the IoU component increases, the value of GIoU converges to IoU.
A loss function to be employed for IoU would be described with the following equation:
Same way, the loss for the GIoU can be written with an equation:
For multilabel datasets, GIoU is commonly averaged across classes, yielding the mean GIoU (mGIoU). You can find more information about generalized IoU here.
Boundary loss
Dice or cross-entropy are based on integrals over the segmentation regions. Unfortunately, for highly imbalanced segmentations, such regional summations have values that differ by several orders of magnitude across classes, which affects training performance and stability. One way to handle it is to add class weights to the loss. Classes with fewer occurrences receive more attention and classes with many occurrences - less attention. The other approach is adopted by boundary loss. Boundary loss uses the summation over the boundary regions rather than summation over the overlapping regions. In other words, the distance between ground truth contour and the predicted contour is taken into account rather than masks overlapping area.
Figure 4. The relationship between differential and integral approaches for evaluating boundary change (variation)
Boundary loss computation example
Let’s go through an example presented in table 1 in order to understand how boundary loss actually works.
The value obtained in step 4 is the final boundary loss. And we could stop here.
Let’s write the boundary loss with mathematical equations.
The first integral in the first of the above equations corresponds to the multiplication of the predicted (orange) mask with the ground truth distance field and it is the final boundary loss value. The second integral in the first of the above equations subtracts the distance field inside the G area.
However, the integral over distances inside the G area is constant and independent of network parameters and thus can be omitted. The simplified boundary loss without the last term is presented in the equation below. It is exactly equal to element wise multiplication of the predicted orange mask with the pre-computed ground-truth distance mask in the example.
The boundary loss described by the equation above is minimized (archives the minimum value), when all negative values in a distance function are included in the sum (i.e., the softmax predictions for the pixels within the ground truth foreground are equal to 1).
The boundary loss can be easily combined with standard regional losses (LR), like for example Dice loss - and actually, it usually is. There is a very trivial solution that the network may be stuck in. If there is an empty foreground prediction, so approximately all values of the softmax probabilities are nulls, the network has very low gradients. Therefore, this trivial solution is close to a local minimum or a saddle point. To avoid it, authors of the boundary loss suggest combining it with the regional based loss. The regional based loss is the most important at the beginning of training. As the training process progresses, the boundary loss term starts to take over.
Lovasz softmax loss
If pt is equal to 1, the log(pt) is equal to 0 and the loss is zero. When pt approaches 1, the log(pt) is going to minus infinity. For negative classes, when we want the predicted pixel to be equal to 0, we use an “otherwise” equation. In the image segmentation task, pixels that we want to include as the mask pixels are labeled as one and all other pixels (background) are labeled as 0. The image segmentation task is a good example of imbalance classes classification problem. The number of background pixels is usually significantly larger than the number of object pixels. Therefore, weighted binary cross-entropy or focal loss are good choices for the object segmentation loss function.
Summary
The main challenge in front of image segmentation loss functions is to handle class imbalance in the loss computation properly. The class belonging pixels usually occupy the small sub-part of the image and therefore the standard loss functions are highly biased toward the proper classification of the image background, not the foreground. The article walked through a few loss functions that handle class imbalance. For more information, please check out the links attached to each loss. Under those links, you can find the original scientific papers describing each loss. Please also check our articles about Image Segmentation and Image Segmentation metrics to better grasp the Image Segmentation task.