Things and stuff or how remote sensing could benefit from panoptic segmentation

I love things and stuff because I can find there my things and all the stuff 😉

Remote sensing

Earth observation is one of the human activities that secretly supports us in our everyday lives. It’s interesting how such a complex discipline involving advanced knowledge and technology can conceal itself from the human eye. Who knows, maybe it has to do with the fact that everything is happening high above our heads 🙂

In simple words, remote sensing is a process of gathering information regarding some objects or phenomena without touching them. Concerning the Earth, it’s related to acquiring data by utilizing satellites, radars, unmanned aerial vehicles, various sensors, and more. Not only are remote sensing projects a source of valuable data but they also include important activities such as processing, interpreting, and analyzing portions of information to form usable and accessible solutions.

You can find multiple examples of Earth observation projects that directly or indirectly influence various areas of human activity: weather forecasting, detecting changes in land cover and land use, estimating forest conditions, detecting points of interest, optimizing transportation, predicting hazardous events, and many more. All of these are possible thanks to the data acquired during remote sensing missions.

You probably wonder how such tasks are carried out. It’s obvious that we started dealing here with massive volumes of data that cannot be visually (manually) interpreted. Therefore, the automation of remote sensing activities was only a matter of time. So was the application of machine learning and deep learning.

In this blog post, the Author will focus on a particular approach to satellite imagery interpretation i.e. panoptic segmentation. Pack your things and collect all your stuff. We are going to take a deep dive into one of the most interesting computer vision tasks out there.

Panoptic segmentation

Let’s start by describing what stuff is. When you look at an aerial image, there are areas that you can clearly distinguish themselves from others. Although we can name separate regions by analyzing their color, shape, and texture, it’s difficult to extract distinct instances. Such regions are stuff. The first choice method of identifying stuff is semantic segmentation.

Semantic segmentation

Semantic segmentation is a machine learning task of detecting a specific region of an image and assigning it a label to make this region distinguishable from other discovered regions. Doing that facilitates the process of image content interpretation. It’s helpful when you would like to measure the area where a certain phenomenon appears e.g.: delineating forest areas or urban areas.

Figure 1: Semantic segmentation example of The Pilica River UAV imagery to four classes: water, sand, plants, and sediment, source: own elaboration.

Instance segmentation

On the other hand, instance segmentation is the task of detecting objects representing different classes and, at the same time, distinguishing them from each other in the scope of a specific class. This makes it easy to count each occurrence of a specific object and precisely locate it in the image. Instance segmentation lets us identify the countable things present in the analyzed area. This task is especially useful for discriminating small objects from their amorphous background.

Figure 2: Instance segmentation example of The Chopin Airport orthophoto to two classes: airplanes and markers, source: own elaboration based on data acquired from (from left: RGB image, class mask, instance mask.

A while ago, it was hardly possible to have the best of both worlds. Of course, you could start with semantic segmentation with models such as U-Net or DeepLab and then introduce a postprocessing step to split the recognized regions into separate instances. Performing an inverse process was difficult and tedious. Deep learning models like MaskRCNN are not suitable for delineating large and complex regions which they like to treat as background.

Mixing things and stuff is a demanding task handled by deep learning models capable both of semantic segmentation and instance segmentation like Panoptic FPN.

How does a panoptic segmentation model perform both tasks simultaneously?

Figure 3: Panoptic FPN training flow, source: own elaboration based on Detectron2. For an interactive version, visit Miro.

Let’s describe all the required steps:

  1. The input of the network is an image (resized to meet the backbone input criteria) along with the original dimensions needed to recreate the image later in the process.
  2. The image is utilized as an input for a good old Feature Pyramid Network (FPN) whose purpose is to extract features from the image using a complex sequence of convolutional networks.
  3. Backbone produces a tensor describing input image features, i.e. a learned representation.
  4. The features are used by a segmentation head. Similar to what you can find in the original implementation of FPN. Its task is to produce a segmentation map, i.e. an image with information regarding the belonging of a pixel to a given predefined class. This is our stuff analysis step.
  5. Both the image and its features are used by a region proposal network (RPN) responsible for selecting relevant instance bounding boxes. This step is needed to properly localize the object.
  6. RPN produces the proposal.
  7. The image, feature, and proposals are used as input for the ROI Heads. This submodule produces class-dependent bounding boxes and instance masks. This is our things analysis step.
  8. The results are gathered and served as output that you can directly use in your research.

What result you can expect from Panoptic FPN when processing remote sensing imagery? Please check it below. Notice that the background contains multiple regions divided into classes. This has been prepared by the segmentation step. Furthermore, you can find all the object masks with their bounding boxes. This is the result of the instance segmentation step. Things and stuff were properly combined in the same process.

Figure 4: Panoptic segmentation of a beach shore near Hel, Poland, source: own elaboration based on data acquired from (left-upper corner: RGB ground truth, right-upper corner: semantic segmentation, left-bottom: instance segmentation masks, right-bottom: instance segmentation bounding boxes).


Satellite and aerial imagery can greatly benefit from panoptical segmentation. Not only can one analyze the occurrence of amorphous features (stuff), which gives a high overview of the information stored in the analyzed area but also detect smaller objects (things). One of the more interesting use cases can be related to enriching land cover/land use classification. Imagine working with land cover data but also being able to automatically separate chosen regions to determine whether they are formed by several smaller instances.

If you are interested in incorporating panoptic segmentation into your commercial or research project, the Detectron2 documentation and code will be a great starting point. Rember that preparing a dataset optimized for PFPN requires preparing both the things and stuff segmentation masks. Before starting, carefully analyze your dataset and prepare the labels. You will probably need some GIS skills for that. The alternative is to use CVAT for annotation but it’s limited to RGB images (no multispectral imagery feature available).

Comments and questions are more than welcome 🙂

Blog Comments powered by Disqus.