YOLOv11 Instance Segmentation with OpenCV and Java (Part 2)

Main image

you can also read this article in Medium -> click here


In this article, we will look at instance segmentation using YOLOv11.

This is Part 2 of a 3-part series.

Instance segmentation goes a step further than object detection and involves identifying individual objects in an image and segmenting them from the rest of the image.

In the first part, we set up our project, prepared the input image for the model, loaded our model, and fed the input data through the network. Now we’re going to inspect the output we got from the net.forward call. This will be our predictions data!

In this part, we’ll have a look at the post-processing — extract the segmentation information from the results. This is somehow tricky as it’s not very well documented.

Let’s expand a bit on our last code snippet from Part 1 and print out the prediction results, which are saved in the outputsList:

List<String> outNames = net.getUnconnectedOutLayersNames();
List<Mat> outputsList = new ArrayList<>();
net.forward(outputsList, outNames);
 
// Get relevant outputs and print them out
Mat boxOutputs = outputsList.get(0);
Mat maskOutputs = outputsList.get(1);
 
LOGGER.info("Boxes Output: "+boxOutputs.toString());
LOGGER.info("Masks Output: "+maskOutputs.toString());

After we ran our input image through the network, we got a Mat output object — this is short for “Matrix” and is OpenCV’s primary data structure for storing images or numerical data.

The first element of our list (boxOutputs) is the Bounding Box Predictions. If we wanted to just perform object detection, we’d only need these. Let’s inspect the output of boxOutputs and concentrate on its dimensions:

INFO: Boxes Output: Mat [ 1*116*8400*CV_32FC1, isCont=true, isSubmat=false, nativeObj=0x6000024672a0, dataAddr=0x7fb2e9ff8000 ]

We see that it has a shape of [1, 116, 8400]. Let’s break it down:

The boxOutputs matrix contains predictions for 8400 potential detections, where each detection has 116 values (4 for bounding box, 80 for class probabilities, and 32 for mask coefficients). The 1 is the batch size, meaning one image is processed.

So, for every one of the 8400 predictions, we have a vector of 116 values that has the following structure:

  • 4 values for bounding box coordinates (center_x, center_y, width, height)
  • 80 values for class probabilities (80 is the number of classes the model was trained to detect — COCO dataset)
  • 32 values for mask coefficients (used to generate segmentation masks by combining with maskOutputs , we’ll get to this in the next part)

Note: The previous statement only applies for YOLO versions ≥ 8. YOLO5, for example, has a different output shape: (batchSize, 25200, 85) (Num classes + box[x,y,w,h] + confidence[c]) It uses anchor boxes, with 3 anchors per grid cell, influencing the number of predictions (25200 = 8400 × 3). We’re not going to get into details here — just wanted to briefly mention it as a side note. If you have any questions, feel free to ask in the comments.

You probably noticed that the boxes output had the structure 1 x 116 x 8400, but we talked about having 8400 predictions, and for each one of them, 116 values. In order to work with this logic properly, we need to transpose the matrix to 8400 x 116. Luckily, this can be done easily with OpenCV in Java:

Mat mat2D = boxOutputs.reshape(1, (int) boxOutputs.size().width); // The second parameter is the number of rows
Core.transpose(mat2D, mat2D);

This will change the structure of our mat2D object to 8400 x 116.

Now we can extract the relevant information from the output. This is going to be our plan:

  1. Interate over every one of the 8400 rows
  2. For every row, extract the 80 class probabilities and find the maximum.
  3. Check this maximum (we can also call this “score”) against a defined threshold (for example, only take predictions that have score > 0.6)
  4. Extract the mask coefficients (the last 32 values out of the total 116)
  5. Generate a mask for this detection. We will use the mask coefficients from boxOutputs and the prototype masks from maskOutputs to generate the final segmentation masks for each detected object — we’ll talk about this in detail in the next part.
var segmentationMasks = new ArrayList<Mat>();
 
LOGGER.info("-----Start analysing the inference-----");
for (int i = 0; i < mat2D.rows(); i++)
{
    Mat detectionMat = mat2D.row(i);
    List<Double> scores = new ArrayList<>();
    for (int j = 4; j < NUM_CLASSES+4; j++) {
        scores.add(mat2D.get(i, j)[0]);
    }
 
    MaxScore maxScore = ScoreUtils.findMaxScore( scores );
    if(maxScore.maxValue() < 0.6) {
        continue;
    }
 
    // Extract mask coefficients
    Mat maskCoeffs = detectionMat.colRange(4 + NUM_CLASSES, 4 + NUM_CLASSES + 32);
    // Generate mask for this detection
    Mat objectMask = generateMask(maskOutputs, maskCoeffs);
    segmentationMasks.add(objectMask);
}

This code snippet seems simple, but it requires an understanding of the underlying structure. Let’s walk through it to really lock it in:

The for-loop iterates through all 8400 prediction rows. For every row, with structure:

Row Structure

The values from index 0 to index 3 are the bounding box predictions. We would use these if we wanted simple object detection — to draw the bounding box over the original image where an object was found.

For instance segmentation, we first go through the class probabilities (from index 4 to index NUM_CLASSES [80 in our case]) and find the maximum. We call this maxScore — the class with the maximum score is the class of the object that the model found.

The code for ScoreUtils is really just finding the max value from a double array, very simple:

record MaxScore (double maxValue, int indexOfMax) {}
 
public class ScoreUtils {
 
    public static MaxScore findMaxScore(List<Double> array) {
        double max = array.get(0);
        int indexOfMax = 0;
        for (int i = 1; i < array.size(); i++) {
            if (array.get(i) > max) {
                max = array.get(i);
                indexOfMax = i;
            }
        }
        return new MaxScore( max, indexOfMax );
    }
}

After we find the maxScore, we check if it’s above a given threshold, we chose 0.6 here. If that’s not the case, we just move to the next prediction.

The code for extracting the mask coefficients is fairly simple too: Mat maskCoeffs = detectionMat.colRange(4 + NUM_CLASSES, 4 + NUM_CLASSES + 32); We just took the values from index 4 + NUM_CLASSES (84 in our case) to NUM_CLASSES+32 (116 in our case).

In Part 3, we’ll have a look at the last two steps of the for loop:

// Generate mask for this detection
Mat objectMask = generateMask(maskOutputs, maskCoeffs);
segmentationMasks.add(objectMask);

We’ll also look into how to overlay the masks over the original image. We’ll inspect the maskOutputs in detail and perform a matrix multiplication between the 32 mask coefficients we extracted in the previous step and the 32 prototype masks in maskOutputs.


We’re not done yet — Part 3 is coming soon.

Conclusion

Java is becoming better and better when it comes to rapidly experimenting with object detection, instance segmentation, and integration with LLMs. This series will help you get a grasp of what is now possible in this ecosystem with the help of OpenCV.

Hope you have managed to follow along until now. Stay tuned as we tackle the rest in the next part!