MRCNN Part 1: Starting off with Instance Segmentation using Mask R-CNN!

No. Image Detection and Segmentation does not require a PhD.

Yes. It truly is amazing that we can access and use numerous libraries for free without understanding the complexities behind them.

This series mainly focuses on getting started with Image Segmentation using Mask Regional Convolutional Neural Network (MRCNN) using Python!

Before anything, I would like to credit Matterport, Inc for making this model so easily accessible. You can check out their repository and tutorial on how to get started with this model.

Although their tutorial is well-detailed, starting off as a complete beginner (like myself) can be extremely challenging. Therefore, in this series I would like to ‘hand-hold’ you through the process of modifying the Python scripts and checking your results, eventually being able to implement them on your own projects!

So let us begin by talking briefly about Computer Vision.

In simplest terms, Computer Vision is a field of study that enable computers to ‘see’ and understand the contents of digital images. The picture above is a masterpiece, not created by people manually tracing the objects but instead by the ‘magic’ of Computer Vision.

The ability to distinguish a vehicle from a pedestrian may seem like a walk-in-the-park for even a baby. Unfortunately, computers are not nearly as smart as us…yet.

However, the recent boom in data generated (the big-data era) in the past decades has been a large contributing factor to the recent advancements in the field of Computer Vision.

With this vast amount of data floating in the digital space, several Machine Learning and Neural Network models have been trained to aid the field of Computer Vision. The most important, however, is one particular algorithm — a Convolutional Neural Networks.

“A convolution can be thought of as ‘looking at a function’s surroundings to make better/accurate predictions of its outcome.” — Dr. Prasad Samarakoon

A convolution involves sliding a filter over an input (e.g. a static image or a video). Instead of looking at entire images at once to determine specific features, the model looks at certain portions of the images sequentially to identify their contents.

Ok, so all that is great. But why ConvNet? We have had Computer Vision filters that were designed by hand such as the well-known Viola Jones, so why do we need this ‘new’ thing ?

Well…developing a filter by hand took eons (rhetorically speaking ), but that is not the case for ConvNets.

Let me briefly explain what will happen behind-the-scenes as we train our model.

As we build the Convolutional Neural Network, we first undergo the ‘training’ phase. We first randomly assign values for the filters and feed the network with images (consisting of what we want to train it on, for example a Balloon). The network will try to make a ‘wild’ guess on the object, its location, etc, and then compare its guess with the actual answer. Based on the results of this, the network tunes its ‘weights’ to improve its guesses. After several iterations (depending on the size of your data-set), the network will be able to accurately detect the item you have trained it for.

As compared to Computer Vision experts having to think up of innovative models for filters, a ConvNet’s ‘trial-and-error’ methodology is highly efficient.

Yes I know… That was a ‘kindergarten’ explaination for ConvNets. However, if you are interested to learn more then you can check out this blog!

If not, let me delve into what EXACTLY we will be working on.

What is Instance Segmentation?

Let us discuss the different Computer Vision tasks.

Credits: Matterport, Inc

Image Classification: It can detect that there is a balloon in this image.

Semantic Segmentation: It can outline the specific pixels of the balloons.

Object Detection: It can detect where the balloons are and how many there are.

Instance Segmentation: It can detect that there are “-” many balloons in this image, and these are the pixels that belong to each balloon.

What we are going to be working on in this series is instance segmentation. Interestingly, this step involves satisfying all other steps, so by doing one project we are able to accomplish all!

In this series, we will be using the Mask Regional Convolutional Network. If you are interested to learn more about it, feel free to read the following articles.

  1. RCNN, Fast RCNN, Faster RCNN
  2. Mask RCNN


In this series, we will be using Python. Don’t worry! You need not be a professional Python programmer. As long as you understand basic syntax for troubleshooting, you should be good to go 😺!

The series consists of 4 parts:

  1. Preparing your training environment.
  2. Testing the Image Segmetaion Model.
  3. Training your own model.
  4. Testing and implementing your Image Segmentation model.

Stay tuned for the upcoming Parts!

Machine Learning. CNN. Computer Vision. Looking to share simple and in-depth tutorials with the world! Sharing knowledge to help kick-start people’s interest!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store