Counting Objects with Faster R-CNN

Accurately counting objects instances in a given image or video frame is a hard problem to solve in machine learning. A number of solutions have been developed to count people, cars and other objects and none of them is perfect. Of course, we are talking about image processing here, so a neural network seems to be a good tool for the job.

Below you can find a description of different approaches, common problems, challenges and latest solutions in the Neural Networks object counting field. As a proof of concept, existing model for Faster R-CNN network will be used to count objects on the street with video examples given at the end of the post.


People Counting Challenges

Finding a proper solution to the problem of counting objects depends on many factors. Besides some challanges common to all image processing with Neural Networks - like size of the training data, its quality etc. Specific challenges to the counting objects problem:

  • type of the objects to be counted
  • overlapping
  • perspective view
  • the minimum size of detected objects
  • training and testing speed

The approach taken to count cars on a highway or crowds of people at a stadium, where most objects overlap and the perspective view usually allows very small objects in the far distance, will be completely different to counting people on a family photo. Also, the solution for counting objects on a single photo could be different to a solution suitable for counting objects in a video in real time manner.

Simple Needs, Simple Solutions

In this post I will try to tackle the problem of counting objects on the street, using sample videos with multiple objects visible at the same time, but not too overcrowded. For processing images of a crowded scene or a traffic jam to count the object instances accurately I recommend diving into the latest research in the field: Towards perspective-free object counting with deep learning. The results from the paper can be reproduced using the code found at GitHub. Methods like CCNN and Hydra CNN described in the aforementioned paper perform poorly when given an image with just a few objects of different types, therefore a different approach had to be taken.

There is a very interesting method in the field of machine learning (and in Deep Learning with Convolutional Neural Networks in particular), called Region based Convolutional Neural Network (RCNN), where we identify multiple objects and their location on a given image.

For our Proof Of Concept work I will use the Keras implementation of 'Faster R-CNN' modified to process video files and annotate the images with the count of detected objects of a given class.

Fast and Faster

There were number of approaches to combine the tasks of finding the object location and identifying the object to increase speed and accuracy. Over the years, we have moved forward from using standard RCNN networks, through Fast R-CNN and up to Faster R-CNN which we are using to solve our simple counting problem. Fast RCNN builds on the previous work to efficiently classify object proposals using deep convolutional networks. Compared to RCNN, Fast R-CNN introduced several innovations to improve training and testing speed, and detection accuracy.

Approaches using RCNN-trained models in multi-stage pipelines (first detecting object boundaries and then performing identification) were rather slow and not suited for real time processing. The drawback of this approach is mainly its speed, both during the training and during the actual testing while object detection was performed. Using the famous VGG16, the training process for a standard RCNN takes 2.5 GPU-days for the 5k images and requires hundreds of GB of storage. Detecting objects at test-time takes 47s/image using a GPU. This is mainly caused by performing a forward pass on the convolutional network for each object proposal, without sharing the computation.

Fast R-CNN improved RCNN by introducing a single-stage training algorithm which classifies objects and their spatial locations in a single processing stage. The improvements introduced in Fast R-CNN are:

  • Higher detection quality
  • Training in a single stage using multi-task loss
  • Training can update all network layers
  • No disk storage is required for feature caching

Faster R-CNN introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, enabling nearly cost-free region proposals. The RPN component of this solution tells the unified network where to look. For the same VGG-16 model, Faster R-CNN has a frame rate of 5 fps on a GPU while achieving state-of-the-art object detection accuracy. The RPN is a kind of a fully convolutional network and can be trained end-to-end specifically for the task of generating detection proposals and is designed to efficiently predict region proposals with a wide range of scales and aspect ratios.

Faster R-CNN was used last year by Pinterest as a solution enabling visual search on their website and it will be our choice to detect and count objects on sample videos in the described PoC below.

Proof Of Concept

To solve our imaginary problem, we are going to use the aforementioned Faster R-CNN model with Keras on a GPU-enabled AWS instance. Living in the era of multiple deep learning frameworks available and ongoing competitions, we are in a comfortable position to download already pretrained models best suited to our needs and the framework of choice. Of course, you can train the model yourself using the provided training python script, just keep in mind that it can take many days to process.

There exist multiple implementations for Faster R-CNN, including Caffe, TensorFlow and possibly many others. We are going to use Keras (v. 2.0.3) with TensorFlow in the backend. The code is available as a fork of original Keras F R-CNN implementation on GitHub.

The script for testing the network was modified so that it can process the video files and annotate each frame with appropriate data for detected objects (with probability) as well as a summary of counted objects. I'm using opencv heavily to process the videos and already trained model (available for download here) while processing the frames. There are a number of utility methods for processing the video, eg:

def convert_to_images():
    cam = cv2.VideoCapture(input_video_file)
    counter = 0
    while True:
        flag, frame = cam.read()
        if flag:
            cv2.imwrite(os.path.join(img_path, str(counter) + '.jpg'),frame)
            counter = counter + 1
        if cv2.waitKey(1) == 27:
            # press esc to quit

and saving the video from processed frames:

def save_to_video():
    list_files = sorted(get_file_names(output_path), key=lambda var:[int(x) if x.isdigit() else x for x in re.findall(r'[^0-9]|[0-9]+', var)])
    img0 = cv2.imread(os.path.join(output_path,'0.jpg'))
    height , width , layers =  img0.shape

    # fourcc = cv2.cv.CV_FOURCC(*'mp4v')
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    #fourcc = cv2.cv.CV_FOURCC(*'XVID')
    videowriter = cv2.VideoWriter(output_video_file,fourcc, frame_rate, (width,height))
    for f in list_files:
        print("saving..." + f)
        img = cv2.imread(os.path.join(output_path, f))

While object detection takes place during testing, we are creating the list of tuples with detected object class and number 1, which later gets reduced to count the number of occurrences for particular object class:

for jk in range(new_boxes.shape[0]):
                (x1, y1, x2, y2) = new_boxes[jk,:]

                cv2.rectangle(img_scaled,(x1, y1), (x2, y2), class_to_color[key],2)

                textLabel = '{}: {}'.format(key,int(100*new_probs[jk]))
                all_objects.append((key, 1))

and reducing method:

def accumulate(l):
    it = itertools.groupby(l, operator.itemgetter(0))
    for key, subiter in it:
        yield key, sum(item[1] for item in subiter)

Script arguments are rather self explanatory:

  • "--input_file", Path to input video file.
  • "--output_file", Path to output video file.
  • "--input_dir", Path to input working directory where the processed frames are stored
  • "--output_dir" Path to output working directory where annotated processed frames are stored
  • "--frame_rate" Frame rate to use while constructing the video output

Example usage:

python test_frcnn_count.py --input_file ~/videos/MVI_6848.mp4 --output_file ~/output4.mp4 --frame_rate=25

A few examples processed by the script:


Region-based Deep Convolutional Networks are exciting tools, enabling software developers to solve many interesting problems. The presented solution just scratches the surface. By fine-tuning the network for the particular data set or using transfer learning from other trained models, we can achieve high accuracy and speed while detecting objects.