Accelerating Computer Vision with DeepStream SDK and TAO Toolkit

Creating Computer Vision pipelines becomes easier with NVIDIA DeepStream SDK and TAO pre-trained models.

11 min readFeb 3, 2022

Welp, it looks like you have a computer vision problem. You have a video and need a pipeline to infer data from it. It is a hassle just training the model itself, and now you have to create the entire thing? There has to be a way to streamline everything.

Enter: NVIDIA DeepStream.

What is NVIDIA DeepStream? Well, it’s an SDK to accelerate making and deploying computer vision pipelines.It’s as easy as just declaring the components and linking them together, and everything just magically works. The libraries for this SDK are compatible with both C/C++ and Python, making implementations especially easy through shared object libraries.

But that’s not all. As it turns out, you don’t have to train a model to satisfy your computer vision needs. NVIDIA’s NGC Catalog contains loads of pre-trained models for your use. Just download the model files, specify them into the configuration files, and you’re good to go!

While working as a Machine Learning Intern at NVIDIA, under the guidance of Mr Amit Kumar (Senior Solutions Architect — Deep Learning), I worked with the DeepStream SDK. I found that it’s not evident for new users how to make their pipelines using this technology. So, I’m here to help you today. I’ll show you how to get started using DeepStream in your solutions by making DeepStream pipelines and integrating models pre-trained using NVIDIA’S TAO Toolkit.

Resources

Before anything, you need to clear some pre-requisites. You need an NVIDIA GPU to be able to run the models efficiently. I used an NVIDIA T4 GPU and would recommend everyone use the T4 or equivalent if using a discrete GPU. If you’re using the Jetson platform, make sure you install the drivers specific to your machine, as they differ from regular GPUs. I also recommend installing Docker onto your system to reap the benefits of OS-level virtualisation.

Once you have your machine up and running, you need to install DeepStream. You can install DeepStream using the documentation listed here or use a Docker container. I find the container route more efficient and easier to implement. So, go to NVIDIA’s NGC Catalog, and copy the pull tag for the DeepStream Container. Go to your machine’s Terminal and paste the command you just copied. Docker will now download all relevant files from NGC onto your computer. All that’s left is running the container.

You can learn more about DeepStream, TAO Toolkit and NGC by clicking on the links provided.

Let’s Start

Now that you have the Docker container up and running, we can begin with our pipeline. I have two Python scripts for you to build. The first script runs a primary inference engine on a sample video. The second script extends the first by adding another inference engine that further modifies the output from the first engine.

The video we will use for our script is roadside footage from a busy road in California. Let’s have a look at it:

Our objective for this video and our pipeline is to detect all the vehicles and pedestrians in the video. We’ll also classify these vehicles based on their size and type. To accomplish this, we take two models from the NGC Catalog into our pipeline: TrafficCamNet and VehicleTypeNet.

Let’s start with the scripts. Follow along with the steps given below. If you get stuck at any moment, feel free to refer to the code present in the repository here.

First, we need to import all the necessary libraries into the script and declare some variables we’ll use throughout. We define the class variables that our model detects (TrafficCamNet detects cars, people, bicycles and road signs) and the path for our input video. For convenience, I also defined where to save the output video later.

import sys 
sys.path.append('../source_code')
import gi
import time
gi.require_version('Gst', '1.0')
from gi.repository import GObject, Gst, GLib
from common.bus_call import bus_call
import pyds

PGIE_CLASS_ID_VEHICLE = 0
PGIE_CLASS_ID_BICYCLE = 1
PGIE_CLASS_ID_PERSON = 2
PGIE_CLASS_ID_ROADSIGN = 3

INPUT_VIDEO_NAME  = '../videos/sample_720p.h264'
OUTPUT_VIDEO_NAME = "../videos/out.mp4"

Now, we start building the pipeline and its components. We first create a function that makes a pipeline element or prints an error. Then, we initialize GStreamer and an empty Pipeline. Now, we build all the necessary parts of the pipeline.

Gst.init(None)

pipeline = Gst.Pipeline()

if not pipeline:
    sys.stderr.write(" Unable to create Pipeline \n")def make_elm_or_print_err(factoryname, name, printedname, detail=""):  print("Creating", printedname)
  elm = Gst.ElementFactory.make(factoryname, name)
  if not elm:
     sys.stderr.write("Unable to create " + printedname + " \n")
  if detail:
     sys.stderr.write(detail)
  return elm

The elements we need are:

A file source,
A .h264 parser (since we use an h264 encoded file),
A stream-muxer,
The inference engine,
Two video converters,
An OSD (On Screen Display),
A queue for encoding and saving,
An encoder,
The parser that parses the output from the encoder,
A container, and
A sink for the output.

We create most of the components using GStreamer Plugins. Some elements, however, use NVIDIA DeepStream plugins. These are Nvinfer, Nvosd and Nvvidconv. Let’s talk about them.

Nvinfer is the plugin we use for inference, detection and tracking. It takes input from a decoder, muxer or a dewarper and outputs each classified object’s class and its bounding boxes after clustering.

Nvosd is the plugin we use for drawing bounding boxes, text and Region-of-Interest polygons on the metadata output by the Nvinfer plugin. Any metadata that the Nvosd plugin creates replaces the original metadata in place.

Nvvidconv is the plugin that performs colour format conversions, which is necessary as these formats vary across video file types. This process prepares the video data coming in for further processing.

source = make_elm_or_print_err("filesrc", "file-source","Source")h264parser = make_elm_or_print_err("h264parse", "h264-parser","h264 parse")decoder = make_elm_or_print_err("nvv4l2decoder", "nvv4l2-decoder","Nvv4l2 Decoder")streammux = make_elm_or_print_err("nvstreammux", "Stream-muxer",'NvStreamMux')pgie = make_elm_or_print_err("nvinfer", "primary-inference" ,"pgie")nvvidconv = make_elm_or_print_err("nvvideoconvert", "convertor","nvvidconv")nvosd = make_elm_or_print_err("nvdsosd", "onscreendisplay","nvosd")queue = make_elm_or_print_err("queue", "queue", "Queue")nvvidconv2 = make_elm_or_print_err("nvvideoconvert", "convertor2","nvvidconv2")encoder = make_elm_or_print_err("avenc_mpeg4", "encoder", "Encoder")codeparser = make_elm_or_print_err("mpeg4videoparse", "mpeg4-parser", 'Code Parser')container = make_elm_or_print_err("qtmux", "qtmux", "Container")sink = make_elm_or_print_err("filesink", "filesink", "Sink")

Now, we create a configuration file to pass to our inference engine. A configuration file is a text file that specifies how our model works and where all relevant files are. The configuration files we use are standard and are usually available by default in the container, albeit with a few modifications.
Let’s have a look at the configuration file we use:

Primary Inference Engine Configuration File

Let’s understand what this all means.

If we look at the first property after the license, we specify which GPU to use. This property is beneficial if you have multiple GPUs and wish to use a specific GPU for the engine. The following properties are net-scale-factor and tlt-model-key, specified in our text file by default. tlt-model-key is usually the same for all TAO, and previously, TLT pre-trained models, and generally not touched. Then, we have the model, label, and calibration files, which get downloaded along with the model itself.

The following line, however, is quite interesting. The property model-engine-file specifies the engine the pipeline is supposed to use. However, if we check our model files, there is no engine file to be found! That, luckily, is not an issue. At runtime, DeepStream notices this and compiles an engine for you. This process takes time but occurs just once, and all subsequent runs require no re-compilation whatsoever. The engine file compiled uses the model data specified in the previous properties and saved at the path specified in the model-engine-file property.

The other properties specified can be left as-is from the original config file, as these are standard.

Now that we understand our configuration file, we continue building. We have declared all the elements for the pipeline, but we haven’t assigned them their properties. That’s what we’ll do now.

First, we assign the input video name. It is a property of the Source element. Now, set the dimensions of the video (depending on the resolution) and how many batches we need for it. This method accelerates processing for long videos, but since ours is short, we set it to 1. We also provide a time limit for batch making. Next, we set the config file path for the inference engine. Now, .h264 videos usually don’t have a bitrate set, so we specify it as an encoder property. Finally, we set the output file path and disable sync and async for saving.

source.set_property('location', INPUT_VIDEO_NAME)streammux.set_property('width', 1920)
streammux.set_property('height', 1080)
streammux.set_property('batch-size', 1)
streammux.set_property('batched-push-timeout', 4000000)pgie.set_property('config-file-path', "../configs/config_infer_primary_trafficcamnet.txt")encoder.set_property("bitrate", 2000000)sink.set_property("location", OUTPUT_VIDEO_NAME)
sink.set_property("sync", 0)
sink.set_property("async", 0)

Now, we will finally complete the pipeline. We add all the declared elements into our pipeline and link them in the following order:

The source
The .h264 parser
The decoder
The stream-muxer
The inference engine
The first video converter
The OSD
The queue
The second video converter
The encoder
The parser
The container and
The sink

pipeline.add(source)
pipeline.add(h264parser)
pipeline.add(decoder)
pipeline.add(streammux)
pipeline.add(pgie)
pipeline.add(nvvidconv)
pipeline.add(nvosd)
pipeline.add(queue)
pipeline.add(nvvidconv2)
pipeline.add(encoder)
pipeline.add(codeparser)
pipeline.add(container)
pipeline.add(sink)

source.link(h264parser)
h264parser.link(decoder)


sinkpad = streammux.get_request_pad("sink_0")
if not sinkpad:
    sys.stderr.write(" Unable to get the sink pad of streammux \n")
srcpad = decoder.get_static_pad("src")
if not srcpad:
    sys.stderr.write(" Unable to get source pad of decoder \n")
    
srcpad.link(sinkpad)
streammux.link(pgie)
pgie.link(nvvidconv)
nvvidconv.link(nvosd)
nvosd.link(queue)
queue.link(nvvidconv2)
nvvidconv2.link(encoder)
encoder.link(codeparser)
codeparser.link(container)
container.link(sink)

We have the pipeline ready! So, we create an event loop and feed it bus messages.

loop = GLib.MainLoop()
bus = pipeline.get_bus()
bus.add_signal_watch()
bus.connect ("message", bus_call, loop)

Our pipeline now has metadata to work with, but it doesn’t do anything. So, we create a function to write onto each frame. We display how many people and vehicles are detected by the engine and draw bounding boxes around them.

For that, we first count the number of objects detected and assign them to the class variables we declared earlier. Then, we get the metadata from the buffer and extract the relevant information from it. Finally, we display all this data on the respective frame of the video. All this is through a sink pad probe inserted into the OSD element.

def osd_sink_pad_buffer_probe(pad,info,u_data):
    
    obj_counter = {
        PGIE_CLASS_ID_VEHICLE:0,
        PGIE_CLASS_ID_PERSON:0,
        PGIE_CLASS_ID_BICYCLE:0,
        PGIE_CLASS_ID_ROADSIGN:0
    }    frame_number=0
    num_rects=0
    
    gst_buffer = info.get_buffer()
    if not gst_buffer:
        print("Unable to get GstBuffer ")
        return

    batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(gst_buffer))
    l_frame = batch_meta.frame_meta_list
    
    while l_frame is not None:
        try:
            frame_meta = pyds.NvDsFrameMeta.cast(l_frame.data)
        except StopIteration:
            break
        
        frame_number=frame_meta.frame_num
        num_rects = frame_meta.num_obj_meta
        l_obj=frame_meta.obj_meta_list
        
        while l_obj is not None:
            try:
                obj_meta=pyds.NvDsObjectMeta.cast(l_obj.data)
            except StopIteration:
                break            obj_counter[obj_meta.class_id] += 1
            obj_meta.rect_params.border_color.set(0.0, 0.0, 1.0, 0.0)
            try: 
                l_obj=l_obj.next
            except StopIteration:
                break
        display_meta=pyds.nvds_acquire_display_meta_from_pool(batch_meta)
        display_meta.num_labels = 1        py_nvosd_text_params = display_meta.text_params[0]
        py_nvosd_text_params.display_text = "Frame Number={} Number of Objects={} Vehicle_count={} Person_count={}".format(frame_number, num_rects, obj_counter[PGIE_CLASS_ID_VEHICLE], obj_counter[PGIE_CLASS_ID_PERSON])        py_nvosd_text_params.x_offset = 10
        py_nvosd_text_params.y_offset = 12
        py_nvosd_text_params.font_params.font_name = "Serif"
        py_nvosd_text_params.font_params.font_size = 10
        py_nvosd_text_params.font_params.font_color.set(1.0, 1.0, 1.0, 1.0)
        py_nvosd_text_params.set_bg_clr = 1
        py_nvosd_text_params.text_bg_clr.set(0.0, 0.0, 0.0, 1.0)
        print(pyds.get_string(py_nvosd_text_params.display_text))
        pyds.nvds_add_display_meta_to_frame(frame_meta, display_meta)
        
        
        try:
            l_frame=l_frame.next
        except StopIteration:
            break
    return Gst.PadProbeReturn.OKosdsinkpad = nvosd.get_static_pad("sink")
if not osdsinkpad:
    sys.stderr.write(" Unable to get sink pad of nvosd \n")
    
osdsinkpad.add_probe(Gst.PadProbeType.BUFFER, osd_sink_pad_buffer_probe, 0)

Now that everything is finally ready, we run the pipeline.

print("Starting pipeline \n")
start_time = time.time()
pipeline.set_state(Gst.State.PLAYING)
try:
    loop.run()
except:
    pass
pipeline.set_state(Gst.State.NULL)
print("--- %s seconds ---" % (time.time() - start_time))

You will now notice a video has appeared in your destination folder. When opened, you will see an output like this:

Congratulations! You have now successfully created a DeepStream pipeline and integrated it with a TAO pre-trained model.

Next Steps

We now have creating pipelines with primary inference engines under our belts. Let’s move on to something a little more advanced: Adding object tracking and secondary inference on our video.

The basic steps here remain the same. All we do is add a tracker to track the objects detected by the primary engine and pass them to a secondary engine.

First, we import a config parser to parse our config file, to set the tracker’s properties.

import configparser

The model we will use for our secondary engine is VehicleTypeNet. VehicleTypeNet takes the object from our TrafficCamNet-based inference engine and classifies the output into multiple categories( SUV, Sedan, Truck, etc.) based on their size and shape.

To accomplish this, we declare our tracker and secondary inference engine while declaring our other elements, as described above.

We use a DeepStream plugin called Nvtracker to create a tracker element in our pipeline. Let’s talk about it.

Nvtracker is the plugin that performs tracking operations on our object metadata. It updates the object’s metadata with a tracker-id, allowing our secondary inference engine to work on them.

Then, we configure both the tracker and secondary inference engine with their config files. For tracking, we will use a KLT Tracker in our script since it is a CPU-based tracker, allowing parallel processing of our video, and doesn’t use up precious GPU resources.

sgie1.set_property('config-file-path', "../configs/config_infer_secondary_vehicletypenet.txt")

config = configparser.ConfigParser()
config.read('../configs/tracker_config.txt')
config.sections()
for key in config['tracker']:
    if key == 'tracker-width' :
        tracker_width = config.getint('tracker', key)
        tracker.set_property('tracker-width', tracker_width)
    if key == 'tracker-height' :
        tracker_height = config.getint('tracker', key)
        tracker.set_property('tracker-height', tracker_height)
    if key == 'gpu-id' :
        tracker_gpu_id = config.getint('tracker', key)
        tracker.set_property('gpu_id', tracker_gpu_id)
   if key == 'll-lib-file' :
        tracker_ll_lib_file = config.get('tracker', key)
        tracker.set_property('ll-lib-file', tracker_ll_lib_file)
   if key == 'll-config-file' :
        tracker_ll_config_file = config.get('tracker', key)
        tracker.set_property('ll-config-file', tracker_ll_config_file)
   if key == 'enable-batch-process' :
        tracker_enable_batch_process = config.getint('tracker', key)
        tracker.set_property('enable_batch_process', tracker_enable_batch_process)

We need two config files for our KLT tracker: a text file, like our config file earlier, and a YML file that provides additional configurations. You can find them in NVIDIA’s documentation, but we have them both here.

Like our TrafficCamNet config file, our VehicleTypeNet config file is standard but modified to run secondary to TrafficCamNet.

Secondary Inference Engine Configuration File

In this config file, pay close attention to these properties: gie-unique-id, operate-on-gie-id and operate-on-class-id.

gie-unique-id is a unique id assigned to every inference engine in a pipeline. These uniquely identify each engine being run and allow secondary processing. We also had this property in our primary config file, identifying that engine with id 1. We can set any number as an id for engines, provided they are unique, so we declare the id of our secondary engine as 5.

operate-on-gie-id is the property that directs the secondary engine to operate on the output produced by a specific inference engine. We set this property to 1 to imply that we want to work on the outcome of the specific engine that has id 1, in this case, our primary engine.

operate-on-class-id is the property that directs which exact label we wish to operate further on. In our example, if we check our labels file, we see that our first label in TrafficCamNet is Car (0-indexing!).

So, collectively, these three properties tell us that our secondary inference engine has a unique id of 5, operates on our primary engine with the unique id 1, and works on the class ‘Car’, which is the first class we encounter in the labels of our primary engine.

We now add them into the pipeline and link them to the appropriate locations. We link the primary inference engine to the tracker, which couples to the secondary inference engine. The updated order will be:

The source
The .h264 parser
The decoder
The stream-muxer
The primary inference engine
The tracker
The secondary inference engine
The first video converter
The OSD
The queue
The second video converter
The encoder
The parser
The container and
The sink

All other steps remain the same, and upon running the pipeline, you should have a video that looks like this:

And that’s it! You’ve successfully made two pipelines with DeepStream. Not only this, but you also integrated two models, which run one after the other and transform how your inferences look. Now, you’re qualified enough to experiment with this SDK and implement more complex solutions faster and more efficiently with NVIDIA DeepStream.

Accelerating Computer Vision with DeepStream SDK and TAO Toolkit

Creating Computer Vision pipelines becomes easier with NVIDIA DeepStream SDK and TAO pre-trained models.

Resources

Let’s Start

Next Steps

Written by Ariz Siddiqui