Getting Started with Conversational AI Applications using NVIDIA Riva

NVIDIA Riva accelerates the development cycle for production-ready speech AI applications.

Ariz Siddiqui
4 min readMay 2, 2022

Conversational AI is a hot field among data scientists across the world. The need to have bots having human-like conversational skills has become evident as more and more corporations employ chatbots and speech recognition for general tasks like customer assistance and the advent of virtual assistants like Siri, Alexa, etc.

Unfortunately, ML Developers interested in developing applications in this field today have a peculiar problem: The lack of a general framework for speedy development and deployment of such applications. Completing just a sub-task in this domain (for example, Natural Language Processing (NLP)) takes different libraries and developer resources. Thus, a generalized SDK for creating conversational applications is required.

Today, We will explore an SDK that solves the above problem: NVIDIA Riva. I worked on this SDK as an ML Intern at NVIDIA under Mr. Amit Kumar (Senior Solutions Architect — Deep Learning). In this blog, we will understand how Riva works, how to install it, and how to build production-ready applications using Riva.

Riva — An Overview

As I discussed above, Riva is an SDK for building Speech AI Applications that you can customize to your needs. Currently, Riva supports three vital areas of conversational AI: Automatic Speech Recognition(ASR), Natural Language Processing(NLP), Text-To-Speech(TTS), and vice-versa. These allow us to have an end-to-end workflow for speech so that you can focus more on the problems you want to solve rather than how to solve them.

Riva also supports TAO pre-trained models from NGC, allowing greater flexibility and efficiency in your speech AI pipelines. You can also deploy custom models trained using NVIDIA NeMO here. If you are interested in learning about NeMO, you can read about the basics of NeMO on my blog here.

A general overview of Riva’s architecture is available here. It may be a lot for the layperson to take in, so here’s a small take: Riva takes in pipelines (for speech Applications, using any model, like a TAO pre-trained model or a NeMO model) and deploys them onto inference servers. So, it divides the job into three different phases, Development, Build and Deploy.

In the first (Development) phase, the pipeline gets converted into a Riva-compatible format. This step is essential since TAO and NeMO models represent single models, while Riva accepts only complete pipelines. So, we translate all model checkpoints to the .riva format.

Next comes the Build phase. All necessary entities to deploy our application/service get bundled into an intermediary .rmir (Riva Model Intermediate Representation) file.

Finally, in the third (Deploy) phase, the RMIR file is converted into a Riva repository, and all models are exported and optimized for the target platform.

A typical Riva deployment cycle

The above was an overview of the inner workings of Riva. Next, we discuss the installation of Riva on your local machine.

Installation

The best way to run Riva on local machines is to use Docker. Docker helps maintain a clean environment for testing and deploying pipelines on your local machine using containers and can also clean up after itself once you’re finished.

All Riva-related container images for Riva are available on NGC, which you can access here.

Riva includes quick start scripts for a quick setup on local systems. To use them, make sure you have Docker installed on your machine, then run the following commands:-

ngc registry resource download-version nvidia/riva/riva_quickstart:2.0.0cd riva_quickstart_v2.0.0bash riva_init.sh
bash riva_start.sh
bash riva_start_client.sh

The above code launches a container where you can use Jupyter to use different services. After you finish, shut down the Jupyter server. Then stop the Riva server using the `riva_stop.sh` command.

Note: The above method is applicable only on dGPU setups. For the Jetson platform, please follow the steps given here.

Sample Pipeline Deployment

We will use NVIDIA TAO for quick access to pre-trained models. The steps we will follow are:

  1. Train and fine-tune a model using TAO
  2. Run inference on samples
  3. Evaluate model performance
  4. Export model for deployment
  5. Configure Riva Services
  6. Deploy model using Riva

To use TAO, install the TAO launcher from here. Then, we can access TAO as a command-line tool. You can choose which task you wish to accomplish from a list. TAO also includes example files to try models. We download these files by adding `download_specs` as an argument to the task command.

We will now have a general model available to infer sample data on our system. Once you are happy with your general model, proceed with context-based training. Use the `train` command with a training config file (all provided while downloading specs).

Once we have retrained your model, we can evaluate your model performance using the `evaluate` command. If the score is not up to the mark, you can fine-tune your model using the `finetune` command.

After we finish tweaking the model to our needs, we export the model to the ‘.riva’ format. Now, we are ready to deploy your Pipeline on Riva. We run the service-maker container with the `riva-build` command to build our RMIR file.

Finally, to build our Riva model repository, we run `riva-deploy` in the same format as the `riva-build` command. We should now have our model repository ready to be deployed onto a server.

Lastly, we modify our ‘config.sh’ file to be compatible with our repo specifications. Then, we run `riva_init.sh` and `riva_start.sh` to initialize and start our server.

Your application is now deployed and ready on an inference server! To access your deployed application, connect your repo to a simple client. When you access your client, you should be able to see the results of the task you configured in your application.

--

--

Ariz Siddiqui

ML Enthusiast, Engineer-in-Making, Campfire Guitarist.