Building Powerful Conversational AI Models with NVIDIA NeMo

NVIDIA NeMo provides all the tools for building efficient and accurate models for Conversational AI.

Ariz Siddiqui
4 min readMay 2, 2022

When creating Conversational AI Applications, building powerful models that solve the question well presents a significant bottleneck in the development process. While there are SDKs available to speed up the creation of application pipelines, the model every application requires is custom; there aren’t many frameworks that allow you to create models that solve your problem.

NVIDIA NeMo solves this exact problem; it is a toolkit to create custom models effectively and efficiently. Models made using NeMo are accurate, optimized, and exportable to multiple frameworks and SDKs like NVIDIA Riva. To learn more about NVIDIA Riva, check out my blog explaining Riva here.

I worked on this toolkit as an ML Intern at NVIDIA under Mr. Amit Kumar (Senior Solutions Architect — Deep Learning). Here, we will discuss what NeMo is, its installation, and brief information on its various collections.

An Outline

As we discussed, NeMo is a toolkit for developing conversational AI models. It is fundamentally different from SDKs like DeepStream and Riva, as SDKs like these handle the development and deployment of entire pipelines. NeMo, on the other hand, has tools just for creating speech AI models.

It provides a framework for building, training, and fine-tuning GPU-accelerated speech and Natural Language Understanding (NLU) models with a simple Python interface. Using NeMo, you can develop entirely new model architectures and train them with mixed-precision computing. Mixed precision computing is a training methodology that uses FP16 and FP32 precision formats, reducing training time and memory costs for similar accuracy to either FP16 or FP32-based training.

With NeMo, it is possible to build models for ASR, NLP, and Text-to-Speech(TTS) applications such as video call transcriptions, video assistants, and many more across healthcare, finance, etc.

All the tools in NeMo are categorized into three collections according to the function they fulfill; these collections are ASR, NLP, and TTS. You can also work on large-scale language modeling using Megatron. Megatron is functionality in NeMo that allows you to train and scale language models to billions of parameters. You can learn more about Megatron here.

Installation

Installing NeMo is very easy. Since it has a Python interface, we will use pip to install NeMo. You can also install and run it on cloud coding platforms like Google Colab. To install NeMo, just run the following command in your Terminal:-

python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

In Jupyter Notebooks, you need to prefix the ‘!’ character to install NeMo(Since the above command is a shell command).

That’s it! Now, you can import NeMo into your Python scripts just like any other library.

You can also download the NeMo Docker container through NVIDIA’s NGC container, which provides a ready-to-run Linux-based environment to run NeMo. You can check out the NGC Catalog page for NeMO here.

Collections

As we discussed above, NeMo divides its tools into three collections, depending on the task that the tool fulfills. These collections are Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS). Let’s talk about each of these collections.

ASR

ASR refers to the problem of getting a program to transcribe spoken language (speech-to-text). The goal is usually to have a model that minimizes the Word Error Rate (WER) metric when transcribing speech input. In simpler terms, given an audio file containing speech, we need to transform the audio into its corresponding text with as few errors as possible.

To this end, there are many models available in NeMo, like Jasper, QuartzNet, Citrinet, and many more.

Citrinet model structure

NLP

NLP is the process of understanding human language and processing it to derive meaningful inferences. Many tasks fall under the umbrella of NLP, like Token Analysis, Sentiment Analysis, Intent Classification, Language Translation, etc.

TTS

Speech Synthesis or Text-to-Speech (TTS) involves turning text into human speech. The NeMo TTS collection currently supports two pipelines for TTS:-

  1. The two-stage pipeline. First, a model to generate a Mel Spectrogram from the input text. Second, a model to generate audio from a Mel Spectrogram.
  2. The end-to-end approach uses one model to generate audio straight from the text.

This is all about NeMo! If you wish to learn more about NeMo, you can always check out NeMo’s documentation here, and if you wish to check out some practical examples, you can see in a GitHub repository here.

--

--

Ariz Siddiqui

ML Enthusiast, Engineer-in-Making, Campfire Guitarist.