Speech recognition and Speech synthesis using neural networks

Hello All,

Today I want to show you how Joker can be used for speech recognition and speech synthesis using neural networks and Joker Empathy module.

joker empathy module
Joker Empathy module

I have brewed two docker containers for super simple usage. Just one command required to run neural network and obtain the results. This tutorial should work on any Linux and OSx . No GPU required, only CPU.

This funny video shows voice interaction with Joker:

Speech recognition (speech-to-text)

This service based on Kaldi ASR project. Kaldi’s ‘chain’ models (type of DNN-HMM model) used. Actual trained model released by api.ai team. Model contains 127847 words. Compare this number with Oxford English Dictionary which contains 171,476 words or average English-speaking adult knows between 20,000 and 30,000 words. And need to say that this model shows 11.2% word error rate (WER). This is very good results ! “Old” speech recognition methods (GMM-HMM) can show only 21+% WER.

To run test just issue following command in console:

docker run -it aospan/stt

builtin file will be processed and output should contain following text:

that is what actually system recognized from audio file. Here is a audio file:


Supply your own audio file

If you want to use your own audio file then run following command in console:

docker run -it -v `pwd`/in:/opt/in aospan/stt

input file format ‘wav, 16 bit, mono 16000 Hz’ and location is ‘in/in.wav’.


Joker can process speech-to-text in real time with 25% CPU usage. This more than enough for “real world” use-cases like voice control, voice assistance, text dictation, smart home and many more.

Speech synthesis (text-to-speech)

This service based on Merlin project. I have trained neural network on ‘cmu_us_bdl_arctic’ dataset (male voice) prepared by Carnegie Mellon University.

To run test just issue following command in console:

docker run -it -v `pwd`/out:/opt/out aospan/tts

resulting audio file location is out/tts.wav. Here is audio file:

default phrase ‘Hello, my name is Joker. Today is a great day because it’s my birthday’ was used. To supply your own phrase run following command:

docker run -it -v `pwd`/out:/opt/out aospan/tts "your phrase here"


Now we can build very user-friendly systems with natural voice control like Amazon Alexa or Google Home. But Joker does’t need online connectivity, all speech processing done locally. This improves privacy and security – no audio data shared with third party. And we can do voice control when no internet connection configured (for example, for fresh installations).

Please check Joker Walker module for use-case of voice control.

Leave a Reply

Your email address will not be published. Required fields are marked *