Hello All,
Today I want to show you how Joker can be used for speech recognition and speech synthesis using neural networks and Joker Empathy module.
I have brewed two docker containers for super simple usage. Just one command required to run neural network and obtain the results. This tutorial should work on any Linux and OSx . No GPU required, only CPU.
This funny video shows voice interaction with Joker:
Speech recognition (speech-to-text)
This service based on Kaldi ASR project. Kaldi’s ‘chain’ models (type of DNN-HMM model) used. Actual trained model released by api.ai team. Model contains 127847 words. Compare this number with Oxford English Dictionary which contains 171,476 words or average English-speaking adult knows between 20,000 and 30,000 words. And need to say that this model shows 11.2% word error rate (WER). This is very good results ! “Old” speech recognition methods (GMM-HMM) can show only 21+% WER.
To run test just issue following command in console:
docker run -it aospan/stt
builtin file will be processed and output should contain following text:
/opt/in/in.wav HELLO THIS IS SPEECH TO TEXT RECOGNITION FOR JOKER PROJECT
that is what actually system recognized from audio file. Here is a audio file:
Supply your own audio file
If you want to use your own audio file then run following command in console:
docker run -it -v `pwd`/in:/opt/in aospan/stt
input file format ‘wav, 16 bit, mono 16000 Hz’ and location is ‘in/in.wav’.
Performance
Joker can process speech-to-text in real time with 25% CPU usage. This more than enough for “real world” use-cases like voice control, voice assistance, text dictation, smart home and many more.
Speech synthesis (text-to-speech)
This service based on Merlin project. I have trained neural network on ‘cmu_us_bdl_arctic’ dataset (male voice) prepared by Carnegie Mellon University.
To run test just issue following command in console:
docker run -it -v `pwd`/out:/opt/out aospan/tts
resulting audio file location is out/tts.wav. Here is audio file:
default phrase ‘Hello, my name is Joker. Today is a great day because it’s my birthday’ was used. To supply your own phrase run following command:
docker run -it -v `pwd`/out:/opt/out aospan/tts "your phrase here"
Conclusions
Now we can build very user-friendly systems with natural voice control like Amazon Alexa or Google Home. But Joker does’t need online connectivity, all speech processing done locally. This improves privacy and security – no audio data shared with third party. And we can do voice control when no internet connection configured (for example, for fresh installations).
Please check Joker Walker module for use-case of voice control.