Speech Recognition in 5 Minutes

Listen to this article:

Automatic Speech Recognition (ASR) is a rapidly growing technology driven by mainstream adoption through Apple Siri, Google Assistant, Microsoft Cortana and many more. ASR seeks to provide a novel voice-driven user interface through either natural language processing or hot-word detection. The future of ASR technology can include natural language chatbots, voice-activated Internet of Things systems, and advanced robotics.

This post is a continuation of Get Started on Deep Learning. In this example, you’ll be using the Sphinx speech recognition library to build a speech-to-text service. In under 5 minutes.

What is Sphinx?

Carnegie Mellon University developed the Sphinx system, which is an open source, large vocabulary, speech recognition codebase. The speech recognition process involves receiving an audio waveform, sectioning the waveform to isolation sounds from noise and silences, and then parse the sounds to identify the best matching combination of words.

How does Sphinx work?

To parse the speech samples, the samples are divided into frames of 10 ms length, where the derivative of the spectrum will extract and store 39 numbers that represent the speech as a feature vector. Next comes the matching process itself. Speech can be modelled as a Hidden Markov Model that is used to describe any sequential process where the states are not directly visible, but the output data is visible.

Within this model, there are three further models to match the speech samples to words. An acoustic model will contain the acoustic properties for each senone or similar sound fragment, the phonetic dictionary will map words to phones, and the language model will restrict the phonetic dictionary. Restrictions include sequential word norms (“I have” as opposed to “I half”).

Get Started

The following steps assume that you followed the installation process in Get Started on Deep Learning for Python 3.

First you need to install PyAudio. It’s required as you will be using your computer microphone as input. Make sure you are installing PyAudio version 0.2.11 or later:

sudo apt-get install python-pyaudio python3-pyaudio

Now activate the Python 3 Virtual Environment:

workon virtual-py3

And make sure that you are running the updated versions of pip, setuptools and wheel:

pip install --upgrade pip setuptools wheel

Then install Sphinx:

pip install --upgrade pocketsphinx

Installing the Speech Recognition Library

The SpeechRecognition library combines Sphinx, Google Speech Recognition, Cloud Speech API, Wit.ai, Microsoft Bing Voice Recognition, Houndify API, IBM Speech to Text and Snowboy Hotword Detection into a single engine for easy development.

pip install SpeechRecognition

Build a Speech Recognition Program

Now you can build our first speech-to-text program! Here’s an example to help you get started. Quickly fire up your favourite text editor (I love using Atom, but you can use Geany or others as well), and follow along:

# Import the Speech Recognition Dependency
import speech_recognition as sr

# Receive Audio from the Microphone
r = sr.Recognizer()
with sr.Microphone() as source:
      print("Say something!")
      audioReceived = r.listen(source)

# Parse with Sphinx
try:
      print("Sphinx thinks that you said " + r.recognize_sphinx(audioReceived))
except sr.UnknownValueError:
      print("Sphinx could not understand what you said")
except sr.RequestError as errors:
      print("Sphinx found errors; {0}".format(errors))

And that’s it! You should try Sphinx first with single word recognition before moving on to sentences.

Expand your Learning

For you scientists out there, here’s the published paper on the CMU Sphinx system.

Geoffrey Momin is a Nuclear Engineer and Technology Consultant. He is actively researching the application of blockchain, artificial intelligence and augmented reality systems to improve human performance and safety.

Follow him on Twitter, Github, and LinkedIn.