Repeat After Me

This tutorial will walk you through manually generating visemes by recording an audio file and then playing it back using the robot face.

One critique of text-to-speech systems is that they can sound robotic. One way to improve the naturalness of the speech is to use actual human speech. Human speech can reflect a wide variety of emotions and intentions that are difficult to capture in plain text. If you are using PyLips for an interaction with mostly pre-recorded speech, you can record your own voice (or hire a voice actor) to record the phrases you need in your interaction.

In this tutorial, we will be using the sounddevice and soundfile libraries to record a 3 second audio clip. We will then use the allosaurus library to recognize the phonemes in the audio file. Finally, we will use the RobotFace class to play back the audio file and display the visemes on the robot face.

Prior to beginning this tutorial, ensure that you have run python3 -m pylips.face.start to start the robot face. You may also need to install the sounddevice and soundfile libraries using python3 -m pip install sounddevice soundfile. allosaurus is included in the PyLips requirements, so you should not need to install it separately.

First, we will import all the necessary libraries for this tutorial.

import sounddevice as sd
import soundfile as sf
import pickle

from pylips.speech import RobotFace
from pylips.speech.system_tts import IPA2VISEME

from allosaurus.app import read_recognizer

Next, we will set up some parameters we will use later. To change the behavior of this script, you can experiment with different values for duration to change the length of the recorded audio. You may also need to modify the sd.default.samplerate and sd.default.channels variables to match the audio input of your microphone.

# sound recording parameters
duration = 3  # seconds
sd.default.samplerate = 44100
sd.default.channels = 1

# load allosaurus for phoneme recognition
phoneme_model = read_recognizer()

# create robot face object for speaking
robot = RobotFace()

Next, we use the sounddevice library to record an audio clip and save the audio clip to a file in the pylips_phrases directory, which is automatically created when the pylips face is instantiated.

#record
myrecording = sd.rec(int(duration * sd.default.samplerate))
print( "Recording Audio")
sd.wait()

sf.write('pylips_phrases/parroted.wav', myrecording, sd.default.samplerate)

Next, we use the allosaurus library to recognize the phonemes in the audio file. We then convert the phonemes to visemes using the IPA2VISEME dictionary, and save the result in the expected format for PyLips.

out = phoneme_model.recognize('pylips_phrases/parroted.wav', timestamp=True, lang_id='eng')

times = [i.split(' ')[0] for i in out.split('\n')]
visemes = [IPA2VISEME[i.split(' ')[-1]] for i in out.split('\n')]

times.append(len(myrecording)/sd.default.samplerate + 0.2)
visemes.append('IDLE')

pickle.dump((times, visemes), open(f'pylips_phrases/parroted.pkl', 'wb'))

Finally, we use the RobotFace class to play back the audio file and display the visemes on the robot face. We use the existing say_file method to play the files we created in the previous step.

robot.say_file('parroted')
robot.wait()

You are done! You can now run the script and record your own voice to play back on the robot face.