Phoneme timings in Text to Speech service

We'd like to use the text to speech service to control an animatronic. The animatronic has a mouth and needs to manipulate its lips and jaws as it's speaking and Amazon had phoneme and viseme support which is what we were using. However, we're switching to Watson for our upcoming demo and could not find anything related to the "mouth position" that would correspond with the audio. We tried generating the mouth shape using acoustic models but it doesn't look good. We're looking to retrieve both the audio and phonetics while the robot is speaking to control the mouth directly. Is there any way to do that with IBM Watson's Text to Speech system? See image attached for the different mouth shapes used by companies like Disney.

Related links: https://docs.aws.amazon.com/polly/latest/dg/viseme.html

Post comment

Guest

Dec 16, 2019

Animated systems like Soul Machines Avatar and FaceMe Digital Human also require phoneme timings before they can use Watson's Text to Speech so that the animation of the mouth lines up with the sound.

Reply
Hide replies

Guest

Aug 13, 2019

Any update on this? Our project would benefit from this as well. Currently Amazon is the only company offering this feature.

Reply
Hide replies

By clicking the "Post Comment" or "Add Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not include IBM confidential, company confidential, or personal information in any field.
Having problems accessing this portal? Describe the problems in an email to ideasibm@us.ibm.com.

Please enter your email address

RELATED IDEAS

Phoneme timings in Text to Speech service