Synthetic Speech in Flash

Recently, I learned about Linear Predictive Coding (“LPC”). This technique is used in classic arcade games (such as Gauntlet) and the Speak & Spell to synthesize speech.

Here’s my first attempt at LPC speech in Flash: (click & explore)

It’s great, except for one tiny problem: It sounds horrific. Can you feel the cold, robotic love? This voice will stalk your nightmares.

The phonemes were derived from an unrehearsed recording of my voice. I’m confident that it can be improved. Note that direct LPC encodings of my voice, such as this one, sound more acceptable.

EDIT: I made an iPhone version, “Metal Mouth”, with lots of features. Here it is on YouTube and the iTunes Store!

EDIT #2: The source code is available here.

7 thoughts on “Synthetic Speech in Flash

  1. This is incredible Zach! I remember speech synth done on C64 (caller SAM or something) but never seen one in Flash. I like the noise option, sounds like whispering.

  2. Og2t: Thank you!! You jogged my memory — I recall using “Say It Sam” on the Apple ][, which was another scratchy, low-fidelity voice. But it was impressive for the time. I have no idea how that software worked. I’ll investigate and report back…

  3. @cmoore: I’d like to release the code! However, I have to clear this with the team… We plan on refining the speech a bit, and using it in a special project. Watch this space…

  4. Great work :) and I think it’s a 1st for AS3, so congratulations!
    So, basically LPC is a way of compressing an audio sample, and a decompression (predictive?) algorithm for playback?
    I’m interested in this type of thing a lot for AS3…is any blending used while transitioning between phonemes?
    If you could share any of the papers or knowledge used while making this i’d really appreciate it.
    Cheers!

  5. Thanks Dave! As I understand it, LPC was designed with speech compression in mind. The playback works like this: you generate either a repeating pulse (for pitched frames such as vowel sounds) or noise (for sibilance (“S”, “Z”, etc)), and you run this signal through a set of very short delays (echoes). The output is fed back to the input, so it self-oscillates a bit.

    Different vowel sounds can be produced by changing the amplitudes of the individual delays. The real magic is obtaining those values — my early experiments were plagued with howling feedback, and situations where the DC offset increased exponentially ;)

    Ultimately, I looked at the source of an app called rt_lpc (http://soundlab.cs.princeton.edu/software/rt_lpc/) and this was very useful. I have a vague idea of how the encoding works now (their code could stand more comments IMHO ;), the decoding & playback is pretty straight-forward.

    Currently I’m not using any blending or transitions between phonemes (tho I am synthesizing the sound at a 10khz sample rate and using anti-aliasing during playback, because I like how it sounds ;) When the feedback amplitudes change, the playback buffer retains its old state, and the result is a smooth-sounding transition between frames. Pretty neat.

    Really, I’m amazed that LPC works AT ALL. It’s very clever.

    I’d really like to share the code (the dictionary and speech-to-phoneme aspects are interesting, too). Hopefully I’ll get the go-ahead from the team soon… I’ll write another post when that’s ready.

Leave a Reply

Your email address will not be published.