“Alex, come here. I need you.”
Leopard’s new Alex voice is quite good. If the words “talking computer” make you think of electronic bleeps and drones in a bad 1950s science fiction movie, you might want to listen to this:
Apple has included voices in the Mac OS for as long as there have been Macs and Steve Jobs even used the original MacinTalk voice to introduce the first Mac, but it wasn’t until Mac OS X and especially recent versions of Mac OS X that the voices included with your Mac have sounded natural.
Here’s a sample of Alex and some of the other voices included with Mac OS X, as well as voices commercially available from Cepstral and AT&T Natural Voices. As you’ll hear, the competition’s not so bad, either.
For he to-day that sheds his blood with me
Shall be my brother; be he ne’er so vile,
This day shall gentle his condition.”
Apple
| Alex: | |
| Bruce: | |
| Fred: | |
| Ralph: | |
| Vicki: | |
| Victoria: |
Cepstral
| Callie: | |
| David: | |
| Duncan: | |
| Lawrence: | |
| Linda: | |
| Walter: | |
| William: |
AT&T Natural Voices
| Anjeli: | |
| Audrey: | |
| Charles: | |
| Claire: | |
| Krystal: | |
| Lauren: | |
| Mike: | |
| Rich: |
What makes one voice better than another, beyond personal preference? That is, what makes one voice technically better? One big difference is the size of the voice samples. Roughly speaking, the bigger the file size, the higher fidelity the voice. Looking at Alex on disk, no wonder it sounds so much better.
From /System/Library/Speech/Voices:

But other differences are telling as well. For instance, Alex breathes. Go back and listen to Alex recite the Shakespeare passage, and listen carefully to the start of the second line, “For he to-day…”. You’ll hear a lifelike intake of breath. It’s so natural you don’t even notice it at first, but computers don’t breathe! The breathing is included to make the voice sound more real, and it works.
No matter how natural it sounds, a voice that parses text incorrectly sounds immediately alien to your ear. The Wikipedia entry on speech synthesis points out some of the problems with converting text to speech. Take the text “1325″, for instance—how should it be pronounced? As a number, it would be pronounced one thousand three hundred and twenty-five; as a street address, thirteen twenty-five; and as part of a phone number, one three two five.
Even a single digit can be challenging to pronounce properly. Jay Waltmunson at MSDN points out that pronouncing “1″ is harder than you think:
Let’s look at the following English sentences:
- I have 1 friend. (”i have one friend”)
- Can you meet me on 1/3/05? (”can you meet me on january third two thousand and five?”)
- My birthday is on March 1. (”my birthday is on march first.”)
In sentence (a), we can read the digit “1″ as “one”. But in sentence (b), the digit “1″ can be spoken as “January” because it is in the context of a date. And in sentence (c), the “1″ commonly takes on an ordinal reading by being pronounced as “first”.
It’s even harder in languages with gender, like Spanish:
- Yo tengo 1 amigo. (”i have one friend”) - “1″ pronounced as “un”
- Yo tengo 1 amiga. (”i have one friend”) - “1″ pronounced as “una”
- Yo tengo 1. (”i have one”) - “1″ pronounced as “un”
In these examples, the “1″ can take on three different pronunciations! It’s not the semantic context (e.g., “date”, “time”, “fraction”) that requires disambiguation, but rather, the context is the gender of the word that the “1″ modifies - or the part of speech of the “1″. So, in sentence (a) the pronunciation is “un” because the following noun is masculine, but in sentence (b) the pronunciation is “una” because “amiga” is feminine. But, in (c), the pronunciation is “uno” because the “1″ is itself acting as a noun.
Abbreviations require special handling, too. A human being can easily navigate a phrase like “St. George lived on George St.”, but it’s harder for a speech parser. Same for “she walked in 4-in heels”
As disks get cheaper and computers get faster with more RAM, we can probably expect voices to continue to improve too. Just in time for the rise of smart phones and smart cars…
Great article. Nice to see someone commenting on text-to-speech in Mac OS X. You may also want to mention Acapela’s great voices:
They are used as the narration for all the videos at:
and here:
The post is interesting and informative. The fun bit was the Shakespeare quote.
Most of those voices sound “too American” for an English playwright’s words to me, although, in point of fact, a modern English accent would doubtless sound pretty weird to Shakespeare.
The word ne’er trips most of the voices up: they mispronounce it as “near”. I’ll bet most of these voices aren’t very good with archaic words and contractions.
Charles sounds like he’s had a few drinks before going on stage.