How might a purely mechanical 'voice' function?
In my world there are mechanical semi-humanoid automata.
Would it be possible to make them speak in a fashion that sounds natural (i.e. human)? How might it function?
The mechanism should be capable of (in order of priority):
- Pronouncing English sounds.
- Simulating intonation.
- Simulating changes in pitch (frequency).
- Simulating changes in timbre (waveform).
- Simulating changes in volume.
- Pronouncing non-English sounds.
Note that I wouldn't mind the contraption having 'quirks', such as the /s/ always being pronounced a little quieter than the other sounds through some odd side-effect of the mechanism, or that there may be a tell-tale hiss before each word from the pumps depressurizing.
These mechanical voice-boxes may be constructed from materials available in the late Victorian era, with precision parts (such as watch mechanisms) freely available.
Controlling the mechanism should not be of concern.
The more compact, modular and more alien the system, the better.
EDIT:
I had researched the human voice on Wikipedia and the tweeting of cuckoo clocks, but found the actual human voice too large and spread-out for practical usage in machinery, particularly because of the involvement of the tongue (being quite a large organ) and it's distance from the lungs.
The cuckoo clocks are very primitive and I couldn't think of anything inspired by them.
I was wondering whether something more like a trumpet could be used, allowing the great distances to be coiled to preserve space, but I have a great lack of knowledge when it comes to topics of more advanced acoustics.
This post was sourced from https://worldbuilding.stackexchange.com/q/136221. It is licensed under CC BY-SA 4.0.
1 answer
BACKGROUND (setting the context for my answer):
I actually used to be a researcher in a university lab that created the software for voice synthesizers back in the 1980's. At that time, all the synthesizers used recordings of human voices and edited examples of each phoneme (the sound you might associate with a letter, but it's not letters). Then the software would grab the sounds it needed and put them together. Very choppy and awful output.
The professors I worked with (I was an undergraduate doing this as paid full or part time depending on my school schedule) created a brand new system. They made a list of every two-phoneme combination (for example "b-ah" or "sh-ew") and some common multiples (like st-ah) and then used a recorded human voice for the examples. My job was to cut each pairing at the exact middle (second half of the "b" and the first half of the "ah" for example). The point was to keep all those important transitions. I had both the sounds and a graphic depiction of the recordings on a computer.
The results were gorgeous compared to anything that came before it. Much more lifelike. But there was still no intonation. Changing pitch, volume, and a few other tonal things was possible then and is even easier now. But intonation is HARD.
To create intonation you need extensive rules on which tones to use when. You might think this is easy (just like you might think my other job, writing the rules for text-to-speech that translated written words into lists of phonemes, was easy) but you'd be wrong. It's hard for humans speaking a second language to get right and it's insanely hard for computers.
But that was over 30 years ago. All that stuff I did by hand is now partially or completely automated. It's easier now than before. But it's still not easy. I mean have you heard an electronic voice that does intonation well? Siri? Alexa? Yeah no. At best you get a rising tone for questions.
Take all this into the near future and, sure, it's gonna happen. Already the electronic voices are worlds better than what I was working on, and that was lightyears ahead of what was out there. Electronic voices are used every day now and it's just going to increase. There are entire companies (and departments of larger companies) working on these problems.
YOUR QUESTION:
You have two differences from what I was talking about.
- Your electronic speakers may be intelligent. In that case, you don't need software to determine which phonemes to use or which tonal variations.
- You're stuck with Victorian-level tech.
It's unclear to me if your "mechanical semi-humanoid automata" are indeed intelligent. If not, they need to be programed in some manner. Even if it's just setting keys to utter various phonemes. You still need to have a way for the brains of the machines, or the programming, to transfer to the "mouths." This is really hard for that era.
If you use artificial mouths to articulate sounds, you'll need to break down every phoneme into its component parts. These are:
- Voiced or unvoiced (if the vocal folds vibrate during the sound).
- Position of the tongue and/or lips. For consonants there are just a few choices but vowels are very complex and some vowels require movement of the tongue in a particular way.
- Method of articulation (stop, fricative, liquid, etc).
Then you need to pump air through the entire mechanism and somehow get everything to coordinate. Seriously, something like this would take a long time to build. And that's only for the version that takes 3 seconds to say every word.
If you use an electronic voice, you'll need to have a stored inventory of phonemes (or the cut phoneme pairs like I describe above). With modern computers you can use electronically generated sounds, but it's the same basic idea: create a string of sounds that come together as words.
English vs non-English? Easy peasy. That's just about which phonemes and/or phoneme pairs you have in your database.
Can you change volume or pitch? Maybe. It can be done mechanically, but you need to either have a human do this, an intelligent machine, or figure out some way to program it.
How about intonation? No. Freaking. Way. If only the rudiments of intonation can be done with modern technology, it's not going to happen with Victorian-era tech.
[Intonation is] extraordinarily complex. "Although intonation is primarily a matter of pitch variation, it is important to be aware that functions attributed to intonation such as the expression of attitudes and emotions, or highlighting aspects of grammatical structure, almost always involve concomitant variation in other prosodic features. David Crystal for example says that "intonation is not a single system of contours and levels, but the product of the interaction of features from different prosodic systems "“ tone, pitch-range, loudness, rhythmicality and tempo in particular." (ref)
What if controlling the machine was really not an issue?
The OP claims this but it really depends on the frame of the question. How much handwaving and "alien tech" is there? Even with "software" that's really an intelligent brain able to produce perfect control, you're still dealing with the slowness of Victorian-era machines. If everything is truly built locally, I don't see any way that speech can be a normal speed, let alone have all these nuances.
With 44 phonemes in English (Ju|'hoan has about 130) and hundreds to account for all the world's languages, the database of recordings alone would take up too much space, even if done very small and even if you could build the tiny player & a machine to move it around. And that's assuming that you only record phonemes, not the edited combinations that will give you much smoother and better results.
0 comment threads