Life-like audio setup for spoken-word content?

Question

I listen to a lot of audiobooks&mdash;some of them ripped straight from CDs in lossless quality&mdash;on a pair of Airmotiv 4S speakers. While these are very far from the best speakers money can buy, most reviews consider them to be high quality studio monitors, appropriate for use by professional audio engineers. (I am not a professional audio engineer.)I would generally expect these speakers to be overkill for spoken-word content. However, although they do make audiobooks sound quite good, I can definitely tell that they are speakers, playing back an artificial recording. I would never mistake them for a person who is physically present in the room with me.Why? Would it be possible to create an audio setup which makes spoken-word content indistinguishable from a real person? In other words, if my eyes are closed, I shouldn't be able to tell the difference between a recording and a human sitting next to me.What would it take? Does it come down to hardware/speakers, or software/mastering, or something else?

themodelplumber · Accepted Answer

The closest I've ever got to this has been using a 120-degree pickup Sony ECM-MS907 condenser mic and a minidisc recorder. Using Sony's closed over-ear headphones (the most expensive I could buy at the time), I recorded several group gatherings.
On playback through the headphones later, I instinctively turned my head at frequent intervals, because I couldn't tell that I wasn't listening to real people near me. It was a really weird brain-fakeout effect.
This was all consumer-level hardware at the time so I'm sure it's very possible today to achieve the general effect, though probably with different equipment. Personally I'm not sure I'd want that from an audiobook but it could be cool.

GrumpyYoungMan · Answer

I say this jokingly but if you really, really wanted to go all out, you could use an artificial head with mouth simulator, e.g. https://www.head-acoustics.com/products/artificial-head-bina... Being a very specialized test instrument, though, it costs vastly more to get such a device and the auxiliary equipment to operate it than the experience would ever be worth. Even basic mouth simulators, sans head, like https://www.bksv.com/en/transducers/simulators/ear-mouth-sim... are astonishingly expensive.

code_Whisperer · Answer

Based on experience, I would say that the most critical part of that chain is the recording technique and quality, followed closely by the audio reproduction. To get the most amazing, lifelike experience you are seeking, however, (and in my opinion) you would most likely need to be using high quality headphones and a similarly high quality recording made using the binaural mic technique. This technique mimics how sound waves travel around your head and into your ears. A good binaural recording will shine only when you listen with headphones, don't even bother trying with speakers. If you've never heard it before, it will probably shock you.

DantesKite · Answer

I don't have an answer to your question, but it does make me think how much an algorithm could improve the sound quality or whether it would be a negligible improvement.

duped · Answer

There are a couple things to take note of.
First is the setup. For an ideal studio monitor setup they need to be placed relatively far from the walls, with the tweeters level with the tip of your ears, placed to form an equilateral triangle with your head, and have their levels set so when you listen to a mono test signal (like a sine wave) the "phantom center" is obvious and directly in front of you as a listener. That's why they have a knob on the back. Then you need to take care as if there's a large surface in front (like a desk) there will be reflections off the surface that create phase interference, and if there is a wall directly behind you, there should be a diffuser or other wall treatment to redirect reflections away from you and/or absorb instead of reflect.
That kind of care is not expensive but makes a huge difference in the listening environment. You don't need to take as much care as a consumer loudspeaker for casual listening because it is designed to have more diffuse radiation pattern which has a less well defined sound stage, but much wider sweet spot so it's less influenced by poor placement.
If you do this, it'll sound great. But it won't sound like Stephen Fry is sitting across from you talking at you when you listen to the Harry Potter audiobooks. It will sound like he's speaking closely into a microphone in a vocal booth, which he probably did. And that's what studio monitors are used for, they are there to make it sound like the environment the recording was made in.
If you want to mask that to sound like a human in your room, that's an interesting design problem. What I would do is ditch the stereo setup and use one speaker, on a stand, far enough away from me that I was in the "far field" (about a meter, maybe more) and away from any walls. That will let the room sound dominate. I'd also enable a high cut (usually found on the monitor with a switch) to compensate for what's called the "proximity effect" (speaking closely into a mic increases the bass of a human voice, it's great for psychoacoustics in bad listening environments but alien compared to real people). The most important thing is to turn the volume down, because the timbre of the voice changes with loudness naturally, and the speaker probably didn't do that.
That's what I'd do, on a budget. It has drawbacks, like having a speaker hanging out in the middle of a room and cable running to it. If money was no object, there are better things to do. The first is gratuitously treat the walls with diffusers to deaden the space, and hire contractors to install line arrays in the walls. Then run the signal through a DSP brick that implemented a convolutional reverb with an impulse response of a studio's live room and meticulously EQ the signal for the mic effects for each book. That would cost $10-50k depending on how crazy you want to get and it will sound incredible. There are a handful of AV consultancies in the US that will do it, I've heard their work and it is as close to "holy shit that guy is in the room with me" as you can get.