I would generally expect these speakers to be overkill for spoken-word content. However, although they do make audiobooks sound quite good, I can definitely tell that they are speakers, playing back an artificial recording. I would never mistake them for a person who is physically present in the room with me.
Why? Would it be possible to create an audio setup which makes spoken-word content indistinguishable from a real person? In other words, if my eyes are closed, I shouldn't be able to tell the difference between a recording and a human sitting next to me.
What would it take? Does it come down to hardware/speakers, or software/mastering, or something else?
On playback through the headphones later, I instinctively turned my head at frequent intervals, because I couldn't tell that I wasn't listening to real people near me. It was a really weird brain-fakeout effect.
This was all consumer-level hardware at the time so I'm sure it's very possible today to achieve the general effect, though probably with different equipment. Personally I'm not sure I'd want that from an audiobook but it could be cool.
First is the setup. For an ideal studio monitor setup they need to be placed relatively far from the walls, with the tweeters level with the tip of your ears, placed to form an equilateral triangle with your head, and have their levels set so when you listen to a mono test signal (like a sine wave) the "phantom center" is obvious and directly in front of you as a listener. That's why they have a knob on the back. Then you need to take care as if there's a large surface in front (like a desk) there will be reflections off the surface that create phase interference, and if there is a wall directly behind you, there should be a diffuser or other wall treatment to redirect reflections away from you and/or absorb instead of reflect.
That kind of care is not expensive but makes a huge difference in the listening environment. You don't need to take as much care as a consumer loudspeaker for casual listening because it is designed to have more diffuse radiation pattern which has a less well defined sound stage, but much wider sweet spot so it's less influenced by poor placement.
If you do this, it'll sound great. But it won't sound like Stephen Fry is sitting across from you talking at you when you listen to the Harry Potter audiobooks. It will sound like he's speaking closely into a microphone in a vocal booth, which he probably did. And that's what studio monitors are used for, they are there to make it sound like the environment the recording was made in.
If you want to mask that to sound like a human in your room, that's an interesting design problem. What I would do is ditch the stereo setup and use one speaker, on a stand, far enough away from me that I was in the "far field" (about a meter, maybe more) and away from any walls. That will let the room sound dominate. I'd also enable a high cut (usually found on the monitor with a switch) to compensate for what's called the "proximity effect" (speaking closely into a mic increases the bass of a human voice, it's great for psychoacoustics in bad listening environments but alien compared to real people). The most important thing is to turn the volume down, because the timbre of the voice changes with loudness naturally, and the speaker probably didn't do that.
That's what I'd do, on a budget. It has drawbacks, like having a speaker hanging out in the middle of a room and cable running to it. If money was no object, there are better things to do. The first is gratuitously treat the walls with diffusers to deaden the space, and hire contractors to install line arrays in the walls. Then run the signal through a DSP brick that implemented a convolutional reverb with an impulse response of a studio's live room and meticulously EQ the signal for the mic effects for each book. That would cost $10-50k depending on how crazy you want to get and it will sound incredible. There are a handful of AV consultancies in the US that will do it, I've heard their work and it is as close to "holy shit that guy is in the room with me" as you can get.