📣 4midori

Why is there no high quality method for voice control of a PC?

Like many people who have spent decades behind a keyboard, RSI (Repetitive Stress Injury) prevents me from writing code and doing graphic design through the usual keyboard and mouse inputs.
So I have turned to a complex and highly unreliable software stack that provides both voice-to-text, and clumsy but limited control of Microsoft Windows, Chrome, etc. This includes Dragon Voice-to-text, Voice Computer, and Talon, plus a browser extension and heavy customization.
Users of Dragon will acknowledge that: a) The software is a creaky dumpster fire built on archaic code b) There is no viable alternative on the market
My question is: how is it that no one has built something better? The market is huge, and the Natural Language Processing of "OK Google" and Siri are quite refined at this point.
References:
Dragon: https://www.nuance.com/dragon.html
Voice Computer: https://voicecomputer.com/
Talon: https://talonvoice.com/

👤 Klaster_1 Accepted Answer ✓

Is there a high quality voice control for anything? I only ever tried Google Assistant and it mostly always can't comprehend queries beyond "timer 10 minutes", like putting a water boiler on schedule.

👤 ahelwer

I've heard cursorless (https://github.com/cursorless-dev/cursorless-talon) is good but have never tried it. Syntax-aware voice navigation of code, powered by tree-sitter queries!
I also have a friend who is a gifted programmer who lost his ability to type about a decade ago; he has put together an open-source software stack to help: http://www.cs.columbia.edu/~dwk/
Of course this doesn't really answer your question. But it's a hard problem, and you're basically forced to become a power user to reliably interact with your PC.

👤 PaulHoule

Google and Siri are good at what they do. They aren't good at other things, such as dictation.
I see the big problem in voice interaction is that a human being will ask you questions to clarify what you said if they don't understand and current systems don't even try. (Actually the search paradigm lets you do some refinement, "Ok Google" works amazingly well on Android TV.)
Superhuman accuracy at dictation doesn't translate to a useful ability to understand text. You're doing great if you only garble 1 out of 20 words. Some errors are inconsequential, but if it garbles every other sentence then you are going to feel 0% understood.

👤 dataangel

I have an RSI and I've been coding by voice exclusively for about 7 years. I used a system built on top of Dragon for most of that and in the last year switched to Talon.
I think there are multiple reasons:
* The obvious market is dictation of natural language, but this isn't what you want for voice control. If you try to use long descriptive phrases as your command language everything takes forever. So instead you end up making your own mini command language where all of your common actions are a single syllable, but now it's no longer the English or other natural language that users already know. So now your product has substantial learning curve just like learning a new keyboard layout.
* Everything other than talon has terrible latency. Most existing speech recognition engines were not designed with the kind of latency you want for quick one syllable commands.
* In order for it to be really effective you need the cooperation of applications (this is why I've written extensive emacs integration). Some tools like window speech recognition try to hook in at the UI layer in order to figure out what text is in dialog boxes and such, but in practice they seem to do a pretty terrible job. Windows speech recognition has a very hard time consistently understanding what links you are trying to get it to click on for example. There's also a long tail of applications that just do their own custom UI rendering inside a blank canvas where no hook is possible.
* Good speech recognition even if not specifically targeting computer voice control is a genuinely hard research problem, and standard benchmarks for accuracy are misleading. You see "95% accuracy" and you are like wow that's a high percentage computers almost have this speech recognition thing solved and then you think about it harder and you go wait a minute, that's one mistake every 20 words! Maybe you are still impressed, but then you have to take into account that when the computer does the wrong thing you'll need to issue more commands in order to correct it, which will are also likely be misinterpreted. When you make a typo with a keyboard the mistakes rarely cascade, you just hit backspace.

👤 floatingatoll

Siri can’t understand “set a timer” more often than 3 in 4 tries for me, and any sentence with more than four words will have one error in it no matter what. I envy you the accuracy your voice assistants offer you, but for me, voice control makes me want to snap my phone in half from frustration at how terrible it is. I still can’t remember why I have a reminder set with the name “2910”, which is the transcription of my spoken English sentence at the time. So at the very least, I imagine the holdup is that voice control failure conditions are miserably bad, when it fails; and, “Delete this sentence” -> “Formatting C:\” misunderstandings are too easy in modern OSes still. (Windows still offers “Format” as a primary context menu choice on the boot hard drive!)

👤 kbenson

For anyone interested in this topic, you might be interested in this tech talk[1] from Emily Shea. In it she demos a tech stack similar to what's mentioned here, to fairly good effect. It does appear that required a lot of tweaking on her part and is optimized for the writing of code, and I'm not sure how well it functions for more general contexts.
1: https://www.youtube.com/watch?v=YKuRkGkf5HU

👤 WorldMaker

The last time I tried Dragon it was just a fancier (bloatware) UI built directly on top of Windows Voice Recognition (and IMO not adding much value on top of it): https://support.microsoft.com/en-us/windows/use-voice-recogn...
Windows Voice Recognition has been around forever (out of the box since XP), it's UI is "serviceable" but not great. (It was slightly better when Cortana was briefly "out of the box" in Windows 10, but has reverted some since.) But I don't think you need to pay for Dragon (or its high memory consumption) if you don't mind taking to learn the quirks of Windows Voice Recognition directly. Most of Dragon's quirks are Windows' quirks anyway papered over with a UI that makes it seem like they are adding value.
Also yeah, one of the answers to "how is it that no one has built something better?" is: Well, Microsoft tried with Cortana, got a huge blowback that "no one" wanted Cortana on their PCs, and gave up.

👤 calchris42

Wow, so many replies that boil down to “because typing is better, you should just type”.
This is fairly insulting as RSI’s are very much a real thing.
Does this community also think that wheelchair ramps should never be invested in because stairs are clearly superior?
I’d rather see the brain power in this community focused on solutions. Keyboard + mouse have lasted so long because they work surprisingly well, but I hope there is a day that we dream up something better that does not require slowly giving ourselves carpel tunnel.

👤 daanzu

I have been coding entirely by voice for approximately 10 years now (by hand long before that). Most of that time I have been using the Dragonfly (https://github.com/dictation-toolbox/dragonfly) library to construct my own customized voice coding system. The library is highly flexible and open source, allowing you to easily customize everything to suit what you need to be productive. It is perhaps the power user analogue to Dragon Naturally Speaking. With it, you can certainly be highly productive coding by voice. However, it does require work to setup and customize to suit you, so it isn't really for the "general population" of computer users to just sit down and use. With regard to accuracy of speech recognition, being open allows you to (with sufficient motivation) to train a custom acoustic speech model that recognizes your voice specifically extremely well.
Regarding the software packages you referenced: Yes, Dragon is trash that I want nothing to do with, because of its inefficient interface, its complete inability to accurately understand my voice, and its generally shoddy software quality. Voice Computer (which I hadn't seen before) is therefore eliminated as well, though it doesn't look terrible as a front end to Dragon to better use the OS GUI-accessibility info. Many people like Talon, but I demand something open, which I can modify to suit my needs.
Background: I develop kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), a free and open source speech recognition backend usable by Dragonfly, itself entirely by voice. There's also a community of voice coders using Dragonfly and other tools that build on top of it, such as Caster (https://github.com/dictation-toolbox/Caster).

👤 phkahler

I would like to see Linux lead here. Have a standard voice interface where a voice-to-text process feeds a stream of text to the DE, which can then forward it to the active application (as text). I want this to be a separate "voice" stream so it is not confused with the keyboard. This would allow the eventual creation of a voice assistant at the system level, but also allow individual applications to adopt voice commands starting now. IMHO this should be like version 1 of the concept and it should last a while until we figure out what all is possible and which use-cases need a design change.
Simple dictation could be done at the DE level, where the VtoT stream would be diverted to the keyboard input of the active app. It could also be done at the app level, but this is one feature I think belongs a level up so it can be used by non-voice enabled apps.

👤 laserbeam

The problem is, even if you do build amazing speech to text, it will br slower and less expressive than a keyboard + pointing device (mouse, touch, pen).
For keyboards, you lose positional logic (wasd in games). You lose shortcuts. You lose control over capitalization and formatring. You lose punctuation. You lose non-text input (code, dictating code sounds like like a horrible pain). You lose function keys. And, of course, you lose speed (think of instant things you do with shortcut keys, like alt tab). Not to mention, that you lose the ability to work in silence.
Make the recognition quality gorgeous, and it will still be a less flexible product than what we use today. It has value for accessibility, but people will likely choose keyboards over dictation based on UX alone.

👤 GeeJay

Voice control of ordinary computer navigation and of program writing and testing has gone nowhere in 30 years. https://dilbert.com/strip/1994-04-24

👤 MattGaiser

Part of the problem is that context is something AI is bad at and instructions are highly context dependent.
https://www.youtube.com/watch?v=FN2RM-CHkuI

👤 boomka

There are some tools, I think the reason they will never become widespread or high quality is that voice is just not a great medium for conveying that type of info in the first place. If I type a sentence and then decide to make a correction it is very difficult to explain in words but very quick to click and retype. If I want to position my window somewhere, I wouldn't even want to start thinking about how to explain it, I would just click and drag. And so on and so forth. This limits any potential markets for such tools greatly, so there is little economic incentive to develop them into anything truly high quality.

👤 nosianu

Voice input is good for high level tasks and goals, requiring a high level comprehension.
For detailed work though the more direct method of translating movements is far more efficient.
When you can describe an abstract end goal voice is great. When you have to actually do all the individual steps towards some high level goal then it's like telling a newbie programmer through some high level database optimization. You only use voice because your main goal here is to teach someone. If the PC could be taught that way, then voice would be in demand for such tasks too.

👤 6gvONxR4sf7o

Once automatic speech recognition (ASR) gets closer to bullet-proof, I expect this to become a huge thing, but right now, it seems like you're getting better error rates than typical.
Any input method where you frequently have to repeat yourself and undo things won't get mainstream. I'd bet people's mainstream tolerance for errors would have to be like one per five to ten minutes before you could get them to really adopt something like this (barring disability reasons, like RSI). Until then, the tech and market don't match.

👤 viro

... because it's not a good way to control a computer?

👤 mertd

As far as human computer interfaces go, keyboard and mouse probably win comfortably in both bandwidth and latency against speech to text in almost all tasks. Former also requires a less physical effort and is creates less noise for others. My guess is that this shrinks the demand for good quality voice HCI significantly and those who really need it end up being overlooked.

👤 liveoneggs

GUIs are unsuitable to anything other than the mouse + keyboard. They are the outputs of their respective inputs.
You need dedicated software built on a hypothetical V(oice)UI to get anything decent.
Otherwise your best bet is to find a mouse/trackball/trackpad/pointerstick/touch-screen/pen that doesn't injure you and use text-to-speech in simple text editors.

👤 SlogMaverick

I've kept this bookmarked for when this eventually happens to me. https://arstechnica.com/gaming/2019/04/coding-without-a-keys...

👤 maxwelljoslyn

I'm in the same boat as you, OP. Talon has proven a lifesaver ... or at least it promises to be one. I'm still getting used to it.
My finding, for text dictation (not code), is that even halfway decent dictation, such as is available on iPhone, still needs much post-dictation editing. I feel that the biggest impact to be made in this area is superior capabilities for this editing phase.
I summarized and wrote up my thoughts as a grant proposal for Scott Alexander's recent "micro grants" project. Get in touch (email in my profile) if you want to read that, or if you'd like to talk about dictation, voice control, voice coding, and editing operations -- or just get some moral support.

👤 ipnon

Speech models today can mine the entire corpus of published conversation and return the most likely response to a given statement. That's not how we converse. Every relationship you have is a little model in your brain that we call a person's "personality." Every one talks differently, has different frames of reference, uses different codes of language, different assumptions. Cutting edge speech models work perfectly for the perfectly average speaker, but that person does not exist! The farther we stray from the mean, the more alienating these speech models become.

👤 smorgusofborg

If I had to program with audio, I would make a steno dictionary with a theory that results in a pronunciation that is sufficiently different from normal language and then speak it instead of chord it.
The complexity of doing that is IMO a good explanation of why commercial audio recognition is worthless to someone who programs a computer instead of interacts with humans over a computer.
http://plover.stenoknight.com/2013/03/using-plover-for-pytho...

👤 mikob

I too noticed that Dragon is trash (2.2/5 rating on the Chrome Webstore, yikes) I've been working on one that's purpose-built for the web. Most software today is moving towards the web, so that's where we narrowly focus. It works everywhere (including HN, Reddit, YouTube, Gmail... even Duolingo)
You can DL it here: https://chrome.google.com/webstore/detail/lipsurf-voice-cont...

👤 alexhwoods

Talon + Cursorless.
People have built the tools you're talking about. They're Talon and Cursorless.
I think you'd be shocked if you saw how productive some people in the Talon community are. Be sure to join the community Slack.

👤 twright

Have you looked at Talon[1] for programming and system control? I used it for a few months last year and while the first two weeks were difficult I was able to nail down a workflow that really suited me. After another few weeks I felt as comfortable and capable working with it as I did a keyboard and mouse. (Cannot attest to its capabilities on Windows)
[1] https://talonvoice.com/

👤 simonblack

JUST IMAGINE THE SCENARIO:
You have just been fired and as the security boys are escorting you to the door, you call out, loud enough to be heard in all the cubicles -
"Computer! Format all drives!"
OR MAYBE THIS OTHER SCENARIO:
The guy in the next cubicle has a loud voice and while he is commanding his own computer to "Exit the file without saving" you find that the work you have carefully constructed over the last four hours is suddenly thrown away too.

👤 newusertoday

I tried using talonvoice but the recognition engine failed to understad lot of words. I then searched for pronunciation of those words on google and tolonvoice detected them correctly. In the end i learned to pronounce the words in american english so that talonvoice can understand them ;-) .Not what i was hoping for, i wanted to teach computer to recognize my voice not the other way around.

👤 miguel-muniz

Has anyone here used Apple's built in Voice Control[1] feature in MacOS? I imagine having something built in the OS is better than third party software, but I haven't used any so I don't know
[1] https://support.apple.com/en-us/HT210539

👤 rileyphone

I’ve been messing around with https://github.com/ideasman42/nerd-dictation which, with the big model, gives surprisingly accurate local detections. Definitely more diy/hacker focused than actually being a solution though.

👤 browningstreet

Alexa can't even hear the 3 things I say to it every single day with any accuracy.
But, it seems like all voice control development keeps getting bought up by the Big 3, so it's not likely to have any significant breakthroughs independent of what Apple, Google and Amazon think voice control is good for.

👤 doesnotexist

Have you read this blog post by Josh W. Comeau outlining his experience with Talon for a developer workflow?
https://www.joshwcomeau.com/blog/hands-free-coding/

👤 1vuio0pswjnm7

Surprised that something like Talon + RPi has not tapped into the "smart speaker" market.

👤 cpach

Just a thought: Have you tried Dasher…?
It’s an alternative input method. Might be worth giving a try.
https://www.inference.org.uk/dasher/DasherSummary2.html

👤 philonoist

For those who need immediate help for RSI, use ->
Voice Finger by Cozendy [$9.99]
Lenovo Voice Control from msstore [free]
Amazon Alexa from msstore [free]
"Win Key + h" for the inbuilt text box dictation [inbuilt]
serenade.ai [$$]
I don't have an exact answer to you OP but I hope someone builds a helpful one for you.

👤 daviddever23box

Arguably, one might wish to create a audible, non-linguistic shorthand for positional control, that would allow for higher efficiency when, say, retouching an image within Photoshop, but without the use of hands.

👤 singularity2001

Nuance started to completely monopolize the market 20 years ago and had crippled a lot of innovation in that space. It's still a minefield of ugly patents for any commercial contestants.

👤 mleonhard

Switching to a tenting split keyboard (Goldtouch V2) and vertical mouse (Evoluent) reduced my RSI. Strength training (Les Mills Body Pump) is what finally solved it. Have you tried those?

👤 wizzerking

The major problem when using voice to control a machine is tremors in the voice as the work day proceeds, when the person is stressed, and if the person is experiencing health issues. All these situations/reason will change the timber, and in some cases the intonation. Like 'emphasis on the syllable. Now top that off with accents, like a Hispanic person, or regional slang. Deep Learning kits like https://github.com/FreddieAbad/Voice-Recognition-using-Deep-... are making headway but still far from general voice recognition

👤 Avatars

"Why is there no high quality method for voice control of a PC?" For the same reason there's no standardized encryption for everyone's comms. Or, 'Why is there no software for gps that works on pc's that is easy and readily available?'. Same reason.
There is a voice assistant ap for Android that uses vosk called Dicio (f-droid). Storage is cheap and easy. Processing power is there even in cheap 3rd world phones. I personally detest typing and would love to talk to my devices without any 3rd party nonsense requirements. Truly there is none because the powers that be do not want everyone thinking they are in control, essentially of anything.

👤 Arubis

Absent regulation or other incentives to nudge the market otherwise, the overwhelming majority of software is and will be written to use existing input methods--i.e. adding a new input method isn't the core competency of a team creating the world's best todo list app.
With that precondition, any voice-to-control layer on the desktop is in the tough situation of translating between voice input and a piece of software that was designed without voice input in mind.
Google and Siri, etc., aren't as beholden to the desktop/browser interface paradigm, so they don't have to perform this interface translation.

👤 walls

A friend of mine uses VoiceAttack in a few VR games and it seems to work decently for triggering actions. Not sure if it's any good at transcription though.

👤 wnolens

That sounds exhausting.
"Open this program"
"Minimize"
"Focus on this text input"
..dictate..
"switch to command mode"
"save and close"
i'd rather just: "click click tab type ctrl-S"

👤 amelius

Because Google is keeping all their crowdsourced voice data secret.

👤 danShumway

I'll give an answer in a slightly separate direction: there aren't engines that are both good enough and open enough to hook into that Open Source communities can build around them.
There are two ways that new software gets built: either the market is big enough and accessible enough that commercial software gets built, or the software is easy enough to build that hobbyists enter the space and solve their own problems. For example, the commercial market for keyboard-driven interfaces is also quite small, but we still have stuff like Sway. But a good keyboard-driven interface is easier to build than speech recognition.
I've been curious about this area for a while, but my understanding is voice-to-text Open Source solutions are still kind of primitive for general text transcribing. The libraries aren't very fun to work with, they're often embedded Python/Java "stuff", and the accuracy isn't great if you advance past the level of text transcription. Additionally, controlling computers and hooking into X or Wayland feels a bit hacky.
That being said, I'll push back on people who are saying that no one would want to control an interface this way. The success of systems like Alexa/Siri/Google are pretty definitive proof to me that (all their weaknesses side) there is a market for voice interfaces. But the ties between that market and the desktop are not strong, and the ecosystem isn't open enough to really build on in that direction.
I suspect that until efforts like Mozilla's open speech datasets pick up more steam and become competitive (if they ever do), it's going to be kind of laggy to find solutions because it's not immediately obvious how to enter the market, either as a commercial company or as an Open Source dev. But maybe I'm wrong and I just haven't researched it enough and the area is totally ripe for disruption. Maybe for people with RSI they'd tolerate something like clipping a bluetooth mic to their lapel or something and that would boost accuracy. Maybe there's another way to approach entering code that isn't just straight text recognition, possibly combining it with some kind of AST or code analysis that made it easier to guess what people were saying.
In any case, I don't think the problem is that people don't want to talk to their computers. Personally I don't like using voice assistants, but they are very popular, in no small part because of the voice part. So maybe there is an evolution of desktop UI controls that could become really popular, or at least competitive with entrenched solutions for people with limited mobility or RSI. But it would require someone to introduce some kind of actual UX innovation into the space, or to find a way of getting over the moat around good recognition and OS integration.

👤 warrenm

>The market is huge
Apparently ... it's not
Or, rather, it's not YET "huge"
Sure - half the planet is online, but they're speaking myriad languages in more combinations of enunciation, dialect, and accent than is probably even calculable
>the Natural Language Processing of "OK Google" and Siri are quite refined at this point
Totally different to ask for today's weather and to tell a computer what to do - just like it's totally different to hit your favorite search engine and type "what is Pluto's orbit" and to write the search engine that goes off and does what you asked (and even when it does go off and do it, it still returns multiple (often conflicting) results - which leads to the whole problem of identifying authority online (something I wrote about 15+ years ago https://antipaucity.com/2006/10/23/authority-issues-online/#...))
It's also worlds different to be able to respond to variations on a theme of maybe a couple hundred search keywords (is it even that many?) and the literally unlimited number of commands people issue to their computing devices every day. Let's even say Siri is That Good™ - you've got a MacBook, iPhone, and iPad on your desk ...which one should respond when you say, "Hey, Siri"? Why that one vs this one? Do you have to start every command with the name of the device? Maybe that's not so hard at home (maybe), but get into corporate environments with naming conventions like H5GG71WLD? ... or dozens/scores/hundreds of people within listening distance of everyone's microphones getting triggered by other conversations in the room, conference calls, your cubemates' inability to attenuate their voices and aim only at their laptop when talking ...
It's a nightmare to think about - practically, let alone computationally
Most people look at the example of, say, Star Trek for voice commands to "the computer". Ever notice the computer only responds when the script demands it? Geordi shouting in Engineering commands to his team or panicked messages to the bridge are never misinterpreted by the computer as commands to it
That's mighty convenient - and not at all representative of anything resembling a reality we can create [yet]
Maybe in another few decades or centuries ... but I'd wager probably not
Another consideration: speaking is very slow compared to a click, tap, or typing a few characters at a prompt. Why would you want to intentionally make your human-to-device interactions more clumsy and error-prone?

👤 sleepingadmin

Certainly exists and I have setup this for various blind people who make due. Unfortunately dont recall what it was exactly but they bought it and all that.
The thing about voice is how weak it is. Even if you've well trained it and you speak well, which i don't. It wont be as good as a keyboard.
Putting work into voice like this for productivity is pointless. Any effort is best placed in brain computer interfaces. Hopefully not surgically required, like neurolink is doing. More of a headset like Valve and openbci is doing.
Lets just wear a headset and work, keyboards can just be there in case you need them.