So I have turned to a complex and highly unreliable software stack that provides both voice-to-text, and clumsy but limited control of Microsoft Windows, Chrome, etc. This includes Dragon Voice-to-text, Voice Computer, and Talon, plus a browser extension and heavy customization.
Users of Dragon will acknowledge that: a) The software is a creaky dumpster fire built on archaic code b) There is no viable alternative on the market
My question is: *how is it that no one has built something better?* The market is huge, and the Natural Language Processing of "OK Google" and Siri are quite refined at this point.
References:
Dragon: https://www.nuance.com/dragon.html
Voice Computer: https://voicecomputer.com/
Talon: https://talonvoice.com/
I also have a friend who is a gifted programmer who lost his ability to type about a decade ago; he has put together an open-source software stack to help: http://www.cs.columbia.edu/~dwk/
Of course this doesn't really answer your question. But it's a hard problem, and you're basically forced to become a power user to reliably interact with your PC.
I see the big problem in voice interaction is that a human being will ask you questions to clarify what you said if they don't understand and current systems don't even try. (Actually the search paradigm lets you do some refinement, "Ok Google" works amazingly well on Android TV.)
Superhuman accuracy at dictation doesn't translate to a useful ability to understand text. You're doing great if you only garble 1 out of 20 words. Some errors are inconsequential, but if it garbles every other sentence then you are going to feel 0% understood.
I think there are multiple reasons:
* The obvious market is dictation of natural language, but this isn't what you want for voice control. If you try to use long descriptive phrases as your command language everything takes forever. So instead you end up making your own mini command language where all of your common actions are a single syllable, but now it's no longer the English or other natural language that users already know. So now your product has substantial learning curve just like learning a new keyboard layout.
* Everything other than talon has terrible latency. Most existing speech recognition engines were not designed with the kind of latency you want for quick one syllable commands.
* In order for it to be really effective you need the cooperation of applications (this is why I've written extensive emacs integration). Some tools like window speech recognition try to hook in at the UI layer in order to figure out what text is in dialog boxes and such, but in practice they seem to do a pretty terrible job. Windows speech recognition has a very hard time consistently understanding what links you are trying to get it to click on for example. There's also a long tail of applications that just do their own custom UI rendering inside a blank canvas where no hook is possible.
* Good speech recognition even if not specifically targeting computer voice control is a genuinely hard research problem, and standard benchmarks for accuracy are misleading. You see "95% accuracy" and you are like wow that's a high percentage computers almost have this speech recognition thing solved and then you think about it harder and you go wait a minute, that's one mistake every 20 words! Maybe you are still impressed, but then you have to take into account that when the computer does the wrong thing you'll need to issue more commands in order to correct it, which will are also likely be misinterpreted. When you make a typo with a keyboard the mistakes rarely cascade, you just hit backspace.
Windows Voice Recognition has been around forever (out of the box since XP), it's UI is "serviceable" but not great. (It was slightly better when Cortana was briefly "out of the box" in Windows 10, but has reverted some since.) But I don't think you need to pay for Dragon (or its high memory consumption) if you don't mind taking to learn the quirks of Windows Voice Recognition directly. Most of Dragon's quirks are Windows' quirks anyway papered over with a UI that makes it seem like they are adding value.
Also yeah, one of the answers to "how is it that no one has built something better?" is: Well, Microsoft tried with Cortana, got a huge blowback that "no one" wanted Cortana on their PCs, and gave up.
This is fairly insulting as RSI’s are very much a real thing.
Does this community also think that wheelchair ramps should never be invested in because stairs are clearly superior?
I’d rather see the brain power in this community focused on solutions. Keyboard + mouse have lasted so long because they work surprisingly well, but I hope there is a day that we dream up something better that does not require slowly giving ourselves carpel tunnel.
Regarding the software packages you referenced: Yes, Dragon is trash that I want nothing to do with, because of its inefficient interface, its complete inability to accurately understand my voice, and its generally shoddy software quality. Voice Computer (which I hadn't seen before) is therefore eliminated as well, though it doesn't look terrible as a front end to Dragon to better use the OS GUI-accessibility info. Many people like Talon, but I demand something open, which I can modify to suit my needs.
Background: I develop kaldi-active-grammar (https://github.com/daanzu/kaldi-active-grammar), a free and open source speech recognition backend usable by Dragonfly, itself entirely by voice. There's also a community of voice coders using Dragonfly and other tools that build on top of it, such as Caster (https://github.com/dictation-toolbox/Caster).
Simple dictation could be done at the DE level, where the VtoT stream would be diverted to the keyboard input of the active app. It could also be done at the app level, but this is one feature I think belongs a level up so it can be used by non-voice enabled apps.
For keyboards, you lose positional logic (wasd in games). You lose shortcuts. You lose control over capitalization and formatring. You lose punctuation. You lose non-text input (code, dictating code sounds like like a horrible pain). You lose function keys. And, of course, you lose speed (think of instant things you do with shortcut keys, like alt tab). Not to mention, that you lose the ability to work in silence.
Make the recognition quality gorgeous, and it will still be a less flexible product than what we use today. It has value for accessibility, but people will likely choose keyboards over dictation based on UX alone.
For detailed work though the more direct method of translating movements is far more efficient.
When you can describe an abstract end goal voice is great. When you have to actually do all the individual steps towards some high level goal then it's like telling a newbie programmer through some high level database optimization. You only use voice because your main goal here is to teach someone. If the PC could be taught that way, then voice would be in demand for such tasks too.
Any input method where you frequently have to repeat yourself and undo things won't get mainstream. I'd bet people's mainstream tolerance for errors would have to be like one per five to ten minutes before you could get them to really adopt something like this (barring disability reasons, like RSI). Until then, the tech and market don't match.
You need dedicated software built on a hypothetical V(oice)UI to get anything decent.
Otherwise your best bet is to find a mouse/trackball/trackpad/pointerstick/touch-screen/pen that doesn't injure you and use text-to-speech in simple text editors.
My finding, for text dictation (not code), is that even halfway decent dictation, such as is available on iPhone, still needs much post-dictation editing. I feel that the biggest impact to be made in this area is superior capabilities for this editing phase.
I summarized and wrote up my thoughts as a grant proposal for Scott Alexander's recent "micro grants" project. Get in touch (email in my profile) if you want to read that, or if you'd like to talk about dictation, voice control, voice coding, and editing operations -- or just get some moral support.
The complexity of doing that is IMO a good explanation of why commercial audio recognition is worthless to someone who programs a computer instead of interacts with humans over a computer.
http://plover.stenoknight.com/2013/03/using-plover-for-pytho...
You can DL it here: https://chrome.google.com/webstore/detail/lipsurf-voice-cont...
People have built the tools you're talking about. They're Talon and Cursorless.
I think you'd be shocked if you saw how productive some people in the Talon community are. Be sure to join the community Slack.
You have just been fired and as the security boys are escorting you to the door, you call out, loud enough to be heard in all the cubicles -
"Computer! Format all drives!"
OR MAYBE THIS OTHER SCENARIO:
The guy in the next cubicle has a loud voice and while he is commanding his own computer to "Exit the file without saving" you find that the work you have carefully constructed over the last four hours is suddenly thrown away too.
But, it seems like all voice control development keeps getting bought up by the Big 3, so it's not likely to have any significant breakthroughs independent of what Apple, Google and Amazon think voice control is good for.
It’s an alternative input method. Might be worth giving a try.
Voice Finger by Cozendy [$9.99]
Lenovo Voice Control from msstore [free]
Amazon Alexa from msstore [free]
"Win Key + h" for the inbuilt text box dictation [inbuilt]
serenade.ai [$$]
I don't have an exact answer to you OP but I hope someone builds a helpful one for you.
There is a voice assistant ap for Android that uses vosk called Dicio (f-droid). Storage is cheap and easy. Processing power is there even in cheap 3rd world phones. I personally detest typing and would love to talk to my devices without any 3rd party nonsense requirements. Truly there is none because the powers that be do not want everyone thinking they are in control, essentially of anything.
With that precondition, any voice-to-control layer on the desktop is in the tough situation of translating between voice input and a piece of software that was designed without voice input in mind.
Google and Siri, etc., aren't as beholden to the desktop/browser interface paradigm, so they don't have to perform this interface translation.
"Open this program"
"Minimize"
"Focus on this text input"
..dictate..
"switch to command mode"
"save and close"
i'd rather just: "click click tab type ctrl-S"
There are two ways that new software gets built: either the market is big enough and accessible enough that commercial software gets built, or the software is easy enough to build that hobbyists enter the space and solve their own problems. For example, the commercial market for keyboard-driven interfaces is also quite small, but we still have stuff like Sway. But a good keyboard-driven interface is easier to build than speech recognition.
I've been curious about this area for a while, but my understanding is voice-to-text Open Source solutions are still kind of primitive for general text transcribing. The libraries aren't very fun to work with, they're often embedded Python/Java "stuff", and the accuracy isn't great if you advance past the level of text transcription. Additionally, controlling computers and hooking into X or Wayland feels a bit hacky.
That being said, I'll push back on people who are saying that no one would want to control an interface this way. The success of systems like Alexa/Siri/Google are pretty definitive proof to me that (all their weaknesses side) there is a market for voice interfaces. But the ties between that market and the desktop are not strong, and the ecosystem isn't open enough to really build on in that direction.
I suspect that until efforts like Mozilla's open speech datasets pick up more steam and become competitive (if they ever do), it's going to be kind of laggy to find solutions because it's not immediately obvious how to enter the market, either as a commercial company or as an Open Source dev. But maybe I'm wrong and I just haven't researched it enough and the area is totally ripe for disruption. Maybe for people with RSI they'd tolerate something like clipping a bluetooth mic to their lapel or something and that would boost accuracy. Maybe there's another way to approach entering code that isn't just straight text recognition, possibly combining it with some kind of AST or code analysis that made it easier to guess what people were saying.
In any case, I don't think the problem is that people don't want to talk to their computers. Personally I don't like using voice assistants, but they are very popular, in no small part because of the voice part. So maybe there is an evolution of desktop UI controls that could become really popular, or at least competitive with entrenched solutions for people with limited mobility or RSI. But it would require someone to introduce some kind of actual UX innovation into the space, or to find a way of getting over the moat around good recognition and OS integration.
Apparently ... it's not
Or, rather, it's not YET "huge"
Sure - half the planet is online, but they're speaking myriad languages in more combinations of enunciation, dialect, and accent than is probably even calculable
>the Natural Language Processing of "OK Google" and Siri are quite refined at this point
Totally different to ask for today's weather and to tell a computer what to do - just like it's totally different to hit your favorite search engine and type "what is Pluto's orbit" and to write the search engine that goes off and does what you asked (and even when it does go off and do it, it still returns multiple (often conflicting) results - which leads to the whole problem of identifying authority online (something I wrote about 15+ years ago https://antipaucity.com/2006/10/23/authority-issues-online/#...))
It's also worlds different to be able to respond to variations on a theme of maybe a couple hundred search keywords (is it even that many?) and the literally unlimited number of commands people issue to their computing devices every day. Let's even say Siri is That Good™ - you've got a MacBook, iPhone, and iPad on your desk ...which one should respond when you say, "Hey, Siri"? Why that one vs this one? Do you have to start every command with the name of the device? Maybe that's not so hard at home (maybe), but get into corporate environments with naming conventions like H5GG71WLD? ... or dozens/scores/hundreds of people within listening distance of everyone's microphones getting triggered by other conversations in the room, conference calls, your cubemates' inability to attenuate their voices and aim only at their laptop when talking ...
It's a nightmare to think about - practically, let alone computationally
Most people look at the example of, say, Star Trek for voice commands to "the computer". Ever notice the computer only responds when the script demands it? Geordi shouting in Engineering commands to his team or panicked messages to the bridge are never misinterpreted by the computer as commands to it
That's mighty convenient - and not at all representative of anything resembling a reality we can create [yet]
Maybe in another few decades or centuries ... but I'd wager probably not
Another consideration: speaking is very slow compared to a click, tap, or typing a few characters at a prompt. Why would you want to intentionally make your human-to-device interactions more clumsy and error-prone?
The thing about voice is how weak it is. Even if you've well trained it and you speak well, which i don't. It wont be as good as a keyboard.
Putting work into voice like this for productivity is pointless. Any effort is best placed in brain computer interfaces. Hopefully not surgically required, like neurolink is doing. More of a headset like Valve and openbci is doing.
Lets just wear a headset and work, keyboards can just be there in case you need them.