HACKER Q&A
📣 rs23296008n1

Non-cloud voice recognition for home use?


I'd like a home-based voice recognition without some off-site cloud.

I'd like a kind of echo dot like thing running on a set of raspberry pi devices each with a microphone and speaker. Ideally they'd be all over the house. I'm happy if they talk back via wifi to a server in my office for whatever real processing. The server might have 16 cores and 128Gb ram. Might even have two of these if required.

What options do I have? What limits? I'd really prefer answers from people who have experiences with the various options.

If it helps I'm happy to reduce vocabulary to a dictionary of words as long as I can add more words as necessary. Training is also ok. I've already analysed my voice conversations with an echo dot and the vocabulary isn't that large.

Please remember: home use, no off-site clouds. I'm not interested in options involving even a free voice speech-to-text cloud. This eliminates google voice recognition, amazon etc. They are great but out of scope.

So far I've identified CMU Sphinx as a candidate but I'm sure there are others.

Ideas?


  👤 romwell Accepted Answer ✓
TL; DR: Win 10 IoT for RasPi does it.

-----------------

Windows 10 IoT for Raspberry Pi comes with offline speech recognition API.

It was not hard to slap some code together that turns on a light when someone says "banana" at a hackathon.

Sounds like exactly what you need.

>If it helps I'm happy to reduce vocabulary to a dictionary of words

You will do it with an XML grammar file for offline recognition[4].

[1]https://docs.microsoft.com/en-us/windows/iot-core/tutorials/...

[2]https://docs.microsoft.com/en-us/windows/iot-core/extend-you...

Someone's demo project:

[3]https://www.hackster.io/krvarma/rpivoice-051857

[4]https://docs.microsoft.com/en-us/windows/uwp/design/input/sp...


👤 ftyers
Mozilla DeepSpeech trained on the Common Voice dataset for English. You can get pretrained models too. They have a nice matrix channel where you can get help, and pretty good documentation. It is also actively developed by several engineers. http://voice.mozilla.org/en/datasets and http://github.com/mozilla/DeepSpeech/

👤 albertzeyer
Are you searching for a complete solution including NLP and an engine to perform actions? Some of these are already posted, like Home Assistant, and Mycroft.

Sphinx is just for the automatic speech recognition (ASR) part. But there are better solutions for that:

Kaldi (https://kaldi-asr.org/) is probably the most comprehensive ASR solution, which yields very competitive state-of-the-art results.

RASR (https://www-i6.informatik.rwth-aachen.de/rwth-asr/) is for non-commercial use only but otherwise similar as Kaldi.

If you want to use a simpler ASR system, nowadays end-to-end models perform quite well. There are quite a huge number of projects which support these:

RETURNN (https://github.com/rwth-i6/returnn) is non-commercial TF-based. (Disclaimer: I'm one of the main authors.)

Lingvo (https://github.com/tensorflow/lingvo), from Google, TF-based.

ESPnet (https://github.com/espnet/espnet), PyTorch/Chainer.

...


👤 daanzu
I develop Kaldi Active Grammar [1], which is mainly intended for use with strict command grammars. Compared to normal language models, these can provide much better accuracy, assuming you can describe (and speak) your command structure exactly. (This is probably more acceptable for a voice assistant for an audience that is more technical.) The grammar can be specified by a FST, or you can use KaldiAG through Dragonfly, which allows you to specify them (and their resultant actions) in Python. However, KaldiAG can also do simple plain dictation if you want.

KaldiAG has an English model available, but other models could be trained. Although you can't just drop in and use a standard Kaldi model with KaldiAG, the modifications required are fairly minimal and don't require any training or modification of its acoustic model. All recognition is performed locally and off line by default, but you can also selectively choose to do some recognition in the cloud, too.

Kaldi generally performs at the state of art. As a hybrid engine, although training can be more complicated, it generally requires far less training data to achieve high accuracy, compared to "end to end" engines.

[1] https://github.com/daanzu/kaldi-active-grammar


👤 guptaneil
Disclaimer: I am the founder of Hiome, a smart home startup focused on private by design local-only products.

What actions are you looking to handle with the assistant?

Reason I ask is because a voice assistant is a command line interface with no auto-complete or visual feedback. It doesn’t scale well as you add more devices or commands to your home, because it becomes impossible to remember all the phrases you programmed. We’ve found the person who sets up the voice assistant will use it for simple tasks like “turn off all lights” but nobody else benefits and it gets little use beyond timers and music. They are certainly nice to have, but they don’t significantly improve the smart home experience.

If you’re looking to control individual devices, I suggest taking a look at actual occupancy sensors like Hiome (https://hiome.com), which can let you automate your home with zero interaction so it just works for everyone without learning anything (like in a sci-fi movie). Even if you’re the only user, it’s much nicer to never think about your devices again.

Happy to answer any questions about Hiome or what we’ve learned helping people with smart homes in general! -> neil@hiome.com


👤 DataDrivenMD
Have you considered the Almond integration for Home Assistant? (https://www.home-assistant.io/integrations/almond/)

Alternatively, you could just fork the Almond project directly and take it from there: https://github.com/stanford-oval/almond-cloud


👤 perturbation
If you don't mind getting your hands dirty a bit, I think Nvidia's model [Jasper](https://arxiv.org/pdf/1904.03288.pdf) is near SOTA, and they have [pretrained models](https://ngc.nvidia.com/catalog/models/nvidia:jaspernet10x5dr) and [tutorials / scripts](https://nvidia.github.io/NeMo/asr/tutorial.html) freely available. The first is in their library "nemo", but they also have it available in [vanilla Pytorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/P...) as well.

👤 nshm
You are welcome to try Vosk

https://github.com/alphacep/vosk-api

Advantages are:

1) Supports 7 languages - English, German, French, Spanish, Portuguese, Chinese, Russian

2) Works offline even on lightweight devices - Raspberry Pi, Android, iOS

3) Install it with simple `pip install vosk`

4) Model size per language is just 50Mb

5) Provides streaming API for the best user experience (unlike popular speech-recognition python package)

6) There are APIs for different languages too - java/csharp etc.

7) Allows quick reconfiguration of vocabulary for best accuracy.

8) Supports speaker identification beside simple speech recognition


👤 notemaker
https://rhasspy.readthedocs.io

Haven't used it, but seems very nice.

https://youtu.be/ijKTR_GqWwA


👤 lukifer
I’m currently assembling an offline home assistant setup using Node-RED and voice2json, all running on Raspberry Pi’s:

http://voice2json.org/

https://nodered.org/

Requires a little customization and/or coding, but it’s quite elegant, and all voice recognition happens on-device. Part of what makes the recognition much more accurate (subjectively, 99%ish) is the constrained vocabulary; the grammars are compiled from a simple user-defined markup language, and then parsed into JSON intents, containing both the full text string and appropriate keywords/variables split out into slots.

Just finished a similar rig in my car, acting as a voice-controlled MP3 player, with thousands of artists and albums compiled into intents from iTunes XML database. Works great, and feels awesome to have a little 3-watt baby computer doing a job normally delegated to massive corporate server farms. ;)


👤 carbon85
I was not able to find the same article online, but the Volume 72 of Make Magazine has a great overview of different non-cloud voice recognition platforms. Here is a preview: https://www.mydigitalpublication.com/publication/?m=38377&i=...

👤 awinter-py
important question

I think there's a group of highly technical people who feel increasingly left behind by 'convenience tech' because of what they have to give up in order to use it


👤 skamoen
I've read good things about Mycroft [1], though I haven't tried it myself. Ticks all the boxes though

[1] https://mycroft.ai/


👤 reaperducer
I wish you luck with this, and more importantly, hope that it inspires many people to start building similar projects.

I know virtually nothing about voice recognition, but my spidey sense tells me that it should be possible with the hardware you specify.

A Commodore 64 with a Covox VoiceMaster could recognize voice commands and trigger X-10 switches around a house. (Usually. My setup had about a 70% success rate, but pretty good for the time!) Surely a 16 core, 128GB RAM machine should be able to do far more.


👤 otodic
My company develops SDKs for on-device speech recognition on Android/iOS: https://keenresearch.com/keenasr-docs (Raspberry Pi is an option too, we'll have a GA release in Q2)

We license this on commercial bases but would be open to indy-developer friendly licensing. We offer a trial SDK that makes testing/evaluation super easy (it works for 15min at the time).

Ogi

ogi@keeenresearch.com


👤 winkelwagen
I've had some good experience with https://snips.ai . Works like advertised. easy to implement. Hardest thing was getting the microphone and the pie to get along.

👤 JanisL
I was one of the maintainers of the Persephone project which is an automated phonetic transcription tool. This came about from a research project that required a non-cloud solution. This project is open source and can be found on GitHub:

https://github.com/persephone-tools

This may be a little too low level for what as there's no language model but maybe it's helpful as part of your system


👤 gibs0ns
I was in the process of planning my multi-room voice-AI setup based on SnipsAI (to be integrated with Home Assistant) when it was announced they were bought by Sonos, which killed their open source project. Since then I have been left trying various projects that meet my needs.

Among those, I tried MyCroft, which still requires a cloud account to config various things on it, and it doesn't support a multi-room setup at this time.

I've since switched to Rhasspy, which offers a larger array of config options and engines, and also multi-room (I'm yet to config multi-room tho)

In the long-term I plan to "train" the voice-AI for various additions, including a custom wake word - No, I'm not calling it `Jarvis` ;)

I'm running each of these voice-AI's on a Raspberry Pi 4 (4GB model), though I'm considering switching them to Pi 3's. I'm using the `ReSpeaker 2mic Pi-Hat` on each pi for the mic input. I'm planning to configure all the satellite nodes (voice-AI in each room) to PXE boot, that way they don't require an sd-card and I can easily update their images/configs from a central location.


👤 villgax
Google has papers on device speech recognition, these are used in the keyboard & for live caption on Pixel devices.

👤 coryrc
I tried to use Julius for this. I may have misconfigured it, but it would always match something to what it was hearing. I encoded some sounds in my grammar to error terms that it would detect in quiet noise (like 'aa' and 'hh'), but it would still occasionally match words when nothing was going on.

Later I worked on the Microsoft Kinect with its 4-microphone array. With only a single microphone, it's so much harder to filter out background noise. If you don't find a system based on multiple microphones, I don't believe you can be successful if there's any ongoing noise (dishwasher, loud fans, etc), but a system that works in only quiet conditions is possible.


👤 beerandt
Homeseer automation software has this built in, with client listening apps for different platforms. I haven't used the voice recognition beyond testing, but I've been very happy with the software overall. It's relatively expensive, but goes on sale for about half price once or twice a year. There's a free 30 day trial.

I think there's two id phrases per sub-device by default, but using virtual devices vastly expands the software's capability. Especially for mapping virtual switches to multiple devices.

They also have zwave devices that for the most part are much better than most.

https://www.homeseer.com


👤 Havoc
>I'm happy if they talk back via wifi to a server in my office for whatever real processing. The server might have 16 cores and 128Gb ram.

Pretty sure Mycroft is capable of that - in theory - you'll need to config it manually. The standard raspberry pi route isn't powerful enough for local.

Check out reespeaker for a raspberry microphone. You'll want one of the more expensive ones for range. Though at like 40 bucks they're not that wildly expensive.

Make sure it's a rasp 4 since wake word is processed locally. And you probably don't need 128gb RAM. No idea what they use but doubt that much.


👤 abrichr
Precise [1], Snowboy [2], and Porcupine [3] are all designed to work offline.

[1] https://github.com/MycroftAI/mycroft-precise

[2] https://github.com/kitt-ai/snowboy

[3] https://github.com/Picovoice/porcupine


👤 Animats
Is there a voice dialer for Android that doesn't use Google? There used to be, but it disappeared in the "upgrade" which has the mothership listening all the time.

👤 LargeWu
The most recent edition of Make magazine had a pretty good overview of some different options. Doesn't go too much into depth but provides a good starting point.

👤 lunixbochs
Hi, I'm the dev behind https://talonvoice.com

I've been working with Facebook's wav2letter project and the results (speed on CPU, command accuracy) are extremely good in my experience. They also hold the "state of the art" for librispeech (a common benchmark) on wer_are_we [1]. Granted, that's with a 2GB model that doesn't run very well on CPU, but I think most of the fully "state of the art" models are computationally expensive and expected to run on GPU. Wav2letter has other models that are very fast on CPU and still extremely accurate.

You can run their "Streaming ConvNets" model on CPU to transcribe multiple live audio streams in parallel, see their wav2letter@anywhere post for more info [2]

I am getting very good accuracy on the in-progress model I am training for command recognition (3.7% word error rate on librispeech clean, about 8% WER on librispeech other, 20% WER on common voice, 3% WER on "speech commands"). I plan to release it alongside my other models here [5] once I'm done working on it.

There's a simple WER comparison between some of the command engines here [3] Between this and wer_are_we [1] it should give you a general idea of what to expect when talking about Word Error Rate (WER). (Note the wav2letter-talonweb entry in [3] is a rather old model I trained, known to have worse accuracy, it's not even the same NN architecture).

----

As far as constraining the vocabulary, you can try train a kenlm language model for kaldi, deepspeech, and wav2letter by grabbing KenLM and piping normalized (probably lowercase it and remove everything but ascii and quotes) text into lmplz:

    cat corpus.txt | kenlm/build/bin/lmplz -o 4 > model.arpa
And you can turn it into a compressed binary model for wav2letter like this:

    kenlm/build/bin/build_binary -a 22 -q 8 -b 8 trie model.arpa model.bin
There are other options, like using a "strict command grammar", but I don't have enough context as to how you want to program this to guide you there.

I also have tooling I wrote around wav2letter, such as wav2train [4] which builds wav2letter training and runtime data files for you.

I'm generally happy to talk more and answer any questions.

----

[1] https://github.com/syhw/wer_are_we

[2] https://ai.facebook.com/blog/online-speech-recognition-with-...

[3] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...

[4] https://github.com/talonvoice/wav2train

[5] https://talonvoice.com/research/



👤 microtherion
Apple platforms offer an API (SFSpeechRecognizer) which for some languages supports on-device recognition. Trivial to set up, super easy to use, and pretty reasonable accuracy.

Disclaimer: Working for Apple, not directly on this API but in related subjects.


👤 cosmic_ape
As an aside, it seems you're interested in speech recognition, or speech to text, not voice recognition. Voice recognition is a different problem, where the particular speaker needs to be recognized from voice.

👤 vinniejames
I've been waiting for Mycroft to release something new, https://mycroft.ai/

Caviat, they keep having delays and may never release v2 imo


👤 thesuperbigfrog
Modern web browsers will support the Web Speech API (https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_...) which may or may not involve a cloud service.

Here is the Google Chrome Web Speech API demo page: https://www.google.com/intl/en/chrome/demos/speech.html


👤 ParanoidShroom
Android has a local speech recognizer, maybe give that a go ? You'll have to make an Android app tho.

👤 mirimir
If cloud services are such an issue (as they would be for me) then it's worth considering the security of local logs and whatever. Maybe limit lifetime, or even use FDE.

👤 wtvanhest
This feels like an area where a Dropbox of self hosted solution will emerge.

👤 stevewilhelm
Question, what aspect of your product restricts the software architecture to not use "off-site cloud?"

👤 gok
Don't bother.

The cloud based solutions are so vastly superior to the current non-cloud solutions that unless you're something of an expert in ASR you're just going to get frustrated. If you're worried about privacy, Google lets you pay a little extra to immediately delete the audio after you send it to their servers.