I'd like a kind of echo dot like thing running on a set of raspberry pi devices each with a microphone and speaker. Ideally they'd be all over the house. I'm happy if they talk back via wifi to a server in my office for whatever real processing. The server might have 16 cores and 128Gb ram. Might even have two of these if required.
What options do I have? What limits? I'd really prefer answers from people who have experiences with the various options.
If it helps I'm happy to reduce vocabulary to a dictionary of words as long as I can add more words as necessary. Training is also ok. I've already analysed my voice conversations with an echo dot and the vocabulary isn't that large.
Please remember: home use, no off-site clouds. I'm not interested in options involving even a free voice speech-to-text cloud. This eliminates google voice recognition, amazon etc. They are great but out of scope.
So far I've identified CMU Sphinx as a candidate but I'm sure there are others.
Ideas?
-----------------
Windows 10 IoT for Raspberry Pi comes with offline speech recognition API.
It was not hard to slap some code together that turns on a light when someone says "banana" at a hackathon.
Sounds like exactly what you need.
>If it helps I'm happy to reduce vocabulary to a dictionary of words
You will do it with an XML grammar file for offline recognition[4].
[1]https://docs.microsoft.com/en-us/windows/iot-core/tutorials/...
[2]https://docs.microsoft.com/en-us/windows/iot-core/extend-you...
Someone's demo project:
[3]https://www.hackster.io/krvarma/rpivoice-051857
[4]https://docs.microsoft.com/en-us/windows/uwp/design/input/sp...
Sphinx is just for the automatic speech recognition (ASR) part. But there are better solutions for that:
Kaldi (https://kaldi-asr.org/) is probably the most comprehensive ASR solution, which yields very competitive state-of-the-art results.
RASR (https://www-i6.informatik.rwth-aachen.de/rwth-asr/) is for non-commercial use only but otherwise similar as Kaldi.
If you want to use a simpler ASR system, nowadays end-to-end models perform quite well. There are quite a huge number of projects which support these:
RETURNN (https://github.com/rwth-i6/returnn) is non-commercial TF-based. (Disclaimer: I'm one of the main authors.)
Lingvo (https://github.com/tensorflow/lingvo), from Google, TF-based.
ESPnet (https://github.com/espnet/espnet), PyTorch/Chainer.
...
KaldiAG has an English model available, but other models could be trained. Although you can't just drop in and use a standard Kaldi model with KaldiAG, the modifications required are fairly minimal and don't require any training or modification of its acoustic model. All recognition is performed locally and off line by default, but you can also selectively choose to do some recognition in the cloud, too.
Kaldi generally performs at the state of art. As a hybrid engine, although training can be more complicated, it generally requires far less training data to achieve high accuracy, compared to "end to end" engines.
What actions are you looking to handle with the assistant?
Reason I ask is because a voice assistant is a command line interface with no auto-complete or visual feedback. It doesn’t scale well as you add more devices or commands to your home, because it becomes impossible to remember all the phrases you programmed. We’ve found the person who sets up the voice assistant will use it for simple tasks like “turn off all lights” but nobody else benefits and it gets little use beyond timers and music. They are certainly nice to have, but they don’t significantly improve the smart home experience.
If you’re looking to control individual devices, I suggest taking a look at actual occupancy sensors like Hiome (https://hiome.com), which can let you automate your home with zero interaction so it just works for everyone without learning anything (like in a sci-fi movie). Even if you’re the only user, it’s much nicer to never think about your devices again.
Happy to answer any questions about Hiome or what we’ve learned helping people with smart homes in general! -> neil@hiome.com
Alternatively, you could just fork the Almond project directly and take it from there: https://github.com/stanford-oval/almond-cloud
https://github.com/alphacep/vosk-api
Advantages are:
1) Supports 7 languages - English, German, French, Spanish, Portuguese, Chinese, Russian
2) Works offline even on lightweight devices - Raspberry Pi, Android, iOS
3) Install it with simple `pip install vosk`
4) Model size per language is just 50Mb
5) Provides streaming API for the best user experience (unlike popular speech-recognition python package)
6) There are APIs for different languages too - java/csharp etc.
7) Allows quick reconfiguration of vocabulary for best accuracy.
8) Supports speaker identification beside simple speech recognition
Requires a little customization and/or coding, but it’s quite elegant, and all voice recognition happens on-device. Part of what makes the recognition much more accurate (subjectively, 99%ish) is the constrained vocabulary; the grammars are compiled from a simple user-defined markup language, and then parsed into JSON intents, containing both the full text string and appropriate keywords/variables split out into slots.
Just finished a similar rig in my car, acting as a voice-controlled MP3 player, with thousands of artists and albums compiled into intents from iTunes XML database. Works great, and feels awesome to have a little 3-watt baby computer doing a job normally delegated to massive corporate server farms. ;)
I think there's a group of highly technical people who feel increasingly left behind by 'convenience tech' because of what they have to give up in order to use it
I know virtually nothing about voice recognition, but my spidey sense tells me that it should be possible with the hardware you specify.
A Commodore 64 with a Covox VoiceMaster could recognize voice commands and trigger X-10 switches around a house. (Usually. My setup had about a 70% success rate, but pretty good for the time!) Surely a 16 core, 128GB RAM machine should be able to do far more.
We license this on commercial bases but would be open to indy-developer friendly licensing. We offer a trial SDK that makes testing/evaluation super easy (it works for 15min at the time).
Ogi
ogi@keeenresearch.com
https://github.com/persephone-tools
This may be a little too low level for what as there's no language model but maybe it's helpful as part of your system
Among those, I tried MyCroft, which still requires a cloud account to config various things on it, and it doesn't support a multi-room setup at this time.
I've since switched to Rhasspy, which offers a larger array of config options and engines, and also multi-room (I'm yet to config multi-room tho)
In the long-term I plan to "train" the voice-AI for various additions, including a custom wake word - No, I'm not calling it `Jarvis` ;)
I'm running each of these voice-AI's on a Raspberry Pi 4 (4GB model), though I'm considering switching them to Pi 3's. I'm using the `ReSpeaker 2mic Pi-Hat` on each pi for the mic input. I'm planning to configure all the satellite nodes (voice-AI in each room) to PXE boot, that way they don't require an sd-card and I can easily update their images/configs from a central location.
Later I worked on the Microsoft Kinect with its 4-microphone array. With only a single microphone, it's so much harder to filter out background noise. If you don't find a system based on multiple microphones, I don't believe you can be successful if there's any ongoing noise (dishwasher, loud fans, etc), but a system that works in only quiet conditions is possible.
I think there's two id phrases per sub-device by default, but using virtual devices vastly expands the software's capability. Especially for mapping virtual switches to multiple devices.
They also have zwave devices that for the most part are much better than most.
Pretty sure Mycroft is capable of that - in theory - you'll need to config it manually. The standard raspberry pi route isn't powerful enough for local.
Check out reespeaker for a raspberry microphone. You'll want one of the more expensive ones for range. Though at like 40 bucks they're not that wildly expensive.
Make sure it's a rasp 4 since wake word is processed locally. And you probably don't need 128gb RAM. No idea what they use but doubt that much.
[1] https://github.com/MycroftAI/mycroft-precise
I've been working with Facebook's wav2letter project and the results (speed on CPU, command accuracy) are extremely good in my experience. They also hold the "state of the art" for librispeech (a common benchmark) on wer_are_we [1]. Granted, that's with a 2GB model that doesn't run very well on CPU, but I think most of the fully "state of the art" models are computationally expensive and expected to run on GPU. Wav2letter has other models that are very fast on CPU and still extremely accurate.
You can run their "Streaming ConvNets" model on CPU to transcribe multiple live audio streams in parallel, see their wav2letter@anywhere post for more info [2]
I am getting very good accuracy on the in-progress model I am training for command recognition (3.7% word error rate on librispeech clean, about 8% WER on librispeech other, 20% WER on common voice, 3% WER on "speech commands"). I plan to release it alongside my other models here [5] once I'm done working on it.
There's a simple WER comparison between some of the command engines here [3] Between this and wer_are_we [1] it should give you a general idea of what to expect when talking about Word Error Rate (WER). (Note the wav2letter-talonweb entry in [3] is a rather old model I trained, known to have worse accuracy, it's not even the same NN architecture).
----
As far as constraining the vocabulary, you can try train a kenlm language model for kaldi, deepspeech, and wav2letter by grabbing KenLM and piping normalized (probably lowercase it and remove everything but ascii and quotes) text into lmplz:
cat corpus.txt | kenlm/build/bin/lmplz -o 4 > model.arpa
And you can turn it into a compressed binary model for wav2letter like this: kenlm/build/bin/build_binary -a 22 -q 8 -b 8 trie model.arpa model.bin
There are other options, like using a "strict command grammar", but I don't have enough context as to how you want to program this to guide you there.I also have tooling I wrote around wav2letter, such as wav2train [4] which builds wav2letter training and runtime data files for you.
I'm generally happy to talk more and answer any questions.
----
[1] https://github.com/syhw/wer_are_we
[2] https://ai.facebook.com/blog/online-speech-recognition-with-...
[3] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...
Disclaimer: Working for Apple, not directly on this API but in related subjects.
Caviat, they keep having delays and may never release v2 imo
Here is the Google Chrome Web Speech API demo page: https://www.google.com/intl/en/chrome/demos/speech.html
The cloud based solutions are so vastly superior to the current non-cloud solutions that unless you're something of an expert in ASR you're just going to get frustrated. If you're worried about privacy, Google lets you pay a little extra to immediately delete the audio after you send it to their servers.